Archive for the ‘statistics’ Category
So it was OpenTech weekend. I wasn’t presenting anything (although I’m kicking myself for not having done a talk on Tropo and Phono) but of course I was there. This year’s was, I think, a bit better than last year’s – the schedule filled up late on, and there were a couple of really good workshop sessions. As usual, it was also the drinking conference with a code problem (the bar was full by the end of the first session).
Things to note: everyone loves Google Refine, and I really enjoyed the Refine HOWTO session, which was also the one where the presenter asked if anyone present had ever written a screen-scraper and 60-odd hands reached for the sky. Basically, it lets you slurp up any even vaguely tabular data and identify transformations you need to clean it up – for example, identifying particular items, data formats, or duplicates – and then apply them to the whole thing automatically. You can write your own functions for it in several languages and have the application call them as part of the process. Removing cruft from data is always incredibly time consuming and annoying, so it’s no wonder everyone likes the idea of a sensible way of automating it. There’s been some discussion on the ScraperWiki mailing list about integrating Refine into SW in order to provide a data-scrubbing capability and I wouldn’t be surprised if it goes ahead.
Tim Ireland’s presentation on the political uses of search-engine optimisation was typically sharp and typically amusing – I especially liked his point that the more specific a search term, the less likely it is to lead the searcher to a big newspaper website. Also, he made the excellent point that mass audiences and target audiences are substitutes for each other, and the ultimate target audience is one person – the MP (or whoever) themselves.
The Sukey workshop was very cool – much discussion about propagating data by SMS in a peer-to-peer topology, on the basis that everyone has a bucket of inclusive SMS messages and this beats paying through the nose for Clickatell or MBlox to send out bulk alerts. They are facing a surprisingly common mobile tech issue, which is that when you go mobile, most of the efficient push-notification technologies you can use on the Internet stop being efficient. If you want to use XMPP or SIP messaging, your problem is that the users’ phones have to maintain an active data connection and/or recreate one as soon after an interruption as possible. Mobile networks analogise an Internet connection to a phone call – the terminal requests a PDP (Packet Data Profile) data call from the network – and as a result, the radio in the phone stays in an active state as long as the “call” is going on, whether any data is being transferred or not.
This is the inverse of the way they handle incoming messages or phone calls – in that situation, the radio goes into a low power standby mode until the network side signals it on a special paging channel. At the moment, there’s no cross-platform way to do this for incoming Internet packets, although there are some device-specific ways of getting around it at a higher level of abstraction. Hence the interest of using SMS (or indeed MMS).
Their other main problem is the integrity of their data – even without deliberate disinformation, there’s plenty of scope for drivel, duplicates, cockups etc to get propagated, and a risk of a feedback loop in which the crap gets pushed out to users, they send it to other people, and it gets sucked up from Twitter or whatever back into the system. This intersects badly with their use cases – it strikes me, and I said as much, that moderation is a task that requires a QWERTY keyboard, a decent-sized monitor, and a shirt-sleeve working environment. You can’t skim-read through piles of comments on a 3″ mobile phone screen in the rain, nor can you edit them on a greasy touchscreen, and you certainly can’t do either while looking out that you don’t get hit over the head by the cops.
Fortunately, there is no shortage of armchair revolutionaries on the web who could actually contribute something by reviewing batches of updates, and once you have reasonably large buckets of good stuff and crap you can use Bayesian filtering to automate part of the process.
Francis Davey’s OneClickOrgs project is coming along nicely – it automates the process of creating an organisation with legal personality and a constitution and what not, and they’re looking at making it able to set up co-ops and other types of organisation.
I didn’t know that OpenStreetMap is available through multiple different tile servers, so you can make use of Mapquest’s CDN to serve out free mapping.
OpenCorporates is trying to make a database of all the world’s companies (they’re already getting on for four million), and the biggest problem they have is working out how to represent inter-company relationships, which have the annoying property that they are a directed graph but not a directed acylic graph – it’s perfectly possible and indeed common for company X to own part of company Y which owns part of company X, perhaps through the intermediary of company Z.
OpenTech’s precursor, Notcon, was heavier on the hardware/electronics side than OT usually is, but this year there were quite a few hardware projects. However, I missed the one that actually included a cat.
What else? LinkedGov is a bit like ScraperWiki but with civil servants and a grant from the Technology Strategy Board. Francis Maude is keen. Kumbaya is an encrypted, P2P online backup application which has the feature that you only have to store data from people you trust. (Oh yes, and apparently nobody did any of this stuff two years ago. Time to hit the big brown bullshit button.)
As always, the day after is a bit of an enthusiasm killer. I’ve spent part of today trying to implement monthly results for my lobby metrics project and it looks like it’s much harder than I was expecting. Basically, NetworkX is fundamentally node-oriented and the dates of meetings are edge properties, so you can’t just subgraph nodes with a given date. This may mean I’ll have to rethink the whole implementation. Bugger.
I’m also increasingly tempted to scrape the competition‘s meetings database into ScraperWiki as there doesn’t seem to be any way of getting at it without the HTML wrapping. Oddly, although they’ve got the Department of Health’s horrible PDFs scraped, they haven’t got the Scottish Office although it’s relatively easy, so it looks like this wouldn’t be a 100% solution. However, their data cleaning has been much more effective – not surprising as I haven’t really been trying. This has some consequences – I’ve only just noticed that I’ve hugely underestimated Oliver Letwin’s gatekeepership, which should be 1.89 rather than 1.05. Along with his network degree of 2.67 (the eight highest) this suggests that he should be a highly desirable target for any lobbying you might want to do.
OKTrends has an amusing post, but what I like about it is that it’s consilient with the process I defined here. My idea was that songs that were rated 5 might be good, but might also just be violently weird to the reviewer. By the same logic, the same must be true of the 1s. Assuming that my tastes aren’t the same as the reviewer, the information in the reviews was whether the music was either mediocre, or potentially interesting. The output is here.
The OKTrends people seem to have rediscovered the idea independently looking at dating profiles – it’s better to be ugly to some and beautiful to others than it is to be boringly acceptable to everybody.
Via Bruce Schneier’s, an interesting paper in PNAS on false positives and looking for terrorists. Even if the assumptions of profiling are valid, and the target-group really is more likely to be terrorists, it still isn’t a good policy. Because the inter-group difference in the proportion of terrorists is small relative to the absolute scarcity of terrorists in the population, profiling means that you hugely over-sample the people who match the profile. Although it magnifies the hit-rate, it also magnifies the false positive rate, and because a search carried out on someone matching the profile is one not carried out elsewhere, it increases the chance of missing someone.
In fact, if you profile, you need to balance this by searching non-profiled people more often.
The operators of Deepwater Horizon disabled a lot of alarms in order to stop false alarms waking everyone up at all hours. Shock! In some ways, though, that was better than this story about a US hospital, from comp.risks. There, a patient died when an alarm was missed. Why? Too many alarms, beeps, and general noise, and people had turned off some devices’ alarms in order to get rid of them.
Unlike Transocean, they had a solution – remove the off switches, because that way, they’ll damn well have to listen. At least the oil people didn’t think that would work. Of course, they didn’t think that if your warning system goes off so often that nobody can sleep when nothing unusual is going on, there’s something wrong with the system.
So the England Zombies are looking more like Fast Zombies again. If I’ve bored you by talking up James Milner, I’d like to take this opportunity to claim my bragging rights. Here’s something interesting; back at the weekend, in the depths of self-loathing, the Obscurer published a table showing the teams with various statistics, including shots on goal. It struck me that England were looking rather good on that, and that the top four looked mostly like a plausible semi-final line up. So I’ve put together a spreadsheet ranking the teams by shots on target/matches played.
(oh, for fucksake – it’s fucking google spreadsheets. wordpress.com, get a clue. here.)
Data source here. Having fifa.com in my browser history makes me feel dirty for some reason.
That puts England 5th in the world – quarter finals again – but ahead of all the three possible opponents in the second round, Germany (7 on target/game vs. 7.333 – Google Spreadsheets is lax about sig figs), Ghana, and Serbia, and well ahead of the Netherlands and Italy. Further, out of the top four, Spain aren’t looking a cert to qualify out of their group, and they have an even worse tradition of World Cup choking than we do. This may be daft, sunshine and beer optimism; but it’s daft, sunshine and beer optimism with data.
Update: Well, would you look at that.
The Institute for Public Policy Research has issued a report on the correlates of BNP membership and support (pdf).
Fascinatingly, they reckon that there is very little or no correlation between BNP support and key socio-economic indicators like GVA per capita, growth, unemployment, immigration, etc. It’s as if a typical BNP supporter was, well, a case of free-floating extremism. (A dedicated swallower of fascism; an accident waiting to happen.)
Oddly enough, this replicates an earlier result.
The Nottingham University Politics blog has a more nuanced response, but I’m quite impressed by the fact that two analyses based on two different metrics of BNP support – votes in the IPPR study, membership in mine – converged on the same result.
Here’s something interesting. I grabbed the last 6 months’ worth of national opinion polls from Wellsy’s and graphed the Tory lead in percentage points. On the tiny chart below, you’ll observe that the mean is 10 points; the hatched area shows one standard deviation each side of the mean, and I’ve plotted a linear trend through it. (You can see a full-size version of it here.)
The interesting bit; there are 21 polls, out of 158, that showed a Conservative lead of more than one standard deviation greater than the mean. All of them occurred before the 29th of January. There are 24 that showed a lead more than one standard deviation less than the mean. 20 out of 24 occurred since the 19th of February. What on earth could have happened between these dates?
Well, this is hardly surprising; the FBI was in the habit of pretending to be on a terrorism case every time they wanted telecoms traffic data. Their greed for call-detail records is truly impressive. Slurp! Unsurprisingly, the lust for CDRs and the telcos’ eagerness to shovel them in rapidly got the better of their communications analysis unit’s capacity to crunch them.
Meanwhile, Leah Farrell wonders about the problems of investigating “edge-of-network” connections. Obviously, these are going to be the interesting ones. Let’s have a toy model; if you dump the CDRs for a group of suspects, 10 men in Bradford, and pour them into a visualisation tool, the bulk of the connections on the social network graph will be between the terrorists themselves, which is only of interest for what it tells you about the group dynamics. There will be somebody who gets a lot of calls from the others, and they will probably be important; but as I say, most of the connections will be between members of the group because that’s what the word “group” means. If the likelihood of any given link in the network being internal to it isn’t very high, then you’re not dealing with anything that could be meaningfully described as a group.
By definition, though, if you’re trying to find other terrorists, they will be at the edge of this network; if they weren’t, they’d either be in it already, or else they would be multiple hops away, not yet visible. So, any hope of using this data to map the concealed network further must begin at the edge of the sub-network we know about. And the principle that the ability to improve a design occurs primarily at the interfaces – this is also the prime location for screwing it up also points this way.
But there’s a really huge problem here. The modelling assumptions are that a group is defined by being significantly more likely to communicate among itself than with any other subset of the phone book, that the group is small relative to the world around it, and that it is boring; everyone has roughly similar phoning behaviour, and therefore who they call is the question that matters. I think these are reasonable.
The problem is that it’s exactly at the edge of the network that the numbers of possible connections start to curve upwards, and that the density of suspects in the population falls. Some more assumptions; an average node talks to x others, with calls being distributed among them on a well-behaved curve. Therefore, the set of possibilities is multiplied by x for each link you follow outwards; even if you pick the top 10% of the calling distribution, you’re going to fall off the edge as the false positives pile up. After three hops and x=8, we’re looking at 512 contacts from the top 10% of the calling distribution alone.
In fact, it’s probably foolish to assume that suspects would be in the top 10% of the distribution; most people have mothers, jobs, and the like, and you also have to imagine that the other side would deliberately try to minimise their phoning or, more subtly, to flatten the distribution by splitting their communications over a lot of different phone numbers. Actually, one flag of suspicion might be people who were closely associated by other evidence who never called each other, but the false positive rate for that would be so high that it’s only realistically going to be hindsight.
Conclusions? The whole project of big-scale database-driven social network analysis is based on the wrong assumptions, which are drawn either from military signals intelligence or from classical policing. Military traffic analysis works because it assumes that the available signals are a subset of a much bigger total, and that this total is large compared to the world. This makes sense because that’s what the battlefield of electronic warfare is meant to look like – cleared of civilian activity, dominated by one side or the other’s military traffic. Working from the subset of enemy traffic that gets captured, it’s possible to infer quite a lot about the system it belongs to.
Police investigation works because it limits the search space and proceeds along multiple lines of enquiry; rather than pulling CDRs and assuming the three commonest numbers must be suspects, it looks for suspects based on the witness and forensic evidence of the case, and then uses other sources of data to corroborate or refute suspicion.
To summarise, traffic analysis works on the assumption that there is an army out there. We can only see part of it, but we can make inferences about the rest because we know there is an army. Police investigation works on the observation that there has been a crime, and the assumption that probably, only a small number of people are possible suspects.
So, I’m a bit underwhelmed by projects like this. One thing that social network datamining does, undoubtedly, achieve is to create handsome data visualisations. But this is dangerous; it’s an opportunity to mistake beauty for truth. (And they will look great on a PowerPoint slide!)
Another, more insidious, more sinister one is to reinforce the assumptions we went into the exercise with. Traffic-analysis methodology will produce patterns; our brains love patterns. But the surge of false positives means that once you get past the first couple of hops, essentially everything you see will be a false positive result. If you’ve already primed your mind with the idea that there is a sinister network of subversives everywhere, techniques like this will convince you even further.
Unconsciously, this may even be the purpose of the exercise – the latent content of Evan Kohlmann. At the levels of numbers found in telco billing systems, everyone will eventually be a suspect if you just traverse enough links.
Which reminded me of Evelyn Waugh, specifically the Sword of Honour trilogy. Here’s his comic counterintelligence officer, Colonel Grace-Groundling-Marchpole:
Colonel Marchpole’s department was so secret that it communicated only with the War Cabinet and the Chiefs of Staff. Colonel Marchpole kept his information until it was asked for. To date that had not occurred and he rejoiced under neglect. Premature examination of his files might ruin his private, undefined Plan. Somewhere, in the ultimate curlicues of his mind, there was a Plan.
Given time, given enough confidential material, he would succeed in knitting the entire quarrelsome world into a single net of conspiracy in which there were no antagonists, only millions of men working, unknown to one another, for the same end; and there would be no more war.
Want a positive idea? One reading of this and this would be that the failure of intelligence isn’t a failure to collect or analyse information about the world, or rather it is, but it is caused by a failure to collect and analyse information about ourselves.
Quite ridiculous microtale about the head of MI6’s wife being on Facebook. But what’s this, from Patrick Mercer MP?
The Conservative MP Patrick Mercer, who chairs the counter-terrorism sub-committee, said the mistake had left the Sawers family “extremely vulnerable”. Referring to Miliband’s suggestion that the incident was not significant, Mercer said: “If that is the case why has the site being taken down?” He also pointed out that military chiefs had warned that the Taliban get 80% of their intelligence from Twitter and Facebook.
Can he really believe this? Eighty per cent? What percentage of users of either are located in Afghanistan? I’m going to stick a target on the wall and say it’s much less than 1%, so this suggests that a very few people are very insecure indeed. Perhaps we could just ask the guy to knock it off, or post him to the Falklands?
I’d be surprised if 80 per cent of their intelligence didn’t come from informers, friendly civilians reporting where our patrols go, if not more. Rather like it did in Northern Ireland. And Patrick Mercer of all people ought to be well aware of the possibilities…
He’s got form for Chris Morris-esque nonsense, mind you; remember his role in the Glen Jenvey/Comedy Gladio affair? Some people are, indeed, very insecure indeed about the world of today, and it remains truly remarkable just what stuff a lot of MPs will happily read out to the camera without passing it through their brains. The question remains whether Facebook is a made-up Web site.
Following on from the last post, we’re unlikely to have funding to dose every school kid in Britain with radioactive markers and fMRI-scan them a term later to see how their neurons are getting on any time soon, even if you could get that past the ethics committee and the Nuclear Dread. So unless someone comes up with a field-expedient diagnostic test, we’ll need some other way of assessing the problem. Which means that this annoyed me.
So some firm decided to try analysing the primary school SAT results better. They broke down the UK into much smaller units than Local Education Authorities or even schools – neighbourhoods of 300 people on average. They then classified them into 24 groups based on demographic and socio-economic indicators, looked at the average results for each group, and arrived at an expected score for each school based on the distribution of those groups in the school’s intake. They then compared the actual results to see which schools were really doing better or worse.
And they got quite a lot of criticism for not using a database of pupils that…wait for it…the government won’t let them use. This is a pity. Ever since Pierre Bourdieu, we’ve been well aware that there is much more to class than money. With all that data, we could do a lot of interesting things; we could, for example, use principal components analysis to establish objectively defined groups and see how well schools are doing that way. We could benchmark them against the Flynn effect, and I suspect quite a lot of schools would turn out just to be tracking the gradual uplift overall. But if we can’t see the data we can’t do anything.