Archive for the ‘Python’ Category

The question isn’t so much “did Eric Pickles eat all the pies?”, it’s “who paid for the pies, and how many did he declare in the register of members’ interests?”. TBIJ is on an absolute tear on Tory lobbying stories at the moment, and the combination of photo and caption for the Eric Pickles one is masterly.

But this story reveals more than it says. So, four cabinet ministers accepted donations to their private offices since May, 2010. Those would be William Hague, George Osborne, Liam Fox, and Michael Gove, or to put it another way, most of Atlantic Bridge and the core of the neo-conservative group within the Conservative Party. I do not think this is a coincidence.

Curiously, it seems that if you get donations to your private office you don’t also get them to your constituency party branch and vice versa, with the exceptions of George Osborne and Michael Gove, who would have more jam on it, wouldn’t they?

Pickles, for his part, received zero, which makes perfect sense. You can’t eat money, and as for spending it on unofficial advisers, that only makes sense if you ever take advice from other people and the Bradford food-mountain has always known he’s right.

Meanwhile, Lord Astor of Hever turns up as a trustee of the Bridge and an pal of the Werritty-funding SAS walt, Iraq contract hunter, and intimate of mercenaries Tim Spicer and Anthony Buckingham.

I think I’ve said before that Astor of Hever came out of the Lobster Project proof of concept script as being a surprisingly important gatekeeper – although in himself, he isn’t a major node, people who meet him also tend to get one-to-one meetings with the most important ministers. His weighted network degree, a measurement of how many links in the lobbying network involve him adjusted for how many people took part in the meetings, is 0.125, pretty low (78th in the league), but his gatekeepership metric is 2.533, the third highest overall and the very highest score for a minister with UK-wide responsibility. (I discount the gatekeepership numbers for Scottish and Welsh ministers, as their role is partly to represent Scottish and Welsh interests and they are structurally heavily lobbied.)

The gatekeepership metric in Lobster is the ratio of the average weighted network degree of those who lobbied a given minister to the average of all lobbies, to the ratio of that minister’s network degree to that of an average minister, thus capturing the degree to which meeting that minister was associated with meeting more or less important ones while taking into account the fact that some ministerial jobs are more important than others. If it is greater than 1, you’re likely to get a boost, if less, you’re being heard out.

A limitation is that obviously, the Prime Minister can’t help you meet a more important minister, so it doesn’t yet deal with the situation where you meet the PM to get your word across and are then referred to a junior minister for action. I accept that this is a problem, although you would expect that it is easier to lobby the small fry, so the metric is nevertheless useful. However, at a network degree of 0.125, Lord Astor is not affected by this phenomenon.

OK, so we have a prediction – other ministers involved with the Werritty/Fox/Atlantic Bridge case will demonstrate unusually high gatekeepership. Step forward Gerald Howarth MP, Minister for International Security Strategy, who achieves a gatekeepership of 2.36, the fourth highest overall and the second highest UK-wide, on a network degree of 1.2. That’s some pull, when you note that he’s a significant node in terms of quantity.

Lobster detected a sinister network of influence! How awesome is that?

My lobbying project has been entered in the Open Data Challenge! Someone posted this to the MySociety list, with rather fewer than the advertised 36 hours left. I was at a wedding and didn’t read it at the time. After my partner and I had tried to invent a tap routine to the back end of Prince’s “Alphabet Street” and had got up at 8am to make it for the sadistic bed & breakfast breakfast and gone back to help clean up and drink any unaccountably unconsumed champagne, and the only thing left to look forward to was the end of the day, I remembered the message and noted that I had to get it filed before midnight.

So it was filed in the Apps category – there’s an Ideas category but that struck me as pathetic, and after all there is some running code. I pushed on to try and get something out under the Visualisation category but ManyEyes was a bit broken that evening and anyway its network diagram view starts to suck after a thousand or so vertices.

As a result, the project now has a name and I have some thin chance of snagging an actual Big Society cheque for a few thousand euros and a trip to Brussels. (You’ve got to take the rough with the smooth.)

The most recent experiment with the Lobster Project – see, it’s got a name! It’s got you its grips before you’re born…it lets you think you’re king when you’re really a prawn…whoops, wrong shellfish – was to try out a new centrality metric, networkx.algorithms.centrality.betweenness_centrality. This is defined as the fraction of the shortest paths between all the pairs of nodes in the network that pass through a given node. As you have probably guessed, this is quite an inefficient metric to compute and the T1700 lappy took over a minute to crunch it compared to 7 seconds to complete the processing script without it. Perhaps the new KillPad would do better but the difference is big enough that it’s obviously my fault.

Worth bothering with?

As far as I can see, though, it’s also not very useful. The results are correlated (R^2 = 0.64) with the infinitely faster weighted graph degree. (It also confirms that Francis Maude is the secret ruler of the world, though.)

The NX functions I’m really interested in, though, are the ones for clique discovery and blockmodelling. It’s obvious that with getting on for 3,000 links and more to come, any visualisation is going to need a lot of reduction. Blockmodelling basically chops your network into groups of nodes you provide and aggregates the links between those groups – it’s one way, for example, to get department level results.

But I’d be really interested to use empirical clique discovery to feed into blockmodelling – the API for the one generates a python list of cliques, which are themselves lists of nodes, and the other accepts a list of nodes or a list of lists (of nodes). Another interesting option might be to blockmodel by edge attribute, which would be a way of deriving results for the content of meetings via the “Purpose of meeting” field. However, that would require creating a list of unique meeting subjects and then iterating over it creating lists of nodes with at least one edge having that subject, and then shoving the resulting list-of-lists into the blockmodeller.

That’s a lorra lorra iteratin’ by anybody’s standards, even if, this being Python, most of it will end up being rolled up in a couple of seriously convoluted list comps. Oddly enough, it would be far easier in a query language or an ORM, but I’ve not heard of anything that lets you do SQL queries against a NX graph.

Having got this far, I notice that I’ve managed to blog my enthusiasm back up.

Anyway, I think it’s perhaps time for a meetup on this next week with Who’s Rob-bying.

So it was OpenTech weekend. I wasn’t presenting anything (although I’m kicking myself for not having done a talk on Tropo and Phono) but of course I was there. This year’s was, I think, a bit better than last year’s – the schedule filled up late on, and there were a couple of really good workshop sessions. As usual, it was also the drinking conference with a code problem (the bar was full by the end of the first session).

Things to note: everyone loves Google Refine, and I really enjoyed the Refine HOWTO session, which was also the one where the presenter asked if anyone present had ever written a screen-scraper and 60-odd hands reached for the sky. Basically, it lets you slurp up any even vaguely tabular data and identify transformations you need to clean it up – for example, identifying particular items, data formats, or duplicates – and then apply them to the whole thing automatically. You can write your own functions for it in several languages and have the application call them as part of the process. Removing cruft from data is always incredibly time consuming and annoying, so it’s no wonder everyone likes the idea of a sensible way of automating it. There’s been some discussion on the ScraperWiki mailing list about integrating Refine into SW in order to provide a data-scrubbing capability and I wouldn’t be surprised if it goes ahead.

Tim Ireland’s presentation on the political uses of search-engine optimisation was typically sharp and typically amusing – I especially liked his point that the more specific a search term, the less likely it is to lead the searcher to a big newspaper website. Also, he made the excellent point that mass audiences and target audiences are substitutes for each other, and the ultimate target audience is one person – the MP (or whoever) themselves.

The Sukey workshop was very cool – much discussion about propagating data by SMS in a peer-to-peer topology, on the basis that everyone has a bucket of inclusive SMS messages and this beats paying through the nose for Clickatell or MBlox to send out bulk alerts. They are facing a surprisingly common mobile tech issue, which is that when you go mobile, most of the efficient push-notification technologies you can use on the Internet stop being efficient. If you want to use XMPP or SIP messaging, your problem is that the users’ phones have to maintain an active data connection and/or recreate one as soon after an interruption as possible. Mobile networks analogise an Internet connection to a phone call – the terminal requests a PDP (Packet Data Profile) data call from the network – and as a result, the radio in the phone stays in an active state as long as the “call” is going on, whether any data is being transferred or not.

This is the inverse of the way they handle incoming messages or phone calls – in that situation, the radio goes into a low power standby mode until the network side signals it on a special paging channel. At the moment, there’s no cross-platform way to do this for incoming Internet packets, although there are some device-specific ways of getting around it at a higher level of abstraction. Hence the interest of using SMS (or indeed MMS).

Their other main problem is the integrity of their data – even without deliberate disinformation, there’s plenty of scope for drivel, duplicates, cockups etc to get propagated, and a risk of a feedback loop in which the crap gets pushed out to users, they send it to other people, and it gets sucked up from Twitter or whatever back into the system. This intersects badly with their use cases – it strikes me, and I said as much, that moderation is a task that requires a QWERTY keyboard, a decent-sized monitor, and a shirt-sleeve working environment. You can’t skim-read through piles of comments on a 3″ mobile phone screen in the rain, nor can you edit them on a greasy touchscreen, and you certainly can’t do either while looking out that you don’t get hit over the head by the cops.

Fortunately, there is no shortage of armchair revolutionaries on the web who could actually contribute something by reviewing batches of updates, and once you have reasonably large buckets of good stuff and crap you can use Bayesian filtering to automate part of the process.

Francis Davey’s OneClickOrgs project is coming along nicely – it automates the process of creating an organisation with legal personality and a constitution and what not, and they’re looking at making it able to set up co-ops and other types of organisation.

I didn’t know that OpenStreetMap is available through multiple different tile servers, so you can make use of Mapquest’s CDN to serve out free mapping.

OpenCorporates is trying to make a database of all the world’s companies (they’re already getting on for four million), and the biggest problem they have is working out how to represent inter-company relationships, which have the annoying property that they are a directed graph but not a directed acylic graph – it’s perfectly possible and indeed common for company X to own part of company Y which owns part of company X, perhaps through the intermediary of company Z.

OpenTech’s precursor, Notcon, was heavier on the hardware/electronics side than OT usually is, but this year there were quite a few hardware projects. However, I missed the one that actually included a cat.

What else? LinkedGov is a bit like ScraperWiki but with civil servants and a grant from the Technology Strategy Board. Francis Maude is keen. Kumbaya is an encrypted, P2P online backup application which has the feature that you only have to store data from people you trust. (Oh yes, and apparently nobody did any of this stuff two years ago. Time to hit the big brown bullshit button.)

As always, the day after is a bit of an enthusiasm killer. I’ve spent part of today trying to implement monthly results for my lobby metrics project and it looks like it’s much harder than I was expecting. Basically, NetworkX is fundamentally node-oriented and the dates of meetings are edge properties, so you can’t just subgraph nodes with a given date. This may mean I’ll have to rethink the whole implementation. Bugger.

I’m also increasingly tempted to scrape the competition‘s meetings database into ScraperWiki as there doesn’t seem to be any way of getting at it without the HTML wrapping. Oddly, although they’ve got the Department of Health’s horrible PDFs scraped, they haven’t got the Scottish Office although it’s relatively easy, so it looks like this wouldn’t be a 100% solution. However, their data cleaning has been much more effective – not surprising as I haven’t really been trying. This has some consequences – I’ve only just noticed that I’ve hugely underestimated Oliver Letwin’s gatekeepership, which should be 1.89 rather than 1.05. Along with his network degree of 2.67 (the eight highest) this suggests that he should be a highly desirable target for any lobbying you might want to do.

After this post and the outstanding response to it, I’ve just been working on the lobby project’s underpinnings, specifically to backport some data cleaning from the analyser script into the original scraper, and to fix the one-edge-per-row version of the scraper. As a result I’ve had to flush the datastore and also search out some URIs that have changed. So far we’ve recreated 931 out of 1,721 meetings, although we’re getting the dreaded “Execution status: run interrupted by a timeout”. Actually, we’ve got 1,747 meetings back, and we’ve got rid of some crap. Anyone wanting the dataset can get it from the Scraperwiki API here or here for linkwise rather than meetingwise (coming soonavailable now) as either json-dict or csv. A full SQL syntax is available.

With luck, there will also be some more data quite soon. On the analysis score, notably, this and also this seem useful. The first estimates the value of a node based on its edges, which is fundamentally what I’m trying to achieve, and the second finds the cliques in the network a given node belongs to.

Regarding visualisation issues, I think one of my mistakes last time out was to visualise the data as a multi-graph – i.e. a structure with zero or more links between each node, permitting the existence of multiple links between the same pair of nodes. This invariably means a lot of links. The nature of the data – multiple meetings are absolutely central to the whole project, and lobbies meet ministers at different times and on different issues – enforces an underlying multigraph structure. But it would be possible to condense it for visualisation purposes – if we rolled up all links between the same nodes into one, we could tot up their weights and perhaps show that in the visualisation, as a thicker line for example.

So OpenSUSE11.4 was out this week. As the Jedi said here:

gah! suse is never totally easy

Indeed. I thought I’d do an online upgrade, so I scheduled this to happen when I was in the office and therefore had a fast Internet link available. I applied all the remaining 11.3 updates, configured the three additional repos, did a “zypper ref” and then a “zypper dup”, paged through the Flash player licence, and watched it report 500 odd MB of packages to grab. Much churning later, it started to miss packages, which I installed manually. Eventually, it finished, and I ran “zypper verify” to check it out. This reported that vim-data was missing, so I installed it, and went for a reboot.

Oh dear, the new distro apparently didn’t know what an ext4 filesystem was. And although I could still start 11.3 from the boot menu, KDE wasn’t working. So, back at home, I downloaded the ISO image (2 hours 20 odd minutes at home), burned a disc, and prepared for a clean install, which failed with a message about running out of processes in this runlevel. You guessed it, dodgy install media. Wiped and downloaded again. I check the MD5 hash. It’s a miss. I start the download again and go out. I come back to find the laptop has rebooted and has got to the failure point in 11.4. How? What? I restart in Windows and discover that 678 of 695MB has been fetched before something happened. It dawns on me that Microsoft has force-rebooted the bastard through Windows Update although I set it to do nothing of the sort. I’m getting seriously pissed off now. I download it again, from a different mirror (ox.ac.uk rather than Kent Uni mirrorservice.org). More hours. I check the MD5 hash. What do you know, it’s wrong. And it’s the same hash as last time. As an experiment, I burn it anyway, boot it, and run the media check utility.

Which fails at 63%, block 226192, in exactly the same location as the first time around. Riight, it looks like Novell has pushed a crappy image out to all the damn mirrors. Well, I can still get a Linux shell in 11.3, so I run it up, hook an ethernet cable to the linksys box, run dhclient, and repeat the command-line distro upgrade. Although Zypper still thinks all the dependencies are in place, when I tell it to “zypper dup”, it still manages to find 258 package changes left to do from the original upgrade. It takes an age, but eventually, completes, and it’s shutdown -r now time. And everything now works, right down to hibernated browser tabs.

Except for Python packages, of course. Pythonistas tend to dote on easy_install, but I’m still annoyed that I have to update this stuff out of sync with my linux environment, especially as it lives in my root partition. Would it be so hard to put everything in PyPi into an RPM repository and never worry about it ever again? This is actually an important lesson about the mobile app stores, and the original app store itself, Firefox extensions. Freedom goes with structure.

Lessons from this: once an upgrade shows any signs of weirdness, abort it and start again. And don’t expect online upgrade to work first time – this happened to me with a past OpenSUSE upgrade, come to think of it, but I clearly learned nothing.

Things to get out of the data in this scraper of mine: for each lobby, the monthly meeting counts, degrees in the weighted multigraph, impact factor (i.e. graph degree/meetings to give an idea of productivity), most met ministers, most met departments, topics. For each ministry, meeting counts, most met lobbies, most discussed topics. For each PR agency (Who’s Lobbying had or has a list of clients for some of them), the same metrics as for lobbies. Summary dashboard: top lobbies, top lobbyists, top topics, graph visualisation, top 10 rising and falling lobbies by impact.

Things I’d like to have but aren’t sure how to implement: a metric of gatekeeper-ness for ministers, for example, how often a lobby met a more powerful minister after meeting this one, and its inverse, a metric of how many low-value meetings a minister had. I’ve already done some scripting for this, and NetworkX will happily produce most of the numbers, although the search for an ideal charting solution goes on. Generating the graph and subgraphs is computationally expensive, so I’m thinking of doing this when the data gets loaded up and storing the results, rather than doing the sums at runtime.

Where’s that Django tutorial? Unfortunately it’s 7.05 pm on Sunday and it’s looking unlikely I’ll do it this weekend…

Oh yes, so the IBM ManyEyes people fixed their computer.

E545e94c-03aa-11e0-9bbc-000255111976 Blog_this_caption

I’ve got much more data now – I still need to do the four (key) departments that release in PDF format, and flush the existing stuff to replace the records with ones with standardised dates, but that should give you an idea. Hit the button in the visualisation with a network on it to redraw the force-directed graph.

So I was moaning about the Government and the release of lists of meetings with external organisations. Well, what about some action? I’ve written a scraper that aggregates all the existing data and sticks it in a sinister database. At the moment, the Cabinet Office, DEFRA, and the Scottish Office have coughed up the files and are all included. I’m going to add more departments as they become available. Scraperwiki seems to be a bit sporky this evening; the whole thing has run to completion, although for some reason you can’t see all the data, and I’ve added the link to the UK Open Government Licence twice without it being saved.

A couple of technical points: to start with, I’d like to thank this guy who wrote an alternative to Python’s csv module’s wonderful DictReader class. DictReader is lovely because it lets you open a CSV (or indeed anything-separated value) file and keep the rows of data linked to their column headers as python dictionaries. Unfortunately, it won’t handle Unicode or anything except UTF-8. Which is a problem if you’re Chinese, or as it happens, if you want to read documents produced by Windows users, as they tend to use Really Strange characters for trivial things like apostrophes (\x92, can you believe it?). This, however, will process whatever encoding you give it and will still give you dictionaries. Thanks!

I also discovered something fun about ScraperWiki itself. It’s surprisingly clever under the bonnet – I was aware of various smart things with User Mode Linux and heavy parallelisation going on, and I recall Julian Todd talking about his plans to design a new scaling architecture based on lots of SQLite databases in RAM as read-slaves. Anyway, I had kept some URIs in a list, which I was then planning to loop through, retrieving the data and processing it. One of the URIs, DEFRA’s, ended like so: oct2010.csv.

Obviously, I liked the idea of generating the filename programmatically, in the expectation of future releases of data. For some reason, though, the parsing kept failing as soon as it got to the DEFRA page. Weirdly, what was happening was that the parser would run into a chunk of HTML and, obviously enough, choke. But there was no HTML. Bizarre. Eventually I thought to look in the Scraperwiki debugger’s Sources tab. To my considerable surprise, all the URIs were being loaded at once, in parallel, before the processing of the first file began. This was entirely different from the flow of control in my program, and as a result, the filename was not generated before the HTTP request was issued. DEFRA was 404ing, and because the csv module takes a file object rather than a string, I was using urllib.urlretrieve() rather than urlopen() or scraperwiki.scrape(). Hence the HTML.

So, Scraperwiki does a silent optimisation and loads all your data sources in parallel on startup. Quite cool, but I have to say that some documentation of this feature might be nice, as multithreading is usually meant to be voluntary:-)

TODO, meanwhile: at the moment, all the organisations that take part in a given meeting are lumped together. I want to break them out, to facilitate counting the heaviest lobbyists and feeding visualisation tools. Also, I’d like to clean up the “Purpose of meeting” field so as to be able to do the same for subject matter.

Update: Slight return. Fixed the unique keying requirement by creating a unique meeting id.

Update Update: Would anyone prefer if the data output schema was link-oriented rather than event-oriented? At the moment it preserves the underlying structure of the data releases, which have one row for each meeting. It might be better, when I come to expand the Name of External Org field, to have a row per relationship, i.e. edge in the network. This would help a lot with visualisation. In that case, I’d create a non-unique meeting identifier to make it possible to recreate the meetings by grouping on that key, and instead have a unique constraint on an identifier for each link.

Update Update Update: So I made one.

Progress update on fixing the Vfeed.

Dubai Airport has done something awful to their Web site; where once flights were organised in table rows with class names like “data-row2”, now, exactly half the flights are like that, they’ve been split between separate arrival, departure, and cargo-only pages, they only show the latest dozen or so movements each, and the rows that aren’t “data-row2” don’t have any class attributes but random HTML colours.

And the airline names have disappeared, replaced by their logos as GIFs. Unhelpful, but then, why should they want to help me?

Anyway, I’ve solved the parsing issue with following horrible hack.
output = [[td.string or td.img["src"] for td in tr.findAll(True) if td.string or td.img] for tr in soup.findAll('tr', bgcolor=lambda(value): value == 'White' or value == '#F7F7DE')]

As it happened, I later realised I didn’t need to bother grabbing the logo filenames in order to extract airline identifiers from them, so the td.img[“src”] bit can be dropped.

But it looks like I’m going to need to do the lookup from ICAO or IATA identifiers to airline names, which is necessary to avoid having to remake the whitelist and the database and the stats script, myself. Fortunately, there’s a list on wikipedia. The good news is that I’ve come up with a way of differentiating the ICAO and the IATA names in the flight numbers. ICAOs are always three alphabetical characters; IATAs are two alphanumeric characters, which aren’t necessarily globally unique. In a flight number, they can be followed by a number of variable length.

But if the third character in the flight number is a digit, the first two must be an IATA identifier; if a string, it must be an ICAO identifier.

a pain in the arse…

Oh Gawd, this is precisely one of the things I hate about Symbian S60 development.

OK, so I’ve now got a version of the PythonForS60 runtime that doesn’t require a note from my parents or Jack Straw or God or someone to use GPS; but at some point they’ve pushed out an update to the phone that means I no longer get to choose where I save things it gets sent. A .py file is treated as a plain text file, which is OK, but this means that it gets saved as a “note”. What this means in practice is that it doesn’t appear anywhere in the damn filesystem. (If anyone knows where the bloody things are saved, and how to get at them, thanks in advance.)

So even having got away from the invisible Finnish policemen, I still can’t run my own fucking code already without installing the 9 billion gigabyte Windows-only SDK, “signing up” for God knows how many vacuous “beta user accounts”, and generally hopping about like a blue arsed fly doing absolutely nothing productive. Can we please please please get away from this crap? Can we?