Archive for the ‘WhoseKidAreYou’ Category

OK, so I did two things – I upgraded to OpenSUSE 11.2/KDE 4.3, which is great, and I’ve installed SQUIN, the semantic Web query server, on my laptop in order to work on WhoseKidAreYou. The concept of SQUIN is that it provides a SPARQL end point to do queries over the various, interlinked sets of data that conform to the Linked Data standard.

So, I should be able to pull data from the FOAF db, from DBpedia, and all sorts of other stuff in the same query statement. Cool. And you’ve got to hand it to them, as well, the install is almost comically easy. But, as with SPARQL in general, there are things I’m not getting. The idea of Linked Data is that you should be able to follow links from a record retrieved from one DB into another related one – for example, if the DBpedia record for somebody contains FOAF information, the query client should note the link, recurse along it into FOAF, and get you any information that matches your query that’s in FOAF as well as DBpedia.

You’d think that the main problem would be constraining the search and filtering the results. Essentially, I’m trying to replicate the behaviour of a cynical and intelligent person searching the Web for the authors of everything they read, and it’s obvious that someone doing that uses most of their brain effort to sieve the search results. Similarly, if you’re writing a SQL query to pull data out of a classical relational database, your biggest concern is usually how to filter, reduce, group, aggregate, summarise, or limit the volume of data that comes back.

But I find the difficult bit with SPARQL is maximising the volume of data that comes back. It’s incredibly easy to get nothing at all for quite trivial queries. Another thing is that if one of the variables in the query doesn’t match, none of them do, and the query will return nothing. You can use the OPTIONAL keyword, but as far as I can see, you need to OPTIONAL each and every statement. The syntax is annoyingly “almost, but not quite, entirely unlike SQL” and it’s oddly difficult to get a data variable, rather than a URI, into your query.

Also, I find the Linked Data element of this a little hard to visualise. Presumably, if you want to query across datasets, you need to use prefixed namespaces that are common to them all. I think, but I’m not sure, that you can mix multiple prefixed namespaces.

Regarding SQUIN itself, I’m also suspicious that the queries return very, very fast; there’s not enough time for it to be doing any recursing that involves multiple network round trips. Here’s an example:

PREFIX foaf:
PREFIX dbproperty:
PREFIX dbresource:
SELECT ?influenced ?page ?knows ?knowspage
WHERE {
?name dbproperty:Name dbresource:Martin_Amis .
?influenced dbproperty:influencedBy ?name .
OPTIONAL
{
?page foaf:page ?influenced .
}
OPTIONAL
{
?knows foaf:knows ?influenced .
?knowspage foaf:page ?knows .
}
}

This should declare the query variables in the SELECT clause, get the value of the Person/Name property of the DBpedia article Martin_Amis, bind it to ?name, then get all the values of the Person/influencedBy property that match ?name, bind them to ?influenced, and then the FOAF:Page values that match ?influenced. We’re then, going to query FOAF for the FOAF:Knows values for each of the influenced, and their home pages.

As that’s uncertain as to whether they have them, it’s an OPTIONAL clause, as is the one that gets the foaf:pages in the first place. DBpedia’s SNORQL interface chokes on the reference to Martin Amis (who wouldn’t). SQUIN considers it valid SPARQL, but produces no results whatsoever. If you browse over here, you’ll find that all the values involved are present and as described; and, indeed, the first influencedBy has a foaf:page attribute. In general, semantic web things seem to be good at failing to return data they actually have under the attributes they have for it.

What is it that I’m missing? Is there a huge tarball of data I need to load in SQUIN? Surely the point of Linked Data and semantics is that you don’t have to scrape the Web and snarf it all into a big database, but rather treat data on Web sites as if it were in a database?

wkay – update

So where’s WhoseKidAreYou? “Well, I’m working on it” is the short answer. I have recently reorganised the code in the user script, and I’ve been fiddling with Sindice, a semantic/linked data search engine. I’m fairly certain, however, that the first version out will work like this.

User script tries a range of XPath and DOM parsing options to obtain a byline and identify the element in the page that contains it; it then does various paper-specific things to clean up the data and convert it to wiki-style Name_Surname, and templates this into the query. The query is fired as an XmlHttpRequest in the background – because you can do cross-domain requests inside Greasemonkey – and the page renders anyway while waiting. When the queries to Sindice/DBpedia/Sourcewatch/Tobacco Archives happen, the results get templated in a chunk of HTML, if they contain hits, and then this is used to replace the byline.

Otherwise, a default element pointing you to a Wikipedia edit page will be inserted. That way, we get a triangular feedback going.

No blogging this weekend, due to engineering work. Specifically, it’s quite incredible the difference between the amount of blog I can produce in an afternoon and the amount of code, which sort of bears out all the misgivings you might have about the whole blogging project. (Not that those wouldn’t have been better raised in 2002, but there you go.)

It was WhoseKidAreYou taking up the time; I’m learning steadily about SPARQL and various Javascript things, notably XPath, which I have to agree is a pretty cool way of dismantling, remantling, and generally fiddling with HTML/XML documents, even compared to BeautifulSoup. (You can search for a pattern – for example, anything with the class attribute “byline” – and then index into the results by a filesystem-like / notation, which is handy when the material you need is inside a sensibly named entity but wrapped in random tags, a surprisingly common antipattern.) I’ve also identified 11 newspapers’ patterns for bylines, and in the cases where the metadata is in the meta tags, like it should be, I’ve also identified where the byline block appears in the text.

For example – the Torygraph puts the name of the author in a meta name="author" content="A. N. Other", and they then have a div class="byline", but the byline div also includes the timestamp, so we’re getting the byline from the meta tags to save post-processing it and then identifying the div for later.

Fair enough; the next problem is the SPARQL query, which seems to be remarkably tickly and easy to break. The problem with this semantic web stuff is that it’s so damn semantic; everything wants very closely specifying. In theory, it should be possible to grab a whole variety of data on the overentitled brat in question – employment, publications, criminal record, however. Which is nice.

The downside is, though, that DBpedia is dependent on decent infoboxes in Wikipedia articles to work. So if you want to help with WKAY (I like the acronym – sounds like a Mexican radio station in a Jack Kerouac novel) and you aren’t coding, why not go and contribute relatives to Wikipedia?

Actually, I don’t think the Wikimedia Foundation will let you do that, even if The Register likes to call them a cult. I mean, contribute other people’s relatives. No. No. No slavery or grave-robbing, please. I mean, go and edit prominent idiots’ Wikipedia entries and record whose kids they are, and pretty up the info boxes.

Speaking of info boxes, at the moment they are the best paradigm I can think of for displaying the data when we get it. It’s pretty trivial to template HTML in Greasemonkey and to replace elements on the page with it, but I want it to look good. If Dan Lockton wants to join the Ggroup, that would be very helpful.

I’d like to introduce you to a new project. The other day, I was reading an imbecilic union-bashing editorial by one “Hugo Rifkind”, and I wondered….whose kid are you? Wikipedia informed me that diary columnist (it’s like a journalist but not quite) Rifkind is indeed the former Defence and Foreign Secretary’s son, and he’s “written” a “book” about “the London media world” called Overexposed Overexposure, which kicks the bottom out of the rotting barrel of satire.

And there, I had it – we need a Web site to monitor nepotism, and backscratching influence-peddling more generally. WhoseKidAreYou! There’s been quite a lot of work on designing machine-readable ways of expressing relationships between people, but to start with, I reckon we need a decent wiki server or else perhaps a Django install, and the British journalists section of Wikipedia as a start. We can crowdsource the rest; we’ve got bitterness and resentment on our side, plus a powerful kicker of personal loathing!

We’ll need to hold basic biographical data, plus job and publication history, a link to corresponding Wikipedia data, and of course, the crucial affiliations. Not just WhoseKidAreYou, but also WhoseThinktankDoYou”Work”For. Once we’ve got a reasonable amount of data, we can think about social-graph visualisations and other fancy twirls; we could also do a browser extension that picks out bylines, searches the DB in background, and shows a notification. “Did you know this was written by Christopher Hitchens’ illegitimate son, working for a thinktank founded by Douglas Murray?”

I am deadly serious about this, and I would like your comments. The project isn’t really suited to MySociety.org – it’s far from neutral and it’s explicitly partisan and generally vicious – so it’ll have to be unilateral. I’ve set up a Google group (aka a mailing list/usenet group) over here.

UPDATE: More is here, including how to take part.

UPDATE UPDATE: Hugo Rifkind has been in touch, to point out that I misspelled the title of his book.