SQUINing like the proverbial 747
OK, so I did two things – I upgraded to OpenSUSE 11.2/KDE 4.3, which is great, and I’ve installed SQUIN, the semantic Web query server, on my laptop in order to work on WhoseKidAreYou. The concept of SQUIN is that it provides a SPARQL end point to do queries over the various, interlinked sets of data that conform to the Linked Data standard.
So, I should be able to pull data from the FOAF db, from DBpedia, and all sorts of other stuff in the same query statement. Cool. And you’ve got to hand it to them, as well, the install is almost comically easy. But, as with SPARQL in general, there are things I’m not getting. The idea of Linked Data is that you should be able to follow links from a record retrieved from one DB into another related one – for example, if the DBpedia record for somebody contains FOAF information, the query client should note the link, recurse along it into FOAF, and get you any information that matches your query that’s in FOAF as well as DBpedia.
You’d think that the main problem would be constraining the search and filtering the results. Essentially, I’m trying to replicate the behaviour of a cynical and intelligent person searching the Web for the authors of everything they read, and it’s obvious that someone doing that uses most of their brain effort to sieve the search results. Similarly, if you’re writing a SQL query to pull data out of a classical relational database, your biggest concern is usually how to filter, reduce, group, aggregate, summarise, or limit the volume of data that comes back.
But I find the difficult bit with SPARQL is maximising the volume of data that comes back. It’s incredibly easy to get nothing at all for quite trivial queries. Another thing is that if one of the variables in the query doesn’t match, none of them do, and the query will return nothing. You can use the OPTIONAL keyword, but as far as I can see, you need to OPTIONAL each and every statement. The syntax is annoyingly “almost, but not quite, entirely unlike SQL” and it’s oddly difficult to get a data variable, rather than a URI, into your query.
Also, I find the Linked Data element of this a little hard to visualise. Presumably, if you want to query across datasets, you need to use prefixed namespaces that are common to them all. I think, but I’m not sure, that you can mix multiple prefixed namespaces.
Regarding SQUIN itself, I’m also suspicious that the queries return very, very fast; there’s not enough time for it to be doing any recursing that involves multiple network round trips. Here’s an example:
SELECT ?influenced ?page ?knows ?knowspage
?name dbproperty:Name dbresource:Martin_Amis .
?influenced dbproperty:influencedBy ?name .
?page foaf:page ?influenced .
?knows foaf:knows ?influenced .
?knowspage foaf:page ?knows .
This should declare the query variables in the SELECT clause, get the value of the Person/Name property of the DBpedia article Martin_Amis, bind it to ?name, then get all the values of the Person/influencedBy property that match ?name, bind them to ?influenced, and then the FOAF:Page values that match ?influenced. We’re then, going to query FOAF for the FOAF:Knows values for each of the influenced, and their home pages.
As that’s uncertain as to whether they have them, it’s an OPTIONAL clause, as is the one that gets the foaf:pages in the first place. DBpedia’s SNORQL interface chokes on the reference to Martin Amis (who wouldn’t). SQUIN considers it valid SPARQL, but produces no results whatsoever. If you browse over here, you’ll find that all the values involved are present and as described; and, indeed, the first influencedBy has a foaf:page attribute. In general, semantic web things seem to be good at failing to return data they actually have under the attributes they have for it.
What is it that I’m missing? Is there a huge tarball of data I need to load in SQUIN? Surely the point of Linked Data and semantics is that you don’t have to scrape the Web and snarf it all into a big database, but rather treat data on Web sites as if it were in a database?