give IT yahoos United States (dollars)

So, there’s this rumour-surrounded gadget that GIYUS wants people to install on their computers as part of the War on Terror. Obviously, I wondered exactly how it worked; did it analyse the Web sites you visit semantically, so as to target its talking points precisely? Did it use some sort of social recommendation mechanism? Also, I was wondering if there was any way of characterising the network traffic it generated and estimating how many people are using it.

So I did the obvious thing and I actually downloaded it. It’s packaged as a Firefox extension (.xpi); extensions consist of JavaScript files for the application logic and XUL (XML User interface Language) for the look’n’feel, all wrapped up in a ZIP archive. If you don’t have the source of one, all you need to do is pass it through an archive tool and extract all files, and then you can read them in a text editor.

And actually, it’s kind of disappointing; no folksonomy, no textual analysis, not even crude keyword matching. It just grabs an RSS feed from ws.collactive.com, passing in the string “GIYUS”, presumably to ensure it gets the right one, checks if any items in it aren’t already cached, and if so, fires a graphical alert containing the message. It’s basically a e-mail list gussied up in Web2.0 finery, with the feature that it’s marginally less trivial to forward the content to nonsubscribers. It doesn’t even appear to spy on your browsing history.

Of course, there could be some server-side magic involved. You can usually get a rough idea of location from an IP address, and a rough idea is probably best in terms of hit-rate (you’ve a much better chance of getting your geotargeting right for “North London” than “Archway”). And you can draw some conclusions from browser credentials – OS, screen, browser type and version etc. For example, perhaps you’d want to serve the red meat civilian deaths are all a fake stuff to MSIE5/6 users in teh US heartland and the Decent Left stuff to Mac users in North London. So I considered actually installing the extension; but then I realised I didn’t actually want a simulated Melanie Phillips on my sofa any more than I wanted the real thing. However, it’s possible to view the feed on the Web anyway, so I checked.

But they may not even be doing that; I’m on a weird niche ISP, with a linux machine, in North London, and the feed I see at http://ws.giyus.org/points/list is deeply generic.

Surely, though, it’s possible to do better than this? I envisage a sort of Web force multiplier, that would analyse the texts you read as you browse and compute some kind of digest hash, and do the same for every link you send anyone else, stashing the hash of each link in a remote server. As you browse, it compares the hash of the current page with the ones in the DB, and returns a list of possibly appropriate arguments – the strength of this being that they could be data, poetry, code, pictures, video, or indeed anything. We could incorporate some sort of social element, too, to keep a check on quality.

Who here knows about corpus analysis? Most of the academic papers my casual search found gave me that “dog listening to music” feeling. What I need is something like a rather bad crypto hash function – one where two texts with different content would produce non-randomly different hashes. Obviously we’d filter the text with a list of stop words like search engines do, so as to strip out the tehs and ands. We could, for example, use (say) the distribution of words in Wikipedia as a common baseline, and measure how the distribution of significant words in the target texts differs from it.


  1. “What I need is something like a rather bad crypto hash function – one where two texts with different content would produce non-randomly different hashes. Obviously we’d filter the text with a list of stop words like search engines do, so as to strip out the tehs and ands. We could, for example, use (say) the distribution of words in Wikipedia as a common baseline, and measure how the distribution of significant words in the target texts differs from it.”

    Why not use the ZIP file compression algorithm ? It seems to have been used with some success in identifying different languages and authors, for example:

    Zip Programs Can Identify Language Of Any Document
    http://www.unisci.com/stories/20021/0204024.htm

    “Data compression routines can accurately identify the language, and even the author, of a document without requiring anyone to bother reading the text.

    The key to the analysis is the measurement of the compression efficiency that a program achieves when an unknown document is appended to various reference documents.

    […]

    The researchers found that file compression analysis worked well in identifying the language of files as short as twenty characters in length, and could correctly sort books by author more than 93% of the time. ”

    reference to:

    D. Benedetto, E. Caglioti, and V. Loreto, Physical Review Letters, 28 January 2002

  2. yorksranter

    Interesting suggestion; thanks a lot!

  1. 1 twitbook: book of twits « Alternate Seat of TYR

    […] As always, if you want a practical policy recommendation, make tools. A little investment in annoying javascript thingies pays off hugely by improving the productivity of your trolls; and it doesn’t have to be technically very interesting. […]




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s



%d bloggers like this: