ambassador, with this pdf you are spoiling us

So I was trying to parse the London Diplomatic List (this month’s edition yet to make an appearance). Cian suggested pulling out the fontspec tags on the grounds that they’re often redundant and it might be possible to identify groups among them. So I did just that and then a little bit of data reduction.

25 tag declarations squash to 11 unique font/size/colour declarations. Mmm, compression. The bad news is that, for example, countries and ambassadors (or rather, chiefs of mission – not all of them are ambassadors) are in font 1 – but font 1 is actually identical to fonts 2, 7, and 8, which include diplomats’ names, spouses, and styles. The good news is that at least font-grouping will help to filter the crap like lists of national days and page numbers and obvious MS Word copy-paste artefacts.

(In case still eats embedded spreadsheets: here’s a link.)


  1. Cian

    Well if particular font ids are used uniquely for particular activities, then it doesn’t really matter that they’re referring to the same font. So you could treat any item with font id 1 as X, any item with font id 2 as Y, etc.

    It sounds like the orginating word file was generated using some kind of OLE bridge to a database, in which case while its a hack, you can treat the font ids as category keys.

    Alternatively I’ve completely misunderstood, and I should really look at the PDF sometime…

  2. The September 2010 London Diplomatic List is now online:

    Do the three changes to this list actually now effectively name the expelled Israeli diplomat / alleged Mossad station chief ?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: