scraping the barrel

I’ve finally got around to answering my own question here. The scraper is work in progress at the moment; the original pdf is rendered by pdftohtml into a tiresomely semi-structured (i.e. worse than no structure) tagpile. I was trying to tackle this through recursion, but I might either try using Python’s continue keyword or perhaps trying to pre-tokenise the document based on the number of blank lines between blocks, and then deal with the blocks.

This all depends on the thing actually having any underlying structure, of course – it may be assembled by copy-and-paste, so anything I do will blow up every month. The things I do for England…

  1. Cian

    Use the -xml output. Its quite easy to sort that into lines/columns (just eyeball the xml to work out where the fixed positions for columns are). Then you can either use the structure of the document itself (hanging indents, right padding, distance between lines), or if its really bad a RegEx that spots the end/beginning of a sentence, to recreate the flow of the paragraphs in HTML.

    I have a framework that provides the functions needed for this, but it needs cleaning up. Plus porting from Haskell to Python (don’t ask. It seemed like a good idea at the time).

    Also, the best version of pdftohtml is distributed with Calibre. It fixes some annoying problems with character encodings.

    • yorksranter

      Under scraperwiki -xml gets passed by default. What I’m getting is each line in a element with a bunch of guessy attributes.

  2. Cian

    Guessy attributes?
    What you should have is a structure of roughly this format:

    And then within each page you’ll either get font definitions , or which will have a position/dimension, plus a font id.

    If its not that then either they’re using the default option (html) and jsut wrapping it in xml, which gives pretty horrible results, or they’re using something else (pdf2text?) and wrapping it in xml.

    One problem you may have encountered is that the words aren’t very well defined. Unfortunately that’s the nature of PDFs. Its basically Postscript with knobs on. Spaces aren’t necessary for printing, so they don’t get added in. This means you have to infer it, using font info. PDF2html does a really good job of this generally, so if it fails that’s usually a sign of a really poorly formatted PDF.

    It you want tables you’re out of luck. If you know enough about the format of the PDF you’re interested in its possible (if messy). You can use PDFMiner to find horizontal and vertical vectors (lines to you and me), and use some kind of heuristic to work out if they define the boundary of a table. Its not much fun, and its hard to write anything very reusable, or general.

    • yorksranter

      Like this:

      text height="456" width="567" font="1">ARMENIA </text
      (tags deliberately borked to make them show)
      The big probby is that the font attribute is guessy – plain text is font=”2″, headings are font=”1″, except when they’re not. Dumping the whole thing gives a separate bounding-box for each *character*.

  3. Cian

    Do you have a link to the PDF and I’ll see what I get.

    The font ids correspond to fontspec tags which are defined on the first page where the font is used. That then gives you the size of the font, its name (or family) and the colour. Its worth running through the xml once with lxml and just grabbing all those fontspecs and placing them in a list. And then sorting the list by size and family. Quite often then you’ll see that you can basically treat three font ids as identical. Sometimes they are identical – this is usually the sign of a borked PDF. It kind of depends upon the fonts used. So for some fonts they might be technically different fonts for italic/bold and regular.

    Basically the more professionally a PDF was created, the easier it is to scrape. Latex and Adobe tools are fine. something created with a dodgy freeware PDF printer driver, not so much.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: