PDF conversion and client-side scraping

One of my ongoing projects is called smusique, a mobile web-based sheet music viewer. The application was designed for tablets, enabling a rich browsing experience through sheet music. Though some sheet music is encumbered by licenses, most of the classic stuff is legally available online via sheet music databases such as the Petrucci Library at imslp.org.

While the focus of my project was to design and develop a delightful tablet-based interface for sheet music, I needed to solve some technical problems:

Figure out how to deal with PDFs in the browser.
Create or find a database with a large enough corpus of searchable sheet music.

Read on to find out how I solved these problems using the AppEngine conversion API and through a Chrome extension that does client-side scraping.

Dealing with PDFs

Since digital sheets are largely in PDF format, I needed some solution for showing PDFs inside a web user interface. Unfortunately showing PDFs inline is pretty much impossible using today's web. Ideally, the browser would have support for something like the following:

<img src="something.pdf" page="3" />

But this functionality simply doesn't exist in a widely supported fasion. Given how important showing PDFs is for my application, I needed a workaround. A few options came to mind:

Attempt to show PDFs in an iframe, and programatically scroll to the right position based on the page number.
Convert PDFs on the client side using something like PDFJS
Build my own PDF conversion server

I experimented briefly with the first approach and unsurprisingly found mixed levels of browser support for PDF rendering inside iframes. Some mobile browsers allowed PDFs to be opened within iframes, but exhibited unexpected scaling behavior, could not be programmatically scrolled, or both.

Did not seriously consider PDFJS as a solution, judging that running PDF conversion in a mobile web browser would be prohibitively slow.

Just as I was bracing to take the plunge and write my own ImageMagick-based conversion API running on Django/Slicehost, I discovered the very new AppEngine conversion API.

The conversion API

Turns out that AppEngine provides a new experimental file format conversion API. It allows you to map from a variety of input file formats to a variety of outputs, including PDF to PNGs (one per page), which is exactly what I wanted. The full list of conversion paths is available in the docs. Some of the more exciting features include image to text conversions via OCR, and generating images from HTML.

My conversion was really straight forward to implement:

def convert_pdf(self, pdf_data):
    """Converts PDF to PNG images. Returns an array of PNG data."""
    asset = conversion.Asset('application/pdf', pdf_data, 'sheet.pdf')
    conversion_request = conversion.Conversion(asset, 'image/png')
    result = conversion.convert(conversion_request)
    if result.assets:
        return [asset.data for asset in result.assets]
    else:
        raise Exception('Conversion failed: %d %s'
                % (result.error_code, result.error_text))

Note that the API methods recently changed (maybe in AppEngine 1.6?) from conversion.ConversionRequest to conversion.Conversion. I should have expected breaking changes since the conversion API is still experimental, but it stumped me for a little while anyway.

The API works pretty well, but doesn't provide much meaningful feedback if running in the dev server. So far I've only managed to overload the conversion service a few times with very large (eg. ~100 page) PDFs. That said, a lot of sheet music is incredibly long, especially in an orchestral setting. So, if I was doing this for production, I would probably need to build a custom solution.

One other caveat with this approach is that I have no idea how much money I would be charged for using this service. Conversion is quite intensive, and given the recently increased rates, I would be wary.

Once the images are converted, I upload them to an Amazon S3 instance where I keep my data. To do this, I use S3.py, a really simple library for interacting with Amazon S3:

def upload_helper(self, path, data, contentType):
    conn = S3.AWSAuthConnection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
    options = {'x-amz-acl': 'public-read', 'Content-Type': contentType}
    response = conn.put(DEFAULT_BUCKET, path, S3.S3Object(data), options)
    return URL_FORMAT % {'bucket': DEFAULT_BUCKET, 'path': path}

The full code for the AppEngine server is open source on github.

Client-side scraping

Unfortunately IMSLP doesn't provide a useful API out of the box. I wasn't ambitious enough to create a full API for it, but needed some interim solution. I figured that for a demo, it wouldn't be such a terrible experience to have to seed the database with the repertoire you were interested in if it was easy enough to do.

Ordinarily, one might write a little scraper utility in their favorite scripting language. However, scraping HTML from python (or any other scripting language for that matter) is really not my favorite activity. Additionally, relying on the command line excludes the target audience (musicians) for this application.

As a workaround, I came across a potentially interesting idea: client-side scraping with a Chrome extension. Let me explain.

Suppose you want to scrape part of a corpus of data that's available through a website, but want to let the user decide which parts they are interested in. Simply use a Chrome extension that injects code into the target page, fetches the interesting part of the DOM using selectors and perhaps jQuery for convenience, and then uploads the data to some server. I used content scripts for this purpose. In the manifest, the entry looks as follows:

"content_scripts": [{
  "matches": ["http://imslp.org/wiki/*"],
  "js": ["imslp.js"],
  "css": ["imslp.css"]
}]

Then, imslp.js does the scraping, which can happen automatically as a user navigates through a page, or by adding extra elements to the page. This simple IMSLP scraper creates a "send to smusique" button beside each PDF on IMSLP:

screenshot

Once clicked, imslp.js gets the URL to fetch, extracts all of the meta data of the current piece via CSS selectors, and makes a cross-domain request to the converter with all of this information.

This approach is much harder to prevent than traditional, server side scraping. Although setting UserAgent restrictions is futile, many sites use JavaScript to render their content, and serve up a content-free page. In contrast, there's very little a content provider can do to protect themselves from client side scraping, making this a very powerful technique. If the content is in the DOM, it can be extracted.

The source of the extension is somewhat messy but available also on github.

▪

February 8, 2012

Boris Smus

interaction engineering

PDF conversion and client-side scraping

Dealing with PDFs

The conversion API

Client-side scraping