Using Scrapy and PyV8 to scrape inline javascript.

When using Scrapy it is easy to scrape HTML using selectors, but when you are confronted with inline javascript objects in the html it is an other story.

I'm using PyV8 to evaluate the imported scripts and inline javascript. The javascript objects in gistfile2.py allows the javascript libraries to access browser variables, like window, history and selectors. The functions I implemented are sufficient to run jQuery and other frameworks. This will work about the same in the Google crawler, where javascript is also being interpreted and evaluated.

Warning: this is merely a proof of concept, than production ready code.

First you need to download and install PyV8.

Google V8 - http://code.google.com/p/v8/
PyV8 - http://code.google.com/p/pyv8/

This is the code of the scraper:

This code mimics the browser:

As you can see, it gets an page, creates a new context using the Global() and evaluates all script tags. If the script tag is remote, it downloads and runs it. The end result is that you can just call for objects within the page, in this case ProducsData and use it as an Python object.

Work todo:

  • make a nice library
  • cache the evaluated context and downloaded scripts
  • further enhance the browser mimicing.