Screen scraping with jQuery
During the course of my job I often find myself faced with the task of migrating information from an existing website to our own content management system. In the past my approach to this task has been to assess the source code of the existing site and see whether it’s feasible to use a combination of curl, regular expressions and string manipulation. Sometimes this is straightforward but increasingly this method is becoming less and less viable as it’s too intensive.
I’ve been using jQuery a lot recently and it occurred to me that I could use jQuery’s selectors to target the information that I’m interested in a web page, and then using Ajax POST it to my own script that would be ready waiting to then do something useful with the data, e.g. validate it and save it in a database. For educational purposes I was keen to keep this completely client-side if possible (except for a script to receive the information). See later on for a server-side solution.
The situation I was up against was a page that had a heap of data in a table (about 90 items), but the table was interspersed with random images to split it up and make it more pleasing to the eye. Fortunately for me, all of the data that I wanted was neatly wrapped in <div class=”information”></div> tags. Selecting these div tags with jQuery is really easy by using $(’div.information’).
My first problem was that in order to use jQuery, the web page you’re looking at has to be using it. Fortunately there’s a quick bookmarklet called jQuerify that allows you to load jQuery onto any web page. Once you’ve got that then you can write further bookmarklets of your own to do stuff.
So, my evil evil plan was to combine a jQuery selector, jQuery’s each() construct, and jQuery’s ajax support to post the content of each div to a “scraper” script, like so:
$('div.information').each(function(){ $.post('http://localhost/scraper.php',{ data: this.innerHTML }); });
I loaded my source page, clicked the jQuerify bookmarklet and then pasted the code above into the Firebug console (what, oh you’ll need that…) and it was flawless … except that the browser security model stepped in and prevented the ajax call because the XHTTPRequest object is not allowed to post information from one domain to another. I was stuck - I googled around for a while looking for workarounds, and investigated the use of JSONP but the transport method seemed more weighted at retrieving information rather than posting it.
So, I was stuck with a simple question: “How can I get information from one site to another by using the browser?” - the simplest answer to this question is of course to have a form on the source website, that when submitted posts to the target. Thanks to the power of JavaScript, modifying the DOM of a loaded web page is a doddle. Therefore it should be simple to create a form on the page after it has loaded (client side, remember), create and populate some form fields with data and then submit the form to my scraper script.
Suddenly my intentions had outgrown a bookmarklet, but I would still need one for jQuerify and one for my “Scraper Utils”. My new bookmarket simply asked jQuery to load a local JavaScript file in exactly the same was that jQuery was loaded in the first place:
javascript:$.getScript('http://localhost/scraper.js');
Now I had the freedom of writing chunk loads of stuff in my local scraper.js file.
Scraper = {}; Scraper.createForm = function() { var form = document.createElement('form'); form.setAttribute('method', 'POST'); form.setAttribute('action', 'http://localhost/scraper.php'); document.getElementsByTagName('body')[0].appendChild(form); return form; } Scraper.createSubmitButton = function() { var button = document.createElement('input'); button.setAttribute('type', 'submit'); return button; } Scraper.createFormField = function(name) { var field = document.createElement('textarea'); field.setAttribute('name', name); field.setAttribute('rows', 10); field.setAttribute('cols', 50); return field; } var ScraperForm = Scraper.createForm(); $('div.information').each(function(){ var field = ScraperForm.appendChild(Scraper.createFormField('data[]')); field.value = this.innerHTML; }); // Create a field that we can post with: ScraperForm.appendChild(Scraper.createSubmitButton());
You can see here that I’ve set up a few functions, createForm(), createFormField(), createSubmitButton() and then at the bottom I wrap them all together with the $(’div.information’).each(…) construct. The end result of this is that when I click my bookmarklet that includes the scraper.js script, a form is created at the bottom of the page and a textarea for each div.information is created that holds the innerHTML from that div.
Then, by clicking the Submit button, the browser posts all of that information across to http://localhost/scraper.php where I then collect the information from $_POST['data'] and poke it into a database.
It’s pretty rough and ready but could easily be extended to do other things like allow you to specify the selector and target URL for the post when you click the Scraper bookmarket.
Server Side Solution
On my travels I also came across the “PHP Simple HTML DOM Parser” which claims a similar ability like so:
// Create DOM from URL or file $html = file_get_html('http://www.google.com/'); // Find all images foreach($html->find('img') as $element) echo $element->src . '<br/>'; // Find all links foreach($html->find('a') as $element) echo $element->href . '<br/>';
You can get a hold of this from Sourceforge at the PHP Simple HTML DOM Parser website.





