Pages

Saturday, February 11, 2012

Less Painful AJAX / Javascript Web Scraping

If you read my previous post, you'll see that scraping ajax pages can be a pain. So I wrote a little Java program to make it easier. It takes a list of URLs to scrape, and will render them in a browser, and save the (normal and ajax) rendered HTML and screenshots to a folder.

Here's the how-to video:



You need Firefox 3+ installed, as well as Java 1.6. This is a beta project, and no warranty is implied. You can get the file here:

 http://dl.dropbox.com/u/1015920/VancouverData/ajaxscraper.zip

Mad props to the Selenium team

10 comments:

  1. I really, genuinely want to thank you for all your efforts in making this site and videos. And I want to thank you even more fore the ingenious ajaxscraper, which saved my day in scraping an important bulk of data from an organisational website to use in an academic study. The said website was dynamically created, full load of javascript making it impossible to get any data out of it, which was essential. I found your new post (this one) just when I almost lost my hopes and it saved my day. Ajaxscraper worked very well. Thank you, thank you and thank you..

    ReplyDelete
  2. @emre glad you like it. And glad it helped. Please share it around! Cheers

    Neil

    ReplyDelete
  3. Thanks for all your videos. I love them. So I have a project I'm going to attempt to work on which would use the clustering video examples you have shown us. I just have one thing I wanted to ask you in regards to how my data is initially structured. So I work in the oil and gas markets and I wanted to profile and cluster various groups of wells declines rates over time. So my data would be in an excel spreadsheet with each well on a row and each column representing a monthly oil production number. So the data would like this in MS Excel with each row a well and each column a production number, aside from the first column.

    Well NAME Month1 Month2 Month3 Month4 Month5 Month6 Month7...
    Gonzales 100 96 93 85 70 65 30
    Dewitt 300 200 100 50 20 10 2
    Vancouver 50 45 30 28 23 22 21
    Horizon 450 430 420 410 380 350 200
    ...
    ...
    ..

    would you have any idea on how to help me initially feed my data into rapidminer before I use the cluster operators?

    ReplyDelete
  4. @Neil

    looks like you have a time-series there. it looks right in that the attributes are on the top and entities on the left.

    you might start by changing the numbers to their rates of change. For example, instead of Horizon Month1->Month2 as 450->430, you could have Horizon ToMonth2 as -4.44%, and similar, so that everything is comparable.

    clustering would find observations (wells in this case) that are similar based on the provided attributes, which are the monthly volumes or rates of change. is that what you're looking for?

    ReplyDelete
  5. Hi Neil,

    recently I was doing a lot of web mining with RapidMiner. Nowadays you can hardly get away without having some websites with dynamically loaded content (using ajax). A lot of months have passed since my last RapidMiner activities. So I searched the forums to find out if there possibly was some progress in handling JavaScript. I found no clue for this but a topic where you pointed to Selenium and Chrome for ajax scraping. Finally this led me to your blog and your ajaxscraper tool. The video demonstrates a nice piece of software and made me want to have a look at it. Sadly it doesn't work for me.

    I didn't search for alternatives yet, but I wanted to let you know about possible issues. I have JRE 1.6.0 (update 38) installed and tried 32 bit as well as the 64 bit version. When executing the jar file I get the "Opening browser..." line printed to console but nothing else happens. Only the output folder with its two subfolders is generated. Firefox is at 17.0.1. Any ideas? Maybe some more debug output might help?

    Regards
    Matthias

    ReplyDelete
  6. I am working on extracting information of third-party advertisements on a given webpage. I did use some HTML parser like htmlunit etc. but realized that most of the third party are dynamic and their information cannot be extracted using static parsing. Most of them are inside iframe tags. Is there any way I can get the information of the ads which are embedded inside these iframes.

    Can I use htmlunit or selenium to do something like this. These webdrivers just simulate the functions of web browsers, so I thought I can use this in Java.
    OR
    Can I make use of the adblockplus libraries in some way to do the required task. Adblockplus removes ads, so instead of blocking the ads, I can use them to just get the information of the ads. Is this possible ? How ?

    I have been working on this for the past 10 days and I am kind of stuck. I am asking this question personally to you because I have asked this question on several forums but have failed to receive a satisfying response. Would be great if you can kindly give me some clue so that I can start working on it. Any help would be greatly appreciated.

    ReplyDelete
  7. Can you please share the source code as well ?

    ReplyDelete
  8. I have the same problem as @Matthias ... Just get the opening browser text and then nothing happens..

    ReplyDelete
  9. Hi Neil, first, thank you for posting videos and tutorials. I appreciate your efforts very much. I am currently trying to mine a website called https://www.cdproject.net/ for research. I have followed your instructions, but I end with the following errors:

    [I have successfully installed phpunit and selenium]
    (1) Hard way of scraping
    after I run phpunit functional I get: '".\php.exe" is not recognized as an internal or external command...
    (I have added the path variable, and on PEAR I have added the following:
    SET "PHP_PEAR_PHP_BIN=php\php.exe"

    What's interesting is after I run ' pear install phpunit/PHPUnit '
    and run ' phpunit functional ' again I get this:

    require_once(File/Iterator/Autoload.php) .... in C:\php\pear\phpunit\Autoload.php on line 45

    I checked this path to make sure i have autoload in that directory.
    I have also added the following: include_path = "c:\php\pear" in my php.ini-dist file.

    I was wondering if you have any suggestions what to check.

    (2) Easy way of scraping
    I recently installed Firefox to do the easy scraping option. Unfortunately, it freezes with opening browser
    (scraping default URL of vancouverdata.blogspot.ca and rapidminer).

    I will try a restart comp to see if that does anything.

    Last question:
    Is it possible to mine a site with a login? ie https://www.cdproject.net/

    Thank you for all the help!
    Chris

    ReplyDelete
  10. Hi,

    Great program but I'm having trouble getting it to work when opening the browser.
    The error is below. Do you have any idea of this issue?
    Opening browser...
    Unable to bind to locking port 7054 within 45000 ms
    Build info: version: '2.19.0', revision: '15849', time: '2012-02-08 16:12:19'
    System info: os.name: 'Windows 7', os.arch: 'amd64', os.version: '6.1', java.version: '1.7.0_05'
    Driver info: driver.version: FirefoxDriver

    Thank you

    ReplyDelete