Tuesday, 24 September 2013

Selenium IDE and Web Scraping

Selenium is a browser automation framework that includes IDE, Remote Control server and bindings of various flavors including Java, .Net, Ruby, Python and other. In this post we touch on the basic structure of the framework and its application to  Web Scraping.
What is Selenium IDE


Selenium IDE is an integrated development environment for Selenium scripts. It is implemented as a Firefox plugin, and it allows recording browsers’ interactions in order to edit them. This works well for software tests, composing and debugging. The Selenium Remote Control is a server specific for a particular environment; it causes custom scripts to be implemented for controlled browsers. Selenium deploys on Windows, Linux, and iOS. How various Selenium components are supported with major browsers read here.
What does Selenium do and Web Scraping

Basically Selenium automates browsers. This ability is no doubt to be applied to web scraping. Since browsers (and Selenium) support JavaScript, jQuery and other methods working with dynamic content why not use this mix for benefit in web scraping, rather than to try to catch Ajax events with plain code? The second reason for this kind of scrape automation is browser-fasion data access (though today this is emulated with most libraries).

Yes, Selenium works to automate browsers, but how to control Selenium from a custom script to automate a browser for web scraping? There are Selenium PHP and other language libraries (bindings) providing for scripts to call and use Selenium. It is possible to write Selenium clients (using the libraries) in almost any language we prefer, for example Perl, Python, Java, PHP etc. Those libraries (API), along with a server, the Java written server that invokes browsers for actions, constitute the Selenum RC (Remote Control). Remote Control automatically loads the Selenium Core into the browser to control it. For more details in Selenium components refer to here.



A tough scrape task for programmer

“…cURL is good, but it is very basic.  I need to handle everything manually; I am creating HTTP requests by hand.
This gets difficult – I need to do a lot of work to make sure that the requests that I send are exactly the same as the requests that a browser would
send, both for my sake and for the website’s sake. (For my sake
because I want to get the right data, and for the website’s sake
because I don’t want to cause error messages or other problems on their site because I sent a bad request that messed with their web application).  And if there is any important javascript, I need to imitate it with PHP.
It would be a great benefit to me to be able to control a browser like Firefox with my code. It would solve all my problems regarding the emulation of a real browser…
it seems that Selenium will allow me to do this…” -Ryan S

Yes, that’s what we will consider below.
Scrape with Selenium

In order to create scripts that interact with the Selenium Server (Selenium RC, Selenium Remote Webdriver) or create local Selenium WebDriver script, there is the need to make use of language-specific client drivers (also called Formatters, they are included in the selenium-ide-1.10.0.xpi package). The Selenium servers, drivers and bindings are available at Selenium download page.
The basic recipe for scrape with Selenium:

    Use Chrome or Firefox browsers
    Get Firebug or Chrome Dev Tools (Cntl+Shift+I) in action.
    Install requirements (Remote control or WebDriver, libraries and other)
    Selenium IDE : Record a ‘test’ run thru a site, adding some assertions.
    Export as a Python (other language) script.
    Edit it (loops, data extraction, db input/output)
    Run script for the Remote Control

The short intro Slides for the scraping of tough websites with Python & Selenium are here (as Google Docs slides) and here (Slide Share).
Selenium components for Firefox installation guide

For how to install the Selenium IDE to Firefox see  here starting at slide 21. The Selenium Core and Remote Control installation instructions are there too.
Extracting for dynamic content using jQuery/JavaScript with Selenium

One programmer is doing a similar thing …

1. launch a selenium RC (remote control) server
2. load a page
3. inject the jQuery script
4. select the interested contents using jQuery/JavaScript
5. send back to the PHP client using JSON.

He particularly finds it quite easy and convenient to use jQuery for
screen scraping, rather than using PHP/XPath.
Conclusion

The Selenium IDE is the popular tool for browser automation, mostly for its software testing application, yet also in that Web Scraping techniques for tough dynamic websites may be implemented with IDE along with the Selenium Remote Control server. These are the basic steps for it:

    Record the ‘test‘ browser behavior in IDE and export it as the custom programming language script
    Formatted language script runs on the Remote Control server that forces browser to send HTTP requests and then script catches the Ajax powered responses to extract content.

Selenium based Web Scraping is an easy task for small scale projects, but it consumes a lot of memory resources, since for each request it will launch a new browser instance.



Source: http://extract-web-data.com/selenium-ide-and-web-scraping/

23 comments:

  1. Thank you very much for keep this information. Buy Yelp Reviews Cheap

    ReplyDelete
  2. One of the finest things around buy Twitter followers is the astonishing kind of authority that derives along. It’s hard to construct power on Twitter, mainly when you merely have 35 follower. This is wherever purchasing Twitter followers could be extremely operative.Buy Twitter Followers Cheap

    ReplyDelete
  3. In as much as Twitter does not encourage buying of followers, it does not mean that purchasing them violates any legal rule.Buy Twitter Followers Cheap

    ReplyDelete
  4. When you own a business, you know very well that customer reviews are extremely influential. From Google To FACEBOOK people are sharing their experiences on all, and what they say has a huge impact. Over 90% of contributors claimed that online positive reviews influenced their purchasing decision, and 85% said their decision was influenced by minus reviews.Buy Google Ratings

    ReplyDelete
  5. Over 90% of contributors claimed that online positive reviews influenced their purchasing decision, and 85% said their decision was influenced by minus reviews.Buy Google Reviews

    ReplyDelete
  6. Google allows users to write reviews directly on the business’s Google or Google map listing. Since Google reviews are obviously favored by the search engine and show up upon every relevant result, the best place to start a business firm is getting involved with Buy Google 5 star Reviews

    ReplyDelete
  7. Google reviews help your business improve its reliance and integrity. Clients constantly check out online reviews before buying anything or using some services.Buy 5 star Google Reviews

    ReplyDelete
  8. The key features of this service are that all page reposts are permanent). Second, you’ll get 100% satisfaction assurances. The service delivery time is 10 hours. No programs, bots, or software are used in this service. When you buy this service, it’s a 100 %stable and safe account, with real high-class campaigns as well as 100 % safety and security, as well as customer satisfaction. With this service, you’ll save time and effort. You’ll experience improvements within 24 hours.
    Buy SoundCloud Plays

    ReplyDelete
  9. The key features of this service are that all page reposts are permanent). Second, you’ll get 100% satisfaction assurances. The service delivery time is 10 hours. No programs, bots, or software are used in this service. When you buy this service, it’s a 100 %stable and safe account, with real high-class campaigns as well as 100 % safety and security, as well as customer satisfaction. With this service, you’ll save time and effort. You’ll experience improvements within 24 hours.
    Buy SoundCloud Plays

    ReplyDelete
  10. When you own a business, you know very well that customer reviews are extremely influential. From Google To FACEBOOK people are sharing their experiences on all, and what they say has a huge impact. Over 90% of contributors claimed that online positive reviews influenced their purchasing decision, and 85% said their decision was influenced by minus reviews.Buy Google Reviews

    ReplyDelete
  11. The number of purchased plays is what determines how frequent your track will be played. This is what determines whether your song will be shortlisted as the SoundCloud most played list.Buy organic SoundCloud Plays

    ReplyDelete
  12. MAKE YOU’RE SOUNDCLOUD TUNE GET HEARD BY MILLIONS! Our top- position Soundcloud service is planned to help you upsurge your social media viewers. Your SoundCloud achievement starts right here!buy 200 soundcloud plays

    ReplyDelete
  13. Delivery time is ten hours before you know it your work is online waiting for likes and followers. There is nothing like wasting time unlike other social media. Buying Google reviews delivers the result very fast. Buy Google Star Reviews

    ReplyDelete
  14. Thinking before buying Google reviews is not a bad idea. This write up will help you in the making the decision to purchase Google Reviews for your business! Buy Negative Google Reviews

    ReplyDelete
  15. Top SEO Work Excellent information on your blog, thank you for taking the time to share with us. Amazing insight you have on this, it's nice to find a website that details so much information about different artists. Google Reviews

    ReplyDelete
  16. I read that Post and got it fine and enlightening. If you don't mind share more like that... Buy Facebook Post Likes

    ReplyDelete
  17. Buy Soundcloud Service

    This is my first time i visit here. I discovered such a large number of fascinating stuff in your online journal particularly its dialog. From the huge amounts of remarks on your articles, I figure I am by all account not the only one having all the satisfaction here! keep doing awesome BUY GOOGLE STAR REVIEWS

    ReplyDelete
  18. I recently found many useful information in your website especially this blog page. Among the lots of comments on your articles. Thanks for sharing. Buy Facebook Reviews Cheap

    ReplyDelete
  19. Great article! Among the lots of comments on your articles. Thanks for sharing.
    Buy Yelp Review

    ReplyDelete
  20. I Know Some Information About By This Article... Buy Google Reviews

    ReplyDelete