Implementation Notes

Scraping with Multiple TOR Instances

GOOGLE performs web scraping every single day to keep its index updated, to fetch different websites, for storing information and making it presentable to the viewer.

Web scrapers can retrieve data at a much faster rate than human beings, so if bad scraping practices are used then it can impact the performance of the website being scraped. In this case, server has to serve multiple requests per second thus impacting performance.

The Process

The most basic way of detecting your website being scraped is the unusual traffic coming from a single IP. If a scraper is making thousands of requests with the same IP within a very short time, websites normally blacklist that IP. In this way web scrapers can temporarily or permanently block the IP, hence making it impossible to scrap from that IP.

TOR, which is used for anonymous surfing over the Internet was used in one of our clients’ application. It changes your outgoing IP, which in turn can be used to fetch different sites over the Internet.

The application of vteam #579’s client required millions of businesses’ data from different regions of the world. For this purpose, we had to go through yellow pages of every region. Yellow pages’ websites are actually the business directories of the region. Each website applies many securities, so that the data cannot be scraped from their website.

Solution

vteams engineer Ali Mehdi had installed TOR, and used it as a proxy server while sending request by CURL. The outgoing for the IP was changed for every 15 requests, and between each request a time gap of 5 seconds was added. In this way it was difficult for the websites to detect the scraper.

Once a request was completed, the scraper being used was put on the production to make another service in order to run in the background.

Once the scraper was ready and made live, the client could clone it on hundreds of servers; all the servers have TOR installed and the given script running, thus fetching data at a very fast rate.

Outcomes

  1. Data scraped quantity was increased to a much faster pace
  2. Millions of data could be scraped each day
  3. Blocking of websites was prevented efficiently
  4. Exit Nodes were also enabled for TOR, so that a particular region could be hit as well as specific country IPs. For example, the client could choose whether to hit China targeted website with only China’a IPs. In this way one can prevent being blacklisted from that region.