Scraping with Multiple TOR Instances

GOOGLE performs web scraping every single day to keep its index updated, to fetch different websites, for storing information and making it presentable to the viewer. Web scrapers can retrieve data at a much faster rate than human beings, so if bad scraping practices are used then it can impact the performance of the website being scraped. In this case, server....

GOOGLE performs web scraping every single day to keep its index updated, to fetch different websites, for storing information and making it presentable to the viewer.

Web scrapers can retrieve data at a much faster rate than human beings, so if bad scraping practices are used then it can impact the performance of the website being scraped. In this case, server has to serve multiple requests per second thus impacting performance.

The Process

The most basic way of detecting your website being scraped is the unusual traffic coming from a single IP. If a scraper is making thousands of requests with the same IP within a very short time, websites normally blacklist that IP. In this way web scrapers can temporarily or permanently block the IP, hence making it impossible to scrap from that IP.

TOR Instances, which is used for anonymous surfing over the Internet was used in one of our clients’ application. It changes your outgoing IP, which in turn can be used to fetch different sites over the Internet.

The application of vteam #579’s client required millions of businesses’ data from different regions of the world. For this purpose, we had to go through yellow pages of every region. Yellow pages’ websites are actually the business directories of the region. Each website applies many securities, so that the data cannot be scraped from their website.

Solution

vteams engineer Ali Mehdi had installed TOR, and used it as a proxy server while sending request by CURL. The outgoing for the IP was changed for every 15 requests, and between each request a time gap of 5 seconds was added. In this way it was difficult for the websites to detect the scraper.

Once a request was completed, the scraper being used was put on the production to make another service in order to run in the background.

Once the scraper was ready and made live, the client could clone it on hundreds of servers; all the servers have TOR installed and the given script running, thus fetching data at a very fast rate.

Outcomes

Data scraped quantity was increased to a much faster pace
Millions of data could be scraped each day
Blocking of websites was prevented efficiently
Exit Nodes were also enabled for TOR, so that a particular region could be hit as well as specific country IPs. For example, the client could choose whether to hit China targeted website with only China’a IPs. In this way one can prevent being blacklisted from that region.

Designers

Miscellaneous

Developers

Scraping with Multiple TOR Instances

The Process

Solution

Outcomes

Muhammad Ahmad

0 Comments

Leave a Reply Cancel reply

Stay Upto Date with our
news and Updates.

About us

Most In-Demand Talent

Contact Us

Scraping with Multiple TOR Instances

The Process

Solution

Outcomes

Trending Posts

How to resolve Core Data background thread problem in iOS?

How Do Routers Create A Broadcast Domain Boundary?

Designing A Platform To Grow Businesses While Helping Causes.

Why Is the Pc Showing the Same Display on Two Monitors

How To Run A Python Script In Terminal?

Muhammad Ahmad

0 Comments

Leave a Reply Cancel reply

Stay Upto Date with our news and Updates.

About us

Most In-Demand Talent

Contact Us

Stay Upto Date with our
news and Updates.