(858) 586 7777 | About | Testimonials | Contact
vteams vteams vteams vteams
  • How does it work?
    • Startup Process
    • Your First Day
  • Technologies
    • Hire PHP Developer
    • Hire App Developer
    • Hire JavaScript Developer
    • Hire ROR Developer
    • Hire IOS Developer
    • Hire .NET Developer
    • Hire AI Developer
    • Hire Robotics Engineer
  • Sample Budgets
  • Meet The Team
  • Experiments
  • Captain’s Log
  • Blog
vteams vteams
  • How does it work?
    • Startup Process
    • Your First Day
  • Technologies
    • Hire PHP Developer
    • Hire App Developer
    • Hire JavaScript Developer
    • Hire ROR Developer
    • Hire IOS Developer
    • Hire .NET Developer
    • Hire AI Developer
    • Hire Robotics Engineer
  • Sample Budgets
  • Meet The Team
  • Experiments
  • Captain’s Log
  • Blog
Blog
  1. vteams
  2. Blog
  3. Scraping with Multiple TOR Instances
Feb 03
scraping-with-multiple-tor-instances

Scraping with Multiple TOR Instances

  • February 3, 2017

GOOGLE performs web scraping every single day to keep its index updated, to fetch different websites, for storing information and making it presentable to the viewer.

Web scrapers can retrieve data at a much faster rate than human beings, so if bad scraping practices are used then it can impact the performance of the website being scraped. In this case, server has to serve multiple requests per second thus impacting performance.

The Process

The most basic way of detecting your website being scraped is the unusual traffic coming from a single IP. If a scraper is making thousands of requests with the same IP within a very short time, websites normally blacklist that IP. In this way web scrapers can temporarily or permanently block the IP, hence making it impossible to scrap from that IP.

TOR, which is used for anonymous surfing over the Internet was used in one of our clients’ application. It changes your outgoing IP, which in turn can be used to fetch different sites over the Internet.

The application of vteam #579’s client required millions of businesses’ data from different regions of the world. For this purpose, we had to go through yellow pages of every region. Yellow pages’ websites are actually the business directories of the region. Each website applies many securities, so that the data cannot be scraped from their website.

Solution

vteams engineer Ali Mehdi had installed TOR, and used it as a proxy server while sending request by CURL. The outgoing for the IP was changed for every 15 requests, and between each request a time gap of 5 seconds was added. In this way it was difficult for the websites to detect the scraper.

Once a request was completed, the scraper being used was put on the production to make another service in order to run in the background.

Once the scraper was ready and made live, the client could clone it on hundreds of servers; all the servers have TOR installed and the given script running, thus fetching data at a very fast rate.

Outcomes

  1. Data scraped quantity was increased to a much faster pace
  2. Millions of data could be scraped each day
  3. Blocking of websites was prevented efficiently
  4. Exit Nodes were also enabled for TOR, so that a particular region could be hit as well as specific country IPs. For example, the client could choose whether to hit China targeted website with only China’a IPs. In this way one can prevent being blacklisted from that region.
  • Facebook
  • Twitter
  • Tumblr
  • Pinterest
  • Google+
  • LinkedIn
  • E-Mail

Comments are closed.

SEARCH BLOG

Categories

  • Blog (470)
  • Captain's Log (1)
  • Closure Reports (45)
  • Experiments (7)
  • How-To (55)
  • Implementation Notes (148)
  • Learn More (137)
  • LMS (8)
  • Look Inside (9)
  • Operations Log (12)
  • Programmer Notes (20)
  • R&D (14)
  • Rescue Log (4)
  • Testimonials (25)
  • Uncategorized (4)

RECENT STORIES

  • Top Interview Questions to ask a Data Scientist
  • Kotlin Language – A New Hope for Android Developers
  • 5 Reasons Why JavaScript is the Best For Your MVP
  • React JS – The Undisputed King of Frameworks in Market
  • Top 5 PHP Frameworks in 2021

ARCHIVES

In Short

With the vteams model, you bypass the middleman and hire your own offshore engineers - they work exclusively for you. You pay a reasonable monthly wage and get the job done without hassles, re-negotiations, feature counts or budget overruns.

Goals for 2020

  • Open development center in Australia
  • Complete and Launch the Robot
  • Structural changes to better address Clients' needs

Contact Us

Address: NEXTWERK INC.
6790 Embarcadero Ln, Ste 100,
Carlsbad, CA 92011, USA

Tel: (858) 586 7777
Email: fahad@nextwerk.com
Web: www.vteams.com

© 2020 vteams. All Rights Reserved.

Content Protection by DMCA.com