Data should be crawled responsibly so that it does not have a detrimental effect on the web site being scraped.
If a crawler sends multiple requests per second and downloads large files, an under-powered web page would get into trouble keeping up with all those requests. Crawlers query a web page much more frequently than a human does, thus the web page becomes preoccupied with handling the requests from the crawler as opposed to human visitors. Since crawlers don’t contribute to a web page (or its owner) but only consume performance of the site, many web pages try to block crawlers (for good reason). Most websites have anti-crawler mechanisms in place.
In a previous post, I have discussed how web pages block crawlers and how we can overcome them. However, as pointed out before, my credo is that crawlers should be implemented in a way that they minimize if not remove any detrimental effect on the web page.
This is what I mean by „friendly crawling“. Be nice. Here’s how to.
1. Pause between requests.
Don’t send out Python requests indefinitely. The faster you crawl, the worse it is for the web page. Instead, pause between every few requests to give the sever some pause.
In Python, put your crawler on sleep for some seconds after every request:
sleep(5) #puts the crawler on sleep for 5 seconds
2. Reconsider how much data you really need
Get back again to your data model and think again whether there is a possibility to limit the data collection. The less data you need, the less effort for the web server to respond to your crawler.
Do you really need all that data? Isn’t there the possibility to limit your data input? These are important questions to consider throughout your project.