We also need to deal with some terminology. Crawler and scraper. Let’s sort this out right away.
There is not one clear definition of these terms and, from my experience, people seem to have slightly different conceptions about the scope and functionality of crawlers and scrapers. Here’s the working definition of these terms for the purpose of our project.
A crawler calls up one or more web pages and downloads them or some files they are linking to. A crawler can be very simple (calling up one or few web pages and downloading them) or more complex (calling up a web page, following links to other web pages and download them as well).
A scraper takes the downloaded files and extracts data from them so that it can be used for analysis. Scrapers can not only deal with downloaded web pages, but with all kinds of different files. So, in other words: instead of clicking your way through a website and downloading each web page by yourself, a computer program does it.