Web crawling is a valuable skill to have these days as it allows you to gather data quickly from many sources for personal or commercial needs. But in order to do this, there are a set of rules you should follow in order to make the most out of it. In this post I am going to provide some tips and some advice on this topic. Most people are well aware of these but they tend to ignore.
You should always identify yourself in the
User-Agent header of the
requests. Usually it should contain the name of the company you are doing this for,
your name and/or your email address where the site owners can contact you in case
they want to tell you to stop or ask you some questions (or even provide you
hints about how you can go directly to the information you need faster).
Read the terms and conditions
Many websites explicitly prohibits crawling for various purposes, usually for commercial purposes. You must comply with their terms and conditions in order to not break the law and not get banned.
Fetch and respect the
robots.txt file is a special file located in the root of the website that
tells web crawlers what resources they are allowed to access and various parameters
so that they make sure you don’t attempt to do nasty things on their website.
There are also other directives that that you should respect:
Crawl-delay- the number of seconds to wait between request, so that you won’t overload their servers. If this is not present, there is no speed restriction, but I suggest to have a default delay between requests (somewhere around 0.5 or 1 second).
Sitemap- a XML page that will lead you directly to the pages that are interesting so that you won’t need to crawl and parse intermediate pages. If you find sitemaps, always stick to them because their format is standard and are unlikely to change.
Don’t extract data with regular expressions
I have seen a lot of web crawlers that abuse regular expressions to extract data and I have to say that this trend needs to stop. The main problem with this approach is that, besides the fact that regular expressions are hard to extend, maintain and debug, they tend to get overdated very easily and require more time for maintainance (some simple page changes can mess up your entire extraction logic).
For reference see this:
For extracting data, I recommend using query selectors and XML parsers or xPath. They are the best choice for this job.
Spend some time understanding the HTTP protocol
There are a lot of items that can interfere with the crawling of websites, such as Cookies, some custom headers, query parameters, content types and many others elements. You should spend some time researching and understanding the purpose of each of these things as they are all havily used by webservers to serve content. By playing with some of these items, you can drastically reduce the number of requests required to get to the relevant data on a website (for example, using query parameters to perform filtering and paging)
Remember, even if you are allowed to crawl a website and nothing explicitly prohibits it, be gentle with their servers and don’t overload them. Be ethical and righ on point when crawling a website and analyze the site in order to reduce the number of the requests you make.