Webcrawl
21 Nov 2023
How do you build a web crawler?
If you have a 1 gigabit connection, that’s about 1000/8 = 128 megabytes / second. A 20 TB hard drive will be filled up in about 1.85 days. Websites vary in size, but let’s assume 1 megabyte for a web page. That’s about 20 million webpages on the hard drive. If we restrict 1 QPS per domain, then it is about 172,800 requests over two days at 1 QPS. That’s about 115 domains at 172.8k pages each. The limitaiton seems to be how fast you can rack up disks.
DNS
You don’t want to make a DNS request every time you fetch a webpage, so you would want to cache this.
HTTP Protocol
HTTP 1
HTTP 1.1 has keep alive
HTTP 2 has things that make connection faster.
The newer the protocol the more performance.
HTTP versus HTTPS
Everything should be over HTTPS. There is no reason for things to be over HTTP unless you are a person who likes walking around naked.
User Agent
Some website response differently to different user agents. Googlebot is a special user agent. Some websites will respond to Googlebot differently than a normal user.
Proxy
You might want to proxy your connection to check connectivity from different routes.
Headless web browser
You can use a web driver to control a web browser, so you can render the DOM by running the javascript.
Compression
You want to use less disk space.
Parsing
You need to parse the response to get more links to crawl.
Seeding
You need to decide where to start your crawl. https://info.cern.ch/ is the first website.
Robot.txt
There are rules for what pages can be crawled. Also helps prevent you from going down dynamically generated pages.
Sitemap.xml
These are webpage that the website owner wants you to crawl or wants these page to be known to exist.
RSS feeds
New pages show up here usually if you are looking for fresh content.
Bloom filter
Smaller memory data structure that keeps track of things that you have visited before.
Prority Queue
You might want to crawl some pages sooner than others.
Redirects
The page you end up on may not be the initial page you requested.
Canonical urls
Some websites have duplicate pages, but there is one url that is considered canonical for that page.
Storage
Common Crawl uses the Web ARCive (WARC) format. It has a request and the response. Some people only care about the response. It is useful to know when the webpage was crawled. You may only want the latest webpage or you may want to be able to look at the website in time.
Databases
Distributed databases like Redshift usually have a dist/partition key and a sort key. The dist/partition key determines which shard the data is located and the sort key determines how data is oreded on the shard.
Cookies
Some websites will respond differently depending on cookies. You may want to set cookies.
Rate limiting
You don’t want to blocked by the website you are crawling, so you should self impose a rate limit.
Determining Frontier
Which links do you want to follow? Do you want to download pdfs / images / etc? Do we want to crawl a limited number of pages from each page? What is the fanout?
References
- https://dev.to/bloomreach/discovery-crawling-billions-of-pages-building-large-scale-crawling-cluster-pt-1-4p6
- https://dev.to/bloomreach/discovery-crawling-billions-of-pages-building-large-scale-crawling-cluster-pt-2-320l
- https://nlp.stanford.edu/IR-book/html/htmledition/crawling-1.html
- https://www.cs.princeton.edu/courses/archive/spr10/cos435/Notes/web_crawling_topost.pdf
- https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot
- https://almanac.httparchive.org/en/2022/page-weight