Webcrawl

21 Nov 2023

How do you build a web crawler?

If you have a 1 gigabit connection, that’s about 1000/8 = 128 megabytes / second. A 20 TB hard drive will be filled up in about 1.85 days. Websites vary in size, but let’s assume 1 megabyte for a web page. That’s about 20 million webpages on the hard drive. If we restrict 1 QPS per domain, then it is about 172,800 requests over two days at 1 QPS. That’s about 115 domains at 172.8k pages each. The limitaiton seems to be how fast you can rack up disks.

DNS

You don’t want to make a DNS request every time you fetch a webpage, so you would want to cache this.

HTTP Protocol

HTTP 1

HTTP 1.1 has keep alive

HTTP 2 has things that make connection faster.

The newer the protocol the more performance.

HTTP versus HTTPS

Everything should be over HTTPS. There is no reason for things to be over HTTP unless you are a person who likes walking around naked.

User Agent

Some website response differently to different user agents. Googlebot is a special user agent. Some websites will respond to Googlebot differently than a normal user.

Proxy

You might want to proxy your connection to check connectivity from different routes.

Headless web browser

You can use a web driver to control a web browser, so you can render the DOM by running the javascript.

Compression

You want to use less disk space.

Parsing

You need to parse the response to get more links to crawl.

Seeding

You need to decide where to start your crawl. https://info.cern.ch/ is the first website.

Robot.txt

There are rules for what pages can be crawled. Also helps prevent you from going down dynamically generated pages.

Sitemap.xml

These are webpage that the website owner wants you to crawl or wants these page to be known to exist.

RSS feeds

New pages show up here usually if you are looking for fresh content.

Bloom filter

Smaller memory data structure that keeps track of things that you have visited before.

Prority Queue

You might want to crawl some pages sooner than others.

Redirects

The page you end up on may not be the initial page you requested.

Canonical urls

Some websites have duplicate pages, but there is one url that is considered canonical for that page.

Storage

Common Crawl uses the Web ARCive (WARC) format. It has a request and the response. Some people only care about the response. It is useful to know when the webpage was crawled. You may only want the latest webpage or you may want to be able to look at the website in time.

Databases

Distributed databases like Redshift usually have a dist/partition key and a sort key. The dist/partition key determines which shard the data is located and the sort key determines how data is oreded on the shard.

Some websites will respond differently depending on cookies. You may want to set cookies.

Rate limiting

You don’t want to blocked by the website you are crawling, so you should self impose a rate limit.

Determining Frontier

Which links do you want to follow? Do you want to download pdfs / images / etc? Do we want to crawl a limited number of pages from each page? What is the fanout?

References

https://dev.to/bloomreach/discovery-crawling-billions-of-pages-building-large-scale-crawling-cluster-pt-1-4p6
https://dev.to/bloomreach/discovery-crawling-billions-of-pages-building-large-scale-crawling-cluster-pt-2-320l
https://nlp.stanford.edu/IR-book/html/htmledition/crawling-1.html
https://www.cs.princeton.edu/courses/archive/spr10/cos435/Notes/web_crawling_topost.pdf
https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot
https://almanac.httparchive.org/en/2022/page-weight