The Facebook Crawler

The Facebook Crawler scrapes the HTML of a website that was shared on Facebook via copying and pasting the link or by a Facebook social plugins on the website. The crawler gathers, caches, and displays information about the website such as its title, description, and thumbnail image.

Crawler Requirements

  • Your server must use gzip and deflate encodings.
  • Any Open Graph properties need to be listed before the first 1 MB of your website or it will be cutoff.
  • Ensure that the content can be scraped by the crawler within a few seconds or Facebook will be unable to display the content.
  • Your website should either generate and return a response with all required properties according to the bytes specified in the Range header of the crawler request or it should ignore the Range header altogether.
  • Whitelist either the user agent strings or the IP addresses (more secure) used by the crawler.

Crawler IPs and User Agents

The Facebook crawler user agent strings:

  • facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
  • facebookexternalhit/1.1

To get a current list of IP addresses the crawler uses, run the following command.

whois -h whois.radb.net -- '-i origin AS32934' | grep ^route  

These IP addresses change often.

Example Response

...
route:      69.63.176.0/21
route:      69.63.184.0/21
route:      66.220.144.0/20
route:      69.63.176.0/20
route6:     2620:0:1c00::/40
route6:     2a03:2880::/32
route6:     2a03:2880:fffe::/48
route6:     2a03:2880:ffff::/48
route6:     2620:0:1cff::/48
... 

Troubleshooting

If your website content is not available at the time of scraping, you can force a scrape once it becomes available either by passing the URL through the Sharing Debugger or by using the Graph API.

You can simulate a crawler request with the following code if you need to troubleshoot your website:

curl -v --compressed -H "Range: bytes=0-524288" -H "Connection: close" -A "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)" "$URL"

Crawler Rate Limits

You can label pages and objects to change how long Facebook's crawler will wait to check them for new content. Use the og:ttl object property to limit crawler access if our crawler is being too aggressive.