The Facebook Crawler

Content is most often shared to Facebook in the form of a web page. The first time someone shares a link, the Facebook crawler will scrape the HTML at that URL to gather, cache and display info about the content on Facebook like a title, description, and thumbnail image.

Crawler Access

The Facebook crawler needs to be able to access your content in order to scrape and share it correctly. Your pages should be visible to the crawler. If you require login or otherwise restrict access to your content, you'll need to whitelist our crawler. You should also exempt it from DDoS protection mechanisms.

If content isn't available at the time of scraping, you can force a rescrape once it becomes available by passing the URL through the Sharing Debugger.

Identifying the Crawler

The Facebook crawler can be identified by either of these user agent strings:

facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

or

facebookexternalhit/1.1

You can target one of these user agents to serve the crawler a nonpublic version of your page that has only metadata and no actual content. This helps optimize performance and is useful for keeping paywalled content secure.

As of May 28th, 2014 you may also see a crawler with the following user agent string:

Facebot

Facebot is Facebook's web crawling robot that helps improve advertising performance. Facebot is designed to be polite. It attempts to access each web server no more than once every few seconds, in line with industry standards, and will respect your robots.txt settings.

Keep in mind Facebot checks for changes to your server's robots.txt file only a few times a day, so any updates will get noted on its next crawl and not immediately.

Crawler rate limiting

You can label pages and objects to change how long Facebook's crawler will wait to check them for new content. Use the og:ttl object property to limit crawler access if our crawler is being too aggressive.

Giving the Crawler Access

There are two ways to give the crawler access:

  1. Whitelist the user agent strings listed above, which requires no upkeep
  2. Whitelist the IP addresses used by the crawler, which is more secure:

Run this command to get a current list of IP addresses the crawler uses.

whois -h whois.radb.net -- '-i origin AS32934' | grep ^route  

It will return a list of IP addresses that change often:

# For example only - over 100 in total
31.13.24.0/21 
66.220.144.0/20    
2401:db00::/32  
2620:0:1c00::/40  
2a03:2880::/32 

Ensuring reasonable latency

You need to ensure that the resources referenced in URLs to be crawled can be retrieved by the crawler reasonably quickly, in no more than a few seconds. If the crawler isn't able to do this, then Facebook will not be able to display the resource.

Canonical URLs

Our crawler fetches content to share by resolving to a URL that you designate as the canonical URL.

As a best practice, you should label all variations of a page with the canonical URL using an og:url tag (preferred) or link rel="canonical". The HTML for the canonical URL itself should also contain an og:url tag to designate itself as the canonical resource.

<meta property="og:url" content="https://example.com/path" />

This ensures that all actions such as likes and shares aggregate at the same URL rather than spreading across multiple versions of a page.

This also means that different versions of the same content will be treated the same, even if they're hosted on separate subdomains or are accessible over both http:// and https://.

If needed, our crawler will follow a chain of redirects to resolve the canonical URL.

Migrating Content/Updating URLs

If you migrate your content from one URL to another, likes and shares won't automatically migrate. You can retain like and share counts with these two steps:

1. Exempt the Facebook crawler from your HTTP redirect

Use an HTTP 301 or 302 redirect to send people to the new URL when they visit the old URL.

The crawler still needs to be able to access the old page, so exempt the crawler's user agent from the redirect and only send an HTTP redirect to non-Facebook crawler clients.

The HTML of the old URL should still contain Open Graph tags (including an og:url tag pointing to itself) and return an HTTP 200 response when the crawler loads it.

Also make sure that your AAAA record is updated correctly when you change your URL as the crawler will look for one and will response code 0 if it is not found.

2. Use the old page as the canonical URL for the new page

Add this tag to the HTML of the new URL:

<meta property="og:url" content="https://example.com/old-url" />

This tells our crawler that the canonical URL is at the old location, and the crawler will use it to generate the number of Likes on the page. New likes at either location will aggregate in both places.

Though the og:url tag is preferred, this method will work with rel=canonical as well.