Tài liệu này đã được cập nhật.
Bản dịch sang Tiếng Việt chưa hoàn tất.
Cập nhật bằng tiếng Anh: 17 Tháng 10
Đã cập nhật bằng Tiếng Việt: 1 Tháng 6

The Facebook Crawler

Content is most often shared to Facebook in the form of a web page. The first time someone shares a link, the Facebook crawler will scrape the HTML at that URL to gather, cache and display info about the content on Facebook like a title, description, and thumbnail image. Apart from the webpage being directly shared on Facebook, there are other ways that can trigger a crawl of your webpage. For example, having any of Facebook's social plugins on the webpage can cause our crawler to scrape that webpage.

Identifying the Crawler

The Facebook crawler can be identified by either of these user agent strings:

facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

or

facebookexternalhit/1.1

Crawler Access

The Facebook crawler needs to be able to access your content in order to scrape and share it correctly. Your pages should be visible to the crawler. If you require login or otherwise restrict access to your content, you'll need to whitelist our crawler. Note that our crawler only accepts gzip and deflate encodings, so make sure your server uses the right encoding. Please note that the crawler only scraps the first 1 MB of a page, so any Open Graph properties need to be listed before that cutoff.

If content isn't available at the time of scraping, you can force a rescrape once it becomes available either by passing the URL through the Sharing Debugger or by using the Graph API.

There are two ways to give the crawler access:

  1. Whitelist the user agent strings listed above, which requires no upkeep
  2. Whitelist the IP addresses used by the crawler, which is more secure:

Run this command to get a current list of IP addresses the crawler uses.

whois -h whois.radb.net -- '-i origin AS32934' | grep ^route  

It will return a list of IP addresses that change often:

# For example only - over 100 in total
31.13.24.0/21 
66.220.144.0/20    
2401:db00::/32  
2620:0:1c00::/40  
2a03:2880::/32 

Ensuring reasonable latency

You need to ensure that the resources referenced in URLs to be crawled can be retrieved by the crawler reasonably quickly, in no more than a few seconds. If the crawler isn't able to do this, then Facebook will not be able to display the resource.

Crawler rate limiting

You can label pages and objects to change how long Facebook's crawler will wait to check them for new content. Use the og:ttl object property to limit crawler access if our crawler is being too aggressive.

Facebot Crawler

As of May 28th, 2014 you may also see a crawler with the following user agent string:

Facebot

Facebot is Facebook's web crawling robot that helps improve advertising performance. Facebot is designed to be polite. It attempts to access each web server no more than once every few seconds, in line with industry standards, and will respect your robots.txt settings.

Keep in mind Facebot checks for changes to your server's robots.txt file only a few times a day, so any updates will get noted on its next crawl and not immediately.