Content is most often shared to Facebook in the form of a web page. The first time someone shares a link, the Facebook crawler will scrape the HTML at that URL to gather, cache and display info about the content on Facebook like a title, description, and thumbnail image. Apart from the webpage being directly shared on Facebook, there are other ways that can trigger a crawl of your webpage. For example, having any of Facebook's social plugins on the webpage can cause our crawler to scrape that webpage.
The Facebook crawler can be identified by either of these user agent strings:
The Facebook crawler needs to be able to access your content in order to scrape and share it correctly. Your pages should be visible to the crawler. If you require login or otherwise restrict access to your content, you'll need to whitelist our crawler. Note that our crawler only accepts gzip and deflate encodings, so make sure your server uses the right encoding. Your website should either generate and return a response with all required properties according to the bytes specified in the
Range header of the crawler request or it should ignore the
Range header altogether. Please note that the crawler only scraps the first 1 MB of a page, so any Open Graph properties need to be listed before that cutoff.
You can simulate a crawler request with the following code if you need to troubleshoot your website:
curl -v --compressed -H "Range: bytes=0-524288" -H "Connection: close" -A "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)" "$URL"
There are two ways to give the crawler access:
Run this command to get a current list of IP addresses the crawler uses.
whois -h whois.radb.net -- '-i origin AS32934' | grep ^route
It will return a list of IP addresses that change often:
# For example only - over 100 in total 22.214.171.124/21 126.96.36.199/20 2401:db00::/32 2620:0:1c00::/40 2a03:2880::/32
You need to ensure that the resources referenced in URLs to be crawled can be retrieved by the crawler reasonably quickly, in no more than a few seconds. If the crawler isn't able to do this, then Facebook will not be able to display the resource.
You can label pages and objects to change how long Facebook's crawler will wait to check them for new content. Use the
og:ttl object property to limit crawler access if our crawler is being too aggressive.
As of May 28th, 2014 you may also see a crawler with the following user agent string:
Facebot is Facebook's web crawling robot that helps improve advertising performance. Facebot is designed to be polite. It attempts to access each web server no more than once every few seconds, in line with industry standards, and will respect your
Keep in mind Facebot checks for changes to your server's
robots.txt file only a few times a day, so any updates will get noted on its next crawl and not immediately.