Facebook Using Hadoop for Large Scale Internal Analytics
June 6th, 2008
| By Justin Smith | 1 Comment » |
Facebook’s engineering team has posted some details on the tools it’s using to analyze the huge data sets it collects. One of the main tools it uses is Hadoop, an open source project that makes it easier to analyze vast amounts of data.
Some interesting tidbits from the post:
- Some of these early projects have matured into publicly released features (like the Facebook Lexicon) or are being used in the background to improve user experience on Facebook (by improving the relevance of search results, for example).
- Facebook has multiple Hadoop clusters deployed now – with the biggest having about 2500 cpu cores and 1 PetaByte of disk space. We are loading over 250 gigabytes of compressed data (over 2 terabytes uncompressed) into the Hadoop file system every day and have hundreds of jobs running each day against these data sets. The list of projects that are using this infrastructure has proliferated – from those generating mundane statistics about site usage, to others being used to fight spam and determine application quality.
- Over time, we have added classic data warehouse features like partitioning, sampling and indexing to this environment. This in-house data warehousing layer over Hadoop is called Hive and we are looking forward to releasing an open source version of this project in the near future.

Twitter
Facebook









Strategic Facebook Platform Ecosystem Overview and Guide For Agencies & Brands
French / Français
Spanish / Español
Italian / Italiano
Track Facebook's International Growth in 95 Global Markets with our Monthly Reports and Analysis


June 7th, 2008 at 8:39 am
[...] google reader: Facebook Using Hadoop for Large Scale Internal Analytics Share me: These icons link to social bookmarking sites where readers can share and discover new web pages. [...]