Facebook Using Hadoop for Large Scale Internal Analytics

hadoopFacebook’s engineering team has posted some details on the tools it’s using to analyze the huge data sets it collects. One of the main tools it uses is Hadoop, an open source project that makes it easier to analyze vast amounts of data.

Some interesting tidbits from the post:

  • Some of these early projects have matured into publicly released features (like the Facebook Lexicon) or are being used in the background to improve user experience on Facebook (by improving the relevance of search results, for example).
  • Facebook has multiple Hadoop clusters deployed now – with the biggest having about 2500 cpu cores and 1 PetaByte of disk space. We are loading over 250 gigabytes of compressed data (over 2 terabytes uncompressed) into the Hadoop file system every day and have hundreds of jobs running each day against these data sets. The list of projects that are using this infrastructure has proliferated – from those generating mundane statistics about site usage, to others being used to fight spam and determine application quality.
  • Over time, we have added classic data warehouse features like partitioning, sampling and indexing to this environment. This in-house data warehousing layer over Hadoop is called Hive and we are looking forward to releasing an open source version of this project in the near future.
Facebook Marketing Bible -
The Guide to Marketing your Brand, App, Website, or Content Inside Facebook

Leave a Reply

5 Responses to “Facebook Using Hadoop for Large Scale Internal Analytics”

  1. separati.st » Daily Interesting Shizzle for June 7th says:

    [...] google reader: Facebook Using Hadoop for Large Scale Internal Analytics Share me: These icons link to social bookmarking sites where readers can share and discover new web pages. [...]

  2. Apache Hadoop is top innovator 2011 | Thinking Different Journal says:

    [...] [2] http://hadoopblog.blogspot.com/2010/05/facebook-has-worlds-largest-hadoop.html [3] http://www.insidefacebook.com/2008/06/06/facebook-using-hadoop-for-large-scale-internal-analytics/ [4] http://highscalability.com/facebook-hadoop-and-hive [...]

  3. Facebook’s large-scale Hadoop clusters | Thinking Different Journal says:

    [...] [2] http://hadoopblog.blogspot.com/2010/05/facebook-has-worlds-largest-hadoop.html [3] http://www.insidefacebook.com/2008/06/06/facebook-using-hadoop-for-large-scale-internal-analytics/ [4] http://highscalability.com/facebook-hadoop-and-hive [...]

  4. Apache Hadoop ist Top-Innovator 2011 | Geekroom says:

    [...] [2] http://hadoopblog.blogspot.com/2010/05/facebook-has-worlds-largest-hadoop.html [3] http://www.insidefacebook.com/2008/06/06/facebook-using-hadoop-for-large-scale-internal-analytics/ [4] http://highscalability.com/facebook-hadoop-and-hive [...]

  5. Facebook’s gewaltige Hadoop-Cluster | Geekroom says:

    [...] [2] http://hadoopblog.blogspot.com/2010/05/facebook-has-worlds-largest-hadoop.html [3] http://www.insidefacebook.com/2008/06/06/facebook-using-hadoop-for-large-scale-internal-analytics/ [4] http://highscalability.com/facebook-hadoop-and-hive [...]

Inside Facebook Sponsors
Nanigans PangeaMedia Appmau LifeStreet Votigo Shoutlet GREE Frima
Featured Company
Jobs of the Day

TinyCo
San Francisco, CA

Virgin Atlantic Airways
Norwalk, CT

SponsorPay
San Francisco, CA

More Research & Information from Inside Facebook

Sign up for free email updates beyond today's news.

 

WebMediaBrands
Mediabistro | All Creative World | Inside Network
Jobs | Education | Research | Events | News
Advertise | Terms of Use | Privacy Policy
Copyright 2012 WebMediaBrands Inc. All rights reserved.