Keep hearing about Big Data and Hadoop? Having a hard time explaining what is behind the curtain?
The term “big data” comes from computational sciences to describe scenarios where the volume of the data outstrips the tools to store it or process it. Three reasons why we are generating data faster than ever: (1) Processes are increasingly automated; (2) Systems are increasingly interconnected; (3) People are increasingly “living” online.
As huge data sets invaded the corporate world there are new tools to help process big data. Corporations have to run analysis on massive data sets to separate the signal from the noisy data. Hadoop is an emerging framework for Web 2.0 and enterprise businesses who are dealing with data deluge challenges – store, process, index, and analyze large amounts of data as part of their business requirements.
So what’s the big deal? The first phase of e-commerce was primarily about cost and enabling transactions. So everyone got really good at this. Then we saw differentiation around convenience… fulfillment excellence (e.g., Amazon Prime) , or relevant recommendations (if you bought this and then you may like this – next best offer).
Then the game shifted as new data mashups became possible based on… seeing who is talking to who in your social network, seeing who you are transacting with via credit-card data, looking at what you are visiting via clickstreams, influenced by ad clickthru, ability to leverage where you are standing via mobile GPS location data and so on.
The differentiation is shifting to turning volumes of data into useful insights to sell more effectively. For instance, E-bay apparently has 9 petabytes of data in their Hadoop and Teradata cluster. With 97 million active buyers and sellers they have 2 Billion page view and 75 billion database calls each day. E-bay like others is racing to put in the analytics infrastructure to (1) collect real-time data; (2) process data as it flows; (3) explore and visualize.
The continuous challenge in Web 2.0 is how to improve site relevance, performance, understand user behavior, and predictive insight to influence decisions.
This is a never ending arms race as each firm tries to become the portal of choice in a fast changing world. Industries - Travel, Retail, Financial Services, Digital Media, Search etc. – that are consumer oriented are all facing similar real-time information dynamics.
Take for instance, the competitive world of travel – airline, hotel, car rental, vacation rental etc.. Every site has to improve at analytics and machine learning as the contextual data is changing by the second – inventory, pricing, customer comments, peer recommendations, political/economic hotspots, natural disasters like earthquakes etc. Without a sophisticated real-time analytics playbook, sites can become less relevant very quickly.
Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. This will be a huge shift in how IT apps are engineered.
Hadoop Quick Overview
So, What is Apache Hadoop ? A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license).
Core Hadoop has two main systems:
- Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage.
- MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction.
The big deal:
- Flexibility – Store any data, Run any analysis.
- Scalability – Start at 1TB/3-nodes grow to petabytes/1000s of nodes.
- Economics – Cost per TB at a fraction of traditional options.
Selecting the right tool for the right job….. Technically, Hadoop, open-source MapReduce framework (written in Java), can store and process gobs of data across many commodity computers. Use Hadoop when dealing with (1) Structured or Not (Flexibility); (2) Scalability of Storage/Compute; (3) Complex Data Processing.
Traditional relational databases and data warehouse products excel at OLAP and OLTP workloads over structured data. These form the underpinnings of most IT applications. Use relational databases when dealing with (1) Interactive OLAP Analytics; (2) Multistep ACID Transactions and (3) 100% SQL Compliance.
It is becoming increasingly more difficult for classic techniques to support the wide range of use cases and workloads that power the next wave of digital business.
Hadoop is designed to solve a different problem: the fast, reliable analysis of both structured, unstructured and complex data.
Hadoop and related software are designed for 3V’s: (1) Volume – Commodity hardware and open source software lowers cost and increases capacity; (2) Velocity – Data ingest speed aided by append-only and schema-on-read design; and (3) Variety – Multiple tools to structure, process, and access data.
As a result, many IT Engineering teams are deploying the Hadoop ecosystem alongside their legacy IT applications, which allows them to combine old data and new data sets in powerful new ways. It also allows them to offload analysis from the data warehouse (and in some cases, pre-process before putting in the data warehouse).
Technically, Hadoop, a Java based framework, consists of two elements: reliable very large, low-cost data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel/distributed data processing framework called MapReduce.
HDFS is self-healing high-bandwidth clustered storage. Map-Reduce is essentially fault tolerant distributed computing. For more see our primer on Big Data, Hadoop and in-memory Analytics.
Scenarios for Using Hadoop
When a user types a query, it isn’t practical to exhaustively scan millions of items. Instead it makes sense to create an index and use it to rank items and find the best matches. Hadoop provides a distributed indexing capability.
Hadoop runs on a collection/cluster of commodity, shared-nothing x86 servers. You can add or remove servers in a Hadoop cluster (sizes from 50, 100 to even 2000+ nodes) at will; the system detects and compensates for hardware or system problems on any server. Hadoop is self-healing and fault tolerant. It can deliver data — and can run large-scale, high-performance processing batch jobs — in spite of system changes or failures.
Three distinct scenarios for Hadoop are:
1) Hadoop as an ETL and Filtering Platform – One of the biggest challenges with high volume data sources is extracting valuable signal from lot of noise. Loading large, raw data into a MapReduce platform for initial processing is a good way to go. Hadoop platforms can read in the raw data, apply appropriate filters and logic, and output a structured summary or refined data set. This output (e.g., hourly index refreshes) can be further analyzed or serve as an input to a more traditional analytic environment like SAS. Typically a small % of a raw data feed is required for any business problem. Hadoop becomes a great tool for extracting these pieces.
2) Hadoop as an exploration engine – Once the data is in the MapReduce cluster, using tools to analyze data where it sits makes sense. As the refined output is in a Hadoop cluster, new data can be added to the existing pile without having to re-index all over again. In other words, new data can be added to existing data summaries. Once the data is distilled, it can be loaded into corporate systems so users have wider access to it.
3) Hadoop as an Archive. Most of the historical data doesn’t need to be accessed and kept in a SAN environment. This historical data is usually archived by tape or disk to secondary storage or sent offsite. When this data is needed for analysis, it’s painful and costly to retrieve it and load it back up… so most people don’t bother using historical data for their analytics. With cheap storage in a distributed cluster, lot’s of data can be kept “active” for continuous analysis. Hadoop is efficient…it allows better utilization of hardware by allowing the generation of different index types in one cluster.
The Hadoop Stack
It’s important to differentiate Hadoop from the Hadoop stack. Firms like Cloudera sell a set of capabilities around Hadoop called the Cloudera’s Distribution for Hadoop (CDH).
This is a set of projects and management tools designed to lower the cost and complexity of administration and production support services; this includes 24/7 problem resolution support, consultative support, and support for certified integrations.
The 2011 version of CDH looked like this:
The 2012 version of CDH (Apache) looks like this:
The introduction of Hadoop stack is changing the business intelligence (reporting/analytics/data mining), which has been dominated by very expensive relational databases and data warehouse appliance products.
What is Hadoop good for?
Searching, log processing, recommendation systems, data warehousing, video and image analysis, archiving seem to be the initial uses. One prominent space where Hadoop is playing a big role in is data-driven online websites. The four primary areas include:
1) To aggregate “data exhaust” — messages, posts, blog entries, photos, video clips, maps, web graph….
2) To give data context — friends networks, social graphs, recommendations, collaborative filtering….
3) To keep apps running — web logs, system logs, system metrics, database query logs….
4) To deliver novel mashup services – mobile location data, clickstream data, SKUs, pricing…..
Let’s look at a few realworld examples from LinkedIn, CBS Interactive, Explorys and FourSquare. Walt Disney, Wal-mart, General Electric, Nokia, and Bank of America are also applying Hadoop to a variety of tasks including marketing, advertising, and sentiment and risk analysis. IBM used the software as the engine for its Watson computer, which competed with the champions of TV game show Jeopardy.
Hadoop @ LinkedIn
LinkedIn is a massive data hoard whose value is connections. It currently computes more than 100 billion personalized recommendations every week, powering an ever growing assortment of products, including Jobs You May be Interested In, Groups You May Like, News Relevance, and Ad Targeting.
LinkedIn leverages Hadoop to transform raw data to rich features using knowledge aggregated from LinkedIn’s 125 million member base. LinkedIn then uses Lucene to do real-time recommendations, and also Lucene on Hadoop to bridge offline analysis with user-facing services. The streams of user-generated information, referred to as a “social media feeds”, may contain valuable, real-time information on the LinkedIn member opinions, activities, and mood states.
CBS Interactive - Leveraging Hadoop
CBS Interactive is using Hadoop as the web analytics platform, processing one Billion weblogs daily (grown from 250 million events per day) from hundreds of web site properties.
Who is CBS Interactive? They are the online division for CBS, the broadcast network. They are a top 10 global web property and the largest premium online content network. Some of the brands include: CNET, Last.fm, TV.com, CBS Sports, 60 Minutes, to name a few.
CBS Interactive migrated processing from a proprietary platform to Hadoop to crunch web metrics. The goal was to achieve more robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far). To enable this they built an Extraction, Transformation and Loading ETL framework called Lumberjack, built based on python and streaming.
Explorys and Cleveland Clinic
Explorys, founded in 2009 in partnership with the Cleveland Clinic, is one of the largest clinical repositories in the United States with 10 million lives under contract. The Explorys healthcare platform is based upon a massively parallel computing model that enables subscribers to search and analyze patient populations, treatment protocols, and clinical outcomes. With billions of clinical and operational events already curated, Explorys helps healthcare leaders leverage analytics for break-through discovery and the improvement of medicine. HBase and Hadoop are at the center of Explorys. Already ingesting billions of anonymized clinical records, Explorys provides a powerful and HIPAA compliant platform for accelerating discovery.
Hadoop @ Orbitz
Travel – air, hotel, car rentals – is an incredibly competitive space. Take the challenge of hotel ranking. Orbitz .com generates ~1.5 million air searches and ~1 million hotel searches a day in 2011. All this activity generates massive amounts of data – over 500 GB/day of log data. The challenge was expensive and difficult to use existing data infrastructure for storing and processing this data.
Orbitz needed an infrastructure that provides (1) long term storage of large data sets; (2) open access for developers and business analysts; (3) ad-hoc quering of data and rapid deploying of reporting applications. They moved to Hadoop and Hive to provide reliable and scalable storage and processing of data on inexpensive commodity hardware. Hive is an open-source data warehousing solution built on top of Hadoop which allows easy data summarization, adhoc querying and analysis of large datasets stored in Hadoop. Hive simplifies Hadoop data analysis — users can use SQL rather than writing low level custom code. Highlevel queries are compiled into Hadoop Mapreduce jobs.
Hadoop @ Foursquare
foursquare is a mobile + location + social networking startup aimed at letting your friends in almost every country know where you are and figuring out where they are.
As a platform foursquare is now aware of 25+ million venues worldwide, each of which can be described by unique signals about who is coming to these places, when, and for how long. To reward and incent users foursquare allows frequent users to collect points, prize “badges,” and eventually, coupons, for check-ins.
Foursquare is built on enabling better mobile + location + social networking by applying machine learning algorithms to the collective movement patterns of millions of people. The ultimate goal is to build new services which help people better explore and connect with places.
Foursquare engineering employs a variety of machine learning algorithms to distill check-in signals into useful data for app and platform. foursquare is enabled by a social recommendation engine and real-time suggestions based on a person’s social graph.
Matthew Rathbone, foursquare engineering, describes the data analytics challenge as follows:
“With over 500 million check-ins last year and growing, we log a lot of data. We use that data to do a lot of interesting analysis, from finding the most popular local bars in any city, to recommending people you might know. However, until recently, our data was only stored in production databases and log files. Most of the time this was fine, but whenever someone non-technical wanted to do some ad-hoc data exploration, it required them knowing SCALA and being able to query against production databases.
This has become a larger problem as of late, as many of our business development managers, venue specialists, and upper management eggheads need access to the data in order to inform some important decisions. For example, which venues are fakes or duplicates (so we can delete them), what areas of the country are drawn to which kinds of venues (so we can help them promote themselves), and what are the demographics of our users in Belgium (so we can surface useful information)?”
To enable easy access to data foursquare engineering decided to use Apache Hadoop, and Apache Hive in combination with a custom data server (built in Ruby), all running in Amazon EC2. The data server is built using Rails, MongoDB, Redis, and Resque and communicates with Hive using the ruby Thrift client.
What Data Projects is Hadoop Driving?
Now that we have dispensed with examples and you are eager to get started, what data driven projects do you undertake?
Where do you start? Do you do legacy application retrofits to leverage the Hadoop stack? Do you do large scale data center upgrades to handle the terabytes – petabytes – exabytes load? Do you do a Proof of Concept (PoC) project to investigate technologies?
Hadoop and Big Data Vendor Landscape…
Which vendors do you engage with for what. How do you make sense of the landscape?
Jeff Kelly @ Wikibon presents a nice Big Data market segmentation landscape graphic that I found quite interesting.
Lots of new entrants in this space that are raising the innovation quotient. According to GigaOM:
“Cloudera which is synonymous with Hadoop has raised $76 million since 2009 [Has raised another $65 million in December 2012]. Newcomers MapR and Hortonworks have raised $29 million and $50 million. And that’s just at the distribution layer, which is the foundation of any Hadoop deployment. Up the stack, Datameer, Karmasphere and Hadapt have each raised around $10 million, and then are newer funded companies such as Zettaset, Odiago and Platfora. Accel Partners has started a $100 million big data fund to feed applications utilizing Hadoop and other core big data technologies. If anything, funding around Hadoop should increase in 2012, or at least cover a lot more startups.”
Hadoop usage/penetration is growing as more analysts, programmers and – increasingly – processes “use” data. Accelerating data growth drives performance challenges, load time challenges and hardware cost optimization.
The data growth chart below gives you a sense of how quickly we are creating digital data. A few years ago Terabytes were considered a big deal. Now Exabytes are the new Terabytes. Making sense of large data volumes at real-time speed is where we are heading.
1000 Kilobytes = 1 Megabyte
1000 Megabytes = 1 Gigabyte
1000 Gigabytes = 1 Terabyte
1000 Terabytes = 1 Petabyte [where most SME corporations are?]
1000 Petabytes = 1 Exabyte [where most large corporations are?]
1000 Exabytes = 1 Zettabyte [where leaders like Facebook and Google are]
1000 Zettabytes = 1 Yottabyte
1000 Yottabytes = 1 Brontobyte
1000 Brontobytes = 1 Geopbyte
Hadoop based analytic complexity grows as data mining, predictive modeling and advanced statistics become the norm. Usage growth is driving the need for more analytical sophistication.
Hadoop’s framework brings a new set of challenges related to the compute infrastructure and underlined network architectures. As Hadoop graduates from pilots to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative.
Finally, adoption of Hadoop in the enterprise will not be an easy journey, and the hardest steps are often the first. Then, they get harder! Weaning the IT organizations off traditional DB and EDW models to use a new approach can be compared to moving the moon out of its orbit with a spatula… but it can be done.
2012 may be the year Hadoop crosses into mainstream IT.
Sources and References
1) Hadoop is a Java-based software framework for distributed processing of data-intensive transformations and analyses.
Apache Hadoop = HDFS + MapReduce
- Hadoop Distributed File System (HDFS) for storing massive datasets using low-cost storage
- MapReduce, the algorithm on which Google built its empire
MapReduce breaks a big data problem into subproblems; distributes them onto tens, hundreds, and even thousands of processing nodes; and then combines the results into a smaller, easy-to-analyze data set. Think of this as an efficient parallel assembly line for data analysis. MapReduce was first presented to the world via a 2004 white paper in which Google. Yahoo re-implemented this technique and open sourced it via the Apache foundation.
2) Related components often deployed with Hadoop – HBase, Hive, Pig, Oozie, Flume and Sqoop. These components form the core Hadoop Stack.
- HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s BigTable architecture. HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant.
- Hive is a SQL’esque query language for interrogating Apache Hadoop
- Pig, a high-level query language for large-scale data processing
- ZooKeeper, a toolkit of coordination primitives for building distributed systems
3) Hadoop ecosystem is evolving constantly so this makes it tricky for enterprise IT adoption which tends to like stable proven models with a big maintenance tail.
4) It’s important to understand what Hadoop doesn’t do…. Big Data technology like the Hadoop stack does not deliver insight, however – insights depend on analytics that result from combing the results of things like Hadoop MapReduce jobs with manageable “small data” already in the Data warehouse (DW).
5) Foursquare and Hadoop case study writeup by - Matthew Rathbone of the Foursquare Engineering team….
6) Presentation Big Data at FourSquare:
9) Data mining leveraging distributed file systems is a field with multiple techniques. These include: Hadoop, map-reduce; PageRank, topic-sensitive PageRank, spam detection, hubs-and-authorities; similarity search; shingling, minhashing, random hyperplanes, locality-sensitive hashing; analysis of social-network graphs; association rules; dimensionality reduction: UV, SVD, and CUR decompositions; algorithms for very-large-scale mining: clustering, nearest-neighbor search, gradient descent, support-vector machines, classification, and regression; and submodular function optimization.
||Hadoop Distributed File System
||Parallel data-processing framework
||A set of utilities that support the Hadoop subprojects
||Hadoop database for random read/write access
||SQL-like queries and tables on large datasets
||Data flow language and compiler
||Workflow for interdependent Hadoop jobs
||Integration of databases and data warehouses with Hadoop
||Configurable streaming data collection
||Coordination service for distributed applications
||User interface framework and software development kit (SDK) for visual Hadoop applications