It’s like an irritating fly buzzing around your head – “Big Data”. Do I have to hear the term one more time? As a Data Scientist who interacts with hard data problems on a daily basis, the term, “Big Data”, has little meaning to me since it lacks any precise definition or structure, the normal comfort food for my mathematical mind. However, I appreciate that the term is “out there” and people who are trying to make sense of the mess of data flowing towards them are searching for a common language to discuss their specific challenges. I hope to help by explaining some of the key language circulating around Big Data concepts – the next level of detail in the Big Data discussion.
Where did the term “Big Data” originate? There is no consensus in the community that would bestow credit to one individual, however, Steve Lohr wrote an interesting article on the New York Times Bits Blog that provides some of the history. He leans toward crediting John Mashey, who was the Chief Scientist at Silicon Graphics in the 1990s. When Mr. Lohr asked him about it, Mr. Mashey replied that Big Data is such a simple term, it’s not much of a claim to fame. His role, if any, was to popularize the term within a pocket of the high-tech community in the 1990s, “I was using one label for a range of issues, and I wanted the simplest, shortest phrase to convey that the boundaries of computing keep advancing.” In his own words, Mr. Mashey used the term to label a “range of issues” with no intent for it to represent a single concept.
The motivation for this article comes from a presentation on Big Data that I delivered to a group of executives at the Brookings Institute. The presentation began with polling the group on their definition of Big Data and how it affects them on a daily basis. I wasn’t surprised when no single definition emerged. Clearly, the next level of discussion needs to focus on specific ideas. The remainder of my presentation dug deeper into the most talked about Big Data concepts introducing a common language for understanding important issues.
First, I want to connect the concept of Big Data and the field of analytics. People are thinking about Big Data as some sort of monolithic monster disconnected from the framework of how data can be used to answer fundamental business questions using analytics (predictive and descriptive). I find that discussions about specific concepts of Big Data are inseparable from the analytic techniques that might be used to make sense of the ocean’s data.
Big Data breaks down into five concepts: Volume, Velocity, Variety, Veracity and Value. Value is essential or all of our efforts are a waste of time! Veracity, or truthfulness of the data, is also vital, but we’ll concentrate here on the first three V’s. Some of our thinking on the big 3 is supported by a presentation given by Sue Feldman, CEO of Synthexis, who described “Big Data and Analytics” as a set of technologies that:
- Solve complex information problems economically
- Collect, manage, mine, analyze, and model large volumes or steams of information
- Integrate information from varied sources for deeper and broader understanding
- Deliver fast and better understanding of phenomena like weather, diseases, customers, products, risks, market place trends, etc.
- Enable high velocity data capture, discovery, and analysis (emphasis mine)
In this short blog, I’ll connect each of the three big V’s with one or more analytic techniques. Keep in mind that all three concepts may apply simultaneously to a single information source.
Volume
People naturally associate Big Data with a lot of data volume. Most of what we hear from the media centers on terabytes and petabytes of data. The good news is that the cost and physical space of data storage continues to drop . I remember my days working on a suitcase-sized computer managing an analysis program for a football team within 256 kilobytes of memory and two 512 kilobyte floppy drives. Recently while perusing the local Best Buy, I noticed that you can buy micro-SD cards the size of your thumbnail that can store 128 Gigabytes, or in 90’s terms, 128,000 times the storage on the two 5.25″ floppy drives. A downside to cheap and compact storage is that we are becoming data packrats, maintaining data out of regulatory fear or thinking that the data might be useful someday.
Dr. Peter Aiken, an expert in data governance, estimates that 80% of the data stored by organizations is what he refers to as ROT (redundant, obsolete, or trivial). We see examples of ROT in nearly every project. Data warehouses that duplicate and summarize data from native databases (redundant), sales data from a product discontinued five years ago (obsolete), and day of the week stored with a date (trivial since it can easily be derived from the date providing no additional value). Dr. Aiken believes that much of the ROT problem is created by collecting data without an identified business need. This can be rephrased as collecting only because we have the data, not because we necessarily need it.
Pruning the data to retain only useful parts will help to reduce the overall volume but the challenge remains to crunch through the residual information. The most common approach is to copy the information from the source(s) into an analytical software platform (such as SAS, SPSS, or R). But copying the data in large volumes takes a considerable amount of time, especially when the data is coming at a high velocity. The processing of data on a memory-limited computer can take days or weeks, which may not be an acceptable time frame for the business. This is where volume and velocity begin to overlap. With the emergence of commercial and open source solutions, there are many great approaches developing which will help to speed up this process.
Velocity
Data velocity is most easily understood as the rate at which information is being created, with most studies pointing to information creation accelerating. One of our clients generates an average of 10 million retail transactions per day. Beyond even that, consider the following staggering data rates:
- 500 million Tweets sent each day
- More than 4 million hours of content uploaded to YouTube every day
- 3.6 billion Instagram Likes each day
- 4.3 billion Facebook messages posted daily
- 5.75 billion Facebook likes every day
- 40 million Tweets shared each day
- 6 billion daily Google Searches
Another important consideration is obsolescence, or the relevance period of the incoming data stream. Tweets may have a very short-lived relevance period while in contrast, certain emails may have a much longer relevance period dictated by policy or law. More recently I have come to the realization that I’m guilty of completely ignoring the relevance period in my personal life. As I was writing this article I wondered how many years of tax information I had locked away in my old file cabinets. I had a good laugh at myself when I discovered neatly kept tax files all the way back to the early 1980’s! Imagine this on a corporate and government-wide basis — especially as data generation continues to accelerate.
Besides considering a data relevance period to manage velocity, strategies to summarize the data based on business needs can significantly reduce the effect velocity has on volume. A good example of this would be retail transaction data, which can be used in many ways such as detecting embezzlement and measuring sales performance (employee, retail unit, and product). For detection of embezzlement, transactions might be summarized on a monthly basis into a single score per employee and tracked over a 12-month rolling period. At the same time, sales data for each employee might be summarized on a shift basis for the current sales period, on a monthly basis for the prior sales period, and on a quarterly basis for the past three years. Proper design of the summarized data can make it easy to derive the retail unit and product level information without duplicating any of the stored information. These examples are for illustration; no rule of thumb that works in all cases. Instead, the underlying business needs and analytics strategy must dictate how to manage velocity.
Variety
Data variety is perhaps the toughest challenge of the three Big Data concepts, but also holds the most promise for increasing the value of data assets. We have seen many examples of important discoveries made using simple reporting on connected information that was not possible when the information sources were distinct. While our experience has been overwhelmingly positive with additional richness in the information provided by data variety there is also a danger of finding relationships that exist merely by random chance. Adopt powerful cross-validation techniques to avoid such false “discoveries”.
For a positive example, one of our customers, the U.S. Postal Service Office of Inspector General, measured a 30% decrease in hours worked per case due to a powerful visualization tool, our Risk Assessment Data Repository (RADR), that allowed investigators to explore connected data in ways not possible before. The customer already had a risk-based model and compared “hours-worked per case” before and after the data was made available through the visualization. Being able to “see” relationships in the risk score and connected data focused their efforts for quicker resolution.
The challenge of variety is reliably connecting different information sources that were never purposefully designed to be connected. A typical example is connecting social media with customer survey information to better understand customer sentiment. Connecting the same people from both sources and within the same time period can sometimes be more challenging than it appears on the surface.
Reliability is a concern since there is no jointly designed and managed “key” shared between information sources. However, when enough bits of information from multiple sources are similar enough, then the confidence of a match is increased to a comfortable point and is typically called a fuzzy match. Fuzzy matching can use a combination of business rules and advanced analytics to measure and boost confidence. The required level of comfort in a fuzzy match depends on organizational risk tolerance and how the connected information will be used.
As Mr. Mashey explained, “Big Data” is such a simple term; it’s not much of a claim to fame. A Big Data effort might also be no claim to fame (more like a Big Data disaster) if not connected by analytics to important and specific business/organization questions. Please ignore any slide that shows a super-clean data analytic environment protected by very thin layer labeled something like “Data Cleansing”. This is like building a house in the middle of a kudzu field, drawing a circle around it called “Kudzu Cleansing” and hoping your house will not be taken over by the invasive weed. Instead, data analytics begins with the incoming stream of information and actively manages the associated “Big Data” volume, velocity, and variety. When done right the return on investment makes all the stakeholders happy indeed!