Okay. You got me. I can’t really tell you everything you need to know about big data. The one thing I discovered last week – as I joined more than 2,500 data junkies from around the world for the O’Reilly Strata conference in rainy Santa Clara California—is that nobody can, not Google, not Intel, not even IBM. All I can guarantee you is that you’ll be hearing a lot more about it.
What is big data? Roughly defined, it refers to massive data sets that can be used to predict or model future events. That can include everything from the online purchase history of millions of Americans (to predict what they’re about to buy) to where people in San Francisco are most likely to jog (according to GPS) to Facebook posts and Twitter trends and 100 year storm records.
With that in mind, here’s the three most important things you need to know about big data right now:
1. The data experts are organizing and they want a revolution!
Data mining, (the primordial ancestor to what is today called predictive analytics) used to be considered a company or organization-specific problem. The data and the people who worked on it were “siloed” in effect. What would a statistics expert in the military and a number cruncher working in retail marketing have to talk about?
These days, it turns out, there’s a lot to discuss. First, new open source data crunching tools like Hadoop (a distributed operating system that lets you gang together thousands of computers to solve problems) are helping organizations big and small develop their own data departments at a small fraction of how much specialized software used to cost a few years ago. That means that the skills that miners are acquiring in one industry like retail are increasingly applicable across sectors, like in government. Second, combining data sets yields new insights, and the number of available sets (in some easily crunchable form like XML or just Excel) is growing.
“Over the last couple of years we’ve seen the horizontalization of data scientist,” says Alistair Croll of Bitcurrent, one of the organizers of the conference.
There was a considerable (but not surprising) consensus among attendees that data and analytics should drive a lot more decision-making within organizations, even if that better-informed strategizing comes at the expense of traditional managers, who will argue that their hard-won expertise is much more valuable than any model based on statistics. More and more often, they’ll be shown to be wrong.
There’s plenty of debate over whether everyone who works with large data sets in a technical way should get to call themselves a “data scientist.” It may be a matter of the uniqueness of the research, or just a price point.
According to JC Hertz, “if you’ve got someone in your organization that can do analytics, don’t call them a data scientist. They’ll ask you for a $20,000 raise and then get a job down the road.”
What that means: according to Hertz: “Data driven decisions have consequences. There can be political and cultural fallout. This is a gating condition that you need in the beginning. You have to say, this might [anger] x, y, z, and know that in the beginning. Not just outside the organization, but within. You need to know the political consequences of any given data-driven decision and who that decision will tick off.”
2. You’re going to be asked to opt-in to sharing your data a lot more.
A major topic for discussion this week was the Target Snafu. As originally reported in the New York Times (reg req.), Target raised a lot of eyebrows when the company used customer data and predictive analytics to figure out that one of their customers was pregnant, and, more remarkably, what trimester she was in. They emailed her some promotional material and the girl’s father discovered his daughter was pregnant based on the coupons she started receiving from a big box retailer, which gave rise to an awkward conversation, no doubt.
Most of the people I spoke with here agreed that Target made a mistake in that case, but they believed the error wasn’t in collecting the data and then using it for marketing so much as doing so without permission.
Big organizations are just beginning to realize the huge upside potential of using massive amounts of data to predict everything from what their customers are going to start buying to which of their employees will complete a certain project on time. More importantly, that data is getting increasingly easy and cheap to collect, and there’s already an enormous storehouse of it to aid in pattern extrapolation.
So where is the middle ground? According to many of the folks here, it’s the point where people knowingly agree to contribute data. As one programmer put it, “Spying is the act of collecting data secret. Transparent data collection with defined boundries is NOT spying.”
What that means: more companies will look to make the case that allowing them to track your behavior will benefit you. If enough people buy the pitch, societal attitudes about data tracking will change. There are a lot of things organizations can do to make the offer a good one for consumers, but they haven’t yet.
As Alistair Croll of Bitcurrent put it, “Imagine if that [New York Times] article had said, Target figured out that 1% of its customer base had cancer and it told them. I would sign up for a program that tracked my purchases to let me know if there was a correlation between what I bought and what people that got colon cancer bought.”
3. The stuff you can predict is amazing, the stuff you can’t is frustrating.
This conference was full of amazing case examples of people using big data to predict things. According to Google’s Hal Varian, unemployment query volume on “Sign up for unemployment” can predict future unemployment claims with a high degree of accuracy one week before official numbers are released from the U.S. government. Coupon and rebate search queries are an excellent predictor of weak economic times ahead.
Having said that, the hype on big data is likely to grow faster than the actual capabilities, as are incidents of “data washing” or making some especially considering how early we are on the hype cycle.(see graph above)
“The most prevalent model in the industry to address this problem is MCU, make crap up,” according to marketing guru Avinash Kaushik.
What that means: Too many organizations are too focused on collecting data without a clear sense of what to do with it. The order should be reversed, according to several presenters. If you want to get started with data-driven decision making first set goals and then start amazing and crunching data sets around those goals.
Most importantly, many agreed that having great data collection and analysis capability is useless if an organization doesn’t have internal processes in place to allow people to use the new info, and not just at the top of the corporate pyramid.
“You’ve got to empower every person to make decisions with data” according to Kaushik. “Say, ‘You, Janitor! You will be in charge of using Data to make your job better!”
Bottom line: Big data is going to change the way organizations and individuals deal with information and plan ahead. Many of those transitions will be difficult; but, ten years from now, we’ll wonder how we got along without it. Even after the hype cycle on big data goes from peak to valley, there’s still a lot to look forward to.