IEET > Rights > HealthLongevity > CognitiveLiberty > GlobalDemocracySecurity > Vision > Affiliate Scholar > Rick Searle > Futurism > Technoprogressivism > Innovation > Artificial Intelligence
Big Data as statistical masturbation
Rick Searle   Feb 15, 2015   Utopia or Dystopia  

It’s just possible that there is a looming crisis in yet another technological sector whose proponents have leaped too far ahead, and too soon, promising all kinds of things they are unable to deliver. It strange how we keep ramming our head into this same damned wall, but this next crisis is perhaps more important than deflated hype at other times, say our over optimism about the timeline for human space flight in the 1970’s, or the “AI winter” in the 1980’s, or the miracles that seemed just at our fingertips when we cracked the Human Genome while pulling riches out of the air during the dotcom boom- both of which brought us to a state of mania in the 1990’s and early 2000’s.

The thing that separates a potentially new crisis in the area of so-called “Big-Data” from these earlier ones is that, literally overnight, we have reconstructed much of our economy, national security infrastructure and in the process of eroding our ancient right privacy on it’s yet to be proven premises. Now, we are on the verge of changing not just the nature of the science upon which we all depend, but nearly every other field of human intellectual endeavor. And we’ve done and are doing this despite the fact that the the most over the top promises of Big Data are about as epistemologically grounded as divining the future by looking at goat entrails.

Well, that might be a little unfair. Big Data is helpful, but the question is helpful for what? A tool, as opposed to a supposedly magical talisman has its limits, and understanding those limits should lead not to our jettisoning the tool of large scale data based analysis, but what needs to be done to make these new capacities actually useful rather than, like all forms of divination, comforting us with the idea that we can know the future and thus somehow exert control over it, when in reality both our foresight and our powers are much more limited.

Start with the issue of the digital economy. One model underlies most of the major Internet giants- Google, FaceBook and to a lesser extent Apple and Amazon, along with a whole set of behemoths who few of us can name but that underlie everything we do online, especially data aggregators such as Axicom. That model is to essentially gather up every last digital record we leave behind, many of them gained in exchange for “free” services and using this living archive to target advertisements at us.

It’s not only that this model has provided the infrastructure for an unprecedented violation of privacy by the security state (more on which below) it’s that there’s no real evidence that it even works.

Just anecdotally reflect on your own personal experience. If companies can very reasonably be said to know you better than your mother, your wife, or even you know yourself, why are the ads coming your way so damn obvious, and frankly even oblivious? In my own case, if I shop online for something, a hammer, a car, a pair of pants, I end up getting ads for that very same type of product weeks or even months after I have actually bought a version of the item I was searching for.

In large measure, the Internet is a giant market in which we can find products or information. Targeted ads can only really work if they are able refract in their marketed product’s favor the information I am searching for, if they lead me to buy something I would not have purchased in the first place. Derek Thompson, in the piece linked to above points out that this problem is called Endogeneity, or more colloquially: “hell, I was going to buy it anyway.”

The problem with this economic model, though, goes even deeper than that. At least one-third of clicks on digital ads aren’t human beings at all but bots that represent a way of gaming advertising revenue like something right out of a William Gibson novel.

Okay, so we have this economic model based on what at it’s root is really just spyware, and despite all the billions poured into it, we have no idea if it actually affects consumer behavior. That might be merely an annoying feature of the present rather than something to fret about were it not for the fact that this surveillance architecture has apparently been captured by the security services of the state. The model is essentially just a darker version of its commercial forbearer. Here the NSA, GCHQ et al hoover up as much of the Internet’s information as they can get their hands on. Ostensibly, their doing this so they can algorithmically sort through this data to identify threats.

In this case, we have just as many reasons to suspect that it doesn’t really work, and though they claim it does, none of these intelligence agencies will actually look at their supposed evidence that it does. The reasons to suspect that mass surveillance might suffer similar flaws as mass “personalized” marketing, was excellently summed up   in a recent article in the Financial Times Zeynep Tufekci when she wrote:

But the assertion that big data is “what it’s all about” when it comes to predicting rare events is not supported by what we know about how these methods work, and more importantly, don’t work. Analytics on massive datasets can be powerful in analysing and identifying broad patterns, or events that occur regularly and frequently, but are singularly unsuited to finding unpredictable, erratic, and rare needles in huge haystacks. In fact, the bigger the haystack — the more massive the scale and the wider the scope of the surveillance — the less suited these methods are to finding such exceptional events, and the more they may serve to direct resources and attention away from appropriate tools and methods.

I’ll get to what’s epistemologically wrong with using Big Data in the way used by the NSA that Tufekci rightly criticizes in a moment, but on a personal, not societal level, the biggest danger from getting the capabilities of Big Data wrong seems most likely to come through its potentially flawed use in medicine.

Here’s the kind of hype we’re in the midst of as found in a recent article by Tim Mcdonnell in Nautilus:

We’re well on our way to a future where massive data processing will power not just medical research, but nearly every aspect of society. Viktor Mayer-Schönberger, a data scholar at the University of Oxford’s Oxford Internet Institute, says we are in the midst of a fundamental shift from a culture in which we make inferences about the world based on a small amount of information to one in which sweeping new insights are gleaned by steadily accumulating a virtually limitless amount of data on everything.

The value of collecting all the information, says Mayer-Schönberger, who published an exhaustive treatise entitled Big Data in March, is that “you don’t have to worry about biases or randomization. You don’t have to worry about having a hypothesis, a conclusion, beforehand.” If you look at everything, the landscape will become apparent and patterns will naturally emerge.

Here’s the problem with this line of reasoning, a problem that I think is the same, and shares the same solution to the issue of mass surveillance by the NSA and other security agencies. It begins with this idea that “the landscape will become apparent and patterns will naturally emerge.”

The flaw that this reasoning suffers has to do with the way very large data sets work. One would think that the fact that sampling millions of people, which we’re now able to do via ubiquitous monitoring, would offer enormous gains over the way we used to be confined to population samples of only a few thousand, yet this isn’t necessarily the case. The problem is the larger your sample size the greater your chance at false correlations.

Previously I had thought that surely this is a problem that statisticians had either solved or were on the verge of solving. They’re not, at least according to the computer scientist Michael Jordan, who fears that we might be on the verge of a “Big Data winter” similar to the one AI went through in the 1980’s and 90’s. Let’s say you had an extremely large database with multiple forms of metrics:

Now, if I start allowing myself to look at all of the combinations of these features—if you live in Beijing, and you ride bike to work, and you work in a certain job, and are a certain age—what’s the probability you will have a certain disease or you will like my advertisement? Now I’m getting combinations of millions of attributes, and the number of such combinations is exponential; it gets to be the size of the number of atoms in the universe.

Those are the hypotheses that I’m willing to consider. And for any particular database, I will find some combination of columns that will predict perfectly any outcome, just by chance alone. If I just look at all the people who have a heart attack and compare them to all the people that don’t have a heart attack, and I’m looking for combinations of the columns that predict heart attacks, I will find all kinds of spurious combinations of columns, because there are huge numbers of them.

The actual mathematics of sorting out spurious from potentially useful correlations from being distinguished is, in Jordan’s estimation, far from being worked out:

We are just getting this engineering science assembled. We have many ideas that come from hundreds of years of statistics and computer science. And we’re working on putting them together, making them scalable. A lot of the ideas for controlling what are called familywise errors, where I have many hypotheses and want to know my error rate, have emerged over the last 30 years. But many of them haven’t been studied computationally. It’s hard mathematics and engineering to work all this out, and it will take time.

It’s not a year or two. It will take decades to get right. We are still learning how to do big data well.

Alright, now that’s a problem. As you’ll no doubt notice the danger of false correlation that Jordan identifies as a problem for science is almost exactly the same critique Tufekci  made against the mass surveillance of the NSA. That is, unless the NSA and its cohorts have actually solved the statistical/engineering problems Jordan identified and haven’t told us, all the biggest data haystack in the world is going to lead to is too many leads to follow, most of them false, and many of which will drain resources from actual public protection. Perhaps equally troubling: if security services have solved these statistical/engineering problems how much will be wasted in research funding and how many lives will be lost because medical scientists were kept from the tools that would have empowered their research?

At least part of the solution to this will be remembering why we developed statistical analysis in the first place. Herbert I. Weisberg with his recent book Willful Ignorance: The Mismeasure of Uncertainty has provided a wonderful, short primer on the subject.

Statistical evidence, according to Weisberg was first introduced to medical research back in the 1950’s as a protection against exaggerated claims to efficacy and widespread quackery. Since then we have come to take the p value .05 almost as the truth itself. Weisberg’s book is really a plea to clinicians to know their patients and not rely almost exclusively on statistical analyses of “average” patients to help those in their care make life altering decisions in terms of what medicines to take or procedures to undergo. Weisberg thinks that personalized medicine will over the long term solve these problems, and while I won’t go into my doubts about that here, I do think, in the experience of the physician, he identifies the root to the solution of our Big Data problem.

Rather than think of Big Data as somehow providing us with a picture of reality, “naturally emerging” as Mayer-Schönberger quoted above suggested we should start to view it as a way to easily and cheaply give us a metric for the potential validity of a hypothesis. And it’s not only the first step that continues to be guided by old fashioned science rather than computer driven numerology but the remaining steps as well, a positive signal  followed up by actual scientist and other researchers doing such now rusting skills as actual experiments and building theories to explain their results. Big Data, if done right, won’t end up making science a form of information promising, but will instead be used as the primary tool for keeping scientist from going down a cul-de-sac.

The same principle applied to mass surveillance means a return to old school human intelligence even if it now needs to be empowered by new digital tools. Rather than Big Data being used to hoover up and analyze all potential leads, espionage and counterterrorism should become more targeted and based on efforts to understand and penetrate threat groups themselves. The move back to human intelligence and towards more targeted surveillance rather than the mass data grab symbolized by Bluffdale may be a reality forced on the NSA et al by events. In part due to the Snowden revelations terrorist and criminal networks have already abandoned the non-secure public networks which the rest of us use. Mass surveillance has lost its raison d’etre.

At least it terms of science and medicine, I recently saw a version of how Big Data done right might work. In an article for Qunta and Scientific American by Veronique Greenwood she discussed two recent efforts by researchers to use Big Data to find new understandings of and treatments for disease.

The physicist (not biologist) Stefan Thurner has created a network model of comorbid diseases trying to uncover the hidden relationships between different, seemingly unrelated medical conditions. What I find interesting about this is that it gives us a new way of understanding disease, breaking free of hermetically sealed categories that may blind us to underlying shared mechanisms by medical conditions. I find this especially pressing where it comes to mental health where the kind of symptom listing found in the DSM- the Bible for mental health care professionals- has never resulted in a causative model of how conditions such as anxiety or depression actually work and is based on an antiquated separation between the mind and the body not to mention the social and environmental factors that all give shape to mental health.

Even more interesting, from Greenwood’s piece, are the efforts by Joseph Loscalzo of Harvard Medical School to try and come up with a whole new model for disease that looks beyond genome associations for diseases to map out the molecular networks of disease isolating the statistical correlation between a particular variant of such a map and a disease. This relationship between genes and proteins correlated with a disease is something Loscalzo calls a “disease module”.

Thurner describes the underlying methodology behind his, and by implication Loscalzo’s,  efforts to Greenwood this way:

Once you draw a network, you are drawing hypotheses on a piece of paper,” Thurner said. “You are saying, ‘Wow, look, I didn’t know these two things were related. Why could they be? Or is it just that our statistical threshold did not kick it out?’” In network analysis, you first validate your analysis by checking that it recreates connections that people have already identified in whatever system you are studying. After that, Thurner said, “the ones that did not exist before, those are new hypotheses. Then the work really starts.

It’s the next steps, the testing of hypotheses, the development of a stable model where the most important work really lies. Like any intellectual fad, Big Data has its element of truth. We can now much more easily distill large and sometimes previously invisible  patterns from the deluge of information in which we are now drowning. This has potentially huge benefits for science, medicine, social policy, and law enforcement.

The problem comes from thinking that we are at the point where our data crunching algorithms can do the work for us and are about to replace the human beings and their skills at investigating problems deeply and in the real world. The danger there would be thinking that knowledge could work like self-gratification a mere thing of the mind without all the hard work, compromises, and conflict between expectations and reality that goes into a real relationship. Ironically, this was a truth perhaps discovered first not by scientists or intelligence agencies but by online dating services. To that strange story, next time….

Rick Searle, an Affiliate Scholar of the IEET, is a writer and educator living the very non-technological Amish country of central Pennsylvania along with his two young daughters. He is an adjunct professor of political science and history for Delaware Valley College and works for the PA Distance Learning Project.


“Big Data” sounds similar to the misdiagnosis for “Super-intelligence”?

Rick Yes, I think I understand what you are getting at - we should use some plain old “Human” intelligence and do what you hint around but never quite state - “ask the right questions from the data”?

The problem is the larger your sample size the greater your chance at false correlations.

This sounds fundamentally flawed however, the “smaller” the sample size the more chance of false correlations?

Google analytics is used by the World Health Organisation to predict worldwide flu epidemic and to help organise logistics - don’t believe me?.. Google it

IBM Watson.. yadda, yadda, yadda… Global Health Oracle, good or evil, depends on which doc is using it perhaps - choose your GP wisely—if it walks like a duck it may well quack?

Mass surveillance - Here Big Data comes in very handy for the emerging Police state, as they can basically find something to pin you down with, someplace, sometime, unless you’re an Angel - they got Al Capone on Tax evasion. Yes - I agree, being blindsided by colloquialism and more soundbite - “Big Data” is not a carte blanche solution for justifying surveillance.

(ps. If you want to make Google analytics and “Big Data” work “for” democracy instead of against it, then continually type what “gets your goat” in the search engine, like say perhaps, “what is Universal Basic Income?”.. or “Capitalism is not sustainable”.. or “Mass Surveillance is a breach of Human privacy rights”... or “Justin Bebob”.. or..  (you get the “Big” picture?)


Samsung is warning customers to avoid discussing personal information in front of their smart television set.

The warning applies to TV viewers who control their Samsung Smart TV using its voice activation feature.

Such TV sets ‘listen’ to every conversation held in front of them and may share any details they hear with Samsung or third parties, it said.

Privacy campaigners said the technology smacked of the telescreens, in George Orwell’s 1984, which spied on citizens.”




“The problem is the larger your sample size the greater your chance at false correlations.”

“This sounds fundamentally flawed however, the “smaller” the sample size the more chance of false correlations?”

The problem with samples is that the larger they get the more likely that you are to find correlations that are merely the result of chance. Let’s say that you wanted to test a crazy hypothesis like: “eyebrow thickness is positively correlated with intelligence”. What is the optimal sample size for testing this question? A very small sample size is just as likely to leave you with silence as anything. You’re not likely to get a lot of thick eye-browed people if you take a random sample of say 10. But if you sampled say 100 million people you might find hundreds or thousands of examples where intelligence is positively correlated with eyebrow thickness. A better sized sample to test this hypothesis would probably be 1,000 because you’re more likely to get a group that is truly random.

The example is of course meant to be ridiculous, but my point is that just because we are now able to whip very large samples out of thin air it doesn’t make them more accurate and compared to truly randomized samples and probably less so. 

I think “Google Flu” is laudable, but there are doubts as to its efficacy:

Also, I wonder how negatively impacted the use of such tools will be given the media echo chamber. Can you imagine how many people searched for Ebola symptoms last year?!

What really surprised me as I was doing research for this piece is how statisticians had yet to really solve these so-called “familywise” errors. That is how to you know that your bushy eyebrows/intelligence correlation, or more seriously,
searches for flu symptoms and flu outbreak, or ad placement and purchases aren’t just chance or the result of something else? The epistemological claim behind what the NSA and the GCHQ are doing is that they have solved the problem of these familywise errors. I don’t believe them, but if they have and are not telling us how they have they will actually have killed people by allowing medical research to be less effective than it would be if this breakthrough was shared.

@ Rick.. thanks for the reply

And thanks for the link regarding Google - was unaware of this. However, on reading the article it implies problems with calibration and leading bias in the data collection, as follows..

“Google constantly makes tweaks to its general search algorithm, averaging more than one a day, and the introduction of its “autosuggest” feature may make people more likely to search on terms related to influenza.

One problem in finding out why GFT has run amok is that Google has never disclosed which 45 search terms it uses, nor how it weights them, to generate its forecast.

“We do find evidence that Google changed how it serves up health-related information that likely resulted in more searches for terms related to flu cures, and that these terms tend to be more correlated with GFT than the CDC data,” Lazer commented in an email.

“This suggests that part of the answer is that those (unknown) GFT search terms are related to flu cures, and that the change of the search algorithm drove counts of those search terms up. But we don’t know that for sure. And even if the algorithm did not change at all, how people use tools changes over time – maybe people didn’t think of using Google for health-related information a decade ago (where the training data for GFT came from) and now they are more likely to.”

So again, this is reliant upon the pertinent questions being asked from the data, as well as the laziness in dealing with “thinking” about the data to be retrieved and any overall process fluctuations/corrections that need to be further implemented as per above, (tweaks)?

Yes, I get your point regarding being overwhelmed with “Big data” and that besides the problems of resultant blindsided/blind-sightedness may lead Humans to have too much reliance and overconfidence, yet the article above also proves that the focus is also on whether this data is accurate, (scrutiny). The norm customarily is that wild fluctuations in the numbers/stats over short periods, (flu trends most recently etc), need to be questioned for accuracy?

Also, I wonder how negatively impacted the use of such tools will be given the media echo chamber. Can you imagine how many people searched for Ebola symptoms last year?!

Yes I take your point, yet the flipside and down-side would be that there is no provision for “Big data” trends on Ebola if we so choose to ignore them? For example, the very same scenario above may well aide to “pinpoint” outbreak in a local suburb very quickly as the data gathered would by highly concentrated - and yet the system cannot be “vigilant” by restricting it’s overall data/sample size?

It’s not merely about the media echo chamber/bad news/fear and anxiety channelled - the positive is again also the resultant vigilance from increased knowledge, education and awareness regarding Ebola, (thus “bad data” and exaggerations must be expected and filtered)? As we know Ebola shows symptoms very much like Flu in the earliest onset, as do a lot of other viral diseases, so there is no ultimate “Big data” magic wand for challenging these diseases.

The only real safeguard against global Ebola outbreaks “presently” is that the virus is fast acting and overwhelming, limiting it’s own ability to incubate for long periods and spreading worldwide before epidemics emerge, (I say presently, as there is concern that due to recent outbreaks, mitigation against the disease may be helping it to evolve. The recent outbreak may also be creating those immune as potential carriers for the disease - the downside obviously that someone who has previously had the disease and overcome could catch it again in future and show very little signs or illness but still spread it?)

So is “Big data” useful against Ebola/Flu and other?
Perhaps it depends how “high” up the priority select list/check boxes one labels it?

Regarding Eyebrows and intelligence - this is not a good example as you only have two datasets, so it is in fact impossible to link intelligence to Eyebrows - you need “more” data and larger sample sizes?


Thanks for the questions: this is the best way for me to clarify my own thinking.

“Regarding Eyebrows and intelligence - this is not a good example as you only have two datasets, so it is in fact impossible to link intelligence to Eyebrows - you need “more” data and larger sample sizes?”

You’re right, I would need at least an extra variable to make that example stick. Where I think we differ is on sample size.

It’s a core tenet of statistics that sample size is not relevant as long as what you aim to measure follows a Gaussian bell curve. As long as what you’re measuring falls within that curve, what you are measuring answers what you set out to test and you have a truly random sample you don’t need a large sample.

Take something like human height which follows a bell curve. Attempting to measure every person in a population wouldn’t give you a much more accurate approximation of random height than a well designed sample. This is the miracle of statistics!

To wrap this back around to the big data question: what the flood of data allows is for researchers to mine that data to uncover previously unknown facts about the world. Yet what I would claim is that there is a temptation to use this data as if your sample had be drawn to answer the very question you are trying to pose. In other words what you can draw from big data is just hints that something might be true. Such hints shouldn’t be presented as having even the admittedly tentative truth value of purely statistical studies when done with the proper constraints and it especially applies to science where any potentially significant signal will have to be fully interrogated and explained. I think we’re in danger of missing what will be the essential skill of the future which isn’t our ability to analyze data but more “old-fashioned” forensic skills of actually figuring out if the data are meaningful. 

As for Google Flu trends, I don’t think this will last as a valid way to predict disease outbreaks. The search algorithms it uses were never designed for this purpose and are likely to be very noisy.

I suppose eventually we’ll get to a sort of tricorder where we use our smart phones to successfully diagnose many of our ailments, including the flu. Given the privacy concerns, hopefully these apps will be closed systems perhaps with information flowing directly to the CDC or NHS for diseases that are of public concern, and, when we wish – to our general practioner who, again hopefully, will be an expert at distinguishing signal from noise. 

Or.. we could just ask the TV?

Whatever “Man” seeks to find, (from the data), he finds, (spooky or spiritual or whatever). “They” seek the elusive Higgs here.. They seek the Higgs there.. And some yet insist, it couldn’t possibly be there?... But there it is!!! (Told you we would find it if we looked hard enough) - By the Grace of the mysterious quantum Universe and energy/matter potential.

That’s assuming the Higgs field really does exist anyhoo.. and quarks, and gluons, and strings.. and time…

As is customary, I think we agree on most points as I stated in my first comment, yet I still say… YES! it is down to using some plain old intelligence and the questions we choose to ask of the data.

(Ps. I am in full support of Bell curves and the golden mean)


YOUR COMMENT Login or Register to post a comment.

Next entry: Privacy will not go away—but it will evolve

Previous entry: Enhancing Virtues: Fairness (pt 1)