Is Big Data a pile of cack?

80% of  conclusions drawn from Big Data are incorrect.[link]

The UK government is investing £64 million in Big Data. [link]

I was fooled by Big Data,  you don’t have to be.
Here is my story.

Years ago I worked in an Awfully Big Public Sector Organisation and I got my hands on some Big Data. Some HUGE data. The sort of HUGE DATA large public sector organisations specialise in. The pointless sort.

Capture

A metaphor of the previous sentence

When I started at the Awfully Big Public Sector Organisation I was told in awestruck tones about Data Mining. That we had such Big Data you’d need a hard-hat and a pickaxe to get to the bottom of it, this was something the Sick Sigmas specialised in. They would dig down deep, root around and return to the surface clutching a glowing ball of insight.

I was jealous of them and their imaginary coloured belts so when the opportunity arose for ME to descend into the depths I was keen as mustard to return with MY bit of glowing insight.  I wanted to find out why some of our records did not match up with someone elses records. They should have done. My job was to find out why they didn’t so I thought I would root around until I found out why they didn’t.

As the name implies there was an awful lot of this Big Data, so much that excel didnt do it justice, there were Big Databases to hold the Big Data.  And with Big Data comes the need for Big Queries to go into and extract data that I wanted to look at, a smaller set of data that I could work stuff out from. Sorry for the technical jargon but it’s inevitable.

Capture

My MAD SKILLZ  in 1998

I had MAD SKILLZ at databases, as summarised in this picture, I felt pretty confident the answer was in the data I pulled out.

So I looked at the data.

For hours…


Suddenly BEHOLD! I found my insight! I had my nugget of the purest knowledge!
I had found things that correlated with other things!
By the power of DATA I had found the truth!

Capture

BEHOLD! [removes lump of Green from pot] Oh, Edmund… can it be true? That I hold here, in my mortal hand, a nugget of purest Green?

Sadly, it was all cack. Despite all the maths and data, I’d been fooled.

I WAS FOOLED BY RANDOMNESS.

Big Data is Big Noise. The more data you have the more noise there is in the data, as demonstrated by Nassim Taleb below. If you don’t know the difference between signal and noise, you are going to be fooled by the random correlations that exist in the data.
Capture

The more spurious links, the more noise, the less chance you’ll hit upon the tiny signal hidden inside the data.

Big Data has noise hardwired into it. It is not a bug it is a feature.
It is to be expected, but if people don’t know about noise, they’ll certainly not find the signal. The rate at which people confuse noise with signal I am calling “the rate of wrongness”.

This study here shows the rate of wrongness in Big Data is 80%.

Yes, you read that right, 80% of conclusions drawn from Big Data are later found to be incorrect. Incorrect being a polite word for wrong.

So if digging around for correlations, no matter how fancily done, produces nonsense most of the time, why do it? I think this is just the latest pair of Magic Goggles. It is technology, and it is new. The worst combination to sucker us in, this data is also just lying round so analysing it seems free, or as free as £64 million can be.

Why is this a problem? Because Governments will be making policy decisions and choices about public services using Big Data. And as the UK public sector has such a strong track record of ICT projects, it will no doubt be using private sector partners to successfully deliver Big Data projects.

Capture

Everyone knows this old graph by now, the statistically proven link between pirates and global warming. And yet NOT everyone knows about signal and noise.


I think the £64 million could be better spent.
There are 5.6 million people employed in the public sector in the UK.
If we spent £11 on each person, we could send them one copy each of these gems about understanding numbers…

Capture

 

Capture

Advertisements
This entry was posted in command and control, data, learning, statistics, systems thinking and tagged , . Bookmark the permalink.

8 Responses to Is Big Data a pile of cack?

  1. You rightly identify the need to separate the signal from the noise, fundamental to science of course. So the problem is not as your headline and conclusion seems to imply? The availability of “big data”, which itself means different things to different people of course, is to be welcomed IF correctly interpreted. Better that evidence based decision making is at least possible. Then at the very least somebody else gets the chance to reanalyse and disagree.

    This seems analogous to the problem of non critical appraisal of administrative paper flows associated with e.g. waterfall project planning and management, as you rightly identify on regular basis in this blog! To be meaningful the “requirements gathering” and feedback needs to be iterative and based on a critical appraisal of the available evidence informed and driven by the users, not whatever a project manager/administrator or statistically unaware investigator derives from the paperwork or “big data” alone?

    Like

  2. WordPress/iOS7/something has just zilched 2 paras of reply to this for no apparent reason. But suffice to say, just as waterfall project management methodologies make it too easy to be seen to be working on a problem rather than addressing the real challenges driven by the real needs of users, likewise “big data” in itself is obviously not a bad thing. It’s all about the interpretation and helping with that would be a very good way to spend Government money imho. Just as an uncritical processing of project managerial documents is misleading and gets you nowhere fast, a non statistical appraisal of datasets will not enable evidence based decision making. Making the “big data” available at least means someone else can reanalyse and disagree. This is also supposed to be fundamental to scientific practice Seems to me that this could be a very good way to spend Government money but it all depends how it’s spent.

    Like

  3. jtedds says:

    “Waterfall” project management methodologies make it too easy to be seen to be working on a problem to satisfy audit trails rather than addressing the real challenges driven by the real needs of users. “Big data” in itself is obviously not a bad thing, interpreted correctly. It might reveal patterns in behaviour that address real needs, for example. Spending a relatively small amount of Government money on helping with the interpretation might in fact be a very good idea.

    Just as an uncritical processing of project managerial documents is misleading and gets you nowhere fast, a non statistical appraisal of datasets will not enable evidence based decision making. Making the “big data” available at least means someone else can reanalyse and disagree. This is also supposed to be fundamental to scientific practice.

    I think this is what you mean but it starts off sounding like big data is useless and any Government spending on it is a waste of time? That could easily happen but it needn’t if interpreted correctly.

    Like

    • ThinkPurpose says:

      My point is, in the real world, person+BigData=80% wrong.
      It ‘could’ be done better, but it IS still only 20% right. Same as if only projects were ‘done right’. If only they were. But they’re not.

      Like

      • ThinkPurpose says:

        I prefer small data. Data collected by people doing the work, using pencil and paper, 5 bar gates. The data being used where it comes from, at source, by people who know the vagaries and strengths of it. Data collected because it helps understand what it’s happening there.
        I like hand hewn data.

        Like

  4. Agree about the importance of small data, the context, the hand hewn. Projects as you point out often fail in this fundamental aspect. But these days data, whether big or small, is increasingly out there, aggregated and available. So we’d better help people understand it and make it understandable. Better if it’s the ones who understand the data in context who do this before it is unleashed? Most people feel they need a little help with that or at least some incentives to make it happen.

    Like

  5. Pingback: What is Tiny Data and why is it crucial? | thinkpurpose

  6. Marc says:

    ThinkPurpose & Jonathan +1 for artisinal data! Big data can be converted to tiny data. context purpose meta data

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s