80% of conclusions drawn from Big Data are incorrect.[link]
The UK government is investing £64 million in Big Data. [link]
I was fooled by Big Data, you don’t have to be.
Here is my story.
Years ago I worked in an Awfully Big Public Sector Organisation and I got my hands on some Big Data. Some HUGE data. The sort of HUGE DATA large public sector organisations specialise in. The pointless sort.
When I started at the Awfully Big Public Sector Organisation I was told in awestruck tones about Data Mining. That we had such Big Data you’d need a hard-hat and a pickaxe to get to the bottom of it, this was something the Sick Sigmas specialised in. They would dig down deep, root around and return to the surface clutching a glowing ball of insight.
I was jealous of them and their imaginary coloured belts so when the opportunity arose for ME to descend into the depths I was keen as mustard to return with MY bit of glowing insight. I wanted to find out why some of our records did not match up with someone elses records. They should have done. My job was to find out why they didn’t so I thought I would root around until I found out why they didn’t.
As the name implies there was an awful lot of this Big Data, so much that excel didnt do it justice, there were Big Databases to hold the Big Data. And with Big Data comes the need for Big Queries to go into and extract data that I wanted to look at, a smaller set of data that I could work stuff out from. Sorry for the technical jargon but it’s inevitable.
I had MAD SKILLZ at databases, as summarised in this picture, I felt pretty confident the answer was in the data I pulled out.
So I looked at the data.
Suddenly BEHOLD! I found my insight! I had my nugget of the purest knowledge!
I had found things that correlated with other things!
By the power of DATA I had found the truth!
I WAS FOOLED BY RANDOMNESS.
Big Data is Big Noise. The more data you have the more noise there is in the data, as demonstrated by Nassim Taleb below. If you don’t know the difference between signal and noise, you are going to be fooled by the random correlations that exist in the data.
The more spurious links, the more noise, the less chance you’ll hit upon the tiny signal hidden inside the data.
Big Data has noise hardwired into it. It is not a bug it is a feature.
It is to be expected, but if people don’t know about noise, they’ll certainly not find the signal. The rate at which people confuse noise with signal I am calling “the rate of wrongness”.
Yes, you read that right, 80% of conclusions drawn from Big Data are later found to be incorrect. Incorrect being a polite word for wrong.
So if digging around for correlations, no matter how fancily done, produces nonsense most of the time, why do it? I think this is just the latest pair of Magic Goggles. It is technology, and it is new. The worst combination to sucker us in, this data is also just lying round so analysing it seems free, or as free as £64 million can be.
Why is this a problem? Because Governments will be making policy decisions and choices about public services using Big Data. And as the UK public sector has such a strong track record of ICT projects, it will no doubt be using private sector partners to successfully deliver Big Data projects.
Everyone knows this old graph by now, the statistically proven link between pirates and global warming. And yet NOT everyone knows about signal and noise.
I think the £64 million could be better spent.
There are 5.6 million people employed in the public sector in the UK.
If we spent £11 on each person, we could send them one copy each of these gems about understanding numbers…