Monday, 16 June 2014

Big data is only relatively 'big'


These days everyone's talking about 'big data'. What they mean is this: by capturing the trillions of interactions that computers can capture and analyse, experts can have a good old look at anything from traffic flows to the spread of diseases - either in each victim's body, or in society at large. And then look at what's gumming up any of those systems.

The problem? They're getting carried away in a numbers-heavy, but theory-light, game that's as old as state-based numbers themselves. As Radio Four's admirable More or Less programme has recently pointed out, it's not the sheer amount of data that matters, so much as what you do with it. In 1936, a massive, millions-strong opinion poll run by the Literary Digest got the outcome of the US Presidential race spectacularly wrong - because it send out poll cards using phone numbers and the like, massively over-sampling richer Americans more likely to vote for President Roosevelt's Republican challenger, Alf Landlon. Opinion polling has in fact surprised us again and again - as anyone who remembers the UK's 1992 General Election will remember - as the best way to take samples and ensure a balanced survey change all the time.

Consider Google's flu indicators, most recently one of the data industry's Great White Hopes. The idea was that you'd look at flu's spread by registering what people searched for online - but the relationship between searches and the spread of the virus has gone way off kilter in recent years. Statisticians had no model, no theory of how one set of numbers related to the other. And the point about two moving points is that their relationship is dynamic, and subject to change - probably change under constraint, change subject to rules, but potentially rapid alteration all the same.

We've been here so many times that a historian doesn't really know where to start. Florence Nightingale's use of graphs and charts. The visualisation breakthrough that allowed nineteenth-century doctors to trace the dirty water source of cholera outbreaks. Census data and migration figures. The last two hundred years are absolutely full of eureka moments when experts proclaimed that they'd made the big data breakthrough. Consider the late Tony Benn's words, as Minister of Technology and computer enthusiast, in the late 1960s: 'communism is going to score heavily over capitalism with the advent of computers because the Communists got centralised control early but didn’t know what to do with it… [but] computers now gave them a tool for management’. Not a great prophecy as things turned out, but a precise and revealing indicator of just how long decision-makers have assumed computers' power to relate one thing to another would change the world. And when we look bag, IBM's tape-to-tape data reels didn't do much to predict or rapidly tackle oil shocks or the threat of the HIV virus. The world turned out to be more complicated than we thought - though our desperate clutching at anything that might give us certainty in a world of chaos is as instructive as Benn's hopes for big data technology all those years ago.

No doubt our data will look pitifully small in years to come. That's why it's useful to take a historical and a long view of such hopes and fears. Remember that the next time you hear someone claiming they can 'solve' anything with numbers, won't you?