Doubting Data

 

By Don Varyu

April, 2021

 
 
data graphic.jpeg

M-01.png

y friend slammed the brakes on his career, made a sharp turn, returned to school and earned an advanced degree in data science. I told him that was great; but I didn’t really understand what he was studying. I know back in the 50’s some companies began using people called “efficiency experts.” And I knew from my own career that little black boxes connected to a few hundred television sets would separate TV’s victors from the victims—deciding which programs and personalities survive, and which disappear. But all that didn’t seem so much like science as a different way of counting.

So, I turned to my old buddy Wikipedia and it told me, “data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insight from structured and unstructured data and apply knowledge and actionable insights from data across a broad range of application domains.”

Aha! So obvious! 

I translate this as finding even more ways to count things…throwing it all into a computer…and deciding what to do.

In any case, I know my friend has accelerated into the fast lane of the zeitgeist, and he’s steering directly into the future.


 
 
h-01.png

owever, I still have some doubts. Or maybe it’s just that the science isn’t perfected yet. But if the idea is that you can quantify so many mor things and analyze them in so many more ways that you can make so many better predictions about the future, there are results I can’t quite figure out. Let me cite three:

1) Election polling. The first Presidential election poll was conducted in just one state all the way back in 1824. It showed Andrew Jackson with a commanding lead over John Quincy Adams. Of course, the “methodology” back then was primitive, so we can’t really criticize anyone after Jackson eventually lost that year in a race decided ultimately by the House of Representatives. But here we are, nearly 200 years later, and just how far have we come? We still can’t figure out how to get this stuff right? How can that be? We have data scientists!

But they were wrong on Hillary, and they saw Biden being anointed early on election night. Granted, each of the Democrats did win the popular vote. But the pollsters didn’t come close to predicting by how small those margins would be.

2) Wall Street and Gamestop. No entity relies more on data than Wall Street. So much, in fact, that some of the data scientists there fear that they'll soon be replaced entirely by computers. But for the moment, it’s still up to the analysts to mine the data and determine who’ll make billions and who’ll be left in the dust.  

Last winter’s Gamestop bubble gave them all pause. The short story goes like this. An outside investor optimistically began buying the game retailer’s shares, despite the fact that little data existed to support that optimism. Big hedge funds consequently decided there was a big an opportunity and began “buying short”—betting the price was bound to fall. They saw boatloads of cash about to dock. But at that point an unruly mob of online investors began buying—and boy, did they keep buying. During the first 27 days of 2021, Gamestop’s stock price soared into the stratosphere—up about 20 times over its New Year’s Day value! Yes, some of those marauding investors wanted to get in early enough to get rich. But many also saw the chance to stick it to the hedge funds; the higher the stock price rose, the more the funds were forced to put up actual assets to cover their paper losses. And pay up they did. One fund saw half of its billions in assets disappear, and it nearly went bankrupt. And in the process, the data analysts lost a lot of swagger. No algorithm saw this coming.

3) March Madness. In March, to a lot of people, if you don’t have a “bracket”, you don’t really matter. When the NCAA holds its month-long tournament featuring the best in college basketball, a good part of the nation is held rapt—and a lot of them invest a few bucks guessing what’s going to happen. In filling out those brackets, fans are largely dependent on data developed by the media and by the NCAA itself. In each of four sub-brackets, 16 teams are ranked by the NCAA, with the top team (#1) first facing off against the worst (#16), #2 against #15, #3 against #14, etc. So those early rounds are pretty easy to pick, while the challenge builds as “better” teams move on.

So, it’s easy—except when it isn’t. The NCAA employs a private algorithm to mix up its own secret sauce of excellence. It won’t tell anyone what it is, but it should normally be able to predict winners from longshots. But it’s not working so well. 

For example, this year the strongest conference in the nation was the Midwest’s Big 10 (actually 13 teams—don’t ask). It was also said to be one of the strongest conferences for any year in history.  Four of its teams were among the top eight overall seeds. Nine schools in all made the tournament. 

Then…only one of the nine made it past the first weekend (two games). And it fell before making it to the final four.

More widely, one of every three games in the first two rounds of the tournament was an upset. A few more stumbles and we’d have been in coin-toss territory.  The NCAA’s secret data sauce had spoiled.


s-01.png

o, is this evidence that the entire realm of data science is a fraud? Not at all. Something else is at work here. And it derives from the words of early computer coders who declared, “garbage in—garbage out.” In other words, flawed data on the front end mean flawed results and analysis on the back end. And in each of the cases above, the flaw is the same: human beings. 

Political polling is almost comically unscientific. With phone surveys, only six percent of people respond—16 people out of 17 don’t answer. And most polls don’t include cell phone users! What kind of sample is that? In addition, a separate poll done before the 2020 election showed that 62% of Americans fear sharing their political views publicly. So if they do answer, are they telling the truth? Some organizations privately urge people to deliberately respond with lies, in order to throw the pollsters off. So, what voters really believe is only revealed on election day. This is a classic case of garbage in. 

The Gamestop incident played out differently. The internet investors who inflated that bubble weren’t dealing with deep data; they were just out for a little vengeance (and yeah, profit). If people refuse to act on sound data, how can you expect accurate data-driven results?

And finally, with college basketball, the actions of 18-to-21-year-olds playing with maximum emotion in front of national TV audiences almost define the limits of guesswork—kids will be kids.


s-01.png

o, I’m left thinking here that the fly in the ointment may be semantics. I’m willing to accept “data analyst”, but not quite “data scientist.” Science is a realm that may describe humans in terms of anatomy or biology or even psychology. 

But if data scientists propose to use their computations to ultimately predict what human beings will do, all I can say is “good luck.”


Have a comment or thought on this? Just hit the Your Turn tab here or email us at mailbox@cascadereview.net to have your say.