July 20, 2023
Jill LePore is a Harvard historian whose wide-ranging scholarship has embraced the broad sweep of American history (These Truths), Native American-colonial conflict (The Name of War), superheroes (The Secret History of Wonder Woman), and even the Beatles (introduction to Paul McCartney’s 1964: Eyes of the Storm).
In her recent New Yorker column, “The Data Delusion,” LePore examines what used in the 1990s to be called “knowledge management.”
“Big data” was just a subject for pulp fiction in the 1930s; LePore cites a short story in the magazine Amazing Stories in which a millionaire hires hundreds of scholars to read up on all subjects, secretly intending to remove their brains after five years and hook them all into a cerebral network of all human knowledge which he will be able to control.
With the advent of the Internet, of course, no brain need be removed, and the idea that there is a free on-line library of all knowledge is taken for granted.
But is all knowledge truly available? I would say it is not, and LePore seems to agree. Big Data has come to dominate one sphere of human endeavor after another, in some ways to the detriment of traditional fields of study and thought. A lot of money in her own field of history is going toward “the digitization of human knowledge.” LePore quotes a researcher who:
…argues that “history as a data science has to prove itself in the most rigorous way possible: by making predictions about what newly available sources will reveal.” But history is not a predictive science, and if it were it wouldn’t be history.
New Yorker, March 2023
A more basic objection to this approach is that prediction based on past data (outside the realm of the hard sciences) is futile. Aside from the fact that any harvest of “big data” is doomed to be selective and therefore biased in unexpected ways, there is no particular reason to think that past “data” in a field such as history will be useful in predicting anything at all about the future. All “data” from history can show us is that a particular set of circumstances have happened to occur simultaneously in the past; the only thing we can conclude from this is that this particular “snapshot” combination of circumstances (e.g. high inflation and high unemployment, or a Republican president and a tax hike) can coexist, so some or all of them might coexist in the future.
It’s not only history that is not a predictive science (or a “science” at all, at least not in the sense that chemistry or nuclear physics are). Political science, sociology, and even economics have never been able to predict the future, and they should not be expected to do so. Algorithms derived from past experience in these fields will fail, because of what the 1700s-1800s French mathematician and scientist LaPlace called “our ignorance of true causes.” Hard science works because we have reduced these fields to an atomic level, where we have identified reliably replicable processes that can be depended upon to operate in the same way over and over.
Among the social sciences, even economics has not arrived at anything like “true causes” that correspond to the molecular-atomic level of physical science. The fact that a certain consumer or investor has acted in a certain manner repeatedly over a period of time does not in any way guarantee that she will act the same way in future. Nor is she like all other consumers or investors. So she cannot be deemed to be the “atom” of economic “science,” upon which a system of prediction might be erected.
Economics, like other social sciences, deals in “intersubjective realities,” agreed-upon imaginary constructs that help with the smooth functioning of society. Money is one such “intersubjective reality;” we think it is real, but unlike a hydrogen atom, for example, a currency does not maintain an objective, unchanging value.
Finance is no more a science than economics. The basic model for valuation of securities that I was taught in business school, the Capital Asset Pricing Model, has been shown (in part by its very inventors) to be unreliable, and no surer replacement theory has clearly emerged to take its place. Finance does purport to be predictive to some extent, yet its foremost exponents manifestly failed in their predictive function in 1929, in briefly in 1987, again for a bit in 1998, and in 2007-8.
Any “science” that purports to predict the future values of intersubjective realities (such as those that historians are concerned with – societal cohesion, international crises, political power dynamics, e.g.) is not a science at all, at least not the way physics is. And this brings me back to a distinction LePore makes near the beginning of her piece, one that I quibble with:
[I]magine that all the world’s knowledge is stored, and organized, in …four drawers. …The drawers are labelled, from top to bottom, “Mysteries,” “Facts,” “Numbers,” and “Data.” Mysteries are things only God knows, like what happens when you’re dead. …[A] few centuries ago, during the scientific revolution, a lot of those folders were moved into the next drawer down, “Facts,” which contains files about things humans can prove by way of observation, detection, and experiment. “Numbers,” second from the bottom, holds censuses, polls, tallies, national averages—the measurement of anything that can be counted, ever since the rise of statistics, around the end of the eighteenth century. Near the floor, the drawer marked “Data” holds knowledge that humans can’t know directly but must be extracted by a computer, or even by an artificial intelligence. It used to be empty, but it started filling up about a century ago, and now it’s so jammed full it’s hard to open.
Starting from the bottom, “data” cannot be “extracted” by a computer without either humans supplying the computer with said data, or else by a direct sensory connection from the computer to the outside world – something like the connection our human brains have, via our senses, to the outside world. I think it is fair to say that the percentage of reality that is currently covered by direct sensation by computers is very small (though growing). The “data” that is jamming the bottom drawer is, in the main, not this type of data, but, rather, e.g., optical character recognition scans of existing documents, records, books and other media, plus a lot of point-of-sale economic data, medical records, and other things that don’t qualify (yet) as “artificial intelligence,” nor yet as what might be called “direct computer sensation.”
But I will say that even her “data” drawer is not wholly without form or substance. No data that has ever been collected, I would venture to say, has ever completely lacked definition. All data that humans have ever collected has been collected only because someone was looking for something that fit certain parameters, while leaving everything else in the world out. Just as our eyes see only a narrow portion of the entire spectrum of electromagnetic waves, data collection programs only collect stuff that has been predetermined to be appropriate for the needs of the collectors in question. So there is really no such thing as completely undifferentiated data.
Because of this fact, there is not a clear distinction between “data” and “numbers.” Usually data already has some numerical component; often, it is purely numerical. So the distinction cannot be numerical. Rather, it’s that the data have been fed into some sort of algorithm, and some higher-level picture of reality has been arrived at.
“Knowledge,” the next drawer, would presumably be about drawing logical causal conclusions from the picture of reality that the “Numbers” gave us; and what would be left unknown would be “Mysteries.”
I think that one mistake is in thinking there can be one filing cabinet for both “hard science” phenomena and intersubjective realities. Hard science stuff can use the filing cabinet she has described. Data can be crunched into numerical arrays (“Numbers”) and logical causal conclusions (Knowledge”) can be drawn from them, and the “Mysteries” drawer can contain less and less. But the intersubjective (including history, economics, sociology, finance, politics, international affairs) can’t really be fit into this Procrustean bed.
In a way, the intersubjective demands we boldly start with the Mysteries. No data can tell us when a stock market crash is coming (at least until after the fact). So we have a duty to speculate about a range of different possible causal explanations for the Mysteries, and Data and Numbers be damned. (Even in history some of the most basic questions famously remain unanswered – What caused the First World War? Would slavery have withered away without the Civil War? What is the meaning of the Second Amendment? All proper subjects for alternative hypotheses, at least to begin with.)
Likewise, for planning and strategy purposes, in the realm of the intersubjective, rather than extrapolating from past Data and Numbers and Knowledge, we must begin by imagining a variety of potential future outcomes, and work our way back to the present, explaining why each could plausibly happen; and plan for that full range of plausible uncertainty.
“Numbers” and “Data” – and even “Knowledge” – lull us to sleep with the assumptions that we actually understand how the intersubjective world works, and that the future will operate the same way the past has. The cure for this delusion is rigorous imagination.
It can be terrifying to realize how little we actually can assume about the intersubjective worlds in which we spend almost all of our conscious time. But imagination can make us smarter and braver. And identifying the right questions is often far more important than any “Knowledge.”
I think Jill LePore would agree with that. (And to paraphrase a blurb I read once about Michael Lewis, I would read an 800-page Betamax instruction manual if LePore wrote it. You should do likewise.)