Prediction and Explanation in Social Systems
Prediction and Explanation in Social Systems
F
operating characteristic (ROC) curve] on a subset
or centuries, prediction has been consid- terms of predictive accuracy. We believe that of the data could easily reach the conclusion that
establishing consensus on the substantive prob- research, researchers may then choose to engage fect, the reason could be insufficient data and/
lems that are to be solved. If early detection of in confirmatory research, which allows them or modeling sophistication, but it could also be
popular content is the goal, for example, then to make stronger claims. To qualify research as that the phenomenon itself is unpredictable, and
peeking strategies are admissible, but if expla- confirmatory, however, researchers should be hence that predictive accuracy is subject to some
nation is the goal, then they are not. Likewise, required to preregister their research designs, in- fundamental limit. In other words, to the extent
AUC is an appropriate metric when balanced cluding data preprocessing choices, model spe- that outcomes in complex social systems resem-
classification (i.e., between classes of equal size) cifications, evaluation metrics, and out-of-sample ble the outcome of a die roll more than the re-
is a meaningful objective, whereas R2 or root predictions, in a public forum such as the Open turn of Halley’s Comet, the potential for accurate
mean square error (RMSE) may be more appro- Science Framework (https://osf.io). Although strict predictions will be correspondingly constrained.
priate when the actual cascade size is of in- adherence to these guidelines may not always To illustrate the potential for predictive lim-
terest. Second, where specific problems can be be possible, following them would dramatically its, consider again the problem of predicting
agreed upon, claims about prediction can be improve the reliability and robustness of results, diffusion cascades. As with “success” in many
evaluated using the “common task framework” as well as facilitating comparisons across studies. domains [e.g., in cultural markets (8)], the dis-
(e.g., the Netflix prize), in which competing algo- tribution of outcomes resembles Fig. 2 (top) in
rithms are evaluated by independent third parties Limits to prediction two important respects: First, both the average
on standardized, publicly available data sets, How predictable is human behavior? There is and modal success is low (i.e., most tweets, books,
agreed-upon performance metrics, and high- no single answer to this question because hu- songs, or people experience modest success), and
quality baselines (13). Third, in the absence of man behavior spans the gamut from highly reg- second, the right tail is highly skewed, consistent
common tasks and data, researchers should ular to wildly unpredictable. At one extreme, a with the observation that a small fraction of
transparently distinguish exploratory from con- study of 50,000 mobile phone users (14) found items (“viral” tweets, best-selling books, hit songs,
firmatory research. In exploratory analyses, re- that in any given hour, users were in their most- or celebrities) are orders of magnitude more suc-
searchers are free to study different tasks, fit visited location 70% of the time; thus, one could cessful than average. The key question posed by
Data source
P (Success)
world, in contrast, even a “per- and justifying the specific choices
fect” predictor would yield medi- made during the modeling pro-
ocre performance, no better than cess. These requirements do not
predicting that all items will ex- preclude exploratory studies, which
perience the same (i.e., average) remain both necessary and desir-
Success
level of success (11). It follows, able for a variety of reasons—for
therefore, that the more that out- “Skill World” “Luck World” example, to deepen understanding
comes are determined by extrin- of the data, to clarify conceptual
P (Success|skill)
P (Success|skill)
sic random factors, the lower the disagreements or ambiguities, or to
theoretical best performance that generate hypotheses. When evaluat-
can be attained by any model. ing claims about predictive accu-
Aside from some special cases racy, however, preference should
(14), the problem of specifying a be given to studies that use stan-
Success Success
theoretical limit to predictive dardized benchmarks that have been
accuracy for any given complex agreed upon by the field or, alterna-
Fig. 2. Schematic illustration of two stylized explanations for an empirically
social system remains open, but tively, to confirmatory studies that
observed distribution of success. In the observed world (top), the distribution of
it ought to be of interest both preregister their predictions. Mecha-
success is right-skewed and heavy-tailed, implying that most items experience rel-
to social scientists and com- nisms revealed in this manner are
atively little success, whereas a tiny minority experience extraordinary success. In “skill
puter scientists. For computer more likely to be replicable, and
world” (bottom left), the observed distribution is revealed to comprise many item-
scientists, if the best-known per- hence to qualify as “true,” than
specific distributions sharply peaked around the expected value of some (possibly un-
Science (ISSN 1095-9203) is published by the American Association for the Advancement of Science. 1200 New York Avenue NW,
Washington, DC 20005. The title Science is a registered trademark of AAAS.
Copyright © 2017, American Association for the Advancement of Science