0% found this document useful (0 votes)
33 views4 pages

Prediction and Explanation in Social Systems

The document discusses the evolving role of prediction in social sciences, emphasizing the need for better standards in evaluating predictive accuracy and the importance of distinguishing between exploratory and confirmatory research. It highlights the challenges of predicting human behavior in complex social systems and the necessity for researchers to transparently report their design choices to improve replicability and comparability of results. The authors argue that predictive accuracy and interpretability should be viewed as complementary rather than mutually exclusive in the evaluation of social scientific explanations.

Uploaded by

zishengtong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views4 pages

Prediction and Explanation in Social Systems

The document discusses the evolving role of prediction in social sciences, emphasizing the need for better standards in evaluating predictive accuracy and the importance of distinguishing between exploratory and confirmatory research. It highlights the challenges of predicting human behavior in complex social systems and the necessity for researchers to transparently report their design choices to improve replicability and comparability of results. The authors argue that predictive accuracy and interpretability should be viewed as complementary rather than mutually exclusive in the evaluation of social scientific explanations.

Uploaded by

zishengtong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

P R ED ICT ION

ESSAY contained links to the top 100 most popular web-


sites, as measured by unique visitors. In addition
to holding the data set fixed, for simplicity, we
Prediction and explanation also restricted our analysis to a single choice of
model, reported in (11), that predicts cascade size

in social systems as a linear function of the average past perform-


ance of the “seed” individual (i.e., the one who
initiated the cascade). Even with the data source
Jake M. Hofman,* Amit Sharma,* Duncan J. Watts* and model held fixed, Fig. 1 (top) shows that
many potential research designs remain: Each
Historically, social scientists have sought out explanations of human and social node represents a decision that a researcher must
phenomena that provide interpretable causal mechanisms, while often ignoring their make, and each distinct path from the root of the
predictive accuracy. We argue that the increasingly computational nature of social science is tree to a terminal leaf node represents a poten-
beginning to reverse this traditional bias against prediction; however, it has also highlighted tial study (12). We emphasize that none of these
three important issues that require resolution. First, current practices for evaluating designs is intrinsically wrong. Nevertheless, Fig. 1
predictions must be better standardized. Second, theoretical limits to predictive accuracy in (bottom) shows that different researchers—each
complex social systems must be better characterized, thereby setting expectations for what making individually defensible choices—can ar-
can be predicted or explained. Third, predictive accuracy and interpretability must be rive at qualitatively different answers to the same
recognized as complements, not substitutes, when evaluating explanations. Resolving these question. For example, a researcher who chose to
three issues will lead to better, more replicable, and more useful social science. measure the AUC [the area under the receiver

F
operating characteristic (ROC) curve] on a subset
or centuries, prediction has been consid- terms of predictive accuracy. We believe that of the data could easily reach the conclusion that

Downloaded from https://www.science.org at Georgia Institute of Technology on April 14, 2022


ered an indispensable element of the sci- the confluence of these two trends presents an their predictions were “extremely accurate” [e.g.,
entific method. Theories are evaluated on opportune moment to revisit the historical sep- (10)], whereas a different researcher who decided
the basis of their ability to make falsifiable aration of explanation and prediction in the so- to measure the coefficient of determination (R2)
predictions about future observations— cial sciences, with productive lessons for both on the whole data set would conclude that 60%
observations that come either from the world at points of view. On the one hand, social scien- of variance could not be explained [e.g., (6)].
large or from experiments designed specifically tists could benefit by paying more attention to Reality is even more complicated than our
to test the theory. Historically, this process of predictive accuracy as a measure of explanatory simple example would suggest, for at least three
prediction-driven explanation has proven un- power; on the other hand, computer scientists reasons. First, researchers typically start with
controversial in the physical sciences, especially could benefit by paying more attention to the different data sets and choose among poten-
in cases where theories make relatively unam- substantive relevance of their predictions, rather tially many different model classes; thus, the
biguous predictions and data are plentiful. Social than to predictive accuracy alone. schematic in Fig. 1 is only a portion of the full
scientists, in contrast, have generally deempha- design space. Second, researchers often reuse
sized the importance of prediction relative to Standards for prediction the same data set to assess the out-of-sample
explanation, which is often understood to mean Predictive modeling has generated enormous performance of many candidate models before
the identification of interpretable causal mech- progress in artificial intelligence (AI) applications choosing one. The resulting process, sometimes
anisms. In part, this emphasis may reflect the (e.g., speech recognition, language translation, and called “human-in-the-loop overfitting,” can pro-
intrinsic complexity of human social systems driverless vehicles), in part because AI researchers duce gross overestimates of predictive perform-
and the relative paucity of available data. But have converged on simple-to-understand quan- ance that fail to generalize to new data sets. Third,
it also partly reflects the widespread adoption titative metrics that can be compared meaning- in addition to arriving at different answers to
within the social and behavioral sciences of a fully across studies and over time. In light of this the same question, researchers may choose similar-
particular style of thinking that emphasizes un- history, it is perhaps surprising that applications sounding prediction tasks that correspond to
biased estimation of model parameters over pre- of similar methods in the social sciences often different substantive questions. For example, a
dictive accuracy (1). Rather than asking whether fail to adhere to common reporting and eval- popular variant of the task described above is to
a given theory can predict some outcome of in- uation standards, making progress impossible observe the progression of a cascade for some
terest, the accepted practice in social science in- to assess. The reason for this incoherence is that time before making a prediction about its even-
stead asks whether a particular coefficient in an prediction results depend on many of the same tual size (7). “Peeking” strategies of this sort gen-
idealized model is statistically significant and in “researcher degrees of freedom” that lead to erally yield much better predictive performance
the direction predicted by the theory. false positives in traditional hypothesis testing than ex ante predictions, which use only infor-
Recently, this practice has come under increas- (3). For example, consider the question of pre- mation available before a given cascade. Im-
ing criticism, in large part out of concern that an dicting the size of online diffusion “cascades” to portantly, however, they achieve this gain by,
unthinking “search for statistical significance” (2) understand how information spreads through in effect, changing the objective from explana-
has resulted in the proliferation of nonreplicable social networks, a topic of considerable recent tion (i.e., which features account for success?)
findings (3, 4). Concurrently, growing interest interest (6, 7, 10, 11). Although seemingly unam- to early detection (i.e., which cascades will con-
among computational scientists in traditionally biguous, this question can be answered only tinue to spread?). Using the same language
social scientific topics, such as the evolution of after it has first been translated into a specific (“predicting cascades”) to describe both exercises
social networks (5), the diffusion of information computational task, which in turn requires the therefore creates confusion about what has been
(6, 7), and the generation of inequality (8), along researcher to make a series of subjective choices, accomplished, as well as how to compare results
with massive increases in the volume and type including the selection of the task, data set, model, across studies.
of social data available to researchers (9), has and performance metric. Depending on which spe- Resolving these issues is nontrivial; nevertheless,
raised awareness of methods from machine cific set of choices the researcher makes, what ap- some useful lessons can be learned from the past
learning that evaluate performance largely in pear to be very different answers can be obtained. three decades of progress in the AI applications
To illustrate how seemingly innocuous design of machine learning, as well as from recent
Microsoft Research, 641 Avenue of the Americas, 7th Floor,
New York, NY 10003, USA.
choices can affect stated results, we reanalyzed efforts to improve the replicability of scientific
*Corresponding author. Email: jmh@microsoft.com (J.M.H.); data from (11) comprising all posts made to claims in behavioral science (3, 4, 12). First,
amshar@microsoft.com (A.S.); duncan@microsoft.com (D.J.W.) Twitter during the month of February 2015 that comparability of results would be improved by

Hofman et al., Science 355, 486–488 (2017) 3 February 2017 1 of 3


P R ED ICT ION

establishing consensus on the substantive prob- research, researchers may then choose to engage fect, the reason could be insufficient data and/
lems that are to be solved. If early detection of in confirmatory research, which allows them or modeling sophistication, but it could also be
popular content is the goal, for example, then to make stronger claims. To qualify research as that the phenomenon itself is unpredictable, and
peeking strategies are admissible, but if expla- confirmatory, however, researchers should be hence that predictive accuracy is subject to some
nation is the goal, then they are not. Likewise, required to preregister their research designs, in- fundamental limit. In other words, to the extent
AUC is an appropriate metric when balanced cluding data preprocessing choices, model spe- that outcomes in complex social systems resem-
classification (i.e., between classes of equal size) cifications, evaluation metrics, and out-of-sample ble the outcome of a die roll more than the re-
is a meaningful objective, whereas R2 or root predictions, in a public forum such as the Open turn of Halley’s Comet, the potential for accurate
mean square error (RMSE) may be more appro- Science Framework (https://osf.io). Although strict predictions will be correspondingly constrained.
priate when the actual cascade size is of in- adherence to these guidelines may not always To illustrate the potential for predictive lim-
terest. Second, where specific problems can be be possible, following them would dramatically its, consider again the problem of predicting
agreed upon, claims about prediction can be improve the reliability and robustness of results, diffusion cascades. As with “success” in many
evaluated using the “common task framework” as well as facilitating comparisons across studies. domains [e.g., in cultural markets (8)], the dis-
(e.g., the Netflix prize), in which competing algo- tribution of outcomes resembles Fig. 2 (top) in
rithms are evaluated by independent third parties Limits to prediction two important respects: First, both the average
on standardized, publicly available data sets, How predictable is human behavior? There is and modal success is low (i.e., most tweets, books,
agreed-upon performance metrics, and high- no single answer to this question because hu- songs, or people experience modest success), and
quality baselines (13). Third, in the absence of man behavior spans the gamut from highly reg- second, the right tail is highly skewed, consistent
common tasks and data, researchers should ular to wildly unpredictable. At one extreme, a with the observation that a small fraction of
transparently distinguish exploratory from con- study of 50,000 mobile phone users (14) found items (“viral” tweets, best-selling books, hit songs,
firmatory research. In exploratory analyses, re- that in any given hour, users were in their most- or celebrities) are orders of magnitude more suc-
searchers are free to study different tasks, fit visited location 70% of the time; thus, one could cessful than average. The key question posed by

Downloaded from https://www.science.org at Georgia Institute of Technology on April 14, 2022


multiple models, try various exclusion rules, and achieve 70% accuracy on average with the sim- this picture, both for prediction and for explana-
test on multiple performance metrics. When report- ple heuristic “Jane will be at her usual spot tion, is what determines the position of a given
ing their findings, however, they should trans- today.” At the other extreme, so-called “black item in this highly unequal distribution. One ex-
parently declare their full sequence of design swan” events (e.g., the impact of the Web or the treme stylized explanation, which we label “skill
choices to avoid creating a false impression of 2008 financial crisis) are thought to be intrin- world” (Fig. 2, bottom left), holds that success is
having confirmed a hypothesis rather than sim- sically impossible to predict in any meaningful almost entirely explained by some property that
ply having generated one (3). Relatedly, they should sense (15). Last, for outcomes of intermediate is intrinsic, albeit possibly hard to measure, which
report performance in terms of multiple metrics predictability, such as presidential elections, stock can be interpreted loosely as skill, quality, or fitness.
to avoid creating a false appearance of accuracy. market movements, and feature films revenues, At the opposite extreme, what we call “luck world”
In cases where data are abundant, moreover, the difficulty of prediction can vary tremendously (Fig. 2, bottom right) contends that skill has very
researchers can increase the validity of explor- with the details of the task (e.g., predicting box little impact on eventual success, which is instead
atory research by using a three-way split of their office revenues a week versus a year in advance). driven almost entirely by other factors, such as
data into a training set used to fit models, a To evaluate the accuracy of any particular pre- luck, that are external to the item in question
validation set used to select any free parameters dictive model, therefore, we require not only the and effectively random in nature. Where exactly
that control model capacity and to compare dif- relevant baseline comparison—that is, the best the real world lies in between these two extremes
ferent models, and a test set that is used only known performance—but also an understanding has important consequences for prediction. In
once to quote final performance. Last, having of the best possible performance. The latter is skill world, for example, if one could hypotheti-
generated a firm hypothesis through exploratory important because when predictions are imper- cally measure skill, then in principle it would be

Data source

Regression Prediction task Classification

R2 RMSE MAE Evaluation metric Accuracy AUC F1

0 1 10 100 0 1 10 100 0 1 10 100 Data processing 1 10 100 1 10 100 1 10 100

R2 RMSE MAE Accuracy AUC F1


1.00 0.00 0.00 1.00 1.00 1.00
0.75 0.10 0.10 0.75 0.75 0.75
0.50 0.20 0.50 0.50 0.50
0.20
0.25 0.30 0.25 0.25 0.25
0.40 0.30
0.00 0.00 0.00 0.00
0 1 10 100 0 1 10 100 0 1 10 100 1 10 100 1 10 100 1 10 100
Threshold for filtering unpopular content Threshold for defining successful outcomes
Fig. 1. A single question may correspond to many research designs, each metric, as a function of the threshold used in each task. The lower limit of each
yielding different answers. (Top) A depiction of the many choices involved vertical axis gives the worst possible performance on each metric, and the top
in translating the problem of understanding diffusion cascades into a concrete gives the best. Dashed lines represent the performance of a naive predictor
prediction task, including the choice of data source, task, evaluation metric, and (always forecasting the global mean for regression or the positive class for clas-
data preprocessing. The preprocessing choices shown at the terminal nodes sification), and solid lines show the performance of the fitted model. R2, co-
refer to the threshold used to filter observations for regression or define suc- efficient of determination; AUC, area under the ROC curve; RMSE, root mean
cessful outcomes for classification. Cascade sizes were log-transformed for all squared error; MAE, mean absolute error; F1 score, the harmonic mean of
of the regression tasks. (Bottom) The results of each prediction task, for each precision and recall.

Hofman et al., Science 355, 486–488 (2017) 3 February 2017 2 of 3


possible to predict success with Empirical Observation sign the prediction exercise to ad-
almost perfect precision. In luck dress that question, clearly stating

P (Success)
world, in contrast, even a “per- and justifying the specific choices
fect” predictor would yield medi- made during the modeling pro-
ocre performance, no better than cess. These requirements do not
predicting that all items will ex- preclude exploratory studies, which
perience the same (i.e., average) remain both necessary and desir-
Success
level of success (11). It follows, able for a variety of reasons—for
therefore, that the more that out- “Skill World” “Luck World” example, to deepen understanding
comes are determined by extrin- of the data, to clarify conceptual

P (Success|skill)

P (Success|skill)
sic random factors, the lower the disagreements or ambiguities, or to
theoretical best performance that generate hypotheses. When evaluat-
can be attained by any model. ing claims about predictive accu-
Aside from some special cases racy, however, preference should
(14), the problem of specifying a be given to studies that use stan-
Success Success
theoretical limit to predictive dardized benchmarks that have been
accuracy for any given complex agreed upon by the field or, alterna-
Fig. 2. Schematic illustration of two stylized explanations for an empirically
social system remains open, but tively, to confirmatory studies that
observed distribution of success. In the observed world (top), the distribution of
it ought to be of interest both preregister their predictions. Mecha-
success is right-skewed and heavy-tailed, implying that most items experience rel-
to social scientists and com- nisms revealed in this manner are
atively little success, whereas a tiny minority experience extraordinary success. In “skill
puter scientists. For computer more likely to be replicable, and
world” (bottom left), the observed distribution is revealed to comprise many item-
scientists, if the best-known per- hence to qualify as “true,” than
specific distributions sharply peaked around the expected value of some (possibly un-

Downloaded from https://www.science.org at Georgia Institute of Technology on April 14, 2022


formance is well below what is mechanisms that are proposed
observable) measure of skill; thus, conditioning correctly on skill accounts for almost all
theoretically possible, efforts to solely on the basis of exploratory
observed variance. In contrast, in “luck world” (bottom right), almost all the observed
find better model classes, con- analysis and interpretive plausibil-
variance is attributable to extrinsic random factors; thus, conditioning on even a hypo-
struct more informative features, ity. Properly understood, in other
thetically perfect measure of skill would explain very little variance. [Adapted from (11)]
or collect more or better data words, prediction and explanation
might be justified. If, however, should be viewed as complements,
the best-known model is already close to the theo- not necessarily generalize better than complex not substitutes, in the pursuit of social scientific
retical limit, scientific effort might be better models (1, 18). Rather, generalization error is a knowledge.
allocated to other tasks, such as devising inter- property of the entire modeling process, includ-
ventions that do not rely on accurate predictions ing researcher degrees of freedom (3) and algo- REFERENCES AND NOTES
(16). For social scientists, benchmarking of this rithmic constraints on the model search (18). 1. L. Breiman, Stat. Sci. 16, 199–231 (2001).
sort could also be used to evaluate causal expla- Generalization error should therefore be mini- 2. G. Gigerenzer, J. Socio-Econ. 33, 587–606 (2004).
nations. For example, to the extent that a hy- mized directly, as illustrated by ensemble methods 3. J. P. Simmons, L. D. Nelson, U. Simonsohn, Psychol. Sci. 22,
1359–1366 (2011).
pothesized mechanism accounts for less observed such as bagging and boosting (19), which often
4. Open Science Collaboration, Science 349, aac4716 (2015).
variance than the theoretical limit, it is likely that succeed in lowering generalization error despite 5. D. Liben‐Nowell, J. Kleinberg, J. Am. Soc. Inf. Sci. Technol. 58,
other mechanisms remain to be identified. Con- increasing model complexity. Second, there is 1019–1031 (2007).
versely, where the theoretical limit is low (i.e., increasing evidence from the machine learning 6. E. Bakshy, J. M. Hofman, W. A. Mason, D. J. Watts, “Everyone's an
influencer: Quantifying influence on Twitter,” in Proceedings of the
where outcomes are intrinsically unpredictable), literature that the trade-off between predictive Fourth ACM International Conference on Web Search and Data Mining
expectations for what can be explained should be accuracy and interpretability may be less severe [ACM (Association for Computing Machinery), 2011], pp. 65–74.
reduced accordingly. For example, although suc- than once thought. Specifically, by optimizing 7. J. Cheng, L. Adamic, P. A. Dow, J. M. Kleinberg, J. Leskovec, “Can
cess is likely determined to some extent by in- first for generalization error and then searching cascades be predicted?” in Proceedings of the 23rd International
Conference on World Wide Web (ACM, 2014), pp. 925–936.
trinsic factors such as quality or skill, it also for simpler and more interpretable versions of the 8. M. J. Salganik, P. S. Dodds, D. J. Watts, Science 311, 854–856 (2006).
likely depends to some (potentially large) extent resulting model, it may be possible to achieve 9. D. Lazer et al., Science 323, 721–723 (2009).
on extrinsic factors such as luck and cumulative close to optimal prediction (subject to the limits 10. M. Jenders, G. Kasneci, F. Naumann, “Analyzing and
predicting viral tweets,” in Proceedings of the 22nd
advantage (8). Depending on the balance between discussed above) while also gaining insight into
International Conference on World Wide Web (ACM, 2013),
these two sets of factors, any explanation for why the relevant mechanisms (20). Third, it is im- pp. 657–664.
a particular person, product, or idea succeeded portant to clarify that “understanding” is often 11. T. Martin, J. M. Hofman, A. Sharma, A. Anderson, D. J. Watts,
when other similar entities did not will be lim- used to refer both to the subjective feeling of “Exploring limits to prediction in complex social systems,” in
Proceedings of the 25th International Conference on World Wide
ited, not because we lack the appropriate model having made sense of something (i.e., interpreted Web (International World Wide Web Conference Committee, 2016),
of success, but rather because success itself is in it) and also to having successfully accounted for pp. 683–694.
part random (17). observed empirical regularities (i.e., predicted 12. A. Gelman, E. Loken, Am. Sci. 102, 460 (2014).
13. M. Liberman, Comput. Linguist. 36, 595–599 (2010).
it). Although these two notions of understand-
Prediction versus interpretation 14. C. Song, Z. Qu, N. Blumm, A.-L. Barabási, Science 327,
ing are frequently conflated, neither one nec- 1018–1021 (2010).
Conversations about the place of prediction in essarily implies the other: It is both possible to 15. N. N. Taleb, The Black Swan: The Impact of the Highly
social science almost always elicit the objection make sense of something ex post that cannot be Improbable (Random House, 2007).
16. D. J. Watts, Everything is Obvious*: *Once You Know the
that an emphasis on predictive accuracy leads predicted ex ante and to make successful predic-
GRAPHIC: ADAPTED BY N. CARY/SCIENCE

Answer (Crown Business, 2011).


to complex, uninterpretable models that gen- tions that are not interpretable (17). Moreover, 17. D. J. Watts, Am. J. Sociol. 120, 313–351 (2014).
eralize poorly and offer little insight. There is although subjective preferences may differ, there 18. P. Domingos, Data Min. Knowl. Discov. 3, 409–425 (1999).
19. R. E. Schapire, “The boosting approach to machine learning: An
merit to this objection: The best-performing mod- is no scientific basis for privileging either form overview,” in Nonlinear Estimation and Classification (Springer,
els are often complex, and, as we have already of understanding over the other (18). 2003), pp. 149–171.
emphasized, an unthinking focus on predictive None of this is to suggest that complex pre- 20. M. T. Ribeiro, S. Singh, C. Guestrin, “‘Why should I trust you?’:
accuracy can lead to spurious claims. However, dictive modeling should supplant traditional Explaining the predictions of any classifier,” in Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge
it does not follow that predictive accuracy is nec- approaches to social science. Rather, we advocate Discovery and Data Mining (ACM, 2016).
essarily at odds with insight into causal mecha- a hybrid approach in which researchers start
nisms, for three reasons. First, simple models do with a question of substantive interest and de- 10.1126/science.aal3856

Hofman et al., Science 355, 486–488 (2017) 3 February 2017 3 of 3


Prediction and explanation in social systems
Jake M. HofmanAmit SharmaDuncan J. Watts

Science, 355 (6324), • DOI: 10.1126/science.aal3856

View the article online


https://www.science.org/doi/10.1126/science.aal3856
Permissions
https://www.science.org/help/reprints-and-permissions

Downloaded from https://www.science.org at Georgia Institute of Technology on April 14, 2022

Use of this article is subject to the Terms of service

Science (ISSN 1095-9203) is published by the American Association for the Advancement of Science. 1200 New York Avenue NW,
Washington, DC 20005. The title Science is a registered trademark of AAAS.
Copyright © 2017, American Association for the Advancement of Science

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy