Diebold Big Data
Diebold Big Data
Abstract: I investigate Big Data, the phenomenon, the term, and the discipline, with em-
phasis on origins of the term, in industry and academics, in computer science and statis-
tics/econometrics. Big Data the phenomenon continues unabated, Big Data the term is now
firmly entrenched, and Big Data the discipline is emerging.
∗
For useful communications I thank – without implicating in any way – Larry Brown, Xu Cheng, Flavio
Cunha, Susan Diebold, Dean Foster, Michael Halperin, Steve Lohr, John Mashey, Tom Nickolas, Lauris
Olson, Mallesh Pai, Marco Pospiech, Frank Schorfheide, Minchul Shin, and Mike Steele. I also thank, again
without implicating, Stephen Feinberg, Douglas Laney and Fred Shapiro, with whom I have not had the
pleasure of communicating, but who are friends of friends, and whose insights were valuable.
1 Introduction
Big Data is at the heart of modern science and business. Premier scientific groups are
intensely focused on it, as as is society at large, as documented by major reports in the
business and popular press, such as Steve Lohr’s “How Big Data Became so Big” (New York
Times, August 12, 2012).1
2
academics were aware of the emerging phenomenon but not the term.7 There is, however,
some pre-2000 (non-academic, unpublished) activity that is spot-on. In particular, Big Data
the term, coupled with awareness of Big Data the phenomenon, was clearly percolating at
Silicon Graphics (SGI) in the mid 1990s. John Mashey, retired former Chief Scientist at SGI,
produced a 1998 SGI slide deck entitled “Big Data and the Next Wave of InfraStress,” which
demonstrates clear awareness of Big Data the phenomenon.8,9 Related, SGI ran an ad that
featured the term Big Data in Black Enterprise (March 1996, p. 60), several times in Info
World (starting November 17, 1997, p. 30), and several times in CIO (starting February 15,
1998, p. 5). Clearly then, Mashey and the SGI community were on to Big Data early, using
it both as a unifying theme for technical seminars and as an advertising hook.
There is also at least one more relevant pre-2000 Big Data reference in computer science.
It is subsequent to Mashey et al., but interestingly, it comes from the academic as opposed
to industry part of the computer science community, and it not only uses the term but also
demonstrates some awareness of the phenomenon. Weiss and Indurkhya (1998) note that “...
very large collections of data ... are now being compiled into centralized data warehouses,
allowing analysts to make use of powerful methods to examine data more comprehensively.
In theory, ‘Big Data’ can lead to much stronger conclusions for data-mining applications,
but in practice many difficulties arise.”
Finally, arriving on the scene later but also going beyond previous work in compelling
ways, Laney (2001) highlighted the “Three V’s” of Big Data (Volume, Variety and Velocity)
in an unpublished 2001 research note at META Group.10 Laney’s note is clearly relevant,
and it goes beyond my exclusive focus on volume, producing a significantly enriched con-
ceptualization of the Big Data phenomenon.11 In short, if Laney arrived slightly late, he
nevertheless brought more to the table.
ing awareness of it, instead reporting exclusively on a particular technology, the so-called high-performance
parallel interface.
7
See, for example, Massive Data Sets: Proceedings of a Workshop, Committee on Applied and Theoretical
Statistics, National Research Council (National Academies Press, 1997), http://www.nap.edu/catalog.
php?record_id=5505.
8
http://static.usenix.org/event/usenix99/invited_talks/mashey.pdf.
9
Mashey notes in private communication that the deck was for a “living talk” and hence updated regularly,
so that the 1998 version is not the earliest. The earliest deck of which he is aware (and hence I am aware)
is from 1997.
10
META is now part of Gartner.
11
http://goo.gl/Bo3GS.
3
4 Big Data the Discipline
Big Data is now not only a phenomenon and term, but also a discipline. It leaves me with
mixed, but ultimately positive, feelings. At first pass it sounds like marketing fluff, as do
other information technology sub-disciplines with catchy names like “artificial intelligence,”
“data mining” and “machine learning.” Indeed it’s hard to resist smirking when told that
Big Data has now arrived as a new discipline and business, and that major firms are rushing
to create new executive titles like “Vice President for Big Data.”12 But as I have argued,
the phenomenon behind the term is very real, so it may be natural and desirable for a
corresponding new discipline to emerge, whatever its executive titles.
It’s not obvious, however, that a new discipline is required, or that Big Data is a new
discipline. Skeptics will argue that traditional disciplines like computer science, statistics
and x-metrics are perfectly capable of confronting the new phenomenon, so that Big Data
is not a new discipline, but rather just a box drawn around some traditional disciplines.
But it’s hard not to notice that the whole of the emerging Big Data discipline seems greater
than the sum of its parts. That is, by drawing on perspectives from a variety of traditional
disciplines, Big Data is not merely taking us to bigger traditional places. Rather, it’s taking
us to very new places, unimaginable only a short time ago, ranging from cloud computing
and associated massively-parallel algorithms, to methods for controlling false-discovery rates
when testing millions of hypotheses, with much in between. Indeed one could argue that,
in a landscape littered with failed attempts at interdisciplinary collaboration, Big Data is
emerging as a major interdisciplinary triumph.
5 Conclusion
The term “Big Data,” which spans computer science and statistics/econometrics, probably
originated in lunch-table conversations at Silicon Graphics Inc. (SGI) in the mid 1990s,
in which John Mashey figured prominently. The first significant academic references are
arguably Weiss and Indurkhya (1998) in computer science and Diebold (2000) in statis-
tics/econometrics. An unpublished 2001 research note by Douglas Laney at Gartner enriched
the concept significantly. Big Data the phenomenon continues unabated, and Big Data the
discipline is emerging.
12
Seriously. Lohr reports the title “Vice President for Big Data” in his earlier-mentioned Times piece, at
http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html.
4
References
Andersen, T.G., T. Bollerslev, P.F. Christoffersen, and F.X. Diebold (2013), “Financial Risk
Measurement for Financial Risk Management,” In M. Harris, G. Constantinedes and R.
Stulz (eds.), Handbook of the Economics of Finance, Volume 2, Part B, Elsevier, 1127-
1220.
Diebold, F.X. (2000), “Big Data Dynamic Factor Models for Macroeconomic Measurement
and Forecasting,” Discussion Read to the Eighth World Congress of the Econometric
Society, Seattle, August.
http://www.ssc.upenn.edu/~fdiebold/papers/paper40/temp-wc.PDF.
Diebold, F.X. (2003), “Big Data Dynamic Factor Models for Macroeconomic Measurement
and Forecasting: A Discussion of the Papers by Reichlin and Watson,” In M. Dewa-
tripont, L.P. Hansen and S. Turnovsky (eds.), Advances in Economics and Econometrics:
Theory and Applications, Eighth World Congress of the Econometric Society, Cambridge
University Press, 115-122.
Laney, D. (2001), “3-D Data Management: Controlling Data Volume, Velocity and Variety,”
META Group Research Note, February 6.
http://goo.gl/Bo3GS.
Reichlin, L. (2003), “Factor Models in Large Cross Sections of Time Series,” In M. Dewa-
tripont, L.P. Hansen and S. Turnovsky (eds.),Advances in Economics and Econometrics:
Theory and Applications, Eighth World Congress of the Econometric Society, Cambridge
University Press, 47-86.
Tilly, C. (1984), “The Old New Social History and the New Old Social History,” Review
(Fernand Braudel Center), 7, 363–406.
Weiss, S.M. and N. Indurkhya (1998), Predictive Data Mining: A Practical Guide, Morgan
Kaufmann Publishers, Inc.