p10 Jacobs
p10 Jacobs
DATABASES
then
15 minutes for one pass through the data at a typical 90-megabyte-per-second sustained read speed, 9 shamefully underutilizing the CPU the whole time. In fact, our table of all the people in the world will t in the memory of a single, $15,000 Dell server with 128-GB RAM. Running off in-memory data, my simple median-age-by-sex-and-country program completed in less than a minute. By such measures, I would hesitate to call this big data, particularly in a world where a single research site, the LHC (Large Hadron Collider) at CERN (European Organization for Nuclear Research), is expected to produce 150,000 times as much raw data each year.10 For many commonly used applications, however, our hypothetical 6.75-billion-row dataset would in fact pose a signicant challenge. I tried loading my fake 100-GB world census into a commonly used enterprise-grade database system (PostgreSQL6) running on relatively hefty hardware (an eightcore Mac Pro workstation with 20-GB RAM and two terabytes of RAID 0 disk) but had to abort the bulk load process after six hours as the database storage had already reached many times the size of the original binary dataset, and the workstations disk was nearly full. (Part of this, of course, was a result of the unpacking of the data. The original le stored elds bit-packed rather than as distinct integer elds, but subsequent tests revealed that the database was using three to four times as much storage as would be necessary to store each eld as a 32-bit integer. This sort of data ination is typical of a traditional RDBMS and shouldnt necessarily be seen as a problem, especially to the extent that it is part of a strategy to improve performance. After all, disk space is relatively cheap.) I was successfully able to load subsets consisting of up to 1 billion rows of just three columns:
DATABASES
country (eight bits, 256 possible values), age (seven bits, 128 possible values), and sex (one bit, two values). This was only 2 percent of the raw data, although it ended up consuming more than 40 GB in the DBMS. I then tested the following query, essentially the same computation as the left side of gure 1: SELECT country,age,sex,count(*) FROM people GROUP BY country,age,sex; This query ran in a matter of seconds on small subsets of the data, but execution time increased rapidly as the number of rows grew past 1 million (gure 2). Applied to the entire billion rows, the query took more than 24 hours, suggesting that PostgreSQL was not scaling gracefully to this big dataset, presumably because of a poor choice of algorithm for the given data and query. Invoking the DBMSs built-in EXPLAIN facility revealed the problem: while the query planner chose a reasonable hash table-based aggregation strategy for small tables, on larger tables it switched to sorting by grouping columnsa viable, if suboptimal strategy given a few million rows, but a very poor one
PostgreSQL Performance on the Query SELECT country,age,sex,count(*) FROM people GROUP BY country,age,sex
104
0.01 1000 104 105 106 number of rows 107 108 109
Note: Curves of linear, linearithmic, and quadratic growth are shown for comparison.
DATABASES
when facing a billion. PostgreSQL tracks statistics such as the minimum and maximum value of each column in a table (and I veried that it had correctly identied the ranges of all three columns), so it could have chosen a hash-table strategy with condence. Its worth noting, however, that even had the tables statistics not been known, on a billion rows it would take far less time to do an initial scan and determine the distributions than to embark on a full-table sort. PostgreSQLs difculty here was in analyzing the stored data, not in storing it. The database didnt blink at loading or maintaining a database of a billion records; presumably there would have been no difculty storing the entire 6.75-billion-row, 10-column table had I had sufcient free disk space. Heres the big truth about big data in traditional databases: its easier to get the data in than out. Most DBMSs are designed for efcient transaction processing: adding, updating, searching for, and retrieving small amounts of information in a large database. Data is typically acquired in a transactional fashion: imagine a user logging into a retail Web site (account data is retrieved; session information is added to a log), searching for products (product data is searched for and retrieved; more session information is acquired), and making a purchase (details are inserted in an order database; user information is updated). A fair amount of data has been added effortlessly to a database thatif its a large site that has been in operation for a whileprobably already constitutes big data. There is no pathology here; this story is repeated in countless ways, every second of the day, all over the world. The trouble comes when we want to take that accumulated data, collected over months or years, and learn something from itand naturally we want the answer in seconds or minutes! The pathologies of big data are primarily those of analysis. This may be a slightly controversial assertion, but I would argue that transaction processing and data storage are largely solved problems. Short of LHC-scale science, few enterprises generate data at such a rate that acquiring and storing it pose major challenges today. In business applications, at least, data warehousing is ordinarily regarded as the solution to the database problem (data goes in but doesnt come out). A data warehouse has been classically dened as a copy of transaction data specically structured for query and analysis,4 and the general approach is commonly understood to be bulk extraction of the data from an operational database, followed by reconstitution in a different database in a form that is more suitable for analytical queries (the so-called extract, transform, load, or sometimes extract, load, transform process). Merely saying, We will build a data warehouse is not sufcient when faced with a truly huge accumulation of data. How must data be structured for query and analysis, and how must analytical databases and tools be designed to handle it efciently? Big data changes the answers to these questions, as traditional techniques such as RDBMS-based dimensional modeling and cube-based OLAP (online analytical processing) turn out to be either too slow or too limited to support asking the really interesting questions about warehoused data. To understand how to avoid the pathologies of big data, whether in the context of a data warehouse or in the physical or social sciences, we need to consider what really makes it big.
DATABASES
(consider logs of Web hits or retail transactions, both common sources of big data). As a result, in the real world, data is not just a big set of random numbers; it tends to exhibit predictable characteristics. For one thing, as a rule, the largest cardinalities of most datasetsspecically, the number of distinct entities about which observations are madeare small compared with the total number of observations. This is hardly surprising. Human beings are making the observations, or being observed as the case may be, and there are no more than 6.75 billion of them at the moment, which sets a rather practical upper bound. The objects about which we collect data, if they are of the human worldWeb pages, stores, products, accounts, securities, countries, cities, houses, phones, IP addressestend to be fewer in number than the total world population. Even in scientic datasets, a practical limit on cardinalities is often set by such factors as the number of available sensors (a state-of-the-art neurophysiology dataset, for example, might reect 512 channels of recording5) or simply the number of distinct entities that humans have been able to detect and identify (the largest astronomical catalogs, for example, include several hundred million objects8). What makes most big data big is repeated observations over time and/or space. The Web log records millions of visits a day to a handful of pages; the cellphone database stores time and location every 15 seconds for each of a few million phones; the retailer has thousands of stores, tens of thousands of products, and millions of customers but logs billions and billions of individual transactions in a year. Scientic measurements are often made at a high time resolution (thousands of samples a second in neurophysiology, far more in particle physics) and really start to get huge when they involve two or three dimensions of space as well; fMRI neuroimaging studies can generate hundreds or even thousands of gigabytes in a single experiment. Imaging in general is the source of some of the biggest big data out there, but the problems of large image data are a topic for an article by themselves; I wont consider them further here. The fact that most large datasets have inherent temporal or spatial dimensions, or both, is crucial to understanding one important way that big data can cause performance problems, especially when databases are involved. It would seem intuitively obvious that data with a time dimension, for example, should in most cases be stored and processed with at least a partial temporal ordering to preserve locality of reference as much as possible when data is consumed in time order. After all, most nontrivial analyses will involve at the very least an aggregation of observations over one or more contiguous time intervals. One is more likely, for example, to be looking at the purchases of a randomly selected set of customers over a particular time period than of a contiguous range of customers (however dened) at a randomly selected set of times. The point is even clearer when we consider the demands of time-series analysis and forecasting, which aggregate data in an order-dependent manner (e.g., cumulative and moving-window functions, lead and lag operators, etc.). Such analyses are necessary for answering most of the truly interesting questions about temporal data, broadly: What happened? Why did it happen? Whats going to happen next? The prevailing database model today, however, is the relational database, and this model explicitly ignores the ordering of rows in tables.1 Database implementations that follow this model, eschewing the idea of an inherent order on tables, will inevitably end up retrieving data in a nonsequential fashion once it grows large enough that it no longer ts in memory. As the total amount of data stored in the database grows, the problem only becomes more signicant. To achieve acceptable
DATABASES
performance for highly order-dependent queries on truly large data, one must be willing to consider abandoning the purely relational database model for one that recognizes the concept of inherent ordering of data down to the implementation level. Fortunately, this point is slowly starting to be recognized in the analytical database sphere. Not only in databases, but also in application programming in general, big data greatly magnies the performance impact of suboptimal access patterns. As dataset sizes grow, it becomes increasingly important to choose algorithms that exploit the efciency of sequential access as much as possible at all stages of processing. Aside from the obvious point that a 10:1 increase in processing time (which could easily result from a high proportion of nonsequential accesses) is far more painful when the units are hours than when they are seconds, increasing data sizes mean that data access becomes less and less efcient. The penalty for inefcient access patterns increases disproportionately as the limits of successive stages of hardware are exhausted: from processor cache to memory, memory to local disk, andrarely nowadays!disk to off-line storage. On typical server hardware today, completely random memory access on a range much larger than cache size can be an order of magnitude or more slower than purely sequential access, but completely random disk access can be ve orders of magnitude slower than sequential access (gure 3). Even state-of-the-art solid-state (ash) disks, although they have much lower seek latency than magnetic disks, can differ in speed by roughly four orders of magnitude between random and sequential access patterns. The results for the test shown in gure 3 are the number of four-byte integer values read per second from a 1-billion-long (4 GB) array on disk or in memory; random disk reads are for 10,000 indices chosen at random between one and 1 billion. A further point thats widely underappreciated: in modern systems, as demonstrated in the gure, random access to memory is typically slower than sequential access to disk. Note that random reads from disk are more than 150,000 times slower than sequential access; SSD improves on this ratio by less than one order of magnitude. In a very real sense, all of the modern forms of storage improve
Note: Disk tests were carried out on a freshly booted machine (a Windows 2003 server with 64-GB RAM and eight 15,000-RPM SAS disks in RAID5 configuration) to eliminate the effect of operating-system disk caching. SSD test used a latest-generation Intel high-performance SATA SSD.
DATABASES
only in degree, not in their essential nature, upon that most venerable and sequential of storage media: the tape. The huge cost of random access has major implications for analysis of large datasets (whereas it is typically mitigated by various kinds of caching when data sizes are small). Consider, for example, joining large tables that are not both stored and sorted by the join keysay, a series of Web transactions and a list of user/account information. The transaction table has been stored in time order, both because that is the way the data was gathered and because the analysis of interest (tracking navigation paths, say) is inherently temporal. The user table, of course, has no temporal dimension. As records from the transaction table are consumed in temporal order, accesses to the joined user table will be effectively randomat great cost if the table is large and stored on disk. If sufcient memory is available to hold the user table, performance will be improved by keeping it there. Because random access in RAM is itself expensive, and RAM is a scarce resource that may simply not be available for caching large tables, the best solution when constructing a large database for analytical purposes (e.g., in a data warehouse) may, surprisingly, be to build a fully denormalized
10 million rows
1 billion rows
transid
timestamp
page
userid
age
sex
country
1 billion rows
13 columns
DATABASES
tablethat is, a table including each transaction along with all user information that is relevant to the analysis (gure 4). Denormalizing a 10-million-row, 10-column user information table onto a 1-billion-row, four-column transaction table adds substantially to the size of data that must be stored (the denormalized table is more than three times the size of the original tables combined). If data analysis is carried out in timestamp order but requires information from both tables, then eliminating random look-ups in the user table can improve performance greatly. Although this inevitably requires much more storage and, more importantly, more data to be read from disk in the course of the analysis, the advantage gained by doing all data access in sequential order is often enormous.
HARD LIMITS
Another major challenge for data analysis is exemplied by applications with hard limits on the size of data they can handle. Here, one is dealing mostly with the end-user analytical applications that constitute the last stage in analysis. Occasionally the limits are relatively arbitrary; consider the 256column, 65,536-row bound on worksheet size in all versions of Microsoft Excel prior to the most recent one. Such a limit might have seemed reasonable in the days when main RAM was measured in megabytes, but it was clearly obsolete by 2007 when Microsoft updated Excel to accommodate up to 16,384 columns and 1 million rows. Enough for anyone? Excel is not targeted at users crunching truly huge datasets, but the fact remains that anyone working with a 1-million-row dataset (a list of customers along with their total purchases for a large chain store, perhaps) is likely to face a 2million-row dataset sooner or later, and Excel has placed itself out of the running for the job. In designing applications to handle ever-increasing amounts of data, developers would do well to remember that hardware specs are improving too, and keep in mind the so-called ZOI (zero-oneinnity) rule, which states that a program should allow none of foo, one of foo, or any number of foo.11 That is, limits should not be arbitrary; ideally, one should be able to do as much with software as the hardware platform allows. Of course, hardwarechiey memory and CPU limitationsis often a major factor in software limits on dataset size. Many applications are designed to read entire datasets into memory and work with them there; a good example of this is the popular statistical computing environment R.7 Memory-bound applications naturally exhibit higher performance than disk-bound ones (at least insofar as the data-crunching they carry out advances beyond single-pass, purely sequential processing), but requiring all data to t in memory means that if you have a dataset larger than your installed RAM, youre out of luck. On most hardware platforms, theres a much harder limit on memory expansion than disk expansion: the motherboard has only so many slots to ll. The problem often goes further than this, however. Like most other aspects of computer hardware, maximum memory capacities increase with time; 32 GB is no longer a rare conguration for a desktop workstation, and servers are frequently congured with far more than that. There is no guarantee, however, that a memory-bound application will be able to use all installed RAM. Even under modern 64-bit operating systems, many applications today (e.g., R under Windows) have only 32-bit executables and are limited to 4-GB address spacesthis often translates into a 2- or 3-GB working set limitation. Finally, even where a 64-bit binary is availableremoving the absolute address space limitation all too often relics from the age of 32-bit code still pervade software, particularly in the use of 32-bit
DATABASES
integers to index array elements. Thus, for example, 64-bit versions of R (available for Linux and Mac) use signed 32-bit integers to represent lengths, limiting data frames to at most 231-1, or about 2 billion rows. Even on a 64-bit system with sufcient RAM to hold the data, therefore, a 6.75-billionrow dataset such as the earlier world census example ends up being too big for R to handle.
DATABASES
other, more complicated tasks that minimize communication between nodes. Naturally, distributed analysis of big data comes with its own set of gotchas. One of the major problems is nonuniform distribution of work across nodes. Ideally, each node will have the same amount of independent computation to do before results are consolidated across nodes. If this is not the case, then the node with the most work will dictate how long we must wait for the results, and this will obviously be longer than we would have waited had work been distributed uniformly; in the worst case, all the work may be concentrated in a single node and we will get no benet at all from parallelism.
Two Ways to Distribute 10 Years of Sensor Data for 1,000 Sites over 10 Machines
Node 1 timestamp sensor reading 19990101000000 1 19990101000015 1 19990101000030 1 20081231235930 20081231235945 19990101000000 19990101000015 19990101000030 20081231235930 20081231235945 19990101000000 20081231235945 1 1 2 2 2 2 2 3 100 Node 2 timestamp sensor reading 19990101000000 101 19990101000015 101 19990101000030 101 20081231235930 20081231235945 19990101000000 19990101000015 19990101000030 20081231235930 20081231235945 19990101000000 20081231235945 101 101 102 102 102 102 102 103 200 Node 10 timestamp sensor reading 19990101000000 901 19990101000015 901 19990101000030 901 20081231235930 20081231235945 19990101000000 19990101000015 19990101000030 20081231235930 20081231235945 19990101000000 901 901 902 902 902 902 902 103
20081231235945 1000 Node 10 timestamp sensor reading 20080101000000 1 20080101000000 2 20080101000000 3 20080101000000 1000 20080101000015 1 20080101000015 2 20080101000015 3 20080101000015 4 20080101000015 1000 20080101000030 1 20080101000030 2 20081231235945 1000
Node 1 timestamp sensor reading 19990101000000 1 19990101000000 2 19990101000000 3 19990101000000 1000 19990101000015 1 19990101000015 2 19990101000015 3 19990101000015 4 19990101000015 1000 19990101000030 1 19990101000030 2 19991231235945 1000
Node 2 timestamp sensor reading 20000101000000 1 20000101000000 2 20000101000000 3 20000101000000 1000 20000101000015 1 20000101000015 2 20000101000015 3 20000101000015 4 20000101000015 1000 20000101000030 1 20000101000030 2 20001231235945 1000
10
DATABASES
Whether this is a problem or not will tend to be determined by how the data is distributed across nodes; unfortunately, in many cases this can come into direct conict with the imperative to distribute data in such a way that processing at each node is local. Consider, for example, a dataset that consists of 10 years of observations collected at 15-second intervals from 1,000 sensor sites. There are more than 20 million observations for each site; and, because the typical analysis would involve time-series calculationssay, looking for unusual values relative to a moving average and standard deviationwe decide to store the data ordered by time for each sensor site (gure 5), distributed over 10 computing nodes so that each one gets all the observations for 100 sites (a total of 2 billion observations per node). Unfortunately, this means that whenever we are interested in the results of only one or a few sensors, most of our computing nodes will be totally idle. Whether the rows are clustered by sensor or by time stamp makes a big difference in the degree of parallelism with which different queries will execute. We could, of course, store the data ordered by time, one year per node, so that each sensor site is represented in each node (we would need some communication between successive nodes at the beginning of the computation to prime the time-series calculations). This approach also runs into the difculty if we suddenly need an intensive analysis of the past years worth of data. Storing the data both ways would provide optimal efciency for both kinds of analysisbut the larger the dataset, the more likely it is that two copies would be simply too much data for the available hardware resources. Another important issue with distributed systems is reliability. Just as a four-engine airplane is more likely to experience an engine failure in a given period than a craft with two of the equivalent engines, so too is it 10 times more likely that a cluster of 10 machines will require a service call. Unfortunately, many of the components that get replicated in clusterspower supplies, disks, fans, cabling, etc.tend to be unreliable. It is, of course, possible to make a cluster arbitrarily resistant to single-node failures, chiey by replicating data across the nodes. Happily, there is perhaps room for some synergy here: data replicated to improve the efciency of different kinds of analyses, as above, can also provide redundancy against the inevitable node failure. Once again, however, the larger the dataset, the more difcult it is to maintain multiple copies of the data.
A META-DEFINITION
I have tried here to provide an overview of a few of the issues that can arise when analyzing big data: the inability of many off-the-shelf packages to scale to large problems; the paramount importance of avoiding suboptimal access patterns as the bulk of processing moves down the storage hierarchy; and replication of data for storage and efciency in distributed processing. I have not yet answered the question I opened with: what is big data, anyway? I will take a stab at a meta-denition: big data should be dened at any point in time as data whose size forces us to look beyond the tried-and-true methods that are prevalent at that time. In the early 1980s, it was a dataset that was so large that a robotic tape monkey was required to swap thousands of tapes in and out. In the 1990s, perhaps, it was any data that transcended the bounds of Microsoft Excel and a desktop PC, requiring serious software on Unix workstations to analyze. Nowadays, it may mean data that is too large to be placed in a relational database and analyzed with the help of a desktop statistics/visualization packagedata, perhaps, whose analysis requires massively parallel software running on tens, hundreds, or even thousands of servers.
11
DATABASES
In any case, as analyses of ever-larger datasets become routine, the denition will continue to shift, but one thing will remain constant: success at the leading edge will be achieved by those developers who can look past the standard, off-the-shelf techniques and understand the true nature of the hardware resources and the full panoply of algorithms that are available to them. Q REFERENCES 1. Codd, E. F. 1970. A relational model for large shared data banks. Communications of the ACM 13(6): 377-387. 2. IBM 3850 Mass Storage System; http://www.columbia.edu/acis/history/mss.html. 3. IBM Archives: IBM 3380 direct access storage device; http://www-03.ibm.com/ibm/history/ exhibits/storage/storage_3380.html. 4. Kimball, R. 1996. The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses. New York: John Wiley & Sons. 5. Litke, A. M., et al. 2004. What does the eye tell the brain? Development of a system for the largescale recording of retinal output activity. IEEE Transactions on Nuclear Science 51(4): 1434-1440. 6. PostgreSQL: The worlds most advanced open source database; http://www.postgresql.org. 7. The R Project for Statistical Computing; http://www.r-project.org. 8. Sloan Digital Sky Survey; http://www.sdss.org. 9. Throughput and Interface Performance. Toms Winter 2008 Hard Drive Guide; http://www. tomshardware.com/reviews/hdd-terabyte-1tb,2077-11.html. 10. WLCG (Worldwide LHC Computing Grid); http://lcg.web.cern.ch/LCG/public/. 11. Zero-One-Innity Rule; http://www.catb.org/~esr/jargon/html/Z/Zero-One-Innity-Rule.html. LOVE IT, HATE IT? LET US KNOW feedback@queue.acm.org ADAM JACOBS is senior software engineer at 1010data Inc., where, among other roles, he leads the continuing development of Tenbase, the companys ultra-high-performance analytical database engine. He has more than 10 years of experience with distributed processing of big datasets, starting in his earlier career as a computational neuroscientist at Weill Medical College of Cornell University (where he holds the position of Visiting Fellow) and at UCLA. He holds a Ph.D. in neuroscience from UC Berkeley and a B.A. in linguistics from Columbia University.
2009 ACM 1542-7730/09/0700 $10.00
12