Data Science Research at Stanford 2017-18-0
Data Science Research at Stanford 2017-18-0
Data Science
Research at
Stanford
2017–2018
TABLE OF CONTENTS
Recently there has been a paradigm shift in the way data is used. Today researchers are
mining data for patterns and trends that lead to new hypotheses. This shift is caused by
the huge volumes of data available from web query logs, social media posts and blogs,
satellites, sensors, and medical devices.
Data-centered research faces many challenges. Current data management and analysis
techniques do not scale to the huge volumes of data that we expect in the future.
New analysis techniques that use machine learning and data mining require careful
tuning and expert direction. In order to be effective, data analysis must be combined
with knowledge from domain experts. Future breakthroughs will often require intimate
and combined knowledge of algorithms, data management, the domain data, and the
intended applications.
SDSI consists of data science research, shared data and computing infrastructure, shared
tools and techniques, industrial links, and education. SDSI has strong ties to groups across
Stanford University such as medicine, computational social science, biology, energy,
and theory.
SDSI’s emphasis is on research for accelerating data science, discovery, and application
across the university and throughout business and society. As this brochure shows, SDSI
has been funding research and enabling collaboration across all seven schools at Stanford
and with industry in medicine, climate change, social science, energy, and more. Every
company and industry on the planet is affected by the data science and artificial intelligence
JURE LESKOVEC
revolution. Smart companies are exploiting this revolution as an opportunity to improve
competitive advantage and increase market share, revenue, and profitability. SDSI provides
visibility into emerging technology, accurate assessments of current capability, and
impactful insights in many specific domains.
There are many pressing questions about the use of data science in civil society. Challenges
include algorithmic bias and fairness, the interpretability and accountability of decisions
made by autonomous systems, security and privacy of data, decisions made using shared
STEVE EGLASH data, and the impact of data science on law, transportation, markets, and national defense.
Autonomous vehicles, robots, intelligent agents, and other forms of automation will cause
job loss and the need for education. Education is being transformed by massive online
courses and automatic tutoring. SDSI strives to work with industry and academia to assure
that we are asking the right questions and developing the most impactful innovations.
The initial research plan is built around three interrelated levels of analysis: individual, group, and
society. At each level, we are investigating the interplay between static and dynamic properties,
and paying special attention to the ethical and economic issues that arise when confronting major
scientific challenges like this one. Our ultimate goal is to identify ways in which scientists, engineers,
community builders, and community leaders can contribute to the development of more
productive, vibrant, and informed teams, online and offline communities, and societies.
The goal of this project is to develop data science tools and statistical models that bring networks
and language together in order to make more and better predictions about both. Our focus is
on joint models of language and network structure. This brings natural language processing and
social network analysis together to provide a detailed picture not only of what is being said in a
community, but also who is saying it, how the information is being transmitted through the
network, how that transmission affects network structure, and, coming full circle, how those
evolving structures affect linguistic expression. We plan to develop statistical models using diverse
data sets, including not only online social networks (Twitter, Reddit, Facebook), but also hyperlink
networks of news outlets (using massive corpora we collect on an ongoing basis) and networks of
political groups, labs, and corporations.
The high-level, long-term goal is to research how to use the Internet of Things to collect data on
human behavior in a manner that preserves privacy but provides sufficient information to allow
interventions which modify that behavior. We are exploring this research question in the context
of water conservation at Stanford: how can smart water fixtures collect data on how students use
water, such that dormitories can make interventions to reduce water use, while keeping detailed
water use data private?
Towards this end, we have deployed a water use sensing network in Stanford dormitories. Our pilot
deployment in the winter of 2017 showed several interesting results, such that the average and
median shower length for men and women is the same. More importantly, using this network
we have been able to determine that placards suggesting using less water, placed within the
showers, are correlated with 10% shorter showers. Furthermore, while the average shower length is
8.8 minutes, there is an extremely long tail, with 20% of showers being longer than 15 minutes.
This project aims to improve in-season predictions of yields for major crops in the
United States, as well as a related goal of mapping soil properties across major
agricultural states, and mapping crop locations and crop types around the world.
The project uses a combination of graphical models, approximate Bayesian
computation, and crop simulation models to make predictions based on weather and
satellite data.
This work has been published in the AAAI and recognized by the best student paper
award and an award from the World Bank Big Data Challenge competition.
cs.stanford.edu/~ermon/group/website/papers/jiaxuan_AAAI17.pdf
www.worldbank.org/en/news/feature/2017/03/27/
and-the-winners-of-the-big-data-innovation-challenge-are#
Mendelian diseases are caused by single gene mutations. In aggregate, they affect 3% (~250M) of the world’s
population. The diagnosis of thousands of Mendelian disorders has been radically transformed by genome
sequencing. The potential of changing so many lives for the better, is held back by the associated human
labor costs. Genome sequencing is a simple, fast procedure costing hundreds of dollars. The mostly manual
process of finding which, if any of the patient’s sequenced variants is responsible for their phenotypes
against an exploding body of literature, makes genetic diagnosis 10X more expensive, unsustainably slow
and incompletely reproducible.
Our project, a unique collaboration between Stanford’s Computer Science department and Stanford’s
children’s hospital Medical Genetics Division, aims to develop and deploy a first of a kind computer
system to greatly accelerate the clinical diagnosis workflow and additionally derive novel disease gene
hypotheses from it. This effort will produce a proof of principle workflow and worldwide deployable tools
to significantly improve diagnostic throughput, greatly reduce the time spent by expert clinicians to reach a
diagnosis and associated costs thereby making genomic testing accessible, reproducible and ubiquitous.
A first flagship analysis web portal for the project has launched at:
https://AMELIE.stanford.edu
In the late 2000’s, the prices of many staple crops sold on markets in low- and middle-income African
countries tripled. Higher prices may compromise households’ ability to purchase enough food, or
alternatively increase incomes for food-producing households. Despite these different potential effects,
the net impact of this “food crisis” on the health of vulnerable populations remains unknown.
We extracted data on the local prices of four major staple crops—maize, rice, sorghum, and wheat—from
98 markets in 12 African countries (2002–2013), and studied their relationship to under-five mortality
from Demographic and Health Surveys. Using within-country fixed effects models, distributed lag models,
and instrumental variable approaches, we used the dramatic price increases in 2007–2008 to test the
relationship between food prices and under-five mortality, controlling for secular trends, gross domestic
product per capita, urban residence, and seasonality.
The prices of all four commodities tripled, on average, between 2006 and 2008. We did not find any model
specification in which the increased prices of maize, sorghum, or wheat were consistently associated
with increased under-five mortality. Indeed, price increases for these commodities were more commonly
associated with (statistically insignificant) lower mortality in our data. A $1 increase in the price per kg of
sorghum, a common African staple, was associated with 0.07–4.50 fewer child deaths per 10,000 child-
months, depending on the specification (p=0. 0.25–0.98). In rural areas where higher food prices may benefit
households that are net food producers, increasing maize prices were associated with lower child mortality
compared with urban households (12.4 fewer child deaths per 10,000 child-months with each $1 increase in
the price of maize; p=0.04).
We did not detect a significant overall relationship between increased prices of maize, rice, sorghum or
wheat and increased under-5 mortality. There is some suggestion that food-producing areas may benefit
from higher prices, while urban areas may be harmed.
BLOCKCHAIN
Dan Boneh
image|NASA.gov
masses; the distorted apparent shapes of galaxies contain information about the
gravitational effects of mass in other galaxies along the line of sight. Our proposed
work is to take the first step in using all of this information in a giant hierarchical
inference of our Universe’s cosmological and galaxy population model hyper-
parameters, after explicit marginalization of the parameters describing millions—and
perhaps billions—of individual galaxies. We will need to develop the statistical machinery to perform this
inference, and implement it at the appropriate computational scale. Training and testing will require large
cosmological simulations, generating plausible mock galaxy catalogs; we plan to make all of our data public to
enable further investigations of this type.
Electronic interfaces to the brain are increasingly being used to treat incurable disease, and eventually may
be used to augment human function. An important requirement to improve the performance of such devices
is that they be able to recognize and effectively interact with the neural circuitry to which they are connected.
An example is retinal prostheses for treating incurable blindness. Early devices of this form exist now, but only
deliver limited visual function, in part because they do not recognize the diverse cell types in the retina to which
they connect. We have developed automated classifiers for functional identification of retinal ganglion cells,
the output neurons of the retina, based solely on recorded voltage patterns on an electrode array similar to the
ones used in retinal prostheses. Our large collection of data—hundreds of recordings from primate retina over
18 years—made an exploration of automated methods for cell type identification possible for the first time.
We trained classifiers based on features extracted from electrophysiological images (spatiotemporal voltage
waveforms), inter-spike intervals (auto-correlations), and functional coupling between cells (cross-correlations),
and were able to routinely identify cell types with high accuracy. Based on this work, we are now developing the
techniques necessary for a retinal prosthesis to exploit this information by encoding the visual signal in a way
that optimizes artificial vision.
A system has been developed at Stanford that enables using confidential healthcare data among distant
hospitals and clinics for creating decision support applications without requiring sharing any patient
data among those institutions, thus facilitating multi-institution research studies on massive datasets.
This collaboration between Microsoft and Stanford will develop a MS Azure application based on this,
thus providing a solution that is robust, usable, and deployable widely at many healthcare institutions.
MyHEART COUNTS
Euan Ashley
The MyHeart Counts study—launched in the spring of 2015 on Apple’s Research Kit platform—seeks to mine
the treasure trove of heart health and activity data that can be gathered in a population through mobile
phone apps. Because the average adult in the U.S. checks his/her phone dozens of times each day, phone
apps that target cardiovascular health are a promising tool to quickly gather large amounts of data about
a population’s health and fitness, and ultimately to influence people to make healthier choices. In the first
8 months, over 40,000 users downloaded the app. Participants recorded physical activity, filled out health
questionnaires, and completed a 6-minute walk test. We applied unsupervised machine learning techniques
to cluster subjects into activity groups, such as those more active on weekends.
We then developed algorithms to uncover associations between the clusters of
accelerometry data and subjects’ self-reported health and well-being outcomes. Our
results, published in JAMA Cardiology in December 2016, are in line with the accepted
medical wisdom that more active people are at lower risk for diabetes, heart disease,
and other health problems. However, there is more to the story, as we learned that
certain activity patterns are healthier than others. For example, subjects who were
active throughout the day in brief intervals had lower incidence of heart disease
compared to those who were active for the same total amount of time, but got all
their activity in a single longer session. In the second iteration of our study, we aim to
answer more complex research questions, focusing on gene-environment interactions
as well as discovering the mechanisms that are most effective in encouraging people
to lead more active lifestyles. The app now provides users with different forms of
coaching as well as graphical feedback about their performance throughout the
duration of the study. Additionally, we know it’s not just environmental factors that
affect heart health, so MyHeart Counts in collaboration with 23andMe has added a function in the app to
allow participants who already have a 23andMe account to seamlessly upload their genome data to our
servers. This data is coded and available to approved researchers. Combining data collected through the
app with genetic data allows us to do promising and exciting research in heart health.
Download the MyHeart Counts app from the iTunes store and come join our research!
https://arxiv.org/abs/1407.5675
The FIND FH EHR project (Flag, Identify, Network and Deliver for Familial
Hypercholesterolemia) aims to pioneer new techniques for the identification
of individuals with Familial Hypercholesterolemia (FH) within electronic health
records (EHRs). FH is a common but vastly underdiagnosed, inherited form
of high cholesterol and coronary heart disease that is potentially devastating if
undiagnosed but can be ameliorated with early identification and proactive
treatment. Traditionally, patients with a phenotype (such as FH) are identified
through rule-based definitions whose creation and validation are time consuming.
Machine learning approaches to phenotyping are limited by the paucity of labeled
training datasets. In this project, we have demonstrated the feasibility of utilizing
noisy labeled training sets to learn phenotype models from the patient’s clinical record. We have searched
both structured and unstructured data with in EHRs to identify possible FH patients. Individuals with
possible FH have been “flagged” and are being contacted in a HIPAA compliant manner to encourage
guideline-based screening and therapy. Algorithms developed have been tested in datasets from
collaborating institutions and are broadly applicable to several different EHR platforms. Furthermore, the
principles can be applied to multiple conditions thereby extending the utility of this approach. The project
is in partnership with the FH Foundation (www.thefhfoundation.org), a non-profit organization founded
and led by FH patients that is dedicated to improving the awareness and treatment of FH.
What do I mean by
“a purposeful university”?
I mean a university that promotes and
celebrates excellence not as an end in
itself, but rather as a means to multiply its
beneficial impact on society.
Marc Tessier-Lavigne
President, Stanford University
Patient reported outcomes (PROs) are seen as a way to improve the quality of adverse event reports, and
mobile apps that facilitate direct reporting to FAERS are currently under development. Research to date has
focused on challenges in usability, leaving unanswered the question of data quality assurance—another key
barrier to adoption. The impact of mobile apps on pharmacovigilance practices must be quantifiable for
them to be a viable data capture technology, and currently no cost effective means of evaluation exists.
The overall goal of this project is to use artificial intelligence to quantitatively measure the impacts of mobile
app design factors on the assessment value of safety reports. Our research supports wider efforts to use
mobile technology to integrate PROs into pharmacovigilance practices as a means of quality assurance
testing for mobile health apps. We will leverage our automated AE report assessment tool and software
engineering expertise to build a mobile app and companion server for event data collection and report
quality assessment. This will provide us a means of measuring the impact of mobile app design features on
the quality AE reports.
To achieve our research goal we will (1) research use cases, conduct competitive landscape analysis,
and establish baseline usability and quality benchmarks. We will (2) build a prototype, and validate
its quality and usability against our previously established benchmarks. Finally, we will (3) measure the
report quality impact of selected app features.
This disconnect is the web’s dark matter: the vast canyon of human experience that goes undocumented or
overshadowed by the small proportion to which we ourselves, the web’s users and content generators, give
disproportionate airtime. Our goal is to quantify this dark matter across the entire web. We’ll be crawling the
entire web (as archived by CommonCrawl) and categorizing webpages about a variety of topics, primarily
political opinions including viewpoints on marriage equality, abortion rights, marijuana legalization, and other
highly-relevant issues. We will then compare the proportions of web pages supporting these issues to the
offline proportions gathered via Pew and Gallup surveys. The result will tell us what impact the web is having
on content production, as well as the impact it has on people when they browse it.
Professor Bernstein’s research applies a computational lens to empower large groups of people connecting
and working online. He designs crowdsourcing and social computing systems that enable people to
connect toward more complex, fulfilling goals. His systems have been used to convene on-demand “flash”
organizations, engage over a thousand people worldwide in open-ended research, and shed light on the
dynamics of antisocial behavior online.
REINFORCEMENT LEARNING
Emma Brunskill
A goal of Brunskill’s work is to increase human potential through advancing interactive machine learning.
Revolutions in storage and computation have made it easy to capture and react to sequences of decisions
made and their outcomes. Simultaneously, due to the rise of chronic health conditions, and demand for
educated workers, there is an urgent need for more scalable solutions to assist people to reach their full
potential. Interactive machine learning systems could be a key part of the solution. To enable this, Brunskill’s
lab’s work spans from advancing our theoretical understanding of reinforcement learning, to developing new
self-optimizing tutoring systems that we test with learners and in the classroom. Their applications focus on
education since education can radically transform the opportunities available to an individual.
COMPUTATIONAL LOGIC
Michael Genesereth
Stanford Logic Group’s work aims at developing innovative logic-based technologies to realize the vision
of a “Declarative Enterprise”—an enterprise that declaratively defined business policies act as executable
specifications of its business operations. Stanford Logic Group’s work in the relevant time period addresses
(a) development of techniques and tools for easy creation, maintenance of user-friendly web forms based
on formal encoding of laws, regulations and business policies—Smart Forms, and (b) development of
techniques and tools for integrated read & write access to structured data—Jabberwocky. Below we present
a short summary of the conducted research and development work.
SMART FORMS
Smart Forms technology makes it so easy to create, maintain and evaluate powerful yet user-friendly web
forms that they could be created and maintained by domain experts themselves. In particular, neither the
creation and maintenance of Smart Forms nor the evaluation of the data entered into a Smart Form require
traditional procedural coding.
JABBERWOCKY
Jabberwocky is a browser based explorer for integrated structured data in the Web. Jabberwocky integrates
structured open data from authoritative sources and makes it easy for end users to browse as well as
expressively query Jabberwocky’s data graph in ad-hoc fashion. Currently, in order to find answers to their
complex questions end users have to browse multiple web sites with different designs, data presentation
rationales and query capabilities. Jabberwocky fills this gap.
Other projects include leveraging big sensor data to understand human mobility and
obesity and open-domain social media analysis. Leskovec’s group used big data from
smartphones tracking the activity levels of hundreds of thousands of people around
the globe to understand human mobility and its relation to obesity. Considering that
an estimated 5.3 million people die from causes associated with physical inactivity
every year, they looked for a simple and convenient way to measure activity across
millions of people to help figure out why obesity is a bigger problem in some countries
than others.
PARALLEL COMPUTING
Kunle Olukotun
The core of the Stanford Pervasive Parallelism Lab’s research agenda is to allow the domain expert to
develop parallel software without becoming an expert in parallel programming. The approach is to
use a layered system based on DSLs, a common parallel compiler and runtime infrastructure, and an
underlying architecture that provides efficient mechanisms for communication, synchronization, and
performance monitoring.
WELD
Matei Zaharia
Weld is a runtime for improving the performance of data-intensive applications. It optimizes across libraries
and functions by expressing the core computations in libraries using a small common intermediate
representation, similar to CUDA and OpenCL.
Modern analytics applications combine multiple functions from different libraries and frameworks to
build complex workflows. Even though individual functions can achieve high performance in isolation,
the performance of the combined workflow is often an order of magnitude below hardware limits due
to extensive data movement across the functions. Weld’s take on solving this problem is to lazily build up
a computation for the entire workflow, optimizing and evaluating it only when a result is needed.
The end goal is to solve the remaining hurdles in patient matching, automated cohort building, and
statistical inference so that for a specific case, we can instantly generate a report with a descriptive summary
of similar patients in Stanford’s clinical data warehouse, the common treatment choices made, and the
observed outcomes after specific treatment choices. This project pursues a unique opportunity to generate
actionable insights from the large amounts of health data that are routinely generated as a byproduct of
clinical processes.
For more information about the Shah Lab, please see the website https://shahlab.stanford.edu
HUMAN GENOME
Michael Snyder
The Snyder laboratory study was the first to perform a large-scale functional
genomics project in any organism, and has developed many technologies in
genomics and proteomics. These including the development of proteome chips,
high resolution tiling arrays for the entire human genome, methods for global
mapping of transcription factor binding sites (ChIP-chip now replaced by ChIP-seq),
paired end sequencing for mapping of structural variation in eukaryotes, de novo
genome sequencing of genomes using high throughput technologies and RNA-Seq.
These technologies have been used for characterizing genomes, proteomes and
regulatory networks.
Seminal findings from the Snyder laboratory include the discovery that much more
of the human genome is transcribed and contains regulatory information than was
previously appreciated, and a high diversity of transcription factor binding occurs both
between and within species.
FOUNDING MEMBERS
REGULAR MEMBERS
sdsi.stanford.edu