0% found this document useful (0 votes)

128 views28 pages

Data Science Research at Stanford 2017-18-0

The document outlines research projects at Stanford University's Data Science Initiative including three flagship projects and several smaller projects related to topics such as healthcare, agriculture, genetics, and physics. The initiative also supports research applying data science techniques to problems in various domains.

Uploaded by

rashmi verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

128 views28 pages

Data Science Research at Stanford 2017-18-0

Uploaded by

rashmi verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Data Science Initiative

Data Science
Research at
Stanford
2017–2018
TABLE OF CONTENTS

About Stanford Data Science Initiative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Letter from the Directors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Flagship Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Data Science for Personalized Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Mapping the “Social Genome” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Privacy Preserving Internet of Things–Analytics for Human Behavior Interventions . . . . . . . . . . . . . . . . . 6
Small Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Algorithms and Foundations for Valid Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Big Data for Agricultural Risk Management in the United States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
AMELIE: Making Genetic Diagnostics Accessible, Reproducible, Ubiquitous . . . . . . . . . . . . . . . . . . . . . . . . 9
Food Prices and Mortality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Blockchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Inferring the Mass Map of the Observable Universe from 10 Billion Galaxies . . . . . . . . . . . . . . . . . . . . . . . 11
Real-Time Large-Scale Neural Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Stanford Distributed Clinical Data Project and MS Azure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
MyHeart Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Physics Event Reconstruction at the Large Hadron Collider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Use of Electronic Phenotyping and Machine Learning Algorithms to Identify Familial
Hypercholesterolemia Patients in Electronic Health Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Selected Research Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Application of Computing and Informatics Technologies to Problems Relevant to Medicine . . . . . . . . 16
Data-Intensive Systems and Tools for Making Complex, High-Volume Data Analytics
More Useful and More Accessible . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Natural Language Processing and Customer Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Studying the Dark Matter of the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Applications of Cryptography to Computer Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Computational Approaches to Help Address the Societal and Environmental Challenges . . . . . . . . . . 19
Food Production, Food Security, and the Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Computational Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Mining and Modeling Large Social and Information Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Sensing, Reconstructing and Building an Expert System for Data Centers and Clouds . . . . . . . . . . . . . . 22
Computer Vision, Robotic Perception and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Weld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Advanced Temporal Language Aided Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Human Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
SDSI Affiliated Faculty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Corporate Members . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Stanford Data Science Initiative

ABOUT STANFORD DATA SCIENCE INITIATIVE

The Stanford Data Science Initiative (SDSI) is a university-wide organization focused on

core data technologies with strong ties to application areas across campus. SDSI comprises
methods research, infrastructure, and education.

Recently there has been a paradigm shift in the way data is used. Today researchers are
mining data for patterns and trends that lead to new hypotheses. This shift is caused by
the huge volumes of data available from web query logs, social media posts and blogs,
satellites, sensors, and medical devices.

Data-centered research faces many challenges. Current data management and analysis
techniques do not scale to the huge volumes of data that we expect in the future.
New analysis techniques that use machine learning and data mining require careful
tuning and expert direction. In order to be effective, data analysis must be combined
with knowledge from domain experts. Future breakthroughs will often require intimate
and combined knowledge of algorithms, data management, the domain data, and the
intended applications.

SDSI consists of data science research, shared data and computing infrastructure, shared
tools and techniques, industrial links, and education. SDSI has strong ties to groups across
Stanford University such as medicine, computational social science, biology, energy,
and theory.

Contact Steve Eglash, Executive Director, for more information: seglash@stanford.edu.

Data Science Research at Stanford 2017–2018 1

LETTER FROM THE DIRECTORS

The Stanford Data Science Initiative:

Enabling Deep Engagement between Industry
and Stanford Researchers
The world is being transformed by large-scale data and massive computation. Data-based
decision making is rapidly becoming an integral part of business, science, and society.
The Stanford Data Science Initiative is an interdisciplinary focal point for research that
harnesses massive data, fast computation, and new machine learning techniques. SDSI is a
collaborative effort between industry and academia spanning all industries, the entire world,
and the whole university including methodologists from computer science, statistics, and
EUAN ASHLEY
artificial intelligence and experts in disciplines that are being transformed by data science
and computation such as medicine, physics, earth science, social sciences, life sciences,
education, business, and law.

SDSI’s emphasis is on research for accelerating data science, discovery, and application
across the university and throughout business and society. As this brochure shows, SDSI
has been funding research and enabling collaboration across all seven schools at Stanford
and with industry in medicine, climate change, social science, energy, and more. Every
company and industry on the planet is affected by the data science and artificial intelligence
JURE LESKOVEC
revolution. Smart companies are exploiting this revolution as an opportunity to improve
competitive advantage and increase market share, revenue, and profitability. SDSI provides
visibility into emerging technology, accurate assessments of current capability, and
impactful insights in many specific domains.

There are many pressing questions about the use of data science in civil society. Challenges
include algorithmic bias and fairness, the interpretability and accountability of decisions
made by autonomous systems, security and privacy of data, decisions made using shared
STEVE EGLASH data, and the impact of data science on law, transportation, markets, and national defense.
Autonomous vehicles, robots, intelligent agents, and other forms of automation will cause
job loss and the need for education. Education is being transformed by massive online
courses and automatic tutoring. SDSI strives to work with industry and academia to assure
that we are asking the right questions and developing the most impactful innovations.

Euan Ashley Jure Leskovec Steve Eglash

Director, Stanford Data Science Director, Stanford Data Science Executive Director, Stanford
Initiative; Professor, Medicine Initiative; Associate Professor, Strategic Research Initiatives,
and Genetics Computer Science Computer Science

2 Stanford Data Science Initiative

Flagship
Projects

Data Science Research at Stanford 2017–2018 3

FLAGSHIP PROJECTS

DATA SCIENCE FOR PERSONALIZED MEDICINE

Michael Snyder, David Tse, Euan Ashley, Mohsen Bayati, Dan Boneh, Andrea Montanari,
Ayfer Ozgur, Tsachy Weissman

Recent technological advances have enabled collection of diverse health data

at an unprecedented level. Omics information of genomes, transcriptomes,
proteomes and metabolomes, DNA methylomes, and microbiome as
well as electronic medical records and data from sensors and wearable
devices provide detailed view of disease state, physiological, and behavioral
parameters at the individual level. Availability of such massive-scale digital
footprint of an individual’s health opens the door to numerous opportunities
for monitoring and accurately predicting the individual’s health outcomes in
addition to customizing treatments at individual level, hence realizing the goal
of personalized medicine. A major challenge is how to efficiently collect, store,
secure and most importantly, analyze such massive-scale and highly private
data so that accuracy of outcome predictions and treatment analysis is not
impacted. The “Data Science for Personalized Health” flagship project will
design a system that will address this challenge and validate it on several personalized medicine
tasks. Specifically, we will 1) devise new algorithms for sampling, for imputation of missing data
and for joint processing of multiple measurements; 2) build novel frameworks to house and
manage complex data in a useful and secure fashion; 3) devise new tools for the analysis of and the
prediction from high dimensional, complex, longitudinal data. Using a unique dataset on 70 pre-
diabetic participants, we devise a personalized and highly accurate early detection method for
diabetes and analyze the consequences of weight change, physical activity, stress, and respiratory
viral infection on individuals’ digital health footprint and ultimately predict the effect of such
perturbations on individuals’ health outcomes. The research is led by an interdisciplinary team of
faculty with expertise in medicine, genetics, machine learning, security and information theory, and
the tools developed will be of broad interest to other data science problems as well.

4 Stanford Data Science Initiative

FLAGSHIP PROJECTS

MAPPING THE “SOCIAL GENOME”

Jure Leskovec, Michael Bernstein, Amir Goldberg, Dan Jurafsky, Dan McFarland,
Christopher Potts

The initial research plan is built around three interrelated levels of analysis: individual, group, and
society. At each level, we are investigating the interplay between static and dynamic properties,
and paying special attention to the ethical and economic issues that arise when confronting major
scientific challenges like this one. Our ultimate goal is to identify ways in which scientists, engineers,
community builders, and community leaders can contribute to the development of more
productive, vibrant, and informed teams, online and offline communities, and societies.

The goal of this project is to develop data science tools and statistical models that bring networks
and language together in order to make more and better predictions about both. Our focus is
on joint models of language and network structure. This brings natural language processing and
social network analysis together to provide a detailed picture not only of what is being said in a
community, but also who is saying it, how the information is being transmitted through the
network, how that transmission affects network structure, and, coming full circle, how those
evolving structures affect linguistic expression. We plan to develop statistical models using diverse
data sets, including not only online social networks (Twitter, Reddit, Facebook), but also hyperlink
networks of news outlets (using massive corpora we collect on an ongoing basis) and networks of
political groups, labs, and corporations.

Leskovec maintains a large collection of network and language data

sets at the website for the Stanford Network Analysis Project SNAP
(http://snap.stanford.edu). The pilot work described in general terms here
relies mainly on resources that have been posted on SNAP for public use.
(In some cases, privacy or business concerns preclude such distribution.)
Moreover, we have access to several powerful, comprehensive data sets:
(i) cell phone call traces of entire countries; (ii) complete article commenting
and voting from sites like CNN, NPR, FOX, and similar; (iii) a near complete
U.S. media picture: 10 billion blog posts and news articles (5 million per
day over last six years); (iv) complete Twitter, LinkedIn, and Facebook data
(through direct collaboration with these companies); (v) five years of email logs
from a medium-sized company.

Data Science Research at Stanford 2017–2018 5

FLAGSHIP PROJECTS

PRIVACY PRESERVING INTERNET OF THINGS–

ANALYTICS FOR HUMAN BEHAVIOR INTERVENTIONS
Philip Levis, Noah Diffenbaugh, Dan Boneh, Mark Horowitz

The high-level, long-term goal is to research how to use the Internet of Things to collect data on
human behavior in a manner that preserves privacy but provides sufficient information to allow
interventions which modify that behavior. We are exploring this research question in the context
of water conservation at Stanford: how can smart water fixtures collect data on how students use
water, such that dormitories can make interventions to reduce water use, while keeping detailed
water use data private?

Towards this end, we have deployed a water use sensing network in Stanford dormitories. Our pilot
deployment in the winter of 2017 showed several interesting results, such that the average and
median shower length for men and women is the same. More importantly, using this network
we have been able to determine that placards suggesting using less water, placed within the
showers, are correlated with 10% shorter showers. Furthermore, while the average shower length is
8.8 minutes, there is an extremely long tail, with 20% of showers being longer than 15 minutes.

We have been asked to deploy the

network again in order to measure
water use within a larger dormitory.
Currently, the network is purely
observational: our next goal is to
augment the network with real-time
feedback to users, such as blinking
a red light when a shower is running
long. This will allow us to explore the
relative efficacy of delayed (signs
on doors, messages to dormitories),
immediate (placards in showers)
and real-time (indicator lights)
interventions. Can the system do this
in a way that provides the aggregate
results without revealing the behavior
of individuals or individual water
use events?

6 Stanford Data Science Initiative

Small
Projects

Data Science Research at Stanford 2017–2018 7

SELECTED RESEARCH PROJECTS
SMALL PROJECTS

ALGORITHMS AND FOUNDATIONS FOR VALID DATA EXPLORATION

James Zou

LARGE-SCALE EXPERIMENTS TO MEASURE ADAPTIVELY

THE SCIENCE OF DATA SCIENCE COLLECTED DATA
Despite the tremendous growth of data science, we From scientific experiments to online A/B testing,
lack systematic and quantitative understandings of adaptive procedures for data collection are
how data science is done in practice. For example, ubiquitous in practice. The previously observed data
if we ask 10,000 data scientists to independently often affects how future experiments are performed,
explore the same dataset, how different would be which in turn affects which data will be collected.
their sequence of analysis steps and their findings? Such adaptivity introduces complex correlations
Are certain paths of analysis more likely to lead to between the data and the collection procedure,
biases and false discoveries? What resources and which has been largely ignored by data scientists.
trainings can we provide to the analyst to improve
We prove that under very general conditions, any
the analysis accuracy? We are answering these and
adaptively collected data has negative bias, meaning
many other fundamental questions in the largest
that the observed effects in the data systematically
controlled experiments to get at the heart of the
underestimate the true effect sizes. As an example,
science behind data science. In collaboration with
consider an adaptive clinical trial where additional
several of the most popular
data points are more likely to be tested for
data science MOOCs, we
treatments that show initial promise. Our result
are recruiting thousands
implies that the average observed treatment effects
of analysts to an online
would be smaller than the true effects of each
platform where we provide
treatment. This is quite surprising, because folklore
them specific datasets
says that, if anything, we might over-estimate the
and track each step of
true effect due to Winner’s Curse. We prove that the
their analysis. The results
opposite is true. Moreover, we develop algorithms
of these experiments
that effectively reduce this bias and improve the
could lead to insights that
usefulness of adaptively collected data.
improve the robustness
and reproducibility of https://arxiv.org/abs/1708.01977
data science.

8 Stanford Data Science Initiative

SELECTED RESEARCH PROJECTS
SMALL PROJECTS

BIG DATA FOR AGRICULTURAL RISK MANAGEMENT

IN THE UNITED STATES
Stefano Ermon and David Lobell

This project aims to improve in-season predictions of yields for major crops in the
United States, as well as a related goal of mapping soil properties across major
agricultural states, and mapping crop locations and crop types around the world.
The project uses a combination of graphical models, approximate Bayesian
computation, and crop simulation models to make predictions based on weather and
satellite data.

This work has been published in the AAAI and recognized by the best student paper
award and an award from the World Bank Big Data Challenge competition.

cs.stanford.edu/~ermon/group/website/papers/jiaxuan_AAAI17.pdf

www.worldbank.org/en/news/feature/2017/03/27/
and-the-winners-of-the-big-data-innovation-challenge-are#

AMELIE: MAKING GENETIC DIAGNOSTICS ACCESSIBLE,

REPRODUCIBLE, UBIQUITOUS
Gill Bejerano, Christopher Ré

Mendelian diseases are caused by single gene mutations. In aggregate, they affect 3% (~250M) of the world’s
population. The diagnosis of thousands of Mendelian disorders has been radically transformed by genome
sequencing. The potential of changing so many lives for the better, is held back by the associated human
labor costs. Genome sequencing is a simple, fast procedure costing hundreds of dollars. The mostly manual
process of finding which, if any of the patient’s sequenced variants is responsible for their phenotypes
against an exploding body of literature, makes genetic diagnosis 10X more expensive, unsustainably slow
and incompletely reproducible.

Our project, a unique collaboration between Stanford’s Computer Science department and Stanford’s
children’s hospital Medical Genetics Division, aims to develop and deploy a first of a kind computer
system to greatly accelerate the clinical diagnosis workflow and additionally derive novel disease gene
hypotheses from it. This effort will produce a proof of principle workflow and worldwide deployable tools
to significantly improve diagnostic throughput, greatly reduce the time spent by expert clinicians to reach a
diagnosis and associated costs thereby making genomic testing accessible, reproducible and ubiquitous.

A first flagship analysis web portal for the project has launched at:
https://AMELIE.stanford.edu

Data Science Research at Stanford 2017–2018 9

SMALL PROJECTS

FOOD PRICES AND MORTALITY

Eran Bendavid, Sze-chuan Suen, Sanjay Basu

In the late 2000’s, the prices of many staple crops sold on markets in low- and middle-income African
countries tripled. Higher prices may compromise households’ ability to purchase enough food, or
alternatively increase incomes for food-producing households. Despite these different potential effects,
the net impact of this “food crisis” on the health of vulnerable populations remains unknown.

We extracted data on the local prices of four major staple crops—maize, rice, sorghum, and wheat—from
98 markets in 12 African countries (2002–2013), and studied their relationship to under-five mortality
from Demographic and Health Surveys. Using within-country fixed effects models, distributed lag models,
and instrumental variable approaches, we used the dramatic price increases in 2007–2008 to test the
relationship between food prices and under-five mortality, controlling for secular trends, gross domestic
product per capita, urban residence, and seasonality.

The prices of all four commodities tripled, on average, between 2006 and 2008. We did not find any model
specification in which the increased prices of maize, sorghum, or wheat were consistently associated
with increased under-five mortality. Indeed, price increases for these commodities were more commonly
associated with (statistically insignificant) lower mortality in our data. A $1 increase in the price per kg of
sorghum, a common African staple, was associated with 0.07–4.50 fewer child deaths per 10,000 child-
months, depending on the specification (p=0. 0.25–0.98). In rural areas where higher food prices may benefit
households that are net food producers, increasing maize prices were associated with lower child mortality
compared with urban households (12.4 fewer child deaths per 10,000 child-months with each $1 increase in
the price of maize; p=0.04).

We did not detect a significant overall relationship between increased prices of maize, rice, sorghum or
wheat and increased under-5 mortality. There is some suggestion that food-producing areas may benefit
from higher prices, while urban areas may be harmed.

BLOCKCHAIN
Dan Boneh

Boneh’s lab is working on an efficient mechanism for confidential transactions on

the block chain (joint work with Benedikt Buenz). Confidential transactions (CT) is a
way for two parties to transact on the block chain without revealing the amount of
money that one party is paying the other. This capability is absolutely necessary if
the block chain is ever going to be used for business. Current CT mechanisms have a
number of drawbacks, most notably, CT transaction size is much larger than non-CT
transactions. Our construction greatly shrinks the overhead for CT transactions.

10 Stanford Data Science Initiative

SMALL PROJECTS

INFERRING THE MASS MAP OF THE OBSERVABLE UNIVERSE FROM

10 BILLION GALAXIES
Risa Wechsler, Phil Marshall

Mapping the Universe is an activity of fundamental interest, linking as it does some of

the biggest questions in modern astrophysics and cosmology: What is the Universe
made of, and why is it accelerating? How do the initial seeds of structure form and
grow to produce our own Galaxy? Wide field astronomical surveys, such as that
planned with the Large Synoptic Survey Telescope (LSST), will provide measurements
of billions of galaxies over half of the sky; we want to analyze these datasets with
sophisticated statistical methods that allow us to create the most accurate map of the
distribution of mass in the Universe to date. The sky locations, colors and brightnesses
of the galaxies allow us to infer (approximately) their positions in 3D, and their stellar

image|NASA.gov
masses; the distorted apparent shapes of galaxies contain information about the
gravitational effects of mass in other galaxies along the line of sight. Our proposed
work is to take the first step in using all of this information in a giant hierarchical
inference of our Universe’s cosmological and galaxy population model hyper-
parameters, after explicit marginalization of the parameters describing millions—and
perhaps billions—of individual galaxies. We will need to develop the statistical machinery to perform this
inference, and implement it at the appropriate computational scale. Training and testing will require large
cosmological simulations, generating plausible mock galaxy catalogs; we plan to make all of our data public to
enable further investigations of this type.

REAL-TIME LARGE-SCALE NEURAL IDENTIFICATION

E.J. Chichilnisky, Andrea Montanari

Electronic interfaces to the brain are increasingly being used to treat incurable disease, and eventually may
be used to augment human function. An important requirement to improve the performance of such devices
is that they be able to recognize and effectively interact with the neural circuitry to which they are connected.
An example is retinal prostheses for treating incurable blindness. Early devices of this form exist now, but only
deliver limited visual function, in part because they do not recognize the diverse cell types in the retina to which
they connect. We have developed automated classifiers for functional identification of retinal ganglion cells,
the output neurons of the retina, based solely on recorded voltage patterns on an electrode array similar to the
ones used in retinal prostheses. Our large collection of data—hundreds of recordings from primate retina over
18 years—made an exploration of automated methods for cell type identification possible for the first time.
We trained classifiers based on features extracted from electrophysiological images (spatiotemporal voltage
waveforms), inter-spike intervals (auto-correlations), and functional coupling between cells (cross-correlations),
and were able to routinely identify cell types with high accuracy. Based on this work, we are now developing the
techniques necessary for a retinal prosthesis to exploit this information by encoding the visual signal in a way
that optimizes artificial vision.

Data Science Research at Stanford 2017–2018 11

SMALL PROJECTS

STANFORD DISTRIBUTED CLINICAL DATA PROJECT AND

MS AZURE
Philip Lavori, Balasubramanian Narasimhan, Daniel Rubin

A system has been developed at Stanford that enables using confidential healthcare data among distant
hospitals and clinics for creating decision support applications without requiring sharing any patient
data among those institutions, thus facilitating multi-institution research studies on massive datasets.
This collaboration between Microsoft and Stanford will develop a MS Azure application based on this,
thus providing a solution that is robust, usable, and deployable widely at many healthcare institutions.

MyHEART COUNTS
Euan Ashley

The MyHeart Counts study—launched in the spring of 2015 on Apple’s Research Kit platform—seeks to mine
the treasure trove of heart health and activity data that can be gathered in a population through mobile
phone apps. Because the average adult in the U.S. checks his/her phone dozens of times each day, phone
apps that target cardiovascular health are a promising tool to quickly gather large amounts of data about
a population’s health and fitness, and ultimately to influence people to make healthier choices. In the first
8 months, over 40,000 users downloaded the app. Participants recorded physical activity, filled out health
questionnaires, and completed a 6-minute walk test. We applied unsupervised machine learning techniques
to cluster subjects into activity groups, such as those more active on weekends.
We then developed algorithms to uncover associations between the clusters of
accelerometry data and subjects’ self-reported health and well-being outcomes. Our
results, published in JAMA Cardiology in December 2016, are in line with the accepted
medical wisdom that more active people are at lower risk for diabetes, heart disease,
and other health problems. However, there is more to the story, as we learned that
certain activity patterns are healthier than others. For example, subjects who were
active throughout the day in brief intervals had lower incidence of heart disease
compared to those who were active for the same total amount of time, but got all
their activity in a single longer session. In the second iteration of our study, we aim to
answer more complex research questions, focusing on gene-environment interactions
as well as discovering the mechanisms that are most effective in encouraging people
to lead more active lifestyles. The app now provides users with different forms of
coaching as well as graphical feedback about their performance throughout the
duration of the study. Additionally, we know it’s not just environmental factors that
affect heart health, so MyHeart Counts in collaboration with 23andMe has added a function in the app to
allow participants who already have a 23andMe account to seamlessly upload their genome data to our
servers. This data is coded and available to approved researchers. Combining data collected through the
app with genetic data allows us to do promising and exciting research in heart health.

Download the MyHeart Counts app from the iTunes store and come join our research!

12 Stanford Data Science Initiative

SMALL PROJECTS

PHYSICS EVENT RECONSTRUCTION AT THE LARGE

HADRON COLLIDER
Ariel Schwartzman

The aim of this proposal is to develop and apply advanced

data science techniques to address fundamental challenges
of physics event reconstruction and classification at the
Large Hadron Collider (LHC). The LHC is exploring physics at
the energy frontier, probing some of the most fundamental
questions about the nature of our universe. The datasets of
the LHC experiments are among the largest in all science.
Each particle collision event at the LHC is rich in information,
particularly in the detail and complexity of each event picture
(consisting of 100 million pixels images taken 40 million times a
second), making it ideal for the application of modern machine
learning techniques to extract the maximum amount of physics
information. Up until now, most of the methods used to extract
useful information from the large datasets of the LHC have been
based on physics intuition built from existing models. During the last several years, spectacular advances
in the fields of artificial intelligence, computer vision, and deep learning have resulted in remarkable
performance improvements in image classification and vision tasks, in particular through the use of
deep convolutional neural networks (CNN). Representing LHC collision events as images, a novel concept
developed by SDSI, has enabled, for the first time, the application of computer vision and deep learning
methods for event classification and reconstruction, resulting in impressive gains in the discovery potential
of the LHC. We plan to continue to improve physics event interpretation at the LHC by the development and
application of advanced machine learning algorithms to some of the most difficult and exciting challenges
in physics event reconstruction at hadron colliders, such as the identification of Higgs bosons, and the
mitigation of pileup—many overlapping collisions in a single event. These developments will have important
implications in extracting knowledge in high energy physics. The problem also provides a setting for more
general exploration of tools to find subtle correlations embedded in a large dataset.

Below is the list ofpublications funded by SDSI:

Jet-Images: Computer Vision Inspired Techniques for Jet Tagging, JHEP 02 (2015) 118.

https://arxiv.org/abs/1407.5675

Data Science Research at Stanford 2017–2018 13

SMALL PROJECTS

USE OF ELECTRONIC PHENOTYPING AND MACHINE LEARNING

ALGORITHMS TO IDENTIFY FAMILIAL HYPERCHOLESTEROLEMIA
PATIENTS IN ELECTRONIC HEALTH RECORDS
Joshua W. Knowles, Nigam Shah

The FIND FH EHR project (Flag, Identify, Network and Deliver for Familial
Hypercholesterolemia) aims to pioneer new techniques for the identification
of individuals with Familial Hypercholesterolemia (FH) within electronic health
records (EHRs). FH is a common but vastly underdiagnosed, inherited form
of high cholesterol and coronary heart disease that is potentially devastating if
undiagnosed but can be ameliorated with early identification and proactive
treatment. Traditionally, patients with a phenotype (such as FH) are identified
through rule-based definitions whose creation and validation are time consuming.
Machine learning approaches to phenotyping are limited by the paucity of labeled
training datasets. In this project, we have demonstrated the feasibility of utilizing
noisy labeled training sets to learn phenotype models from the patient’s clinical record. We have searched
both structured and unstructured data with in EHRs to identify possible FH patients. Individuals with
possible FH have been “flagged” and are being contacted in a HIPAA compliant manner to encourage
guideline-based screening and therapy. Algorithms developed have been tested in datasets from
collaborating institutions and are broadly applicable to several different EHR platforms. Furthermore, the
principles can be applied to multiple conditions thereby extending the utility of this approach. The project
is in partnership with the FH Foundation (www.thefhfoundation.org), a non-profit organization founded
and led by FH patients that is dedicated to improving the awareness and treatment of FH.

What do I mean by
“a purposeful university”?
I mean a university that promotes and
celebrates excellence not as an end in
itself, but rather as a means to multiply its
beneficial impact on society.
Marc Tessier-Lavigne
President, Stanford University

14 Stanford Data Science Initiative

Selected
Research
Projects

Data Science Research at Stanford 2017–2018 15

SELECTED RESEARCH PROJECTS

APPLICATION OF COMPUTING AND INFORMATICS TECHNOLOGIES TO

PROBLEMS RELEVANT TO MEDICINE
Russ Altman
The annual volume of safety reports submitted to the FDA adverse event reporting system (FAERS) has
increased 10x over the past 15 years to 1.2 million in 2016; yet the promise of big data analysis in drug
safety monitoring has not paid off because of pervasive data quality issues. Almost 60 percent of serious or
life-threatening adverse event reports submitted by manufacturers in 2016 did not have all of: (1) patient age,
(2) gender, (3) event date, and (4) at least one medical term describing the event. By contrast, over 80 percent
of direct submissions in 2016 were reasonably complete.

Patient reported outcomes (PROs) are seen as a way to improve the quality of adverse event reports, and
mobile apps that facilitate direct reporting to FAERS are currently under development. Research to date has
focused on challenges in usability, leaving unanswered the question of data quality assurance—another key
barrier to adoption. The impact of mobile apps on pharmacovigilance practices must be quantifiable for
them to be a viable data capture technology, and currently no cost effective means of evaluation exists.

The overall goal of this project is to use artificial intelligence to quantitatively measure the impacts of mobile
app design factors on the assessment value of safety reports. Our research supports wider efforts to use
mobile technology to integrate PROs into pharmacovigilance practices as a means of quality assurance
testing for mobile health apps. We will leverage our automated AE report assessment tool and software
engineering expertise to build a mobile app and companion server for event data collection and report
quality assessment. This will provide us a means of measuring the impact of mobile app design features on
the quality AE reports.

To achieve our research goal we will (1) research use cases, conduct competitive landscape analysis,
and establish baseline usability and quality benchmarks. We will (2) build a prototype, and validate
its quality and usability against our previously established benchmarks. Finally, we will (3) measure the
report quality impact of selected app features.

For more information, please see the Helix Group website:

http://helix.stanford.edu/about.html

16 Stanford Data Science Initiative

SELECTED RESEARCH PROJECTS

DATA-INTENSIVE SYSTEMS AND TOOLS FOR MAKING COMPLEX, HIGH-VOLUME

DATA ANALYTICS MORE USEFUL AND MORE ACCESSIBLE
Peter Bailis
One of our core projects is MacroBase, a new data processing engine for analyzing high-volume monitoring
and business data. While Big Data infrastructure has made it increasingly cheap to collect telemetry from
automated sources and sensors such mobile devices and manufacturing equipment, human cognition hasn’t
kept pace. Therefore, to help humans understand complex behaviors in complex application deployments,
MacroBase combines feature extraction, classification, and model explanation at scale. Using this combo,
MacroBase has successfully delivered new, previously unknown results in domains including online services,
mobile application monitoring, manufacturing, and automotives, and serves as a platform for ongoing
research in scalable data analytics, including time-series classification, large-scale video processing, and
data visualization.

For more information, please see the website: http://www.bailis.org

NATURAL LANGUAGE PROCESSING AND CUSTOMER SUPPORT

Dan Jurafsky
Jurafsky’s research ranges widely across computational linguistics; special
interests include natural language understanding, human-human conversation,
the relationship between human and machine processing, and the application of
natural language processing to the social and behavioral sciences.

The Natural Language Processing Group at Stanford University is a team of faculty,

postdocs, programmers and students who work together on algorithms that allow
computers to process and understand human languages. Their work ranges from
basic research in computational linguistics to key applications in human language
technology, and covers areas such as sentence understanding, automatic question
answering, machine translation, syntactic parsing and tagging, sentiment analysis,
and models of text and visual scenes, as well as applications of natural language
processing to the digital humanities and computational social sciences. They provide
a widely used, integrated NLP toolkit, Stanford CoreNLP. Particular technologies
include our competition-winning coreference resolution system; a high speed, high performance neural
network dependency parser; a state-of-the-art part-of-speech tagger; a competition-winning named entity
recognizer; and algorithms for processing Arabic, Chinese, French, German, and Spanish text.

For a list of Jurafsky’s recent publications, please go to this website:

https://web.stanford.edu/~jurafsky

Data Science Research at Stanford 2017–2018 17

SELECTED RESEARCH PROJECTS

STUDYING THE DARK MATTER OF THE WEB

Michael Bernstein
Is the web a perfect reflection of reality? If it were, a quick image search for “grandma” might convince you
that all grandmothers are old white ladies with glasses and frizzy hair (seriously, try it). In perhaps a more
serious example, a search for “chest pain” might leave a web user with anxiety about an imminent heart
attack, when that outcome is actually very unlikely.

This disconnect is the web’s dark matter: the vast canyon of human experience that goes undocumented or
overshadowed by the small proportion to which we ourselves, the web’s users and content generators, give
disproportionate airtime. Our goal is to quantify this dark matter across the entire web. We’ll be crawling the
entire web (as archived by CommonCrawl) and categorizing webpages about a variety of topics, primarily
political opinions including viewpoints on marriage equality, abortion rights, marijuana legalization, and other
highly-relevant issues. We will then compare the proportions of web pages supporting these issues to the
offline proportions gathered via Pew and Gallup surveys. The result will tell us what impact the web is having
on content production, as well as the impact it has on people when they browse it.

Professor Bernstein’s research applies a computational lens to empower large groups of people connecting
and working online. He designs crowdsourcing and social computing systems that enable people to
connect toward more complex, fulfilling goals. His systems have been used to convene on-demand “flash”
organizations, engage over a thousand people worldwide in open-ended research, and shed light on the
dynamics of antisocial behavior online.

LINKS TO OTHER PROJECTS

http://hci.stanford.edu/msb
Visual Genome: http://visualgenome.org
Anyone Can Be A Troll: https://www.technologyreview.com/s/603489/
theres-a-troll-inside-all-of-us-researchers-say
Iris: https://hackernoon.
com/a-conversational-agent-for-data-science-4ae300cdc220

REINFORCEMENT LEARNING
Emma Brunskill
A goal of Brunskill’s work is to increase human potential through advancing interactive machine learning.
Revolutions in storage and computation have made it easy to capture and react to sequences of decisions
made and their outcomes. Simultaneously, due to the rise of chronic health conditions, and demand for
educated workers, there is an urgent need for more scalable solutions to assist people to reach their full
potential. Interactive machine learning systems could be a key part of the solution. To enable this, Brunskill’s
lab’s work spans from advancing our theoretical understanding of reinforcement learning, to developing new
self-optimizing tutoring systems that we test with learners and in the classroom. Their applications focus on
education since education can radically transform the opportunities available to an individual.

18 Stanford Data Science Initiative

SELECTED RESEARCH PROJECTS

APPLICATIONS OF CRYPTOGRAPHY TO COMPUTER SECURITY

Dan Boneh
Professor Boneh heads the applied cryptography group and co-directs the computer security lab.
Professor Boneh’s research focuses on applications of cryptography to computer security. His work
includes cryptosystems with novel properties, web security, security for mobile devices, and cryptanalysis.
The Applied Crypto Group is a part of the Security Lab in the Computer Science Department at Stanford
University. Research projects in the group focus on various aspects of network and computer security.
In particular, the group focuses on applications of cryptography to real-world security problems.

For more information, please see the website https://crypto.stanford.edu

COMPUTATIONAL APPROACHES TO HELP ADDRESS THE SOCIETAL AND

ENVIRONMENTAL CHALLENGES
Stefano Ermon
Ermon’s group researches innovative computational approaches to help address the societal and
environmental challenges of the 21st century. They combine research on the foundations of artificial
intelligence and machine learning with applications in science and engineering. Their work enables
computers to act intelligently and adaptively in increasingly complex and uncertain real
world environments.

Please see the website for more information:

https://cs.stanford.edu/~ermon/group/website

FOOD PRODUCTION, FOOD SECURITY, AND THE ENVIRONMENT

David Lobell
The Lobell research group studies the interactions between food production, food
security, and the environment. Their work relies heavily on using modern sensors
and quantitative methods to better understand cropping systems, both in developed
and developing countries. Their research focuses on agriculture and food security,
specifically on generating and using unique datasets to study rural areas throughout
the world. Lobell’s projects span Africa, South Asia, Mexico, and the United States, and
involve a range of tools including remote sensing, GIS, and crop and climate models.

Please see the website for more information:

https://lobell-lab.stanford.edu

Data Science Research at Stanford 2017–2018 19

SELECTED RESEARCH PROJECTS

COMPUTATIONAL LOGIC
Michael Genesereth
Stanford Logic Group’s work aims at developing innovative logic-based technologies to realize the vision
of a “Declarative Enterprise”—an enterprise that declaratively defined business policies act as executable
specifications of its business operations. Stanford Logic Group’s work in the relevant time period addresses
(a) development of techniques and tools for easy creation, maintenance of user-friendly web forms based
on formal encoding of laws, regulations and business policies—Smart Forms, and (b) development of
techniques and tools for integrated read & write access to structured data—Jabberwocky. Below we present
a short summary of the conducted research and development work.

SMART FORMS
Smart Forms technology makes it so easy to create, maintain and evaluate powerful yet user-friendly web
forms that they could be created and maintained by domain experts themselves. In particular, neither the
creation and maintenance of Smart Forms nor the evaluation of the data entered into a Smart Form require
traditional procedural coding.

JABBERWOCKY
Jabberwocky is a browser based explorer for integrated structured data in the Web. Jabberwocky integrates
structured open data from authoritative sources and makes it easy for end users to browse as well as
expressively query Jabberwocky’s data graph in ad-hoc fashion. Currently, in order to find answers to their
complex questions end users have to browse multiple web sites with different designs, data presentation
rationales and query capabilities. Jabberwocky fills this gap.

For more information, please go to the Stanford Logic Group’s website:

http://logic.stanford.edu

20 Stanford Data Science Initiative

SELECTED RESEARCH PROJECTS

MINING AND MODELING LARGE SOCIAL AND INFORMATION NETWORKS

Jure Leskovec
Leskovec’s group has developed a new state-of-the-art framework, called GraphSage, for deep learning
on social and biological networks. GraphSage can be used to make predictions about individual nodes in a
network, for example, predicting user behavior in a social network or drug interactions. Instead of relying on
hand-engineered network statistics, GraphSage automatically learns how to incorporate information from
a node’s local network neighborhood in order to make predictions. Unlike previous approaches, GraphSage
is capable of scaling to networks that have billions of nodes and edges while achieving state-of-the-art
results on a number of common tasks, such as content recommendation in social networks and predicting
drug interactions.

Other projects include leveraging big sensor data to understand human mobility and
obesity and open-domain social media analysis. Leskovec’s group used big data from
smartphones tracking the activity levels of hundreds of thousands of people around
the globe to understand human mobility and its relation to obesity. Considering that
an estimated 5.3 million people die from causes associated with physical inactivity
every year, they looked for a simple and convenient way to measure activity across
millions of people to help figure out why obesity is a bigger problem in some countries
than others.

Opinion analysis of consumers is done traditionally through pools and questionnaires,

which makes it costly, covers only a small sample of the population, and cannot
provide real-time updates. Widespread use of social media allows for opinion analysis
to be performed at a much deeper level by using automatic methods which are used
to process online discussions in a cost-effective manner, covering significantly larger
populations and providing real-time updates as new discussions are published.

For links, please go to the website http://snap.stanford.edu

PARALLEL COMPUTING
Kunle Olukotun
The core of the Stanford Pervasive Parallelism Lab’s research agenda is to allow the domain expert to
develop parallel software without becoming an expert in parallel programming. The approach is to
use a layered system based on DSLs, a common parallel compiler and runtime infrastructure, and an
underlying architecture that provides efficient mechanisms for communication, synchronization, and
performance monitoring.

New heterogeneous architectures continue to provide increases in achievable performance, but

programming these devices to reach maximum performance levels is not straightforward. The goal
of the PPL is to make heterogeneous parallelism accessible to average software developers through domain-
specific languages (DSLs) so that it can be freely used in all computationally demanding applications.

Data Science Research at Stanford 2017–2018 21

SELECTED RESEARCH PROJECTS

SENSING, RECONSTRUCTING AND BUILDING AN EXPERT SYSTEM FOR DATA

CENTERS AND CLOUDS
Balaji Prabhakar
Over the past decade, the users and operators of large cloud platforms and campus
networks have desired a much more programmable network infrastructure so as
to configure it to the needs of different applications and reduce the friction they can
cause to each other. This has culminated in the SDN paradigm, initiated at Stanford,
and now widely adopted. But it is hard to program what you do not understand: the
volume, velocity and richness of network applications and traffic seem beyond the
ability of direct human comprehension. What is needed is an expert system that can
observe the data emitted by a network during the course of its operation, continually
learn the best responses to rapidly-changing load and operating conditions, and help
the network adapt to them in real-time.

COMPUTER VISION, ROBOTIC PERCEPTION AND MACHINE LEARNING

Silvio Savarese
The Computational Vision and Geometry Lab (CVGL) at Stanford is directed by Prof. Silvio Savarese.
Their research addresses the theoretical foundations and practical applications of computational
vision. The Lab’s interest lies in discovering and proposing the fundamental principles, algorithms and
implementations for solving high level visual recognition and reconstruction problems such as object
and scene understanding as well as human behavior recognition in the complex 3D world.

For more information, please see the website http://cvgl.stanford.edu/index.html

Also see information on the social navigation robot, Jackrabbot,

http://cvgl.stanford.edu/projects/jackrabbot

WELD
Matei Zaharia
Weld is a runtime for improving the performance of data-intensive applications. It optimizes across libraries
and functions by expressing the core computations in libraries using a small common intermediate
representation, similar to CUDA and OpenCL.

Modern analytics applications combine multiple functions from different libraries and frameworks to
build complex workflows. Even though individual functions can achieve high performance in isolation,
the performance of the combined workflow is often an order of magnitude below hardware limits due
to extensive data movement across the functions. Weld’s take on solving this problem is to lazily build up
a computation for the entire workflow, optimizing and evaluating it only when a result is needed.

For more information, please see the website https://cs.stanford.edu/~matei

22 Stanford Data Science Initiative

SELECTED RESEARCH PROJECTS

ADVANCED TEMPORAL LANGUAGE AIDED SEARCH

Nigam Shah
At Stanford, we have developed a search engine, called ATLAS—for Advanced Temporal Language Aided
Search—to find similar patients from the patient data in the Stanford clinical data warehouse. For example,
we can search for, and, in under one second, find matching patients by searching across diagnosis, billing
and procedure codes, concepts extracted from textual data, laboratory test results, vital signs, as well as visit
types and duration of inpatient stays. Such rapid querying across diverse data types for cohort-building is
not possible at any other academic medical center in the country.

The end goal is to solve the remaining hurdles in patient matching, automated cohort building, and
statistical inference so that for a specific case, we can instantly generate a report with a descriptive summary
of similar patients in Stanford’s clinical data warehouse, the common treatment choices made, and the
observed outcomes after specific treatment choices. This project pursues a unique opportunity to generate
actionable insights from the large amounts of health data that are routinely generated as a byproduct of
clinical processes.

For more information about the Shah Lab, please see the website https://shahlab.stanford.edu

HUMAN GENOME
Michael Snyder
The Snyder laboratory study was the first to perform a large-scale functional
genomics project in any organism, and has developed many technologies in
genomics and proteomics. These including the development of proteome chips,
high resolution tiling arrays for the entire human genome, methods for global
mapping of transcription factor binding sites (ChIP-chip now replaced by ChIP-seq),
paired end sequencing for mapping of structural variation in eukaryotes, de novo
genome sequencing of genomes using high throughput technologies and RNA-Seq.
These technologies have been used for characterizing genomes, proteomes and
regulatory networks.

Seminal findings from the Snyder laboratory include the discovery that much more
of the human genome is transcribed and contains regulatory information than was
previously appreciated, and a high diversity of transcription factor binding occurs both
between and within species.

Please see the Snyder Lab website for additional information

http://snyderlab.stanford.edu/Snyder.html

Data Science Research at Stanford 2017–2018 23

SDSI AFFILIATED FACULTY

Russ Altman Stefano Ermon Ayfer Ozgur

Professor of Bioengineering, of Assistant Professor of Computer Assistant Professor of Electrical
Genetics, of Medicine (General Science Engineering
Medical Discipline), of Biomedical Michael Genesereth Christopher Potts
Data Science and, by courtesy, Associate Professor of Computer Professor of Linguistics and, by
of Computer Science Science and, by courtesy, of Law courtesy, of Computer Science

Euan Ashley Amir Goldberg Balaji Prabhakar

Professor of Medicine Associate Professor of Organizational Professor of Electrical Engineering
(Cardiovascular) and, by courtesy, of Behavior in the Graduate School and of Computer Science and, by
Pathology at the Stanford University of Business and, by courtesy, of courtesy, of Management Science
Sociology, and Engineering and of Operations,
Medical Center
Information and Technology at the
Peter Bailis Mark Horowitz Graduate School of Business
Assistant Professor of Computer Professor of Electrical Engineering
and of Computer Science
Christopher Ré
Science
Associate Professor of Computer
Sanjay Basu Daniel Jurafsky Science
Assistant Professor of Medicine Professor and Chair of Linguistics
Daniel Rubin
(Primary Care and Outcomes Department and Professor of
Associate Professor of Biomedical
Research) and, by courtesy, of Health Computer Science
Data Science and of Radiology
Research and Policy (Epidemiology) Joshua Knowles (Integrative Biomedical Imaging
Mohsen Bayati Assistant Professor of Medicine Informatics at Stanford), of Medicine
Associate Professor of Operations, (Cardiovascular Medicine) (Biomedical Informatics Research)
Information and Technology at the and, by courtesy, of Ophthalmology
Philip Lavori
Graduate School of Business and, by Professor of Biomedical Data Silvio Savarese
courtesy, of Electrical Engineering Science, Emeritus Associate Professor of Computer
Science
Gill Bejerano Jure Leskovec
Associate Professor of Associate Professor of Computer Ariel Schwartzman
Developmental Biology, of Computer Science Associate Professor of Particle
Science and of Pediatrics (Genetics) Physics and Astrophysics
Philip Levis
Eran Bendavid Associate Professor of Computer Nigam Shah
Assistant Professor of Medicine Science and of Electrical Engineering Associate Professor of Medicine
(Primary Care and Population Health) (Biomedical Informatics Research)
David Lobell and of Biomedical Data Science
Michael Bernstein Professor of Earth System Science
Assistant Professor of Computer Michael Snyder
Phil Marshall Professor in Genetics
Science
Staff scientist at the Kavli Institute for
Dan Boneh David Tse
Particle Astrophysics and Cosmology
Professor of Electrical Engineering
Professor of Computer Science and
Daniel McFarland Risa Wechsler
of Electrical Engineering
Professor of Education and, Associate Professor of Physics and of
Emma Brunskill by courtesy, of Sociology and Particle Physics and Astrophysics
Assistant Professor of Computer of Organizational Behavior
Science at the Graduate School of Business Tsachy Weissman
Professor of Electrical Engineering
E.J. Chichilnisky Andrea Montanari
Professor of Neurosurgery and of Professor of Electrical Engineering
Matei Zaharia
Ophthalmology and, by courtesy, of Assistant Professor of Computer
and of Statistics
Science and, by courtesy, of
Electrical Engineering
Balasubramanian Electrical Engineering
Somalee Datta Narasimhan
James Zou
Director of Research IT, SoM – IRT Senior Research Scientist in Statistics
Assistant Professor of Biomedical
Research Technology and in Biomedical Data
Data Science and, by courtesy, of
Noah Diffenbaugh Kunle Olukotun Computer Science and of Electrical
Professor of Earth System Science Professor of Electrical Engineering Engineering

24 Stanford Data Science Initiative

CORPORATE MEMBERS

FOUNDING MEMBERS

REGULAR MEMBERS

Data Science Research at Stanford 2017–2018 25

Data Science Initiative

The Stanford Data Science Initiative

is focused on core data technologies
with strong ties to application areas
across campus.

sdsi.stanford.edu

Printed on paper from responsible sources

certified by Forest Stewardship Council®

AWS SAA Assesment Sample Paper
No ratings yet
AWS SAA Assesment Sample Paper
5 pages
Data Science Tools Study Guides For MIT's 15.003
No ratings yet
Data Science Tools Study Guides For MIT's 15.003
23 pages
ADP Textbooks
No ratings yet
ADP Textbooks
15 pages
17 Free Data Science Projects To Boost Your Knowledge & Skills
100% (1)
17 Free Data Science Projects To Boost Your Knowledge & Skills
9 pages
R For Everyone - For Data Science
No ratings yet
R For Everyone - For Data Science
10 pages
Module 1 - Introduction To Data Science
100% (1)
Module 1 - Introduction To Data Science
59 pages
Specility Department
No ratings yet
Specility Department
140 pages
Pip 2011-12 Ap NRHM
No ratings yet
Pip 2011-12 Ap NRHM
276 pages
Prospectus 11
No ratings yet
Prospectus 11
130 pages
FINAL AnswerBank Data Science Sem VI PDF
No ratings yet
FINAL AnswerBank Data Science Sem VI PDF
90 pages
50 Years of Data Science
No ratings yet
50 Years of Data Science
23 pages
Ch02 DSS BI
No ratings yet
Ch02 DSS BI
91 pages
012 Evaluating Federated Learning For Intrusion Detection in Internet of Things Review and Challenges
No ratings yet
012 Evaluating Federated Learning For Intrusion Detection in Internet of Things Review and Challenges
16 pages
SPARK Science Learning System
No ratings yet
SPARK Science Learning System
90 pages
Visvesvaraya Technological University Belagavi
No ratings yet
Visvesvaraya Technological University Belagavi
74 pages
Data Science Regular Handout
No ratings yet
Data Science Regular Handout
25 pages
HikCentral Professional Quick Start Guide 2.0.1 20210128
No ratings yet
HikCentral Professional Quick Start Guide 2.0.1 20210128
48 pages
Main IEC
100% (1)
Main IEC
9 pages
Free Download Data Science Curriculum - Innomatics Research Labs Hyderabad, India
No ratings yet
Free Download Data Science Curriculum - Innomatics Research Labs Hyderabad, India
14 pages
Machine Learning in Genomics Medicine
No ratings yet
Machine Learning in Genomics Medicine
22 pages
Program Overview: #Datascience - Data Science in Iot
100% (1)
Program Overview: #Datascience - Data Science in Iot
9 pages
Big Ip DNS Datasheet.
100% (1)
Big Ip DNS Datasheet.
19 pages
Final UTS Report For Data Science Institute 2017-1-3
100% (3)
Final UTS Report For Data Science Institute 2017-1-3
39 pages
Parker IQAN - Genric Drive - 5 Bank - 22 July 2020
No ratings yet
Parker IQAN - Genric Drive - 5 Bank - 22 July 2020
14 pages
Intro To Data Science With DB
No ratings yet
Intro To Data Science With DB
33 pages
Stat 1261/2260: Principles of Data Science (Fall 2021) Assignment 1: R and Rstudio
No ratings yet
Stat 1261/2260: Principles of Data Science (Fall 2021) Assignment 1: R and Rstudio
10 pages
Smart Substation: State of Art and Future Development
No ratings yet
Smart Substation: State of Art and Future Development
8 pages
Professional Nursing
No ratings yet
Professional Nursing
19 pages
Duong Link Quizlet
No ratings yet
Duong Link Quizlet
1 page
Data Science With R - Course Materials
No ratings yet
Data Science With R - Course Materials
25 pages
Data Science With R
No ratings yet
Data Science With R
21 pages
Financial Data Science
No ratings yet
Financial Data Science
5 pages
Wharton - Business Analytics - Week 6 - Summary Transcripts
No ratings yet
Wharton - Business Analytics - Week 6 - Summary Transcripts
12 pages
Ucf Nursing Dec04 Final-Edited 000
No ratings yet
Ucf Nursing Dec04 Final-Edited 000
24 pages
Applied Text Analysis
No ratings yet
Applied Text Analysis
13 pages
Data Science With R by Jigsaw Academy
0% (1)
Data Science With R by Jigsaw Academy
4 pages
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
No ratings yet
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
22 pages
Data Science Terminology Flashcards - Quizlet
100% (1)
Data Science Terminology Flashcards - Quizlet
15 pages
Analytics 2022 12 26 020011
No ratings yet
Analytics 2022 12 26 020011
181 pages
4 Data Science-Big Data
No ratings yet
4 Data Science-Big Data
22 pages
20140268ece312prj - Byron Chamunorwa Ngoshi
No ratings yet
20140268ece312prj - Byron Chamunorwa Ngoshi
11 pages
Saving and Publishing Data Sources
No ratings yet
Saving and Publishing Data Sources
4 pages
Data Science Course in Hyderabad - Innomatics
No ratings yet
Data Science Course in Hyderabad - Innomatics
10 pages
CU Data Science With SQL and Tableau
No ratings yet
CU Data Science With SQL and Tableau
4 pages
Analytixpro - Data Science - Brochure PDF
No ratings yet
Analytixpro - Data Science - Brochure PDF
13 pages
Death and The: Nursing Home
No ratings yet
Death and The: Nursing Home
8 pages
Avinash Resume
No ratings yet
Avinash Resume
3 pages
Regional Sales Manager JD
No ratings yet
Regional Sales Manager JD
2 pages
Types of Data Models: Data Modeling (Data Modelling) Is The Process of Creating A
No ratings yet
Types of Data Models: Data Modeling (Data Modelling) Is The Process of Creating A
2 pages
Vice President Director Sales in Orlando FL Resume Beverly Rider PDF
No ratings yet
Vice President Director Sales in Orlando FL Resume Beverly Rider PDF
3 pages
SAS Presentation
No ratings yet
SAS Presentation
49 pages
Unit-1 - Introduction To Business Aalytics
No ratings yet
Unit-1 - Introduction To Business Aalytics
25 pages
Algorithm Pancake PDF
No ratings yet
Algorithm Pancake PDF
22 pages
Applications of Automata in Electronic Machines and Android Games (Finite Automata)
No ratings yet
Applications of Automata in Electronic Machines and Android Games (Finite Automata)
5 pages
Rufus
No ratings yet
Rufus
20 pages
Cardloadcell Seneca Manual
No ratings yet
Cardloadcell Seneca Manual
2 pages
MDP - Data Science
No ratings yet
MDP - Data Science
12 pages
Data Science
100% (1)
Data Science
7 pages
Syllabus - Data Visualization and Communication
No ratings yet
Syllabus - Data Visualization and Communication
7 pages
Monash Data Science
No ratings yet
Monash Data Science
4 pages
IRJCS:: Information Security in Big Data Using Encryption and Decryption
No ratings yet
IRJCS:: Information Security in Big Data Using Encryption and Decryption
6 pages
Big Data Analystics - Bhuvaneswari - Contents
No ratings yet
Big Data Analystics - Bhuvaneswari - Contents
7 pages
Power Panel PP21: 5.1 Order Data
No ratings yet
Power Panel PP21: 5.1 Order Data
7 pages
Emc Data Science Study WP PDF
No ratings yet
Emc Data Science Study WP PDF
6 pages
100 SQL Formulas Each Student Should Know
No ratings yet
100 SQL Formulas Each Student Should Know
10 pages
Dr. Lubna Syed CV
No ratings yet
Dr. Lubna Syed CV
4 pages
Mma 484 W
No ratings yet
Mma 484 W
8 pages
IT663 Mathematical Foundations of Cryptography
No ratings yet
IT663 Mathematical Foundations of Cryptography
1 page
Intro To DAX in Power BI
No ratings yet
Intro To DAX in Power BI
21 pages
Power BI Syllabus
No ratings yet
Power BI Syllabus
7 pages
Abdul Wahab Alvi - Power BI Analyst - Resume
No ratings yet
Abdul Wahab Alvi - Power BI Analyst - Resume
1 page
Anurag's Resume - 6
No ratings yet
Anurag's Resume - 6
1 page
Power BI Cheatsheet
No ratings yet
Power BI Cheatsheet
10 pages
Resume 1
No ratings yet
Resume 1
3 pages
2JD S4CLD2402 BPD en XX
No ratings yet
2JD S4CLD2402 BPD en XX
8 pages
0214 Lecture Notes
No ratings yet
0214 Lecture Notes
316 pages
Computer Architecture Note 2024
No ratings yet
Computer Architecture Note 2024
45 pages
Lecture 06 - Introduction To Spreadsheets
No ratings yet
Lecture 06 - Introduction To Spreadsheets
17 pages
Project 1 - Employee Management System - WEB - SRD
No ratings yet
Project 1 - Employee Management System - WEB - SRD
9 pages
Chapter 6 Arrays
No ratings yet
Chapter 6 Arrays
11 pages
Deepika G - Resume-1
No ratings yet
Deepika G - Resume-1
2 pages
Configuring Position Management Settings
No ratings yet
Configuring Position Management Settings
6 pages
Assessment - Attempt Review - Complykaro
No ratings yet
Assessment - Attempt Review - Complykaro
9 pages
Computer Science Project Term 1 (Xii)
No ratings yet
Computer Science Project Term 1 (Xii)
57 pages
The Complete SQL HandBook
No ratings yet
The Complete SQL HandBook
89 pages
B38DF LS1 Introduction
No ratings yet
B38DF LS1 Introduction
46 pages
What Is Test Driven Development
No ratings yet
What Is Test Driven Development
4 pages
Holy Mary Institute of Technology & Science: Answer All Questions and Each Question Carries: Questions
No ratings yet
Holy Mary Institute of Technology & Science: Answer All Questions and Each Question Carries: Questions
2 pages
Docking STATION
No ratings yet
Docking STATION
9 pages
Fast API
No ratings yet
Fast API
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.