Big Data 101
Big Data 101
© ArchonMagnus
Traditional Research
1. Generate a hypothesis.
2. Assemble a sample
population and a control
group.
3. Expose both to an
intervention (drug,
treatment, etc.).
4. Do statistical analysis to
identify causal relationships.
5. Rinse and repeat… ©Mark A. Hicks
Types of Data
Quantitative Data Qualitative Data
• Measurable • Descriptive
• Collected through measuring things • Collected through observation, field
that have a fixed reality work, focus groups, interviews,
recording or filming conversations
• Close ended
• Open ended
Big Data
Data that is too large or too
complex to be managed using
traditional data processing,
analysis, and storage
techniques.
Volume Variety
The amount The types
of data of data
The 4 V’s
of
Big Data
Velocity Veracity
The frequency The quality
of data of data
Volume: scale of data
Volume: scale of data
• 90% of today’s data has been created in just the last 2 years
• Every day we create 2.5 quintillion bytes of data or enough to fill 10
million Blu-ray discs
• 40 zettabytes (4o trillion gigabytes) of data will be created by 2020,
an increase of 300 times from 2005, and the equivalent of 5,200
gigabytes of data for every man, woman and child on Earth
• Most companies in the US have over 100 terabytes (100,000
gigabytes) of data stored
Variety: different forms of data
Velocity: analysis of streaming data
Veracity: trustworthiness of data
• Origin
• Authenticity
• Trustworthiness
• Completeness
• Integrity
Value
Volume Variety
The amount The types
of data The 4 V’s of data
of
Big Data
Velocity Veracity
The frequency The quality
of data of data
Big Data and Research
Big Data Mining
1. Collect Big Data or obtain
access to a repository.
2. Perform data analysis to
explore patterns (pattern
recognition, predictive
analytics).
3. Identify potential
correlations.
©Rina Piccolo
4. Good enough!
Big Data in Health Care
• Faster and cheaper technology and data storage
• Widespread sensing devices
• An increase in “born” digital data
• Greater availability of data via repositories
• Data sharing mandates
Faster and
cheaper
technology and
data storage
“born”
digital data
© NEC Corporation of America
©Hellerhoff
Greater
availability of
data via
repositories
As of April 2016 the Registry of
Research Data Repositories
(re3data.org) listed 1,500 research
data repositories. Currently 458 are
key worded “medicine.”
Sharing
mandates
1000
800 723
600
463
400
201
200
2 1 9 3 2 7 41
0
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Year
Hurdles and Risks
• Unstructured Data (~75% of data in the healthcare environment)
• Data privacy/security (HIPAA Compliance, Patient Confidentiality,
Personally Identifiable Information/PII)
• Inconsistent, incomplete , unavailable, poor quality or invalid data
• Poor analysis/analytics leading to erroneous correlations/conclusions
• Misused data
Big Data and Librarians