DS - Module 1
DS - Module 1
Module 1
Introduction
Basic Terminologies
• Data
• It can be
• Generated
• Collected
Similarity Measures
Data Structures
• Retrieved
Algorithms
Basic Terminologies Data
Processed
• Data: facts with no meaning Information
• Information: learning from facts
Validation
• Knowledge: practical understanding of a subjects Knowledge
• Understanding: the ability to absorb knowledge and
learn to reason
Thinking
Wisdom
• Wisdom: the quality of having experience and good
judgement; ability to think and foresee
• Validity: ways to confirm truth
Science is a systematic discipline that builds and organises
knowledge in the form of testable hypotheses and predictions
about the world.
Big Data and Data Science Hype
1. Lack of Clear Definitions:
• The terms "Big Data" and "data science" are often used without clear
definitions.
• Questions arise about the relationship between them and whether data
science is exclusive to certain industries.
• Ambiguities make these terms seem almost meaningless.
2. Respect for Existing Work:
• Lack of recognition for researchers in academia and industry who have
worked on similar concepts for years.
• Media portrays machine learning as a recent invention, overlooking the long
history of work by statisticians, computer scientists, mathematicians, and
engineers.
Big Data and Data Science Hype
3. Excessive Hype:
• Hype surrounding data science is criticized for
using over-the-top phrases and creating
unrealistic expectations.
• Comparisons to the pre-financial crisis "Masters of
the Universe" era are seen as detrimental.
• Excessive hype can obscure the real value
underneath and turn people away.
Big Data and Data Science Hype
4. Overlap with Statistics:
• Statisticians already consider themselves working on the
"Science of Data."
• Data science is argued to be more than just a rebranding of
statistics or machine learning, but media often portrays it as
such, especially in the context of the tech industry.
5. Debating the Term "Science":
• Some question whether anything that has to label itself as a
science truly is one.
• The term "data science" may not strictly represent a scientific
discipline but might be more of a craft or a different kind of field.
Getting Past The Hype
• Amid the hype, there's a kernel of truth: data science
represents something genuinely new but is at risk of
premature rejection due to unrealistic expectations.
Why Now!? (Data Science Popularity)
1. Data Abundance and Computing Power:
• We now have massive amounts of data about various aspects of
our lives.
• There's also an abundance of inexpensive computing power,
making it easier to process and analyze this data.
2. Datafication of Offline Behavior:
• Our online activities, like shopping, communication, and
expressing opinions, are commonly tracked.
• The trend of collecting data about our offline behavior has also
started, similar to the online data collection revolution.
Why Now!?
3. Data's Influence Across Industries:
• Data is not limited to the internet; it's prevalent in finance, medical industry,
pharmaceuticals, bioinformatics, social welfare, government, education,
retail, and more.
• Many sectors are experiencing a growing influence of data, sometimes
reaching the scale of "big data."
4. Real-Time Data as Building Blocks:
• The interest in new data is not just due to its sheer volume but because it
can be used in real time to create data products.
• Examples include recommendation systems on Amazon, friend
recommendations on Facebook, trading algorithms in finance, personalized
learning in education, and data-driven policies in government.
Why Now!?
5. Culturally Saturated Feedback Loop:
• A significant shift is occurring where our behavior influences products,
and products, in turn, shape our behavior.
• Technology, with its large-scale data processing capabilities, increased
memory, and bandwidth, along with cultural acceptance, enables this
feedback loop.
6. Emergence of a Feedback Loop:
• The interaction between behavior and products creates a feedback
loop, influencing both culture and technology.
• The book aims to initiate a conversation about understanding and
responsibly managing this loop, addressing ethical and technical
considerations.
Datafication
• Datafication is described as a process of taking all aspects of life and
transforming them into data.
• Examples include how "likes" on social media quantify friendships, Google's
augmented-reality glasses datafy the gaze, and Twitter datafies thoughts.
• People's actions, whether online or in the physical world, are being recorded
for later analysis.
• Datafication occurs intentionally, such as when actively engaging in
social media, or unintentionally through passive actions like browsing the
web or walking around with sensors and cameras capturing data.
• Datafication ranges from intentional participation in social media
experiments to unintentional surveillance and stalking.
• Regardless of individual intentions, the outcome is the same – datafication.
The Current Landscape
• So what is Data Science?
• The passage raises questions about what data
science is and whether it's something new or a
rebranding of statistics or analytics.
• Data science is described as a blend of hacking
and statistics. It involves practical knowledge
of tools and materials along with a theoretical
understanding of possibilities.
• Drew Conway's Venn diagram from 2010 is
mentioned as a representation of data science
skills.
• Data science involves skills such as traditional
statistics, data munging (parsing, scraping, and
formatting data), and other technical abilities.
A Data Science Profile (Skillset needed)
• Computer science
• Math
• Statistics
• Machine learning
• Domain expertise
• Communication and presentation skills
• Data visualization
A Data Science Profile (Skillset needed)
• The following experiment was done:
• Students were given index cards and asked to profile their skill levels in
different data science domains.
• The domains include computer science, math, statistics, machine
learning, domain expertise, communication and presentation skills, and
data visualization.
• Acknowledge that the actual numerical values of parameters are unknown initially.
• Parameters represent the coefficients or constants in mathematical models.
• Visual Representation:
• Alternatively, some individuals may start with visual representations, like diagrams illustrating
data flow.
• Use arrows to depict how variables impact each other or represent temporal changes.
• A visual representation helps in forming an abstract picture of relationships before translating
them into mathematical equations.
But how do you build a model?
• Functional Form of Data: Art and Science
• Determining the functional form involves both art and science.
• Lack of guidance in textbooks despite its critical role in modeling.
• Making assumptions about the underlying structure of reality.
• Challenges and Lack of Global Standards
• Lack of global standards for making assumptions.
• Need for standards in making and explaining choices.
• Making assumptions in a thoughtful way, despite the absence of clear
guidelines.
• Where to Start in Modeling: Not Obvious
• Starting point not obvious, similar to the meaning of life.
But how do you build a model?
• Exploratory Data Analysis (EDA)
• Introduction to EDA as a starting point.
• Making plots and building intuition for the dataset.
• Emphasizing the importance of trial and error and iteration.
• Mystery of Modeling Until Practice
• Modeling appears mysterious until practiced extensively.
• Starting with simple approaches and gradually increasing complexity.
• Using exploratory techniques like histograms and scatterplots.
• Writing Down Assumptions and Starting Simple
• Advantages of writing down assumptions.
• Starting with the simplest models and building complexity.
• Encouraging the use of full-blown sentences to express assumptions.
But how do you build a model?
• Trade-off Between Simple and Accurate Models
• Acknowledging the trade-off between simplicity and accuracy.
• Simple models may be easier to interpret and understand.
• Highlighting that simple models can often achieve a significant level of
accuracy.
• Building a range of Potential Models
• Introducing probability distributions as fundamental components.
• Stressing the significance of comprehending and applying probability
distributions.
Probability Distribution
• Probability distributions are fundamental in statistical models.
• Back in the day, before computers, scientists observed real-
world phenomenon, took measurements, and noticed that
certain mathematical shapes kept reappearing.
• The classical example is the height of humans, following a
normal distribution—a bell-shaped curve, also called a
Gaussian distribution, named after Gauss.
Probability Distribution
Probability Distribution
• Not all processes generate data that looks like a named
distribution, but many do. We can use these functions as
building blocks of our models.
• It’s beyond the scope of the book / syllabus to go into each of
the distributions in detail, but we look at the figure as an
illustration of the various common shapes.
• Note that they only have names because someone observed
them enough times to think they deserved names.
• There is actually an infinite number of possible distributions.
Probability Distribution
• Each distribution has corresponding
functions.
• For example, the normal distribution
is written as:
• Calculating Probability:
• If we want to find the probability that the next bus arrives between 12 and 13
minutes (which we convert to seconds for consistency with the unit of the random
variable x), we need to find the area under the curve of the PDF between x = 12
minutes and x = 13 minutes.
• This is done by integrating the PDF from 12 to 13:
Probability Distribution and
Probability Density Function
• Probability distribution function and probability density function are functions defined
over the sample space, to assign the relevant probability value to each element.
• Probability distribution functions are defined for the discrete random variables while
probability density functions are defined for the continuous random variables.
• Distribution of probability values (i.e. probability distributions) are best portrayed by
the probability density function and the probability distribution function.
• The probability distribution function can be represented as values in a table, but that
is not possible for the probability density function because the variable is continuous.
• When plotted, the probability distribution function gives a bar plot while the
probability density function gives a curve.
• The height/length of the bars of the probability distribution function must add to 1
while the area under the curve of the probability density function must add to 1.
• In both cases, all the values of the function must be non-negative.
Probability Distribution
• Choosing the Right Distribution:
• One way to determine the appropriate probability distribution for a random variable is by
conducting experiments and collecting data. By analyzing the data and plotting it, we can
approximate the probability distribution function (PDF).
• Alternatively, if we have prior knowledge or experience with a real-world phenomenon,
we might use a known distribution that fits that phenomenon. For instance, waiting times
often follow an exponential distribution, which has the form p(x) = λe^(-λx), where λ is a
parameter.
• Joint Distributions:
• In scenarios involving multiple random variables, we use joint distributions, denoted as
p(x, y). These distributions assign probabilities to combinations of values of the variables.
• The joint distribution function is defined over a plane, where each point corresponds to a
pair of values for the variables.
• Similar to single-variable distributions, the integral of the joint distribution over the entire
plane must equal 1 to represent probabilities.
Probability Distribution
• Conditional Distributions:
• Conditional distributions, denoted as p(x|y), represent the distribution of one variable
given a specific value of another variable.
• In practical terms, conditioning corresponds to subsetting or filtering the data based
on certain criteria.
• For example, in user-level data for Amazon.com, we might want to analyze the
amount of money spent by users given their gender or other characteristics. This
analysis involves conditional distributions.
• If we consider X to be the random variable that represents the amount of money
spent, then we can look at the distribution of money spent across all users, and
represent it as p(X)
• We can then take the subset of users who looked at more than five items before
buying anything, and look at the distribution of money spent among these users.
• Let Y be the random variable that represents number of items looked at,
• then p (X | Y > 5) would be the corresponding conditional distribution.
Joint Probability
• Joint probability is the probability of two (or more) events
happening simultaneously. It is denoted as P(A∩B) for two events A
and B, which reads as the probability of both A and B occurring.
• For two events A and B, the joint probability is defined as:
• P(A∩B)=P(both A and B occur)
• Note: If A and B are dependent, the joint probability is calculated
using conditional probability
• Examples of Joint Probability - Rolling Two Dice
• Let A be the event that the first die shows a 3.
• Let B be the event that the second die shows a 5.
Joint Probability
• The joint probability P(A∩B) is the probability that the first die
shows a 3 and the second die shows a 5. Since the outcomes
are independent,
• P(A∩B) = P(A) ⋅ P(B).
• Given: P(A) = 1/6 and P(B) = 1/6, so
• ⇒ P(A∩B) = 1/6 × 1/6 = 1/36.
Conditional Probability
• Conditional probability is the probability of an event occurring given that another
event has already occurred. It provides a way to update our predictions or beliefs
about the occurrence of an event based on new information.
• The conditional probability of event A given event B is denoted as P(A ∣B) and is
defined by the formula:
• P(A∣B)=P(A∩B) / P(B)
Where:
• P(A∩B) is the joint probability of both events A and B occurring.
• P(B) is the probability of event BBB occurring.
Conditional Probability
Examples of Conditional Probability
• Suppose we have a deck of 52 cards, and we want to find the
probability of drawing an Ace given that we have drawn a red card.
• Let A be the event of drawing an Ace.
• Let B be the event of drawing a red card.
• There are 2 red Aces in a deck (Ace of hearts and Ace of diamonds)
and 26 red cards in total.
P(A∣B)=P(A∩B) / P(B) =(2/52)/(26/52)=2/62=1/13
Aspect Joint Probability Conditional Probability