0% found this document useful (0 votes)
24 views57 pages

DS - Module 1

The document outlines the fundamentals of data science, including key terminologies such as data, information, knowledge, and wisdom, while addressing the hype surrounding big data and data science. It discusses the current landscape of data science, the essential skillset for data scientists, and various case studies demonstrating the application of data science in fields like healthcare, urban planning, e-commerce, and transportation. Additionally, it emphasizes the importance of statistical inference and modeling in understanding and extracting insights from large datasets.

Uploaded by

satvikhegde2905
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views57 pages

DS - Module 1

The document outlines the fundamentals of data science, including key terminologies such as data, information, knowledge, and wisdom, while addressing the hype surrounding big data and data science. It discusses the current landscape of data science, the essential skillset for data scientists, and various case studies demonstrating the application of data science in fields like healthcare, urban planning, e-commerce, and transportation. Additionally, it emphasizes the importance of statistical inference and modeling in understanding and extracting insights from large datasets.

Uploaded by

satvikhegde2905
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Data Science (21CS772)

Module 1
Introduction
Basic Terminologies
• Data
• It can be
• Generated

• Collected
Similarity Measures

Data Structures
• Retrieved
Algorithms
Basic Terminologies Data
Processed
• Data: facts with no meaning Information
• Information: learning from facts
Validation
• Knowledge: practical understanding of a subjects Knowledge
• Understanding: the ability to absorb knowledge and
learn to reason
Thinking
Wisdom
• Wisdom: the quality of having experience and good
judgement; ability to think and foresee
• Validity: ways to confirm truth
Science is a systematic discipline that builds and organises
knowledge in the form of testable hypotheses and predictions
about the world.
Big Data and Data Science Hype
1. Lack of Clear Definitions:
• The terms "Big Data" and "data science" are often used without clear
definitions.
• Questions arise about the relationship between them and whether data
science is exclusive to certain industries.
• Ambiguities make these terms seem almost meaningless.
2. Respect for Existing Work:
• Lack of recognition for researchers in academia and industry who have
worked on similar concepts for years.
• Media portrays machine learning as a recent invention, overlooking the long
history of work by statisticians, computer scientists, mathematicians, and
engineers.
Big Data and Data Science Hype
3. Excessive Hype:
• Hype surrounding data science is criticized for
using over-the-top phrases and creating
unrealistic expectations.
• Comparisons to the pre-financial crisis "Masters of
the Universe" era are seen as detrimental.
• Excessive hype can obscure the real value
underneath and turn people away.
Big Data and Data Science Hype
4. Overlap with Statistics:
• Statisticians already consider themselves working on the
"Science of Data."
• Data science is argued to be more than just a rebranding of
statistics or machine learning, but media often portrays it as
such, especially in the context of the tech industry.
5. Debating the Term "Science":
• Some question whether anything that has to label itself as a
science truly is one.
• The term "data science" may not strictly represent a scientific
discipline but might be more of a craft or a different kind of field.
Getting Past The Hype
• Amid the hype, there's a kernel of truth: data science
represents something genuinely new but is at risk of
premature rejection due to unrealistic expectations.
Why Now!? (Data Science Popularity)
1. Data Abundance and Computing Power:
• We now have massive amounts of data about various aspects of
our lives.
• There's also an abundance of inexpensive computing power,
making it easier to process and analyze this data.
2. Datafication of Offline Behavior:
• Our online activities, like shopping, communication, and
expressing opinions, are commonly tracked.
• The trend of collecting data about our offline behavior has also
started, similar to the online data collection revolution.
Why Now!?
3. Data's Influence Across Industries:
• Data is not limited to the internet; it's prevalent in finance, medical industry,
pharmaceuticals, bioinformatics, social welfare, government, education,
retail, and more.
• Many sectors are experiencing a growing influence of data, sometimes
reaching the scale of "big data."
4. Real-Time Data as Building Blocks:
• The interest in new data is not just due to its sheer volume but because it
can be used in real time to create data products.
• Examples include recommendation systems on Amazon, friend
recommendations on Facebook, trading algorithms in finance, personalized
learning in education, and data-driven policies in government.
Why Now!?
5. Culturally Saturated Feedback Loop:
• A significant shift is occurring where our behavior influences products,
and products, in turn, shape our behavior.
• Technology, with its large-scale data processing capabilities, increased
memory, and bandwidth, along with cultural acceptance, enables this
feedback loop.
6. Emergence of a Feedback Loop:
• The interaction between behavior and products creates a feedback
loop, influencing both culture and technology.
• The book aims to initiate a conversation about understanding and
responsibly managing this loop, addressing ethical and technical
considerations.
Datafication
• Datafication is described as a process of taking all aspects of life and
transforming them into data.
• Examples include how "likes" on social media quantify friendships, Google's
augmented-reality glasses datafy the gaze, and Twitter datafies thoughts.
• People's actions, whether online or in the physical world, are being recorded
for later analysis.
• Datafication occurs intentionally, such as when actively engaging in
social media, or unintentionally through passive actions like browsing the
web or walking around with sensors and cameras capturing data.
• Datafication ranges from intentional participation in social media
experiments to unintentional surveillance and stalking.
• Regardless of individual intentions, the outcome is the same – datafication.
The Current Landscape
• So what is Data Science?
• The passage raises questions about what data
science is and whether it's something new or a
rebranding of statistics or analytics.
• Data science is described as a blend of hacking
and statistics. It involves practical knowledge
of tools and materials along with a theoretical
understanding of possibilities.
• Drew Conway's Venn diagram from 2010 is
mentioned as a representation of data science
skills.
• Data science involves skills such as traditional
statistics, data munging (parsing, scraping, and
formatting data), and other technical abilities.
A Data Science Profile (Skillset needed)
• Computer science
• Math
• Statistics
• Machine learning
• Domain expertise
• Communication and presentation skills
• Data visualization
A Data Science Profile (Skillset needed)
• The following experiment was done:
• Students were given index cards and asked to profile their skill levels in
different data science domains.
• The domains include computer science, math, statistics, machine
learning, domain expertise, communication and presentation skills, and
data visualization.

• An example of a data science profile is


shown, indicating the relative skill levels in
each domain.
A Data Science Profile (Skillset needed)

• There was noticeable variation in the skill


profiles of each student, especially
considering the diverse backgrounds of the
class, including many students from social
sciences.
• So a data science team works best when
different skills (profiles) are represented
across different people, because nobody is
good at everything.
OK, So What Is a Data Scientist, Really?
• In Academia:
• An academic data scientist is described as a scientist trained in
various disciplines, working with large amounts of data. They must
address computational challenges posed by the structure, size, and
complexity of data while solving real-world problems.
• Articulating data science in academia involves emphasizing
commonalities in computational and deep data problems across
disciplines. Collaboration among researchers from different
departments can lead to solving real-world problems.
OK, So What Is a Data Scientist, Really?
• In Industry:
• Chief Data Scientist's Role:
A chief data scientist in industry sets the data strategy of the company,
covering aspects such as data collection infrastructure, privacy concerns,
user-facing data, decision-making processes, and integration into products.
They manage teams, communicate with leadership, and focus on
innovation and research goals.
• General Data Scientist Skills:
A data scientist in industry extracts meaning from and interprets data using
tools and methods from statistics and machine learning. They engage in
collecting, cleaning, and processing data, requiring persistence, statistical
knowledge, and software engineering skills.
OK, So What Is a Data Scientist, Really?
• In Industry:
• Exploratory Data Analysis:
Exploratory data analysis involves visualization and data sense
to find patterns, build models, and algorithms. Data scientists
contribute to understanding product usage, the overall health
of the product, and designing experiments.
• Communication and Decision-Making:
Data scientists communicate with team members, engineers,
and leadership using clear language and data visualizations.
They play a critical role in data-driven decision-making
processes.
Case Studies
• Case Study 1: IBM Watson Health
• IBM Watson Health employs data science to enhance healthcare by providing personalized
diagnostic and treatment recommendations. Watson's natural language processing
capabilities enable it to sift through vast medical literature and patient records to assist
doctors in making more informed decisions.
• Data science has significantly aided IBM Watson Health in healthcare diagnostics and
personalized treatment in:
• IBM Watson Health has demonstrated a 15% increase in the accuracy of cancer diagnoses when
assisting oncologists in analyzing complex medical data, including genomic information and
medical journals.
• In a recent clinical trial, IBM Watson Health's AI-powered recommendations helped reduce the
average time it takes to develop a personalized cancer treatment plan from weeks to just a few
days, potentially improving patient outcomes and survival rates.
• Watson's data-driven insights have contributed to a 30% reduction in medication errors in some
healthcare facilities by flagging potential drug interactions and allergies in patient records.
• IBM Watson Health has processed over 200 million pages of medical literature to date, providing
doctors with access to a vast knowledge base that can inform their diagnostic and treatment
decisions.
Case Studies
• Case Study 2: Urban planning and smart cities
• Singapore is pioneering the smart city concept, using data science to optimize urban
planning and public services. They gather data from various sources, including sensors
and citizen feedback, to manage traffic flow, reduce energy consumption, and improve the
overall quality of life in the city-state.
• Here’s how data science helped Singapore in efficient urban planning:
• Singapore's real-time traffic management system, powered by data analytics, has led to a 25%
reduction in peak-hour traffic congestion, resulting in shorter commute times and lower fuel
consumption.
• Through its data-driven initiatives, Singapore has achieved a 15% reduction in energy
consumption across public buildings and street lighting, contributing to significant
environmental sustainability gains.
• Citizen feedback platforms have seen 90% of reported issues resolved within 48 hours, reflecting
the city's responsiveness in addressing urban challenges through data-driven decision-making.
• The implementation of predictive maintenance using data science has resulted in a 30%
decrease in the downtime of critical public infrastructure, ensuring smoother operations and
minimizing disruptions for residents.
Case Studies
• Case Study 3: E-commerce personalization and recommendation
systems
• Amazon, the e-commerce giant, heavily relies on data science to personalize the
shopping experience for its customers. They use algorithms to analyze customers'
browsing and purchasing history, making product recommendations tailored to
individual preferences. This approach has contributed significantly to Amazon's success
and customer satisfaction by reducing customer service response times by 40%.
• Additionally, Amazon leverages data science for:
• Amazon's data-driven product recommendations have led to a 29% increase in average order
value as customers are more likely to add recommended items to their carts.
• A study found that Amazon's personalized shopping experience has resulted in a 68%
improvement in click-through rates on recommended products compared to non-personalized
suggestions.
• Customer service response times have been reduced by 40% due to fewer inquiries related
to product recommendations, as customers find what they need more easily.
• Amazon's personalized email campaigns, driven by data science, have shown an 18% higher
open rate and a 22% higher conversion rate compared to generic email promotions.
Case Studies
• Case Study 4: Transportation and route optimization
• Uber revolutionized the transportation industry by using data science to optimize
ride-sharing and delivery routes. Their algorithms consider real-time traffic
conditions, driver availability, and passenger demand to provide efficient, cost-
effective transportation services. Other use cases include:
• Uber's data-driven routing and matching algorithms have led to an average 20%
reduction in travel time for passengers, ensuring quicker and more efficient
transportation.
• By optimizing driver routes and minimizing detours, Uber has contributed to a 30%
decrease in fuel consumption for drivers, resulting in cost savings and reduced
environmental impact.
• Uber's real-time demand prediction models have helped reduce passenger wait times
by 25%, enhancing customer satisfaction and increasing the number of rides booked.
• Over the past decade, Uber's data-driven approach has enabled 100 million active
users to complete over 15 billion trips, demonstrating the scale and impact of their
transportation services.
What is Data Science?
• Data Science is a multidisciplinary
field that focuses on finding
actionable insights from large sets of
structured and unstructured data.
• Data Science experts integrate
computer science, predictive analytics,
statistics and Machine Learning to mine
very large data sets, with the goal of
discovering relevant insights that can
help the organisation move forward, and
identifying specific future events.
Statistical
Inference,
Exploratory
Data Analysis, and
the Data
Science Process
Statistical Thinking in the Age of Big Data
• What is Big Data?
• Big data refers to extremely large and diverse collections of
structured, unstructured, and semi-structured data that continues to
grow exponentially over time.
• These datasets are so huge and complex in volume, velocity, and
variety, that traditional data management systems cannot store,
process, and analyze them.
• Big Data is a term often used loosely, but it generally refers to three
things:
• a set of technologies,
• a potential revolution in measurement, and
• a philosophy about decision-making.
Statistical Inference
• The world is complex, random, and uncertain, functioning as a
massive data-generating machine.
• Everyday activities, from commuting to work, shopping, and even
biological processes, potentially produce data.
• Processes in our lives are inherently data-generating, and
understanding them is crucial for problem-solving.
• Data represents traces of real-world processes, collected through
subjective data collection methods.
Statistical Inference
• Two sources of randomness and uncertainty: (i) underlying
process and (ii) uncertainty in data collection methods.
• Data scientists turn the world into data through subjective
observation and data collection.
• Capturing the world in data is not enough; the challenge is to
understand the complex processes behind the data.
• Need for simplification: transforming captured traces into
comprehensible forms, often through statistical estimators.
• Statistical Inference is the discipline focused on developing
procedures and methods to extract meaning from data generated
by stochastic processes.
Populations and Samples
• Population
• Population refers to the entire set of objects or units, not limited to people (e.g.,
tweets, photographs, stars).
• Denoted by N, representing the total number of observations in the population.
• If characteristics of all objects in the population are measured, we have a complete
set of observations.
• Example:
• Suppose your population was all emails sent last year by employees at a huge
corporation, BigCorp.
• Then a single observation could be a list of things: the sender’s name, the list of
recipients, date sent, text of email, number of characters in the email, number
of sentences in the email, number of verbs in the email, and the length of time
until first reply.
Populations and Samples
• Sampling
• A sample is a subset of units (size n) taken to draw conclusions about the population.
• Different sampling mechanisms exist, and awareness is crucial to avoid biases.
• Sampling Methods Example: BigCorp Email:
• Two reasonable methods:
• Randomly selecting 1/10th of all employees and taking their emails, or
• sampling 1/10th of all emails sent each day.
• Both methods yield the same sample size but can lead to different conclusions
about the underlying distribution of emails.
• Biases can be introduced during sampling, distorting data.
• Distorted data can lead to incorrect and biased conclusions, especially
in complex algorithms and models.
Modeling
• We need to build models from collected data.
• The term "model" has different meanings, causing confusion in discussions.
• Example:
• Data models for storing data (database managers) vs.
• statistical models (central to this course).
• Reference to a provocative Wired magazine piece by Chris Anderson:
• Argues that complete information from data eliminates the need for models; correlation alone
suffices.
• But the author of our book, Rachel does't agree with what Anderson is saying. She thinks models are
still very important.
• Recognizing the Media's Role:
• The media plays a big part in how people see data science and modeling.
• It's crucial to think carefully and judge opinions, especially from those who don't work directly with
data.
• Data scientists should share their well-informed opinions in discussions about these
topics.
What is a model?
• Humans understand the world by creating various representations.
• Architects use blueprints, molecular biologists use visualizations, and statisticians/data scientists use
mathematical functions.
• Statisticians and data scientists express uncertainty and randomness in data-generating processes
through mathematical functions.
• Models serve as lenses to understand and represent reality, whether in architecture, biology, or
mathematics.
• A model is an artificial construction that simplifies reality by removing or abstracting irrelevant details.
• Attention must be given to these abstracted details post-analysis to ensure nothing crucial was
overlooked.
• Imagine creating a model to predict students' final grades based on various factors.
• The model might consider variables like attendance, study hours, participation in class, and past academic performance.
• Students’ contact details can be abstracted.
Statistical modelling
• Draw a conceptual picture of the underlying process before diving into
data and coding.
• Identify the sequence of events, factors influencing each other, and
causation relationships.
• Consider questions like what comes first, what influences what, and
what causes what.
• Formulate testable hypotheses based on your initial understanding of
the process.
• Different Thinking Styles:
• Recognize that people have different preferences in expressing relationships.
• Some individuals lean towards mathematical expressions, while others prefer
visual representations.
Statistical modeling
• Mathematical Expression:
• If inclined towards math, use expressions with Greek letters for parameters and Latin letters
for data.
• Example: For a potential linear relationship between columns x and y, express it as
• y = β0 + β1x
• Where β0 and β1 are parameters.

• Acknowledge that the actual numerical values of parameters are unknown initially.
• Parameters represent the coefficients or constants in mathematical models.
• Visual Representation:
• Alternatively, some individuals may start with visual representations, like diagrams illustrating
data flow.
• Use arrows to depict how variables impact each other or represent temporal changes.
• A visual representation helps in forming an abstract picture of relationships before translating
them into mathematical equations.
But how do you build a model?
• Functional Form of Data: Art and Science
• Determining the functional form involves both art and science.
• Lack of guidance in textbooks despite its critical role in modeling.
• Making assumptions about the underlying structure of reality.
• Challenges and Lack of Global Standards
• Lack of global standards for making assumptions.
• Need for standards in making and explaining choices.
• Making assumptions in a thoughtful way, despite the absence of clear
guidelines.
• Where to Start in Modeling: Not Obvious
• Starting point not obvious, similar to the meaning of life.
But how do you build a model?
• Exploratory Data Analysis (EDA)
• Introduction to EDA as a starting point.
• Making plots and building intuition for the dataset.
• Emphasizing the importance of trial and error and iteration.
• Mystery of Modeling Until Practice
• Modeling appears mysterious until practiced extensively.
• Starting with simple approaches and gradually increasing complexity.
• Using exploratory techniques like histograms and scatterplots.
• Writing Down Assumptions and Starting Simple
• Advantages of writing down assumptions.
• Starting with the simplest models and building complexity.
• Encouraging the use of full-blown sentences to express assumptions.
But how do you build a model?
• Trade-off Between Simple and Accurate Models
• Acknowledging the trade-off between simplicity and accuracy.
• Simple models may be easier to interpret and understand.
• Highlighting that simple models can often achieve a significant level of
accuracy.
• Building a range of Potential Models
• Introducing probability distributions as fundamental components.
• Stressing the significance of comprehending and applying probability
distributions.
Probability Distribution
• Probability distributions are fundamental in statistical models.
• Back in the day, before computers, scientists observed real-
world phenomenon, took measurements, and noticed that
certain mathematical shapes kept reappearing.
• The classical example is the height of humans, following a
normal distribution—a bell-shaped curve, also called a
Gaussian distribution, named after Gauss.
Probability Distribution
Probability Distribution
• Not all processes generate data that looks like a named
distribution, but many do. We can use these functions as
building blocks of our models.
• It’s beyond the scope of the book / syllabus to go into each of
the distributions in detail, but we look at the figure as an
illustration of the various common shapes.
• Note that they only have names because someone observed
them enough times to think they deserved names.
• There is actually an infinite number of possible distributions.
Probability Distribution
• Each distribution has corresponding
functions.
• For example, the normal distribution
is written as:

• The parameter μ is the mean and median and controls


where the distribution is centered (because this is a
symmetric distribution)
• the parameter σ controls how spread out the distribution
is.
• This is the general functional form, but for specific real-
world phenomenon, these parameters have actual
numbers as values, which we can estimate from the data.
Probability Distribution
• Random Variable (x or y):
• A random variable is a variable whose possible values are outcomes of a random
phenomenon. It's represented by symbols like x or y.
• Probability Distribution (p(x)):
• A probability distribution is a mathematical function that provides the probabilities
of occurrence of different possible outcomes in an experiment.
• For a continuous random variable, the probability distribution is described by a
probability density function (PDF), denoted as p(x), which assigns probabilities to
intervals rather than individual values.
• Probability Density Function (PDF):
• A probability density function maps the values of a random variable to non-
negative real numbers. It's denoted as p(x). For a PDF to be valid, its integral over
its entire domain must equal 1. This ensures that the total probability across all
possible outcomes is 1.
Probability Distribution
• Example (Time until the next bus):
• In this example, the random variable x represents the amount of time until the
next bus arrives, measured in seconds. Since the arrival time can vary and is
uncertain, it's a random variable.
• Given PDF (p(x)):
• The probability density function (PDF) for the time until the next bus arrives is
given as
P(x) =2e−2x

• Calculating Probability:
• If we want to find the probability that the next bus arrives between 12 and 13
minutes (which we convert to seconds for consistency with the unit of the random
variable x), we need to find the area under the curve of the PDF between x = 12
minutes and x = 13 minutes.
• This is done by integrating the PDF from 12 to 13:
Probability Distribution and
Probability Density Function
• Probability distribution function and probability density function are functions defined
over the sample space, to assign the relevant probability value to each element.
• Probability distribution functions are defined for the discrete random variables while
probability density functions are defined for the continuous random variables.
• Distribution of probability values (i.e. probability distributions) are best portrayed by
the probability density function and the probability distribution function.
• The probability distribution function can be represented as values in a table, but that
is not possible for the probability density function because the variable is continuous.
• When plotted, the probability distribution function gives a bar plot while the
probability density function gives a curve.
• The height/length of the bars of the probability distribution function must add to 1
while the area under the curve of the probability density function must add to 1.
• In both cases, all the values of the function must be non-negative.
Probability Distribution
• Choosing the Right Distribution:
• One way to determine the appropriate probability distribution for a random variable is by
conducting experiments and collecting data. By analyzing the data and plotting it, we can
approximate the probability distribution function (PDF).
• Alternatively, if we have prior knowledge or experience with a real-world phenomenon,
we might use a known distribution that fits that phenomenon. For instance, waiting times
often follow an exponential distribution, which has the form p(x) = λe^(-λx), where λ is a
parameter.
• Joint Distributions:
• In scenarios involving multiple random variables, we use joint distributions, denoted as
p(x, y). These distributions assign probabilities to combinations of values of the variables.
• The joint distribution function is defined over a plane, where each point corresponds to a
pair of values for the variables.
• Similar to single-variable distributions, the integral of the joint distribution over the entire
plane must equal 1 to represent probabilities.
Probability Distribution
• Conditional Distributions:
• Conditional distributions, denoted as p(x|y), represent the distribution of one variable
given a specific value of another variable.
• In practical terms, conditioning corresponds to subsetting or filtering the data based
on certain criteria.
• For example, in user-level data for Amazon.com, we might want to analyze the
amount of money spent by users given their gender or other characteristics. This
analysis involves conditional distributions.
• If we consider X to be the random variable that represents the amount of money
spent, then we can look at the distribution of money spent across all users, and
represent it as p(X)
• We can then take the subset of users who looked at more than five items before
buying anything, and look at the distribution of money spent among these users.
• Let Y be the random variable that represents number of items looked at,
• then p (X | Y > 5) would be the corresponding conditional distribution.
Joint Probability
• Joint probability is the probability of two (or more) events
happening simultaneously. It is denoted as P(A∩B) for two events A
and B, which reads as the probability of both A and B occurring.
• For two events A and B, the joint probability is defined as:
• P(A∩B)=P(both A and B occur)
• Note: If A and B are dependent, the joint probability is calculated
using conditional probability
• Examples of Joint Probability - Rolling Two Dice
• Let A be the event that the first die shows a 3.
• Let B be the event that the second die shows a 5.
Joint Probability

• The joint probability P(A∩B) is the probability that the first die
shows a 3 and the second die shows a 5. Since the outcomes
are independent,
• P(A∩B) = P(A) ⋅ P(B).
• Given: P(A) = 1/6 and P(B) = 1/6, so
• ⇒ P(A∩B) = 1/6 × 1/6 = 1/36​.
Conditional Probability
• Conditional probability is the probability of an event occurring given that another
event has already occurred. It provides a way to update our predictions or beliefs
about the occurrence of an event based on new information.
• The conditional probability of event A given event B is denoted as P(A ∣B) and is
defined by the formula:
• P(A∣B)=P(A∩B) / P(B)
Where:
• P(A∩B) is the joint probability of both events A and B occurring.
• P(B) is the probability of event BBB occurring.
Conditional Probability
Examples of Conditional Probability
• Suppose we have a deck of 52 cards, and we want to find the
probability of drawing an Ace given that we have drawn a red card.
• Let A be the event of drawing an Ace.
• Let B be the event of drawing a red card.
• There are 2 red Aces in a deck (Ace of hearts and Ace of diamonds)
and 26 red cards in total.
P(A∣B)=P(A∩B) / P(B) ​=(2/52)/(26​/52)​​=2/62​=1/13
Aspect Joint Probability Conditional Probability

The probability of two or more The probability of an event given


Definition
events occurring together. that another event has occurred.

Notation P(A∩B) or P(A, B) P(A∣B) or P(B∣A)

Formula P(A∩B) P(A∣B) = P(A∩B)​/P(B)

Probability of rolling a 2 given


Probability of rolling a 2 and flipping
Example that the coin flip is heads: P(2 ∣
heads: P(2 ∩ Heads)
Heads)

Calculated using the joint


Calculation Calculated from a joint probability probability and the marginal
Context distribution. probability of the given
condition.

Involves multiple events happening Depends on the occurrence of


Dependencies
simultaneously. another event.

Used to find the likelihood of Used to update the probability of


Use Case combined events in probabilistic an event based on new
models. information.
Fitting a model
• Fitting a model involves estimating the parameters of the model
using observed data. This process helps approximate the real-
world mathematical process that generated the data.
• It often requires optimization methods like maximum likelihood
estimation to find the best parameters that fit the data.
• The parameters estimated from the data are themselves
functions of the data and are called estimators.
• Once the model is fitted, you can express it in a mathematical
form, such as y = 7.2 + 4.5x, which represents the relationship
between variables based on the assumption that the data follows
a specific pattern.
Fitting a model
• Coding the Model:
• Coding the model involves reading in the data and specifying
the mathematical form of the model.
• Programming languages like R or Python use built-in
optimization methods to find the most likely values of the
parameters given the data.
• While you should understand that optimization is happening,
you typically don't need to code this part yourself, as it's
handled by the programming language's functions.
Fitting a model
• Overfitting:
• Overfitting occurs when a model is too complex and fits the
training data too closely, capturing noise or random
fluctuations rather than the underlying pattern.
• When you apply an overfitted model to new data (not used for
training) for prediction, it performs poorly because it hasn't
generalized well beyond the training data.
• Overfitting can be detected by evaluating the model's
performance on unseen data using metrics like accuracy.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy