FDSA Unit 1
FDSA Unit 1
COMPILED BY
DEAN-ACADEMICS PRINCIPAL
SEMESTER IV
23ADE401 DATA SCIENCE AND ANALYTICS
LTPC
30 2 4
OBJECTIVES
To understand the techniques and processes of data science
Need for data science – benefits and uses – facets of data – data science process – setting the research goal
– retrieving data – cleansing, integrating, and transforming data – exploratory data analysis – build the models
– presenting and building applications.
Normal distributions – z scores – normal curve problems – finding proportions – finding scores –
more about z scores – correlation – scatter plots – correlation coefficient for quantitative data –
computational formula for correlation coefficient – regression – regression line – least squares
regression line – standard error of estimate – interpretation of r2 – multiple regression equations –
regression toward the mean .
UNIT IV T-TEST 9
t-test for one sample – sampling distribution of t – t-test procedure – degrees of freedom – estimating
the standard error – case studies t-test for two independent samples – statistical hypotheses –
sampling distribution – test procedure – p-value – statistical significance – estimating effect size – t-
test for two related samples
F-test – ANOVA – estimating effect size – multiple comparisons – case studies Analysis of variance
with repeated measures Two-factor experiments – three f-tests – two-factor ANOVA – other types of
ANOVA Introduction to chi-square tests
TOTAL:45+15=60 PERIODS
LABORATORY PART
LIST OF EXPERIMENTS
(All Experiments to be conducted)
REFERENCE(S) :
1) Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”, Green Tea Press, 2014.
2) Sanjeev J. Wagh, Manisha S. Bhende, Anuradha D. Thakare, “Fundamentals of Data Science”, CRC
Press, 2022
E-RESOURCES:
1) https://nptel.ac.in/courses/108/105/106105219/ (Introduction data science)
2) https://nptel.ac.in/courses/108/102/106102132/ (time series)
CO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3
1 1 1 3 1 3 - - - 3 2 2 2 3 1 2
2 2 1 3 2 1 - - - 2 1 1 3 3 3 2
3 3 3 1 2 2 - - - 3 2 1 2 3 1 3
4 3 1 2 2 2 - - - 1 2 1 3 3 1 1
5 1 1 2 3 2 - - - 3 2 1 2 3 3 3
- - - - - - - - - - - - - - -
6
AVG 2 1 2 2 2 0 0 0 2 2 1 2 3 2 2
1- Low 2-Medium 3-High ‘-‘ – No Correlation
SENGUNTHAR ENGINEERING COLLEGE(AUTONOMOUS),TIRUCHENGODE-637 205
LECTURE PLAN
David Cielen, T1
1 Introducing Data Science
Arno D. B. Meysman
Think Stats: Exploratory Data Analysis in Allen B. Downey R1
2
Python
UNIT I
T1-Ch1.2,
2 1 Chalk/Talk
Benefits and uses R1- Ch1.2
T1-Ch1.3
3 1 Chalk/Talk
Facets of data R1- Ch1.2
4 1 T1-Ch1.4 Chalk/Talk
Data science process
5 1 T1-Ch1.5 Chalk/Talk
Setting the research goal
6 1 T1-Ch1.6 Chalk/Talk
Retrieving data
7 1 T1-Ch1.7 PPT
Cleansing, Integrating, Transforming Data
UNIT II
PROCESS MANAGEMENT
2 T1-Ch2.3,
Chalk/Talk
3 more about z scores ,correlation ,scatter plots R1-Ch2.4
correlation coefficient for quantitative data , 1 T1-Ch2.4,
Chalk/Talk
4
computational formula for correlation coefficient R1-Ch2.5
regression – regression line – least squares
1 PPT
5 regression line T1-Ch2.5
1 Chalk/Talk
7 regression toward the mean . T1-Ch2.8
UNIT III
INFERENTIAL STATISTICS
1 Chalk/Talk
1 Populations ,samples ,random sampling. T1-Ch3.1
1 T1-Ch3.2 Chalk/Talk
2 Probability and statistics Sampling
distribution ,creating a sampling distribution R1-Ch3.3
Mean of all sample means , standard error of the mean 1 Chalk/Talk
3 T1-Ch3.3
, other sampling distributions
1 T1-Ch3.4, Chalk/Talk
4 Hypothesis testing
R1-Ch3.5
1 PPT
5 Z-test , z-test procedure T1-Ch3.5,3.6
1 PPT
6 Statement of the problem T1-Ch3.7
1 Chalk/Talk
7 Null hypothesis ,alternate hypotheses T1-Ch3.8
1 Chalk/Talk
8 Decision rule – calculations – decisions T1-Ch3.10
1 Chalk/Talk
9 Interpretations T1-Ch3.11
UNIT IV
T-test
T1-Ch4.3,
2 T-test procedure , degrees of freedom ,estimating the 1 Chalk/Talk
standard error . R1-Ch4.3
T1-Ch4.4,
case studies t-test for two independent samples , 2 Chalk/Talk
3 statistical hypotheses R1-Ch4.4
1 T1-Ch 5.6,
Case studies Analysis of variance with repeated Chalk/Talk
3 measures Two-factor experiments R1-Ch5.7
1 T1-Ch 5.7
4 Three f-tests Chalk/Talk
2 T1-Ch
5 Two-factor ANOVA, other types of Chalk/Talk
5.9,5.10
1 T1-Ch 5.11
6 ANOVA PPT
1
Introduction to chi-square tests Chalk/Talk
7 T1-Ch 5.13
Total 45
Revision 5
Total 50
UNIT – I
Data Science is a combination of multiple disciplines that uses statistics, data analysis, and
machine learning to analyze data and to extract knowledge and insights from it.
Data Science is about finding patterns in data, through analysis, and make future predictions.
By using Data Science, companies are able to make:
Data Science can be applied in nearly every part of a business where data is available.
Examples are:
Consumer goods
Stock markets
Industry
Politics
Logistic companies
E-commerce
Human resource professionals use people analytics and text mining to screen candidates, monitor
the mood of employees, and study informal networks among coworkers.
People analytics is the central theme in the book Money ball: The Art of Winning an Unfair
Game.
In the book (and movie) we saw that the traditional scouting process for American baseball was
random, and replacing it with correlated signals changed everything. Relying on statistics allowed
them to hire the right players and pit them against the opponents where they would have the
biggest advantage.
Financial institutions use data science to predict stock markets, determine the risk of lending
money, and learn how to attract new clients for their services.
At the time of writing this book, at least 50% of trades world wide are performed automatically by
machines based on algorithms developed by quants, as data scientists
Governmental organizations are also aware of data’s value. Many governmental organizations
not only rely on internal data scientists to discover valuable information, but also share their data
with the public. You can use this data to gain insights or build data-driven applications. Data.gov is
but one example; it’s the home of the US Government’s open data
1.3 Facets of data
In data science and big data you’ll come across many different types of data, and each of them
One purpose of Data Science is to structure data, making it interpretable and easy to work with.
Structured data
Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the content is context-
specific or varying.
Although email contains structured elements such as the sender, title, and body
Natural language
it’s challenging to process because it requires knowledge of specific data science techniques and
linguistics.
The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis, but model strained in one
domain don’t generalize well to other domains.
Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of text. This
shouldn’t be a surprise though: humans struggle with natural language as well.
Have two people listen to the same conversation. Will they get the same meaning? The meaning of
the same words can vary when coming from someone upset or joyous.
Machine-generated data
Machine-generated data is becoming a major data resource and will continue to do so.
Wiki bon has fore cast that the market value of the industrial Internet (a term coined by Frost &
Sullivan to refer to the integration of complex physical machinery with networked sensors and
software) will be approximately $540 billion in 2020. IDC (International Data Corporation)has
estimated there will be 26 times more connected things than people in2020.
The analysis of machine data relies on highly scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call detail records, network ,event logs, and
telemetry
Streaming data
While streaming data can take almost any of the previous forms, it has an extra property.
The data flows into the system when an event happens instead of being loaded into a data store in
a batch.
Although this isn’t really a different type of data, we treat it here as such because you need to
adapt your process to deal with this type of information.
Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock
market.
1.4 DATA SCIENCE PROCESS
1. Discovery: The first phase is discovery, which involves asking the right questions. When you start any
data science project, you need to determine what are the basic requirements, priorities, and project budget.
In this phase, we need to determine all the requirements of the project such as the number of people,
technology, time, data, an end goal, and then we can frame the business problem on first hypothesis level.
2. Data preparation: Data preparation is also known as Data Munging. In this phase, we need to perform
the following tasks:
Data cleaning
Data Reduction
Data integration
Data transformation,
After performing all the above tasks, we can easily use this data for our further processes.
3. Model Planning: In this phase, we need to determine the various methods and techniques to establish
the relation between input variables. We will apply Exploratory data analytics (EDA) by using various
statistical formula and visualization tools to understand the relations between variable and to see what data
can inform us. Common tools used for model planning are:
4. Model-building: In this phase, the process of model building starts. We will create datasets for training
and testing purpose. We will apply different techniques such as association, classification, and clustering, to
build the model.
5. Operationalize: In this phase, we will deliver the final reports of the project, along with briefings, code,
and technical documents. This phase provides you a clear overview of complete project performance and
other components on a small scale before the full deployment.
6. Communicate results: In this phase, we will check if we reach the goal, which we have set on the initial
phase. We will communicate the findings and final result with the business team.
Finance industries always had an issue of fraud and risk of losses, but with the help of data
science, this can be rescued.
Most of the finance companies are looking for the data scientist to avoid risk and any type of
losses with an increase in customer satisfaction.
A project starts by understanding the what, the why, and the how of your project .
What does the company expect you to do? And why does management place such a value on your
research?
Is it part of a bigger strategic picture or a “lone wolf” project originating from an opportunity someone
detected? Answering these three questions (what, why, how) is the goal of the first phase, so that
everybody knows what to do and can agree on the best course of action.
An essential outcome is the research goal that states the purpose of your assignment in a
clear and focused manner.
Understanding the business goals and context is critical for project success.
Continue asking questions and devising examples until you grasp the exact business
expectations, identify how your project fits in the bigger picture, appreciate how your
research is going to change the business, and understand how they’ll use your results.
Nothing is more frustrating than spending months researching something until you have that one
moment of brilliance and solve the problem, but when you report your findings back to the organization,
everyone immediately realizes that you misunderstood their question. Don’t skim over this phase lightly.
Many data scientists fail here: despite their mathematical wit and scientific brilliance, they never seem to
grasp the business goals and context.
A project charter requires teamwork, and your input covers at least the following:
A timeline
Your client can use this information to make an estimation of the project costs and the data and people
required for your project to become a success.
Sometimes you need to go into the field and design a data collection process yourself, but most of
the time you won’t be involved in this step
Many companies will have already collected and stored the data for you, and what they don’t have
can often be bought from third parties.
Don’t be afraid to look outside your organization for data, because more and more organizations are
making even high-quality data freely available for public and commercial use.
FIG 1. 7 : Retrieving Data
Data can be stored in many forms, ranging from simple text files to tables in a database.
This may be difficult, and even if you succeed, data is often like a diamond in the rough: it needs
polishing to be of any use to you.
Your first act should be to assess the relevance and quality of the data that’s readily available within
your company.
Most companies have a program for maintaining key data, so much of the cleaning work may
already be done.
This data can be stored in official data repositories such as databases, data marts, data
warehouses, and data lakes maintained by a team of IT professionals.
The primary goal of a database is data storage, while a data warehouse is designed for reading and
analyzing that data. A data mart is a subset of the data warehouse and geared toward serving a
specific business unit.
While data warehouses and data marts are home to preprocessed data, data lakes contains data in
its natural or raw format.
But the possibility exists that your data still resides in Excel files on the desktop of a domain expert.
Finding data even within your own company can sometimes be a challenge.
As companies grow, their data becomes scattered around many places. Knowledge of the data may
be dispersed as people change positions and leave the company.
Documentation and metadata aren’t always the top priority of a delivery manager, so it’s possible
you’ll need to develop some Sherlock Holmes–like skills to find all the lost bits.
Organizations understand the value and sensitivity of data and often have policies in place so
everyone has access to what they need and nothing more.
These policies translate into physical and digital barriers called Chinese walls.
These “walls” are mandatory and well-regulated for customer data in most countries.
This is for good reasons, too; imagine everybody in a credit card company having access to your
spending habits. Getting access to the data may take time and involve company politics.
If data isn’t available inside your organization, look outside your organization’s walls.
For instance, Nielsen and GFK are well known for this in the retail industry.
Other companies provide data so that you, in turn, can enrich their services and ecosystem. Such is
the case with Twitter, LinkedIn, and Facebook.
Although data is considered an asset more valuable than oil by certain companies, more and more
governments and organizations share their data for free with the world.
The information they share covers a broad range of topics such as the number of accidents or
amount of drug abuse in a certain region and its demographics.
This data is helpful when you want to enrich proprietary data but also convenient when training your
data science skills at home. Table 1.5.1 shows only a small selection from the growing number of
open-data providers.
Table 1. 5.1. A list of open-data providers that should get you started
Freebase.org An open database that retrieves its information from sites like Wikipedia,
MusicBrains, and the SEC archive
Expect to spend a good portion of your project time doing data correction and cleansing, sometimes
up to 80%.
The retrieval of data is the first time you’ll inspect the data in the data science process.
Most of the errors you’ll encounter during the data-gathering phase are easy to spot, but being too
careless will make you spend many hours solving data issues that could have been prevented
during data import.
You’ll investigate the data during the import, data preparation, and exploratory phases.
During data retrieval, you check to see if the data is equal to the data in the source document and
look to see if you have the right data types.
This shouldn’t take too long; when you have enough evidence that the data is similar to the data you
find in the source document, you stop.
If you did a good job during the previous phase, the errors you find now are also present in the
source document.
The focus is on the content of the variables: you want to get rid of typos and other data entry errors
and bring the data to a common standard among the data sets.
For example, you might correct USQ to USA and United Kingdom to UK.
During the exploratory phase your focus shifts to what you can learn from the data.
Now you assume the data to be clean and look at the statistical properties such as distributions,
correlations, and outliers.
For instance, when you discover outliers in the exploratory phase, they can point to a data entry
error.
Now that you understand how the quality of the data is improved during the process, we’ll look
deeper into the data preparation step.
If data is incorrect, outcomes and algorithms are unreliable, even though they may look
correct. There is no one absolute way to prescribe the exact steps in the data cleaning
process because the processes will vary from dataset to dataset.
But it is crucial to establish a template for your data cleaning process so you know you are
doing it the right way every time.
Data cleaning is the process that removes data that does not belong in your dataset. Data
transformation is the process of converting data from one format or structure into another.
Transformation processes can also be referred to as data wrangling, or data munging, transforming
and mapping data from one "raw" data form into another format for warehousing and analyzing. This
article focuses on the processes of cleaning that data.
While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to map out a framework for your organization.
FIG 1. 8 : Data Science Process
Remove unwanted observations from your dataset, including duplicate observations or irrelevant
observations. Duplicate observations will happen most often during data collection. When you combine data
sets from multiple places, scrape data, or receive data from clients or multiple departments, there are
opportunities to create duplicate data. De-duplication is one of the largest areas to be considered in this
process. Irrelevant observations are when you notice observations that do not fit into the specific problem
you are trying to analyze. For example, if you want to analyze data regarding millennial customers, but your
dataset includes older generations, you might remove those irrelevant observations. This can make
analysis more efficient and minimize distraction from your primary target—as well as creating a more
manageable and more performant dataset.
Step 2: Fix structural errors
Structural errors are when you measure or transfer data and notice strange naming conventions,
typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or classes. For
example, you may find “N/A” and “Not Applicable” both appear, but they should be analyzed as the same
category.
Often, there will be one-off observations where, at a glance, they do not appear to fit within the data
you are analyzing. If you have a legitimate reason to remove an outlier, like improper data-entry, doing so
will help the performance of the data you are working with. However, sometimes it is the appearance of an
outlier that will prove a theory you are working on. Remember: just because an outlier exists, doesn’t mean
it is incorrect. This step is needed to determine the validity of that number. If an outlier proves to be
irrelevant for analysis or is a mistake, consider removing it.
You can’t ignore missing data because many algorithms will not accept missing values. There are a
couple of ways to deal with missing data. Neither is optimal, but both can be considered.
1. As a first option, you can drop observations that have missing values, but doing this will drop or lose
information, so be mindful of this before you remove it.
2. As a second option, you can input missing values based on other observations; again, there is an
opportunity to lose integrity of the data because you may be operating from assumptions and not
actual observations.
3. As a third option, you might alter the way the data is used to effectively navigate null values.
At the end of the data cleaning process, you should be able to answer these questions as a part of
basic validation:
Does it prove or disprove your working theory, or bring any insight to light?
Can you find trends in the data to help you form your next theory?
False conclusions because of incorrect or “dirty” data can inform poor business strategy and decision-
making. False conclusions can lead to an embarrassing moment in a reporting meeting when you realize
your data doesn’t stand up to scrutiny. Before you get there, it is important to create a culture of quality data
in your organization. To do this, you should document the tools you might use to create this culture and
what data quality means to you.
Try Tableau for free to create beautiful visualizations with your data.
Determining the quality of data requires an examination of its characteristics, then weighing those
characteristics according to what is most important to your organization and the application(s) for which
they will be used.
Validity. The degree to which your data conforms to defined business rules or constraints.
3. Consistency. Ensure your data is consistent within the same dataset and/or across multiple data
sets.
4. Uniformity. The degree to which the data is specified using the same unit of measure.
Having clean data will ultimately increase overall productivity and allow for the highest quality
information in your decision-making. Benefits include:
Ability to map the different functions and what your data is intended to do.
Monitoring errors and better reporting to see where errors are coming from, making it easier to fix
incorrect or corrupt data for future applications.
Using tools for data cleaning will make for more efficient business practices and quicker decision-
making.
1.6.2 OUTLIERS
An outlier is an observation that seems to be distant from other observations or, more specifically, one
observation that follows a different logic or generative process than the other observations. The easiest way
to find outliers is to use a plot or a table with the minimum and maximum values.
FIG 1. 9 : Outliers
Data Cleaning is useful as you need to sanitize Data while gathering it. The following are some of the
most typical causes of Data Inconsistencies and Errors:
Exploratory Data Analysis (EDA) is a robust technique for familiarizing yourself with Data and
extracting useful insights. Data Scientists sift through Unstructured Data to find patterns and infer
relationships between Data elements. Data Scientists use Statistics and Visualization tools to summarise
Central Measurements and variability to perform EDA.
If Data skewness persists, appropriate transformations are used to scale the distribution around its
mean. When Datasets have a lot of features, exploring them can be difficult. As a result, to reduce the
complexity of Model inputs, Feature Selection is used to rank them in order of significance in Model Building
for enhanced efficiency. Using Business Intelligence tools like Tableau, Micro Strategy, etc. can be quite
beneficial in this step. This step is crucial in Data Science Modeling as the Metrics are studied carefully for
validation of Data Outcomes.
Feature Selection is the process of identifying and selecting the features that contribute the most to
the prediction variable or output that you are interested in, either automatically or manually.
The presence of irrelevant characteristics in your Data can reduce the Model accuracy and cause your
Model to train based on irrelevant features. In other words, if the features are strong enough, the Machine
Learning Algorithm will give fantastic outcomes. Two types of characteristics must be addressed:
This is one of the most crucial processes in Data Science Modelling as the Machine Learning
Algorithm aids in creating a usable Data Model. There are a lot of algorithms to pick from, the Model is
selected based on the problem. There are three types of Machine Learning methods that are incorporated:
1) Supervised Learning
It is based on the results of a previous operation that is related to the existing business operation.
Based on previous patterns, Supervised Learning aids in the prediction of an outcome. Some of the
Supervised Learning Algorithms are:
Linear Regression
Random Forest
This form of learning has no pre-existing consequence or pattern. Instead, it concentrates on examining
the interactions and connections between the presently available Data points. Some of the Unsupervised
Learning Algorithms are:
Hierarchical Clustering
Anomaly Detection
3) Reinforcement Learning
It is a fascinating Machine Learning technique that uses a dynamic Dataset that interacts with the real
world. In simple terms, it is a mechanism by which a system learns from its mistakes and improves over
time. Some of the Reinforcement Learning Algorithms are:
Q-Learning
State-Action-Reward-State-Action (SARSA)
Deep Q Network
This is the next phase, and it’s crucial to check that our Data Science Modeling efforts meet the
expectations. The Data Model is applied to the Test Data to check if it’s accurate and houses all desirable
features. You can further test your Data Model to identify any adjustments that might be required to
enhance the performance and achieve the desired results. If the required precision is not achieved, you can
go back to Step 5 (Machine Learning Algorithms), choose an alternate Data Model, and then test the model
again.
The Model which provides the best result based on test findings is completed and deployed in the
production environment whenever the desired result is achieved through proper testing as per the business
needs. This concludes the process of Data Science Modeling.
Applications of Data Science
Every industry benefits from the experience of Data Science companies, but the most common areas
where Data Science techniques are employed are the following:
Banking and Finance: The banking industry can benefit from Data Science in many aspects. Fraud
Detection is a well-known application in this field that assists banks in reducing non-performing
assets.
Healthcare: Health concerns are being monitored and prevented using Wearable Data. The Data
acquired from the body can be used in the medical field to prevent future calamities.
Marketing: Marketing offers a lot of potential, such as a more effective price strategy. Pricing based
on Data Science can help companies like Uber and E-Commerce businesses enhance their profits.
Government Policies: Based on Data gathered through surveys and other official sources, the
government can use Data Science to better build policies that cater to the interests and wishes of
the people
PART-A Questions
PART-B
1. Explain the use of Data Science tools and techniques in practical scenarios.