0% found this document useful (0 votes)
18 views34 pages

FDSA Unit 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views34 pages

FDSA Unit 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 34

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

23ADE401 DATA SCIENCE AND ANALYTICS


FOR IV SEMESTER AI&DS STUDENTS

COMPILED BY

S.SANTHI PRIYA , AP /AI&DS

FACULTY INCHARGE ACADEMIC COORDINATOR HoD

DEAN-ACADEMICS PRINCIPAL

SEMESTER IV
23ADE401 DATA SCIENCE AND ANALYTICS
LTPC
30 2 4
OBJECTIVES
 To understand the techniques and processes of data science

 To understand skills in data preparatory and preprocessing steps

 To understand the mathematical skills in statistics

 To understand inferential data analytics

 To study predictive models from data

 To learn a knowledge about various data analytics

UNIT I INTRODUCTION TO DATA SCIENCE 9

Need for data science – benefits and uses – facets of data – data science process – setting the research goal
– retrieving data – cleansing, integrating, and transforming data – exploratory data analysis – build the models
– presenting and building applications.

UNIT II PROCESS MANAGEMENT 9

Normal distributions – z scores – normal curve problems – finding proportions – finding scores –
more about z scores – correlation – scatter plots – correlation coefficient for quantitative data –
computational formula for correlation coefficient – regression – regression line – least squares
regression line – standard error of estimate – interpretation of r2 – multiple regression equations –
regression toward the mean .

UNIT III INFERENTIAL STATISTICS 9

Populations – samples – random sampling – probability and statistics Sampling distribution –


creating a sampling distribution – mean of all sample means – standard error of the mean – other
sampling distributions Hypothesis testing – z-test – z-test procedure – statement of the problem –
null hypothesis – alternate hypotheses – decision rule – calculations – decisions - interpretations .

UNIT IV T-TEST 9

t-test for one sample – sampling distribution of t – t-test procedure – degrees of freedom – estimating
the standard error – case studies t-test for two independent samples – statistical hypotheses –
sampling distribution – test procedure – p-value – statistical significance – estimating effect size – t-
test for two related samples

UNIT V ANALYSIS OF VARIANCE 9

F-test – ANOVA – estimating effect size – multiple comparisons – case studies Analysis of variance
with repeated measures Two-factor experiments – three f-tests – two-factor ANOVA – other types of
ANOVA Introduction to chi-square tests

TOTAL:45+15=60 PERIODS
LABORATORY PART
LIST OF EXPERIMENTS
(All Experiments to be conducted)

1. Working with Numpy arrays


2. Working with Pandas data frames
3. Basic plots using Matplotlib
4. Frequency distributions, Averages, Variability ,Normal curves
6. Correlation and scatter plots
7. Correlation coefficient
8. Regression
9. Z-test case study
10. T-test case studies
OUTCOMES

Upon completion of the course, the students will be able to:


 Explain the data analytics pipeline
 Represent the useful information using mathematical skills
 Perform statistical inferences from data
 Analyze the variance in the data
 Build models for predictive analytics
 Develop various applications that uses AI
TEXT BOOKS:
1) David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”, Manning
Publications, 2016. (first two chapters for Unit I).
2) Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.

REFERENCE(S) :
1) Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”, Green Tea Press, 2014.
2) Sanjeev J. Wagh, Manisha S. Bhende, Anuradha D. Thakare, “Fundamentals of Data Science”, CRC
Press, 2022

E-RESOURCES:
1) https://nptel.ac.in/courses/108/105/106105219/ (Introduction data science)
2) https://nptel.ac.in/courses/108/102/106102132/ (time series)

Mapping of Cos-Pos & PSOs

CO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3

1 1 1 3 1 3 - - - 3 2 2 2 3 1 2

2 2 1 3 2 1 - - - 2 1 1 3 3 3 2

3 3 3 1 2 2 - - - 3 2 1 2 3 1 3
4 3 1 2 2 2 - - - 1 2 1 3 3 1 1

5 1 1 2 3 2 - - - 3 2 1 2 3 3 3
- - - - - - - - - - - - - - -
6

AVG 2 1 2 2 2 0 0 0 2 2 1 2 3 2 2
1- Low 2-Medium 3-High ‘-‘ – No Correlation
SENGUNTHAR ENGINEERING COLLEGE(AUTONOMOUS),TIRUCHENGODE-637 205

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

LECTURE PLAN

Subject Name : DATA SCIENCE AND ANALYTICS

Subject Code : 23ADE401

Name of Faculty/Designation : S.SANTHIPRIYA AP/AI&DS

Course : IV SEMESTER B.TECH -


AI&DS

Academic year : 2024 - 2025

RECOMMENDED TEXT BOOKS / REFERENCE BOOKS

S.No Title of the Book Author Reference

David Cielen, T1
1 Introducing Data Science
Arno D. B. Meysman
Think Stats: Exploratory Data Analysis in Allen B. Downey R1
2
Python
UNIT I

INTRODUCTION TO DATA SCIENCE


Teaching
S.No. Topics to be Covered Period Reference
Methods
T1-Ch1.1,
Need for data science 1 Chalk/Talk
1 R1-Ch1.1

T1-Ch1.2,
2 1 Chalk/Talk
Benefits and uses R1- Ch1.2
T1-Ch1.3
3 1 Chalk/Talk
Facets of data R1- Ch1.2

4 1 T1-Ch1.4 Chalk/Talk
Data science process

5 1 T1-Ch1.5 Chalk/Talk
Setting the research goal

6 1 T1-Ch1.6 Chalk/Talk
Retrieving data

7 1 T1-Ch1.7 PPT
Cleansing, Integrating, Transforming Data

8 1 T1-Ch 1.7.3 Chalk/Talk


Exploratory Data Analysis

9 Build The Models , Presenting And Building 1 T1-Ch1.8 Chalk/Talk


Applications

UNIT II
PROCESS MANAGEMENT

1 Normal distributions , z scores 1 T1-Ch2.1 Chalk/Talk

normal curve problems , finding proportions 2 T1-Ch2.2,


Chalk/Talk
2 finding scores R1-Ch2.3

2 T1-Ch2.3,
Chalk/Talk
3 more about z scores ,correlation ,scatter plots R1-Ch2.4
correlation coefficient for quantitative data , 1 T1-Ch2.4,
Chalk/Talk
4
computational formula for correlation coefficient R1-Ch2.5
regression – regression line – least squares
1 PPT
5 regression line T1-Ch2.5

standard error of estimate – interpretation of r2 –


1 Chalk/Talk
6 multiple regression equations T1-Ch2.7

1 Chalk/Talk
7 regression toward the mean . T1-Ch2.8
UNIT III
INFERENTIAL STATISTICS
1 Chalk/Talk
1 Populations ,samples ,random sampling. T1-Ch3.1

1 T1-Ch3.2 Chalk/Talk
2 Probability and statistics Sampling
distribution ,creating a sampling distribution R1-Ch3.3
Mean of all sample means , standard error of the mean 1 Chalk/Talk
3 T1-Ch3.3
, other sampling distributions
1 T1-Ch3.4, Chalk/Talk
4 Hypothesis testing
R1-Ch3.5
1 PPT
5 Z-test , z-test procedure T1-Ch3.5,3.6

1 PPT
6 Statement of the problem T1-Ch3.7

1 Chalk/Talk
7 Null hypothesis ,alternate hypotheses T1-Ch3.8

1 Chalk/Talk
8 Decision rule – calculations – decisions T1-Ch3.10

1 Chalk/Talk
9 Interpretations T1-Ch3.11

UNIT IV
T-test

1 T-test for one sample ,sampling distribution of t 1 T1-Ch4.1, 4.2 Chalk/Talk

T1-Ch4.3,
2 T-test procedure , degrees of freedom ,estimating the 1 Chalk/Talk
standard error . R1-Ch4.3
T1-Ch4.4,
case studies t-test for two independent samples , 2 Chalk/Talk
3 statistical hypotheses R1-Ch4.4

sampling distribution ,test procedure 1 Chalk/Talk


4 T1-Ch4.5,4.6

p-value, statistical significance 1 Chalk/Talk


5 T1-Ch4.7,4.9
T1-Ch4.10
estimating effect size 1 PPT
6 R1-Ch4.8

T-test for two related samples 2 Chalk/Talk


7 T1-Ch4.11
UNIT V
ANALYSIS OF VARIANCE
1 T1-Ch 5.2,5.3 Chalk/Talk
1 F-test ,ANOVA

2 T1-Ch 5.4,5.5 Chalk/Talk


2 Estimating effect size , multiple comparisons

1 T1-Ch 5.6,
Case studies Analysis of variance with repeated Chalk/Talk
3 measures Two-factor experiments R1-Ch5.7

1 T1-Ch 5.7
4 Three f-tests Chalk/Talk

2 T1-Ch
5 Two-factor ANOVA, other types of Chalk/Talk
5.9,5.10
1 T1-Ch 5.11
6 ANOVA PPT

1
Introduction to chi-square tests Chalk/Talk
7 T1-Ch 5.13

Total 45
Revision 5
Total 50
UNIT – I

INTRODUCTION TO DATA SCIENCE


Need for data science – benefits and uses – facets of data – data science process – setting the research
goal – retrieving data – cleansing, integrating, and transforming data – exploratory data analysis – build
the models – presenting and building applications.

1.1 NEED FOR DATA SCIENCE

1.1. 1 INTRODUCATION TO DATA SCIENCE

Data Science is a combination of multiple disciplines that uses statistics, data analysis, and
machine learning to analyze data and to extract knowledge and insights from it.

 Data Science is about data gathering, analysis and decision-making.

 Data Science is about finding patterns in data, through analysis, and make future predictions.
By using Data Science, companies are able to make:

 Better decisions (should we choose A or B)


 Predictive analysis (what will happen next?)
 Pattern discoveries (find pattern, or maybe hidden information in the data)

1.1. 2 NEED FOR DATA SCIENCE


Data Science is used in many industries in the world today, e.g. banking, consultancy,
healthcare, and manufacturing.

Examples of where Data Science is needed:

 For route planning: To discover the best routes to ship


 To foresee delays for flight/ship/train etc. (through predictive analysis)
 To create promotional offers
 To find the best suited time to deliver goods
 To forecast the next years revenue for a company
 To analyze health benefit of training
 To predict who will win elections

Data Science can be applied in nearly every part of a business where data is available.
Examples are:
 Consumer goods
 Stock markets
 Industry
 Politics
 Logistic companies
 E-commerce

1.2 Benefits and uses


 Commercial companies in almost every industry use data science and big data to gain insights into
their customers, processes, staff, completion, and products. Many companies use data science to
offer customers a better user experience, as well as to cross-sell, up-sell, and personalize their
offerings. A good example of this is Google
 Ad Sense, which collects data from internet users so relevant commercial messages can be
matched to the person browsing the internet.
 Max Point (http://maxpoint.com/us) is another example of real-time personalized advertising.

 Human resource professionals use people analytics and text mining to screen candidates, monitor
the mood of employees, and study informal networks among coworkers.

 People analytics is the central theme in the book Money ball: The Art of Winning an Unfair
Game.

 In the book (and movie) we saw that the traditional scouting process for American baseball was
random, and replacing it with correlated signals changed everything. Relying on statistics allowed
them to hire the right players and pit them against the opponents where they would have the
biggest advantage.

 Financial institutions use data science to predict stock markets, determine the risk of lending
money, and learn how to attract new clients for their services.

 At the time of writing this book, at least 50% of trades world wide are performed automatically by
machines based on algorithms developed by quants, as data scientists

 Governmental organizations are also aware of data’s value. Many governmental organizations
not only rely on internal data scientists to discover valuable information, but also share their data
with the public. You can use this data to gain insights or build data-driven applications. Data.gov is
but one example; it’s the home of the US Government’s open data
1.3 Facets of data
In data science and big data you’ll come across many different types of data, and each of them

tends to require different tools and techniques.

The main categories of data are these:


1) Structured
2) Unstructured
3) Natural language
4) Machine-generated
5) Graph-based
6) Audio, video

1.3.1 What is Data?

 Data is a collection of information.

 One purpose of Data Science is to structure data, making it interpretable and easy to work with.

Data can be categorized into two groups:

 Structured data
 Unstructured data

Structured Structured Data

 Structured data is organized and easier to work with.

FIG 1.1 : Structured Data


 UnStructured Data

 Unstructured data is data that isn’t easy to fit into a data model because the content is context-
specific or varying.

 One example of unstructured data is your regular email.

 Although email contains structured elements such as the sender, title, and body

FIG 1.2 : Un Structured Data

Natural language

 Natural language is a special type of unstructured data.

 it’s challenging to process because it requires knowledge of specific data science techniques and
linguistics.

 The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis, but model strained in one
domain don’t generalize well to other domains.

 Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of text. This
shouldn’t be a surprise though: humans struggle with natural language as well.

 It’s ambiguous by nature.

 The concept of meaning itself is questionable here.

 Have two people listen to the same conversation. Will they get the same meaning? The meaning of
the same words can vary when coming from someone upset or joyous.
Machine-generated data

 Machine-generated data is information that’s automatically created by a computer, process,


application, or other machine without human intervention.

 Machine-generated data is becoming a major data resource and will continue to do so.

 Wiki bon has fore cast that the market value of the industrial Internet (a term coined by Frost &
Sullivan to refer to the integration of complex physical machinery with networked sensors and
software) will be approximately $540 billion in 2020. IDC (International Data Corporation)has
estimated there will be 26 times more connected things than people in2020.

 This network is commonly referred to as the internet of things.

 The analysis of machine data relies on highly scalable tools, due to its high volume and speed.

 Examples of machine data are web server logs, call detail records, network ,event logs, and
telemetry

FIG 1.3 : Machine-generated data

Graph-based or Network Data


 “Graph data” can be a confusing term because any data can be shown in a graph.

 “Graph” in this case points to mathematical graph theory.

 A mathematical structure to model pair-wise relationships between objects.


 Graph or network data is, in short, data that focuses on the relationship or adjacency of
objects.
 The graph structures use nodes, edges, and properties to represent and store graphical data
FIG 1.4 : Graph-based or Network Data

Streaming data

 While streaming data can take almost any of the previous forms, it has an extra property.

 The data flows into the system when an event happens instead of being loaded into a data store in
a batch.

 Although this isn’t really a different type of data, we treat it here as such because you need to
adapt your process to deal with this type of information.

 Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock
market.
1.4 DATA SCIENCE PROCESS

The life-cycle of data science is explained as below diagram.

FIG 1.5 : Data Science Process


The main phases of data science life cycle are given below:

1. Discovery: The first phase is discovery, which involves asking the right questions. When you start any
data science project, you need to determine what are the basic requirements, priorities, and project budget.
In this phase, we need to determine all the requirements of the project such as the number of people,
technology, time, data, an end goal, and then we can frame the business problem on first hypothesis level.

2. Data preparation: Data preparation is also known as Data Munging. In this phase, we need to perform
the following tasks:

 Data cleaning
 Data Reduction
 Data integration
 Data transformation,

After performing all the above tasks, we can easily use this data for our further processes.

3. Model Planning: In this phase, we need to determine the various methods and techniques to establish
the relation between input variables. We will apply Exploratory data analytics (EDA) by using various
statistical formula and visualization tools to understand the relations between variable and to see what data
can inform us. Common tools used for model planning are:

 SQL Analysis Services


 R
 SAS
 Python

4. Model-building: In this phase, the process of model building starts. We will create datasets for training
and testing purpose. We will apply different techniques such as association, classification, and clustering, to
build the model.

Following are some common Model building tools:

 SAS Enterprise Miner


 WEKA
 SPCS Modeler
 MATLAB

5. Operationalize: In this phase, we will deliver the final reports of the project, along with briefings, code,
and technical documents. This phase provides you a clear overview of complete project performance and
other components on a small scale before the full deployment.
6. Communicate results: In this phase, we will check if we reach the goal, which we have set on the initial
phase. We will communicate the findings and final result with the business team.

Applications of Data Science:


 Image recognition and speech recognition:
 Data science is currently using for Image and speech recognition. When you upload an image on
Facebook and start getting the suggestion to tag to your friends. This automatic tagging suggestion
uses image recognition algorithm, which is part of data science.
When you say something using, "Ok Google, Siri, Cortana", etc., and these devices respond as per
voice control, so this is possible with speech recognition algorithm.
 Gaming world:
 In the gaming world, the use of Machine learning algorithms is increasing day by day. EA Sports,
Sony, Nintendo, are widely using data science for enhancing user experience.
 Internet search:
 When we want to search for something on the internet, then we use different types of search
engines such as Google, Yahoo, Bing, Ask, etc. All these search engines use the data science
technology to make the search experience better, and you can get a search result with a fraction of
seconds.
 Transport:
Transport industries also using data science technology to create self-driving cars. With self-driving
cars, it will be easy to reduce the number of road accidents.
 Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data science is being used for
tumor detection, drug discovery, medical image analysis, virtual medical bots, etc.
 Recommendation systems:
 Most of the companies, such as Amazon, Netflix, Google Play, etc., are using data science
technology for making a better user experience with personalized recommendations. Such as, when
you search for something on Amazon, and you started getting suggestions for similar products, so
this is because of data science technology.
 Risk detection:

 Finance industries always had an issue of fraud and risk of losses, but with the help of data
science, this can be rescued.
 Most of the finance companies are looking for the data scientist to avoid risk and any type of
losses with an increase in customer satisfaction.

1.4.1 Defining research goals and creating a project charter

 A project starts by understanding the what, the why, and the how of your project .
 What does the company expect you to do? And why does management place such a value on your
research?

 Is it part of a bigger strategic picture or a “lone wolf” project originating from an opportunity someone
detected? Answering these three questions (what, why, how) is the goal of the first phase, so that
everybody knows what to do and can agree on the best course of action.

FIG 1.6 : Setting The Research Goal

 Spend time understanding the goals and context of your research

 An essential outcome is the research goal that states the purpose of your assignment in a
clear and focused manner.

 Understanding the business goals and context is critical for project success.

 Continue asking questions and devising examples until you grasp the exact business
expectations, identify how your project fits in the bigger picture, appreciate how your
research is going to change the business, and understand how they’ll use your results.

Nothing is more frustrating than spending months researching something until you have that one
moment of brilliance and solve the problem, but when you report your findings back to the organization,
everyone immediately realizes that you misunderstood their question. Don’t skim over this phase lightly.
Many data scientists fail here: despite their mathematical wit and scientific brilliance, they never seem to
grasp the business goals and context.

1.4.2 Create a project charter


Clients like to know upfront what they’re paying for, so after you have a good understanding of the business
problem, try to get a formal agreement on the deliverables. All this information is best collected in a project
charter. For any significant project this would be mandatory.

A project charter requires teamwork, and your input covers at least the following:

 A clear research goal

 The project mission and context

 How you’re going to perform your analysis

 What resources you expect to use

 Proof that it’s an achievable project, or proof of concepts

 Deliverables and a measure of success

 A timeline

Your client can use this information to make an estimation of the project costs and the data and people
required for your project to become a success.

1.5 RETRIEVING DATA

 The next step in data science is to retrieve the required data .

 Sometimes you need to go into the field and design a data collection process yourself, but most of
the time you won’t be involved in this step

 Many companies will have already collected and stored the data for you, and what they don’t have
can often be bought from third parties.

 Don’t be afraid to look outside your organization for data, because more and more organizations are
making even high-quality data freely available for public and commercial use.
FIG 1. 7 : Retrieving Data

 Data can be stored in many forms, ranging from simple text files to tables in a database.

 The objective now is acquiring all the data you need.

 This may be difficult, and even if you succeed, data is often like a diamond in the rough: it needs
polishing to be of any use to you.

1.5.1. Start With Data Stored Within The Company

 Your first act should be to assess the relevance and quality of the data that’s readily available within
your company.

 Most companies have a program for maintaining key data, so much of the cleaning work may
already be done.

 This data can be stored in official data repositories such as databases, data marts, data
warehouses, and data lakes maintained by a team of IT professionals.

 The primary goal of a database is data storage, while a data warehouse is designed for reading and
analyzing that data. A data mart is a subset of the data warehouse and geared toward serving a
specific business unit.

 While data warehouses and data marts are home to preprocessed data, data lakes contains data in
its natural or raw format.

 But the possibility exists that your data still resides in Excel files on the desktop of a domain expert.

 Finding data even within your own company can sometimes be a challenge.
 As companies grow, their data becomes scattered around many places. Knowledge of the data may
be dispersed as people change positions and leave the company.

 Documentation and metadata aren’t always the top priority of a delivery manager, so it’s possible
you’ll need to develop some Sherlock Holmes–like skills to find all the lost bits.

 Getting access to data is another difficult task.

 Organizations understand the value and sensitivity of data and often have policies in place so
everyone has access to what they need and nothing more.

 These policies translate into physical and digital barriers called Chinese walls.

 These “walls” are mandatory and well-regulated for customer data in most countries.

 This is for good reasons, too; imagine everybody in a credit card company having access to your
spending habits. Getting access to the data may take time and involve company politics.

1.5.2. Don’t be afraid to shop around

 If data isn’t available inside your organization, look outside your organization’s walls.

 Many companies specialize in collecting valuable information.

 For instance, Nielsen and GFK are well known for this in the retail industry.

 Other companies provide data so that you, in turn, can enrich their services and ecosystem. Such is
the case with Twitter, LinkedIn, and Facebook.

 Although data is considered an asset more valuable than oil by certain companies, more and more
governments and organizations share their data for free with the world.

 This data can be of excellent quality;

 it depends on the institution that creates and manages it.

 The information they share covers a broad range of topics such as the number of accidents or
amount of drug abuse in a certain region and its demographics.

 This data is helpful when you want to enrich proprietary data but also convenient when training your
data science skills at home. Table 1.5.1 shows only a small selection from the growing number of
open-data providers.
Table 1. 5.1. A list of open-data providers that should get you started

Open data site Description

Data.gov The home of the US Government’s open data

https://open- The home of the European Commission’s open data


data.europa.eu/

Freebase.org An open database that retrieves its information from sites like Wikipedia,
MusicBrains, and the SEC archive

Data.worldbank.org Open data initiative from the World Bank

Aiddata.org Open data for international development

Open.fda.gov Open data from the US Food and Drug Administration

1.5.3. Do data quality checks now to prevent problems later.

 Expect to spend a good portion of your project time doing data correction and cleansing, sometimes
up to 80%.

 The retrieval of data is the first time you’ll inspect the data in the data science process.

 Most of the errors you’ll encounter during the data-gathering phase are easy to spot, but being too
careless will make you spend many hours solving data issues that could have been prevented
during data import.

 You’ll investigate the data during the import, data preparation, and exploratory phases.

 The difference is in the goal and the depth of the investigation.

 During data retrieval, you check to see if the data is equal to the data in the source document and
look to see if you have the right data types.

 This shouldn’t take too long; when you have enough evidence that the data is similar to the data you
find in the source document, you stop.

 With data preparation, you do a more elaborate check.

 If you did a good job during the previous phase, the errors you find now are also present in the
source document.

 The focus is on the content of the variables: you want to get rid of typos and other data entry errors
and bring the data to a common standard among the data sets.

 For example, you might correct USQ to USA and United Kingdom to UK.
 During the exploratory phase your focus shifts to what you can learn from the data.

 Now you assume the data to be clean and look at the statistical properties such as distributions,
correlations, and outliers.

 You’ll often iterate over these phases.

 For instance, when you discover outliers in the exploratory phase, they can point to a data entry
error.

 Now that you understand how the quality of the data is improved during the process, we’ll look
deeper into the data preparation step.

1.6 THE STEPS IN DATA CLEANSING .


 Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.
 When combining multiple data sources, there are many opportunities for data to be
duplicated or mislabeled.

 If data is incorrect, outcomes and algorithms are unreliable, even though they may look
correct. There is no one absolute way to prescribe the exact steps in the data cleaning
process because the processes will vary from dataset to dataset.

 But it is crucial to establish a template for your data cleaning process so you know you are
doing it the right way every time.

 Difference between data cleaning and data transformation

Data cleaning is the process that removes data that does not belong in your dataset. Data
transformation is the process of converting data from one format or structure into another.
Transformation processes can also be referred to as data wrangling, or data munging, transforming
and mapping data from one "raw" data form into another format for warehousing and analyzing. This
article focuses on the processes of cleaning that data.

 Techniques used for data cleaning

While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to map out a framework for your organization.
FIG 1. 8 : Data Science Process

Step 1: Remove duplicate or irrelevant observations

Remove unwanted observations from your dataset, including duplicate observations or irrelevant
observations. Duplicate observations will happen most often during data collection. When you combine data
sets from multiple places, scrape data, or receive data from clients or multiple departments, there are
opportunities to create duplicate data. De-duplication is one of the largest areas to be considered in this
process. Irrelevant observations are when you notice observations that do not fit into the specific problem
you are trying to analyze. For example, if you want to analyze data regarding millennial customers, but your
dataset includes older generations, you might remove those irrelevant observations. This can make
analysis more efficient and minimize distraction from your primary target—as well as creating a more
manageable and more performant dataset.
Step 2: Fix structural errors

Structural errors are when you measure or transfer data and notice strange naming conventions,
typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or classes. For
example, you may find “N/A” and “Not Applicable” both appear, but they should be analyzed as the same
category.

Step 3: Filter unwanted outliers

Often, there will be one-off observations where, at a glance, they do not appear to fit within the data
you are analyzing. If you have a legitimate reason to remove an outlier, like improper data-entry, doing so
will help the performance of the data you are working with. However, sometimes it is the appearance of an
outlier that will prove a theory you are working on. Remember: just because an outlier exists, doesn’t mean
it is incorrect. This step is needed to determine the validity of that number. If an outlier proves to be
irrelevant for analysis or is a mistake, consider removing it.

Step 4: Handle missing data

You can’t ignore missing data because many algorithms will not accept missing values. There are a
couple of ways to deal with missing data. Neither is optimal, but both can be considered.

1. As a first option, you can drop observations that have missing values, but doing this will drop or lose
information, so be mindful of this before you remove it.

2. As a second option, you can input missing values based on other observations; again, there is an
opportunity to lose integrity of the data because you may be operating from assumptions and not
actual observations.

3. As a third option, you might alter the way the data is used to effectively navigate null values.

Step 5: Validate and QA

At the end of the data cleaning process, you should be able to answer these questions as a part of
basic validation:

 Does the data make sense?


 Does the data follow the appropriate rules for its field?

 Does it prove or disprove your working theory, or bring any insight to light?

 Can you find trends in the data to help you form your next theory?

 If not, is that because of a data quality issue?

False conclusions because of incorrect or “dirty” data can inform poor business strategy and decision-
making. False conclusions can lead to an embarrassing moment in a reporting meeting when you realize
your data doesn’t stand up to scrutiny. Before you get there, it is important to create a culture of quality data
in your organization. To do this, you should document the tools you might use to create this culture and
what data quality means to you.

Try Tableau for free to create beautiful visualizations with your data.

Components of quality data

Determining the quality of data requires an examination of its characteristics, then weighing those
characteristics according to what is most important to your organization and the application(s) for which
they will be used.

5. characteristics of quality data

Validity. The degree to which your data conforms to defined business rules or constraints.

1. Accuracy. Ensure your data is close to the true values.

2. Completeness. The degree to which all required data is known.

3. Consistency. Ensure your data is consistent within the same dataset and/or across multiple data
sets.

4. Uniformity. The degree to which the data is specified using the same unit of measure.

Benefits of data cleaning

Having clean data will ultimately increase overall productivity and allow for the highest quality
information in your decision-making. Benefits include:

 Removal of errors when multiple sources of data are at play.

 Fewer errors make for happier clients and less-frustrated employees.

 Ability to map the different functions and what your data is intended to do.

 Monitoring errors and better reporting to see where errors are coming from, making it easier to fix
incorrect or corrupt data for future applications.

 Using tools for data cleaning will make for more efficient business practices and quicker decision-
making.
1.6.2 OUTLIERS

An outlier is an observation that seems to be distant from other observations or, more specifically, one
observation that follows a different logic or generative process than the other observations. The easiest way
to find outliers is to use a plot or a table with the minimum and maximum values.

FIG 1. 9 : Outliers

1.7 THE COMMON ERROR OCCUR IN DATA CLEANSING PROCESS.


Data Cleansing: Problems and Solutions
It is more important for any organization to have the right data as compared to a large data
set. Data cleansing solutions can have several problems during the process of data scrubbing. The
company needs to understand the various problems and figure out how to tackle them. Some of the
key data cleaning problems and solutions include
Data is never static
It is important that the data cleansing process arranges the data so that it is easily accessible
to everyone who needs it. The warehouse should contain unified data and not in a scattered
manner. The data warehouse must have a documented system which is helpful for the employees
to easily access the data from different sources. Data cleaning also further helps to improve the
data quality by removing inaccurate data as well as corrupt and duplicate entries.

Incorrect data may lead to bad decisions


While operating your business you rely on certain source of data, based on which you make
most of your business decisions. If the data has a lot of errors, the decisions you take may be
incorrect and prove to be hazardous for your business. The way you collect data and how your data
warehouse functions can easily have an impact on your productivity.

Incorrect data can affect client records


Complete client records are only possible when the names and addresses match. Names and
addresses of the client can be poor sources of data. To avoid these mistakes, companies should
provide external references which are capable of verifying the data, supplementing data points and
correcting any inconsistencies.

Develop a data cleansing framework in advance


Data cleansing can be a time consuming and expensive job for your company. Once the data
is cleaned it needs to be stored in a secure location. The staff should keep a complete log of the
entire process so as to ascertain which data went through which process. If a data scrubbing
framework is not created in advance, the entire process can become repetitive.

Big data can bring in bigger problems


Big data needs regular cleansing to maintain its effectiveness. It requires complex computer
data analysis of semi-structured or structured and voluminous data. Data cleansing helps in
extracting information from such a big set of data and come up with some data which can be used to
make certain key business decisions.

 It is good with large databases and datasets


 It predicts future results
 It creates actionable insights
 It utilizes the automated discovery of patterns
1.8 THE STEPS INVOLVED IN DATA SCIENCE MODELLING.

The key steps involved in Data Science Modelling are:

 Step 1: Understanding the Problem


 Step 2: Data Extraction

 Step 3: Data Cleaning

 Step 4: Exploratory Data Analysis

 Step 5: Feature Selection

 Step 6: Incorporating Machine Learning Algorithms

 Step 7: Testing the Models

 Step 8: Deploying the Model

Step 1: Understanding the Problem


The first step involved in Data Science Modelling is understanding the problem. A Data Scientist
listens for keywords and phrases when interviewing a line-of-business expert about a business challenge.
The Data Scientist breaks down the problem into a procedural flow that always involves a holistic
understanding of the business challenge, the Data that must be collected, and various Artificial Intelligence
and Data Science approach that can be used to address the problem.

Step 2: Data Extraction


The next step in Data Science Modelling is Data Extraction. Not just any Data, but the Unstructured
Data pieces you collect, relevant to the business problem you’re trying to address. The Data Extraction is
done from various sources online, surveys, and existing Databases.

Step 3: Data Cleaning

Data Cleaning is useful as you need to sanitize Data while gathering it. The following are some of the
most typical causes of Data Inconsistencies and Errors:

 Duplicate items are reduced from a variety of Databases.


 The error with the input Data in terms of Precision.

 Changes, Updates, and Deletions are made to the Data entries.

 Variables with missing values across multiple Databases.


Step 4: Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a robust technique for familiarizing yourself with Data and
extracting useful insights. Data Scientists sift through Unstructured Data to find patterns and infer
relationships between Data elements. Data Scientists use Statistics and Visualization tools to summarise
Central Measurements and variability to perform EDA.

If Data skewness persists, appropriate transformations are used to scale the distribution around its
mean. When Datasets have a lot of features, exploring them can be difficult. As a result, to reduce the
complexity of Model inputs, Feature Selection is used to rank them in order of significance in Model Building
for enhanced efficiency. Using Business Intelligence tools like Tableau, Micro Strategy, etc. can be quite
beneficial in this step. This step is crucial in Data Science Modeling as the Metrics are studied carefully for
validation of Data Outcomes.

Step 5: Feature Selection

Feature Selection is the process of identifying and selecting the features that contribute the most to
the prediction variable or output that you are interested in, either automatically or manually.

The presence of irrelevant characteristics in your Data can reduce the Model accuracy and cause your
Model to train based on irrelevant features. In other words, if the features are strong enough, the Machine
Learning Algorithm will give fantastic outcomes. Two types of characteristics must be addressed:

 Consistent characteristics that are unlikely to change.


 Variable characteristics whose values change over time.

Step 6: Incorporating Machine Learning Algorithms

This is one of the most crucial processes in Data Science Modelling as the Machine Learning
Algorithm aids in creating a usable Data Model. There are a lot of algorithms to pick from, the Model is
selected based on the problem. There are three types of Machine Learning methods that are incorporated:

1) Supervised Learning

It is based on the results of a previous operation that is related to the existing business operation.
Based on previous patterns, Supervised Learning aids in the prediction of an outcome. Some of the
Supervised Learning Algorithms are:

 Linear Regression
 Random Forest

 Support Vector Machines


2) Unsupervised Learning

This form of learning has no pre-existing consequence or pattern. Instead, it concentrates on examining
the interactions and connections between the presently available Data points. Some of the Unsupervised
Learning Algorithms are:

 KNN (k-Nearest Neighbours)


 K-means Clustering

 Hierarchical Clustering

 Anomaly Detection

3) Reinforcement Learning

It is a fascinating Machine Learning technique that uses a dynamic Dataset that interacts with the real
world. In simple terms, it is a mechanism by which a system learns from its mistakes and improves over
time. Some of the Reinforcement Learning Algorithms are:

 Q-Learning
 State-Action-Reward-State-Action (SARSA)

 Deep Q Network

For further information on Advance Machine Learning techniques, visit here.

Step 7: Testing the Models

This is the next phase, and it’s crucial to check that our Data Science Modeling efforts meet the
expectations. The Data Model is applied to the Test Data to check if it’s accurate and houses all desirable
features. You can further test your Data Model to identify any adjustments that might be required to
enhance the performance and achieve the desired results. If the required precision is not achieved, you can
go back to Step 5 (Machine Learning Algorithms), choose an alternate Data Model, and then test the model
again.

Step 8: Deploying the Model

The Model which provides the best result based on test findings is completed and deployed in the
production environment whenever the desired result is achieved through proper testing as per the business
needs. This concludes the process of Data Science Modeling.
Applications of Data Science

Every industry benefits from the experience of Data Science companies, but the most common areas
where Data Science techniques are employed are the following:

 Banking and Finance: The banking industry can benefit from Data Science in many aspects. Fraud
Detection is a well-known application in this field that assists banks in reducing non-performing
assets.
 Healthcare: Health concerns are being monitored and prevented using Wearable Data. The Data
acquired from the body can be used in the medical field to prevent future calamities.

 Marketing: Marketing offers a lot of potential, such as a more effective price strategy. Pricing based
on Data Science can help companies like Uber and E-Commerce businesses enhance their profits.

 Government Policies: Based on Data gathered through surveys and other official sources, the
government can use Data Science to better build policies that cater to the interests and wishes of
the people
PART-A Questions

1. What is Data Science?


2. Differentiate between Data Analytics and Data Science
3. What are the challenges in Data Science?
4. List the facets of data.
5. What are the steps in data science process ?
6. Explain unstructured data and give example.
7. What do you understand about linear regression?
8. What are outliers?
9. What do you understand by logistic regression?
10. What is a confusion matrix?
11. What do you understand about the true-positive rate and false-positive rate?
12. How is Data Science different from traditional application programming?
13. Explain the differences between supervised and unsupervised learning.
14. What is the difference between the long format data and wide format data?
15. Mention some techniques used for sampling. What is the main advantage of sampling?

PART-B
1. Explain the use of Data Science tools and techniques in practical scenarios.

2. IIIustrate the facets of data in detail

3. Explain the importance of data quality, integrity, and reliability.

4. Mention the importance of cleaning data for accurate analysis.

5. Explain the common error occur in data cleansing process.

6. Describe Steps Involved in Data Science Modeling.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy