100% found this document useful (2 votes)

3K views

Ccs346 Eda Unit 1 Notes

Uploaded by

elangokandhasamy284

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

3K views

Ccs346 Eda Unit 1 Notes

Uploaded by

elangokandhasamy284

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

lOMoARcPSD|43954982

CCS346-EDA UNIT-1 - Notes

Computer Science and Engineering (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)
lOMoARcPSD|43954982

CCS346 - EXPLORATORY DATA ANALYSIS

UNIT I - EXPLORATORY DATA ANALYSIS

EDA fundamentals – Understanding data science – Significance of EDA – Making sense of data –
Comparing EDA with classical and Bayesian analysis – Software tools for EDA - Visual Aids for EDA-
Data transformation techniques-merging database, reshaping and pivoting, Transformation techniques.

EDA FUNDAMENTALS
Data and Information
 Data is a collection of facts such as numbers, discrete objects, fugures, words, symbols, events,
measurements, observations, or descriptions of things.
 Data can be Qualitative or Quantitative. Qualitative data is descriptive information (It describes
something). Quantitative data is numerical information (numbers)

 Quantitative data can be Discrete or Continuous. Discrete data take only certain values (whole numbers).
Continuous data take any value (within a range). Discrete data is Counted and Continuous data is measured
Example: Qualitative data
 Most common names in India
 Colour of a Dress
Example: Quantitative data
 Height, Weight, Leaves on the tree, Customers in a shop.
 Information is defined as classified or organized data that has meaningful value. Information is
processed data sed to make decisions and take action. Processed data must meet the following criteria:
 Accuracy
 Completeness
 Timeliness

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

Data Collection methods

Data Collection is defined as method of collecting, analysing data for the purpose of validation and
research. Data collection methods are
 Interviews
 Questionnaire and Surveys
 Observations
 Documents and Records
 Focus Groups
Data collection methods can be classified into
 Primary data collection
 Interviews
 Delphi Technique
 Focus Groups
 Questionnaires
 Secondary data collection
 Financial Statements
 Sales Report
 Internet
 Magazines
 Business Journals
Data Collection Tools:
 Role playing
 Online Surveys
 Mobile Surveys
 Phone Surveys
 Observations
 Web Surveys
Common issues/Problems in Data:
 Inconsistent Data
 Data downtime
 Ambiguous Data
 Duplicate Data
 Inaccurate Data
 Hidden Data

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

Exploratory Data Analysis (EDA)

 Exploratory Data Analysis (EDA) is a process of examining or understanding the available
dataset and extracting insights of the data i.e., generating meaningful and useful information from
data. It discover patterns, spot anomalies, test hypotheses, and check assumptions using statistical
measures.
 EDA is also known as Visual Analytics or Descriptive Statistics.
 EDA is classified into Graphical and Non-Graphical analysis.
Reasons to use EDA:
1. Detection of mistakes.
2. Examine the data distribution.
3. Handling missing values of the dataset
4. Handling the outliers.
5. Removing duplicate data.
6. Encoding the categorical variables.
7. Normalizing and scaling.
8. Checking of assumptions.
9. Preliminary selection of appropriate models.
10. Determining relationships among the explanatory variables.

Understanding Data Science

Data science is the field of study that combines knowledge of mathematics and statistics to extract
meaningful insights from data. Data science deals with vast volumes of data using modern tools and
techniques to find unseen patterns, derive meaningful information and make business decisions.
Data science combines math and statistics, specialized programming, advanced analytics, Artificial
Intelligence (AI) and machine learning to extract meaningful insights from data and used to guide decision
making and strategic planning.
Data Science is extraction, preparation, analysis, visualization, and maintenance of information. Data
science involves cross-disciplinary knowledge from computer science, data, statistics, and mathematics.
The Data Science Lifecycle
Data science's lifecycle consists of five distinct stages, each with its own tasks:
1. Capture - Data acquisition, data entry, signal reception, data extraction - This stage involves gathering
raw structured and unstructured data.
2. Maintain - Data warehousing, data cleansing, data staging, data processing, data architecture - This
stage covers taking the raw data and putting it in a form that can be used.

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

3. Process - Data mining, clustering/classification, data modeling, data summarization Data scientists take
the prepared data and examine its patterns, ranges and biases to determine how useful it will be in
predictive analysis.
4. Analyze - Exploratory/confirmatory, predictive analysis, regression, text mining, qualitative analysis -
This stage involves performing the various analyses on the data.
5. Communicate - Data reporting, data visualization, business intelligence, decision making - In this final
step, analysts prepare the analyses in easily readable forms such as charts, graphs and reports.
Data science tools
Data science tools used in various stages of the data science process are:
o Data Analysis - SAS, Jupyter, R Studio, MATLAB, Excel, RapidMiner
o Data Warehousing - Informatica, AWS Redshift, Wega
o Data Visualization - Jupyter, Tableau, Cognos, RAW
o Machine Learning - Spark MLib, Mahout, Azure ML studio

Phases of Data Analysis

 There are several phases of data analysis, includes
1. Data requirements
2. Data collection
3. Data processing
4. Data cleaning
5. Exploratory data analysis
6. Modeling and algorithms
7. Data product and communication
1. Data requirements
• There can be various sources of data for an organization. It can be collected, curated, and stored.
• For example, an application tracking the sleeping pattern of patients suffering from dementia requires
several types of sensors' data storage, such as sleep data, heart rate from the patient, electro-dermal
activities, and user activities patterns.
• All of these data points are required to correctly diagnose the mental state of the person.
• Hence, these are mandatory requirements for the application.
• It is also required to categorize data, numerical or categorical, and the format of storage and dissemination.
2. Data collection
• Data collected from several sources must be stored in the correct format and transferred to the right
information technology personnel within a company. Data can be collected from several objects during
several events using different types of sensors and storage tools.

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

3. Data processing
• Preprocessing involves the process of selecting and organizing the dataset before actual analysis.
• Common tasks involve correctly exporting the dataset, placing them under the right tables, structuring
them, and exporting them in the correct format.
4. Data cleaning
• Preprocessed data is still not ready for detailed analysis.
• It must be correctly transformed for an incompleteness check, duplicates check, error check, and missing
value check.
• This stage involves responsibilities such as matching the correct record, finding inaccuracies in the
dataset, understanding the overall data quality, removing duplicate items, and filling in the missing values.
• Data cleaning is dependent on the types of data under study.
• Hence, it is essential for data scientists or EDA experts to comprehend different types of datasets.
• An example of data cleaning is using outlier detection methods for quantitative data cleaning.
5. EDA
• Exploratory data analysis is the stage where the message contained in the data is actually understood.
• Several types of data transformation techniques might be required during the process of exploration.
6. Modeling and algorithm
• Generalized models or mathematical formulas represent relationships among different variables, such as
correlation or causation. These models or equations involve one or more variables that depend on other
variables to cause an event.
• For example, when buying pens, the total price of pens
(Total) = price for one pen (UnitPrice) * the number of pens bought (Quantity).
Hence, model would be Total = UnitPrice * Quantity. Here, the total price is dependent on the unit price.
Hence, the total price is referred to as the dependent variable and the unit price is referred to as an
independent variable.
• In general, a model always describes the relationship between independent and dependent variables.
• Inferential statistics deals with quantifying relationships between particular variables.
• The Judd model for describing the relationship between data, model, and the error still holds true:
Data = Model + Error
7. Data Product
• Any computer software that uses data as inputs, produces outputs, and provides feedback based on the
output to control the environment is referred to as a data product.
• A data product is generally based on a model developed during data analysis.

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

Example: a recommendation model that inputs user purchase history and recommends a related item that
the user is highly likely to buy.
8. Communication
• This stage deals with disseminating the results to end stakeholders to use the result for business
intelligence. • One of the most notable steps in this stage is data visualization.
• Visualization deals with information relay techniques such as tables, charts, summary diagrams, and bar
charts to show the analyzed result.
Applications of Data Science:
 Healthcare
 Gaming
 Image Recognition
 Recommendation Systems
 Fraud Detection
 Speech Recognition
 Airline Route Planning
 Virtual Reality

SIGNIFICANCE OF EDA
➢ Different fields of science, economics, engineering, and marketing accumulate and store data primarily
in electronic databases. Appropriate and well-established decisions should be made using data collected.
➢ It is practically impossible to make sense of datasets containing more than a handful of data points
without the help of computer programs. To make sure of the insights provided by the collected data and to
make further decisions, data mining is performed which includes distinct analysis processes.
➢ Exploratory data analysis is the key and first exercise in data mining. It allows us to visualize data to
understand it as well as to create hypotheses (ideas) for further analysis. The exploratory analysis centers
around creating a synopsis of data or insights for the next steps in a data mining project. EDA actually
reveals the ground truth about the content without making any underlying assumptions.
➢ Hence, data scientists use this process to actually understand what type of modeling and hypotheses can
be created. Key components of exploratory data analysis include summarizing data, statistical analysis, and
visualization of data.
➢ Python provides expert tools for exploratory analysis
• Pandas for summarizing
• Scipy, along with others, for statistical analysis
• Matplotlib and Plotly for visualizations

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

STEPS IN EDA
The four different steps involved in exploratory data analysis are,
1. Problem Definition
2. Data Preparation
3. Data Analysis
4. Development and Representation of the Results
1. Problem Definition
• It is essential to define business problem to be solved before trying to extract useful insight from the data.
• The problem definition works as the driving force for a data analysis plan execution. The main tasks
involved in problem definition are
o defining the main objective of the analysis
o defining the main deliverables
o outlining the main roles and responsibilities
o obtaining the current status of the data
o defining the timetable, and performing cost/benefit analysis
• Based on the problem definition, an execution plan can be created.
2. Data Preparation
• This step involves methods for preparing the dataset before actual analysis. This step involves
o defining the sources of data
o defining data schemas and tables
o understanding the main characteristics of the data
o cleaning the dataset
o deleting non-relevant datasets
o transforming the data
o dividing the data into required chunks for analysis
3. Data analysis
 This is one of the most crucial steps that deals with descriptive statistics and analysis of the data
 The main tasks involve
 summarizing the data
 finding the hidden correlation
 relationships among the data
 developing predictive models
 evaluating the models
 calculating the accuracies

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

Some of the techniques used for data summarization are

 Summary Tables
 Graphs
 Descriptive Statistics
 Inferential Statistics
 Correlation Statistics
 Searching
 Grouping
 Mathematical Models

4. Development and representation of the results

• This step involves presenting the dataset to the target audience in the form of graphs, summary tables,
maps, and diagrams.
• This is also an essential step as the result analyzed from the dataset should be interpretable by the
business stakeholders, which is one of the major goals of EDA.
• Most of the graphical analysis techniques include
 scattering plots
 character plots
 histograms
 box plots
 residual plots
 mean plots

MAKING SENSE OF DATA

➢ It is crucial to identify the type of data under analysis. Different disciplines store different kinds of data
for different purposes.
➢ Example: medical researchers store patients' data, universities store students' and teachers' data, and real
estate industries storehouse and building datasets.
➢ A dataset contains many observations about a particular object. For instance, a dataset about patients in
a hospital can contain many observations. A patient can be described by a
 patient identifier (ID)  date of birth
 name  address
 address  email
 weight  gender
➢ Each of these features that describes a patient is a variable.

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

➢ Each observation can have a specific value for each of these variables.
For example, a patient can have the following:
PATIENT_ID = 1001
Name = Yoshmi Mukhiya
Address = Mannsverk 61, 5094, Bergen, Norway
Date of birth = 10th July 2018
Email = yoshmimukhiya@gmail.com
Weight = 10
Gender = Female
➢ These datasets are stored in hospitals and are presented for analysis.
➢ Most of this data is stored in some sort of database management system in tables/schema.
Table for storing patient information

➢ The table contains five observations (001, 002, 003, 004, 005).
➢ Each observation describes variables (PatientID, name, address, dob, email, gender, and weight).

Types of datasets
➢ Most datasets broadly fall into two groups—Numerical Data and Categorical Data.

Numerical data

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

➢ This data has a sense of measurement involved in it

➢ For example, a person's age, height, weight, blood pressure, heart rate, temperature, number of teeth,
number of bones, and the number of family members.
➢ This data is often referred to as quantitative data in statistics.
➢ The numerical dataset can be either discrete or continuous types.

Discrete data
➢ This is data that is countable and its values can be listed.
➢ For example, if we flip a coin, the number of heads in 200 coin flips can take values from 0 to 200
cases.
➢ A variable that represents a discrete dataset is referred to as a discrete variable.
➢ The discrete variable takes a fixed number of distinct values.
Example:
o The Country variable can have values such as Nepal, India, Norway, and Japan.
o The Rank variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.

Continuous data
➢ A variable that can have an infinite number of numerical values within a specific range is classified as
continuous data.
➢ A variable describing continuous data is a continuous variable.
➢ Continuous data can follow an interval measure of scale or ratio measure of scale
Example:
o The temperature of a city
o The weight variable is a continuous variable

Categorical data
➢ This type of data represents the characteristics of an object
➢ Examples: gender, marital status, type of address, or categories of the movies.
➢ This data is often referred to as qualitative datasets in statistics.
Examples of categorical data
o Gender (Male, Female, Other, or Unknown)
o Marital Status (Annulled, Divorced, Interlocutory, Legally Separated, Married, Polygamous, Never
Married, Domestic Partner, Unmarried, Widowed, or Unknown)
o Movie genres (Action, Adventure, Comedy, Crime, Drama, Fantasy, Historical, Horror, Mystery,
Philosophical, Political, Romance, Saga, Satire, Science Fiction, Social, Thriller, Urban, or Western)
o Blood type (A, B, AB, or O)

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

o Types of drugs (Stimulants, Depressants, Hallucinogens, Dissociatives, Opioids, Inhalants, or Cannabis)

➢ A variable describing categorical data is referred to as a categorical variable.
➢ These types of variables can have a limited number of values.

Types of categorical variables

Binary categorical variable
➢ This type of variable can take exactly two values
➢ Also referred to as a dichotomous variable.
➢ Example: while creating an experiment, the result is either success or failure.
Polytomous variables
➢ This type can take more than two possible values.
➢ Example: marital status can have several values, such as divorced, legally separated, married, never
married, unmarried, widowed, etc.
➢ Most of the categorical dataset follows either nominal or ordinal measurement scales.

Measurement scales
➢ There are four different types of measurement scales in statistics:
 Nominal,
 Ordinal,
 Interval and
 Ratio.

Nominal Scale:
• A nominal scale associates numbers with variables for naming or labeling the object. It is the
most basic of the four levels of measurement. This is a method of measuring the objects or events
into a discrete category. It assigns a number to an object only for the identification of the object. So
it is a categorical data or qualitative data.
• The numbers are only used for labeling variables and without any quantitative value. The scales are
referred to as labels.

 The languages that are spoken in a particular country

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

 Parts of speech in grammar (noun, pronoun, adjective, and so on)

➢ Nominal scales are considered qualitative scales and the measurements that are taken using qualitative
scales are considered qualitative data.
➢ Using numbers as labels have no concrete numerical value or meaning.
➢ No form of arithmetic calculation can be made on nominal measures.

Ordinal Scale:
 It establishes the rank between the variables of a scale but not the difference value between the
variables. The ordinal scale is the next level of data measurement scale. Here the “Ordinal” is the
indication of “Order”.
 Ordinal measurement assigns a numerical value to the variables based on their relative ranking.
In a nominal scale, there is no predefined order for arranging the data.
➢ The main difference in the ordinal and nominal scale is the order.
Example of ordinal scale using the Likert scale:

➢ The answer to the question is scaled down to five different ordinal values, Strongly Agree, Agree,
Neutral, Disagree, and Strongly Disagree.
➢ These Scales are referred to as the Likert scale.
More examples of the Likert scale:

Interval Scale (Ranking Scale) :

Along with establishing a rank and name of variables, the scale also makes known the difference
between the two variables. The only drawback is that there is no fixed start point of the scale, i.e., the
actual zero value is absent. The distance between any two adjacent attributes is called an interval, and
intervals are always equal.

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

Example1:
o The measure of central tendencies—mean, median, mode, and standard deviations.

Example2:

Ratio Scale:

Ratio scale is the most advanced measurement scale, which has variables that are labeled in order
and have a calculated difference between variables. This scale has a fixed starting point, i.e., the actual
zero value is present. Ratio scale is purely quantitative. Among the four levels of measurement, ratio scale
is the most precise.
Examples of ratio scales are age, wight, height, income, distance etc.
Examples: the measure of energy, mass, length, duration, electrical energy and volume.

Summary of the data types and scale measures:

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

COMPARING EDA WITH CLASSICAL AND BAYESIAN ANALYSIS

Several approaches to data analysis
➢ Classical data analysis
➢ Exploratory data analysis approach
➢ Bayesian data analysis approach

Classical data analysis

➢ This approach includes the problem definition and data collection step followed by model development,
which is followed by analysis and result communication.

Exploratory data analysis approach

➢ This approach follows the same approach as classical data analysis except for the model imposition and
the data analysis steps are swapped.
➢ The main focus is on the data, its structure, outliers, models, and visualizations.
➢ EDA does not impose any deterministic or probabilistic models on the data.

Bayesian data analysis approach

➢ This approach incorporates prior probability distribution knowledge into the analysis steps.
➢ Prior probability distribution of any quantity expresses the belief about that particular quantity before
considering some evidence.

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

Three different approaches for data analysis

SOFTWARE TOOLS AVAILABLE FOR EDA

➢ Python
• an open-source programming language widely used in data analysis, data mining, and data science
An interpreted, object-oriented programming language for rapid application development. Python and EDA
can be used to identify missing values in a data set and to handle missing values for machine learning.
➢ R programming language
• an open-source programming language widely used in statistical computation and graphical data analysis
➢ Weka
• an open-source data mining package that involves several EDA tools and algorithms. It Proides algorithms
for Data preprocessing, Classification, Regression, Clustering, Association rules and Visualization.
➢ KNIME
• an open-source tool for data analysis and is based on Eclipse
➢ Excel/Spreadsheet
• It supports important features like summarizing data, visualizing data, data wrangling etc.. Microsoft
Excel is paid but there are various other spreadsheet tools like open office, google docs are open source.
➢ Tableau
• Data Visualization software which allows exploring the data using Charts.

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

Python tools and packages

NumPy
➢ NumPy is Numerical Python which is a Python library. NumPy is used for working with arrays.
➢ It also has functions for working in domain of linear algebra, fourier transform, and matrices.
Pandas
➢ Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning,
exploring, and manipulating data.
SciPy
➢ SciPy stands for Scientific Python which is a scientific computation library that uses NumPy.
➢ It provides more utility functions for optimization, stats and signal processing. Like NumPy, SciPy is
open source so we can use it freely. SciPy has optimized and added functions that are frequently used in
NumPy and Data Science.
Matplotlib
➢ Matplotlib is a low-level graph plotting library in python that serves as a visualization utility.
➢ It provides a huge library of customizable plots used to create professional reporting applications,
interactive analytical applications, complex dashboard applications, web/GUI applications etc.,

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

VISUAL AIDS FOR EDA

Data scientists extract knowledge fromthe data and to present the data to stakeholders. Visual aids
are very useful tools Presenting results to stakeholders. Different types of techniques that can be used in the
visualization of data.

 Univariate analysis (Used for univariate data - the data containing one variable)
Univariate plots show the frequency or the distribution shape of a variable. Below ar visual tools used to
analyze univariate data. Histograms are two-dimensional plots in which the x-axis divides into a range o
numerical bins or time intervals. The y-axis shows the frequency values, which are counts of
occurrences of values for each bin

 Bar chart  Box Plot

 Scatterplot  Stacked area plot (Line Chart+Bar
 Histograms Chart)
 Line Chart  Table chart
 Pie chart  Probability Distribution Plots

Example:

 Bivariate Plots (Used for bivariate data - the data containing two variables)

 Bar Chart  Lollipop chart

 Scatter Plot  Pie chart
 Box Plot  Density Plot
 Polar chart  Contour Plot

Example:

 Multivariate Plots (Used for Multivariate data - the data containing more than two variables)
Multivariate data contains high dimensionality data and has applications in deep learning such as
visualizing natural language such as text or images.
 Scatter Plot

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

 PCA-Principal Component analysis

 Heatmap
Example:

Univariate Bivariate Multivariate

It only summarize single variable It only summarize more than 2
It only summarize two variables
at a time. variables.
It does not deal with causes and It does deal with causes and It does not deal with causes and
relationships. relationships and analysis is done. relationships and analysis is done.
It does not contain any dependent It does contain only one It is similar to bivariate but it contains
variable. dependent variable. more than 2 variables.
The main purpose is to study the
The main purpose is to describe. The main purpose is to explain.
relationship among them.
correlation between the amount of
The example of bivariate can be
The example of a univariate can time spent on social media and an
temperature and ice sales in
be height.
summer vacation. employee's productivity

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

lOMoARcPSD|43954982

DATA TRANSFORMATION TECHNIQUES

Merging Database, Reshaping And Pivoting, Transformation Techniques

Data transformation techniques refer to Techniques used to transform the raw data into a clean and ready-
to-use dataset.
There are different types of data transformation
1. Data smoothing
Smoothing is a technique where an algorithm is applied in order to remove noise from the dataset when
trying to identify a trend. Noise can have a bad effect on the data and by eliminating or reducing it one can
extract better insights or identify patterns that would not be seen otherwise.÷
There are three algorithm types that help with data smoothing:
o Clustering: Where one can group similar values together to form a cluster while labeling any value out
of the cluster as an outlier.
o Binning: Using an algorithm for binning will help to split the data into bins and smooth the data value
within each bin.
o Regression: Regression algorithms are used to identify the relation between two dependent attributes and
help to predict an attribute based on the value of the other.
2. Attribution construction
3. Data Generalization
4. Data Aggregation
5. Data Discretization
6. Data Normalization
7. Integration
8. Manipulation

Merging Database using Pandas Library

https://cse.poriyaan.in/topic/data-wrangling-50606/

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
CS8711 - Cloud Computing Laboratory Record: Department of Computer Science & Engineering
No ratings yet
CS8711 - Cloud Computing Laboratory Record: Department of Computer Science & Engineering
5 pages
Advance Statistics Module
100% (1)
Advance Statistics Module
64 pages
EDA Unit 1 Notes
No ratings yet
EDA Unit 1 Notes
27 pages
Data Exploration and Visualization
100% (1)
Data Exploration and Visualization
281 pages
EDA Unit V
No ratings yet
EDA Unit V
28 pages
EDA Unit3
No ratings yet
EDA Unit3
44 pages
EDA Unit 2 Notes
No ratings yet
EDA Unit 2 Notes
61 pages
Ccs341 DW Lab Manual Chumma Chumma Practical Notes
No ratings yet
Ccs341 DW Lab Manual Chumma Chumma Practical Notes
89 pages
Ad3491 Fdsa Unit 3 Notes Eduengg
No ratings yet
Ad3491 Fdsa Unit 3 Notes Eduengg
37 pages
DVT - Question Bank
100% (1)
DVT - Question Bank
3 pages
Ccs341 - Data Warehousing
100% (1)
Ccs341 - Data Warehousing
2 pages
Ccs341 Question Bank
No ratings yet
Ccs341 Question Bank
10 pages
EDA Unit 3 Notes
No ratings yet
EDA Unit 3 Notes
35 pages
CCS356 Oose Lab Manual 21 Reg
No ratings yet
CCS356 Oose Lab Manual 21 Reg
134 pages
CCS364-Soft Computing-Unit 5 - Applications - Lecture Notes
No ratings yet
CCS364-Soft Computing-Unit 5 - Applications - Lecture Notes
25 pages
Ad3301-Data-Exploration-And-Visualization Lab Manual
No ratings yet
Ad3301-Data-Exploration-And-Visualization Lab Manual
24 pages
CS3451 Course Plan
100% (1)
CS3451 Course Plan
10 pages
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
No ratings yet
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
56 pages
Data Structures and Algorithms - CD3291 - Important Questions
No ratings yet
Data Structures and Algorithms - CD3291 - Important Questions
6 pages
AD3251 Data Structures Design Question Bank 1
No ratings yet
AD3251 Data Structures Design Question Bank 1
1 page
Data Science-Lab Manual
100% (1)
Data Science-Lab Manual
15 pages
CCS356 OOSE Unit -1 Notes
No ratings yet
CCS356 OOSE Unit -1 Notes
24 pages
Cs3301 Unit Important Q-Data-Structures
No ratings yet
Cs3301 Unit Important Q-Data-Structures
8 pages
ccs346 Eda Lab Manual
No ratings yet
ccs346 Eda Lab Manual
41 pages
Unit I Content Beyond Syllabus - I Introduction To Data Mining and Data Warehousing What Are Data Mining and Knowledge Discovery?
No ratings yet
Unit I Content Beyond Syllabus - I Introduction To Data Mining and Data Warehousing What Are Data Mining and Knowledge Discovery?
12 pages
CCS341 Set1
100% (2)
CCS341 Set1
2 pages
Ccs334 Big Data Analytics
0% (1)
Ccs334 Big Data Analytics
2 pages
CCS341 Set3
100% (1)
CCS341 Set3
3 pages
Information Security Two Marks With Answer
No ratings yet
Information Security Two Marks With Answer
18 pages
IT3681 Mobile Applications Development Laboratory
No ratings yet
IT3681 Mobile Applications Development Laboratory
83 pages
Unit-I Notes
No ratings yet
Unit-I Notes
29 pages
Dbms
No ratings yet
Dbms
99 pages
EDA Unit 4 Notes
No ratings yet
EDA Unit 4 Notes
22 pages
CCS341-Data Warehousing Lab Manual (2021)
100% (1)
CCS341-Data Warehousing Lab Manual (2021)
50 pages
AI & DS - AD3351 DAA - 2marks (Unit 1 & 2) Question Bank
No ratings yet
AI & DS - AD3351 DAA - 2marks (Unit 1 & 2) Question Bank
7 pages
Eda Question Paper
No ratings yet
Eda Question Paper
4 pages
ccs341 Data Warehousing Lab Manual2021
No ratings yet
ccs341 Data Warehousing Lab Manual2021
41 pages
Algorithm For Asynchronous Check Pointing and Recovery
No ratings yet
Algorithm For Asynchronous Check Pointing and Recovery
4 pages
BTCS9202 Data Sciences Lab Manual
No ratings yet
BTCS9202 Data Sciences Lab Manual
39 pages
CCS334 BDA Practical Question
No ratings yet
CCS334 BDA Practical Question
2 pages
Ad3251 Data Structures Design
No ratings yet
Ad3251 Data Structures Design
2 pages
Cognitive Science Manual
No ratings yet
Cognitive Science Manual
17 pages
Flajolet-Martin Algorithm
No ratings yet
Flajolet-Martin Algorithm
28 pages
CS3491 Ai & ML Lab Manual
No ratings yet
CS3491 Ai & ML Lab Manual
57 pages
Data Analytics Unit-3 Notes
No ratings yet
Data Analytics Unit-3 Notes
21 pages
OOSE IMP Questions&Question Bank (III Year I Sem)
No ratings yet
OOSE IMP Questions&Question Bank (III Year I Sem)
52 pages
CS8651 Internet Programming (Downloaded From Annauniversityedu - Blogspot.com)
No ratings yet
CS8651 Internet Programming (Downloaded From Annauniversityedu - Blogspot.com)
539 pages
Daa-r22-Unit 1&2-Digital Notes Cse Dept (A.y 2024-25) @DR.K
No ratings yet
Daa-r22-Unit 1&2-Digital Notes Cse Dept (A.y 2024-25) @DR.K
50 pages
Oose Unit 4
100% (1)
Oose Unit 4
88 pages
ML Lab Manual - Ex No. 1 To 9
No ratings yet
ML Lab Manual - Ex No. 1 To 9
26 pages
Data Warehousing ccs341
No ratings yet
Data Warehousing ccs341
103 pages
Data Mining Lab Manual COMPLETE GMR
No ratings yet
Data Mining Lab Manual COMPLETE GMR
66 pages
Cs3691-Unit 3
No ratings yet
Cs3691-Unit 3
22 pages
CCS350 KNOWLEDGE ENGINEERING - Syllabus
No ratings yet
CCS350 KNOWLEDGE ENGINEERING - Syllabus
2 pages
CCS341 Data Warehousing Question Bank
No ratings yet
CCS341 Data Warehousing Question Bank
2 pages
CS3452 Theory of Computation Apr May 2023 Question Paper Download
100% (2)
CS3452 Theory of Computation Apr May 2023 Question Paper Download
3 pages
Unit-III Notes
No ratings yet
Unit-III Notes
33 pages
DBMS-Unit 3
No ratings yet
DBMS-Unit 3
30 pages
ccs346-eda-unit-1-notes
No ratings yet
ccs346-eda-unit-1-notes
20 pages
Unit 3
No ratings yet
Unit 3
83 pages
Basuc Statshi
100% (3)
Basuc Statshi
20 pages
Descriptive Statistics Questions
No ratings yet
Descriptive Statistics Questions
33 pages
Concepts and Variables
100% (1)
Concepts and Variables
14 pages
Test Bank for Elementary Statistics Using the TI-83/84, 4/E 4th Edition Plus Calculator : 0133864979 download
100% (3)
Test Bank for Elementary Statistics Using the TI-83/84, 4/E 4th Edition Plus Calculator : 0133864979 download
56 pages
Parametric Vs Non Paramteric Tests
No ratings yet
Parametric Vs Non Paramteric Tests
5 pages
Statistics and Probability - The Basics
No ratings yet
Statistics and Probability - The Basics
28 pages
Secure immediate PDF access to every chapter of Test Bank for Statistics for Business Decision Making and Analysis 3rd Edition Stine Foster 0134497163 9780134497167.
100% (16)
Secure immediate PDF access to every chapter of Test Bank for Statistics for Business Decision Making and Analysis 3rd Edition Stine Foster 0134497163 9780134497167.
47 pages
Fa1 Cabrera Sec4
No ratings yet
Fa1 Cabrera Sec4
4 pages
JETIR2005364
No ratings yet
JETIR2005364
9 pages
PTJ 1636
No ratings yet
PTJ 1636
10 pages
Test Bank for Statistics for Management and Economics, 11th Edition, Gerald Keller - Read Now Or Download For A Complete Experience
100% (6)
Test Bank for Statistics for Management and Economics, 11th Edition, Gerald Keller - Read Now Or Download For A Complete Experience
54 pages
Full Chapter Categorical and Nonparametric Data Analysis E Michael Nussbaum PDF
100% (8)
Full Chapter Categorical and Nonparametric Data Analysis E Michael Nussbaum PDF
53 pages
STAT1 Module Modified 1
No ratings yet
STAT1 Module Modified 1
38 pages
Mba 2 Sem Business Research Methods kmbn203 2022
No ratings yet
Mba 2 Sem Business Research Methods kmbn203 2022
28 pages
Biostatistics
No ratings yet
Biostatistics
53 pages
TPM en Pequeñas Industrias en Paises en Desarrollo
No ratings yet
TPM en Pequeñas Industrias en Paises en Desarrollo
14 pages
MODULE 4 Data Management Week 10 11
No ratings yet
MODULE 4 Data Management Week 10 11
14 pages
statistics-and-probability-group-assignment (3)
No ratings yet
statistics-and-probability-group-assignment (3)
53 pages
Guide To Assist in Choosing The Suitable Statistical Test
No ratings yet
Guide To Assist in Choosing The Suitable Statistical Test
43 pages
Essentials of Marketing Research 4th Edition Hair Test Bank instant download
100% (1)
Essentials of Marketing Research 4th Edition Hair Test Bank instant download
54 pages
BRM Presentation Group 5 - Univariate & Bivariate Analysis
No ratings yet
BRM Presentation Group 5 - Univariate & Bivariate Analysis
26 pages
Sampling and Data Management
100% (1)
Sampling and Data Management
48 pages
Hacking
100% (2)
Hacking
431 pages
M2 - Two-Way Factorial ANOVAs
No ratings yet
M2 - Two-Way Factorial ANOVAs
26 pages
Research in Business
No ratings yet
Research in Business
88 pages
Instant Ebooks Textbook Essentials of Social Statistics For A Diverse Society Third Edition Anna Leon-Guerrero Download All Chapters
100% (3)
Instant Ebooks Textbook Essentials of Social Statistics For A Diverse Society Third Edition Anna Leon-Guerrero Download All Chapters
62 pages
DSILYTC Final Paper
No ratings yet
DSILYTC Final Paper
25 pages
نموذج تبني الافكار المستحدثة
No ratings yet
نموذج تبني الافكار المستحدثة
10 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Ccs346 Eda Unit 1 Notes

Uploaded by

Ccs346 Eda Unit 1 Notes

Uploaded by

lOMoARcPSD|43954982

CCS346-EDA UNIT-1 - Notes

Computer Science and Engineering (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

CCS346 - EXPLORATORY DATA ANALYSIS

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

Data Collection methods

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

Exploratory Data Analysis (EDA)

Understanding Data Science

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

Phases of Data Analysis

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

Some of the techniques used for data summarization are

4. Development and representation of the results

MAKING SENSE OF DATA

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

➢ This data has a sense of measurement involved in it

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

o Types of drugs (Stimulants, Depressants, Hallucinogens, Dissociatives, Opioids, Inhalants, or Cannabis)

Types of categorical variables

 The languages that are spoken in a particular country

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

 Parts of speech in grammar (noun, pronoun, adjective, and so on)

Interval Scale (Ranking Scale) :

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

Summary of the data types and scale measures:

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

COMPARING EDA WITH CLASSICAL AND BAYESIAN ANALYSIS

Classical data analysis

Exploratory data analysis approach

Bayesian data analysis approach

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

Three different approaches for data analysis

SOFTWARE TOOLS AVAILABLE FOR EDA

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

Python tools and packages

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

VISUAL AIDS FOR EDA

 Bar chart  Box Plot

 Bar Chart  Lollipop chart

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

 PCA-Principal Component analysis

Univariate Bivariate Multivariate

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

DATA TRANSFORMATION TECHNIQUES

Merging Database using Pandas Library

Downloaded by Elango Kandhasamy (elangokandhasamy284@gmail.com)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.