Da Mod 1
Da Mod 1
1. Improving decision-making:
By analyzing data, companies can make informed decisions.
2. Identifying trends:
It helps in spotting trends that can shape future strategies.
3. Solving problems:
Data analytics can help find the root cause of issues within a business.
4. Boosting efficiency:
By understanding what works and what doesn't, companies can
streamline their processes.
• Steps in Data Analytics
1. Data Collection:
This is the first step where raw data is collected from various sources, like
databases, surveys, social media, sensors, etc.
2. Data Cleaning:
The raw data often has errors or missing values. Data cleaning involves correcting
these errors and filling in missing information to make the data accurate and ready
for analysis.
3. Data Analysis:
This is the most important step where data is analyzed using different techniques.
These could include:
➢ Statistical analysis: Applying mathematical formulas to data to find
averages, trends, or patterns.
➢ Visualization: Using charts, graphs, and tables to represent data so it’s easier
to understand.
4. Data Interpretation:
After analyzing, the results need to be interpreted. This means making sense of the
findings and seeing how they apply to real-world problems or decisions.
5. Data Reporting:
Once the analysis is complete, the insights are shared with others through reports
or dashboards. These insights help decision-makers in an organization.
• Example
Let’s take the example of an online shopping website.
Data Collection: The website collects data about what products customers are
browsing, what they buy, and the ratings they give.
Data Cleaning: They remove duplicate data (if a user accidentally clicked multiple
times) or incorrect entries.
Data Analysis: The company analyzes customer buying patterns. For example,
they might notice that people tend to buy more shoes during certain times of the
year.
Data Interpretation: From this, they learn that they should increase their stock of
shoes before certain holidays to meet demand.
Data Reporting: The analysis is presented to the sales team, who can then plan
promotions and stock inventory accordingly.
Example:
o Ordinal Data: Categories with a meaningful order but no fixed difference between
them.
Example:
a) Number of students in a class: 25, 30, 40.
b) Number of cars sold in a month: 100, 150, 200.
b. Continuous Data
Continuous data refers to numbers that can take any value within a given
range. It includes decimal points and is measured rather than counted.
Example:
Disadvantages:
4. Secondary Data
Secondary data is the data that has already been collected by someone else and is available
for you to use. This could include data from government reports, research papers, or
company records.
Example:
a) A company uses census data from the government to understand population trends.
b) A student uses data from a research paper to support their thesis.
Advantages:
o It’s quick and inexpensive to obtain.
o It’s already available and ready for analysis.
Disadvantages:
Example:
a) A spreadsheet listing the sales of a store, with columns for the date, product, and
number of items sold.
b) A table showing customer information such as name, age, and contact details.
6. Unstructured Data
Unstructured data is data that doesn’t have a pre-defined format or organization. This type
of data is harder to analyze because it doesn’t fit neatly into tables.
Example:
Example:
a) Stock prices recorded every day for a year.
b) Monthly sales data of a product over the past five years.
8. Cross-sectional Data
Cross-sectional data is collected at one point in time, providing a snapshot of a situation. It
doesn’t show changes over time but focuses on understanding a specific moment.
Example:
a) A survey taken by customers on a particular day to measure their satisfaction.
b) A health survey that measures the fitness levels of people at one point in time.
1. Accuracy
Accuracy means the data correctly represents the real-world values or facts it is
meant to describe. It should be error-free and provide a true picture.
2. Completeness
Completeness means that all the necessary information is present. There should be
no missing values or gaps in the data.
Example: If you have a form with a customer’s name, address, and email
but the email field is blank, the data is incomplete.
3. Consistency
Consistency means that data is uniform and follows the same format across all
sources. There should be no contradictions in the data.
4. Timeliness
Timeliness means that the data is up-to-date and available when needed. Data needs
to be relevant to the time frame in which decisions are being made.
Example: Using last year’s sales data to make decisions for this year may
not be useful if the market conditions have changed.
5. Validity
Validity refers to how well the data matches the rules and requirements for its
intended use. Data should follow the correct format and meet predefined standards.
Example: If a date field requires the format "DD/MM/YYYY," and the data
is entered as "31/13/2023," it is invalid because the 13th month doesn’t
exist.
6. Uniqueness
Uniqueness means that each data entry is unique and not duplicated. Duplicate
records can cause confusion and errors.
Example: If the same customer is recorded twice in a database, once as
"Anna Siju" and again as "Anna.S," there is a duplication issue.
7. Relevance
Relevance means that the data should be directly related to the purpose for which
it is being used. Irrelevant data is not useful for analysis.
8. Accessibility
Accessibility means that data should be easily available and understandable for
those who need it. If people cannot access the data or understand it, it is not useful.
Example: If a company's sales team needs access to customer data but the
database is too complicated for them to use, the data lacks accessibility.
9. Reliability
Reliability means that the data consistently gives the same results over time,
provided nothing has changed. It should be trustworthy and come from a reliable
source.
Example: If two different employees are recording sales figures and their
records always match, the data is reliable.
10. Security
Without proper preprocessing, using raw data can lead to inaccurate analysis, wrong predictions,
or poor decisions
3. Data Transformation: After cleaning, the data may need to be transformed into a suitable
format. This step involves:
4. Data Reduction: This step reduces the size of the dataset without losing essential
information. Techniques used for data reduction include:
Feature selection: Choosing only the most important features for analysis.
Example: In a dataset with 100 variables, only the top 10 that have the most
impact on customer behavior are selected for analysis.
5. Data Integration: If data comes from multiple sources (e.g., different databases or systems),
it is combined into a single dataset. This is necessary to avoid inconsistencies when
analyzing the data together.
Example: Combining sales data from both online and offline stores into one file to
get a complete picture of sales performance.
6. Data Discretization: If continuous data (like age, income) needs to be grouped into
categories or bins, discretization is applied. This makes the data easier to analyze.
Example: Grouping customer ages into categories like "18-25," "26-35," and "36-
50."
Steps:
1. Data Cleaning:
Fill in missing blood pressure readings with the average blood pressure for
the patient's age group.
Convert cholesterol values into the same unit (mg/dL) for consistency.
2. Data Transformation:
Normalize cholesterol and blood pressure values to a scale of 0 to 1.
Convert gender ("Male"/"Female") into numerical values (0 and 1).
3. Data Reduction:
4. Data Integration:
Combine patient data from different hospitals into one dataset.
5. Data Discretization:
Group patients' ages into ranges (e.g., "18-30," "31-45," "46-60") for easier
analysis.
3. Finance:
Application: Fraud detection in banking transactions.
Preprocessing: Financial data often contains outliers (e.g., unusually large transactions)
and may be incomplete (e.g., missing transaction details). Data preprocessing involves
removing outliers, filling missing details (e.g., using past transaction history), and
standardizing transaction amounts. The preprocessed data is used in machine learning
models to detect fraudulent activities by identifying abnormal patterns.
4. Marketing:
Preprocessing: Marketers use customer data like demographics, purchasing behavior, and
online activity. This data may contain noise (irrelevant information) or be inconsistent (e.g.,
inconsistent age formats). Preprocessing cleans and organizes the data, encoding
categorical values (e.g., gender as 0 and 1) and normalizing numeric data (e.g., purchase
frequency). This processed data is then used to divide customers into groups for more
personalized and effective advertising campaigns.
5. Social Media:
Data management involves organizing, storing, and maintaining the collected data so that
it is accurate, consistent, and available for use when needed. Proper data management
ensures data is secure and easy to access for analysis.
How It Works: Surveys can be conducted online, via paper forms, phone calls, or in person.
Respondents answer questions about their behavior, opinions, preferences, etc.
Example: A company might send a customer satisfaction survey via email to ask
customers how satisfied they are with a recent purchase. The data collected helps
the company understand its service quality and areas for improvement.
2. Interviews
Interviews involve direct communication between an interviewer and the participant. The
interviewer asks questions and may clarify or expand on responses to collect detailed data.
How It Works: Interviews can be structured (with a fixed set of questions) or unstructured
(more open conversation). They can be conducted in person, over the phone, or via video
calls.
Example: A researcher may interview teachers to understand the challenges they
face in the classroom. This method allows for in-depth responses that can provide
insight into specific issues.
3. Observations
Observation involves watching and recording events or behaviors as they occur in a natural
setting. This method is non-intrusive and does not rely on participants' responses.
How It Works: The observer collects data by simply watching people or events without
interacting with them.
Example: A retail store manager might observe how customers move through the
store to see which areas attract more attention. This data helps the store optimize
product placement.
4. Experiments
In experiments, researchers manipulate one or more variables and observe the effects on
another variable. This method is commonly used in scientific research.
How It Works: The experiment is conducted in controlled conditions where one variable
(the independent variable) is changed, and its effect on another variable (the dependent
variable) is measured.
Secondary data is data that has already been collected by someone else for a different
purpose. It involves using existing data sources rather than collecting new data.
How It Works: This data can be obtained from books, websites, government reports,
research papers, or databases.
With the rise of the internet, a lot of data is collected online through websites, social media,
and apps. This method uses automated tools to gather data.
How It Works: Data is collected through online forms, tracking user activity on websites
(e.g., what pages they visit), or social media platforms.
Example: An e-commerce site might collect data on what products a customer
views and adds to their cart. This data helps personalize product recommendations.
Data Management:
Once the data is collected, it needs to be properly managed to ensure it is useful and accessible for
analysis. Here are key steps involved in data management:
3. Data Security: Ensuring the data is secure is a crucial part of data management.
Sensitive data must be protected from unauthorized access, and measures like encryption
and passwords are used.
4. Data Backup: Regular backups of the data ensure that if the original data is lost, it can
be recovered.
5. Data Cleaning: Raw data may contain errors, duplicates, or inconsistencies. Data
cleaning involves checking for these issues and correcting them to ensure the accuracy and
reliability of the data.
store it in a cloud system for easy access. The school ensures the data is password-protected
to maintain privacy and regularly backs up the data to avoid any loss.
expensive or time-consuming. Secondary data is usually cheaper and easier to access, but it may
not always be specific to the current research needs.
1. Sensors and Internet of Things (IoT) Devices: Data collected from devices
that measure environmental factors like temperature, traffic, or air quality.
3. Streaming Data: Data collected from real-time streams such as social media
feeds, online gaming, or video streaming platforms.
Example: A media company uses data from its streaming service to
recommend content to users based on their viewing habits.
1. Exploring Data
Exploring data involves looking into the dataset to understand its structure, quality, and
key features. This is often done through simple visualizations and statistics. Here’s how
you can explore data:
Steps in Exploring Data:
1. Understand the Data Structure:
Look at the dataset to understand what types of data you have (e.g., numerical,
categorical, dates).
Example: A sales dataset may include columns for product names, sales amounts, dates
of purchase, and customer locations.
3. Summary Statistics:
Calculate basic statistics like mean, median, mode, minimum, maximum, and standard
deviation for numerical data.
Example: In a dataset of test scores, you might calculate the average score and identify
the highest and lowest scores.
4. Data Visualization:
Create visual representations like histograms, bar charts, or scatter plots to observe
trends and outliers.
Example: A scatter plot can show if there’s a relationship between advertising spend
and sales revenue.
5. Identify Outliers:
Look for data points that stand out from the rest and may be errors or extreme cases.
2. Fixing Data
After exploring the data, it’s important to clean and fix any issues so the data is accurate,
complete, and ready for analysis. Here are the common steps to fix data:
Steps in Fixing Data:
Example: If 80% of the entries in a column for "customer comments" are missing,
it might make sense to drop that column.
2. Dealing with Outliers:
Investigate Outliers: Look into whether outliers are valid or caused by errors. If
they are errors, remove or correct them.
Transformation: In some cases, outliers are extreme but valid data points. You
might use techniques like log transformation to minimize their impact in the
analysis.
3. Standardizing Data:
Consistent Formatting: Ensure all data is in a consistent format (e.g., dates are in
the same format, text is properly capitalized).
Fixing Typos and Mistakes: Identify and correct any typos or errors in the dataset.
Example: In a product dataset, if a product price is listed as $1000 instead of $100,
this is an error and should be corrected.
Removing Duplicates: Check for and remove any duplicate rows in the dataset.
Example: In a customer list, the same customer might be listed twice due to a
duplicate entry, which should be removed.
Ensure Correct Data Types: Make sure that each column has the correct data type
(e.g., numerical data is not stored as text).
Example: In a dataset of sales transactions, the price might be stored as text (e.g.,
“$50”), which should be converted to numerical values for analysis.
Scenario:
o You are analyzing sales data for an online retail store, but there are issues with the
dataset.
o Some sales records are missing customer email addresses.
o A few sales records have negative quantities, which is not possible.
o The date format in some records is inconsistent (some are in "MM/DD/YYYY" format,
and others are in "DD-MM-YYYY" format).
o Duplicate entries exist where the same order is recorded twice.
Steps to Fix the Data:
1. Handling Missing Data: For missing customer email addresses, you decide to fill in a
placeholder email like “unknown@example.com” for records where it’s missing.
2. Correcting Outliers: For sales with negative quantities, you correct the records by either
setting the quantity to zero or investigating further to fix the actual quantity sold.
3. Standardizing Data: You reformat all the dates to follow the "YYYY-MM-DD" format,
ensuring consistency.
4. Removing Duplicates: You find and remove all duplicate sales records, keeping only the
unique entries.
By the end of this process, your dataset is now clean and ready for further analysis, such as
calculating total sales or identifying trends in customer purchases.
Data Exploration
Data exploration is the process of analyzing datasets to discover patterns, spot anomalies, test
hypotheses, and check assumptions. It’s an essential first step in any data analysis or data science
project because it helps you understand the dataset before performing advanced modeling or
analysis.
Types of Data Exploration
There are several methods of exploring data. These methods can broadly be classified into
univariate, bivariate, and multivariate explorations:
Goal: To understand the distribution, central tendency (mean, median, mode), and spread
(variance, standard deviation) of the data.
Techniques:
Summary Statistics: Calculating measures like mean, median, mode, and standard
deviation for a single variable.
Data Visualization: Using plots like histograms, box plots, and pie charts to visualize the
distribution of the variable.
Example: Exploring the sales price of products in a dataset to understand the average price
and the spread of prices.
Techniques:
Scatter Plots: Used to visualize the relationship between two numerical variables.
Correlation Coefficients: To measure the strength and direction of the relationship (e.g.,
Pearson correlation).
Cross Tabulations (Contingency Tables): Used when exploring relationships between
categorical variables.
Example: Exploring the relationship between advertising spend and sales revenue to see if
there is a positive correlation.
3. Multivariate Data Exploration:
This method involves exploring more than two variables simultaneously.
Goal: To understand interactions between multiple variables, such as how a group of factors
together influences a certain outcome.
Techniques:
Multidimensional Plots: Heat maps, pair plots, or 3D scatter plots to visualize interactions.
Principal Component Analysis (PCA): To reduce the dimensionality of data and identify
important features.
Example: Analyzing the effect of advertising spend, product quality, and customer reviews
together on overall sales.
1. Summary Statistics
Example: Creating histograms to visualize the distribution of product prices, or scatter plots
to see the relationship between two variables (like sales and advertising).
3. Data Grouping
4. Outlier Detection
Identifying data points that are significantly different from others.
Example: Detecting an unusually high purchase price in a sales dataset, which may indicate
either an error or a special case.
Exploration Tools
There are several tools available for data exploration, ranging from basic tools to advanced
platforms.
1. Microsoft Excel
Excel is a widely used tool for data exploration, especially for small datasets. It
provides basic summary statistics, simple visualizations, and data manipulation
functions.
Features:
Tableau is a business intelligence tool that allows users to create interactive and
shareable dashboards. It’s widely used for data visualization and exploration.
Features:
Drag-and-drop interface for building complex visualizations.
1. Data Storage
Data storage refers to the methods and technologies used to keep data safe and accessible. There
are several types of storage systems and media:
i. Hard Disk Drives (HDDs): Traditional storage devices that use spinning disks to
read and write data. They offer large storage capacities but are slower compared to
newer technologies.
Example: A computer’s internal HDD where files, applications, and
operating system data are stored.
ii. Solid State Drives (SSDs): Storage devices that use flash memory to store data,
providing faster access speeds and greater reliability compared to HDDs.
Example: SSDs in modern laptops and smartphones for quicker boot times
and file access.
iii. Optical Discs: Media like CDs, DVDs, and Blu-ray discs that use laser technology
to read and write data.
Examples:
Google Drive: Provides online storage for documents, photos, and other files.
Systems designed to store and manage structured data efficiently, often used in businesses and
organizations.
Types:
Relational Databases: Use tables to store data in rows and columns, supporting SQL
queries.
Example: MongoDB or Cassandra for handling large volumes of diverse data types.
2. Data Management
Data management involves organizing, maintaining, and ensuring the quality and security of data
throughout its lifecycle. This includes tasks like data organization, backup, and ensuring data
integrity.
Key Components of Data Management
1. Data Organization:
Data Structuring: Organizing data in a logical way to make it easily accessible. This
includes creating schemas, tables, and indexes in databases.
Example: In a customer database, structuring data into tables like "Customers,"
"Orders," and "Products" to organize information effectively.
Metadata Management: Managing data about data, which includes details about data
sources, data structure, and data definitions.
Example: Documenting the data fields in a sales database, such as field names,
types, and descriptions.
2. Data Backup and Recovery:
Backup: Regularly copying data to prevent loss in case of hardware failure, accidental
deletion, or corruption.
Example: Setting up automatic backups to an external hard drive or cloud storage
to protect important files.
Recovery: Restoring data from backups to recover from data loss incidents.
Data Validation: Checking data for accuracy and completeness when entering or
importing it.
4. Data Security:
Protecting data from unauthorized access and breaches.
Techniques:
Encryption: Converting data into a secure format that can only be read by
authorized users.
Stages:
A small business needs to manage its customer data, sales records, and financial
information.
Storage Solution:
• Cloud Storage: The business uses Google Drive to store customer contact information and
sales reports. This allows access from any device and provides automatic backups.
• Database Storage: The business uses MySQL to manage customer transactions, sales
records, and inventory. This helps in querying data efficiently and generating reports.
• Organize Data: Create structured tables in MySQL for customers, orders, and products.
• Backup Data: Schedule daily backups of the MySQL database to a cloud storage service.
• Ensure Data Integrity: Validate data entry forms to check for correct and complete customer
details.
• Secure Data: Encrypt sensitive customer information and set access controls to ensure only
authorized staff can access financial records.
• Manage Data Lifecycle: Archive old sales records that are no longer needed for daily
operations but must be retained for regulatory compliance. Delete obsolete or redundant
data periodically.
4. Comprehensive Analysis:
Example: Combining financial data with market research can help investors make
more informed decisions about stock purchases.
Examples:
Examples:
Examples:
• IoT Devices: Data from smart home devices, wearable technology, and
industrial sensors.
• Environmental Sensors: Data on temperature, humidity, and pollution
levels.
Standardize Formats: Ensure data from different sources follows the same format
for consistency.
3. Data Transformation:
Merge Data: Combine datasets using common fields or identifiers.
Example: Merging customer purchase data with CRM data based on
customer IDs.
4. Data Storage:
Unified Storage: Store integrated data in a central repository like a data warehouse
or database.
Analyze Patterns: Use analytical tools to uncover insights from the integrated data.
6. Data Visualization:
Visualize Insights: Create dashboards or reports to present findings from the
combined data.
Example: Developing a dashboard that shows sales trends alongside social
media engagement metrics.
These platforms provide tools for extracting, transforming, and loading (ETL) data
from various sources into a unified system.
2. Data Warehousing Solutions:
These solutions store large volumes of integrated data and support complex queries
and analysis.
3. Business Intelligence Tools:
These tools enable users to create visualizations and reports from integrated data
sources.
4. Data Integration Services:
Example Scenario:
Scenario:
A retail company wants to understand the impact of marketing campaigns on sales.
Data Sources:
Internal Data: Sales records, customer purchase history, and marketing campaign details.
Collection: Gather sales data from the company's database and social media metrics from
an analytics platform.
Cleaning: Standardize formats (e.g., dates, currencies) and handle any missing or
inconsistent data.
Transformation: Merge sales data with social media engagement data using common
identifiers such as campaign names.
Storage: Store the combined data in a cloud-based data warehouse.
Analysis: Perform analysis to identify correlations between marketing efforts and sales
performance.
Visualization: Create a dashboard showing sales trends and social media engagement
metrics for campaign evaluation.
1. Data Integration
Combine data from different sources into a unified format for analysis and reporting.
Data Virtualization:
Create a virtual view of data from different sources without moving the data
physically.
Benefits: Provides a unified view and real-time access to data without duplication.
2. Data Consolidation
Aggregate data from multiple sources into a single system or platform.
Data Warehousing:
Central repository that stores data from various sources, optimized for querying and
analysis.
Benefits: Scalable and can handle large volumes of diverse data types.
Data Validation:
Implement rules to ensure data meets specific criteria.
Example: Validate email addresses and phone numbers during data entry.
Data Profiling:
Data Cataloging:
6. Data Governance
Establish policies and procedures for data management.
Data Policies:
Example Scenario:
Scenario:
A retail company wants to integrate customer feedback from social media with sales data to
improve marketing strategies.
• Data Integration:
Use ETL tools to extract customer feedback from social media platforms and sales data
from internal systems.
Transform the data to a common format and load it into a central data warehouse.
• Data Consolidation:
Store the integrated data in a data warehouse for easy access and analysis.
• Data Synchronization:
Synchronize sales data with social media metrics in real-time to keep insights up-to-date.
• Metadata Management:
• Data Governance:
• Measures of data dispersion include quartiles, interquartile range (IQR), and variance.
• These descriptive statistics are of great help in understanding the distribution of the data.
a) Measuring Central Tendency: Mean
• The most common and most effective numerical measure of the “center” of
a set of data is the arithmetic mean.
▪ Arithmetic Mean:
▪ Although the mean is the single most useful quantity for describing a data
set, it is not always the best way of measuring the center of the data.
➢ A major problem with the mean is its sensitivity to extreme
(outlier) values.
➢ Even a small number of extreme values can corrupt the mean.
▪ To offset the effect caused by a small number of extreme values, we can
instead use the trimmed mean,
▪ Trimmed mean can be obtained after chopping off values at the high and
low extremes.
Example
What are central tendency measures (mean, median, mode)for the following attributes?
attr1 = {2,4,4,6,8,24}
attr2 = {2,4,7,10,12}
mean = (2+4+7+10+12)/5 = 7 average of all values
median = 7 middle value
mode = any of them (no mode) all of them has same freq.
attr3 = {xs,s,s,s,m,m,l}
defined as the expected value of the squared deviation from the mean. It is
calculated by:
1. Finding the mean (average) of the data set.
2. Subtracting the mean from each data point to get the deviations from the
mean.
3. Squaring each of the deviations.
4. Calculating the average of the squared deviations. This is the variance.
Standard Deviation
• Standard deviation is a measure of the amount of variation or dispersion of a set
of values from the mean value.
• It is calculated as the square root of the variance, which is the average squared
deviation from the mean.
Interquartile Range (IQR)
• Interquartile Range (IQR) is a measure of statistical dispersion that represents the
middle 50% of a data set.
• It is calculated as the difference between the 75th percentile (Q3) and the
25th percentile (Q1) of the data i.e., IQR = Q3 − Q1.
Examples for Dispersion
Let’s consider the same dataset of daily temperatures recorded over a week: 22°C,
23°C, 21°C, 25°C, 22°C, 24°C, and 20°C.
Range: Maximum temperature – Minimum temperature = 25°C – 20°C =
5°C
Variance: Variance = (Sum of squared differences from the mean) /
(Number of data points)
Mean = 21.86 °C
Sum of squared differences from the mean = (22 – 21.86)2 + (23 – 21.86)2 +
(21 – 21.86)2 + (25 – 21.86)2 + (22 – 21.86)2 + (24 – 21.86)2 + (20 – 21.86)2
= (0.14)2 + (1.14)2 + (-0.86)2 + (3.14)2 + (0.14)2 + (2.14)2 + (-1.86)2
= 0.0196 + 1.2996 + 0.7396 + 9.8596 + 0.0196 + 4.5796 + 3.4596
= 19.0772
Thus, Variance = 19.0772 / 7 ≈ 2.725 °C
Standard Deviation: Take the square root of the variance to get the
standard deviation.
Thus, Standard Deviation ≈ √2.725 ≈ 1.65 °C
Interquartile Range (IQR): First Quartile (Q1) = 21°C Third Quartile
(Q3) = 24°C
Thus, IQR = Q3 − Q1 = 24°C – 21°C = 3°C
It is crucial to understand it in depth before you perform data analysis and run your data through
an algorithm. You need to know the patterns in your data and determine which variables are
important and do not play a significant role in the output. Further, some variables may have
correlations with other variables. You also need to recognize errors in your data.
Exploratory data analysis can do all of this. It helps you gather insights, better sense the data, and
remove irregularities and unnecessary values.
• Familiarize yourself with the data set, understand the domain, and identify the
objectives of the analysis.
2. Data Collection
• Collect the required data from various sources such as databases, web scraping, or
APIs.
3. Data Cleaning
4. Data Transformation
5. Data Integration
6. Data Exploration
• Bivariate Analysis: Analyze the relationship between two variables with scatter plots,
correlation coefficients, and cross-tabulations.
7. Data Visualization
• Visualize data distributions and relationships using visual tools such as bar charts, line
charts, scatter plots, heatmaps, and box plots.
8. Descriptive Statistics
• Calculate central tendency measures (mean, median, mode) and dispersion measures
(range, variance, standard deviation).
• Detect patterns, trends, and outliers in the data using visualizations and statistical
methods.
• Formulate and test hypotheses using statistical tests (e.g., t-tests, chi-square tests) to
validate assumptions or relationships in the data.
• Document the EDA process, findings, and insights clearly and structured.
• Continuously refine the analysis based on feedback and additional questions during the
process.
1. Univariate Analysis
• Techniques:
2. Bivariate Analysis
• Techniques:
❖ Scatter plots.
3. Multivariate Analysis
• Techniques:
❖ Cluster analysis.
4. Descriptive Statistics
• Techniques:
❖ Frequency distributions.
5. Graphical Analysis
• Techniques:
6. Dimensionality Reduction
• Purpose: To simplify models, reduce computation time, and mitigate the curse of
dimensionality.
• Techniques:
Using the following tools for exploratory data analysis, data scientists can effectively gain deeper
insights and prepare data for advanced analytics and modeling.
1. Python Libraries
• Pandas: Provides data structures and functions needed to manipulate structured data
seamlessly.
• Use: Basic plots like line charts, scatter plots, and bar charts.
• Use: Advanced visualizations like heatmaps, violin plots, and pair plots.
2. R Libraries
• ggplot2: A framework for creating graphics using the principles of the Grammar of
Graphics.
• dplyr: A set of tools for data manipulation, offering consistent verbs to address common
data manipulation tasks.
• tidyr: Provides functions to help you organize your data in a tidy way.
• shiny: An R package that makes building interactive web apps straight from R easy.
• Jupyter Notebook: An open-source web application that allows you to create and share
documents that contain live code, equations, visualizations, and narrative text.
• RStudio: An integrated development environment for R that offers tools for writing and
debugging code, building software, and analyzing data.
• Tableau: A top data visualization tool that facilitates the creation of diverse charts and
dashboards.
• Power BI: A Microsoft business analytics service offering interactive visualizations and
business intelligence features.
• SAS: A software suite developed by SAS Institute for advanced analytics, business
intelligence, data management, and predictive analytics.
• OpenRefine: A powerful tool for cleaning messy data, transforming formats, and
enhancing it with web services and external data.
• SQL Databases: Tools like MySQL, PostgreSQL, and SQLite are used to manage and
query relational databases.
Similarity Measures
• L2 or Euclidean
The most common and intuitive distance metric is L2 or Euclidean distance. We
can imagine this as the amount of space between two data objects. For example,
how far your screen is from your face.
• Cosine Similarity
We use the term “cosine similarity” or “cosine distance” to denote the difference between the orientation
of two vectors. For example, how far would you turn to face two opposite directions from the front
door? This cosine similarity formula or measure is a good place to start when considering similarity
between two vectors.
• Jaccard Similarity
Jaccard Similarity also known as Jaccard index, is a statistic to measure the
similarity between two data sets. It is measured as the size of the intersection of
two sets divided by the size of their union.
For example: Given two sets A and B, their Jaccard Similarity is provided by,
Dissimilarity Measures
• Manhattan Distance
The Manhattan distance, also called taxicab distance or cityblock distance, is
another popular distance metric. Suppose you’re inside a two-dimensional plane
and you can move only along the axes as shown:
• Hamming Distance
Hamming distance between two codewords is the number of places by which the
codewords differ. For two codes c1 and c2, the hamming distance is denoted
as d(c1 ,c2).
It is the number of positions at which corresponding symbols of two equal length
strings are different. It is named after Richard Hamming, an American
mathematician and computer engineer.
• It improves the way of analyzing and learning as the graphical representation makes the
data easy to understand.
• It can be used in almost all fields from mathematics to physics to psychology and so on.
• It is easy to understand for its visual impacts.
• It shows the whole and huge data in an instance.
• It is mainly used in statistics to determine the mean, median, and mode for different data
• The main disadvantage of graphical representation of data is that it takes a lot of effort as
well as resources to find the most appropriate data and then represent it graphically.
• Suitable Title: The title of the graph should be appropriate that indicate the subject of
the presentation.
• Measurement Unit: The measurement unit in the graph should be mentioned.
• Proper Scale: A proper scale needs to be chosen to represent the data accurately.
• Index: For better understanding, index the appropriate colors, shades, lines, designs in
the graphs.
• Data Sources: Data should be included wherever it is necessary at the bottom of the
graph.
• Simple: The construction of a graph should be easily understood.
• Neat: The graph should be visually neat in terms of size and font to read the data
accurately.
Data is represented in different types of graphs such as plots, pies, diagrams, etc. They
are as follows,
1.Bar Graph
• A bar graph is a graph that shows complete data with rectangular bars and the
heights of bars are proportional to the values that they represent. The bars in the
graph can be shown vertically or horizontally.
• Bar graphs are also known as bar charts and it is a pictorial representation of
grouped data. It is one of the ways of data handling. Bar graph is an excellent
tool to represent data that are:
1. Independent of one another and
2. That do not need to be in any specific order while being represented.
• The bars give a visual display for comparing quantities in different categories.
• The bar graphs have two lines, horizontal and vertical axis, also called the x and
y-axis along with the title, labels, and scale range.
Types of Bar Graphs
The bars in bar graphs can be plotted horizontally or vertically, but the most commonly
used bar graph is the vertical bar graph. Apart from the vertical and horizontal bar
graphs, there are two more types of bar graphs, which are given below:
• Grouped Bar Graph
• Stacked Bar Graph
2. Pie Chart
• A pie chart is a type of graph that records data in a circular manner that is further divided
into sectors for representing the data of that particular part out of the whole part.
• Each of these sectors or slices represents the proportionate part of the whole.
• Pie charts, also commonly known as pie diagrams help in interpreting and representing the
data more clearly. It is also used to compare the given data.
3.Line graph
• A line graph is a type of chart or graph that is used to show information that changes over
time. A line graph can be plotted using several points connected by straight lines.
4.Histogram
• A histogram is the graphical representation of data where data is grouped into continuous
number ranges and each range corresponds to a vertical bar.
• The horizontal axis displays the number range.
• The vertical axis (frequency) represents the amount of data that is present in each range.
• The number ranges depend upon the data that is being used.
5.Scatter Plot
• A scatter plot is a means to represent data in a graphical format.
• A simple scatter plot makes use of the Coordinate axes to plot the points, based on their
values.
• The following scatter plot excel data for age (of the child in years) and height (of the child
in feet) can be represented as a scatter plot.