0% found this document useful (0 votes)
11 views60 pages

Da Mod 1

This document provides an introduction to data analytics and machine learning, detailing the importance of data analytics in decision-making, identifying trends, and improving efficiency. It outlines the steps involved in data analytics, types of data, characteristics of quality data, and the process of data preprocessing. Additionally, it emphasizes the significance of data quality and preprocessing in ensuring accurate analysis and effective decision-making.

Uploaded by

ks.sanchula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views60 pages

Da Mod 1

This document provides an introduction to data analytics and machine learning, detailing the importance of data analytics in decision-making, identifying trends, and improving efficiency. It outlines the steps involved in data analytics, types of data, characteristics of quality data, and the process of data preprocessing. Additionally, it emphasizes the significance of data quality and preprocessing in ensuring accurate analysis and effective decision-making.

Uploaded by

ks.sanchula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Introduction to Data Analytics and Machine Learning Module 1

i)Introduction to Data Analytics:


• What is Data Analytics?
Data Analytics is the process of analyzing raw data to find trends, patterns, and
insights that help in decision-making. It involves collecting, organizing, and
examining data to make it meaningful and useful.

• Why is Data Analytics Important?


In today’s world, businesses and organizations collect a huge amount of data.
However, data alone isn’t valuable unless we can understand what it’s telling us.
Data Analytics helps in:

1. Improving decision-making:
By analyzing data, companies can make informed decisions.
2. Identifying trends:
It helps in spotting trends that can shape future strategies.
3. Solving problems:
Data analytics can help find the root cause of issues within a business.
4. Boosting efficiency:
By understanding what works and what doesn't, companies can
streamline their processes.
• Steps in Data Analytics
1. Data Collection:
This is the first step where raw data is collected from various sources, like
databases, surveys, social media, sensors, etc.
2. Data Cleaning:
The raw data often has errors or missing values. Data cleaning involves correcting
these errors and filling in missing information to make the data accurate and ready
for analysis.
3. Data Analysis:
This is the most important step where data is analyzed using different techniques.
These could include:
➢ Statistical analysis: Applying mathematical formulas to data to find
averages, trends, or patterns.
➢ Visualization: Using charts, graphs, and tables to represent data so it’s easier
to understand.
4. Data Interpretation:
After analyzing, the results need to be interpreted. This means making sense of the
findings and seeing how they apply to real-world problems or decisions.

MSc Computer Science(Data Analytics) 1


Introduction to Data Analytics and Machine Learning Module 1

5. Data Reporting:
Once the analysis is complete, the insights are shared with others through reports
or dashboards. These insights help decision-makers in an organization.
• Example
Let’s take the example of an online shopping website.

Data Collection: The website collects data about what products customers are
browsing, what they buy, and the ratings they give.
Data Cleaning: They remove duplicate data (if a user accidentally clicked multiple
times) or incorrect entries.

Data Analysis: The company analyzes customer buying patterns. For example,
they might notice that people tend to buy more shoes during certain times of the
year.
Data Interpretation: From this, they learn that they should increase their stock of
shoes before certain holidays to meet demand.

Data Reporting: The analysis is presented to the sales team, who can then plan
promotions and stock inventory accordingly.

ii) Types of Data:


Data can be classified into different types based on how it is represented, stored, and used.
Understanding these types helps in selecting the right method to analyze the data. Here are the
main types of data explained in simple terms:

1. Qualitative Data (Categorical Data)


Qualitative data describes qualities or characteristics. This type of data is not numeric and
cannot be measured in numbers. It is used to classify things into categories.

Example:

a) Colors of cars: Red, Blue, Black.


b) Types of fruits: Apple, Banana, Orange.
c) Customer feedback: “Excellent,” “Good,” “Poor.”

Qualitative data can be divided into two subtypes:

o Nominal Data: Categories without any order.


Example: Gender (Male, Female), Types of pets (Dog, Cat, Bird).

MSc Computer Science(Data Analytics) 2


Introduction to Data Analytics and Machine Learning Module 1

o Ordinal Data: Categories with a meaningful order but no fixed difference between
them.

Example: Customer satisfaction levels (Satisfied, Neutral, Unsatisfied),


Education level (High school, Bachelor’s, Master’s).

2. Quantitative Data (Numerical Data)


Quantitative data is numerical and can be measured. It answers questions like “how much?”
or “how many?”. Quantitative data can be further divided into two types:
a. Discrete Data
Discrete data refers to numbers that are counted and are whole numbers
(integers). It cannot take decimal or fractional values.

Example:
a) Number of students in a class: 25, 30, 40.
b) Number of cars sold in a month: 100, 150, 200.

b. Continuous Data
Continuous data refers to numbers that can take any value within a given
range. It includes decimal points and is measured rather than counted.
Example:

a) Height of students: 5.2 feet, 5.8 feet, 6.1 feet.


b) Temperature: 72.5°F, 98.6°F.
3. Primary Data
Primary data is the data that you collect yourself, typically through surveys, experiments,
or observations. It is first-hand data, meaning it hasn’t been collected or processed by
anyone else.
Example:
a) A company conducts a survey to find out customer preferences for a new product.
b) A scientist collects data by conducting an experiment in a lab.
Advantages:

o It's specific to the research purpose.


o The data is up-to-date and relevant.

Disadvantages:

o Collecting primary data can be time-consuming and expensive.

MSc Computer Science(Data Analytics) 3


Introduction to Data Analytics and Machine Learning Module 1

4. Secondary Data
Secondary data is the data that has already been collected by someone else and is available
for you to use. This could include data from government reports, research papers, or
company records.

Example:
a) A company uses census data from the government to understand population trends.
b) A student uses data from a research paper to support their thesis.

Advantages:
o It’s quick and inexpensive to obtain.
o It’s already available and ready for analysis.
Disadvantages:

o It may not be specific to your needs.


o The data could be outdated or incomplete.
5. Structured Data
Structured data is organized and easily searchable in databases or spreadsheets. It is stored
in rows and columns, making it easy to analyze using formulas and queries.

Example:
a) A spreadsheet listing the sales of a store, with columns for the date, product, and
number of items sold.
b) A table showing customer information such as name, age, and contact details.
6. Unstructured Data
Unstructured data is data that doesn’t have a pre-defined format or organization. This type
of data is harder to analyze because it doesn’t fit neatly into tables.

Example:

a) Social media posts, emails, or videos.


b) Customer reviews on an e-commerce website.
7. Time Series Data
Time series data is data collected over a period of time. This type of data shows changes
or trends over time.

MSc Computer Science(Data Analytics) 4


Introduction to Data Analytics and Machine Learning Module 1

Example:
a) Stock prices recorded every day for a year.
b) Monthly sales data of a product over the past five years.
8. Cross-sectional Data
Cross-sectional data is collected at one point in time, providing a snapshot of a situation. It
doesn’t show changes over time but focuses on understanding a specific moment.

Example:
a) A survey taken by customers on a particular day to measure their satisfaction.
b) A health survey that measures the fitness levels of people at one point in time.

iii) Quality of Data:


The quality of data refers to how good, reliable, and accurate the data is for the purpose it’s being
used. High-quality data ensures that analysis and decision-making are accurate and trustworthy,
while poor-quality data can lead to mistakes and bad decisions.

Characteristics of Quality Data:


To determine if data is of high quality, it must meet certain characteristics. Here are the
main characteristics of quality data explained in simple terms:

1. Accuracy

Accuracy means the data correctly represents the real-world values or facts it is
meant to describe. It should be error-free and provide a true picture.

Example: A customer’s phone number should have the correct digits. If


there is a mistake in the phone number, the data is not accurate.

2. Completeness

Completeness means that all the necessary information is present. There should be
no missing values or gaps in the data.

Example: If you have a form with a customer’s name, address, and email
but the email field is blank, the data is incomplete.

3. Consistency

Consistency means that data is uniform and follows the same format across all
sources. There should be no contradictions in the data.

MSc Computer Science(Data Analytics) 5


Introduction to Data Analytics and Machine Learning Module 1

Example: If a customer’s name is recorded as "Anna Siju" in one system


and "Anna.S" in another, the data is inconsistent.

4. Timeliness

Timeliness means that the data is up-to-date and available when needed. Data needs
to be relevant to the time frame in which decisions are being made.
Example: Using last year’s sales data to make decisions for this year may
not be useful if the market conditions have changed.

5. Validity

Validity refers to how well the data matches the rules and requirements for its
intended use. Data should follow the correct format and meet predefined standards.

Example: If a date field requires the format "DD/MM/YYYY," and the data
is entered as "31/13/2023," it is invalid because the 13th month doesn’t
exist.

6. Uniqueness

Uniqueness means that each data entry is unique and not duplicated. Duplicate
records can cause confusion and errors.
Example: If the same customer is recorded twice in a database, once as
"Anna Siju" and again as "Anna.S," there is a duplication issue.

7. Relevance

Relevance means that the data should be directly related to the purpose for which
it is being used. Irrelevant data is not useful for analysis.

Example: If you are analyzing customer feedback to improve product


quality, collecting data on customer age might not be relevant to this
purpose.

8. Accessibility

Accessibility means that data should be easily available and understandable for
those who need it. If people cannot access the data or understand it, it is not useful.

Example: If a company's sales team needs access to customer data but the
database is too complicated for them to use, the data lacks accessibility.

9. Reliability

MSc Computer Science(Data Analytics) 6


Introduction to Data Analytics and Machine Learning Module 1

Reliability means that the data consistently gives the same results over time,
provided nothing has changed. It should be trustworthy and come from a reliable
source.

Example: If two different employees are recording sales figures and their
records always match, the data is reliable.

10. Security

Security means that data is protected from unauthorized access or corruption.


Sensitive data, like customer personal details, should be kept secure to prevent
misuse.

Example: A company needs to ensure that customer credit card information


is encrypted and only accessible by authorized personnel.

Examples of Good vs. Poor Data Quality:


Good Data Quality : A bank maintains a customer database with correct, up-to-date
contact information, and all transactions are recorded without errors. This ensures smooth
operations and proper communication with customers.
Poor Data Quality : A company uses customer data for marketing campaigns, but many
of the email addresses are invalid, some records are duplicated, and others are missing
critical information like phone numbers. This can lead to failed marketing efforts and
wasted resources.

iv) Data Preprocessing:


Data preprocessing is the process of transforming raw data into a clean and organized format,
making it ready for analysis or machine learning models. Since raw data often contains errors,
missing values, or is in an unsuitable format, preprocessing helps in preparing it for better results
and accurate decision-making.

Why is Data Preprocessing Important?


Raw data can have issues such as:

1. Missing values: Some information may be incomplete.

2. Inconsistent data: Data might be recorded in different formats or units.


3. Noise: Irrelevant data that can confuse the analysis.

4. Outliers: Extremely high or low values that may distort results.

MSc Computer Science(Data Analytics) 7


Introduction to Data Analytics and Machine Learning Module 1

Without proper preprocessing, using raw data can lead to inaccurate analysis, wrong predictions,
or poor decisions

Steps in Data Preprocessing:


1. Data Collection: The first step is gathering the raw data from various sources. Data can
come from surveys, databases, sensors, social media, or websites. Once collected, it is
stored in a format (like a spreadsheet or database) for further processing.

Example: An e-commerce company collects customer data like names, addresses,


product purchases, and reviews.
2. Data Cleaning: This is the most crucial step. Data cleaning involves:
Handling missing data: You can either remove the rows/columns with missing data
or fill in the missing values using methods like:
Mean or median values (for numerical data).
The most frequent category (for categorical data).

Correcting errors: Fixing incorrect or inconsistent data entries.

Removing duplicates: Removing repeated entries.


Example: In a customer database, if some addresses are missing, you can
remove those rows or fill the missing values with the city name, if available.

3. Data Transformation: After cleaning, the data may need to be transformed into a suitable
format. This step involves:

Normalization/Standardization: Scaling numerical data to a common range so that


no single variable dominates others in the analysis.

Normalization: Converts data to a range between 0 and 1.


Standardization: Scales data so it has a mean of 0 and a standard deviation
of 1.

Encoding categorical data: Converting text categories into numerical


values.
Example: Converting "Male" and "Female" to 0 and 1 for analysis.
Example: If a dataset contains customer ages that range from 18 to
80, normalization might scale them to a range between 0 and 1.

4. Data Reduction: This step reduces the size of the dataset without losing essential
information. Techniques used for data reduction include:

MSc Computer Science(Data Analytics) 8


Introduction to Data Analytics and Machine Learning Module 1

Dimensionality reduction: Reducing the number of features or variables by


combining similar ones.

Feature selection: Choosing only the most important features for analysis.
Example: In a dataset with 100 variables, only the top 10 that have the most
impact on customer behavior are selected for analysis.

5. Data Integration: If data comes from multiple sources (e.g., different databases or systems),
it is combined into a single dataset. This is necessary to avoid inconsistencies when
analyzing the data together.
Example: Combining sales data from both online and offline stores into one file to
get a complete picture of sales performance.

6. Data Discretization: If continuous data (like age, income) needs to be grouped into
categories or bins, discretization is applied. This makes the data easier to analyze.

Example: Grouping customer ages into categories like "18-25," "26-35," and "36-
50."

Example of Data preprocessing :


Imagine a healthcare company wants to analyze patient data to predict the likelihood of
heart disease. The raw data might include:
o Age, gender, cholesterol levels, blood pressure, and other medical records.
o Some patients may have missing blood pressure readings, and cholesterol levels may
be recorded in different units (mg/dL and mmol/L).

Steps:
1. Data Cleaning:

Fill in missing blood pressure readings with the average blood pressure for
the patient's age group.

Convert cholesterol values into the same unit (mg/dL) for consistency.

2. Data Transformation:
Normalize cholesterol and blood pressure values to a scale of 0 to 1.
Convert gender ("Male"/"Female") into numerical values (0 and 1).

3. Data Reduction:

Remove unnecessary features like patient ID or address, which don’t affect


the heart disease prediction.

MSc Computer Science(Data Analytics) 9


Introduction to Data Analytics and Machine Learning Module 1

4. Data Integration:
Combine patient data from different hospitals into one dataset.

5. Data Discretization:
Group patients' ages into ranges (e.g., "18-30," "31-45," "46-60") for easier
analysis.

v) Example Applications of Data Preprocessing:


1. Healthcare:
Application: Predicting diseases based on patient data.
Preprocessing: In healthcare, data from various sources (medical records, test results) may
have missing values or inconsistencies. Data preprocessing is used to clean the data by
filling missing values (e.g., average blood pressure), converting units (e.g., blood sugar
levels), and normalizing medical records to ensure consistency. After preprocessing, the
cleaned data can be used to train machine learning models to predict diseases like diabetes
or heart disease.
2. E-commerce:

Application: Personalized product recommendations.


Preprocessing: In e-commerce, customer behavior data such as purchase history, browsing
patterns, and clicks may contain incomplete or inconsistent information. Data
preprocessing cleans and integrates data from various platforms (website, mobile app). The
data is then transformed (e.g., normalizing purchase amounts) and encoded (e.g.,
converting product categories into numerical values) to build recommendation systems that
suggest products based on customer preferences.

3. Finance:
Application: Fraud detection in banking transactions.

Preprocessing: Financial data often contains outliers (e.g., unusually large transactions)
and may be incomplete (e.g., missing transaction details). Data preprocessing involves
removing outliers, filling missing details (e.g., using past transaction history), and
standardizing transaction amounts. The preprocessed data is used in machine learning
models to detect fraudulent activities by identifying abnormal patterns.
4. Marketing:

Application: Customer segmentation for targeted advertising.

MSc Computer Science(Data Analytics) 10


Introduction to Data Analytics and Machine Learning Module 1

Preprocessing: Marketers use customer data like demographics, purchasing behavior, and
online activity. This data may contain noise (irrelevant information) or be inconsistent (e.g.,
inconsistent age formats). Preprocessing cleans and organizes the data, encoding
categorical values (e.g., gender as 0 and 1) and normalizing numeric data (e.g., purchase
frequency). This processed data is then used to divide customers into groups for more
personalized and effective advertising campaigns.

5. Social Media:

Application: Sentiment analysis of customer reviews.


Preprocessing: Social media data often contains text data (customer reviews or comments),
which may be unstructured and noisy (with slang, emojis, etc.). Data preprocessing
includes removing irrelevant text (like special characters or stop words), converting text
into numerical form (using techniques like one-hot encoding or TF-IDF), and normalizing
text data. After preprocessing, the data can be analyzed for sentiment (positive, negative,
or neutral) to understand customer opinions about a product or service.

vi) Data Collection and Management


Data collection is the process of gathering information from various sources to analyze and
use for decision-making. This information can be collected manually or automatically, and
it is essential for any organization, business, or research to make informed decisions.

Data management involves organizing, storing, and maintaining the collected data so that
it is accurate, consistent, and available for use when needed. Proper data management
ensures data is secure and easy to access for analysis.

Data Collection Methods


There are different methods to collect data depending on the type of information required, the
resources available, and the goal of the research or project. Here are some common methods of
data collection:

1. Surveys and Questionnaires

Surveys and questionnaires involve asking a set of predefined questions to collect


information from respondents. These questions can be open-ended (allowing detailed
responses) or closed-ended (like Yes/No or multiple-choice).

How It Works: Surveys can be conducted online, via paper forms, phone calls, or in person.
Respondents answer questions about their behavior, opinions, preferences, etc.

MSc Computer Science(Data Analytics) 11


Introduction to Data Analytics and Machine Learning Module 1

Example: A company might send a customer satisfaction survey via email to ask
customers how satisfied they are with a recent purchase. The data collected helps
the company understand its service quality and areas for improvement.

2. Interviews
Interviews involve direct communication between an interviewer and the participant. The
interviewer asks questions and may clarify or expand on responses to collect detailed data.

How It Works: Interviews can be structured (with a fixed set of questions) or unstructured
(more open conversation). They can be conducted in person, over the phone, or via video
calls.
Example: A researcher may interview teachers to understand the challenges they
face in the classroom. This method allows for in-depth responses that can provide
insight into specific issues.

3. Observations
Observation involves watching and recording events or behaviors as they occur in a natural
setting. This method is non-intrusive and does not rely on participants' responses.
How It Works: The observer collects data by simply watching people or events without
interacting with them.
Example: A retail store manager might observe how customers move through the
store to see which areas attract more attention. This data helps the store optimize
product placement.
4. Experiments

In experiments, researchers manipulate one or more variables and observe the effects on
another variable. This method is commonly used in scientific research.
How It Works: The experiment is conducted in controlled conditions where one variable
(the independent variable) is changed, and its effect on another variable (the dependent
variable) is measured.

Example: A pharmaceutical company might conduct an experiment to test the


effectiveness of a new drug. They give one group of patients the drug and another
group a placebo, then compare the health outcomes.
5. Secondary Data Collection

Secondary data is data that has already been collected by someone else for a different
purpose. It involves using existing data sources rather than collecting new data.

MSc Computer Science(Data Analytics) 12


Introduction to Data Analytics and Machine Learning Module 1

How It Works: This data can be obtained from books, websites, government reports,
research papers, or databases.

Example: A business might use government census data to analyze population


trends instead of conducting its own survey. This saves time and resources.
6. Online Data Collection

With the rise of the internet, a lot of data is collected online through websites, social media,
and apps. This method uses automated tools to gather data.
How It Works: Data is collected through online forms, tracking user activity on websites
(e.g., what pages they visit), or social media platforms.
Example: An e-commerce site might collect data on what products a customer
views and adds to their cart. This data helps personalize product recommendations.

Data Management:
Once the data is collected, it needs to be properly managed to ensure it is useful and accessible for
analysis. Here are key steps involved in data management:

1. Data Storage: Collected data is stored in databases, spreadsheets, or data warehouses.


It can be stored physically or digitally. For digital storage, cloud-based storage services
like Google Drive or Dropbox are often used.
2. Data Organization: Organizing data helps ensure it can be easily retrieved and
analyzed. Data is categorized, labeled, and structured in a way that makes it easy to
understand and use.

3. Data Security: Ensuring the data is secure is a crucial part of data management.
Sensitive data must be protected from unauthorized access, and measures like encryption
and passwords are used.
4. Data Backup: Regular backups of the data ensure that if the original data is lost, it can
be recovered.

5. Data Cleaning: Raw data may contain errors, duplicates, or inconsistencies. Data
cleaning involves checking for these issues and correcting them to ensure the accuracy and
reliability of the data.

Example of Data Collection and Management:


Let’s say a school wants to improve its curriculum based on student feedback. They use
surveys to collect feedback from students about their learning experiences. Some students
may submit incomplete responses, so the school cleans the data to remove or fix any
missing values. After collecting and cleaning the data, they organize it in a spreadsheet and

MSc Computer Science(Data Analytics) 13


Introduction to Data Analytics and Machine Learning Module 1

store it in a cloud system for easy access. The school ensures the data is password-protected
to maintain privacy and regularly backs up the data to avoid any loss.

vii) Sources of Data


Data can come from a wide range of sources depending on the type of information needed and the
goal of the research or analysis. These sources can be categorized into primary and secondary data.
Understanding the sources of data helps businesses, researchers, and organizations make informed
decisions by gathering the right information.

1. Primary Data Sources


Primary data is original data collected first-hand for a specific purpose. It is gathered directly from
the source through various methods, such as surveys, interviews, and experiments. Since primary
data is tailored to the specific needs of a study or project, it is often more accurate and relevant.
Examples of Primary Data Sources:
1. Surveys and Questionnaires: Collecting data directly from individuals by
asking them specific questions.

Example: A company conducts a customer satisfaction survey to


learn how happy customers are with their service.
2. Interviews: One-on-one conversations where detailed information is
gathered by asking open-ended questions.
Example: A researcher interviews doctors to understand their views
on a new medical treatment.

3. Observations: Watching and recording behaviors or events as they happen


in a natural setting without interacting with the subjects.

Example: A wildlife researcher observes animals in their natural


habitat to gather data on their behavior.

4. Experiments: Conducting controlled experiments where variables are


manipulated to measure their effect on other variables.

Example: A scientist tests the effectiveness of a new fertilizer on


plant growth by setting up different test conditions.

2. Secondary Data Sources


Secondary data is information that has already been collected and published by someone else. It is
used when there is no need to gather new data or when gathering primary data would be too

MSc Computer Science(Data Analytics) 14


Introduction to Data Analytics and Machine Learning Module 1

expensive or time-consuming. Secondary data is usually cheaper and easier to access, but it may
not always be specific to the current research needs.

Examples of Secondary Data Sources:


1. Government Reports and Statistics: Official data published by government
agencies, such as census reports, employment statistics, or economic
indicators.

Example: A company might use government-provided population


data to determine where to open new stores based on population
growth trends.
2. Research Papers and Academic Journals: Published findings from studies
conducted by other researchers or institutions.
Example: A student might use data from a scientific paper on climate
change to support their research project.
3. Books and Newspapers: Books, newspapers, and magazines can provide
data that is already analyzed and interpreted by others.
Example: A historian might use data from old newspapers to
understand public opinion during a historical event.
4. Databases and Online Repositories: Data stored in large databases, often
available for free or by subscription. This includes libraries, data
warehouses, and online archives.
Example: A business analyst might use data from an online financial
database to track stock market trends.

3. Internal Data Sources


Internal data is collected from within an organization or business. It is generated from the day-to-
day activities of the organization and can provide valuable insights into its operations.

Examples of Internal Data Sources:


1. Sales Records: Data on how much product was sold, where, and when.
Example: A retailer analyzes sales data to identify which products
are most popular during specific seasons.

2. Employee Data: Information about employees, including their performance,


attendance, and job satisfaction.

MSc Computer Science(Data Analytics) 15


Introduction to Data Analytics and Machine Learning Module 1

Example: A company uses employee data to evaluate job


performance and make decisions about promotions or training
programs.

3. Website Analytics: Data collected from a company’s website, such as the


number of visitors, time spent on pages, and user interactions.

Example: An e-commerce site uses website analytics to understand


customer behavior and optimize product placements on its
homepage.

4. External Data Sources


External data comes from outside an organization and is usually collected by third-party sources.
This type of data is useful for understanding market trends, competitors, and broader industry
insights.
Examples of External Data Sources:
1. Social Media: Data from social media platforms, including user posts,
comments, likes, and shares.

Example: A marketing team analyzes customer feedback from social


media posts to improve their advertising strategy.
2. Market Research Firms: Organizations that collect and sell data about
industry trends, customer preferences, and competitor analysis.
Example: A company buys market research data to learn more about
the latest trends in consumer electronics.
3. Public Databases: Data collected by non-profit organizations, academic
institutions, or governments that is made available to the public.
Example: A researcher uses data from a public health database to
study the spread of infectious diseases.

5. Big Data Sources


Big data refers to large, complex datasets that are often collected in real time and can come from
multiple sources. This data is used in advanced analytics to find patterns and make predictions.
Examples of Big Data Sources:

1. Sensors and Internet of Things (IoT) Devices: Data collected from devices
that measure environmental factors like temperature, traffic, or air quality.

MSc Computer Science(Data Analytics) 16


Introduction to Data Analytics and Machine Learning Module 1

Example: A smart city project uses data from traffic sensors to


manage traffic flow and reduce congestion.

2. Online Transactions: Data generated from online purchases, payments, and


other financial transactions.
Example: A bank analyzes transaction data to detect fraudulent
activities by looking for unusual spending patterns.

3. Streaming Data: Data collected from real-time streams such as social media
feeds, online gaming, or video streaming platforms.
Example: A media company uses data from its streaming service to
recommend content to users based on their viewing habits.

viii) Exploring and Fixing Data


When working with data, especially in data analytics or data science, the initial step is to explore
the data and identify any issues that need to be fixed before performing analysis. The process of
exploring and fixing data ensures that it is clean, complete, and ready for use. This process is also
called data cleaning or data wrangling.

1. Exploring Data

Exploring data involves looking into the dataset to understand its structure, quality, and
key features. This is often done through simple visualizations and statistics. Here’s how
you can explore data:
Steps in Exploring Data:
1. Understand the Data Structure:

Look at the dataset to understand what types of data you have (e.g., numerical,
categorical, dates).
Example: A sales dataset may include columns for product names, sales amounts, dates
of purchase, and customer locations.

2. Check for Missing Values:


Look for any gaps where data is missing.
Example: In a customer database, some customers may not have provided their phone
numbers or email addresses.

3. Summary Statistics:

MSc Computer Science(Data Analytics) 17


Introduction to Data Analytics and Machine Learning Module 1

Calculate basic statistics like mean, median, mode, minimum, maximum, and standard
deviation for numerical data.

Example: In a dataset of test scores, you might calculate the average score and identify
the highest and lowest scores.
4. Data Visualization:

Create visual representations like histograms, bar charts, or scatter plots to observe
trends and outliers.
Example: A scatter plot can show if there’s a relationship between advertising spend
and sales revenue.
5. Identify Outliers:

Look for data points that stand out from the rest and may be errors or extreme cases.

Example: A temperature reading of 150°C in a dataset of weather data could be an


outlier or error, as it’s not realistic.

2. Fixing Data
After exploring the data, it’s important to clean and fix any issues so the data is accurate,
complete, and ready for analysis. Here are the common steps to fix data:
Steps in Fixing Data:

1. Handling Missing Data:


Filling Missing Values: Replace missing values with appropriate ones like the
mean, median, or a default value.
Example: In a dataset with missing ages of customers, you could fill in the average
age of other customers as a reasonable estimate.
Removing Rows/Columns with Missing Data: If a large portion of a row or column
is missing, you might remove it.

Example: If 80% of the entries in a column for "customer comments" are missing,
it might make sense to drop that column.
2. Dealing with Outliers:
Investigate Outliers: Look into whether outliers are valid or caused by errors. If
they are errors, remove or correct them.

Example: If a person’s age is listed as 200 years in a customer dataset, it is clearly


an error and should be corrected or removed.

MSc Computer Science(Data Analytics) 18


Introduction to Data Analytics and Machine Learning Module 1

Transformation: In some cases, outliers are extreme but valid data points. You
might use techniques like log transformation to minimize their impact in the
analysis.

3. Standardizing Data:
Consistent Formatting: Ensure all data is in a consistent format (e.g., dates are in
the same format, text is properly capitalized).

Example: In a sales dataset, the date of purchase might appear as "01/02/2024" in


one entry and "1st Feb 2024" in another. You should standardize them to one format
like "YYYY-MM-DD" (e.g., 2024-02-01).
Categorical Data Consistency: Ensure that categorical data (e.g., gender, regions)
is consistent and uses the same labels.
Example: In a dataset of employees, some entries might list gender as “Male” while
others use “M.” These should be made consistent.
4. Correcting Data Entry Errors:

Fixing Typos and Mistakes: Identify and correct any typos or errors in the dataset.
Example: In a product dataset, if a product price is listed as $1000 instead of $100,
this is an error and should be corrected.
Removing Duplicates: Check for and remove any duplicate rows in the dataset.

Example: In a customer list, the same customer might be listed twice due to a
duplicate entry, which should be removed.

5. Converting Data Types:

Ensure Correct Data Types: Make sure that each column has the correct data type
(e.g., numerical data is not stored as text).
Example: In a dataset of sales transactions, the price might be stored as text (e.g.,
“$50”), which should be converted to numerical values for analysis.

6. Handling Inconsistent Data:

Reformatting Data: Sometimes, data collected from different sources may be


inconsistent. Reformatting ensures everything matches.
Example: In one dataset, the country name might be “United States,” and in another,
it’s listed as “USA.” These should be made consistent.

Example of Fixing Data in Action:

MSc Computer Science(Data Analytics) 19


Introduction to Data Analytics and Machine Learning Module 1

Scenario:
o You are analyzing sales data for an online retail store, but there are issues with the
dataset.
o Some sales records are missing customer email addresses.
o A few sales records have negative quantities, which is not possible.
o The date format in some records is inconsistent (some are in "MM/DD/YYYY" format,
and others are in "DD-MM-YYYY" format).
o Duplicate entries exist where the same order is recorded twice.
Steps to Fix the Data:
1. Handling Missing Data: For missing customer email addresses, you decide to fill in a
placeholder email like “unknown@example.com” for records where it’s missing.

2. Correcting Outliers: For sales with negative quantities, you correct the records by either
setting the quantity to zero or investigating further to fix the actual quantity sold.
3. Standardizing Data: You reformat all the dates to follow the "YYYY-MM-DD" format,
ensuring consistency.

4. Removing Duplicates: You find and remove all duplicate sales records, keeping only the
unique entries.

By the end of this process, your dataset is now clean and ready for further analysis, such as
calculating total sales or identifying trends in customer purchases.

Data Exploration
Data exploration is the process of analyzing datasets to discover patterns, spot anomalies, test
hypotheses, and check assumptions. It’s an essential first step in any data analysis or data science
project because it helps you understand the dataset before performing advanced modeling or
analysis.
Types of Data Exploration

There are several methods of exploring data. These methods can broadly be classified into
univariate, bivariate, and multivariate explorations:

1. Univariate Data Exploration:


This method focuses on exploring a single variable at a time.

Goal: To understand the distribution, central tendency (mean, median, mode), and spread
(variance, standard deviation) of the data.
Techniques:

MSc Computer Science(Data Analytics) 20


Introduction to Data Analytics and Machine Learning Module 1

Summary Statistics: Calculating measures like mean, median, mode, and standard
deviation for a single variable.

Data Visualization: Using plots like histograms, box plots, and pie charts to visualize the
distribution of the variable.
Example: Exploring the sales price of products in a dataset to understand the average price
and the spread of prices.

2. Bivariate Data Exploration:


This method examines the relationship between two variables.
Goal: To identify correlations or associations between variables, such as how one variable
influences another.

Techniques:

Scatter Plots: Used to visualize the relationship between two numerical variables.
Correlation Coefficients: To measure the strength and direction of the relationship (e.g.,
Pearson correlation).
Cross Tabulations (Contingency Tables): Used when exploring relationships between
categorical variables.
Example: Exploring the relationship between advertising spend and sales revenue to see if
there is a positive correlation.
3. Multivariate Data Exploration:
This method involves exploring more than two variables simultaneously.

Goal: To understand interactions between multiple variables, such as how a group of factors
together influences a certain outcome.
Techniques:
Multidimensional Plots: Heat maps, pair plots, or 3D scatter plots to visualize interactions.

Principal Component Analysis (PCA): To reduce the dimensionality of data and identify
important features.

Example: Analyzing the effect of advertising spend, product quality, and customer reviews
together on overall sales.

Common Techniques in Data Exploration

1. Summary Statistics

MSc Computer Science(Data Analytics) 21


Introduction to Data Analytics and Machine Learning Module 1

Calculation of simple statistics that describe basic properties of the dataset.


Example: Finding the mean, median, minimum, and maximum values for the variable
"customer age" in a customer dataset.
2. Data Visualization
The graphical representation of data to make patterns, trends, and relationships more
understandable.

Example: Creating histograms to visualize the distribution of product prices, or scatter plots
to see the relationship between two variables (like sales and advertising).
3. Data Grouping

Grouping data into categories or clusters to summarize or find patterns.


Grouping sales data by product categories to see which products perform better.

4. Outlier Detection
Identifying data points that are significantly different from others.

Example: Detecting an unusually high purchase price in a sales dataset, which may indicate
either an error or a special case.

5. Missing Value Analysis

Identifying and handling missing data in the dataset.


Example: Checking for missing values in a customer dataset where some customers haven’t
provided their email addresses.

Exploration Tools
There are several tools available for data exploration, ranging from basic tools to advanced
platforms.
1. Microsoft Excel

Excel is a widely used tool for data exploration, especially for small datasets. It
provides basic summary statistics, simple visualizations, and data manipulation
functions.
Features:

Pivot tables for summarizing data.


Chart creation (e.g., bar charts, pie charts, histograms).

MSc Computer Science(Data Analytics) 22


Introduction to Data Analytics and Machine Learning Module 1

Sorting, filtering, and conditional formatting to explore data.


Example: Analyzing a customer list by age groups and creating a bar
chart of the frequency of customers in each age range.
2. Python (Pandas and Matplotlib Libraries)
Python is a popular programming language for data exploration and analysis. The
Pandas library is used for data manipulation, while Matplotlib and Seaborn are used
for visualization.
Features:
DataFrames for handling tabular data.

Functions to calculate summary statistics.


Visualization libraries for generating plots like histograms, scatter plots, and
box plots.
Example: Using Pandas to calculate the average sales and
Matplotlib to create a scatter plot of sales vs. advertising spend.
3. Tableau

Tableau is a business intelligence tool that allows users to create interactive and
shareable dashboards. It’s widely used for data visualization and exploration.

Features:
Drag-and-drop interface for building complex visualizations.

Support for various data sources (databases, Excel, etc.).

Interactive filters and drill-down capabilities for deeper exploration.


Example: Creating an interactive dashboard to explore sales data by
region, product, and time period.
4. Power BI

Power BI is Microsoft’s business analytics tool, which allows users to visualize


data and share insights. It’s especially useful for integrating with other Microsoft
products like Excel and Azure.
Features:

Intuitive drag-and-drop interface for creating visual reports.

Advanced data modeling and analysis capabilities.

MSc Computer Science(Data Analytics) 23


Introduction to Data Analytics and Machine Learning Module 1

Integration with cloud data sources and real-time data.


Example: Creating a dashboard to visualize customer demographics
and purchase trends over time.
5. R (ggplot2 Library)
R is a programming language commonly used for statistical analysis and data
visualization. The ggplot2 library is a powerful tool for creating detailed
visualizations.
Features:
Statistical functions for data exploration.

Flexible visualizations with ggplot2.


Large ecosystem of packages for data manipulation and analysis.

Example: Using ggplot2 to create a box plot of sales data to identify


outliers and understand the distribution.

ix) Data Storage and Management


Data storage and management involves the processes and systems used to save, organize, and
maintain data in an efficient and accessible manner. Proper data management ensures that data is
secure, easily retrievable, and properly organized for analysis and use.

1. Data Storage
Data storage refers to the methods and technologies used to keep data safe and accessible. There
are several types of storage systems and media:

Types of Data Storage


1. Physical Storage:

i. Hard Disk Drives (HDDs): Traditional storage devices that use spinning disks to
read and write data. They offer large storage capacities but are slower compared to
newer technologies.
Example: A computer’s internal HDD where files, applications, and
operating system data are stored.

ii. Solid State Drives (SSDs): Storage devices that use flash memory to store data,
providing faster access speeds and greater reliability compared to HDDs.
Example: SSDs in modern laptops and smartphones for quicker boot times
and file access.

MSc Computer Science(Data Analytics) 24


Introduction to Data Analytics and Machine Learning Module 1

iii. Optical Discs: Media like CDs, DVDs, and Blu-ray discs that use laser technology
to read and write data.

Example: A DVD storing a backup of important files.


iv. External Hard Drives: Portable storage devices that connect to computers via USB
or other interfaces.

Example: An external hard drive used to back up personal photos and


documents.
2. Cloud Storage:
Storing data on remote servers accessed via the internet, managed by cloud service providers.

Advantages: Scalability, accessibility from anywhere with an internet connection, and


reduced need for physical hardware.

Examples:
Google Drive: Provides online storage for documents, photos, and other files.

Dropbox: Offers cloud-based file storage and sharing services.


3. Database Storage:

Systems designed to store and manage structured data efficiently, often used in businesses and
organizations.

Types:
Relational Databases: Use tables to store data in rows and columns, supporting SQL
queries.

Example: MySQL or PostgreSQL databases used to manage customer information


and transactions.
NoSQL Databases: Designed for unstructured or semi-structured data, often used for large-
scale applications.

Example: MongoDB or Cassandra for handling large volumes of diverse data types.

2. Data Management
Data management involves organizing, maintaining, and ensuring the quality and security of data
throughout its lifecycle. This includes tasks like data organization, backup, and ensuring data
integrity.
Key Components of Data Management

MSc Computer Science(Data Analytics) 25


Introduction to Data Analytics and Machine Learning Module 1

1. Data Organization:
Data Structuring: Organizing data in a logical way to make it easily accessible. This
includes creating schemas, tables, and indexes in databases.
Example: In a customer database, structuring data into tables like "Customers,"
"Orders," and "Products" to organize information effectively.

Metadata Management: Managing data about data, which includes details about data
sources, data structure, and data definitions.
Example: Documenting the data fields in a sales database, such as field names,
types, and descriptions.
2. Data Backup and Recovery:

Backup: Regularly copying data to prevent loss in case of hardware failure, accidental
deletion, or corruption.
Example: Setting up automatic backups to an external hard drive or cloud storage
to protect important files.

Recovery: Restoring data from backups to recover from data loss incidents.

Example: Using backup files to restore a database after a server crash.


3. Data Integrity:

Ensuring that data is accurate, consistent, and trustworthy.


Techniques:

Data Validation: Checking data for accuracy and completeness when entering or
importing it.

Example: Ensuring that an email field in a database contains valid email


addresses.
Error Checking: Identifying and correcting data errors or discrepancies.

Example: Correcting inconsistencies in customer addresses recorded in a


database.

4. Data Security:
Protecting data from unauthorized access and breaches.

Techniques:

MSc Computer Science(Data Analytics) 26


Introduction to Data Analytics and Machine Learning Module 1

Encryption: Converting data into a secure format that can only be read by
authorized users.

Example: Encrypting sensitive customer information like credit card


numbers.
Access Controls: Implementing permissions and restrictions to control who can
view or modify data.

Example: Restricting access to sensitive financial records to only authorized


employees.
5. Data Lifecycle Management:
Managing the data from its creation and storage to its eventual archiving or deletion.

Stages:

Creation: Collecting and entering data into the system.


Storage: Keeping data in databases or storage systems.

Use: Analyzing and utilizing data for decision-making.


Archiving: Moving data that is no longer actively used but needs to be kept for
reference.
Example: Archiving old customer records that are no longer active but are
retained for compliance purposes.
Deletion: Permanently removing data that is no longer needed.

Example: Deleting outdated test data from a development database.

Example of Data Storage and Management Process:


Scenario:

A small business needs to manage its customer data, sales records, and financial
information.

Storage Solution:

• Cloud Storage: The business uses Google Drive to store customer contact information and
sales reports. This allows access from any device and provides automatic backups.
• Database Storage: The business uses MySQL to manage customer transactions, sales
records, and inventory. This helps in querying data efficiently and generating reports.

MSc Computer Science(Data Analytics) 27


Introduction to Data Analytics and Machine Learning Module 1

Data Management Tasks:

• Organize Data: Create structured tables in MySQL for customers, orders, and products.
• Backup Data: Schedule daily backups of the MySQL database to a cloud storage service.
• Ensure Data Integrity: Validate data entry forms to check for correct and complete customer
details.
• Secure Data: Encrypt sensitive customer information and set access controls to ensure only
authorized staff can access financial records.
• Manage Data Lifecycle: Archive old sales records that are no longer needed for daily
operations but must be retained for regulatory compliance. Delete obsolete or redundant
data periodically.

x) Using Multiple Data Sources


Using multiple data sources involves integrating and analyzing data from various origins to gain a
more comprehensive understanding of a topic or to support better decision-making. Combining
data from different sources can provide richer insights, reveal hidden patterns, and improve the
accuracy of analysis.

1. Why Use Multiple Data Sources?


1. Enhanced Insight:
Example: A business analyzing customer behavior might combine sales data from
their own systems with social media data to understand how online sentiment
influences purchasing decisions.
2. Improved Accuracy:
Example: Combining data from surveys with transaction data can provide a fuller
picture of customer satisfaction and spending habits.
3. Cross-Validation:
Example: Using data from different sensors in a manufacturing process to validate
and cross-check the accuracy of machine performance metrics.

4. Comprehensive Analysis:
Example: Combining financial data with market research can help investors make
more informed decisions about stock purchases.

2. Types of Data Sources


1. Internal Data Sources:

MSc Computer Science(Data Analytics) 28


Introduction to Data Analytics and Machine Learning Module 1

Data generated within an organization or system.


Examples:

• Customer Databases: Information about customers' transactions,


preferences, and interactions.
• Sales Records: Data on product sales, revenue, and inventory.
• Employee Data: HR records, performance evaluations, and internal
communications.
2. External Data Sources:
Data obtained from outside the organization.

Examples:

• Market Research Reports: Industry trends, competitor analysis, and market


forecasts.
• Social Media Data: Public posts, reviews, and user interactions from
platforms like Twitter and Facebook.
• Government Data: Economic indicators, demographic statistics, and
regulatory information.
3. Open Data Sources:

Data freely available to the public, often provided by governments or organizations.


Examples:

• Open Government Data Portals: Publicly available datasets on various


topics (e.g., crime statistics, environmental data).
• Public APIs: Services providing data from various domains (e.g., weather,
transportation).
4. Transactional Data Sources:
Data related to transactions and interactions.

Examples:

• E-commerce Transactions: Online purchase details, customer feedback, and


payment information.
• Bank Transactions: Records of deposits, withdrawals, and account balances.

5. Sensor Data Sources:


Data collected from physical sensors.

MSc Computer Science(Data Analytics) 29


Introduction to Data Analytics and Machine Learning Module 1

Examples:

• IoT Devices: Data from smart home devices, wearable technology, and
industrial sensors.
• Environmental Sensors: Data on temperature, humidity, and pollution
levels.

3. Integrating Multiple Data Sources


Integrating data from multiple sources can be complex but is crucial for a comprehensive analysis.
Here are steps and methods to effectively combine data:
1. Data Collection:
Gather Data: Collect data from various sources, ensuring data is relevant and of
high quality.

Example: Downloading sales data from an internal database and social


media metrics from an analytics platform.
2. Data Cleaning:

Standardize Formats: Ensure data from different sources follows the same format
for consistency.

Example: Converting dates and currencies to a uniform format across


datasets.

3. Data Transformation:
Merge Data: Combine datasets using common fields or identifiers.
Example: Merging customer purchase data with CRM data based on
customer IDs.
4. Data Storage:
Unified Storage: Store integrated data in a central repository like a data warehouse
or database.

Example: Using a cloud-based data warehouse to store and manage


combined data from internal and external sources.
5. Data Analysis:

Analyze Patterns: Use analytical tools to uncover insights from the integrated data.

Example: Performing a regression analysis to understand how social media


sentiment impacts sales.

MSc Computer Science(Data Analytics) 30


Introduction to Data Analytics and Machine Learning Module 1

6. Data Visualization:
Visualize Insights: Create dashboards or reports to present findings from the
combined data.
Example: Developing a dashboard that shows sales trends alongside social
media engagement metrics.

4. Tools for Using Multiple Data Sources


Several tools can help with integrating and analyzing data from multiple sources:
1. Data Integration Platforms:

Examples: Talend, Apache Nifi, Informatica

These platforms provide tools for extracting, transforming, and loading (ETL) data
from various sources into a unified system.
2. Data Warehousing Solutions:

Examples: Amazon Redshift, Google BigQuery, Snowflake

These solutions store large volumes of integrated data and support complex queries
and analysis.
3. Business Intelligence Tools:

Examples: Tableau, Power BI, Looker

These tools enable users to create visualizations and reports from integrated data
sources.
4. Data Integration Services:

Examples: Zapier, Microsoft Power Automate


Services that automate the integration of data between different applications and
platforms.

Example Scenario:
Scenario:
A retail company wants to understand the impact of marketing campaigns on sales.

Data Sources:
Internal Data: Sales records, customer purchase history, and marketing campaign details.

MSc Computer Science(Data Analytics) 31


Introduction to Data Analytics and Machine Learning Module 1

External Data: Social media engagement metrics, market research reports.


Process:

Collection: Gather sales data from the company's database and social media metrics from
an analytics platform.
Cleaning: Standardize formats (e.g., dates, currencies) and handle any missing or
inconsistent data.

Transformation: Merge sales data with social media engagement data using common
identifiers such as campaign names.
Storage: Store the combined data in a cloud-based data warehouse.

Analysis: Perform analysis to identify correlations between marketing efforts and sales
performance.

Visualization: Create a dashboard showing sales trends and social media engagement
metrics for campaign evaluation.

Ways to manage multiple data sources


Managing multiple data sources involves organizing, integrating, and maintaining data from
various origins to ensure consistency, accuracy, and accessibility. Here are key ways to manage
multiple data sources effectively:

1. Data Integration
Combine data from different sources into a unified format for analysis and reporting.

ETL (Extract, Transform, Load):


Extract: Pull data from various sources.

Transform: Clean, format, and aggregate data as needed.


Load: Insert the processed data into a centralized repository like a data warehouse.

Tools: Talend, Apache Nifi, Informatica.

Data Virtualization:
Create a virtual view of data from different sources without moving the data
physically.

Benefits: Provides a unified view and real-time access to data without duplication.

Tools: Denodo, Cisco Data Virtualization.

MSc Computer Science(Data Analytics) 32


Introduction to Data Analytics and Machine Learning Module 1

2. Data Consolidation
Aggregate data from multiple sources into a single system or platform.

Data Warehousing:
Central repository that stores data from various sources, optimized for querying and
analysis.

Benefits: Facilitates complex queries and reporting.

Tools: Amazon Redshift, Google BigQuery, Snowflake.


Data Lakes:
Storage system that holds raw data in its native format until needed for analysis.

Benefits: Scalable and can handle large volumes of diverse data types.

Tools: AWS S3, Azure Data Lake, Hadoop.


3. Data Synchronization

Ensure data consistency across different systems.


Batch Synchronization:

Periodically update data in bulk.

Benefits: Efficient for large data sets.


Example: Daily updates to synchronize sales data with inventory data.
Real-Time Synchronization:

Continuously update data as changes occur.

Benefits: Ensures immediate consistency across systems.


Tools: Apache Kafka, AWS Kinesis.

4. Data Quality Management


Ensure data accuracy, completeness, and consistency.
Data Cleansing:

Identify and correct errors or inconsistencies in data.

Tools: OpenRefine, Trifacta.

MSc Computer Science(Data Analytics) 33


Introduction to Data Analytics and Machine Learning Module 1

Data Validation:
Implement rules to ensure data meets specific criteria.

Example: Validate email addresses and phone numbers during data entry.
Data Profiling:

Analyze data to understand its structure, quality, and relationships.

Tools: Talend Data Quality, IBM InfoSphere.


5. Metadata Management
Manage information about the data, such as its source, structure, and relationships.
Metadata Repositories:
Centralized storage for metadata information.

Benefits: Helps in understanding and managing data across different sources.


Tools: Apache Atlas, Collibra.

Data Cataloging:

Maintain an inventory of data assets, including their origins and usage.


Tools: Alation, Data.world.

6. Data Governance
Establish policies and procedures for data management.

Data Policies:

Set rules for data access, security, and quality.


Example: Define who can access sensitive data and how it should be protected.
Data Stewardship:

Assign roles and responsibilities for managing data.

Example: Appoint data stewards to oversee data quality and compliance.


7. Data Integration Platforms

Use specialized platforms to facilitate data integration from multiple sources.

Integration Platforms as a Service (iPaaS):

Cloud-based platforms for integrating applications and data.

MSc Computer Science(Data Analytics) 34


Introduction to Data Analytics and Machine Learning Module 1

Benefits: Scalable and flexible integration capabilities.


Tools: MuleSoft, Dell Boomi.

Data Integration Tools:


Software designed to extract, transform, and load data.

Example: Informatica PowerCenter, Microsoft SQL Server Integration Services


(SSIS).

Example Scenario:
Scenario:

A retail company wants to integrate customer feedback from social media with sales data to
improve marketing strategies.

• Data Integration:
Use ETL tools to extract customer feedback from social media platforms and sales data
from internal systems.

Transform the data to a common format and load it into a central data warehouse.

• Data Consolidation:

Store the integrated data in a data warehouse for easy access and analysis.

• Data Synchronization:
Synchronize sales data with social media metrics in real-time to keep insights up-to-date.

• Data Quality Management:

Cleanse the data to remove duplicates and correct errors.


Validate data entries to ensure accuracy.

• Metadata Management:

Document metadata related to customer feedback and sales data.

• Data Governance:

Define policies for data access and security.

• Data Integration Platforms:

Use an iPaaS solution to streamline the integration process.

MSc Computer Science(Data Analytics) 35


Introduction to Data Analytics and Machine Learning Module 1

xi) Basic Statistical Descriptions of Data


• Basic statistical descriptions can be used to identify properties of the data and highlight
which data values should be treated as noise or outliers.
• For data preprocessing tasks, we want to learn about data characteristics regarding both
central tendency and dispersion of the data.

• Measures of central tendency include mean, median, mode, and midrange.

• Measures of data dispersion include quartiles, interquartile range (IQR), and variance.
• These descriptive statistics are of great help in understanding the distribution of the data.
a) Measuring Central Tendency: Mean
• The most common and most effective numerical measure of the “center” of
a set of data is the arithmetic mean.
▪ Arithmetic Mean:

▪ Although the mean is the single most useful quantity for describing a data
set, it is not always the best way of measuring the center of the data.
➢ A major problem with the mean is its sensitivity to extreme
(outlier) values.
➢ Even a small number of extreme values can corrupt the mean.
▪ To offset the effect caused by a small number of extreme values, we can
instead use the trimmed mean,
▪ Trimmed mean can be obtained after chopping off values at the high and
low extremes.

b) Measuring Central Tendency: Median


▪ Another measure of the center of data is the median.
▪ Suppose that a given data set of N distinct values is sorted in numerical
order.
➢ If N is odd, the median is the middle value of the ordered set;
➢ If N is even, the median is the average of the middle two values.

MSc Computer Science(Data Analytics) 36


Introduction to Data Analytics and Machine Learning Module 1

▪ In probability and statistics, the median generally applies to numeric data;


however, we may extend the concept to ordinal data.
➢ Suppose that a given data set of N values for an attribute X is
sorted in increasing order.
➢ If N is odd, then the median is the middle value of the ordered
set.
➢ If N is even, then the median may not be not unique.
▪ In this case, the median is the two middlemost values and any value in
between.

c) Measuring Central Tendency: Mode


▪ Another measure of central tendency is the mode.
▪ The mode for a set of data is the value that occurs most frequently in the
set.
➢ It is possible for the greatest frequency to correspond to several
different values, which results in more than one mode.
➢ Data sets with one, two, or three modes: called unimodal,
bimodal, and trimodal.
➢ At the other extreme, if each data value occurs only once, then
there is no mode.

Measuring Central Tendency -Mean, Median, Mode


Median, mean and mode of symmetric, positively and negatively skewed data

Example

MSc Computer Science(Data Analytics) 37


Introduction to Data Analytics and Machine Learning Module 1

What are central tendency measures (mean, median, mode)for the following attributes?
attr1 = {2,4,4,6,8,24}

mean = (2+4+4+6+8+24)/6 = 8 average of all values


median = (4+6)/2 = 5 avg. of two middle values

mode = 4 most frequent item

attr2 = {2,4,7,10,12}
mean = (2+4+7+10+12)/5 = 7 average of all values
median = 7 middle value
mode = any of them (no mode) all of them has same freq.

attr3 = {xs,s,s,s,m,m,l}

mean is meaningless for categorical attributes.


median = s middle value

mode = s most frequent item

Measuring the dispersion of data


What is Dispersion?
Dispersion, also known as variability or spread, measures the extent to which individual
data points deviate from the central value. It provides information about the spread or
distribution of data points in a dataset.
Common measures of dispersion include
• Range
• Variance
• Standard Deviation
• Interquartile Range (IQR)
Range
• In statistics, the range refers to the difference between the highest and lowest
values in a dataset.
• It provides a simple measure of variability, indicating the spread of data points.
• The range is calculated by subtracting the lowest value from the highest value.
• For example, in a dataset {4, 6, 9, 3, 7}, the range is 9 – 3 = 6.
Variance
• Variance is a statistical measure that quantifies the amount of variation or
dispersion of a set of numbers from their mean value. Specifically, variance is

MSc Computer Science(Data Analytics) 38


Introduction to Data Analytics and Machine Learning Module 1

defined as the expected value of the squared deviation from the mean. It is
calculated by:
1. Finding the mean (average) of the data set.
2. Subtracting the mean from each data point to get the deviations from the
mean.
3. Squaring each of the deviations.
4. Calculating the average of the squared deviations. This is the variance.
Standard Deviation
• Standard deviation is a measure of the amount of variation or dispersion of a set
of values from the mean value.
• It is calculated as the square root of the variance, which is the average squared
deviation from the mean.
Interquartile Range (IQR)
• Interquartile Range (IQR) is a measure of statistical dispersion that represents the
middle 50% of a data set.
• It is calculated as the difference between the 75th percentile (Q3) and the
25th percentile (Q1) of the data i.e., IQR = Q3 − Q1.
Examples for Dispersion
Let’s consider the same dataset of daily temperatures recorded over a week: 22°C,
23°C, 21°C, 25°C, 22°C, 24°C, and 20°C.
Range: Maximum temperature – Minimum temperature = 25°C – 20°C =
5°C
Variance: Variance = (Sum of squared differences from the mean) /
(Number of data points)
Mean = 21.86 °C
Sum of squared differences from the mean = (22 – 21.86)2 + (23 – 21.86)2 +
(21 – 21.86)2 + (25 – 21.86)2 + (22 – 21.86)2 + (24 – 21.86)2 + (20 – 21.86)2
= (0.14)2 + (1.14)2 + (-0.86)2 + (3.14)2 + (0.14)2 + (2.14)2 + (-1.86)2
= 0.0196 + 1.2996 + 0.7396 + 9.8596 + 0.0196 + 4.5796 + 3.4596
= 19.0772
Thus, Variance = 19.0772 / 7 ≈ 2.725 °C
Standard Deviation: Take the square root of the variance to get the
standard deviation.
Thus, Standard Deviation ≈ √2.725 ≈ 1.65 °C
Interquartile Range (IQR): First Quartile (Q1) = 21°C Third Quartile
(Q3) = 24°C
Thus, IQR = Q3 − Q1 = 24°C – 21°C = 3°C

MSc Computer Science(Data Analytics) 39


Introduction to Data Analytics and Machine Learning Module 1

xii) Exploratory data analysis


Exploratory Data Analysis is a data analytics process that aims to understand the data in depth and
learn its different characteristics, often using visual means. This allows one to get a better feel for
the data and find useful patterns.

It is crucial to understand it in depth before you perform data analysis and run your data through
an algorithm. You need to know the patterns in your data and determine which variables are
important and do not play a significant role in the output. Further, some variables may have
correlations with other variables. You also need to recognize errors in your data.

Exploratory data analysis can do all of this. It helps you gather insights, better sense the data, and
remove irregularities and unnecessary values.

• Helps you prepare your dataset for analysis.

• Allows a machine learning model to predict our dataset better.

• Gives you more accurate results.

• It also helps us to choose a better machine learning model.

Steps Involved in Exploratory Data Analysis

1. Understand the Data

• Familiarize yourself with the data set, understand the domain, and identify the
objectives of the analysis.

2. Data Collection

• Collect the required data from various sources such as databases, web scraping, or
APIs.

MSc Computer Science(Data Analytics) 40


Introduction to Data Analytics and Machine Learning Module 1

3. Data Cleaning

• Handle missing values: Impute or remove missing data.

• Remove duplicates: Ensure there are no duplicate records.

• Correct data types: Convert data types to appropriate formats.

• Fix errors: Address any inconsistencies or errors in the data.

4. Data Transformation

• Normalize or standardize the data if necessary.

• Create new features through feature engineering.

• Aggregate or disaggregate data based on analysis needs.

5. Data Integration

• Integrate data from various sources to create a complete data set.

6. Data Exploration

• Univariate Analysis: Analyze individual variables using summary statistics and


visualizations (e.g., histograms, box plots).

• Bivariate Analysis: Analyze the relationship between two variables with scatter plots,
correlation coefficients, and cross-tabulations.

• Multivariate Analysis: Investigate interactions between multiple variables using pair


plots and correlation matrices.

MSc Computer Science(Data Analytics) 41


Introduction to Data Analytics and Machine Learning Module 1

7. Data Visualization

• Visualize data distributions and relationships using visual tools such as bar charts, line
charts, scatter plots, heatmaps, and box plots.

8. Descriptive Statistics

• Calculate central tendency measures (mean, median, mode) and dispersion measures
(range, variance, standard deviation).

9. Identify Patterns and Outliers

• Detect patterns, trends, and outliers in the data using visualizations and statistical
methods.

10. Hypothesis Testing

• Formulate and test hypotheses using statistical tests (e.g., t-tests, chi-square tests) to
validate assumptions or relationships in the data.

11. Data Summarization

• Summarize findings with descriptive statistics, visualizations, and key insights.

12. Documentation and Reporting

• Document the EDA process, findings, and insights clearly and structured.

• Create reports and presentations to convey results to stakeholders.

MSc Computer Science(Data Analytics) 42


Introduction to Data Analytics and Machine Learning Module 1

13. Iterate and Refine

• Continuously refine the analysis based on feedback and additional questions during the
process.

Types of Exploratory Data Analysis (EDA)

1. Univariate Analysis

• Focuses on analyzing a single variable at a time.

• Purpose: To understand the variable's distribution, central tendency, and spread.

• Techniques:

❖ Descriptive statistics (mean, median, mode, variance, standard deviation).

❖ Visualizations (histograms, box plots, bar charts, pie charts).

2. Bivariate Analysis

• Examines the relationship between two variables.

• Purpose: To understand how one variable affects or is associated with another.

• Techniques:

❖ Scatter plots.

❖ Correlation coefficients (Pearson, Spearman).

❖ Cross-tabulations and contingency tables.

❖ Visualizations (line plots, scatter plots, pair plots).

MSc Computer Science(Data Analytics) 43


Introduction to Data Analytics and Machine Learning Module 1

3. Multivariate Analysis

• Investigates interactions between three or more variables.

• Purpose: To understand the complex relationships and interactions in the data.

• Techniques:

❖ Multivariate plots (pair plots, parallel coordinates plots).

❖ Dimensionality reduction techniques (PCA, t-SNE).

❖ Cluster analysis.

❖ Heatmaps and correlation matrices.

4. Descriptive Statistics

• Summarizes the main features of a data set.

• Purpose: To provide a quick overview of the data.

• Techniques:

❖ Measures of central tendency (mean, median, mode).

❖ Measures of dispersion (range, variance, standard deviation).

❖ Frequency distributions.

5. Graphical Analysis

• Uses visual tools to explore data.

• Purpose: To identify patterns, trends, and data anomalies through visualization.

• Techniques:

MSc Computer Science(Data Analytics) 44


Introduction to Data Analytics and Machine Learning Module 1

❖ Charts (bar charts, histograms, pie charts).

❖ Plots (scatter plots, line plots, box plots).

❖ Advanced visualizations (heatmaps, violin plots, pair plots).

6. Dimensionality Reduction

• Reduces the number of variables under consideration.

• Purpose: To simplify models, reduce computation time, and mitigate the curse of
dimensionality.

• Techniques:

❖ Principal Component Analysis (PCA).

❖ t-Distributed Stochastic Neighbor Embedding (t-SNE).

❖ Linear Discriminant Analysis (LDA).

Exploratory Data Analysis Tools

Using the following tools for exploratory data analysis, data scientists can effectively gain deeper
insights and prepare data for advanced analytics and modeling.

1. Python Libraries

• Pandas: Provides data structures and functions needed to manipulate structured data
seamlessly.

• Use: Data cleaning, manipulation, and summary statistics.

• Supports large, multi-dimensional arrays and matrices and a collection of mathematical


functions.

• Use: Numerical computations and data manipulation.

MSc Computer Science(Data Analytics) 45


Introduction to Data Analytics and Machine Learning Module 1

• Matplotlib: A plotting library that produces static, animated, and interactive


visualizations.

• Use: Basic plots like line charts, scatter plots, and bar charts.

• Seaborn: Built on Matplotlib, it provides a high-level interface for drawing attractive


statistical graphics.

• Use: Advanced visualizations like heatmaps, violin plots, and pair plots.

• SciPy: Builds on NumPy and provides many higher-level scientific algorithms.

• Use: Statistical analysis and additional mathematical functions.

• Plotly: A graphing library that makes interactive, publication-quality graphs online.

• Use: Interactive and dynamic visualizations.

2. R Libraries

• ggplot2: A framework for creating graphics using the principles of the Grammar of
Graphics.

• Use: Complex and multi-layered visualizations.

• dplyr: A set of tools for data manipulation, offering consistent verbs to address common
data manipulation tasks.

• Use: Data wrangling and manipulation.

• tidyr: Provides functions to help you organize your data in a tidy way.

• Use: Data cleaning and tidying.

• shiny: An R package that makes building interactive web apps straight from R easy.

• Use: Interactive data analysis applications.

MSc Computer Science(Data Analytics) 46


Introduction to Data Analytics and Machine Learning Module 1

• plotly: Also available in R for creating interactive visualizations.

• Use: Interactive visualizations.

3. Integrated Development Environments (IDEs)

• Jupyter Notebook: An open-source web application that allows you to create and share
documents that contain live code, equations, visualizations, and narrative text.

• Use: Combining code execution, rich text, and visualizations.

• RStudio: An integrated development environment for R that offers tools for writing and
debugging code, building software, and analyzing data.

• Use: R development and analysis.

4. Data Visualization Tools

• Tableau: A top data visualization tool that facilitates the creation of diverse charts and
dashboards.

• Use: Interactive and shareable dashboards.

• Power BI: A Microsoft business analytics service offering interactive visualizations and
business intelligence features.

• Use: Interactive reports and dashboards.

5. Statistical Analysis Tools

• SPSS: A comprehensive statistics package from IBM.

• Use: Complex statistical data analysis.

• SAS: A software suite developed by SAS Institute for advanced analytics, business
intelligence, data management, and predictive analytics.

MSc Computer Science(Data Analytics) 47


Introduction to Data Analytics and Machine Learning Module 1

• Use: Statistical analysis and data management.

6. Data Cleaning Tools

• OpenRefine: A powerful tool for cleaning messy data, transforming formats, and
enhancing it with web services and external data.

• Use: Data cleaning and transformation.

• SQL Databases: Tools like MySQL, PostgreSQL, and SQLite are used to manage and
query relational databases.

• Use: Data extraction, transformation, and basic analysis.

xiii) Measuring data similarity and dissimilarity


Measuring data similarity and dissimilarity is crucial in many areas of data analysis, such as
clustering, classification, and recommendation systems. These measures help us understand how
alike or different data points or sets are from each other. Here’s a detailed explanation of common
methods to measure similarity and dissimilarity:

Similarity Measures
• L2 or Euclidean
The most common and intuitive distance metric is L2 or Euclidean distance. We
can imagine this as the amount of space between two data objects. For example,
how far your screen is from your face.

MSc Computer Science(Data Analytics) 48


Introduction to Data Analytics and Machine Learning Module 1

How Does L2 or Euclidean Distance Work?

• Cosine Similarity
We use the term “cosine similarity” or “cosine distance” to denote the difference between the orientation
of two vectors. For example, how far would you turn to face two opposite directions from the front
door? This cosine similarity formula or measure is a good place to start when considering similarity
between two vectors.

MSc Computer Science(Data Analytics) 49


Introduction to Data Analytics and Machine Learning Module 1

How Does Cosine Similarity Work?

• Jaccard Similarity
Jaccard Similarity also known as Jaccard index, is a statistic to measure the
similarity between two data sets. It is measured as the size of the intersection of
two sets divided by the size of their union.
For example: Given two sets A and B, their Jaccard Similarity is provided by,

MSc Computer Science(Data Analytics) 50


Introduction to Data Analytics and Machine Learning Module 1

Dissimilarity Measures
• Manhattan Distance
The Manhattan distance, also called taxicab distance or cityblock distance, is
another popular distance metric. Suppose you’re inside a two-dimensional plane
and you can move only along the axes as shown:

MSc Computer Science(Data Analytics) 51


Introduction to Data Analytics and Machine Learning Module 1

• Hamming Distance
Hamming distance between two codewords is the number of places by which the
codewords differ. For two codes c1 and c2, the hamming distance is denoted
as d(c1 ,c2).
It is the number of positions at which corresponding symbols of two equal length
strings are different. It is named after Richard Hamming, an American
mathematician and computer engineer.

MSc Computer Science(Data Analytics) 52


Introduction to Data Analytics and Machine Learning Module 1

E.g.: considering codewords c1=0100 and c2=1111, the hamming distance


between the two codewords is 3 because they differ at the 1st, 3rd and 4th places in
the code.

xiv) Graphical Representation of Data

• A graphical representation is a visual representation of data statistics-based


results using graphs, plots, and charts. This kind of representation is more
effective in understanding and comparing data than seen in a tabular form.
• Graphical representation helps to qualify, sort, and present data in a method that
is simple to understand for a larger audience.
• Graphs enable in studying the cause and effect relationship between two
variables through both time series and frequency distribution.
• The data that is obtained from different surveying is infused into a graphical
representation by the use of some symbols, such as lines on a line graph, bars on
a bar chart, or slices of a pie chart. This visual representation helps in clarity,
comparison, and understanding of numerical data.
Principles of Graphical Representation of Data

• The principles of graphical representation are algebraic.


• In a graph, there are two lines known as Axis or Coordinate axis. These are the
X-axis and Y-axis.
• The horizontal axis is the X-axis and the vertical axis is the Y-axis. They are
perpendicular to each other and intersect at O or point of Origin.
• On the right side of the Origin, the Xaxis has a positive value and on the left side,
it has a negative value. In the same way, the upper side of the Origin Y-axis has
a positive value where the down one is with a negative value.
• When -axis and y-axis intersect each other at the origin it divides the plane into
four parts which are called Quadrant I, Quadrant II, Quadrant III, Quadrant IV.

MSc Computer Science(Data Analytics) 53


Introduction to Data Analytics and Machine Learning Module 1

• This form of representation is seen in a frequency distribution that is represented


in four methods, namely Histogram, Smoothed frequency graph, Pie diagram or
Pie chart, Cumulative or ogive frequency graph, and Frequency Polygon.

Advantages and Disadvantages of Graphical Representation of Data

• It improves the way of analyzing and learning as the graphical representation makes the
data easy to understand.
• It can be used in almost all fields from mathematics to physics to psychology and so on.
• It is easy to understand for its visual impacts.
• It shows the whole and huge data in an instance.
• It is mainly used in statistics to determine the mean, median, and mode for different data
• The main disadvantage of graphical representation of data is that it takes a lot of effort as
well as resources to find the most appropriate data and then represent it graphically.

MSc Computer Science(Data Analytics) 54


Introduction to Data Analytics and Machine Learning Module 1

Rules of Graphical Representation of Data


While presenting data graphically, there are certain rules that need to be followed. They are listed
below:

• Suitable Title: The title of the graph should be appropriate that indicate the subject of
the presentation.
• Measurement Unit: The measurement unit in the graph should be mentioned.
• Proper Scale: A proper scale needs to be chosen to represent the data accurately.
• Index: For better understanding, index the appropriate colors, shades, lines, designs in
the graphs.
• Data Sources: Data should be included wherever it is necessary at the bottom of the
graph.
• Simple: The construction of a graph should be easily understood.
• Neat: The graph should be visually neat in terms of size and font to read the data
accurately.

Uses of Graphical Representation of Data

• The main use of a graphical representation of data is understanding and


identifying the trends and patterns of the data. It helps in analyzing large
quantities, comparing two or more data, making predictions, and building a firm
decision.
• The visual display of data also helps in avoiding confusion and overlapping of
any information.
• Graphs like line graphs and bar graphs, display two or more data clearly for easy
comparison. This is important in communicating our findings to others and our
understanding and analysis of the data.
Types of Graphical Representation of Data

Data is represented in different types of graphs such as plots, pies, diagrams, etc. They
are as follows,

MSc Computer Science(Data Analytics) 55


Introduction to Data Analytics and Machine Learning Module 1

1.Bar Graph

• A bar graph is a graph that shows complete data with rectangular bars and the
heights of bars are proportional to the values that they represent. The bars in the
graph can be shown vertically or horizontally.
• Bar graphs are also known as bar charts and it is a pictorial representation of
grouped data. It is one of the ways of data handling. Bar graph is an excellent
tool to represent data that are:
1. Independent of one another and
2. That do not need to be in any specific order while being represented.
• The bars give a visual display for comparing quantities in different categories.
• The bar graphs have two lines, horizontal and vertical axis, also called the x and
y-axis along with the title, labels, and scale range.
Types of Bar Graphs

Bar Graphs are mainly classified into two types:

• Vertical Bar Graph


• Horizontal Bar Graph

The bars in bar graphs can be plotted horizontally or vertically, but the most commonly
used bar graph is the vertical bar graph. Apart from the vertical and horizontal bar
graphs, there are two more types of bar graphs, which are given below:
• Grouped Bar Graph
• Stacked Bar Graph

MSc Computer Science(Data Analytics) 56


Introduction to Data Analytics and Machine Learning Module 1

2. Pie Chart
• A pie chart is a type of graph that records data in a circular manner that is further divided
into sectors for representing the data of that particular part out of the whole part.
• Each of these sectors or slices represents the proportionate part of the whole.
• Pie charts, also commonly known as pie diagrams help in interpreting and representing the
data more clearly. It is also used to compare the given data.

MSc Computer Science(Data Analytics) 57


Introduction to Data Analytics and Machine Learning Module 1

3.Line graph
• A line graph is a type of chart or graph that is used to show information that changes over
time. A line graph can be plotted using several points connected by straight lines.

MSc Computer Science(Data Analytics) 58


Introduction to Data Analytics and Machine Learning Module 1

4.Histogram
• A histogram is the graphical representation of data where data is grouped into continuous
number ranges and each range corresponds to a vertical bar.
• The horizontal axis displays the number range.
• The vertical axis (frequency) represents the amount of data that is present in each range.
• The number ranges depend upon the data that is being used.

5.Scatter Plot
• A scatter plot is a means to represent data in a graphical format.
• A simple scatter plot makes use of the Coordinate axes to plot the points, based on their
values.
• The following scatter plot excel data for age (of the child in years) and height (of the child
in feet) can be represented as a scatter plot.

MSc Computer Science(Data Analytics) 59


Introduction to Data Analytics and Machine Learning Module 1

MSc Computer Science(Data Analytics) 60

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy