UNIT-IV Basics of Data Science 7 Hours: What Is AI?
UNIT-IV Basics of Data Science 7 Hours: What Is AI?
Importance of Data Science, Data Science relation to other fields, Data Science and Information
Science, Computational Thinking, Skills and tools needed to do Data Science, storing data,
combining bytes into larger structures, creating datasets, identifying data problems,
understanding data sources, Exploring data models
What is AI?
Artificial Intelligence (AI) is when computers are designed to think and learn like humans.
They can perform tasks without needing explicit instructions for every step.
• Chatbots: Imagine a chatbot on a website. When you ask it a question like, "What are
your store hours?" it understands your question and provides the answer. It learns from
previous interactions to improve its responses over time.
Data Science (DS) is all about collecting and analyzing data to understand patterns and trends. It
helps organizations make better decisions based on facts.
1
Prepared by Mrs. Shah S. S.
Why They Are Important
• AI: Makes things easier. For example, if you have a smart assistant on your phone, it can
set reminders, send texts, or play music just by you asking. You don’t have to do
everything manually.
• Data Science: Helps organizations understand what’s happening. For instance, if a store
sees that ice cream sales go up in summer, they can stock more ice cream during that
time.
2. Computer Science
• Explanation: Computer science contributes algorithms and programming skills that are
essential for processing large datasets. It involves data structures, algorithms, and
software development.
• Example: A data scientist might use Python, a programming language, to write scripts
that clean and analyze large datasets. For instance, a company might use machine
learning algorithms to predict customer churn, employing techniques from computer
science to build and deploy these models effectively.
3. Domain Expertise
4. Mathematics
• Explanation: Mathematics is essential for understanding the models and algorithms used
in data analysis. Topics such as linear algebra, calculus, and probability theory form the
basis for many data science techniques.
• Example: In machine learning, algorithms often rely on mathematical concepts. For
instance, gradient descent, used in training models, is based on calculus. Understanding
3
Prepared by Mrs. Shah S. S.
these mathematical principles helps data scientists optimize their models for better
performance.
5. Business Intelligence
1. Information Science
• Definition: Information science deals with the collection, classification, organization, and
analysis of information. It focuses on data management, retrieval, and ensuring that
information is accessible and usable.
• Key Activities:
o Data Management: Organizing data in databases and information systems.
o Classification: Categorizing information for easier retrieval.
o Information Retrieval: Developing systems that help users find the information
they need, such as search engines and library databases.
• Example: A librarian using information science principles might categorize books in a
library to help patrons find specific titles or topics easily. They may also implement
digital systems for cataloging and retrieving information.
2. Data Science
• Definition: Data science goes a step further by using statistical and computational
methods to extract insights and knowledge from data. It involves analyzing large datasets
to uncover patterns, make predictions, and inform decision-making.
• Key Activities:
o Data Analysis: Applying statistical techniques to analyze data and identify
trends.
o Predictive Modeling: Using machine learning algorithms to forecast future
outcomes based on historical data.
o Data Visualization: Creating visual representations of data to communicate
findings effectively.
• Example: A data scientist working for a retail company might analyze customer purchase
data to predict which products will be popular during the holiday season, helping the
company stock inventory accordingly.
4
Prepared by Mrs. Shah S. S.
3. Overlap Between Data Science and Information Science
• Shared Foundations: Both fields require a strong understanding of how to organize and
manage data effectively. Knowledge of data structures, databases, and information
systems is essential in both disciplines.
• Analytical Focus: While information science emphasizes managing and retrieving data,
data science focuses more on analyzing that data to derive insights. This means that data
scientists often utilize tools and methods from information science but apply them in a
more analytical and predictive context.
Computational Thinking
1. Decomposition:
o Explanation: Breaking down a complex problem into smaller, manageable parts.
o Example: When developing a video game, a developer might decompose the
project into components like character design, game mechanics, and level design.
2. Pattern Recognition:
o Explanation: Identifying similarities or patterns in problems to help predict
future outcomes or simplify problem-solving.
o Example: A data scientist analyzing sales data might notice a pattern where sales
increase during specific seasons, which helps in forecasting future sales.
3. Abstraction:
o Explanation: Focusing on the essential features of a problem while ignoring
irrelevant details.
o Example: When designing a database, an engineer might abstract the data by
creating a model that captures only the necessary attributes (e.g., name, date of
birth) while leaving out unnecessary details.
4. Algorithm Design:
o Explanation: Developing a step-by-step procedure or set of rules to solve a
problem or perform a task.
o Example: Creating a recipe to bake a cake can be viewed as an algorithm: gather
ingredients, mix, bake, and cool.
1. Statistical Analysis
o Importance: Statistical analysis is fundamental to interpreting data and making
informed decisions based on it. Understanding statistics allows data scientists to
draw valid conclusions from datasets.
o Key Concepts:
▪ Descriptive Statistics: Summarizes data through measures like mean,
median, mode, variance, and standard deviation.
▪ Inferential Statistics: Involves hypothesis testing, confidence intervals,
and regression analysis to infer properties about a population based on
sample data.
o Example: A data scientist might use regression analysis to understand the
relationship between advertising spend and sales revenue, helping businesses
make budgetary decisions.
2. Programming Skills
o Importance: Proficiency in programming allows data scientists to manipulate
data, perform analyses, and automate tasks efficiently.
o Common Languages:
▪ Python: Known for its simplicity and readability, Python is widely used
for data analysis, machine learning, and automation. Libraries such as
NumPy, Pandas, and Matplotlib facilitate various data tasks.
▪ R: Primarily used for statistical analysis and visualization. R has
numerous packages tailored for data analysis, like ggplot2 for
visualization and dplyr for data manipulation.
o Example: A data scientist might write a Python script to clean a dataset, perform
analysis, and visualize results using Matplotlib.
3. Data Manipulation and Cleaning
o Importance: Raw data is often messy and unstructured. Data manipulation skills
are essential for cleaning and transforming data into a usable format.
o Key Techniques:
▪ Handling Missing Values: Techniques like imputation or deletion.
▪ Data Transformation: Normalizing or scaling data, converting data
types, and encoding categorical variables.
o Tools:
▪ Pandas (Python): Offers powerful data structures like DataFrames for
data manipulation.
▪ dplyr (R): Provides functions for data manipulation in R.
o Example: A data scientist might use Pandas to remove duplicates and fill missing
values in a customer dataset before analysis.
4. Machine Learning
6
Prepared by Mrs. Shah S. S.
o Importance: Understanding machine learning algorithms enables data scientists
to build models that can make predictions or classifications based on data.
o Key Concepts:
▪ Supervised Learning: Algorithms trained on labeled data (e.g.,
regression, classification).
▪ Unsupervised Learning: Algorithms that find patterns in unlabeled data
(e.g., clustering, dimensionality reduction).
o Libraries:
▪ scikit-learn (Python): Offers a wide range of algorithms and tools for
machine learning.
▪ TensorFlow and Keras (Python): Used for building deep learning models.
o Example: A data scientist might use scikit-learn to train a model that predicts
customer churn based on historical data.
5. Data Visualization
o Importance: Data visualization helps in presenting data insights clearly and
effectively, making it easier for stakeholders to understand complex data.
o Key Tools:
▪ Matplotlib (Python): A foundational plotting library for creating static,
animated, and interactive visualizations.
▪ Seaborn (Python): Built on Matplotlib, it provides a high-level interface
for drawing attractive statistical graphics.
▪ Tableau: A user-friendly business intelligence tool for creating interactive
visualizations and dashboards.
o Example: A data scientist might create a dashboard in Tableau to visualize sales
trends, allowing executives to quickly grasp performance metrics.
6. Database Management
o Importance: Knowledge of databases is essential for retrieving and storing large
volumes of data efficiently.
o Key Skills:
▪ SQL (Structured Query Language): The standard language for querying
and managing relational databases.
o Example: A data scientist might use SQL to extract relevant data from a
company’s database for analysis, such as retrieving customer purchase histories.
7. Domain Knowledge
o Importance: Understanding the specific industry or field where one is applying
data science is crucial for accurate interpretation of data and insights.
o Example: In healthcare, a data scientist should understand medical terms and
healthcare processes to analyze patient data effectively and provide actionable
insights.
1. Programming Languages
o Python: Versatile and widely used, ideal for data analysis, machine learning, and
scripting.
o R: Specifically designed for statistical analysis and data visualization.
7
Prepared by Mrs. Shah S. S.
2. Data Manipulation Libraries
o Pandas: Provides data structures for efficiently handling large datasets.
o NumPy: Supports numerical computing and handling large arrays and matrices.
3. Machine Learning Libraries
o scikit-learn: Offers tools for data mining and data analysis, including
classification, regression, and clustering.
o TensorFlow and Keras: Frameworks for building and training deep learning
models.
4. Data Visualization Tools
o Matplotlib: A foundational library for creating static, animated, and interactive
plots.
o Seaborn: Enhances Matplotlib for statistical data visualization.
o Tableau: A powerful tool for creating interactive visualizations and dashboards
for business intelligence.
5. Database Technologies
o SQL: Essential for managing and querying relational databases.
o MongoDB: A NoSQL database for handling unstructured data.
6. Big Data Technologies
o Apache Hadoop: A framework for distributed storage and processing of large
datasets.
o Apache Spark: A fast data processing engine that supports batch and stream
processing.
7. Cloud Platforms
o AWS (Amazon Web Services): Offers a wide range of cloud services, including
data storage and computing.
o Google Cloud Platform: Provides tools for data analytics and machine learning
in a cloud environment.
1. Relational Databases
2. NoSQL Databases
• Description: NoSQL databases are designed for unstructured or semi-structured data and
allow for flexible data models. They can handle large volumes of data and provide
scalability.
• Common Types:
o Document Stores: Store data as JSON-like documents (e.g., MongoDB).
o Key-Value Stores: Store data as key-value pairs (e.g., Redis).
o Column-Family Stores: Organize data into columns rather than rows (e.g.,
Cassandra).
o Graph Databases: Focus on storing and querying data with complex
relationships (e.g., Neo4j).
• Key Characteristics:
o Schema-Less: No rigid schema; data can be added without predefined structure.
o Horizontal Scalability: Can easily scale out by adding more servers.
• Use Case: A social media platform may use a graph database to store user profiles and
their connections, enabling efficient traversal of relationships for features like friend
suggestions.
3. Data Warehouses
• Description: Data warehouses are centralized repositories that store large volumes of
structured data from multiple sources, optimized for query and analysis rather than
transaction processing.
• Key Characteristics:
o ETL Process: Data is extracted, transformed, and loaded (ETL) from various
sources into the warehouse.
o Optimized for Read-Heavy Queries: Designed for complex queries and data
analysis, not for transaction processing.
• Common Technologies:
o Amazon Redshift: A fully managed data warehouse service that provides fast
query performance using SQL.
o Google BigQuery: A serverless data warehouse that enables fast SQL queries on
large datasets.
o Snowflake: A cloud-based data warehousing platform that separates storage and
compute resources for flexibility.
• Use Case: A retail company might use a data warehouse to analyze sales data across
different regions and products, generating insights for marketing strategies.
9
Prepared by Mrs. Shah S. S.
4. Data Lakes
• Description: A data lake is a centralized repository that holds vast amounts of raw data
in its native format until needed for analysis. It can store structured, semi-structured, and
unstructured data.
• Key Characteristics:
o Schema-On-Read: No need for a predefined schema; the schema is applied when
data is read.
o Scalability: Designed to store large volumes of data at low cost.
• Common Technologies:
o Apache Hadoop: An open-source framework for distributed storage and
processing of large datasets using a cluster of computers.
o Amazon S3: A scalable object storage service that can be used as a data lake.
o Azure Data Lake Storage: A service designed for big data analytics with
hierarchical namespace support.
• Use Case: A healthcare organization might use a data lake to store a variety of data types,
including electronic health records, imaging data, and sensor data from medical devices,
for future analysis.
5. File Systems
• Description: Traditional file systems store data in files, which can be structured or
unstructured. Common formats include CSV, JSON, XML, and Parquet.
• Key Characteristics:
o Simplicity: Easy to implement for small-scale projects.
o Manual Management: Data management and organization require manual effort.
• Use Case: A data scientist may store cleaned datasets as CSV files in a directory for
quick access during analysis and sharing with team members.
6. Cloud Storage
• Description: Cloud storage provides scalable and flexible storage solutions via the
internet, allowing for easy access and management of data.
• Key Characteristics:
o On-Demand Scalability: Easily scale storage up or down based on needs.
o Cost-Effective: Pay for only what you use, with no upfront hardware costs.
• Common Services:
o Amazon S3: An object storage service that offers high availability and durability.
o Google Cloud Storage: Provides scalable storage for various types of data.
o Microsoft Azure Blob Storage: A scalable object storage solution for
unstructured data.
• Use Case: A startup might use cloud storage to save user-generated content, such as
images and videos, ensuring that they can scale as their user base grows.
10
Prepared by Mrs. Shah S. S.
1. Data Volume and Velocity:
o Consider the expected data volume and the speed at which data will be generated.
Solutions like NoSQL databases and cloud storage can efficiently handle high-
velocity data streams.
2. Data Structure:
o Assess whether the data is structured, semi-structured, or unstructured. Relational
databases are suitable for structured data, while NoSQL and data lakes
accommodate unstructured data.
3. Access Speed and Query Performance:
o Determine performance requirements for data retrieval and analysis. Data
warehouses and in-memory databases like Redis are optimized for fast query
performance.
4. Scalability:
o Ensure that the chosen solution can scale as data storage needs grow. Cloud
solutions offer significant scalability compared to traditional on-premises
systems.
5. Data Security:
o Implement security measures to protect sensitive data, including encryption,
access controls, and compliance with data protection regulations.
6. Cost:
o Evaluate the cost implications of the storage solution, including ongoing
operational costs, maintenance, and potential retrieval fees. Cloud storage
solutions often provide cost-effective options compared to traditional hardware.
1. Bytes:
o A byte consists of 8 bits and can represent values from 0 to 255.
o Bytes are the basic building blocks for more complex data types.
2. Larger Data Structures:
o Data structures can combine multiple bytes to represent more complex types of
information. Common larger structures include:
▪ Integers: Typically represented by 4 bytes (32 bits) or 8 bytes (64 bits).
▪ Floats: Usually stored in 4 bytes (single precision) or 8 bytes (double
precision).
▪ Characters and Strings: Strings are arrays of characters, with each
character usually taking 1 byte (ASCII) or 2 bytes (UTF-16).
▪ Arrays: A collection of elements of the same type, where each element
can be several bytes.
▪ Structures/Records: Custom data types that group different types
together, often used in languages like C and C++.
11
Prepared by Mrs. Shah S. S.
1. Primitive Data Types
• Integers:
o A 32-bit integer is represented using 4 bytes. For example, the integer 10 is stored
in memory as:
• Floating Point:
o A 32-bit float (single precision) is also stored in 4 bytes, following the IEEE 754
standard. For instance, the float 3.14 might be stored in binary as:
• Strings are often stored as arrays of bytes, where each character corresponds to a byte or
more. For example, the string "Hello" in ASCII is stored as:
• An array of integers might combine multiple integer bytes. For example, an array of 3
integers (e.g., [1, 2, 3]) would use 12 bytes in total (4 bytes per integer):
00000000000000000000000000000001 (1)
00000000000000000000000000000010 (2)
00000000000000000000000000000011 (3)
4. Structures/Records
• In languages like C or C++, a structure can combine different data types. For example:
structPerson {
charname[20]; // 20 bytes for name
intage; // 4 bytes for age
};
• The memory layout for a Person object would combine the bytes for the name and age:
When combining bytes into larger structures, the order in which bytes are arranged can differ
based on the system architecture:
1. Little-Endian:
o The least significant byte is stored first.
12
Prepared by Mrs. Shah S. S.
o For example, the integer 1 (binary 00000001) in a 4-byte structure would be:
2. Big-Endian:
o The most significant byte is stored first.
o The same integer would be represented as:
Applications
1. Data Serialization:
o Converting complex data structures into a byte stream for storage or transmission.
For example, JSON or Protocol Buffers serialize data structures for network
communication.
2. Network Communication:
o Data is often transmitted over networks in byte streams, requiring proper
structuring to ensure the receiving end interprets the data correctly.
3. File Formats:
o File formats (e.g., BMP, PNG, MP3) define how data is structured in files,
combining bytes into headers, metadata, and content.
4. Database Storage:
o Databases combine bytes to represent various data types, optimizing storage and
retrieval.
Creating a Dataset
Objective: To predict house prices based on various features like size, location, number of
bedrooms, and other relevant factors.
• Identify Objectives: The goal is to develop a model that predicts house prices based on
available features. This could be used by real estate agents, buyers, or analysts to
understand market trends.
• Determine Features: Relevant features might include:
o Size of the house (in square feet)
o Location (e.g., neighborhood, zip code)
o Number of bedrooms
o Number of bathrooms
o Year built
13
Prepared by Mrs. Shah S. S.
o Lot size
o Amenities (e.g., garage, pool)
2. Data Collection
1. Surveys: Create a survey targeting homeowners in a specific area to gather details about
their homes. Questions could include:
o Size of the house
o Number of bedrooms and bathrooms
o Year built
o Current market price (if willing to disclose)
2. Interviews: Conduct interviews with real estate agents to gain insights into market trends
and property features that influence prices.
1. Public Datasets: Utilize existing datasets from government sources or real estate
websites. For example, the U.S. Census Bureau and Zillow often provide relevant data.
2. Web Scraping: Use web scraping tools to collect data from real estate listing sites. Tools
like Beautiful Soup (Python) can extract property listings, including features and prices.
3. APIs: Access data from real estate APIs (e.g., Zillow API) to obtain current property
listings and historical sales data.
Size (sqft) Location Bedrooms Bathrooms Year Built Lot Size (sqft) Price
14
Prepared by Mrs. Shah S. S.
Size (sqft) Location Bedrooms Bathrooms Year Built Lot Size (sqft) Price
3. Data Preparation
A. Data Cleaning
B. Data Transformation
4. Dataset Structuring
• Organize the data into a tabular format, where each row corresponds to a property and
each column represents a feature.
After cleaning and transforming, your dataset may look like this:
Size (sqft) Bedrooms Bathrooms Year Built Lot Size (sqft) Price Age Price_per_sqft
15
Prepared by Mrs. Shah S. S.
Size (sqft) Bedrooms Bathrooms Year Built Lot Size (sqft) Price Age Price_per_sqft
5. Data Validation
• Consistency Checks: Ensure all entries follow expected formats. For instance, confirm
that Bedrooms and Bathrooms are non-negative integers.
• Sampling: Randomly sample a portion of the dataset to check for anomalies or
unexpected distributions.
• Save the cleaned and structured dataset in a suitable format for analysis, such as:
o CSV File: Easy to read and manipulate.
o Excel File: Useful for sharing and manual inspection.
o Database: Store in a relational database for efficient querying and scalability.
Using Python's Pandas library, you can save the dataset to a CSV file like this:
import pandas as pd
1. Missing Data
o Description: Incomplete records where some values are absent.
o Identification:
▪ Check for null or NaN values in your dataset using methods like isnull() in
Pandas.
▪ Calculate the percentage of missing values in each column.
o Impact: Missing data can bias results and reduce the statistical power of analyses.
2. Duplicate Data
o Description: Multiple identical records in the dataset.
o Identification:
▪ Use functions like duplicated() in Pandas to find duplicate rows.
▪ Analyze the dataset’s cardinality (number of unique values) for fields that
should be unique.
16
Prepared by Mrs. Shah S. S.
o Impact: Duplicates can skew analysis, leading to overestimation of trends or
patterns.
3. Outliers
o Description: Data points that differ significantly from the rest of the dataset.
o Identification:
▪ Visualize data using box plots or scatter plots to spot anomalies.
▪ Calculate z-scores or use the IQR method to detect extreme values.
o Impact: Outliers can distort statistical analyses and model performance if not
handled appropriately.
4. Inconsistent Data
o Description: Variations in data representation that lead to confusion (e.g., "USA"
vs. "United States").
o Identification:
▪ Use functions to check for unique values in categorical columns.
▪ Analyze the dataset for variations and inconsistencies in spelling,
formatting, or casing.
o Impact: Inconsistent data can lead to misinterpretations and errors in grouping or
analysis.
5. Incorrect Data Types
o Description: Data stored in an inappropriate format (e.g., numbers stored as
strings).
o Identification:
▪ Use the dtypes attribute in Pandas to inspect data types of each column.
▪ Look for discrepancies between expected types and actual types (e.g.,
numerical operations on strings).
o Impact: Incorrect data types can lead to errors in calculations and analyses.
6. Data Bias
o Description: Systematic favoritism in data collection leading to skewed results.
o Identification:
▪ Examine the data collection process to identify any biases (e.g., sampling
bias).
▪ Analyze distributions of data points across different categories.
o Impact: Data bias can produce misleading insights and exacerbate existing
inequalities.
7. Unbalanced Classes
o Description: Significant disparities in class distributions in classification
problems (e.g., many more instances of one class than another).
o Identification:
▪ Use value counts or histograms to visualize class distributions.
o Impact: Unbalanced classes can lead to models that favor the majority class,
reducing overall predictive performance.
8. Noisy Data
o Description: Data containing random errors or fluctuations.
o Identification:
▪ Look for unexpected variance in data patterns through visual inspection or
statistical analysis.
17
Prepared by Mrs. Shah S. S.
▪ Use methods like signal-to-noise ratio to quantify noise levels.
o Impact: Noise can obscure true patterns in data, making it harder to draw reliable
conclusions.
1. Descriptive Statistics:
o Use summary statistics (mean, median, mode, standard deviation) to quickly
assess data characteristics.
o Identify anomalies by comparing summary statistics across different segments.
2. Data Visualization:
o Create visualizations (e.g., histograms, box plots, scatter plots) to spot trends,
outliers, and distributions.
o Visual tools can highlight patterns that might not be obvious in raw data.
3. Data Profiling:
o Perform data profiling to get a comprehensive view of the dataset, checking for
completeness, accuracy, and consistency.
o Tools like Pandas Profiling or Dask can automate this process.
4. Automated Data Quality Checks:
o Implement automated scripts to regularly check for common data problems
(missing values, duplicates, incorrect types).
o Use data validation libraries like Great Expectations to set expectations and
validate data against them.
5. Cross-Validation with External Sources:
o Validate data against external benchmarks or datasets to identify inconsistencies
or errors.
o For instance, compare survey results with demographic data from national
statistics.
Let’s assume you have a dataset for a customer satisfaction survey with the following columns:
1 25 8 Good service
2 NaN 7
3 30 10 Excellent
4 25 6 Good
5 40 8 USA
6 25 8 Good service
18
Prepared by Mrs. Shah S. S.
CustomerID Age Satisfaction Score Comments
7 30 15 Excellent
8 25 NaN Good
9 25 8 Unbelievable
Identifying Problems
1. Missing Data:
o CustomerID 2 and 8 have missing satisfaction scores.
2. Duplicate Data:
o CustomerID 6 has the same responses as CustomerID 1, indicating a potential
duplicate.
3. Outliers:
o CustomerID 7 has a satisfaction score of 15, which exceeds the expected range
(usually between 1-10).
4. Inconsistent Data:
o The Comments for CustomerID 5 contains "USA", which is not a relevant comment
compared to others.
5. Incorrect Data Types:
o Satisfaction Score might be recorded as strings in some cases, which would need to
be converted to numeric values.
1. Missing Data
• Identification:
o Columns Age and Satisfaction Score have missing values (NaN).
o Check with Python's Pandas:
import pandas as pd
# Sample data
data = {
'CustomerID': [1, 2, 3, 4, 5, 6, 7, 8, 9],
'Age': [25, None, 30, 25, 40, 25, 30, 25, 25],
'Satisfaction Score': [8, 7, 10, 6, 8, 8, 15, None, 8],
'Comments': ['Good service', '', 'Excellent', 'Good', 'USA', 'Good service', 'Excellent', 'Good',
'Unbelievable']
}
df = pd.DataFrame(data)
19
Prepared by Mrs. Shah S. S.
• Impact: Missing data can lead to biased analyses. For instance, if Age is a relevant factor
for satisfaction, ignoring those with missing ages could skew results.
2. Duplicate Data
• Identification:
o Check for duplicate CustomerID or identical rows. In this example, CustomerID 6 has
the same responses as CustomerID 1.
o Use:
duplicates = df.duplicated()
print(df[duplicates])
• Impact: Duplicates can inflate satisfaction scores and lead to incorrect conclusions about
overall customer sentiment.
3. Outliers
• Identification:
o The satisfaction score of 15 for CustomerID 7 exceeds the expected range (typically
between 1 and 10).
o Use visualization (box plots or histograms) to easily spot outliers:
importmatplotlib.pyplotasplt
plt.boxplot(df['Satisfaction Score'])
plt.title('Boxplot of Satisfaction Scores')
plt.show()
• Impact: Outliers can distort average satisfaction scores and affect model training.
4. Inconsistent Data
• Identification:
o The Comments field contains an irrelevant entry ("USA") compared to other
comments. This inconsistency can lead to confusion during text analysis.
o Check unique values:
unique_comments = df['Comments'].unique()
print(unique_comments)
• Impact: Inconsistent data can lead to misinterpretation during analysis or when training
models that rely on text data.
• Identification:
20
Prepared by Mrs. Shah S. S.
o If Satisfaction Score is stored as a string instead of a numeric type, it could hinder
calculations.
o Check data types:
print(df.dtypes)
• Impact: Incorrect data types can cause errors during analysis or modeling processes.
1. Data Quality:
o Assess the reliability, accuracy, and completeness of the data.
o Consider the source's reputation and the methodology used for data collection.
2. Relevance:
o Ensure that the data is pertinent to the research question or business problem at
hand.
o Evaluate whether primary or secondary data is more appropriate for your needs.
3. Timeliness:
o Check whether the data is up-to-date and relevant for the current analysis.
o For time-sensitive analyses, ensure that the data reflects the most recent
information.
4. Ethics and Compliance:
22
Prepared by Mrs. Shah S. S.
o Be aware of data privacy regulations (e.g., GDPR, HIPAA) when using personal
data.
o Obtain necessary permissions when collecting primary data or using secondary
data.
5. Integration:
o Consider how different data sources can be combined. Structured data from a
database might need to be integrated with unstructured text data from customer
reviews.
You want to analyze customer feedback to improve a product. Here are potential data sources
you might consider:
• Regression Analysis: Models relationships between variables; used for predictive tasks.
• Decision Trees: A tree-like model for classification and regression that breaks down data
into subsets.
• Neural Networks: Inspired by the human brain; useful for complex pattern recognition,
especially in unstructured data.
• Support Vector Machines (SVM): Effective for classification tasks, especially in high-
dimensional spaces.
• Clustering Algorithms: Such as K-means and hierarchical clustering, used to group
similar data points.
3. Model Evaluation
• Metrics: Accuracy, precision, recall, F1 score for classification; RMSE, MAE for
regression.
• Cross-Validation: Techniques like k-fold cross-validation help ensure that the model
generalizes well to unseen data.
• Overfitting and Underfitting: Balancing model complexity to avoid these common
pitfalls.
Cross-Validation
1. Overfitting
o Definition: Overfitting occurs when a model learns the training data too well,
including noise and outliers, which leads to poor performance on unseen data.
o Example: If a model perfectly predicts training data but fails to generalize to new
data, like memorizing answers to specific questions without understanding the
underlying patterns.
2. Underfitting
o Definition: Underfitting happens when a model is too simple to capture the
underlying structure of the data, leading to poor performance on both training and
test data.
24
Prepared by Mrs. Shah S. S.
o Example: Using a linear model to predict a complex quadratic relationship will
result in high errors for both training and test datasets.
• Increase Model Complexity: Use more features or a more complex algorithm that can
capture the necessary patterns.
• Feature Engineering: Add or modify features to help the model learn better.
4. Data Preparation
Cleaning Data
Data cleaning involves preprocessing raw data to make it suitable for analysis. This step is
crucial because real-world data is often messy and contains various issues.
Feature Engineering
Feature engineering is the process of using domain knowledge to create new features that help
improve model performance.
▪
▪ Binning: Convert continuous variables into categorical bins.
▪ Example: For age, you might create bins like "0-18," "19-35," "36-
60," "60+" to analyze patterns within age groups.
▪ Datetime Features: Extract useful features from datetime fields.
▪ Example: From a timestamp, you could derive features like "day of
the week," "month," or "hour."
2. Selecting Important Features
o Feature Selection Techniques: Identify which features contribute most to the
predictive power of the model.
▪ Correlation Analysis: Analyze correlation matrices to find highly
correlated features and eliminate those that do not add value.
26
Prepared by Mrs. Shah S. S.
▪ Recursive Feature Elimination (RFE): Use algorithms to recursively
remove the least important features and evaluate model performance.
▪ Domain Knowledge: Leverage domain knowledge to select features that
are known to be important for the problem at hand.
5. Model Deployment
Integrating a machine learning model into a production system is a critical step that involves
deploying the model in a way that it can be used to make predictions in real time or on-demand.
Here are the key aspects of integration:
1. Deployment Strategies
o Batch Processing: In this approach, predictions are made on a batch of data at
regular intervals. This is suitable for scenarios where real-time predictions are not
required.
▪ Example: A retail company might run a daily job to predict inventory
needs for the next week.
o Real-Time Processing: Models are integrated into applications to provide instant
predictions. This often involves APIs (Application Programming Interfaces) that
serve predictions based on user inputs.
▪ Example: A fraud detection system in banking that evaluates transactions
in real-time to flag suspicious activity.
o Edge Deployment: For applications requiring low latency or working in
constrained environments, models can be deployed on local devices or IoT
devices.
▪ Example: Anomaly detection in manufacturing using sensors directly on
machinery.
2. Model Serving
o REST APIs: Many organizations expose their models through RESTful APIs.
This allows applications to send data to the model and receive predictions.
▪ Example: A web application that predicts customer churn can send user
data to the model via an API endpoint and receive a churn probability in
response.
o Microservices Architecture: Deploying models as microservices allows for
scalable and independent model updates without affecting the entire application.
▪ Example: Different machine learning models for customer
recommendations, fraud detection, and sentiment analysis can run as
separate services.
27
Prepared by Mrs. Shah S. S.
3. Containerization and Orchestration
o Docker: Using Docker to package models with their dependencies helps ensure
consistency across environments (development, testing, production).
▪ Example: A data science team can build a Docker container with the
model and all necessary libraries, ensuring it runs the same way in
production as it did in development.
o Kubernetes: This orchestration tool can manage containerized applications,
allowing for scaling, load balancing, and easier deployment of machine learning
models.
▪ Example: Automatically scaling up the number of model instances during
peak usage times for a recommendation engine.
4. CI/CD Pipelines
o Continuous Integration/Continuous Deployment: Setting up CI/CD pipelines
allows for automated testing and deployment of new model versions, facilitating
quick updates and rollbacks.
▪ Example: Using tools like Jenkins or GitHub Actions to automatically
deploy a new model version after passing tests on performance metrics.
Once a machine learning model is deployed, it’s crucial to monitor its performance over time.
This ensures that the model remains effective and adapts to any changes in the data distribution
or user behavior.
28
Prepared by Mrs. Shah S. S.
o Scheduled Retraining: Periodically retrain models using fresh data to adapt to
changes in the underlying patterns.
▪ Example: A predictive maintenance model may need retraining as new
sensor data becomes available.
o Automated Retraining Pipelines: Setting up pipelines that automatically retrain
models based on performance degradation or data drift detection.
▪ Example: Using frameworks like MLflow or Kubeflow to facilitate the
retraining and deployment process seamlessly.
5. Logging and Analytics
o Model Logs: Maintain logs of model predictions, input data, and any encountered
errors. This data can provide insights into model performance and areas for
improvement.
▪ Example: Analyzing logs to understand common inputs leading to high
error rates can inform model improvements.
6. User Feedback Incorporation
o Feedback Loops: Implement mechanisms to gather feedback from end users
regarding model predictions, which can inform future improvements.
▪ Example: If users report inaccuracies in predictions, the model can be
adjusted or retrained with this feedback incorporated.
6. Emerging Trends
Definition: AutoML refers to the use of automated processes to simplify the workflow of
machine learning. This includes automating tasks like model selection, hyperparameter tuning,
feature engineering, and model evaluation, making it easier for users, especially those without
extensive expertise, to develop machine learning models.
1. Model Selection
o AutoML tools automatically evaluate and select the best model for a given dataset
from a variety of algorithms (e.g., decision trees, random forests, neural
networks).
o Example: A user uploads a dataset, and the AutoML tool might test several
algorithms and suggest the one that performs best based on validation metrics.
2. Hyperparameter Tuning
29
Prepared by Mrs. Shah S. S.
o Hyperparameters are settings that govern the training process of models (e.g.,
learning rate, depth of a tree). AutoML frameworks can perform systematic
searches (like grid search or random search) or more sophisticated methods (like
Bayesian optimization) to find the optimal hyperparameters.
o Example: Instead of manually trying out different values for learning rate and
number of trees in a random forest, AutoML can automate this process and find
the best combination.
3. Feature Engineering
o AutoML can generate new features or select important features automatically,
which is often one of the most time-consuming tasks in machine learning.
o Example: For a dataset containing user information, AutoML might create new
features such as age groups, interaction terms, or polynomial features.
4. Ensembling
o Many AutoML systems employ ensembling techniques, where multiple models
are combined to improve overall performance.
o Example: Combining predictions from several algorithms (like logistic regression
and gradient boosting) to create a stronger predictive model.
5. User-Friendly Interfaces
o Many AutoML tools provide graphical user interfaces (GUIs) that allow users to
interact with the system without requiring extensive programming knowledge.
o Example: Platforms like Google Cloud AutoML and H2O.ai provide intuitive
interfaces for users to upload data and run models without deep technical
expertise.
Explainable AI (XAI)
Definition: Explainable AI refers to techniques and methods that make the outputs of machine
learning models understandable to humans. XAI aims to ensure transparency, accountability, and
trust in AI systems, especially in critical applications like healthcare and finance.
Importance of XAI:
1. Transparency
o Models should not operate as "black boxes." XAI helps elucidate how models
make decisions, revealing the factors that contributed to a specific prediction.
o Example: In a credit scoring model, XAI tools can show which features (like
income, credit history, etc.) influenced the credit decision and how.
2. Interpretability
o Users should be able to understand model predictions in a human-friendly
manner. This is especially crucial in regulated industries where understanding
decision-making processes is mandatory.
o Example: Using techniques like LIME (Local Interpretable Model-agnostic
Explanations) to explain individual predictions made by complex models like
deep neural networks.
3. Accountability
30
Prepared by Mrs. Shah S. S.
oWith increasing scrutiny on AI systems, organizations need to justify model
predictions. XAI aids in building accountable systems that can be audited and
verified.
o Example: If a model denies a loan application, XAI can provide an explanation
that shows the specific criteria leading to that decision.
4. Bias Detection
o XAI can help identify and mitigate biases in models, ensuring fair treatment of
different groups. Understanding how a model makes decisions can reveal
potential discriminatory patterns.
o Example: Analyzing a hiring algorithm to ensure it does not unfairly
disadvantage certain demographic groups.
Transfer Learning
Definition: Transfer learning is a machine learning technique where a model developed for one
task is reused as the starting point for a model on a second, related task. This approach is
especially valuable in situations where labeled data is scarce or expensive to obtain.
1. Pre-trained Models
o Transfer learning often involves using pre-trained models that have already been
trained on large datasets (e.g., ImageNet for image classification).
o Example: Using a model like VGG16 or ResNet, which has been trained on
millions of images, as a base for a specific image classification task (e.g.,
classifying medical images).
2. Fine-Tuning
o After using a pre-trained model, fine-tuning involves adjusting the model's
weights based on a smaller, task-specific dataset.
o Example: You might take a pre-trained model for recognizing everyday objects
and fine-tune it on a smaller dataset of specific plant species images.
3. Reduced Training Time
o Transfer learning significantly reduces training time since the model starts with
learned weights rather than starting from scratch.
o Example: Training a model from scratch on a small dataset might take weeks,
while fine-tuning a pre-trained model can take just a few hours.
4. Performance Boost
o Transfer learning often leads to improved performance, especially when data for
the new task is limited.
o Example: A sentiment analysis model that leverages a pre-trained language
model (like BERT) will typically perform better than one trained from scratch on
a small dataset.
31
Prepared by Mrs. Shah S. S.