AI Teacher HandbookXII
AI Teacher HandbookXII
INTELLIGENCE
CLASS XII
TEACHER HANDBOOK
Subject Code:843
ARTIFICIAL
INTELLIGENCE
CURRICULUM
Teacher Handbook for Class XII
Acknowledgments
Patrons
Mr. Rahul Singh, IAS, Chairperson, Central Board of Secondary Education
Strategic Guidance
Dr. Biswajit Saha, Director (Skill Education), Central Board of Secondary Education
Sh. Ravinder Pal Singh, Joint Secretary, Department of Skill Education, Central Board of
Secondary Education
Strategic Advisory
Ms. Shipra Sharma, CSR Leader, India/South Asia, IBM
Dr. Mani Madhukar, Program Lead - SkillsBuild, IBM
This revised textbook, designed for students in Classes XI and XII, dives into the captivating
world of AI, offering a comprehensive exploration of its core concepts, applications, and
potential impact. As you embark on this journey, you will not only delve into the fascinating
algorithms that power AI systems, but also examine its ethical considerations and its
profound implications for the future.
This is no longer science fiction. AI is here, and it holds immense potential to improve our
lives in countless ways. This textbook equips you, the future generation, with the knowledge
and critical thinking skills necessary to navigate this rapidly evolving landscape. Through
engaging exercises and thought-provoking questions, you will be challenged to not only
understand AI but also to consider its role in your own future.
The Central Board of Secondary Education (CBSE) recognizes the transformative power of
Artificial Intelligence (AI) and its impact on the future. Building upon this successful
introduction, CBSE extended the AI subject to Classes XI & XII, starting in the 2020-2021
academic session. Thus, allowing students to delve deeper into the world of AI and develop
a more comprehensive understanding.
This AI Curriculum has been created with the help of teacher advisors managed by 1M1B
and supported by IBM. This curriculum aligns with industry standards as set forth by the
National Skills Qualification Framework (NSQF) at Levels 3 & 4.
CBSE acknowledges and appreciates the valuable contribution of IBM India in developing
the AI curriculum and conducting training programs. This collaborative effort ensures
educators are well-equipped to deliver the AI curriculum effectively.
By working together, CBSE and its partners aim to empower students to embrace the future.
By incorporating AI into their learning experience, students gain the knowledge and skills
necessary to not only understand AI but also leverage its potential to enhance their learning
and future prospects.
The future is full of possibilities, and AI is poised to play a pivotal role. Are you ready to be a
part of it?
Learning Objectives:
1. Review the basics of the NumPy and Pandas library, including arrays, and essential
functions.
2. Efficiently import and export data between CSV files and Pandas Data Frames.
3. Implement Linear Regression algorithm, including data preparation, and model
training.
Key Concepts:
1. Recap of NumPy library
2. Recap of Pandas library
3. Importing and Exporting Data between CSV Files and Data Frames
4. Handling missing values
5. Linear Regression algorithm
Learning Outcomes:
Students will be able to:
1. Apply the fundamental concepts of the NumPy and Pandas libraries to perform data
manipulation and analysis tasks.
2. Import and export data between CSV files and Pandas Data Frames, ensuring data
integrity and consistency.
Prerequisites: Foundational understanding of Python from class XI and familiarity with the
basic programming.
1
Become a Python Powerhouse: A Teacher's Guide to Python Programming-II
This lesson transforms your classroom into a hub of Python programming excellence! Students
will master foundational Python libraries and practical techniques, equipping them for advanced
data analysis and machine learning.
2
5. Practical Applications
• Case Studies:
o Real-life examples like student performance analysis or marketing campaign
insights.
o Activity: Students choose a small-scale project to apply Python skills, analyse data,
and draw actionable insights.
Additional Tips:
• Encourage group work to foster collaboration and problem-solving.
• Provide clear instructions and real-world datasets to make learning relatable.
• Emphasize the iterative nature of data analysis and encourage students to experiment.
By incorporating these elements, educators can equip students with a robust foundation in Python
programming, empowering them to tackle real-world challenges with confidence and creativity.
3
Teachers can ask the following questions to spark curiosity before starting the topics:
• How do you think businesses analyze the performance of their products or services
using data? What tools might they use to make sense of millions of numbers and
trends? (This question introduces the relevance of libraries like Pandas and NumPy in
real-world scenarios and connects the lesson to practical applications students can relate
to.)
• If you had a magic tool to predict future events (like house prices or exam scores),
what kind of data would you need to make accurate predictions? (This question sets
the stage for discussing Linear Regression and the importance of data handling,
preparation, and analysis in predictive modeling.)
Let's recap a few of these libraries (covered in Class XI) that are incredibly valuable in the realm
of Artificial Intelligence, Data science and analytics.
In NumPy, the number of dimensions of the array is called the rank of the array
4
1.1.2 Pandas Library
Where and why do we use the Pandas library in Artificial Intelligence?
i) Creation of a Series from Scalar Values- A Series can be created using scalar values as shown
below:
5
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
The dictionary keys become column labels by default in a DataFrame, and the lists become the
columns of data.
To add a new column for another student ‘Fathima’, we can write the following statement:
Result['Fathima']=[89,78,76]
print(Result)
6
ii) Adding a New Row to a DataFrame:
Result.loc['English'] = [90, 92, 89, 80, 90, 88]
print(Result)
DataFRame.loc[] method can also be used to change the data values of a row to a particular value.
For example, to change the marks of science.
Result.loc['Science'] = [92, 84, 90, 72, 96, 88]
print(Result)
7
1.1.2.4 Attributes of DataFrames
We are going to use following data as example to understand the attributes of a DataFrame.
import pandas as pd
# creating a 2D dictionary
dict = {"Student": pd.Series(["Arnav","Neha","Priya","Rahul"],
index=["Data 1","Data 2","Data 3","Data 4"]),
"Marks": pd.Series([85, 92, 78, 83],
index=["Data 1","Data 2","Data 3","Data 4"]),
"Sports": pd.Series(["Cricket","Volleyball","Hockey","Badminton"],
index=["Data 1","Data 2","Data 3","Data 4"])}
# creating a DataFrame
df = pd.DataFrame(dict)
# printing this DataFrame on the output screen
print(df)
i) DataFrame.index
>>>df.index
ii) DataFrame.columns
>>>df.columns
iii) DataFrame.shape
>>>df.shape
(4,3)
iv) DataFrame.head(n)
>>>df.head(2)
v) DataFrame.tail(n)
>>>df.tail(2)
8
1.2. Import and Export Data between CSV Files and DataFrames
CSV files, which stand for Comma-Separated Values, are simple text files used to store tabular
data. Each line in a CSV file represents a row in the table, and each value in the row is separated
by a comma. This format is widely used because it is easy to read and write, both for humans and
computers.
In Python, CSV files are incredibly important for data analysis and manipulation. We often use the
Pandas library to load, manipulate, and analyze data stored in CSV files. Pandas provides powerful
tools to read CSV files into Data Frames, which are data structures that allow us to perform
complex operations on the data with ease. This makes CSV files a go-to format for data scientists
and analysts working with Python.
Once the csv file is uploaded then we can execute the following code to convert the csv to
DataFrame.
import pandas as pd
df=pd.read_csv("studentsmarks.csv")
print(df)
9
On Python IDE we can directly give the complete path name with the csv file in parenthesis.
import pandas as pd
import io
df = pd.read_csv('C:/PANDAS/studentsmarks.csv',sep =",", header=0)
print(df)
10
The resultout.csv is created
11
Let us feed the data in Python
ResultSheet={'Maths': pd.Series([90,91,97,89,65,93],
index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']),
'Science':pd.Series([92,81,np.NaN,87,50,88],
index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']),
'English': pd.Series([89, 91, 88,78,77,82],
index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']),
'Hindi': pd.Series([81, 71, 67,82,np.NaN,89],
index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet']),
'AI': pd.Series([94, 95, 99,np.NaN,96,99],
index=['Heena','Shefali','Meera','Joseph','Suhana','Bismeet'])}
marks = pd.DataFrame(ResultSheet)
print(marks)
We can see there are three “True” values. So three pieces of data are missing.
>>>print(marks['Science'].isnull().any())
True
print(marks['Maths'].isnull().any())
False
#To find the total number of NaN in the whole dataset
>>>marks.isnull().sum().sum()
3
12
#Estimate the missing value
FillZero = marks.fillna(0)
print(FillZero)
import pandas as pd
df=pd.read_csv(‘USA_Housing.csv')
df.head()
Upon examining the count, it's evident that all columns contain 5000 values. This suggests that
there are no missing values in any of the columns.
13
EXPLORATORY DATA ANALYSIS
14
From the above output, we understand that 4000 rows (80% of 5000 rows will be used for training
the model)
15
Applying the Linear Regression Algorithm
We observe that there is a difference between the actual and predicted value.
Further, we need to calculate the error, evaluate the model and test the accuracy of the model.
This will be covered in the next chapter.
16
EXERCISES
A. Objective type questions
1. Which of the following is a primary data structure in Pandas?
a) List
b) Tuple
c) Series
d) Matrix
3. In Linear Regression, which library is typically used for importing and managing data?
a) NumPy
b) Pandas
c) Matplotlib
d) Scikit-learn
4. What is the correct syntax to read a CSV file into a Pandas DataFrame?
a) pd.DataFrame("filename.csv")
b) pd.read_csv("filename.csv")
c) pandas.read_file("filename.csv")
d) pd.file_read("filename.csv")
17
B. Short Answer Questions
1.What is a DataFrame in Pandas?
Ans: A DataFrame is a 2D data structure in Pandas, similar to a table in a database or Excel
sheet. It consists of rows and columns, where each column can hold different types of data.
Example:
import pandas as pd
df = pd.read_csv('data.csv')
df.to_csv('output.csv', index=False)
18
2.Explain the concept of handling missing values in a DataFrame with examples.
Ans:
• Dropping rows/columns:
df = df.dropna()
• Filling with mean/median:
df['column'] = df['column'].fillna(df['column'].mean())
5.How can we add new rows and columns to an existing DataFrame? Explain with code
examples.
Ans:
Add a column:
df['new_column'] = [value1, value2, value3]
Add a row:
df.loc[len(df)] = [value1, value2, value3]
19
D. Case study
1. A dataset of student marks contains missing values for some subjects. Write Python code
to handle these missing values by replacing them with the mean of the respective
columns.
Ans:
import pandas as pd
df = pd.DataFrame({'Maths': [90, None, 88], 'Science': [None, 92, 85]})
df.fillna(df.mean(), inplace=True)
print(df)
2. Write Python code to load the file into a Pandas DataFrame, calculate the total sales for
each product, and save the results into a new CSV file. Click in the link below to access
sales.csv dataset.
https://drive.google.com/drive/folders/1tLbVXWkKzcp6O_-FAvn9usEWuoF3jTp3?usp=sharing
Ans:
import pandas as pd
df = pd.read_csv('sales.csv')
total_sales = df.groupby('Product')['Sales'].sum()
total_sales.to_csv('total_sales.csv')
4. A company has collected data on employee performance. Some values are missing, and
certain columns are irrelevant. Explain how to clean and preprocess this data for analysis
using Pandas.
Ans:
• Drop irrelevant columns using drop().
• Handle missing values using fillna() or dropna().
• Normalize or scale data if needed.
Example:
df = df.drop(['Irrelevant Column'], axis=1)
df.fillna(0, inplace=True)
References:
https://www.programiz.com/python-programming
https://www.javatpoint.com/python-pandas
https://www.w3schools.com/
20
UNIT 2: Data Science Methodology: An Analytic Approach to
Capstone Project
Title: Capstone Project Using Data Science Approach: Hands-on, Team Discussion, Web
Methodology search, Case studies
Summary:
The Data Science Methodology put forward by John B. Rollins, a Data Scientist in IBM
Analytics, is discussed here. The major steps involved in practicing Data Science, from
forming a concrete business or research problem, to collecting and analyzing data, to building
a model, and understanding the feedback, after model deployment are detailed here.
Students can develop their Capstone Project based on this methodology.
Learning Objectives:
1. Understand the major steps involved in tackling a Data Science problem.
2. Define Data Science methodology and articulate its importance.
3. Demonstrate the steps of Data Science Methodology.
Key Concepts:
1. Introduction to Data Science Methodology
2. Steps for Data Science Methodology
3. Model Validation Techniques
4. Model Performance- Evaluation Metrics
Learning Outcomes:
Students will be able to -
1. Integrate Data Science Methodology steps into the Capstone Project.
2. Identify the best way to represent a solution to a problem.
3. Understand the importance of validating machine learning models
4. Use key evaluation metrics for various machine learning tasks
Prerequisites: Foundational understanding of AI concepts from class XI and familiarity with
the concept of Capstone Projects and their objectives.
21
Become a Data Detective: A Teacher's Guide to Data Science Methodology
This lesson equips you to transform your classroom into a data detective agency! Students will
learn the Data Science Methodology, a powerful framework to solve problems using data.
• Warm-up Activity: Spark curiosity with an interactive game! Begin by saying, "Let's put on
our data detective hats!" Present a dataset (e.g., historical sales figures, customer reviews
with sentiment analysis scores). Challenge students to ask questions, uncover trends, and
identify potential insights hidden within the data. This ignites their interest and highlights
the detective-like nature of Data Science.
• Key Terminology: Introduce the key terms within the Data Science Methodology
framework:
o Business Understanding: Defining the business problem and its goals.
o Problem Approach: Formulating a plan to solve the problem using data.
o Data Requirements: Identifying the type and amount of data needed.
o Data Collection: Gathering the required data from various sources.
o Data Understanding: Exploring and analyzing the data to understand its structure
and quality.
o Data Preparation: Cleaning and transforming the data for analysis.
o AI Modelling: Building models to make predictions or classifications based on the
data (focusing on Linear Regression in this lesson).
o Evaluation: Assessing the performance of the model (using K-Fold Cross-
Validation).
o Deployment: Integrating the model into a real-world application (briefly discuss for
future lessons).
o Feedback: Monitoring the model's performance and making adjustments as needed
(briefly discuss for future lessons).
• Case Studies: Present real-world case studies where Data Science Methodology was used
to solve a business problem. Guide students through the different stages of the
methodology, asking them to consider:
o What was the business problem?
o How did they approach the problem with data?
o What type of data was required?
22
5. Choosing Your Detective Tools: Introduction to Python:
• Python Power: Briefly introduce Python as a popular programming language widely used in
Data Science. Highlight its benefits like readability and vast libraries for data analysis (like
NumPy, Pandas, Scikit-learn).
• Train-Test Split: Explain the concept of train-test split, a technique where data is divided
into two sets: training data used to build the model and test data used to evaluate its
performance. Demonstrate this with Python code examples using libraries like Scikit-learn.
• K-Fold Cross-Validation: Introduce K-Fold Cross-Validation, a robust evaluation technique
that splits data into multiple folds, trains the model on different folds, and provides a more
reliable estimate of the model's generalizability. Demonstrate this concept with Python
code examples using libraries like Scikit-learn.
Additional Tips:
• Encourage students to work in pairs or small groups throughout the lesson to foster
collaboration and problem-solving skills.
• Provide online resources and tutorials for students who wish to delve deeper into Python
programming.
• Offer opportunities for students to apply the Data Science Methodology to a small-scale
project of their own, focusing on a specific problem and utilizing the skills learned in train-
test split and K-Fold Cross-Validation.
By incorporating these elements, you can equip students with the skills to approach problems
analytically and unlock the power of data to solve real-world challenges.
23
Teachers can ask the following questions to spark curiosity before starting the topics:
• Imagine you are at a food festival where dishes are unlabeled, and you need to identify
which cuisine each dish belongs to. What strategies would you use to figure it out? (This
question introduces the concept of problem scoping and analytic approaches in an accessible,
real-world scenario. It sets the stage for discussing data science methodology by focusing on
identifying problems and formulating approaches.)
• How do you think businesses like Amazon or Netflix decide what products or shows to
recommend to you? What kind of questions and data might they consider? (This question
connects students' daily experiences with recommendation systems to concepts like
predictive and prescriptive analytics. It encourages them to think about how data can be used
to solve practical problems.)
A Methodology gives the Data Scientist a framework for designing an AI Project. The
framework will help the team to decide on the methods, processes and strategies that will be
employed to obtain the correct output required from the AI Project. It is the best way to organize
the entire project and finish it in a systematic way without losing time and cost.
Data Science Methodology is a process with a prescribed sequence of iterative steps that
data scientists follow to approach a problem and find a solution.
Data Science Methodology enables the capacity to handle and comprehend the data.
In this unit, we discuss the steps of Data Science Methodology which was put forward by
John Rollins, a Data Scientist at IBM Analytics. It consists of 10 steps. The foundation
methodology of Data Science provides a deep insight on how every AI project can be solved from
beginning to end. There are five modules, each going through two stages of the methodology,
explaining the rationale as to why each stage is required.
1. From Problem to Approach
2. From Requirements to Collection
3. From Understanding to Preparation
4. From Modelling to Evaluation
5. From Deployment to Feedback
Source: https://cognitiveclass.ai/courses/data-science-methodology-2
24
1. From Problem to Approach
In this stage, first, we understand the problem of the customer by asking questions and try
to comprehend what is exactly required for them. With this understanding we can figure out the
objectives that support the customer’s goal. This is also known as Problem Scoping and defining.
The team can use 5W1H Problem Canvas to deeply understand the issue. This stage also involves
using DT (Design Thinking) Framework.
To solve a problem, it's crucial to understand the customer's needs. This can be achieved
by asking relevant questions and engaging in discussions with all stakeholders. Through this
process, we will be able to identify the specific requirements and create a comprehensive list of
business needs.
Activity 1:
Mr. Pavan Sankar visited a food festival and wants to sample various cuisines. But due to health
concerns, he has to avoid certain dishes. However, since the dishes were not categorized by
cuisine, he found it challenging and wished for assistance in identifying the cuisines offered.
25
Q7. What is the name of the dish prepared from the ingredients given in Fig 2.1?
Fig 2.1
Vegetable Pulav
When the business problem has been established clearly, the data scientist will be able to
define the analytical approach to solve the problem. This stage involves seeking clarification from
the person who is asking the question, so as to be able to pick the most appropriate path or
approach. Let us understand this in detail.
This stage involves asking more questions to the stakeholders so that the AI Project team can
decide on the correct approach to solve the problem.
Fig 2.2
26
Let’s understand each of them.
Descriptive Analytics: This summarizes past data to understand what has happened. It is the
first step undertaken in data analytics to describe the trends and patterns using tools like
graphs, charts etc. and statistical measures like mean, median, mode to understand the
central tendency. This method also examines the spread of data using range, variance and
standard deviation.
For example: To calculate the average marks of students in an exam or analyzing sales data
from the previous year.
Diagnostic Analytics: It helps to understand the reason behind why some things have
happened. This is normally done by analyzing past data using techniques like root cause
analysis, hypothesis testing, correlation analysis etc. The main purpose is to identify the
causes or factors that led to a certain outcome.
For example: If the sales of a company dropped, diagnostic analysis will help to find the cause
for it, by analyzing questions like “Is it due to poor customer service” or “low product quality”
etc.
Predictive Analytics: This uses the past data to make predictions about future events or
trends, using techniques like regression, classification, clustering etc. Its main purpose is to
foresee future outcomes and make informed decisions.
For example: A company can use predictive analytics to forecast its sales, demand, inventory,
customer purchase pattern etc., based on previous sales data.
Prescriptive Analytics: This recommends the action to be taken to achieve the desired
outcome, using techniques such as optimization, simulation, decision analysis etc. Its purpose
is to guide decisions by suggesting the best course of action based on data analysis.
For example: To design the right strategy to increase the sales during festival season by
analyzing past data and thus optimize pricing, marketing, production etc.
27
[To know more about these analytics, you can go through this Coursera activity]
https://www.coursera.org/learn/data-science-methodology/
Activity 2:
Mr. Pavan Sankar has set his goal to find the dish and its
cuisine using its ingredients. He plans to proceed as
shown in the flowchart in Fig 2.3.
Observe the flowchart and answer the questions.
28
The requirements of data are determined by the analytic approach chosen in the previous
stage. The 5W1H questioning method can be employed in this stage also to determine the data
requirements. It is necessary to identify the data content, formats, and sources for initial data
collection, in this stage.
Determining the specific information needed for our analysis or project includes:
● identifying the types of data required, such as numbers, words, or images.
● considering the structure in which the data should be organized, whether it is in a
table, text file, or database.
● identifying the sources from which we can collect the data, and
● any necessary cleaning or organization steps required before beginning the analysis.
This stage involves defining our data requirements, including the type, format, source, and
necessary preprocessing steps to ensure the data is usable and accurate for our needs. Data for
a project can be categorized into three types: structured data (organized in tables, e.g., customer
databases), unstructured data (without a predefined structure, e.g., social media posts, images),
and semi-structured data (having some organization, e.g., emails, XML files). Understanding
these data types is essential for effective data collection and management in project development
Activity 3:
Mr. Pavan Sankar is now ready with a classification approach. Now he needs to identify the data
requirements.
Q1. Write down the name of two cuisines, five dishes from each cuisine and the ingredients
needed for the five dishes separately.
Cuisine: Indian
Dish1 – Aloo gobi- Ingredients – Potato, Cauliflower, Masalas, Oil, Salt
Dish2 – Naan- Ingredients – Flour, Yeast, Salt, Milk
Dish3 – Butter Chicken- Ingredients –Chicken, Butter, Masala, Oil, Salt
Dish4 –Gulab Jamun- Ingredients – Dough, Oil, Sugar
Dish5 –Poha- Ingredients – Rice, Potato, Turmeric
Cuisine – Chinese
Dish1—Manchow soup- Ingredients –Vegetables, Ginger, Garlic, Soy sauce, Chilly
Dish2—Mapo Tofu- Ingredients –Tofu, Pork,Vegetable, Meat, Soy sauce, Chilly, Garlic etc.
Dish3-Chow Mein- Ingredients –Noodles, Sesame oil, Chicken, Garlic, Soy Sauce
Dish4-Chicken Fried Rice- Ingredients –Rice, Vegetables, Chicken, Soy sauce, Oil, Salt
Dish5—Char Siu Ingredients –Soy Sauce, Garlic, Honey, Spices, Sesame oil
29
Q2. To collect the data on ingredients, in what format should the data be collected?
Data can be collected in table format.
Text file can also be created.
For available dishes, images can also be collected.
DBA’s and programmers often work together to extract data from both primary and secondary
sources. Once the data is collected, the data scientist will have a good understanding of what they
will be working with. The Data Collection stage may be revisited after the Data Understanding
stage, where gaps in the data are identified, and strategies are developed to either collect
additional data or make substitutions to ensure data completeness.
30
Activity 4:
Q1. If you need the names of American cuisine, how will you collect the data?
To get the dish names of American cuisine, we can use Web scraping. Personal Interviews with
Americans is also possible in case if Americans are there nearby.
Q2. You want to try out some healthy recipes in the Indian culture. Mention the different ways
you could collect the data.
Collect the data directly from the places where the culture is maintained, interview
grandparents, refer text book which have the cultural context.
Q3. How can you collect a large amount of data and where can it be stored?
Large amount of data can be collected through Online sources. Many websites provide large
data sets free of use. It can be stored in the form of CSV file or relational database in the cloud.
Data Understanding encompasses all activities related to constructing the dataset. In this stage,
we check whether the data collected represents the problem to be solved or not. The relevance,
comprehensiveness, and suitability of the data for addressing the specific problem or question
at hand are evaluated. Techniques such as descriptive statistics (univariate analysis, pairwise
correlation etc.) and visualization (Histogram) can be applied to the dataset, to assess the
content, quality, and initial insights about the data.
Activity 5:
Q1. Semolina which is called rava or suji in Indian households is a by-product of durum wheat.
Name a few dishes made from semolina. How will you differentiate the data of different dishes?
Upma, Rava Kichadi, Kesari, Suji Pancakes etc.
Main ingredients of all dishes are Suji. Based on salt or sugar, it becomes sweet dish or not.
With different ingredients added to the base ingredient Suji, different type of dishes is made.
31
Q2. Given below is a sample data collected during the data collection stage. Let us try to
understand it.
Table -2.2
Vegetable
Soy sauce
Chicken
Wasabi
Potato
Onion
Sugar
Chilli
Rice
Milk
Fish
Salt
Dish Country
Oil
Chicken 0
Indian 1 1 1 1 1 2 0 0 0 0 0 0
Biriyani
Kheer Indian 0 0 0 N 1 0 0 0 1 1 0 0 0
Pulao Indiana 1 1 1 1 One 0 0 0 0 1 0
Sushi Japanese 1 0 1 1 1 1 0 0 0 N 0 1 0
Fried 0
Chinese Y 1 1 1 1 1 0 1 0 1 0 1
Rice
a. Basic ingredients of sushi are rice, soy sauce, wasabi and vegetables. Is the dish listed in the
data? Are all ingredients available?
Yes. Vegetables and Soy Sauce is not available in the data.
b. Find out the ingredients for the dish “Pulao”. Check for invalid data or missing data.
Common Ingredients for Fried Rice are Rice, Vegetables, Oil, Garlic, Soy sauce, Salt, Chilli,
Onion.
Here in the data Garlic is not found.
c. Inspect all columns for invalid, incorrect or missing data and list them below.
Invalid: Salt (Y), Oil (N), Sugar(N)
Incorrect: Rice (one), Chicken(2)
Missing data: Potato, Soy sauce
d. Which ingredients is common for all dishes? Which ingredient is not used for any dish?
Common- Rice, Not common- Potato
This stage covers all the activities to build the set of data that will be used in the modelling step.
Data is transformed into a state where it is easier to work with.
32
Data preparation includes
● cleaning of data (dealing with invalid or missing values, removal of duplicate values and
assigning a suitable format)
● combine data from multiple sources (archives, tables and platforms)
● transform data into meaningful input variables
Feature Engineering is a part of Data Preparation. The preparation of data is the most time-
consuming step among the Data Science stages.
Feature engineering is the process of selecting, modifying, or creating new features (variables)
from raw data to improve the performance of machine learning models.
For example:
Suppose you're building an ML model to predict price of houses, and you have the following
data:
● Raw Data: Area of the house (in sq.ft), number of bedrooms, and the year the house
was built.
To improve the model's performance, you might create new features such as age of the house
and price per square foot which can be derived from the raw data.
● New Features
1. Age of the house = Current year - Year built.
2. Price per square foot = Price of the house / Area.
These new features can help the model make more accurate predictions.
Activity 6:
Q1. Are there any textual mistakes in the data given in the Table-1? Mention if any.
Yes, For Pulao dish, as value of country “Indiana” is given instead of “Indian”.
Q2. In Table-1, incorrect data was identified in the columns rice and chicken. Write the possible
ways to rectify them.
In the column Rice it is written as “one” instead of 1 and in the column Chicken “2” is written.
Q3. Is the first column name appropriate? Can you suggest a better name?
No, as this is the table of dishes and cuisine, column name country can be replaced with” Cuisine”.
Q4. First three values of the first column seem to be similar. Do we need to make any corrections
to this data?
First three values are similar as the dishes belongs to the same cuisine. “Indiana’ may be changed
to “Indian”.
Q5. Do the dishes with common ingredients come under the same cuisine? Why?
Yes, some ingredients may be common for the dishes under the same cuisine. It is determined by
the culture and food habits of people under the cuisine.
33
Q6. Instead of mentioning whether the ingredient is present or not by using 0’s and 1’s, can you
suggest any alternative ways to display the information?
• Tables of dish names and ingredients list can be taken.
• An image with dish and its ingredients can be collected.
2.1.7 AI modelling
The modelling stage uses the initial version of the dataset prepared and focuses on
developing models according to the analytical approach previously defined. The modelling
process is usually iterative, leading to the adjustments in the preparation of data. For a
determined technique, Data scientists can test multiple algorithms to identify the most suitable
model for the Capstone Project.
Data Modelling focuses on developing models that are either descriptive or predictive.
1. Descriptive Modeling: It is a concept in data science and statistics that focuses on
summarizing and understanding the characteristics of a dataset without making
predictions or decisions. The goal of descriptive modeling is to describe the data rather
than predict or make decisions based on it. This includes summarizing the main
characteristics, patterns, and trends that are present in the data. Descriptive modeling is
useful when you want to understand what is happening within your data and how it
behaves, but not necessarily why it happens.
34
2. Predictive modeling: It involves using data and statistical algorithms to identify patterns
and trends in order to predict future outcomes or values. It relies on historical data and
uses it to create a model that can predict future behavior or trends or forecast what might
happen next. It involves techniques like regression, classification, and time-series
forecasting, and can be applied in a variety of fields, from predicting exam scores to
forecasting weather or stock prices. While it is a powerful tool, students must also
understand its limitations and the importance of good data.
The data scientist will use a training set for predictive modeling. A training set is a set of
historical data in which the outcomes are already known. The training set acts like a gauge
to determine if the model needs to be calibrated. In this stage, the data scientist will play
around with different algorithms to ensure that the variables selected are actually required.
Activity 7:
Q1. Name two programming languages which can be used to implement the Decision Tree
Algorithm.
Python, R
Q2. In the problem of identifying dish name and cuisine, if we chose the algorithm Decision Tree
to solve the problem and choose Python as a tool, name some libraries which will help in the
implementation.
numpy, pandas, re, sklearn, matplotlib, itertools, random
2.1.8 Evaluation
https://medium.com/ml-research-lab/part-4-data-science-methodology-from-modelling-to-evaluation-3fb3c0cdf805
35
Evaluation in an AI project cycle is the process of assessing how well a model performs after
training. It involves using test data to measure metrics like accuracy, precision, recall, or F1 score.
This helps determine if the model is reliable and effective before deploying it in real-world
situations.
Activity 8:
Q1. In the cuisine identification problem, on which set will the Decision tree be built: Training or
Test?
Training
Q2. Name any diagnostic metric which can be used to determine an optimal classification model.
Confusion Matrix, Log loss etc.
2.1.9 Deployment
Deployment refers to the stage where the trained AI model is made available to the users
in real-world applications. Data scientists must make the stakeholders familiar with the tool
produced in different scenarios. Once the model is evaluated and the data scientist is confident it
will work, it is deployed and put to the ultimate test. Depending on the purpose of the model, it
may be rolled out to a limited group of users or in a test environment, to build up confidence in
applying the outcome for use across the board.
Deploying a model into a live business process, frequently necessitates the involvement of
additional internal teams, skills and technology.
36
Activity 9:
Q1. Mention some ways to embed the solution into mobiles or websites.
Training model may be downloaded into apk files and this can be integrated into mobile apps
(using thunkable) or into websites created by Weebly.
2. 1.10 Feedback
The last stage in the methodology is feedback. This includes results collected from the
deployment of the model, feedback on the model’s performance from the users and clients, and
observations from how the model works in the deployed environment. This process continues till
the model provides satisfactory and acceptable results.
Feedback from the users will help to refine the model and assess it for performance and
impact. The process from modelling to feedback is highly iterative. Data Scientists may automate
any or all of the feedback so that the model refresh process speeds up and can get quick improved
results. Feedback from users can be received in many ways.
Throughout the Data Science Methodology, each step sets the stage for the next, making the
methodology cyclical and ensuring refinement at each stage.
37
Teachers can ask the following questions to spark curiosity before starting the topics:
• Imagine you've trained a self-driving car's AI to recognize stop signs. How would you
ensure that it performs well in both clear and foggy conditions? What steps would you
take to test its reliability? (This question introduces the concept of model validation by
relating it to a real-world scenario where accuracy and reliability are critical.)
• If you were developing a system to predict whether an email is spam, how would you
measure whether your predictions are accurate? What would you do if your system
keeps missing certain spam emails? (Purpose: This question highlights the importance of
evaluation metrics such as precision, recall, and F1-score, encouraging students to think
about practical ways to assess model performance.)
https://www.geeksforgeeks.org/what-is-model-validation-and-why-is-it-important/
Model validation is the step conducted post Model Training, wherein the effectiveness of the
trained model is assessed using a testing dataset. Validating the machine learning model during
the training and development stages is crucial for ensuring accurate predictions. The benefits of
Model Validation include
• Enhancing the model quality.
• Reduced risk of errors
• Prevents the model from overfitting and underfitting.
38
2.2.1 Train Test Split
The train-test split is a technique for evaluating the performance of a machine learning
algorithm. It can be used for classification or regression problems and can be used for any
supervised learning algorithm.
The procedure involves taking a dataset and dividing it into two subsets, as shown in Fig
2.4. The first subset is used to fit/train the model and is referred to as the training dataset. The
second subset is used to test the model. It is with the testing data that predictions are made and
compared to the expected values. This second dataset is referred to as the test dataset.
Train Dataset: Used to fit the machine learning
model.
Test Dataset: Used to evaluate the fit machine
learning model.
Fig 2.4
The objective is to estimate the performance of the machine learning model on new data, i.e.,
data not used to train the model. This is how we expect to use the model in practice. The
procedure is to fit it on available data with known inputs and outputs, then make predictions on
new examples in the future where we do not have the expected output or target values. The
train-test procedure is appropriate when there is a sufficiently large dataset available.
39
k-fold cross validation
In k-fold cross validation we will be working with k subsets of datasets. For example, if we
divide the data into 5 folds or 5 pieces, as shown in Fig 2.5, each being 20% of the full dataset,
then k=5.
We run an experiment called experiment 1 which uses the first fold as a holdout set
(validation), and remaining four folds as training data. This gives us a measure of model quality
based on a 20% holdout set. We then run a second experiment, where we hold out data from the
second fold. This gives us a second estimate of model quality. We repeat this process, using every
fold once as the holdout. Putting this together, 100% of the data is used as a holdout at some
point.
Fig 2.5
Cross-validation gives a more accurate measure of model quality, which is especially important if
you are making a lot of modelling decisions. However, it can take more time to run, because it
estimates models once for each fold.
Divides the data into training data set and Divides a dataset into subsets (folds), trains the
testing dataset. model on some folds, and evaluates its
performance on the remaining data.
Clear demarcation on training data and Every data point at some stage could be in either
testing data. testing or training data set.
Evaluation metrics help assess the performance of a trained model on a test dataset, providing
insights into its strengths and weaknesses. These metrics enable comparison of different models,
including variations of the same model, to select the best-performing one for a specific task.
40
In classification problems, we categorize the target variable into a finite number of classes, while
in regression problems, the target variable has continuous values. Hence, we have different
evaluation metrics for each type of supervised learning, as depicted in Fig 2.6.
Fig 2.6
https://medium.com/@ladkarsamisha123/most-popular-machine-learning-performance-metrics-part-1-ab7189dce555
Fig 2.7
41
2. Precision and Recall
Precision measures “What proportion of predicted Positives is truly Positive?”
Precision = (TP)/(TP+FP).
Precision should be as high as possible.
3. F1-score
A good F1 score means that you have low false positives and low false negatives, so you’re
correctly identifying real threats, and you are not disturbed by false alarms. An F1 score is
considered perfect when it is 1, while the model is a total failure when it is 0.
F1 = 2* (precision * recall)/(precision + recall)
4. Accuracy
Accuracy = Number of correct predictions / Total number of predictions
Accuracy = (TP+TN)/(TP+FP+FN+TN)
2. MSE
Mean Square Error (MSE) is the most commonly used metric to evaluate the performance of a
regression model. MSE is the mean(average) of squared distances between our target variable
and predicted values.
3. RMSE
Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).
RMSE is often preferred over MSE because it is easier to interpret since it is in the same units as
the target variable.
42
2.4. PRACTICAL ACTIVITIES
2.4.1. Calculate MSE and RMSE values for the data given below using MS Excel.
Predicted value 14 19 17 13 12 7 24 23 17 18
Actual Value 17 18 18 15 18 11 20 18 13 19
2.4.2. Given a confusion matrix, calculate Precision, Recall, F1 score and Accuracy
True Positives (TP) = 35, True Negatives (TN) = 50, False Positives (FP) = 10, False Negatives (FN) = 5
43
Summary of Metrics:
Precision: 77.8%
Recall: 87.5%
F1 Score: 82.3%
Accuracy: 85%
44
2.4.3. Python Code to Evaluate a Model
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df=pd.read_csv('/kaggle/input/random-salary-data-of-employes-age-
wise/Salary_Data.csv',sep=',')
df.head()
df.shape
(30, 2)
df.isnull().sum()
YearsExperience 0
Salary 0
dtype: int64
#Data Preparation
X=np.array(df['YearsExperience']).reshape(-1,1)
Y=np.array(df['Salary']).reshape(-1,1)
print(X.shape,Y.shape)
45
#Configuring the train-test split
X_train,x_test,Y_train,y_test=train_test_split(X,Y,test_size=0.2,shuffle=True,random_state
=10)
46
EXERCISES
47
9. Identifying the necessary data content, formats and sources for initial data collection is done
in which step of Data Science methodology?
a. Data requirements b. Data Collection c. Data Understanding d. Data Preparation
10. Data sets are available online. From the given options, which one does not provide online
data?
a. UNICEF b.WHO c. Google d. Edge
11. A ____________ set is a set of historical data in which outcomes are already known.
a. Training set b. Test set c. Validation set d. Evaluation set
12. _____________ data set is used to evaluate the fit machine learning model.
a. Training set b. Test set c. Validation set d. Evaluation set
13. x_train,x_test,y_train,y_test = train_test_split (x, y, test_size=0.2)
From the above line of code, identify the training data set size
a. 0.2 b. 0.8 c. 20 d. 80
14. In k-fold cross validation, what does k represent?
a. number of subsets b. number of experiments c. number of folds d. all of
the above
15. Identify the correct points regarding MSE given below:
i. MSE is expanded as Median Squared Error
ii. MSE is standard deviation of the residuals
iii. MSE is preferred with regression
iv. MSE penalize large errors more than small errors
a. i and ii b. ii and iii c. iii and iv d. ii, iii and iv
B. Short Answer Questions
1. How many steps are there in Data Science Methodology? Name them in order.
Ans- There are 10 steps in Data Science Methodology. They are Business Understanding,
Analytic Approach, Data Requirements, Data Collection, Data Understanding, Data Preparation,
Modelling, Evaluation, Deployment and Feedback
3. Data is collected from different sources. Explain the different types of sources with example.
Ans - Data can be collected from two sources—Primary data source and Secondary data source
Primary Sources are sources which are created to collect the data for analysis. Examples include
Interviews, Surveys, Marketing Campaigns, Feedback Forms, IOT sensor data etc.,
Secondary data is the data which is already stored and ready for use. Data given in Books,
journals, Websites, Internal transactional databases, etc. are some examples
48
4. Which step of Data Science Methodology is related to constructing the data set? Explain.
Ans- Data Understanding stage is related to constructing the data set. Here we check whether
the data collected represents the problem to be solved or not. Here we evaluate whether the
data is relevant, comprehensive, and suitable for addressing the specific problem or question at
hand. Techniques such as descriptive statistics and visualization can be applied to the dataset,
to assess the content, quality, and initial insights about the data.
3. F1-score
A good F1 score means that you have low false positives and low false negatives, so you’re
correctly identifying real threats, and you are not disturbed by false alarms. An F1 score is
considered perfect when it is 1, while the model is a total failure when it is 0.
49
F1 = 2* (precision * recall)/(precision + recall)
4. Accuracy
Accuracy = Number of correct predictions / Total number of predictions
Accuracy = (TP+TN)/(TP+FP+FN+TN)
50
quality. We repeat this process, using every fold once as the holdout. Putting this together, 100%
of the data is used as a holdout at some point.
2. Data is the main part of any project. How will you find the requirements of data, collect it,
understand the data and prepare it for modelling?
Ans - For any Model data is made ready with four steps.
1. Data Requirements- In the data requirements stage we should identify the necessary data
content, formats, and sources for initial data collection. 5W1H questions may be employed.
Here we identifying the types of data required, decides how to store the data considering the
structure in which the data should be organized, whether it is in a table, text file, or database.
We will be identifying the sources from which we can collect the data and also any necessary
cleaning or organization steps required are done.
3. Data Understanding- Data Understanding stage is related to constructing the data set. Here we
check whether the data collected represents the problem to be solved or not. Here we evaluate
whether the data is relevant, comprehensive, and suitable for addressing the specific problem or
question at hand. Techniques such as descriptive statistics and visualization can be applied to the
dataset, to assess the content, quality, and initial insights about the data.
4. Data Preparation- The most time-consuming stage is Data Preparation. Here data is
transformed into a state where it is easier to work with. Data preparation includes cleaning of
data, combining data from multiple sources and transform data into meaningful input variables.
Feature Engineering is also a part of Data Preparation.
51
D. Case study
1. Calculate MSE and RMSE values for the data given below using MS Excel.
Steps in Excel:
1. Enter the Data
In Column A, input the actual values.
In Column B, input the predicted values.
Fill in the data under the respective columns.
52
2. Given a confusion matrix, calculate Precision, Recall, F1 score and Accuracy.
Summary of Metrics:
Precision: 78.9%
Recall: 83.3%
F1 Score: 81.0%
Accuracy: 82.5%
53
E. Competency Based Questions
1.A transportation company aims to optimize its delivery routes and schedules to minimize costs
and improve delivery efficiency. The company wants to use Data Science to identify the most
optimal routes and delivery time windows based on historical delivery data and external factors
such as traffic and weather conditions. Various questions are targeted by data scientist to achieve
this business goal. Identify the analytical approach model that can be used for each.
a) determine the most suitable delivery routes for perishable goods, ensuring timely
deliveries without explicitly using past data to make predictions.
b) gather insights on the average delivery times for different vehicle types, how they vary
based on the complexity of delivery route.
c) group delivery routes into different categories based on the average delivery time and order
volume.
Ans. a. Predictive model
b. Descriptive model
c. Classification model
2. A leading investment firm aims to improve their client portfolio management system. They want
to know whether Artificial Intelligence could be used to better understand clients' investment
preferences and risk tolerance levels. Which stage of Data Science methodology can you relate
this to?
Ans. Business Understanding
4. A data scientist working to improve public transportation services by analyzing commuter travel
patterns. He has encountered a scenario where he needs to understand the impact of major
events on commuter behavior. For instance, the city is hosting a large-scale sporting event, and
the data scientist needs to assess how this event affects commuting patterns, such as changes in
peak travel times, shifts in preferred modes of transportation, and alterations in popular routes.
Which stage of Data Science methodology is he in? List the steps he needs to follow.
Ans: The data scientist in this scenario is in the stage of Data Collection within the Data Science
methodology. To address the scenario effectively, the data scientist should:
• Identify Relevant Data Sources
• Gather Data
• Clean and Prepare Data
• Analyze Data
• Interpret Results
5.A data scientist is tasked with developing a machine learning model to predict customer churn
for a small e-commerce startup. A limited dataset is only available for this task. The dataset
contains information about customer demographics, purchase history, website interactions, and
whether they churned or not. Considering the challenge posed by the limited dataset size, which
approach would you recommend the data scientist to use for training the churn prediction model
54
- a simple train-test split or cross-validation? Justify your recommendation regarding the
dataset's size and generalizability.
Ans: Considering the limited dataset size and the need for robustness and generalizability in the
model, I would recommend using cross-validation approach. Cross-validation involves splitting
the dataset into multiple subsets (folds), training the model on different combinations of these
subsets, and evaluating its performance on the remaining data. This mitigates the risk of
overfitting or underfitting the model, which is common with a small dataset. Additionally, cross-
validation maximizes the utilization of limited data by using each data point for both training and
validation across multiple folds, eliminating the need for additional data.
6.Identify the type of big data analytics (descriptive, predictive) used in the following:
a. A clothing brand monitors social media mentions to understand customer perception. It uses
data in the form of social media posts, comments, and reviews containing brand mentions to
get a clear picture of overall customer sentiment and areas where they excel or fall short.
b. A factory aims to predict equipment failures before they occur to minimize downtime. It uses
Sensor data from machines (temperature, vibration, power consumption) coupled with
historical maintenance records to identify patterns in sensor data that indicate an impending
equipment failure.
Ans:
a. Descriptive Analytics. This involves summarizing and analyzing historical social media data to
understand customer sentiment and perception towards the clothing brand.
b. Predictive Analytics. Specifically, it involves using historical sensor data from machines to
predict equipment failures before they occur, minimizing downtime.
7.Identify the type of big data analytics (diagnostic, prescriptive) used in the following:
a. A subscription service experiences a rise in customer cancellations. It uses Customer account
information, usage data (frequency of logins, features used), and support ticket logs.to identify
potential reasons for churn.
b. A food delivery service wants to improve delivery efficiency and reduce delivery times. It uses
Customer location data, order details, historical delivery times, and traffic patterns to calculate
the most efficient delivery routes.
Ans:
a. Diagnostic Analytics. This involves analyzing customer account information, usage data, and
support ticket logs to diagnose potential reasons for customer cancellations or churn.
b. Prescriptive Analytics. This involves analyzing customer location data, order details, historical
delivery times, and traffic patterns to prescribe the most efficient delivery routes and reduce
delivery times.
RESOURCES
Courses in Data Science Methodology:
1. https://www.coursera.org/learn/data-science-methodology#modules
2. https://cognitiveclass.ai/courses/data-science-methodology-2
55
UNIT 3: Making Machines See
56
Unveiling the Magic of Computer Vision: A Teacher's Guide to Making Machines See
This lesson equips you to introduce students to the captivating world of Computer Vision
(CV), where machines learn to "see" and interpret the visual world around them.
57
4. Seeing the Challenges: Obstacles and Considerations:
• Utilize engaging visuals and interactive activities throughout the lesson to enhance
student understanding.
• Encourage students to explore real-world examples of CV applications through
online resources and demos.
• Provide opportunities for students to discuss the ethical implications and societal
impact of CV.
By incorporating these elements, you can ignite student interest in Computer Vision,
empower them to explore its potential, and encourage them to contribute responsibly to
its future development.
58
Teachers can ask the following questions to spark curiosity before starting the topics:
• Imagine you take a picture with a digital camera. How do you think the camera
captures the image you see in front of you? What do you think happens to the
picture after you take it? (This question activates students' prior knowledge about
how cameras capture visual information digitally and connects it to the concept of
digital images being made up of pixels.)
• Can you think of any examples of how we describe visual information? We might
say something is red or shaped like a circle. How do you think computers might
understand these visual details in an image? (This question primes students to
consider the challenges of translating visual information into a format computer can
understand, which is what computer vision aims to achieve.)
With the rapid expansion of social media platforms such as Facebook, Instagram, and
Twitter, smartphones have emerged as pivotal tools, thanks to their integrated cameras
facilitating effortless sharing of photos and videos. While the Internet predominantly
consists of text-based content, indexing and searching images present a distinct challenge.
Indexing and searching images involve organizing image data for quick retrieval based on
specific features like colour, texture, shape, or metadata. During indexing, key attributes are
extracted and stored in a searchable format. Searching uses this index to match query
parameters with stored image features, enabling efficient retrieval. Unlike text, which can
be easily processed, algorithms require additional capabilities to interpret image content.
Traditionally, the information conveyed by images and videos has relied heavily on
manually provided meta descriptions. To overcome this limitation, there is a growing need
for computer systems to visually perceive and comprehend images to extract meaningful
information from them. This involves enabling computers to "see" images and decipher their
content, thereby bridging the gap in understanding and indexing visual data. This
poses a simple challenge for humans, evident in the common practice of teaching
children to associate an image, such as an apple, with the letter 'A'.
Humans can easily make this connection. However, enabling computers to
comprehend images presents a different dilemma. Similarly to how children learn by
repeatedly viewing images to memorize objects or people, we need computers to develop
similar capabilities to effectively analyse our images and videos.
59
defects or issues. Due to its speed, objectivity, continuity, accuracy, and scalability, it can
quickly surpass human capabilities. The latest deep learning models achieve above human-
level accuracy and performance in real-world image recognition tasks such as facial
recognition, object detection, and image classification.
Computer Vision is a field of artificial intelligence (AI) that uses Sensing devices and deep learning
models to help systems understand and interpret the visual world.
60
Fig. 3.3: Representing an image using 0’s and 1’s
In representing images digitally, each pixel is assigned a numerical value. For
monochrome images, such as black and white photographs, a pixel's value typically ranges
from 0 to 255. A value of 0 corresponds to black, while 255 represents white.
61
Step 3: Convert to Grayscale
● Transform the image into grayscale so it contains only shades of gray (1 channel).
● Use an online grayscale converter, such as Pine tools-
https://pinetools.com/grayscale-image
● Upload your resized image, convert it to grayscale, and download the resulting
image.
Fig. 3.7
Step 5: Copy the Pixel Values
● Once the pixel values are extracted, select all the values from the tool and copy
them.
62
Fig. 3.8: copy this pixel value
Fig. 3.9: image formation as 0s and 1s recreate the original grayscale image
In coloured images, each pixel is assigned a specific number based on the RGB
colour model, which stands for Red, Green, and Blue.
1 byte= 8 bits so the total number of binary numbers formed will be 28=256.
By combining different intensities of red, green, and blue, a wide range of colours
can be represented in an image, each colour channel can have a value from 0 to 255,
resulting in over 16 million possible colours.
63
3.3. COMPUTER VISION – PROCESS:
The Computer Vision process often involves five stages. They are explained below.
3.3.1. Image Acquisition: Image acquisition is the initial stage in the process of computer
vision, involving the capture of digital images or videos. This step is crucial as it provides the
raw data upon which subsequent analysis is based. Digital images can be acquired through
various means, including capturing them with digital cameras, scanning physical
photographs or documents, or even generating them using design software.
The quality and characteristics of the acquired images greatly influence the
effectiveness of subsequent processing and analysis. It is important to understand that the
capabilities and resolutions of different imaging devices play a significant role in
determining the quality of acquired images. Higher-resolution devices can capture finer
details and produce clearer images compared to those with lower resolutions. Moreover,
various factors such as lighting conditions and angles can influence the effectiveness of
image acquisition techniques. For instance, capturing images in low-light conditions may
result in poorer image quality, while adjusting the angle of capture can provide different
perspectives of the scene.
In scientific and medical fields, specialized imaging techniques like MRI (Magnetic
Resonance Imaging) or CT (Computed Tomography) scans are employed to acquire highly
detailed images of biological tissues or structures. These advanced imaging modalities offer
insights into the internal composition and functioning of biological entities, aiding in
diagnosis, research, and treatment planning.
3.3.2. Preprocessing:
Preprocessing in computer vision aims to enhance the quality of the acquired image.
Some of the common techniques are-
a. Noise Reduction: Removes unwanted elements like blurriness, random spots, or
distortions. This makes the image clearer and reduces distractions for algorithms.
Example: Removing grainy effects in low-light photos.
Fig. 3.10: before image is noise image, after image is noise reduced
b. Image Normalization: Standardizes pixel values across images for consistency.
Adjusts the pixel values of an image so they fall within a consistent range (e.g., 0–1
or -1 to 1).
Ensures all images in a dataset have a similar scale, helping the model learn better.
Example: Scaling down pixel values from 0–255 to 0–1.
64
Fig. 3.11: before is distorted with no normalization
c. Resizing/Cropping: Changes the size or aspect ratio of the image to make it uniform. Ensures
all images have the same dimensions for analysis.
Example: Resizing all images to 224×224 pixels before feeding them into a neural network.
Fig. 3.13
The main goal for preprocessing is to prepare images for computer vision tasks by:
● Removing noise (disturbances).
● Highlighting important features.
● Ensuring consistency and uniformity across the dataset.
65
3.3.3. Feature Extraction:
Feature extraction involves identifying and extracting relevant visual patterns or
attributes from the pre-processed image. Feature extraction algorithms vary depending on
the specific application and the types of features relevant to the task. The choice of feature
extraction method depends on factors such as the complexity of the image, the
computational resources available, and the specific requirements of the application.
● Edge detection identifies the boundaries between different regions in an image
where there is a significant change in intensity
● Corner detection identifies points where two or more edges meet. These points are
areas of high curvature in an image, focused on identifying sharp changes in image
gradients, which often correspond to corners or junctions in objects.
● Texture analysis extracts features like smoothness, roughness, or repetition in an
image
● Colour-based feature extraction quantifies colour distributions within the image,
enabling discrimination between different objects or regions based on their colour
characteristics.
66
3.3.4. Detection/Segmentation:
Detection and segmentation are fundamental tasks in computer vision, focusing on
identifying objects or regions of interest within an image. These tasks play a pivotal role in
applications like autonomous driving, medical imaging, and object tracking. This crucial
stage is categorized into two primary tasks:
1. Single Object Tasks
2. Multiple Object Tasks
Single Object Tasks: Single object tasks focus on analysing/or delineate individual objects
within an image, with two main objectives:
Multiple Object Tasks: Multiple object tasks deal with scenarios where an image contains
multiple instances of objects or different object classes. These tasks aim to identify and
distinguish between various objects within the image, and they include:
i) Object Detection: Object detection focuses on identifying and locating multiple
objects of interest within the image. It involves analysing the entire image and
drawing bounding boxes around detected objects, along with assigning class
labels to these boxes. The main difference between classification and detection
is that classification considers the image as a whole and determines its class
whereas detection identifies the different objects in the image and classifies all of
them.
In detection, bounding boxes are drawn around multiple objects and these are
labelled according to their particular class. Object detection algorithms typically
use extracted features and learning algorithms to recognize instances of an object
category. Some of the algorithms used for object detection are: R-CNN (Region-
67
Based Convolutional Neural Network), R-FCN (Region-based Fully Convolutional
Network), YOLO (You Only Look Once) and SSD (Single Shot Detector).
ii) Image segmentation: It creates a mask around similar characteristic pixels and
identifies their class in the given input image. Image segmentation helps to gain
a better understanding of the image at a granular level. Pixels are assigned a class
and for each object, a pixel-wise mask is created in the image. This helps to easily
identify each object separately from the other. Techniques like Edge detection
which works by detecting discontinuities in brightness is used in Image
segmentation. There are different types of Image Segmentation available.
Two of the popular segmentation are:
a. Semantic Segmentation: It classifies pixels belonging to a particular class.
Objects belonging to the same class are not differentiated. In this image for example
the pixels are identified under class animals but do not identify the type of animal.
b. Instance Segmentation: It classifies pixels belonging to a particular instance. All
the objects in the image are differentiated even if they belong to the same class. In
this image for example the pixels are separately masked even though they belong to
the same class.
Fig.3.17
68
3.3.5. High-Level Processing: In the final stage of computer vision, high-level processing
plays a crucial role in interpreting and extracting meaningful information from the detected
objects or regions within digital images. This advanced processing enables computers to
achieve a deeper understanding of visual content and make informed decisions based on
the visual data. Tasks involved in high-level processing include recognizing objects,
understanding scenes, and analysing the context of the visual content. Through
sophisticated algorithms and machine learning techniques, computers can identify and
categorize objects, infer relationships between elements in a scene, and derive insights
from complex visual data. Ultimately, high-level processing empowers computer vision
systems to extract valuable insights and drive intelligent decision-making in various
applications, ranging from autonomous driving to medical diagnostics.
69
3.5. CHALLENGES OF COMPUTER VISION
Computer vision, a vital part of artificial intelligence, faces several hurdles as it strives to
make sense of the visual world around us. These challenges include:
1. Reasoning and Analytical Issues: Computer vision relies on more than just image
identification; it requires accurate interpretation. Robust reasoning and analytical
skills are essential for defining attributes within visual content. Without such
capabilities, extracting meaningful insights from images becomes challenging,
limiting the effectiveness of computer vision systems.
4. Duplicate and False Content: Computer vision introduces challenges related to the
proliferation of duplicate and false content. Malicious actors can exploit
vulnerabilities in image and video processing algorithms to create misleading or
fraudulent content. Data breaches pose a significant threat, leading to the
dissemination of duplicate images and videos, fostering misinformation and
reputational damage.
70
ACTIVITY 3.2 CREATING A WEBSITE CONTAINING AN ML MODEL
1. Go to the website https://teachablemachine.withgoogle.com/
Fig.3.18
Fig.3.19
4. Choose the ‘Image’ project.
5. Select ‘Standard Image Model’.
Fig.3.20
Fig.3.21
You have the option to choose between two methods: using your webcam to capture
images or uploading existing images. For the webcam option, you will need to position the
image in front of the camera and hold down the record button to capture the image.
Alternatively, with the upload option, you have the choice to upload images either from your
local computer or directly from Google Drive.
71
7. Let us name ‘Class 1’ as Kittens and upload pictures already saved on the computer.
Fig.3.22
8. Now, let us add another class of images “Puppies” saved on the computer.
Fig.3.23
9. Click on Train Model.
Fig.3.24
Once the model is trained, you can test the working by showing an image infront of the
web camera. Else, you can also upload the image from your local computer / Google drive.
72
Fig.3.25
10. Now click on ‘Export Model’. A screen will open up as shown. Now, click on Upload
my model.
Fig.3.26
11.Once your model is uploaded, Teachable Machine will create a URL which we will use
in the Javascript code. Copy the Javascript code by clicking on Copy.
73
Fig.3.27
12.Open Notepad and paste the JavaScript code and save this file as web.html.
13.Let us now deploy this model in a website.
14.Once you create a free account on Weebly, go to Edit website and create an appealing
website using the tools given.
Fig.3.28
15.Click on Embed Code and drag and place it on the webpage.
Fig.3.29
16.Click on ‘Edit HTML Code’ and copy the Javascript code from the html file
(web.html) and paste it here as shown.
Fig.3.30
Fig.3.31
74
18.Copy the URL and paste it into a new browser window to check the working of your
model.
Fig.3.32
19.Click on start and you can show pictures of kitten and puppy to check the predictions of
your model.
Fig.3.33
To use OpenCV in Python, you need to install the library. Use the following command in
your terminal or command prompt:
pip install opencv-python
75
3.7.2. Loading and Displaying an Image: Let us understand the loading and displaying
using a scenario followed by a question.
Scenario- You are working on a computer vision project where you need to load and display an
image. You decide to use OpenCV for this purpose.
Question:
What are the necessary steps to load and display an image using OpenCV? Write a Python code
snippet to demonstrate this.
sol -
Here's a simple Python script to load and display an image using OpenCV:
import cv2
image = cv2.imread('example.jpg') # Replace 'example.jpg' with the path to
your image
cv2.imshow('original image', image)
cv2.waitKey(0)
cv2.destroyAllWindows()
76
new_width = 300
new_height = 300
# Resize the image to the new dimensions
resized_image = cv2.resize(image, (new_width, new_height))
So, the full code will look like this -
import cv2
image = cv2.imread('example.jpg') # Replace 'example.jpg' with the path
to your image
new_width = 300
new_height = 300
# Resize the image to the new dimensions
resized_image = cv2.resize(image, (new_width, new_height))
cv2.imshow('Resized Image', resized_image)
cv2.waitKey(0)
cv2.destroyAllWindows()
77
So, the full code will look like this:
import cv2
image = cv2.imread('example.jpg')# Replace 'example.jpg' with the path to
your image
grayscale_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
cv2.imshow('Grayscale Image', grayscale_image)
cv2.waitKey(0)
cv2.destroyAllWindows()
EXERCISES
A. Multiple Choice Questions:
1. The field of study that helps to develop techniques to help computers “see”
is________________.
a. Python b. Convolution c. Computer Vision d. Data Analysis
2. Task of taking an input image and outputting/assigning a class label that best describes
the image is ____________.
a. Image classification b. Image localization
c. Image Identification d. Image prioritization
3. Identify the incorrect option
(i) computer vision involves processing and analysing digital images and videos
to understand their content.
(ii) A digital image is a picture that is stored on a computer in the form of a
sequence of numbers that computers can understand.
(iii) RGB colour code is used only for images taken using cameras.
(iv) Image is converted into a set of pixels and less pixels will resemble the original
image.
a. ii b. iii c. iii & iv d. ii & iv
4. The process of capturing a digital image or video using a
digital camera, a scanner, or other imaging devices is related to ________.
a. Image Acquisition b. Preprocessing
c. Feature Extraction d. Detection
5. Which algorithm may be used for supervised learning in computer vision?
a. KNN b. K-means c. K-fold d. KEAM
6. A computer sees an image as a series of ___________
a. colours b. pixels c. objects d. all of the above
78
7. ____________ empowers computer vision systems to extract valuable insights and
drive intelligent decision-making in various applications, ranging from autonomous
driving to medical diagnostics.
a. Low level processing b. High insights
c. High-level processing d. None of the above
8. In Feature Extraction, which technique identifies abrupt changes in pixel intensity
and highlights object boundaries?
a. Edge detection b. Corner detection
c. Texture Analysis d. boundary detection
9. Choose the incorrect statement related to preprocessing stage of computer vision
a. It enhances the quality of acquired image
b. Noise reduction and Image normalization is often employed with images
c. Techniques like histogram equalization can be applied to adjust the distribution
of pixel intensities
d. Edge detection and corner detection are ensured in images.
10. 1 byte = __________ bits
a. 10 b. 8 c. 2 d. 1
3. Write down any two algorithm which can be used for object detection.
Ans- Algorithms used for object detection are: R-CNN (Region-Based Convolutional
Network) and R-FCN (Region-based Fully Convolutional Network).
79
5. Write any four applications of computer vision.
Ans –
1)Facial recognition: Popular social media platforms like Facebook uses facial recognition
to detect and tag users.
2)Healthcare: Helps in evaluating cancerous tumours, identify diseases or abnormalities.
Object detection & tracking in medical imaging
3)Surveillance: Live footage from CCTV cameras in public places helps to identify
suspicious behaviour, identify dangerous objects, and prevent crimes by maintaining law
and order.
4)Fingerprint recognition and biometrics: Detects fingerprints and biometrics to validate a
user's identity.
80
leading to the dissemination of duplicate images and videos, fostering
misinformation and reputational damage.
2. The Red Fort is hosting a grand cultural event, and keeping everyone safe is top priority!
A state-of-the-art security system utilizes different "FEATURE EXTRACTION " to analyse
live video feeds and identify potential issues. Identify the feature extraction technique
that can be used in the following situation.
a. A large bag is left unattended near a crowded entrance.
b. A person tries to climb over a wall near a blind spot.
c. A group of people starts pushing and shoving in a congested area.
d. A wanted person with a distinctive red scarf enters the venue.
Ans: a. Texture Analysis
b. Edge detection
c. Corner detection
d. Colour based extraction
3. Which image segmentation technique would be most effective in this scenario: semantic
segmentation or instance segmentation?
a. You are developing a quality control system for a manufacturing plant. The system
needs to analyse images captured by cameras above a conveyor belt to identify
and isolate defective products.
b. You are assisting urban planners in developing a comprehensive land-use map for
a growing city. They require analysis of aerial imagery to classify large areas into
distinct categories like "buildings," "roads," "vegetation," and "parks."
Ans:
a. Instance Segmentation
b. Semantic Segmentation
81
4. Shreyan joined an architecture firm wherein he was asked to check if AI could be used
to create a detailed, representation of a city district for urban planning purposes, utilizing
computer vision techniques and data integration from aerial imagery, topographic
surveys, and architectural blueprints. Help him identify the application of computer
vision that can be used for the same.
Ans: 3D Model building using computer vision
5. You work for a fact-checking website. Lately, there's been a surge in INCORRECT news
articles and videos online, often manipulated using computer vision techniques. These
manipulations can make real people appear to be saying things they never did. What is
the concern here?
Ans. False/Fake content
References:
1. https://www.ibm.com/topics/computer-vision
2. https://www.datacamp.com/tutorial/seeing-like-a-machine-a-beginners-guide-
to-image-analysis-in-machine-learning
3. https://medium.com/@edgeneural.ai/computer-vision-making-the-machines-
see-361a3d0cfc3f
4. https://www.javatpoint.com/computer-vision
82
UNIT 4: AI with Orange Data Mining Tool
Title: AI with Orange Data Mining Tool Approach: Practical work, Group activity,
Creativity, Data analysis, Group discussion
Summary: Students will learn to use Orange's intuitive interface and component-
based visual programming approach in the domains of Data Science, Computer Vision,
and Natural Language Processing. They will explore its diverse set of widgets, covering
data visualization, preprocessing, feature selection, modeling, and evaluation, gaining
practical insights into its applications in these fields.
Learning Objectives:
1. Students will gain a comprehensive understanding of Orange Data Mining tool,
empowering them to leverage its capabilities across various domains of Artificial
Intelligence.
2. Students will gain practical insights into its applications in Data Science, Computer
Vision, and Natural Language Processing (NLP) through detailed exploration of
different widgets and functionalities offered by Orange.
Key Concepts:
1. Introduction to Orange Data Mining Tool
2. Components of Orange Data Mining Tool
3. Key domains of AI with Orange data mining tool – Data Science, Computer Vision,
NLP
Learning outcomes:
Students will be able to -
1. Develop proficiency in utilizing the Orange Data Mining tool, enabling them to
navigate its interface, employ its features, and execute data analysis tasks
effectively.
2. Demonstrate the ability to apply Orange in real-world scenarios across diverse
domains of Artificial Intelligence, including Data Science, Computer Vision, and
Natural Language Processing (NLP), through hands-on projects and case studies.
Prerequisites:
● Awareness and understanding of the terms Data Science, Natural Language
Processing, and Computer Vision.
● Basic knowledge of algorithms used in Machine Learning.
83
Unveiling Data Secrets with Orange Data Mining Tool: A Teacher's Guide
This lesson equips you to introduce students to Orange, a user-friendly data mining tool. Students
will learn to manipulate data, create visualizations, and explore machine learning through practical
exercises.
• Introduction: Introduce Orange as a powerful data mining tool that helps us explore data,
visualize patterns, and build predictive models.
2. Orange Tour:
• Machine Learning with Orange: Introduce the concept of Machine Learning and how
Orange can be used to build models.
• Model Evaluation: Explain how to evaluate the performance of models using metrics like
accuracy and confusion matrix.
• Classification with Orange: Demonstrate building a classification model to predict Iris
flower species based on features. Guide students through:
o Selecting a classification algorithm (e.g., K-Nearest Neighbors).
o Splitting data into training and testing sets.
o Training the model and evaluating its performance.
84
5. Text Mining with Orange's "Keyphrases" Widget:
• Unveiling Text Data: Introduce the "Keyphrases" widget to analyze text data.
o Focus: Highlight Orange's capabilities for text mining tasks like identifying key topics,
sentiment analysis, and trends.
• Practical Demonstration: Demonstrate using the "Keyphrases" widget on a sample text
dataset. Discuss the extracted keywords and their potential insights.
• Exploring Data: Guide students through a deeper exploration of the Iris dataset:
o Load the dataset in Orange.
o Visualize characteristics like sepal length and sepal width using scatter plots.
o Analyze patterns and discuss potential relationships between features.
7. Classification in Action:
• Computer Vision with Orange: Briefly introduce Orange's capabilities for computer vision
tasks, including image processing, feature extraction, and object detection.
• Natural Language Processing (NLP) with Orange: Briefly discuss how Orange can be used
for NLP tasks like text preprocessing, sentiment analysis, topic modeling, and classification.
Additional Tips:
• Utilize real-world examples to showcase the applications of data mining and machine
learning.
• Encourage students to ask questions and explore different functionalities within Orange.
• Provide online tutorials and resources for students to delve deeper into specific Orange
features.
By incorporating these elements, you can equip students with a valuable toolkit for data
exploration, visualization, and machine learning using Orange, empowering them to uncover
hidden insights from data.
85
Teachers can ask the following questions to spark curiosity before starting the topics:
• Have you ever wondered how businesses analyze vast amounts of data to make
decisions, like predicting customer preferences or classifying products? What tools or
techniques do you think they use to make sense of such data? (This question introduces
the concept of data mining and piques interest in tools like Orange.)
• Imagine you're tasked with identifying patterns in social media images or analysing
customer reviews for trends. What kind of process or tools might you use to extract
meaningful insights from such diverse data sources? (This question sets the stage for
exploring Orange’s applications in data science, computer vision, and natural language
processing.)
86
Educators can leverage Orange's intuitive interface and visual
Educators and Students programming environment to introduce students to complex topics
in a more approachable way.
87
4.4.5. Launch Orange:
After installation is complete, launch the Orange tool from your system's applications menu or by
double-clicking its icon.
1. Blank Canva: The Blank Canva is where you build your analysis workflows by dragging and
dropping widgets. It serves as the workspace where you connect widgets together to form
a data analysis pipeline. You can add widgets to the canvas, rearrange them, and connect
them to create a flow of data processing from input to output.
2. Widgets: -
Widgets are graphical elements that
perform specific tasks or operations on
data. When you open Orange, you will
typically see a blank canvas where you can
drag and drop widgets to create your
analysis workflow.
88
4.6. DEFAULT WIDGET CATALOG
These widgets are used for data
manipulation, i.e.,
● File widget reads the input data file
(data table with data instances) and
sends the dataset to its output
Data Widgets
channel.
● Data Table- Displays attribute-value
data in a spreadsheet.
● SQL Table - Reads data from an SQL
Fig.4.4 database.
Fig.4.5
Fig. 4.6
89
Widgets that enable users to apply
machine learning algorithms, including
Model
classification, regression, clustering, and
Widgets
anomaly detection, to build predictive
models and analyze data patterns.
Fig. 4.7
Fig. 4.8
Fig. 4.9
90
4.7. KEY DOMAINS OF AI WITH ORANGE DATA MINING TOOL
1. Data science with Orange Data Mining
2. Computer Vision with Orange Data Mining
3. Natural Language Process with Orange Data Mining
91
Data Visualization: - Exploring Iris Flower Dimensions with Orange Data Mining
92
Step 4 - Display the Dataset
in a Data Table- Drag and
drop the "Data Table" widget
onto the canvas. Connect the
output of the "File" widget to
the input of the "Data Table"
widget by dragging the
connector from one widget to
the other.
93
Step 7- Interpret the Scatter Plot
Once the scatter plot is generated, you'll see a graphical representation of the iris Dataset. The
scatter plot will display the relationship between two selected variables, such as sepal length
and sepal width, with each point representing an individual iris sample. You can further
customize the scatter plot and explore other visualizations using additional widgets available in
Orange.
You can experiment with other widgets available in Orange for data visualization, such as
histograms, box plots, or parallel coordinate plots. Each visualization offers unique insights into
the iris dataset, helping you understand the distribution and patterns of flower dimensions.
4.7.1.1. CLASSIFICATION: -
Imagine you have just returned from the flower shop with a bunch of irises, and you have
meticulously measured the dimensions of their petals and sepals. Now, armed with these data
features, you are ready to determine the type of iris flower - whether it is iris setosa, iris versicolor,
or iris virginica. We will collect this testing data and feed it into a spreadsheet for further analysis.
Note:- “Data table” widget just shows the content of the data in tabular form, it is optional to connect or not.
94
Step 3: Perform Classification
● Drag and drop the "Predictions" widget
onto the canvas. Connect the output of the
"File" widget (containing the training data)
to the input of the "Predictions" widget.
● Drag another "File" widget onto the canvas
and upload the iris testing dataset created
in the spreadsheet.
● Connect the output of the second "File"
widget (containing the testing data) to the
input of the "Predictions" widget. This
instructs Orange to use the trained model
to classify the samples in testing dataset.
Fig. 4.19
By following the above steps, you have successfully employed Orange data mining software
to classify iris flower types using testing data. This demonstrates the practical application of
machine learning techniques in real-world scenarios, offering insights into the classification
process and its accuracy.
95
Step 1: Perform Evaluation: -
● Begin by adding the "File" widget to the canvas and connect it
to the output of the classification tree model. This will provide
the dataset used in the earlier training phase.
● Connect the same "File" widget to the input of the "Test and
Score" widget. This widget evaluates the model's performance
and calculates metrics such as accuracy, precision, recall, and
F1 score.
Fig. 4.21
● Double-click on the "Test and Score" widget to view the
evaluation metrics. Orange utilizes cross-validation, splitting
the data into 10 subsets for training and testing. This process is repeated multiple times to
ensure robust evaluation.
Fig. 4.22
Step 2: Interpret Evaluation Metrics
● Analyze the evaluation metrics displayed
by the "Test and Score" widget. The
accuracy metric indicates the percentage
of correctly classified instances, which in
our case is 93%.
● However, to gain deeper insights, we will
Fig. 4.23: Evaluation Metric
examine precision, recall, and F1 score.
Precision measures the proportion of true
positives among all instances classified as positive, while recall measures the proportion of
true positives correctly identified. The F1 score is the harmonic mean of precision and recall.
● These metrics collectively provide a comprehensive understanding of the model's
performance across different classes.
96
Fig. 4.24: Confusion Matrix
● Interpret the confusion matrix to identify
any patterns or areas where the model may
be struggling. For instance, while the model accurately identifies iris setosa, it may
struggle with distinguishing between versicolor and virginica.
By evaluating our classification model using Orange, we've gained valuable insights into its
performance. Through metrics such as accuracy, precision, recall, and F1 score, as well as the
confusion matrix with cross-validation, we can assess the model's strengths and weaknesses. This
evaluation process empowers us to make informed decisions and further refine our model for
improved performance.
Practical Activity
● Your task is to conduct a Data Science project aiming to differentiate between fruits and
vegetables based on their nutritional characteristics.
● Begin by gathering a dataset containing nutritional information for a variety of fruits and
vegetables, including features such as energy (kcal/kJ), water content (g), protein (g),
total fat (g), carbohydrates (g), fiber (g), sugars (g), calcium (mg), iron (mg), magnesium
(mg), phosphorus (mg), potassium (mg), and sodium (g).
● You can explore reputable sources like Kaggle for this data.
● Next, split the dataset into training and testing subsets. Utilize classification algorithms
available in Orange to train models on the training data, then evaluate their performance
using the testing data.
● Utilize evaluation metrics such as accuracy, precision, recall, and F1-score to assess the
models' effectiveness and compare their performance to identify the most effective one.
97
4.7.2. Computer Vision with Orange
In the vast landscape of data, images represent a rich and diverse source of information.
With advancements in computer vision and machine learning, we can now extract valuable insights
from images using tools like Orange. Let us now explore how Orange transforms images into
numerical representations and enables machine learning on image data.
Step 1: Install Image Analytics Add-On
● To get started, we need to install the Image Analytics add-on in Orange. Go to the "Options"
menu, click on "Add-ons," and install Image Analytics. Restart Orange to enable the add-on
to appear in the widget panel.
98
Step 4: Image Visualization
99
Step 7: Hierarchical Clustering
● Pass the distance matrix to the "Hierarchical Clustering" algorithm by dragging and dropping
the widget onto the canvas.
100
Fig. 4.32: Visualization and Interpretation
With Orange's Image Analytics capabilities, we can unlock the potential of image data and perform
sophisticated analysis and machine learning tasks. By leveraging image embeddings and
clustering algorithms, we can gain insights and make informed decisions from image datasets,
paving the way for innovative applications in various domains.
Practical Activity:
● Cluster images of birds and animals into distinct groups based on their visual characteristics.
● Collect datasets of images containing various species of birds and animals.
● Ensure that each dataset contains a sufficient number of images representing different species
within the respective categories.
● Import the collected image datasets into Orange Data Mining.
● Apply clustering algorithms to group similar images together based on their numerical
representations.
● Analyze the clustering results and interpret the grouping of images into different clusters.
● Identify any patterns or similarities observed within each cluster and between clusters .
101
4.7.3. Natural Language Processing with Orange
Now that we have learned about analyzing data in spreadsheets and images, let us move on
to working with text using Orange Data Mining. Natural Language Processing (NLP) helps us
understand and learn from written words, like finding patterns in documents. With Orange, we will
do different tasks in NLP, which can help us in lots of ways, like analyzing and understanding text
better.
102
Step 4: Visualize Word Frequencies with
Word Cloud
Connect the output of the "Corpus" widget
to the "Word Cloud" widget. The Word
Cloud visually represents word
frequencies in a cloud format, with more
frequent words appearing larger. This
visualization provides an initial glimpse
into the prominent themes or topics within
the text.
103
● Now, the Word Cloud displays only meaningful words, allowing us to better understand the
main themes or topics within the corpus. Here you can see Turtle and rabbit are larger in
size because these words are more frequent in the corpus.
With Orange Data Mining, we can leverage the power of NLP to analyze textual data effectively.
Through the steps outlined above, we have seen how to load, visualize, preprocess, and analyze
textual data effectively. From exploring word frequencies with the Word Cloud to cleaning and
refining text through preprocessing, Orange empowers users to derive meaningful insights from
text effortlessly.
Practical Activity
Your task is to create your own corpus by selecting a story or article of your choice. Ensure the
text is sufficiently long to provide meaningful insights. Once you have your corpus, use Orange
Data Mining to perform: -
1. Text normalization techniques like Converting all text to lowercase to ensure
consistency, tokenization to split the text into individual words or tokens, removal of
stop words to eliminate common words that do not contribute to the meaning of the text
2. Analyze the word cloud to identify which words appear most frequently and gain insights
into the main themes or topics of the text.
3. Compare the results of the word cloud analysis before and after applying lemmatization
and stemming to observe any differences in the most commonly used words.
EXERCISES
A. Multiple Choice Questions
1. Which widget helps in knowing the Accuracy, Precision, recall and F1 score values?
a. Confusion matrix b. Test and score c. Word cloud d. Corpus
2. Which widget provides the detailed breakdown of the model's classifications, showing the
number of true positives, true negatives, false positives, and false negatives for each class?
a. Test and score b. Preprocess text c. Confusion Matrix d. Corpus
3. ______________widget performs text normalization by converting text to lowercase,
tokenizing it into individual words, removing punctuation, and filtering out stop words.
Ans- Preprocess text
4. Cross validation can also be called as _______
a. Rotation estimation b. Cross hybrid c. Revolution Estimation d. Estimation
5. In word cloud, more the frequency of word in corpus larger the size of word would look.
a. True b. False
6. Which widget is used to transform the raw images into numerical representations?
a. Image viewer b. Image analytics c. Image Embedding d. Distance
7. Which widget is used to compare the embeddings of images and compute their similarities?
a. Distance b. Image Embedding c. Preprocess Text d. Image viewer
104
8. ____________ are lines that link widgets together on the canvas. They represent the flow
of data from one widget to another, indicating how the output of one widget is used as input
for another.
Ans- Connectors
9. Suppose you are working as a data scientist for a social media analytics company. The
company is interested in analyzing and categorizing images shared on its platform to better
understand user behavior and preferences. Which widget will you use to extract numerical
representations of images for analysis?
Ans- Image embedding
4. How does the "Preprocess Text" widget contribute to text analysis in Orange Data Mining?
Ans- “Preprocess text” widget performs text normalization by converting text to lowercase,
tokenizing it into individual words, removing punctuation, and filtering out stop words.
105
C. Competency Based Questions:
2. Arun, a teacher wants to educate his students about distinguishing between animals with two
legs and those with four legs. He has a folder containing mixed images of animals and wants to
organize them into clusters based on this distinction. Which Add-ons should he install in Orange
data mining tool to achieve this image differentiation?
Ans: Arun should install the add-on called "Image Analytics."
3. Sunita enjoys writing and is currently working on a story about nature. Sunita is curious to know
about the most frequently used word in her story and seeks assistance in using the Orange data
mining tool to determine this. Which tool can she use to analyze her story's word frequency?
Ans: Sunita can utilize the Word Cloud tool within Orange Data Mining to analyze her story's
word frequency. This tool visually displays the frequency of words in a cloud format, with more
frequent words appearing larger and more prominent.
4. Rajat and Misty were having an argument on the effectiveness of natural language processing
(NLP) techniques in Orange data mining for text analysis tasks. Can you list the applications of
NLP techniques in Orange data mining that contribute to text analysis?
Ans: NLP techniques in Orange data mining are used for tasks such as text preprocessing,
sentiment analysis, topic modeling, and text classification. These techniques enable analysts
to extract insights from unstructured text data, automate text processing tasks, and gain
valuable information from textual sources in diverse domains.
5. Imagine you are a data scientist working at a recycling plant. The plant recently introduced
robots equipped with machine learning models to sort incoming materials. These robots
categorize items into different bins like plastic, metal, glass, and paper. To improve the robots'
accuracy, you plan to use Orange Data Mining to analyze their performance. Which widget
within Orange can provide a detailed breakdown of the model's classifications for each material
type? Specifically, you are interested in seeing the number of true positives (correctly identified
items), true negatives (correctly rejected items), false positives (incorrectly accepted items),
and false negatives (incorrectly rejected items) for aluminum and tin cans. What widget can
help you achieve this?
Ans: Confusion matrix
REFERENCES:
1. https://orangedatamining.com/docs/
2. https://orangedatamining.com/blog/
3. https://www.youtube.com/@OrangeDataMining
106
UNIT 5: Introduction to Big Data and Data Analytics
Title: Introduction to Big Data and Data Approach: Team discussion, Web search
Analytics
Summary: Students will delve into the world of Big Data, a game-changer in today's
digital age. Students gain insights into the various types of data and their unique
characteristics, equipping them to understand how this vast information is managed
and analysed. The journey continues as students discover the real-world applications
of Big Data and Data Analytics in diverse fields, witnessing how this revolutionary
concept is transforming how we approach data analysis to unlock new possibilities.
Learning Objectives:
1. Students will develop an understanding of the concept of Big Data and its
development in the new digital era.
2. Students will appreciate the role of big data in AI and Data Science.
3. Students will learn to understand the features of Big Data and how these
features are handled in Big Data Analytics.
4. Students will appreciate its applications in various fields and how this new
concept has evolved to bring new dimensions to Data Analysis.
5. Students will understand the term mining data streams.
Key Concepts:
1. Introduction to Big Data
2. Types of Big Data
3. Advantages and Disadvantages of Big Data
4. Characteristics of Big Data
5. Big Data Analytics
6. Working on Big Data Analytics
7. Mining Data Streams
8. Future of Big Data Analytics
Learning Outcomes:
Students will be able to –
1. Define Big Data and identify its various types.
2. Evaluate the advantages and disadvantages of Big Data.
3. Recognize the characteristics of Big Data.
4. Explain the concept of Big Data Analytics and its significance.
5. Describe how Big Data Analytics works.
6. Exploring the future trends and advancements in Big Data Analytics.
107
Demystifying the Big Data Deluge: A Teacher's Guide to Introduction to Big
Data and Data Analytics
This lesson plan equips you to guide students through the vast and exciting world of Big
Data.
• Big Data Analytics in Action: Introduce the various stages of Big Data Analytics:
o Data Collection: Gathering data from diverse sources (e.g., social media,
sensors, transactions).
o Data Storage: Storing massive datasets using specialized solutions (e.g.,
distributed file systems).
o Data Processing: Cleaning, organizing, and preparing data for analysis.
o Data Analysis: Utilizing various techniques (e.g., machine learning, statistics)
to extract insights from data.
o Data Visualization: Presenting data findings in clear and compelling ways (e.g.,
charts, graphs).
• Real-World Applications: Showcase real-world applications of Big Data Analytics
across various sectors:
o Business: Optimizing marketing campaigns, identifying customer trends, and
improving operational efficiency.
o Healthcare: Predicting disease outbreaks, analyzing medical images, and
personalizing patient care.
o Finance: Detecting fraudulent activities, managing risk, and providing
personalized financial recommendations.
• Data Diversity: Define and differentiate between the three main data types:
o Structured Data: Highly organized data with a defined format (e.g., database
tables).
o Semi-structured Data: Partially organized data with some inherent structure
(e.g., emails, logs).
o Unstructured Data: Data with no predefined format (e.g., text, images, videos).
• Advantages and Disadvantages: Discuss the benefits and limitations of Big Data:
108
Advantages: * Improved decision-making through data-driven insights. * Enhanced
efficiency and innovation across various industries. * Potential for breakthroughs in
scientific research and social good.
• The Future of Big Data: Discuss the booming job opportunities in Big Data Analytics
across various industries and the projected growth of the Big Data market. Explore
emerging markets where Big Data is revolutionizing various sectors.
Additional Tips:
• Utilize interactive activities, case studies, and real-world data examples to solidify
student understanding.
• Encourage students to research specific Big Data applications in their areas of
interest.
• Discuss potential solutions and ethical frameworks for addressing Big Data
challenges like privacy concerns.
By incorporating these elements, you can equip students with the knowledge and skills to navigate
the Big Data landscape and leverage its power to make informed decisions and solve real-world
problems.
109
Teachers can ask the following questions to spark curiosity before starting the topics:
• Imagine you are running a global online store. Every day, thousands of customers
visit, browse, and purchase items. How would you handle and analyse the massive
amount of data generated to improve customer experience and boost sales? (This
introduces the concept of Big Data and its applications, making it relatable to real-
world scenarios.)
• Have you ever wondered how streaming platforms like Netflix suggest shows and
movies you might like? What kind of data do you think they analyze to make those
recommendations? (This connects Big Data Analytics to familiar examples, sparking
curiosity about how data is used for personalization.)
• In a world where millions of tweets and posts are created every second, how do
you think companies or governments use this information to identify trends,
predict outcomes, or make decisions? (This highlights the importance of analysing
unstructured data streams and sets the stage for discussing Big Data characteristics
like velocity and variety.)
Big Data refers to extremely large and complex datasets that regular computer programs
and databases cannot handle. It comes from three main sources: transactional data (e.g.,
online purchases), machine data (e.g., sensor readings), and social data (e.g., social media
posts). To analyze and use Big Data effectively, special tools and techniques are required.
These tools help organizations find valuable insights hidden in the data, which lead to
innovations and better decision-making. For example, companies like Amazon and Netflix
use Big Data to recommend products or shows based on users’ past activities.
110
5.2. Types of Big Data
Fig. 5.2
Semi-Structured
Aspect Structured Data Unstructured Data
Data
Quantitative data A mix of quantitative No inherent
Definition with a defined and qualitative structures or formal
structure properties rules
Dedicated data May lack a specific Lacks a consistent
Data Model
model data model data model
No organization
Organized in clearly Less organized than
Organization exhibits variability
defined columns structured data
over time
Accessibility depends
Easily accessible and Accessible but may
Accessibility on the specific data
searchable be harder to analyze
format
Examples Customer XML files, CSV files, Audio files, images,
information, JSON files, HTML video files, emails,
transaction records, files, PDFs, social media
product directories semi-structured posts
documents
Big Data is a key to modern innovation. It has changed how organizations analyze and use
information. While it offers great benefits, it also comes with challenges that affect its use
in different industries. In this section, we will be discussing a few pros and cons of big data.
111
Advantages:
● Enhanced Decision Making: Big Data analytics empowers organizations to make
data-driven decisions based on insights derived from large and diverse datasets.
● Improved Efficiency and Productivity: By analyzing vast amounts of data,
businesses can identify inefficiencies, streamline processes, and optimize resource
allocation, leading to increased efficiency and productivity.
● Better Customer Insights: Big Data enables organizations to gain a deeper
understanding of customer behavior, preferences, and needs, allowing for
personalized marketing strategies and improved customer experiences.
● Competitive Advantage: Leveraging Big Data analytics provides organizations with
a competitive edge by enabling them to uncover market trends, identify
opportunities, and stay ahead of competitors.
● Innovation and Growth: Big Data fosters innovation by facilitating the development
of new products, services, and business models based on insights derived from data
analysis, driving business growth and expansion.
Disadvantages:
● Privacy and Security Concerns: The collection, storage, and analysis of large
volumes of data raise significant privacy and security risks, including unauthorized
access, data breaches, and misuse of personal information.
● Data Quality Issues: Ensuring the accuracy, reliability, and completeness of data can
be challenging, as Big Data often consists of unstructured and heterogeneous data
sources, leading to potential errors and biases in analysis.
● Technical Complexity: Implementing and managing Big Data infrastructure and
analytics tools require specialized skills and expertise, leading to technical
challenges and resource constraints for organizations.
● Regulatory Compliance: Organizations face challenges in meeting data protection
laws like GDPR (General Data Protection Regulation) and The Digital Personal Data
Protection Act, 2023. These laws require strict handling of personal data, making
compliance essential to avoid legal risks and penalties.
● Cost and Resource Intensiveness: The cost of acquiring, storing, processing, and
analyzing Big Data, along with hiring skilled staff, can be high. This is especially
challenging for smaller organizations with limited budgets and resources.
Activity: Find the sources of big data using the link UNSTATS
112
5.4. Characteristics of Big Data
The “characteristics of Big Data” refer to the
defining attributes that distinguish large and
complex datasets from traditional data
sources. These characteristics are commonly
described using the "3Vs" framework:
Volume, Velocity, and Variety.
The 6Vs framework provides a holistic view of
Big Data, emphasizing not only its volume,
velocity, and variety but also its veracity,
variability, and value. Understanding and
addressing these six dimensions are essential
for effectively managing, analyzing, and
deriving value from Big Data in various
domains.
Fig. 5.3 Characteristics of Big Data
113
5.4.3. Variety: Big data encompasses data
in various formats, including structured,
unstructured, semi-structured, or highly
complex structured data. These can range
from simple numerical data to complex
and diverse forms such as text, images,
audio, videos, and so on. Storing and
processing unstructured data through
RDBMS is challenging. However,
unstructured data often provides valuable
insights that structured data cannot offer.
Additionally, the variety of data sources
within big data provides information on the
diversity of data.
Fig.5.6 Varieties in Big data
114
5.4.6. Variability: This refers to establishing if the
contextualizing structure of the data stream is regular and
dependable even in conditions of extreme unpredictability. It
defines the need to get meaningful data considering all possible
circumstances.
Fig. 5.9
Case Study: How a Company Uses 3V and 6V Frameworks for Big Data
Company: An OTT Platform ‘OnDemandDrama’
3V Framework:
Volume: OnDemandDrama processes huge amounts of data from millions of users, including watch
history, ratings, searches, and preferences to offer personalized content recommendations.
Velocity: Data is processed in real-time, allowing OnDemandDrama to immediately adjust
recommendations, track the patterns of the users, and offer trending content based on their
activity.
Variety: The platform handles diverse data such as user profiles, watch lists, video content, and
user reviews which are categorized as structured, semi-structured, and unstructured data.
6V Framework:
Along with the above 3 V of big data, the 6V Framework involves 3 more features of big data named
Veracity, Value, and Variability.
Veracity: OnDemandDrama filters out irrelevant or low-quality data (such as incomplete profiles) to
ensure accurate content recommendations.
Value: OnDemandDrama uses the data to personalize user experiences, driving engagement and
retention by recommending shows and movies that match individual tastes.
Variability: OnDemandDrama handles changes or inconsistencies in data streams caused by factors
like user behavior, trends, or any other external events. For example, user preferences can vary
based on region, time, or trends.
By using the 3V and 6V frameworks, OnDemandDrama can manage, process, and derive valuable
insights from its Big Data, which enhances customer satisfaction and drives business decisions.
115
Big data analytics uses advanced analytic techniques against huge, diverse datasets that
include structured, semi-structured, and unstructured data, from different sources, and in
various sizes from terabytes to zettabytes.
116
5.6. Working on Big Data Analytics
Big data analytics involves collecting, processing, cleaning, and analyzing enormous
datasets to improve organizational operations. The working process of big data analytics
includes the following steps –
Example: Data Analytics Tools – Tableau, APACHE Hadoop, Cassandra, MongoDB, SaS
We will explore how big data analysis can be performed using Orange Data Mining.
117
It is important to carefully study the dataset and understand the features and target
variable.
● Features: age, gender, chest pain, resting blood pressure (rest_spb), cholesterol,
resting ECG (rest_ecg), maximum heart rate (max_hr), etc.
● Target: diameter narrowing.
118
Here, we will focus on the Normalization technique.
Normalization in data preprocessing refers to scaling numerical values to a specific range
(e.g., 0–1 or -1–1), making them comparable and improving the performance of machine
learning algorithms.
You will see that all numerical values are now scaled between 0 and 1.
119
Step 3.1: Upload Data
1. Use the File widget to upload a dataset with missing values.
2. Assign the role of "Target" to the feature you want to predict.
120
Step 3.3: Verify Cleaned Data
1. Connect the Data Table widget to the Impute widget.
2. Open the Data Table to confirm the missing values have been replaced.
Missing values are now filled with the chosen method (e.g., average values).
121
Step 4.3: Choose a Validation Method
1. Double-click the Test and Score widget. Select a validation method (e.g., Cross-
Validation).
To understand mining data streams, we first understand what data stream is. A data
stream is a continuous, real-time flow of data generated by various sources. These sources
can include sensors, satellite image data, Internet and web traffic, etc.
Mining data streams refers to the process of extracting meaningful patterns, trends,
and knowledge from a continuous flow of real-time data. Unlike traditional data mining, it
processes data as it arrives, without storing it completely. An example of an area where data
stream mining can be applied is website data. Websites typically receive continuous
streams of data daily. For instance, a sudden spike in searches for "election results" on a
122
particular day might indicate that elections were recently held in a region or highlight the
level of public interest in the results.
The future of Big Data Analytics is highly influenced by several key technological
advancements that will shape the way data is processed and analyzed. A few of them are:
-----------------------------------------------------------------------------------------------------
Education
Environmental Science
Media and
Entertainment
Solution: In this activity, the students are to be encouraged to search for any one video
source which is related to the above fields given. Their answers may vary, this can be
encouraged. In the next column, students can write the insights drawn about the concepts
seen in the video. For example: In the field of education, students can write points like-
Adaptive Learning, assisting management decisions, adaptive content etc. Encourage the
students to write creative answers for each field. Maximum 4 insights are enough for them
to write.
123
Activity-2
List the steps involved in the working process of Big Data analytics.
Step 1:
Step 2:
Step 3:
Step 4:
Solution:
Step 1- Gather data
Step 2- Process data
Step 3-Clean data
Step 4-Analyse data
EXERCISES
5. Which technique is commonly used for analyzing large datasets to discover patterns
and relationships?
a) Linear regression b) Data mining c) Decision trees d) Naive Bayes
6. Which term describes the process of extracting useful information from large
datasets?
a) Data analytics b) Data warehousing c) Data integration d) Data virtualization
124
7. Which of the following is a potential benefit of big data analytics?
a) Decreased data security b) Reduced operational efficiency
c) Improved decision-making d) Reduced data privacy
9. What is the primary challenge associated with the veracity aspect of big data?
a) Handling large volumes of data
b) Ensuring data quality and reliability
c) Dealing with diverse data types
d) Managing data processing speed
B. True or False
1. Big data refers to datasets that are too large to be processed by traditional
database systems. (True)
2. Structured data is the primary type of data processed in big data analytics, making
up the majority of datasets. (False)
3. Veracity refers to the trustworthiness and reliability of data in big data analytics.
(True)
4. Real-time analytics involves processing and analyzing data as it is generated, without
any delay. (True)
5. Cloud computing is the only concept used in Big Data Analytics. (False)
6. A CSV file is an example of structured data. (False)
7. “Positive, Negative, and Neutral” are terms related to Sentiment Analysis. (True)
8. Data preprocessing is a critical step in big data analytics, involving cleaning,
transforming, and aggregating data to prepare it for analysis. (True)
9. To analyze vast collections of textual materials to capture key concepts, trends, and
hidden relationships, the concept of Text mining is used. (True)
125
C. Short answer questions
1. Define the term Big Data.
Ans - Big Data refers to a vast collection of data that is characterized by its immense
volume, which continues to expand rapidly over time.
2. Diagnostic Analytics: Analyses past data to understand the reasons behind specific
outcomes.
These types are designed to provide insights at different levels of decision-making and
problem-solving
127
Step 3. Clean Data
Scrubbing all data, regardless of size, improves quality and yields better results.
Correct formatting and elimination of duplicate or irrelevant data are essential. Dirty
data can lead to inaccurate insights.
Step 4. Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced analytics
processes can turn big data into big insights.
4.Why is Big Data Analytics important in modern industries and decision-making
processes?
Big Data Analytics is important in modern industries and decision-making processes
because it:
1. Enables Data-Driven Decisions: By analyzing vast and diverse datasets,
organizations can make informed decisions based on insights and trends.
2. Improves Efficiency and Productivity: Identifying inefficiencies and optimizing
resource allocation helps streamline processes.
3. Enhances Customer Insights: Understanding customer behavior and preferences
enables personalized marketing and improved customer experiences.
4. Provides Competitive Advantage: Leveraging analytics helps organizations
uncover market trends, identify opportunities, and stay ahead of competitors.
5. Fosters Innovation and Growth: Insights derived from data analysis drive the
development of new products, services, and business models.
5.A healthcare company is using Big Data analytics to manage patient records, predict
disease outbreaks, and personalize treatments. However, the company is facing
challenges regarding data privacy, as patient information is highly sensitive. What are the
potential risks to patient privacy when using Big Data in healthcare, and how can these be
mitigated?
Potential Risks to Patient Privacy:
1. Unauthorized Access: Sensitive patient information could be accessed by
unauthorized individuals, leading to breaches of confidentiality.
2. Data Breaches: Cyberattacks could expose patient data to malicious actors.
3. Misuse of Personal Information: Patient data might be used for purposes beyond
its intended scope, such as marketing or profiling.
128
4. Regulatory Non-Compliance: Failing to comply with data protection laws like
GDPR or the Digital Personal Data Protection Act, 2023, could lead to legal and
financial penalties.
Mitigation Strategies:
1. Data Encryption: Encrypt data during storage and transmission to protect against
unauthorized access.
2. Access Controls: Implement strict access controls to ensure that only authorized
personnel can access sensitive data.
3. Anonymization: Remove personally identifiable information (PII) from datasets to
safeguard patient identity during analysis.
4. Regular Audits: Conduct regular security audits to identify and address
vulnerabilities.
5. Compliance with Regulations: Adhere to data protection laws to ensure ethical
handling of sensitive information.
6. Employee Training: Educate staff about data privacy practices and the importance
of protecting patient information
6.Given the following list of data types, categorize each as Structured, Unstructured, or
Semi-Structured:
a) A customer database with fields such as Name, Address, Phone Number, and
Email. Structured
b) A JSON file containing product information with attributes like name, price, and
specifications. Semi-Structured
c) Audio recordings of customer service calls. Unstructured
d) A sales report in Excel format with rows and columns. Structured
e) A collection of social media posts, including text, images, and hashtags.
Unstructured
f) A CSV file with daily temperature readings for the past year. Structured
129
E. Competency Based Questions:
1. A retail clothing store is experiencing a decline in sales despite strong marketing
campaigns. You are tasked with using big data analytics to identify the root cause.
a. What types of customer data can be analyzed?
b. How can big data analytics be used to identify buying trends and customer
preferences?
c. Can you recommend specific data visualization techniques to present insights
to stakeholders?
d. How might these insights be used to personalize customer experiences and
improve sales?
Ans:
a. Analyze purchase history (items bought together, frequency, time of
purchase), demographics (age, location, income), and browsing behavior
(clicks, time spent on product pages) of the customer.
b. Big data analytics can help
i. identify items that are frequently purchased together to optimize
product placement and promotions.
ii. group customers based on demographics and buying habits
iii. track customer journeys on the website, identify areas of improvement
(e.g., checkout process)
c. understand the key metrics (sales by category, customer demographics) for
easy stakeholder comprehension, and to understand the customer browsing
behavior on the website (hotspots which indicate the items of interest).
d. These insights will help the application to
i. recommend relevant products based on a customer's purchase history
and browsing behavior.
ii. tailor promotions and advertisements to specific customer segments.
iii. adjust prices based on demand and customer demographics.
3. A global e-commerce platform is experiencing rapid growth in its user base, with
millions of transactions occurring daily across various product categories. As part of
their data analytics efforts, they are focused on improving the speed and efficiency
of processing incoming data to provide real-time recommendations to users during
their browsing and purchasing journeys. Identify the specific characteristic of big
data (6V's of Big Data) that is most relevant in the above scenario and justify your
answer.
130
Ans:
In the scenario described, the most relevant characteristic of big data from the 6V's
perspective is Velocity. The reason being it highlights the need for the e-commerce
platform to handle the high speed at which data is generated from millions of
transactions daily. The platform needs to process this data quickly to provide real-time
recommendations during a user's browsing and purchasing journey. Delays in
processing could lead to missed opportunities to influence customer decisions.
Reference links:
• https://www.researchgate.net/publication/259647558_Data_Stream_Mining
• https://www.ibm.com/topics/big-data-analytics
• https://www.researchgate.net/figure/olume-scale-of-Data-from-different-data-
sources-26_fig1_324015815Fig
131
UNIT 6: Understanding Neural Networks
Key Concepts:
1. Parts of a neural network.
2. Components of a neural network.
3. Working of a neural network.
4. Types of neural networks, such as feedforward, convolutional, and recurrent.
5. Impact of neural network on society.
Learning Outcomes:
Students will be able to –
1. Explain the basic structure and components of a neural network.
2. Identify different types of neural networks and their respective applications.
3. Understand machine learning and neural networks through hands-on projects,
interactive visualization tools, and practical Python programming.
Prerequisites:
Basic understanding of machine learning concepts.
132
Unveiling the Neural Network: A Teacher's Guide to Machine Learning
This lesson empowers you to guide students through the fascinating world of Neural
Networks, a cornerstone of Machine Learning.
133
• Structure, Applications, and Implications: Discuss the specific structure,
applications, and implications of each network type for diverse machine learning
domains.
6. The Power and Responsibility of NNs: Societal Impact and Ethical Considerations:
Additional Tips:
• Utilize visuals like diagrams and animations to illustrate Neural Network structures
and processes.
• Encourage students to research specific NN applications in their areas of interest.
• Discuss potential solutions for mitigating ethical concerns related to NNs.
• Explore online resources and interactive tools for further student exploration of
Neural Networks.
By incorporating these elements, you can equip students with a solid foundation in Neural
Networks and empower them to contribute responsibly to the future of this transformative
technology.
134
Teachers can ask the following questions to spark curiosity before starting the topics:
• Can you think of any examples in everyday life where we encounter situations that
involve making choices based on patterns? (This will help connect the concept of
neural networks to real-world applications. Students might provide examples like
filtering spam emails, recommending products on shopping websites, or recognizing
faces in photos.)
• Have you ever learned a new skill by practicing and improving over time? How do
you think this process of learning happens in the brain? (This will help bridge the gap
between biological neurons and artificial neurons. By understanding how our brains
learn and adapt, students can better grasp the concept of neural networks that mimic
this process.)
A network of neurons is called “Neural Network”. “Neural” comes from the word
neuron of the human nervous system. The neuron is a cell within the nervous system which
is the basic unit of the brain used to process and transmit the information to all other nerve
cells and muscles.
A similar behaviour of neurons is embedded by Artificial Intelligence into a Network
called Artificial Neural Network. It can adapt to changing inputs and hence the network
generates the best possible outcome without making any changes anywhere as the human
brain does.
A neural network is a machine learning program, or model, that makes decisions in a manner
similar to the human brain, by using processes that mimic the way biological neurons work together
to identify phenomena, weigh options and arrive at conclusions.
The main advantage of neural networks is that they can extract data features
automatically without needing any input from the programmer. Artificial Neural Networks
(ANN) are very famous these days and also considered to be a very interesting topic due to
their applications in chat-bots. It also makes our work easy by auto replying the emails,
suggesting email replies, spam filtering, Facebook image tagging, showing items of our
interest in the e-shopping web portals, and many more. One of the best-known examples of
a neural network is Google’s search algorithm.
6.1.1 PARTS OF A NEURAL NETWORK: -
Every neural network comprises layers of interconnected nodes — an input layer,
hidden layer(s), and an output layer, as shown in Fig 6.1.
1. Input Layer: This layer consists of units representing the input fields. Each unit
corresponds to a specific feature or attribute of the problem being solved.
2. Hidden Layers: These layers, which may include one or more, are located between the
input and output layers. Each hidden layer contains nodes or artificial neurons, which
process the input data. These nodes are interconnected, and each connection has an
associated weight.
135
3. Output Layer: This layer consists of one or more units representing the target field(s).
The output units generate the final predictions or outputs of the neural network.
Each node is connected to others, and each connection is assigned a weight. If the
output of a node exceeds a specified threshold value, the node is activated, and its output is
passed to the next layer of the network. Otherwise, no data is transmitted to the subsequent
layer.
An Artificial Neural Network (ANN) with two or more hidden layers is known as a Deep
Neural Network. The process of training deep neural networks is called Deep Learning. The
term “deep” in deep learning refers to the number of hidden layers (also called depth) of a
neural network. A neural network that consists of more than three layers—which would be
inclusive of the inputs and the output layers—can be considered a deep learning algorithm.
A neural network that only has three layers is just a basic neural network.
136
• Different types of Activation Functions are Sigmoid Function, Tanh Function, ReLU
(Rectified Linear Unit), etc. (A detailed explanation of activation functions is beyond
the scope of the syllabus)
• These functions help neural networks learn and make decisions by adding non-
linearities to the model, allowing them to understand complex patterns in data.
4. Bias:
• Bias terms are constants added to the weighted sum before applying the
activation function.
• They allow the network to shift the activation function horizontally.
• Bias helps account for any inherent bias in the data.
5. Connections:
• Connections represent the synapses between neurons.
• Each connection has an associated weight, which determines its influence on the
output of the connected neurons.
• Biases (constants) are also associated with each neuron, affecting its activation
threshold.
6. Learning Rule:
• Neural networks learn by adjusting their weights and biases.
• The learning rule specifies how these adjustments occur during training.
• Backpropagation, a common learning algorithm, computes gradients and updates
weights to minimize the network’s error.
7. Propagation Functions:
• These functions define how signals propagate through the network during both
forward and backward passes. Forward pass is known as Forward Propagation and
backward pass is known as Back Propagation.
• Forward Propagation
In the forward propagation, input data flows through the layers, and
activations are computed. Here, input data flows through the network layers,
and activations are computed. The predicted output is compared to the
actual target (ground truth), resulting in an error (loss).
• Back Propagation
Backpropagation is the essence of neural network training. It is the practice of
fine-tuning the weights of a neural network based on the error rate (i.e. loss)
obtained in the previous epoch (i.e. iteration.) Proper tuning of the weights ensures
lower error rates, making the model reliable by increasing its generalization and
helping the network to improve its prediction over time.
Backpropagation, (short for “backward propagation of errors”) is an optimization algorithm
used during neural network training. It adjusts the weights of the network based on the
error (loss) obtained in the previous iteration (epoch).
In Back propagation, gradients are propagated to update weights using
optimization algorithms (e.g., gradient descent).
137
6.3. WORKING OF A NEURAL NETWORK
Teachers can ask the following question:
When you practice a new skill, like riding a bike, what happens with your physical
movements and how do you eventually improve? How might similar adjustments
happen within a neural network during training?
Imagine each node as a simple calculator. It takes input numbers, multiplies them
by certain values (weights), adds them together, adds an extra number (bias), and then gives
an output. This output is used as input for the next node in the network. The formula would
look something like this:
∑wixi + bias = w1x1 + w2x2 + w3x3 + bias
Fig 6.2 illustrates that the output of a neuron can be expressed as a linear
combination of weight ‘w’ and bias ‘b’, expressed mathematically as w * x + b.
Each input is assigned a weight to show its importance. These weights are used to
calculate the total input by multiplying each input with its weight and adding them together.
This total input then goes through an activation function, which decides if the node should
"fire" or activate based on the result. If it fires, the output is passed to the next layer. This
process continues, with each node passing its output to the next layer, defining the network
as a feedforward network. One single node might look like using binary values.
138
Let us see a simple problem.
CASE I: Let the features be represented as x1,x2 and x3.
Input Layer:
Feature 1, x1 = 2
Feature 2, x2 = 3
Feature 3, x3 = 1
Hidden Layer:
Weight 1, w1 = 0.4
Weight 2, w2 = 0.2
Weight 3, w3 = 0.6
bias = 0.1
threshold = 3.0
= 2.1
CASE II
Let's say we have another neuron in the output layer with the following weights and bias:
w1 = 0.7
w2 = 0.3
bias = 0.2
The output of the hidden layer (0) is passed as input to the output layer:
Solution:
1. Let us assume that there are three factors influencing your decision-making:
• Are the waves good? (Yes: 1, No: 0)
• Is the line-up empty? (Yes: 1, No: 0)
• Has there been a recent shark attack? (Yes: 0, No: 1)
2. Then, let us assume the following, giving us the following inputs:
• X1 = 1, since the waves are pumping
• X2 = 0, since the crowds are out
• X3 = 1, since there has not been a recent shark attack
3. Now, we need to assign some weights to determine importance. Larger weights signify that
particular variables are of greater importance to the decision or outcome.
• W1 = 5, since large swells do not come often
• W2 = 2, since you are used to the crowds
• W3 = 4, since you have a fear of sharks
Factor Input Weight
Wave Quality 1 5
Lineup Congestion 0 2
Any Shark Activity 1 4
4. Finally, we will assume a threshold value of 3, which would translate to a bias value of –3.
With all the various inputs, we can start to plug in values into the formula to get the desired output.
ŷ = (1*5) + (0*2) + (1*4) – 3 = 6
If ŷ > threshold, then output = 1
If ŷ < threshold, then output = 0
Since 6 is greater than 3, we can determine that the output of this node would be 1. In this instance,
you would go surfing.
If we adjust the weights or the threshold, we can achieve different outcomes from the model.
140
6.4. TYPES OF NEURAL NETWORKS
Neural networks can be classified into different types, which are used for different
purposes. While this is not a comprehensive list of types, the below would be representative
of the most common types of neural networks that you will come across for its common use
cases:
Fig 6.3
6.4.2 Feed Forward Neural Network (FFNN):
FFNN, also known as multi-layer perceptrons (MLPs) have
an input layer, one or more hidden layers, and an output
layer. Data flows only in one direction, from input to
output. They use activation functions and weights to
process information in a forward manner (Fig 6.4).
FFNNs are efficient for handling noisy data and are
relatively straightforward to implement, making them
versatile tools in
Fig 6.4 various AI applications.
Application: used in tasks like image recognition, natural language processing (NLP), and
regression.
141
Application: Dominant in computer vision for tasks such as object detection, image
recognition, style transfer, and medical imaging.
[To see the working of a CNN, you may watch the video on
https://www.youtube.com/watch?v=K_BHmztRTpA&t=1s ]
Application: Widely employed in generating synthetic data for various tasks like image
generation, style transfer, and data augmentation.
142
6.5. FUTURE OF NN AND ITS IMPACT ON SOCIETY:
Project: Identifying
Animals & Birds
143
Steps in Machine
Learning for Kids
After Adding
Labels & Contents
Click on Describe
your model to
know about neural
network
Neural Network
related to the
model will be
displayed
144
Click on next and
see the steps of
Deep Learning
working
6.6.2 Activity 2:
Problem: Convert from Celsius to Fahrenheit where the formula is: f=c×1.8+32
It would be simple enough to create a conventional Python function, but that wouldn't be
machine learning. Instead, we will give TensorFlow some sample Celsius values and their
corresponding Fahrenheit values. Then, we will train a model that figures out the above
formula through the training process.
#Importing Libraries
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
#Training Data
c = np.array([-40, -10, 0, 8, 15, 22, 38], dtype=float)
145
f = np.array([-40, 14, 32, 46, 59, 72, 100], dtype=float)
#Creating a model The model definition takes a list
#Since the problem is straightforward, this network of layers as argument, specifying
will require only a single layer, with a single neuron. the calculation order from the
model = input to the output.
tf.keras.Sequential([tf.keras.layers.Dense(units=1, • input_shape=[1] → This
input_shape=[1])]) specifies that the input to
this layer is a single value,
#Compile, loss, optimizer i.e., the shape is a one-
model.compile (loss='mean_squared_error', dimensional array with
optimizer=tf.keras.optimizers.Adam(0.1), one member.
metrics=['mean_squared_error']) • units=1 → This specifies
the number of neurons in
#Train the model the layer. The number of
history = model.fit(c, f, epochs=500, verbose=False) neurons defines how
print("Finished training the model") many internal variables
the layer has to try to
#Training Statistics learn to solve the
plt.xlabel('Epoch Number') problem.
plt.ylabel("Loss Magnitude") • Loss function — A way of
plt.plot(history.history['loss']) measuring how far off the
plt.show() predictions are from the
desired outcome. The
measured difference is
called the "loss".
• Optimizer function — A
way of adjusting internal
values in order to reduce
the loss.
#Predict Values
print(model.predict(np.array([100.0])))
146
6.6.3 Activity 3: (**For advanced learners)
The following python program creates and trains a simple artificial neural network
(ANN) using tensorflow and keras to predict Fahrenheit temperatures based on Celsius
temperatures.
In [ ]: #Import the necessary files
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
temp_df = pd.read_csv(‘cel_fah.csv’)
temp_df.head()
plt.scatter(temp_df['Celsius'], temp_df['Fahrenheit'])
X_train = temp_df['Celsius']
y_train = temp_df['Fahrenheit']
# In this neural network, layers are added sequentially, one on top of other.
model = tf.keras.Sequential()
#Define the network with 1 node in input layer,32 nodes in the first hidden layer
model.add(tf.keras.layers.Dense(units= 32 , input_shape = (1,)))
# now we are adding one more hidden layer to the network with 32 nodes
model.add(tf.keras.layers.Dense(units = 32))
# now adding the output layer
model.add(tf.keras.layers.Dense(units = 1))
147
In [ ]: model.summary()
'''shows The architecture of the model, showing the layers and their order.
The output shape of each layer.
The number of parameters (weights) in each layer.
The total number of trainable parameters in the model.'''
In [ ]: # train the network model by iterating 30 times with 20% of data used for
validation
148
TensorFlow Playground home screen
149
7. Regularization
The purpose of regularization L1 and L2 is to remove or reduce overfitting.
8. Output
Check the model performance after training the neural network. Observe the Test
loss and Training loss of the model.
150
151
EXERCISES
A. Multiple Choice Questions:
1. What is a neural network?
A. A biological network of neurons in the brain.
B. A machine learning model inspired by the human brain.
C. A type of computer hardware used for complex calculations.
D. A mathematical equation for linear regression.
2. What are neurons in a neural network?
A. Cells in the human brain.
B. Mathematical functions that process inputs and produce outputs.
C. Nodes that make up the layers of a neural network.
D. None of the above.
3. What is the role of activation functions in neural networks?
A. They determine the learning rate of the network.
B. They introduce non-linearity, allowing the network to learn complex patterns.
C. They control the size of the neural network.
D. They define the input features of the network.
4. What is backpropagation in neural networks?
A. The process of adjusting weights and biases to minimize the error in the network.
B. The process of propagating signals from the output layer to the input layer.
C. The process of initializing the weights and biases of the network.
D. The process of defining the architecture of the neural network.
5. Which type of neural network is commonly used for image recognition?
A. Feedforward neural network.
B. Convolutional neural network.
C. Recurrent neural network.
D. Perceptron.
6. How do neural networks learn from data?
A. By adjusting their weights and biases based on the error in their predictions.
B. By memorizing the training data.
C. By using a fixed set of rules.
D. By ignoring the input data.
152
3. What are the key components of a neural network?
Ans- The key components of a neural network include neurons, connections
(weights), activation functions, and biases.
5. What are some common types of neural networks and their applications?
Ans- Common types of neural networks include feedforward neural networks (used
for general-purpose learning tasks), convolutional neural networks (used for image
recognition and computer vision), and recurrent neural networks (used for sequential
data processing, such as natural language processing).
FNN RNN
Data flows only in one direction, from input
They have feedback connections that allow
to output. They use activation functions and
data to flow in a loop. If the prediction is
weights to process information in a forward
wrong, the learning rate is employed to
manner. make small changes. Hence, making it
gradually increase towards making the right
prediction during the backpropagation.
This makes them suitable for tasks with This feedback enables RNNs to remember
independent inputs, like image prior inputs, making them ideal for tasks
classification where context is important.
153
8. What are the potential future developments and trends that will shape the evolution of
neural networks?
i) The sophisticated NN algorithms have significantly enhanced efficiency and productivity
by automating tasks, streamlining processes, and optimizing resource allocation in
industries ranging from manufacturing to finance.
ii) Neural networks have personalized products and services by analysing vast datasets,
leading to tailored recommendations and experiences that cater to individual preferences
and needs.
iii) Economically, the adoption of neural networks has spurred innovation, driven economic
growth, and created new job opportunities in burgeoning fields like data science and
artificial intelligence.
9. What are the four ways in which neural networks help A.I. grow?
i) Advancing Deep learning
ii) Improving Accuracy and Efficiency
iii) Supporting Autonomous Systems
iv) Personalizing Experience
2. Describe the structure of a neural network and the role of each component, including
neurons, connections, activation functions, and biases.
Ans- A neural network consists of layers of neurons organized in a specific structure.
The input layer receives data, which is then passed through one or more hidden
layers before reaching the output layer. Neurons in each layer are connected to
neurons in the adjacent layers through connections (weights). Each neuron computes
a weighted sum of its inputs, applies an activation function to produce an output, and
passes this output to the next layer. Activation functions introduce non-linearity to
the network, allowing it to model complex patterns. Biases are constants added to
the weighted sum before applying the activation function, helping the network
account for any inherent bias in the data.
154
3. Explain how backpropagation works in neural networks and its importance in training the
network.
Ans- Backpropagation is an algorithm used to train neural networks by adjusting the
weights and biases based on the error in the network's predictions. It works by first
making a forward pass through the network to make predictions, then calculating the
error between the predicted output and the actual output. The algorithm then makes
a backward pass through the network, computing the gradient of the error with
respect to each weight and bias. These gradients are used to update the weights and
biases using an optimization algorithm such as gradient descent. Backpropagation is
crucial for training neural networks as it allows the network to learn from the error
and improve its predictions over time.
4. Discuss the different types of neural networks, including feedforward neural networks,
convolutional neural networks, and recurrent neural networks. Provide examples of real-
world applications where each type is used.
Ans- Feedforward neural networks are the simplest type of neural network and are
used for general-purpose learning tasks such as classification and regression.
Convolutional neural networks (CNNs) are specialized for image recognition tasks
and are used in applications like facial recognition and object detection. Recurrent
neural networks (RNNs) are designed for sequential data processing and are used in
applications such as speech recognition and language translation.
2.You are planning to go out for dinner and are deciding between two restaurants. You
consider three factors which are food quality, ambience, and distance. Using a neural
network concept, determine the inputs and weights.
Ans-
155
2. Inputs and Weights:
• ( X1 = 1) (Food Quality is high)
• ( X2 = 1) (Ambience is cozy)
• ( X3 = 1 ) (Distance is nearby)
Weights:
• ( W1 = 3) (Food Quality importance)
• ( W2 = 2) (Ambience importance)
• ( W3 = 1) (Distance importance)
3.You are deciding between two colleges and are considering three factors: academic
reputation, campus facilities, and tuition fees. Using a neural network concept: What would
be the inputs and weights? Apply the formula and calculate the outcome.
3. You are deciding between two colleges and are considering three factors: academic
reputation, campus facilities, and tuition fees. Using a neural network concept: What would
be the inputs and weights? Apply the formula and calculate the outcome.
156
4.You work for a company that develops software for recognizing handwritten digits. Your
task is to create a neural network model to accurately classify handwritten digits from the
dataset. Which neural network would you use for this?
5.You have built a CNN model for the task in Case study – 2. List the training process that
you will use to train the neural network.
6.You want to do time series forecasting by feeding historical data into a neural network
and training it to predict future values based on past observations. Which neural network
should you use?
157
REFERENCES:
1. https://machinelearningforkids.co.uk/#!/projects
2. Types of Neural Networks and Definition of Neural Network (mygreatlearning.com)
3. https://web.pdx.edu/~nauna/week7b-neuralnetwork.pdf
4. https://realpython.com/courses/build-neural-network-python-ai/
5. https://www.geeksforgeeks.org/neural-networks-a-beginners-guide/
6. https://medium.com/data-science-365/overview-of-a-neural-networks-learning-process-
61690a502fa
158
UNIT 7: Generative AI
Prerequisites:
1. Foundational understanding of AI concepts from class XI.
2. Understanding of basic Python, installation and importing of packages.
159
Unleashing Creativity with Machines: A Teacher's Guide to Generative AI
This lesson empowers you to introduce students to the fascinating world of Generative AI,
where machines learn to create entirely new content!
160
• LLM Applications: Highlight the applications of LLMs:
o Content creation assistance (e.g., writing summaries or marketing copy).
o Personalized education and learning experiences.
o Improved human-computer interaction through chatbots and virtual assistants.
• Exploring Challenges: Delve deeper into the ethical challenges of Generative AI:
o Deepfakes: Creating realistic but fabricated videos that can be used for malicious
purposes.
o Bias: Generative models can perpetuate biases present in the training data.
o Copyright: Questions arise regarding ownership of content created by AI.
o Transparency: Understanding how Generative AI models work and their
limitations.
• Fostering Responsible Development: Discuss the importance of responsible
development and deployment of Generative AI technologies. Encourage students to
think critically about the potential impact of this technology.
Additional Tips:
By incorporating these elements, you can equip students to understand the potential of
Generative AI for creative expression while fostering critical thinking about its ethical
implications and responsible use.
161
Teachers can ask the following questions to spark curiosity before starting the topics:
• Have you ever seen an image or video online that seemed too good to be true? How
could you tell if it might be fake? (This question prompts students to think critically
about the media they consume and the challenges of distinguishing real content from
artificial content. It sparks curiosity about how Generative AI can be used to create
realistic-looking fakes.)
• Imagine you could create new and interesting things using a computer program. What
kind of content would you generate (e.g., images, stories, music)? (This question taps
into students' creativity and gets them thinking about the potential applications of
Generative AI. It helps bridge the gap between understanding the technology and its
potential uses in various domains.)
In recent times, we all have come across these pieces of information widely being
circulated in news or online media. Celebrities and famous people are being targeted with
fake images, damaging their reputations and self-esteem. Additionally, some individuals are
using AI tools to generate content and claiming it as their own, thereby misusing the power
of AI technology. But do you know what is behind all of this? This chapter takes you on an
interesting journey where you will explore a new dimension of AI, known as Generative AI.
Fig. 7.1
Generative AI, a facet of artificial intelligence, creates diverse content like audio, text,
images, and more, aiming to generate new data resembling its training samples. It utilises
machine learning algorithms to achieve this, learning from existing datasets. Examples include
ChatGPT, Gemini, Claude, and DALL-E.
162
7.2 Working of Generative AI
Generative AI learns patterns from data and autonomously generates similar samples.
It operates within the realm of deep learning, employing neural networks to understand
intricate patterns. Models such as Generative Adversarial Networks (GANs) and Variational
Autoencoders (VAEs) facilitate tasks like image and text generation.
1. Generative Adversarial Networks (GANs): A Generative Adversarial Network is a type of
neural network architecture. It consists of two networks, a generator, and a discriminator,
that compete against each other. The generator generates new data samples, such as
images or text (which are fake), while the discriminator evaluates these samples to
distinguish between real and fake data. The generator aims to produce samples that are
indistinguishable from real data, while the discriminator aims to differentiate between real
and generated data. Through adversarial training, where these networks challenge one
another, GANs learn to generate increasingly realistic samples. GANs have been
successfully applied in various domains, including image generation, style transfer, and
data augmentation.
With the vast amount of data generated globally every day, acquiring new data is
straightforward. However, addressing this immense volume of data requires either the
development of new algorithms or the scaling up of existing ones. These algorithms, rooted in
mathematics—particularly calculus, probability, and statistics—can be broadly categorized into
two types:
1. Discriminative models 2. Generative models
163
Discriminative models focus on defining class boundaries within the data, making them
suitable for tasks like classification. On the other hand, generative models seek to comprehend
the underlying data distribution and generate new samples. In simple words, discriminative
models are used to distinguish between different categories or classes. These models learn the
boundary or difference between classes based on the features of the data. For example, we
want to identify an email as either "spam" or "not spam. A discriminative model would learn
the features like certain words or phrases that differentiate spam emails from non-spam
emails.
By efficiently tackling complex tasks, generative models play a crucial role in managing large
datasets.
Fig. 7.2
Generative AI Discriminative AI
Helps determine what something
Helps create things like images and
is or belongs to by looking at its
Purpose stories and finds unusual things. It
features. It is good at telling
(What is it for?) learns from data without needing to
different things apart and making
be told precisely what to do.
decisions based on that.
Uses tricks like making things Learn by finding rules to separate
Models
compete against each other or things and recognise patterns,
(What are they
making guesses based on patterns like understanding whether
like?)
to create new things. something is a dog or a cat.
Training Focus Focuses on learning how to draw
Tries to understand what makes
(What did they lines or make rules to tell other
data unique and how to create new
learn during things apart based on their
data that are similar but different.
training?) features.
Application Helps artists create new artworks, Powers things like facial and
(How are they generate new ideas for stories, and speech recognition and helps
164
used in the real find unusual patterns in data. make decisions like whether an
world?) email is spam or not.
Examples of Naïve Bayes, Gaussian discriminant
Logistic Regression, Decision
Algorithms analysis, GAN, VAEs, LLM, DBMs,
Trees, SVM, Random Forest
used Autoregressive models
Text Generation:
Text generation is when computers write sentences
that sound like people wrote them. It involves
creating written content that mimics human language
patterns. These models analyse text data to produce
coherent and contextually relevant text. They learn
from many written words to create new sentences
that make sense. For example, an AI tool/application
Source: Software Snapshot
that might compose a story that reads like a human
Fig. 7.4
authored it. Examples include OpenAI’s ChatGPT,
Perplexity and Google’s Bard (Gemini)
Video Generation:
It involves creating new videos by learning from
existing ones, including animations and visual effects.
These models learn from videos to create realistic and
unique visuals, producing new videos that look Figure 5
authentic. For instance, an AI tool/application might
generate a movie scene that resembles professional
filmmaking. Examples include Google’s Lumiere and
Deepfake algorithms for modifying video content.
Fig. 7.5 video generation
165
Audio Generation: Audio generation
involves computers producing new sounds,
such as music or voices, based on sounds
they have heard. It involves generating fresh
audio content, including music, sound
effects, and speech, using AI models. These
models derive inspiration from existing audio
recordings to generate new audio samples.
They learn from existing sounds to create
new ones. For instance, an AI
tool/application might compose a song that Figure 7.6 Google’s Music LM to generate music from text.
sounds like a real band performed it.
Examples include Meta AI’s Voicebox and Google’s Music LM.
Teachers can ask the following questions to spark curiosity before starting the topics:
1. Can machines be creative? Why or why not? (This question primes students to consider
the capabilities of LLMs in tasks traditionally associated with human creativity, sparking
curiosity about how these models can be used for creative writing, music generation, or
other applications.)
2. Imagine you have a magic writing assistant that can help you with different creative
tasks. What kind of help would you find most beneficial? (e.g., brainstorming ideas,
checking grammar, generating different writing styles) (This question taps into
students' creativity and gets them thinking about the potential benefits of LLMs for tasks
like writing and content creation. It helps bridge the gap between understanding the
technology and its potential usefulness in various domains.)
A Large Language Model (LLM) is a deep learning algorithm that can perform a variety of
Natural Language Processing (NLP) tasks, such as generating and classifying text, answering
questions in a conversational manner, and translating text from one language to another.
Fig. 7.7
166
Large Language Models (LLMs) are called large because they are trained on massive datasets
of text and code. These datasets can contain trillions of words, and the quality of the dataset
will affect the language model’s performance.
Fig. 7.8
Image source: https://levelup.gitconnected.com/understanding-the-magic-of-large-language-model-architecture-transformer-models-exposed-f5164b5db174
Transformers in LLMs:
Transformers are a type of neural network architecture that has revolutionized the field
of Natural Language Processing (NLP), particularly in the context of Large Language Models
(LLMs). The primary use of transformers in LLMs is to enable efficient and effective learning of
complex language patterns and relationships within vast amounts of text data.
Some leading Large Language Models (LLMs) are:
● OpenAI's GPT-4o: GPT-4 is multimodal and excels in processing and generating both
text and images.
● Google's Gemini 1.5 Pro: Integrates advanced multimodal capabilities for seamless text,
image, and speech understanding.
● Meta's LLaMA 3.1: Open-source and optimized for high efficiency in diverse AI tasks.
● Anthropic's Claude 3.5: Prioritizes safety and interpretability in language model
interactions.
● Mistral AI's Mixtral 8x7B: Implements a sparse mixture of experts for superior
performance with smaller model sizes.
Applications of LLMs:
● Text Generation: LLMs are primarily employed for text generation tasks, including
content creation, dialogue generation, story writing, and poetry generation. They can
produce coherent and contextually relevant text based on given prompts or input. Other
examples include
o translating natural language descriptions into working code and streamlining
development processes.
o autocompleting text and generating continuations for sentences or paragraphs,
enhancing writing tools, and email auto-completion.
167
● Audio Generation: While LLMs themselves do not directly generate audio signals, they
can indirectly influence audio generation tasks through their text-to-speech (TTS)
capabilities. By generating textual descriptions or scripts, LLMs enable TTS systems to
synthesize natural-sounding speech from text inputs.
● Image Generation: LLMs have been adapted for image captioning tasks, where they
generate textual descriptions or captions for images. While they do not directly generate
images, their understanding of visual content can be leveraged to produce relevant
textual descriptions, enhancing image accessibility and understanding.
● Video Generation: Similarly, LLMs can contribute to video generation tasks by
generating textual descriptions or scripts for video content. These descriptions can be
used to create subtitles, captions, or scene summaries for videos, improving
accessibility and searchability.
Limitations of LLM:
● Processing text requires significant computational resources, leading to high response
time and costs.
● LLMs prioritise natural language over the accuracy, which may result in generating
factually incorrect or misleading information with high confidence.
● LLMs might memorize specific details rather than generalize, leading to poor
adaptability.
Case Study: Demystifying LLaMA - A Publicly Trained Powerhouse in the LLM Landscape
Large Language Models (LLMs) are a type of deep learning algorithm revolutionizing
Natural Language Processing (NLP). These advanced AI models excel at various tasks,
including generating human-quality text, translating languages, and answering questions in a
conversational way. Their effectiveness hinges on the quality and size of the training data.
Traditionally, LLMs are trained on massive, proprietary datasets, limiting transparency and
accessibility. This case study explores LLaMA, a unique LLM developed by Meta AI that stands
out for its approach to training data and architecture.
168
LLaMA's Disruptive Training Approach: LLaMA leverages publicly available text and code
scraped from the internet for training. This focus on open data, fosters transparency within the
research community and allows for wider accessibility. Additionally, LLaMA's developers
implemented efficient training techniques, requiring less computational power compared to
some LLMs. This translates to better scalability, making LLaMA a more feasible solution for
deployment on a wider range of devices with varying processing capabilities.
Fig. 7.8
Flexibility Through Multi-Model Design: LLaMA offers flexibility through its multi-model
design. Meta releases LLaMA in various sizes, ranging from 7 billion to 65 billion parameters.
This allows users to choose the model that best suits their specific needs and computational
resources. Smaller models can be deployed on devices with lower processing power, ideal for
everyday tasks. Conversely, larger models offer superior performance for complex tasks
requiring more intensive processing power.
Impressive Results Despite Publicly Trained Data: Despite its focus on efficient training with
publicly available data, LLaMA delivers impressive results. It benchmarks competitively or
even surpasses some larger LLMs on various NLP tasks. Areas of particular strength include
text summarization and question answering, showcasing LLaMA's ability to grasp complex
information and generate concise, informative outputs. This positions LLaMA as a valuable tool
with promising applications in education (personalized learning materials), content creation
(brainstorming ideas, generating drafts, translation), and research assistance (analysing vast
amounts of data).
169
7.6 Future of Generative AI
The future of AI focuses on evolving architectures to surpass current capabilities while
prioritizing ethical development to minimize biases and ensure responsible use. Generative AI
will address complex challenges in fields like healthcare and education, enhance NLP tasks like
multilingual translation, and expand in multimedia content creation. Collaboration between
humans and AI will deepen, emphasizing AI's role as a supportive partner across domains.
Teachers can ask the following questions to spark curiosity before starting the topics:
• Imagine you see a funny video online of a celebrity doing something strange. How can
you tell if the video might be fake? What are some reasons why someone might create
a fake video? (This question primes students to think critically about the authenticity of
online content and the potential motives behind creating deepfakes. It sparks a discussion
about the challenges deepfakes pose to trust in media and the spread of misinformation.)
• Have you ever seen an advertisement online that seemed to target you specifically?
How might companies use AI to personalize their marketing strategies? (This question
gets students thinking about the applications of AI in everyday life and how it can be used
to influence them. It can then be transitioned into a discussion about potential biases in
AI algorithms and the importance of fairness and transparency in AI development.)
Generative AI, with its ability to create realistic content such as images, videos, and text,
brings about a multitude of ethical and social considerations. While the technology offers
promising applications, its potential for misuse and unintended consequences raises
significant concerns. In this context, understanding the ethical and social implications of
generative AI becomes crucial for ensuring responsible development and deployment.
1. Deepfake Technology:
The emergence of deepfake AI technology, such
as DeepFaceLab and FaceSwap, raises concerns
about the authenticity of digital content.
Deepfake algorithms can generate compelling
fake images, audio, and videos, jeopardising trust
in media integrity and exacerbating the spread of
misinformation.
Examples: Deepfake AI tools, such as DeepArt's
style transfer algorithms, can seamlessly
manipulate visual content, creating deceptive
and misleading media. For instance, deepfake
Fig. 7.8
170
videos have been used to superimpose individuals' faces onto adult content without their
consent, leading to privacy violations and reputational damage.
3. Plagiarism:
Presenting AI-generated content as one's work, whether intentionally or unintentionally, raises
ethical questions regarding intellectual property rights and academic integrity. Moreover, if AI
output significantly resembles copyrighted material, it could potentially infringe upon copyright
laws, leading to legal ramifications.
4. Transparency:
Transparency in the use of generative AI is paramount to maintaining trust and accountability.
Disclosing the use of AI-generated content, particularly in academic and professional settings,
is essential to uphold ethical standards and prevent instances of academic dishonesty. Failure
to disclose AI use can erode trust and undermine the credibility of research and scholarly work.
Points to Remember:
● Be cautious and transparent when using generative AI.
● Respect copyright and avoid presenting AI output as your own.
● Consult your teacher/institution for specific guidelines.
Citing Sources with Generative AI:
● Intellectual Property: Ensure proper attribution for AI-generated content to respect
original creators and comply with copyright laws.
● Accuracy: Verify the reliability of AI-generated information and cite primary data
sources whenever possible to maintain credibility.
● Ethical Use: Acknowledge AI tools and provide context for generated content to
promote transparency and ethical use.
171
Citation Example:
1. Treat the AI as author: Cite the tool name (e.g., Bard) & "Generative AI tool" in the author
spot.
2. Date it right: Use the date you received the AI-generated content, not any tool release date.
3. Show your prompt: Briefly mention the prompt you gave the AI for reference (optional).
APA reference for text generated using Google Gemini on February 20, 2024:
(Optional in parentheses): "Prompt: Explain how to cite generative AI in APA style."
Bard (Generative AI tool). (2024, February 20). How to cite generative AI in APA style.
[Retrieved from [invalid URL removed]]
Note: Generative AI is a weak AI
172
Explore Results: Canva will generate multiple images based on your input. Browse through
the generated images to find the one that best matches your vision.
● Customize (Optional): If desired, you can customize the generated image by adjusting
colours, shapes, and backgrounds using Canvas editing tools.
● Download or Use in Design: Once satisfied with the image, you can download it to your
computer or directly incorporate it into your design project within Canva.
173
● Agree to Terms: Review and agree to the terms of service and privacy policy for Google
Gemini. This step may vary depending on your region and Google's current policies.
● Access Veed AI Text Generator: Log in to your Veed AI account and navigate the Text
Generation tool.
● Enter Prompt: Input a concise prompt describing the text content you want to generate.
For example, "Write a short story about a lost astronaut searching for their way home."
● Generate Text: Click the "Generate Text" or similar button to initiate the text generation
process. Veed AI will analyse your prompt and generate a corresponding text output.
Review and use the generated text for your projects or creative endeavours.
174
Activity 4. Signing Up for Animaker Site URL: https://www.animaker.com/
● Visit Animaker Website: Open your web browser and navigate to the Animaker website.
● Click on Sign Up: Look for the "Sign Up" or "Get Started" button on the homepage and
click on it.
● Provide Details: Fill out the sign-up form with your email address, password, and other
required information. Once completed, click "Sign Up" to create your Animaker account.
Instructions for Using Animaker AI Video Generation
● Access Animaker AI Video Tool: Log in to your Animaker account and navigate to the AI
Video Generation tool.
● Enter Prompt: Input a detailed and descriptive prompt that outlines the video content
you want to generate. For example, "Create a promotional video for a new product
launch featuring animated characters and dynamic visuals."
● Generate Video: Click the "Generate Video" or similar button to initiate the video
generation process. Animaker AI will analyse your prompt and generate a corresponding
video output. Review and customise the generated video for your marketing campaigns,
presentations, or storytelling projects.
175
Activity 5. ChatGPT Site URL: https://chat.openai.com/
176
escalate the query to a human agent, providing initial support and gathering relevant
information. This not only improves customer satisfaction by offering quick responses but also
reduces the burden on human agents, allowing them to focus on more critical tasks.
Beyond customer service, customized chatbots find applications across various sectors. In
education, they can provide personalized tutoring, answer questions, and handle
administrative tasks. Researchers can leverage chatbots to analyse data, summarize research
papers, and identify relevant articles. In healthcare, chatbots can offer initial medical advice,
schedule appointments, and provide mental health support. For the general public, they can
be used for language learning, financial advice, and cultural insights. By tailoring chatbots to
specific needs, individuals and organizations can unlock their full potential and enhance
productivity, efficiency, and user experience.
Let us learn to create a customized Chatbot with the help of Gemini API.
We can follow the following steps to create a chatbot:
177
• Click on the Create API key button
178
Click on Copy and keep it safe.
179
EXERCISES
A. Multiple Choice Questions
1. What is the primary objective of generative AI?
a) To classify data into different categories
b) To generate new data resembling its training samples
c) To learn from labelled data for decision-making
d) To predict outcomes based on input features
180
5. In image generation, what analogy describes how generative AI models work?
a) Computers analyse sounds to produce new images.
b) Computers make new pictures based on existing ones, resembling patterns they
have learned.
c) Computers generate images by understanding class boundaries in data.
d) Computers utilise labelled data to predict outcomes.
8. Which technology raises concerns about the authenticity of digital content and the
spread of misinformation?
a) Deepfake AI
b) Generative Adversarial Networks (GANs)
c) Variational Autoencoders (VAEs)
d) Convolutional Neural Networks (CNNs)
B. True/False
1. Generative AI can only generate text and images, not audio or video. (False)
2. Generative Adversarial Networks (GANs) involve competition between the Generator
and the Discriminator networks. (True)
3. Discriminative models generate new data samples similar to the training data. (False)
4. Variational Autoencoders (VAEs) are AI generative models that can be used for tasks
such as image generation. (True)
181
5. Deepfake technology is an application of generative AI that can create realistic fake
videos. (False)
6. Large Language Models (LLMs) like GPT and BERT cannot generate human-like text.
(False)
C. Fill in the Blanks
1. Generative AI utilises training learning algorithms to learn from existing datasets.
2. Examples of Generative AI include ChatGPT, DALL-E, Gemini, and Claude
3. The full form of GAN is Generative Adversarial Networks.
4. Image Generation is an application of Generative AI where computers create new
images that resemble the ones they have seen before.
5. OpenAI’s ChatGPT and Google’s Bard (Gemini) are examples of Generative AI used for
Text Generation.
6. Large Language Models (LLMs), like GPT models developed by OpenAI, are trained on
vast amounts of text data to understand and generate human-like text.
182
6. What is a Large Language Model (LLM)?
Ans- LLMs are sophisticated AI models trained on vast text data to understand and generate
human-like text, excelling in natural language processing tasks.
9. What are the limitations and risks involved with Large Language Models?
Ans- Limitations include high computational costs, the potential for generating incorrect
information, and data privacy concerns.
3. Identify one potential ethical consideration the agency must address when using Generative
AI in advertising.
Ans- The agency must ensure that the AI-generated content does not inadvertently propagate
biases or stereotypes, maintaining ethical standards in representation and messaging.
183
4.What is a significant advantage of using Generative AI for dynamic video ad creation?
Ans- Generative AI can rapidly produce diverse and innovative video content tailored to
different platforms and audience preferences, significantly reducing production time and costs
while enhancing creativity.
5.How can "Creative Horizons" ensure their AI-generated content's originality and copyright
compliance?
Ans- The agency should implement checks to verify the uniqueness of the content, use AI
models trained on copyright-compliant datasets, and provide proper attribution or licenses for
AI-generated outputs, ensuring compliance with copyright laws and ethical guidelines.
F. Ethical Dilemma
Read the following ethical dilemma and provide your response:
Ethical Dilemma Scenario:
An AI development company, "InnovateAI," has created a new Generative AI model that
can produce original music tracks by learning from a vast database of songs across various
genres. The AI's capability to generate music that rivals compositions by human artists has
attracted significant attention in the music industry. However, "InnovateAI" faces an ethical
dilemma: the AI model inadvertently replicates distinctive styles and melodies of existing
copyrighted works, raising concerns about copyright infringement and the originality of AI-
generated music. Furthermore, the company discovered that some AI-generated tracks
contain elements remarkably similar to unreleased songs by living artists, likely due to
including these tracks in the training data without consent.
Discussion Question:
Should "InnovateAI" release this music-generating AI to the public, considering the potential
for copyright infringement and ethical concerns regarding the originality of AI-generated
content? What measures should be taken to address these ethical and legal issues while
advancing technological innovation in the music industry?
Response:
"InnovateAI" faces a complex situation that balances the fine line between innovation and
ethical responsibility. Before releasing the AI model to the public, the company should take
several steps to mitigate potential legal and ethical issues:
1. Transparency: "InnovateAI" should be transparent about the AI's capabilities and
limitations, including the potential for generating content that may resemble existing
copyrighted works. This transparency should extend to how the model was trained,
including the sources of its training data.
2. Consent and Copyright Compliance: The company must ensure that all data used to train
the AI model is either in the public domain, copyrighted with permission, or used under
184
fair use provisions. This may involve auditing the training dataset to remove any
copyrighted material included without consent.
3. Creative Attribution: "InnovateAI" should consider implementing a system for
attributing the creative influences of AI-generated music. This could involve tagging AI-
generated tracks with metadata referencing the styles or artists that influenced the AI's
composition, indirectly acknowledging human artists' contributions.
4. User Guidelines: Providing clear guidelines regarding the ethical use of AI-generated
music, including advising against passing off AI-generated compositions as entirely
original works without acknowledging the AI's role.
5. Technological Solutions: Developing and integrating technology that can detect and flag
potential copyright issues in AI-generated compositions before making them public.
This tool could help identify elements too closely resembling existing copyrighted
works, allowing for revision or alteration.
6. Engaging with Stakeholders: "InnovateAI" should engage in dialogue with copyright
holders, artists, and legal experts to explore collaborative solutions that respect
copyright while fostering innovation. This could include licensing agreements or
partnerships with music publishers.
By taking these measures, "InnovateAI" can address ethical concerns head-on while
contributing positively to the music industry. It acknowledges the importance of balancing
innovation with respect for existing creative works and copyright laws, ensuring that
advancements in AI benefit all stakeholders in the music ecosystem.
Note to Teachers: Teachers to discuss any other ethical concern.
1. Anita works for a social media company that is exploring the use of Generative Adversarial
Networks (GANs) to generate personalized content for its users. However, there are concerns
about the ethical implications of using AI to manipulate users' perceptions and behaviours.
What are some potential ethical concerns associated with using GANs to generate
personalized content on social media platforms?
Ans- Some potential ethical concerns include:
• Manipulation of user perceptions potentially leading to misinformation or exploitation.
• Invasion of privacy
• Amplification of biases leading to unfair treatment or discrimination.
2.A research team is using Generative AI to generate synthetic data for training medical
imaging algorithms to detect rare diseases. However, there are concerns about the potential
biases and inaccuracies in the generated data. What steps can the research team take to
mitigate biases and ensure the accuracy and reliability of the synthetic data generated by
Generative AI for medical imaging applications?
185
Ans- The research team can take the following steps:
• Ensure that the training data used to train the Generative AI model represent a diverse
range of demographics, including underrepresented groups, to mitigate biases.
• Validate the synthetic data generated by the model against real medical imaging data
to ensure accuracy and reliability.
• Involve domain experts, such as medical professionals and imaging specialists
• Continuously refine and improve the Generative AI model based on feedback from
domain experts and the performance of the medical imaging algorithms trained on the
synthetic data.
4.An interior design company wants to use Generative AI to create room renderings for client
presentations. They aim to ensure that the generated designs are both attractive and
practical. How can the interior design company ensure that the room renderings generated by
Generative AI are both visually appealing and functional?
Ans- The interior design company can:
• Curate diverse room design data for training.
• Incorporate design principles into the Generative AI model.
• Seek feedback from experienced interior designers.
5.A marketing agency needs visually appealing social media posts for an advertising
campaign. How can the marketing agency use Generative AI to create diverse and engaging
visuals for their campaign?
Ans- The marketing agency can:
• Collect relevant images for training the Generative AI model.
• Generate visuals aligned with the campaign's branding.
• Review and refine the generated visuals for maximum impact.
186
References/Links of images used in the lesson
● https://www.hindustantimes.com/ht-
img/img/2023/04/25/550x309/Anand_mahindra_shares_video_of_a_girl_aging_168
2405476306_1682405479636.png
● https://fdczvxmwwjwpwbeeqcth.supabase.co/storage/v1/object/public/images/d654
6a37-4ba1-4a5d-87b1-a698c3960c20/fa7f7d26-0ba6-4275-b469-
f5a005828775.png
● https://uploads-
ssl.webflow.com/61dfc899a471632619dca9dd/62f2dd5573c01a4523b4ace6_Deepf
ake-Optional-Body-of-Article.jpeg
● https://arxiv.org/html/2405.11029v1#S6
187
UNIT 8: Data Storytelling
Title: Data Storytelling Approach: Team discussion, Web search,
Case studies
Summary: Students will learn about the importance of storytelling, which has been
used for ages to share knowledge, experiences, and information. They will also
understand how to connect storytelling with data storytelling, a key part of Data
Analysis. This lesson will teach them to combine the three elements of data
storytelling—data, visuals, and narrative—to present complex information engagingly
and effectively. This helps the audience make informed decisions at the right time.
Learning Objectives:
1. Students will understand the benefits and importance of powerful storytelling.
2. Students will appreciate the concept of data storytelling in data analysis,
which is a key part of data science and AI.
3. Students will learn how to combine the elements of data storytelling—data,
visuals, and narrative—to present complex information.
4. Students will learn how to draw insights from a data story.
Key Concepts:
1. Introduction to Storytelling
2. Elements of a Story
3. Introduction to Data Storytelling
4. Why is Data Storytelling Powerful?
5. Essential Elements of Data Storytelling
6. Narrative Structure of a Data Story (Freytag’s Pyramid)
7. Types of Data and Visualizations for Different Data
8. Steps to Create a Story Through Data
9. Ethics in Data Storytelling
Learning Outcomes:
Students will be able to -
1. Identify the difference between storytelling and data storytelling.
2. Understand the key elements of data storytelling.
3. Recognize the importance of data storytelling today.
4. Use the appropriate type of visual for the data.
5. Draw insights from data stories and write simple narratives based on the
visuals.
188
Unveiling the Power of Data Storytelling: A Teacher's Guide
This lesson empowers you to equip students with the art of Data Storytelling –
transforming data into captivating narratives that inform, persuade, and inspire.
189
o Persuades audiences to take action based on evidence.
o Creates a deeper emotional connection with the data.
190
Additional Tips:
• Encourage students to find creative ways to present their Data Stories (e.g.,
infographics, videos).
• Integrate technology tools like data visualization platforms to enhance
storytelling.
• Discuss ethical considerations when presenting data (e.g., avoiding bias, data
source transparency).
191
Teachers can ask the following questions to spark curiosity before starting the
topics:
• Do you like listening to stories? Why or why not? (This taps into students'
prior knowledge about stories and their enjoyment of them)
• Can you think of any examples of stories that you've heard that taught you
something? (This encourages students to make the connection between
stories and learning)
• Do you think facts and figures can be presented in a story format? Why or
why not? (This gets students thinking about the possibility of data being
presented in a story)
192
So, what are Stories?
Stories are a valuable form of human expression. They connect us closely with one
another and transport us to different places and times. Stories can be of various
types, like folk tales, fairy tales, fables, and real-life stories. Each type of story creates
a sense of connection, and folk tales, in particular, strengthen our sense of belonging
to our community and help establish our identity.
Every story has a theme or topic. There is always a storyteller and a listener, and
sometimes the listeners can be a group of people. According to the dictionary, a
'story' is a 'factual or fictional narrative,' meaning it tells about an event that can be
true or made up, in a way that the listener experiences or learns something. Stories
can be used to share information, experiences, or viewpoints.
193
1. Characters: The characters are the people or animals or some things or
objects which are featured in a story. They perform the actions and drive the
story.
2. Plot/setting: Setting refers to the time or location in which the story takes
place. Plot refers to the sequence of the events of the story.
3. Conflict: It is the problem or the situation the characters are dealing with. It
drives the story forward which makes the story engaging and a key element
for the characters.
4. Resolution: It is the end of the story where the characters arrive at a particular
situation to resolve the conflict. It is the stage after climax which is the peak
or height of any story.
5. Insights: The ability to have a clear, deep, and sometimes sudden
understanding of a complicated problem or situation.
Activity 1:
Think of different types of stories, a real story, a mythological story, a fiction story,
folk tale, and then complete the table according to the given headings.
Name of the
story Type of the story Characters Insight Gathered/Moral
Anne Frank, Otto Frank (father), Even in difficult
The Diary of Edith Frank (mother), Margot Frank circumstances, hope and
Anne Frank Real Story (sister) the human spirit can endure.
Simba (lion cub), Mufasa (father),
Mythological Story Scar (uncle), Timon & Pumbaa Circle of life, overcoming
(resembles coming- (meerkat & warthog), Rafiki adversity, importance of
The Lion King of-age stories) (mandrill) responsibility.
Pinocchio (wooden puppet),
The Geppetto (carver), Jiminy Cricket Honesty, hard work, and
Adventures of (conscience), The Fox & The Cat good choices lead to a
Pinocchio Fiction Story (tricksters), The Blue Fairy happy life.
Preparation and hard work
The Three are rewarded, while laziness
Little Pigs Folk Tale Three little pigs, Big Bad Wolf has consequences.
194
So, when we interpret this data in a systematic way, then this concept is known
as Data Storytelling. It is a practice that is used lately by analysts and data scientists
to communicate their findings and observations from data to technical and non-
technical business stakeholders who are in general called audience.
Data storytelling is the art and practice of translating complex data and analytics
into a compelling narrative that is easily understandable and relatable to various
audiences.
Data can be in the simple form of numbers and digits. This data when it is
pictorially represented is known as Data Visualization. It can be in the form of
different types of charts or graphs. Depending on the requirements, data can be
interpreted in the form of narratives known as Data Stories which can reduce
ambiguity. It can be clear with respect to context and convey the right meaning which
can be used for an effective decision-making process.
195
Activity 3:
Show the following content to the students on screen for approximately 2 minutes.
A police officer pulls over a car for An established finding in traffic safety is
driving too slowly on the highway. “This that vehicles should maintain speeds
is a highway, you must drive at least 80 consistent with those of neighboring
km/hr.” vehicles. For this reason, federal
“But the sign says 20!” said the driver. agencies recommend a minimum speed
“This is highway route 20, not speed of eighty kilometers per hour to be
limit 20.” adopted by local state, and provincial
The officer then sees the passenger is governments. Nearly all local
unconscious. “Is everything OK?” governments have conformed to this
“She’s been like that since we turned recommendation as it promotes safety
off highway 180.” and minimizes confusion when passing
from one jurisdiction to another or when
merging from one highway onto another.
Now, ask them “What is the minimum speed limit on a highway?
With these two activities, we understand that data, when presented in a narrative
format, is better absorbed, retained, and understood compared to a collection of
disjointed facts or figures. Just like the effectiveness of storytelling in memory
retention, data storytelling enhances the comprehension and impact of data insights
by providing context and structure.
Fig. 8.3 Data Storytelling: Makes information more memorable, engaging, and persuasive
The need for data storytelling is gaining importance in all fields. Many
companies and brands are using data storytelling as an effective method of
conveying their message and gaining client loyalty. Data storytelling makes complex
data more accessible and understandable, allowing audiences or stakeholders to
grasp insights easily. Engaging narratives and compelling visuals keep audiences
engaged, increasing retention and attention. Storytelling with data empowers better
decision-making by presenting evidence-based insights.
196
Examples of some famous brand Data Stories
Spotify Uber
197
Data: Basic facts or raw facts about any entity is known as Data.
Data is the primary building block of every data story. It serves as
the foundation for the narrative and visual elements of your story.
198
Narrative structure of a data story
Most stories follow a common arc – a protagonist who faces a complication
goes on a journey of resolving a difficulty before returning to their normal lives.
Building on Aristotle’s simple model, Freytag developed a more robust narrative
framework to better understand the arc or progression of a story. The “pyramid-
based” dramatic structure with five key stages:
1. Introduction: The beginning of the story when the setting is established, and
main characters are introduced. It provides the audience with ample
background information to understand what is going to happen.
2. Rising action: The series of events that build up to the climax of the story.
3. Climax: The most intense or important point within the story. It is often an
event in which the fortune of the protagonist turns for the better or worse in
the story.
4. Falling action: The rest of the events that unravel after the main conflict has
occurred, but before the final outcome is decided.
5. Conclusion: The conclusion of the story where all of the conflicts are resolved
and outstanding details are explained.
199
Visualizations for different data
Data visualization is a powerful way to show context. Data charts can reveal
crucial deviations or affinities in the data that can lead to insights.
Word Cloud
FacetGrid
Line Graph
Bar Chart
Compares data between categories using bars.
Numeric
Data
Pie Chart
200
Scatter Plot
Histogram
Heat Map
Candlestick Chart
Stocks
Map Chart
201
miss out on important facts. To find compelling stories in data sets, the following
steps are to be followed:
Create a
simple
Use proper Then observe narrative
Collect the visualization the which is
data and tools to relationships hidden in the
organize it. visualize the between the data to be
data. data. communicated
to the
audience.
Example 1:
Using available data on student enrollment, attendance, and dropout rates,
create a compelling data story that explores the impact of the Mid-Day Meal Scheme
(MDMS) since its launch in 1995. Uncover trends, patterns, and correlations in the
data to tell a story about how the implementation of the MDMS may have influenced
dropout rates in the state over the years. Consider incorporating visualizations,
charts, and graphs to effectively communicate your findings. Additionally, analyze
any external factors or events that might have played a role in shaping these trends.
Your goal is to provide a comprehensive narrative that highlights the relationship
between the MDMS and student dropout rates in the state.
202
Example 2:
Let us do an activity now to create a data story with the information given below. We
have collected the data. Use the above steps to create an effective Data Story.
Solution:
Step 1: Prepare this data sheet in MS-Excel.
Step 2: Visualize the data using Line chart as follows in Ms-Excel.
Step 3: Narrative:
Covid Vaccine- Gives a ray of Hope
1. Accuracy: Ensure that the data is accurate, reliable, and truthful. Avoid
manipulating data to support a predetermined narrative.
2. Transparency: Clearly cite the sources of the data, methods used for analysis, and
any limitations or biases. Be transparent about the story's purpose and potential
conflicts of interest.
203
3. Respect for Privacy: Protect the privacy of individuals and groups represented in
the data. Avoid sharing personal or sensitive information without consent.
Conclusion
A data story does not just happen on its own—it must be curated and prepared
by someone for the benefit of other people. When we effectively combine the right
insights with the right narratives and visuals, we can communicate the data in a
manner that can inspire change. The data stories can help other people to understand
a problem, risk, or opportunity in a meaningful way that compels them to act on it.
So, we can define Data storytelling as a persuasive, structured approach for
communicating insights using narrative elements and explanatory visuals to inform
decisions and drive changes.
Exercises
A. Multiple choice questions
1. Which of the following best describes data storytelling?
a) Presenting raw data without any analysis
b) Communicating insights through data in a compelling narrative format
c) Creating colorful charts and graphs
d) Analyzing data without any visual aids
2. What is the primary goal of data storytelling?
a) To confuse the audience with complex charts
b) To entertain the audience with anecdotes
c) To communicate insights and findings effectively using data
d) To hide information from the audience
3. Which of the following is NOT a component of effective data storytelling?
a) Compelling visuals
b) Clear narrative structure
c) Overcomplicating the message
d) Insightful analysis
4. What role do visuals play in data storytelling?
a) They make the presentation look fancy
b) They distract the audience from the data
c) They help convey complex information quickly and effectively
d) They are not necessary in data storytelling
204
5. Why is it important to know your audience when creating a data story?
a) To impress them with your knowledge
b) To tailor the message and visuals to their level of understanding and
interests
c) To exclude certain data points that might confuse them
d) To ignore their feedback and preferences
B. True or False
1. The main purpose of a data visualization is to hide the data from the audience
- False
2. Sonnet charts are not a common type of data visualization. - True
3. Data storytelling involves presenting insights and findings in a compelling
narrative format. - True
4. Data storytelling involves presenting raw data without any analysis or
interpretation. - False
5. Data storytelling is only effective if it excludes certain data points that might
confuse the audience. – False
4. Name some graphs which can be used for the following type of data.
a. Text data- Word cloud
b. Data which is changing constantly over a period of time- Line chart
c. Stocks variation- Candle stick graph
d. Mixed data- facet grid
205
5. Name some important ethical concerns related to Data Storytelling.
Ans- The ethical concerns related to each element of Data Storytelling are as
follows:
1. Data-accuracy concerns
2. Narrative- transparency concerns
3. Visualizations- fallacy concerns
206
2. Explain the steps to create a Data story.
Ans-: If the data collected is represented in just a series of graphs and charts, it
will not serve the purpose to any organization. It should be communicated well
with proper narrative, with proper context and meaning, relevance and clarity.
The narrative should be able to take the focus of the audience to the correct spot
and not miss out on important facts. To find compelling stories in data sets, the
following steps are to be followed:
● Collect the data and organize it.
● Use proper visualization tools to visualize the data.
● Then observe the relationships between the data.
● Finally create a simple narrative which is hidden in the data to be
communicated to the audience.
3. What are the different types of data and which type of visualization should we use
for which data?
Ans- Data can be broadly classified as Qualitative data and Quantitative data. Under
the qualitative data we can nominal data and ordinal data. Under quantitative data ,
we have discrete data and continuous data. Further this data can come under any of
these categories, which are: Text data, Mixed data, Numeric data, Stocks data,
Geographic data. Based on the data we are handling we can go for different type of
visualizations which are: Word cloud, Facet grid, Line chart, Bar chart, Pie chart,
Histogram, Bubble chart, Heat map, Scatter plot, Candle stick, Map chart etc.
2. Case Study:
A city government collected data on traffic accidents at intersections to identify
high-risk areas and prioritize safety improvements. They analyzed the data to identify
patterns and trends in accident occurrence and severity. Subsequently, they
207
developed a data storytelling report to present their findings to city officials and
propose targeted interventions.
What is the primary purpose of the city government's data storytelling report?
A) To analyze public transportation usage.
B) To identify high-risk areas for traffic accidents.
C) To assess air quality levels in the city.
D) To evaluate the performance of road maintenance crews.
3. Case Study:
A healthcare organization conducted a study to analyze patient satisfaction levels
at various hospitals within the network. They collected data on patient experiences,
wait times, staff responsiveness, and overall satisfaction ratings. Using this data, they
created a data storytelling presentation to share insights with hospital administrators
and identify areas for improvement.
What type of data did the healthcare organization primarily analyze in their study?
A) Sales data for medical supplies.
B) Patient satisfaction levels at hospitals.
C) Staffing levels at healthcare facilities.
D) Insurance claims data.
4. Case Study:
A technology company analyzed user engagement data to understand the
effectiveness of its mobile app features. They collected data on user interactions,
session durations, and feature usage patterns. Based on the analysis, they developed
a data storytelling presentation to guide future app development efforts and enhance
user experience.
What was the main objective of the technology company's data analysis?
A) To measure employee productivity.
B) To understand user engagement with the mobile app.
C) To track inventory levels of hardware components.
D) To evaluate customer satisfaction with tech support services.
5. Case Study:
An educational institution conducted a survey to gather feedback from students on
online learning experiences during the COVID-19 pandemic. They collected data on
internet connectivity issues, course content satisfaction, and overall learning
effectiveness. Using this data, they created a data storytelling presentation to inform
future decisions on online course delivery methods.
What motivated the educational institution to conduct the survey and create the data
storytelling presentation?
A) To analyze student enrollment trends.
B) To assess campus infrastructure needs.
C) To gather feedback on online learning experiences.
D) To evaluate faculty performance in virtual classrooms.
208
F.Competency Based Questions:
1. What is the primary goal of the human resources department's data storytelling
presentation?
Ans- The primary goal is to address employee concerns and improve job satisfaction.
2. What specific areas of employee feedback did the human resources department
collect data on?
Ans- They collected data on workplace culture and job satisfaction.
3. How does the human resources department plan to utilize the data storytelling
presentation based on the employee satisfaction survey results?
Ans- The plan is to use it to address employee concerns and enhance employee
satisfaction.
209
4.A healthcare organization analyzed patient satisfaction survey data to identify
areas for improvement in patient care and services. They collected data on patient
feedback, treatment outcomes, and facility experiences. How can data storytelling
be used for improvement in patient care?
Ans-
• Data storytelling helps healthcare organizations improve patient care by
analysing patient feedback and treatment outcomes.
• It provides insights into areas that need enhancement, allowing for data-driven
decisions that enhance patient experience and quality of care.
• This approach enables clear communication of findings and actionable
recommendations, leading to targeted improvements in patient care and
services.
References:
1. https://www.unicef.org/india/media/8746/file/THE%20STATE%20OF%20T
HE%20WORLD%20%E2%80%99S%20CHILDREN%202023.pdf
2. https://www.linkedin.com/pulse/what-data-storytelling-ram-narayan/
210
Class XII| Artificial Intelligence |AI Teacher Handbook