0% found this document useful (0 votes)
44 views

Data Science Book

This document is a comprehensive guide to data science using Python, authored by Professor Ali A. Ibrahim. It covers essential topics such as data collection, visualization, manipulation, and statistical modeling, emphasizing the importance of data-driven decision-making across various industries. The book aims to equip readers with the necessary tools and methodologies to analyze complex datasets and derive meaningful insights.

Uploaded by

joseph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Data Science Book

This document is a comprehensive guide to data science using Python, authored by Professor Ali A. Ibrahim. It covers essential topics such as data collection, visualization, manipulation, and statistical modeling, emphasizing the importance of data-driven decision-making across various industries. The book aims to equip readers with the necessary tools and methodologies to analyze complex datasets and derive meaningful insights.

Uploaded by

joseph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 383

Data Science using Python Language with Applications

Professor Ali A. Ibrahim, PhD in Artificial Intelligence and Bioinformatics

College of Business Economics

Al-Nahrain University

2
Preface

In an era driven by an unprecedented surge in data generation and technological advancement,


the pursuit of insights from vast and complex datasets has become an indispensable pillar of
modern knowledge acquisition. The emergence of data science as a dynamic and
interdisciplinary field has paved the way for harnessing data-driven approaches to unveil
patterns, extract valuable information, and make informed decisions that shape industries,
academia, and society at large.

This book delves into the heart of data science, offering an in-depth exploration of its
methodologies, principles, and applications. With a rigorous and scientific approach, we embark
on a journey through the landscape of data analysis, employing robust statistical techniques and
cutting-edge machine learning algorithms to decipher hidden relationships, predict trends, and
solve intricate problems.

Through the lens of this book, readers will traverse the intricate terrain of data manipulation,
visualization, and modeling. We draw upon foundational concepts from mathematics, statistics,
and computer science to empower readers with the tools needed to wrangle complex datasets,
identify sources of bias, and ensure the integrity of results. Embracing a scientific mindset, we
emphasize reproducibility and the importance of transparent methodology in the pursuit of
credible data-driven findings.

The chapters within this book are meticulously crafted to guide readers from foundational
concepts to advanced methodologies. We unravel the intricacies of exploratory data analysis,
hypothesis testing, and model validation, while also delving into the nuances of ethical
considerations and the responsible use of data in a rapidly evolving technological landscape.

We extend our gratitude to the vast community of data scientists, statisticians, and researchers
whose groundbreaking work has paved the way for the methodologies outlined in this book. Our
aspiration is for this text to serve as both a comprehensive guide for those new to the field and a
valuable resource for experienced practitioners seeking to deepen their understanding and refine
their skills.

As we venture into the world of data science together, let us embrace curiosity, critical thinking,
and the scientific method, and let us embark on a quest to unlock the insights concealed within
the vast sea of data that surrounds us.

Ali A. Ibrahim
Prof., PhD. in Artificial Intelligence and Bioinformatics
College of Business Economics
Al-Nahrain University
Contents

Preface

Chapter 1: Introduction

Chapter 2: Data Collection

Chapter 3: Data Visualization

Chapter 4: Data Manipulation

Chapter 5: Exploratory Data Analysis

Chapter 6: Statistical Modeling with Programming Concepts

Chapter 7: Case Studies

Chapter 8: Data Science Relationships

Appendix: Farther Reading

Index

ii
iii
CHAPTER ONE: INTRODUCTION

iv
1.1 Introduction

Data Science is an interdisciplinary field that involves using statistical and computational
methods to extract insights and knowledge from data. It encompasses a wide range of techniques
and tools, including machine learning, data mining, statistics, and visualization. The goal of data
science is to analyze and interpret complex data sets in order to extract meaningful insights that
can inform decision-making and drive business value.

Data science is used in a variety of industries, including healthcare, finance, marketing, and
technology. It is particularly useful in areas where there are large amounts of data that need to be
analyzed in order to make informed decisions. Data scientists typically work with large data sets,
using programming languages such as Python and R to clean and analyze the data. They then use
statistical techniques and machine learning algorithms to identify patterns and relationships in
the data, and create models to predict future outcomes.

Data science has become increasingly important in recent years, as more and more organizations
have realized the value of data-driven decision making. With the rise of big data and the
increasing availability of data from a variety of sources, data science is playing an increasingly
important role in business strategy and decision-making.

The following are examples for data science:

a. Predictive modeling: A common use case for data science is predictive modeling, which
involves using historical data to build a model that can make predictions about future events.
For example, a retailer might use data science to predict which products are likely to be
popular during the upcoming holiday season, so they can stock up accordingly.

b. Customer segmentation: Data science can be used to identify groups of customers with similar
characteristics and behaviors. This can help businesses better target their marketing efforts
and tailor their products and services to the specific needs of each group.

c. Fraud detection: Data science can be used to identify patterns and anomalies in financial
transactions, which can help detect and prevent fraud.

d. Personalized recommendations: Many online retailers and streaming services use data science
to make personalized recommendations to their users based on their past behavior and
preferences.

e. Medical research: Data science is increasingly being used in medical research to analyze large
data sets of patient information in order to identify new treatments and improve patient
outcomes.

1
Here are some examples of different types of data:

a. Text Data: Text data refers to any data that is in written or textual form. Examples of text data
include emails, social media posts, customer reviews, news articles, and chat logs.

b. Numeric Data: Numeric data refers to data that consists of numbers. Examples of numeric data
include stock prices, sensor readings, temperature readings, and financial data.

c. Categorical Data: Categorical data refers to data that is divided into categories or groups.
Examples of categorical data include gender, age group, educational level, and job title.

d. Image Data: Image data refers to any data that is in the form of images or photographs.
Examples of image data include medical images, satellite images, and photographs.

e. Audio Data: Audio data refers to any data that is in the form of sound recordings. Examples of
audio data include music recordings, podcast episodes, and phone call recordings.

f. Video Data: Video data refers to any data that is in the form of videos or motion pictures.
Examples of video data include movies, TV shows, and security camera footage.

g. Geospatial Data: Geospatial data refers to data that is related to geographic locations.
Examples of geospatial data include maps, GPS coordinates, and weather data.

h. Time Series Data: Time series data refers to any data that is recorded over a period of time.
Examples of time series data include stock prices, weather data, and website traffic data.

See Examples from 1.1 to 1.3 included Figures from 1.1 to 1.6.

Example 1.1: DataFrame

Fig. 1.1: Create Data file as a list of numbers and save it using Python code.

2
Example 1.2: Read file

Fig. 1.2: Read the data “text” file and print the output result using Python code.

Fig. 1.3: Output of the data.txt file using Python code.

Example 1.3: Create Data file

Fig. 1.4: Create categories “text” data file using Python code.

3
Example 1.4: Reading file.

Fig. 1.5: Read data.txt file and print the output result using Python code.

Fig. 1.6: Output of the data.txt file using Python code.

Example 1.5: Text data

Fig. 1.7:Text Data Python code

4
Fig. 1.8: Output Text Data Python code

Example 1.6: Time Series Data

Fig. 1.9: Time Series Data Python Code.

Fig. 1.10: Time Series Data using Python Code.

5
Fig.1.11: Output Plot for Time series Data using Python code.

6
CHAPTER TWO: Data

Data Collection

Generating Random Data

7
2.1 Data Collection

Data collection is the process of gathering and acquiring data from various sources for analysis
and interpretation. The quality and accuracy of data collected are critical in ensuring that the
resulting analysis and insights are reliable and meaningful. In data science, data collection
involves identifying the relevant data sources, selecting appropriate data collection methods, and
ensuring that the data is clean and well-organized.

Here are some common methods of data collection:

a. Surveys: Surveys are a popular method of data collection and involve gathering
information from individuals through questionnaires or interviews. Surveys can be
conducted online, over the phone, or in person (see Figures 2.1-2.5).
b. Experiments: Experiments involve manipulating variables to study the effects on an
outcome of interest. Experiments can be conducted in controlled laboratory settings or in
the field (see Figure 2.6).
c. Observational studies: Observational studies involve observing and recording data
without manipulating variables. Observational studies can be conducted in natural
settings or in controlled laboratory settings (see Figures 2.7-2.8).
d. Web scraping: Web scraping involves extracting data from websites using automated
tools. Web scraping is a useful method for collecting large amounts of data from online
sources.
e. Social media monitoring: Social media monitoring involves analyzing social media
platforms to gather information about trends, sentiments, and opinions. Social media
monitoring is useful for understanding public opinion.
f. Sensor data collection: Sensor data collection involves collecting data from sensors such
as GPS, accelerometers, and temperature sensors. Sensor data collection is useful for
monitoring physical environments and behaviour.

The goal of data collection is to gather high-quality data that is relevant to the research question
and analysis. Data scientists need to carefully consider the data sources, data collection methods,
and data quality when collecting data to ensure that the analysis and insights derived from the
data are reliable and meaningful. Data collection is an ongoing process, and data scientists need
to continually monitor and update their data sources and collection methods to ensure that the
data remains relevant and accurate.

The following Examples (from 2.1 to 2.13) included Figures (2.1 – 2.15) describe python code
of the first method of data collection which is survey method and as follows:

8
Example 2.1: Data collect

Fig. 2.1: demonstrate using python code to ask 4 question, name, age, country, and the color, in the first part
and in the second part ask about number of persons including this survey.

Example 2.2: Data collection

Fig. 2.2: create loop to collect data ask questions and get the answers (responses) collect the data.

9
Example 2.3: Save file data

Fig. 2.3: create function “save_responses” to save the answers of the questions using python code.

Fig. 2.4: all outputs implementation of the python code, (a) display the data csv file name
(survey_responses.csv), (b) the output content of implementation of python code, (c) the content of output file
(survey_responses.csv).

Example 2.4: Generate data

Fig. 2.5: using python code to generate and print the current temperature.

10
Example 2.5 : Generate Data:

Fig. 2.6: using python code to create observation data

Fig. 2.7: Statistical output of the python code.

2.2 Generating Random Data

Generating random data is a common task in programming, particularly when simulating


scenarios or working with data that exhibits random characteristics. The process involves
producing values that are unpredictable and follow a probability distribution.

In Python, the random module provides functions for generating random data. Here are some key
concepts and functions related to generating random data:
11
a. Randomness and Seed: Randomness refers to the lack of predictability in generated values.
The random module uses a pseudorandom number generator (PRNG), which is an algorithm that
produces a sequence of numbers that appears random but is actually deterministic. By default,
the PRNG is initialized based on the current system time, so running the program multiple times
produces different results. However, you can set a specific seed value using the random.seed()
function to obtain the same sequence of random values each time the program is executed.

b. Uniform Distribution: The random module provides functions for generating random
numbers that follow a uniform distribution. In a uniform distribution, all values in the range have
an equal probability of occurring. For example:

- random.random() returns a random floating-point number between 0 and 1.

- random.uniform(a, b) returns a random floating-point number between a and b.

c. Integer Generation: To generate random integers, you can use the following functions:

- random.randint(a, b) returns a random integer between a and b, inclusive.

- random.randrange(start, stop, step) returns a random integer from the range start (inclusive) to
stop (exclusive) with the specified step.

d. Floating-Point Generation: Besides the random.uniform() function mentioned earlier, you


can use the following function:

random.gauss(mu, sigma) returns a random floating-point number based on a Gaussian (normal)


distribution with mean mu and standard deviation sigma.

e. Random Choices: The random.choices() function allows you to randomly select elements
from a given sequence, allowing for sampling with replacement (i.e., the same element can be
chosen multiple times) or sampling without replacement (i.e., each element is chosen only once).

f. Shuffling and Sampling: The random.shuffle() function shuffles a sequence in-place,


randomly rearranging its elements. The random.sample() function returns a new list containing a
specified number of unique elements randomly selected from a sequence without replacement.

These are just a few of the functions available in the random module for generating random data.
Depending on your specific needs, you can explore additional functions and techniques for
generating random values, such as random choices from custom distributions, permutations, or
combinations.

Different examples from to that demonstrate generating random data based on randomness and
seed in Python see Figures 2.9 – 2.16:

12
Example 2.6: Genrate Integer numbers

Fig.2.8 : Genrate 5 random integers for the range between 1 and 10 with output using Python code.

Example 2.7 : Generate floting numbers.

Fig.2.9 : Genrate 5 Random Float Numbers for the range between 1 and 5 with the output using Python code.

13
Example 2.8: Choosing elements

Fig.2.10 : Random Choices from a Sequence using python code.

14
Example 2.9: Input Numbers

Fig.2.11 : Shuffling the Input List with output result using Python code.

15
Example 2.10: Samples

Fig.2.12 : Sampling Without Replacement using Python code.

Example 2.11: Generate random data

Fig.2.13 : Generating Random Data with randomness and seed and output results using Python code.

16
Example 2.12:Genrate data

Fig.2.14: Generate Uniform Distribution output using Python code.

17
Example 2.13: genrate data

Fig.2.15 : Generate Uniform Distribution output using Python code.

18
19
CHAPTER Three: Data Visulaization

Line Chart

Bar Chart

Pie Chart

Scatter Plot

Histogram

Heat Maps

Treemap

Bubble

Choropleth Map

Sankey

Box Plot

Parallel Coordinates Plot

Radar Charts

Network Diagram

Word Cloud

Streamgraphs

3D

Gantt Charts

20
Data visualization

Data visualization is a rich and multifaceted field with deep theoretical underpinnings. At its core, data
visualization is about representing data visually to facilitate understanding, exploration, and
communication of information. Here are some key theoretical concepts and principles in data
visualization:

 The Visual Encoding Framework: This foundational concept proposes that data attributes should
be mapped to visual properties in a way that exploits human perception effectively. Common
visual properties include position, length, angle, color, shape, and size. For example, using
position along a common scale for two data attributes allows for easy comparison.
 Pre-attentive Processing: The theory of pre-attentive processing suggests that certain visual
attributes, like color or shape, can be quickly and accurately perceived by the human brain
without conscious attention. Effective data visualization leverages these attributes to highlight
important information and make patterns easily discernible.
 Gestalt Principles: These principles, such as proximity, similarity, continuity, and closure, explain
how humans naturally group and perceive visual elements. Understanding Gestalt principles
helps in designing visualizations that encourage viewers to see patterns and relationships in
data.
 Exploratory vs. Explanatory Visualizations: Data visualizations can serve different purposes.
Exploratory visualizations are created during the data analysis process to help researchers
understand the data themselves. Explanatory visualizations are designed for a broader audience
to communicate insights clearly and persuasively.
 Visualization Taxonomies: Various taxonomies categorize different types of visualizations based
on their purposes and characteristics. For example, there are hierarchical, network, time-series,
and spatial visualizations, among others. Understanding these categories can help choose the
most appropriate visualization for a specific dataset and objective.
 Ethical Considerations: As data visualization can influence perception and decision-making, it's
important to consider ethical implications. This includes issues related to bias,
misrepresentation, and privacy.
 Interactivity and User Experience (UX): Interactive visualizations allow users to explore data
actively. Understanding principles of UX design, such as responsiveness, user feedback, and
usability, is crucial for creating engaging and effective data visualizations.
 Data Semiotics: This emerging field explores the semiotic aspects of data visualization,
considering how symbols and signs in visualizations convey meaning. It delves into the cultural,
social, and cognitive aspects of data representation.

These theoretical foundations, among others, provide a solid framework for creating meaningful and
impactful data visualizations. Effective data visualization combines both art and science, utilizing these
theories to communicate complex information clearly and persuasively.

Data visualization encompasses a wide range of types and techniques, each suited to different purposes
and data characteristics. Here are some common types of data visualizations:

21
3.1 Line Charts: Line charts display data as a series of data points connected by lines. They are
excellent for showing trends and changes in data over time.

Example 3.1: Create a Line Chart

Fig. 3.1: plot Line Chart graph using Python code.

22
Fig.3.2: Output Line Chart from python code.

23
Example 3.2: Create a line Chart with two different lines

Fig.3.3: plot Multiple Line Chart using Python code.

24
Fig. 3.4: Output of Multiple Line Chart.

Example 3.3: create line chart

Fig. 3.5 : Time series line chart using Python code.

25
Fig. 3.6: Time series line chart using Python code.

Example 3.4: Create Simple Line Chart

Fig. 3.7: Create simple Line Chart using Python code.

26
Fig. 3.8: Create simple Line Chart using Python code.

Example 3.5: Create Multiple Line Chart

Fig. 3.9: Multiple Line Chart using python code.

27
Example 3.6: Create Line Chart with Markers

Fig. 3.10: Line Chart with „o‟ Marks using Python code.

Fig. 3.11: Line Chart with „o‟ Marks plot using Python code.

28
Example 3.7: Create Line Chart with different colors.

Fig. 3.12 : Create Line Chart with different colors using Python code.

Fig. 3.13: Create Line Chart with different colors using Python code.

29
Example 3.8: Create Line Chart with Different Line Styles

Fig. 3.14: Line Chart with Different Line Styles using Python code.

Fig. 3.15: Line Chart with Different Line Styles using Python code.

30
Example 3.9: Create Line Chart with a Grid

Fig. 3.16: Line Chart with a Grid using Python code.

Fig. 3.17: Line Chart with a Grid using Python code.

31
Example 3.10: Create Line Chart

Fig. 3.18: Line Chart using Python code.

Fig. 3.19: Line Chart using Python code.

Example 3.11: Create Step Line Chart

Fig. 3.20: Step Line Chart Using Pyhton code.

32
Fig. 3.21: Step Line Chart Using Pyhton code.

Example 3.12: Create Stacked Area Chart

Fig. 3.22: Stacked Area Chart using Python code.

33
Fig. 3.23: Stacked Area Chart using Python code.

Example 3.13: Create Logarithmic Scale Line Chart

Fig. 3.24: Logarithmic Scale Line Chart using Python code.

34
Fig. 3.25: Logarithmic Scale Line Chart using Python code.

Example 3.14: Create Line Chart with Date on x-axis

Fig. 3.26: Line Chart with Date on x-axis using Python code.

35
Fig. 3.27: Line Chart with Date on x-axis using Python code.

Example 3.15: Create Line Chart with Annotations

Fig. 3.28: Line Chart with Annotations using Python code.

36
Fig. 3.29: Line Chart with Annotations using Python Code.

Example 3.16: Create Dual Y-Axis Line Chart

Fig. 3.30: Dual Y-Axis Line Chart using Python code.

37
Fig. 3.31: Dual Y-Axis Line Chart using Python code.

Example 3.17: Create Line Chart with Shaded Region

Fig. 3.32: Line Chart with Shaded Region using python code.

38
Fig. 3.33: Line Chart with Shaded Region using python code.

3.2. Bar Chart:

Bar Charts: Bar charts represent data using rectangular bars of varying lengths or heights. They are
effective for comparing values across categories or showing trends over time (in the case of
horizontal bar charts).

Example 3.18 : Create Bar Chart

Fig. 3.34: Create Bar Chart using Python code.

39
Fig. 3.35: Output Bar Chart using Python code.

Example 3.19: Create a grouped bar chart

Fig. 3.36: Create Grouped Bar chart using Python code.

40
Fig. 3.37: Output Grouped Bar chart using Python code.

Example 3.20: Create Bar charts

41
42
Fig. 3.38: Different Bar Charts examples using Python code.

43
Fig. 3.39: Basic Bar Chart using Python code.

Fig. 3.40: Horizontal Bar Chart using Python code.

44
Fig. 3.41: Grouped Bar Chart using Python code.

Fig. 3.42: Stacked Bar Chart using Python code.

45
Fig. 3.43: Bar Chart with Error Bars using Python code.

Fig. 3.44: Bar Chart with Custom Colors using Python code.

46
Fig. 3.45: Bar Chart with Data Labels using Python code.

Fig. 3.46: Horizontal Bar Chart using Python code.

47
Fig. 3.47: Bar Chart with Logarithmic Scale using Python code.

Fig. 3.48: Bar Chart with 3D Effect using Python code.

48
Fig. 3.49: Horizontal Stacked Bar Chart using Python code.

3.3. Pie Chart:

Pie Charts: Pie charts represent data as a circle divided into segments, with each segment representing a
proportion of the whole. They are useful for showing parts of a whole, but can be less effective for
precise comparisons.

Example 3.21: Create Pie Chart

Fig.3.50: Create Pie Chart using Python code.

49
Fig. 3.51: Output Pie Chart from using Python code.

Example 3.22: Create an exploded pie chart

Fig. 3.52: Create exploded Pie Chart using Python code.

50
Fig. 3.53: Output of exploded Pie Chart from Figure 3.

Example 3.23: Create Different types of Pie Charts

51
Fig. 3.54: Different types for Pie Charts using Python code.

52
Fig. 3.55: Basic Pie Charts using Python code.

Fig. 3.56: Exploded Pie Charts using Python code.

53
Fig. 3.57: Donut Pie Charts using Python code.

Fig. 3.58: Pie Charts with Custom Colors using Python code.

54
Fig. 3.59: Pie Charts with Shadow Effect using Python code.

Fig. 3.60: Pie Charts with Percentage Labels using Python code.

55
Fig. 3.61: Pie Charts with a Single Exploded using Python code.

Fig. 3.62: Pie Charts with Custom Start Angle using Python code.

56
Fig. 3.63: Pie Charts with Custom Labels Removed using Python code.

Fig. 3.64: Pie Charts with Legend using Python code.

57
Fig. 3.65: Nested Pie Chart using Python code.

58
Example 3.24: Create different types 3D-like Pie Chart

Fig. 3.66: Different type’s 3D-like Pie Chart using Python code.

59
Fig. 3.67: 3D-like Pie Chart with Beveled Edge.

Fig. 3.68: 3D-like Pie Chart with Different Color Map.

60
Fig. 3.69: 3D-like Donut Pie Chart.

Fig. 3.70: 3D-like Exploded Pie Chart.

61
Fig. 3.71: 3D-like Pie Chart with Rotated Angle.

3.4 Scatter Plot

Scatter Plots: Scatter plots display individual data points on a two-dimensional grid, with one variable on
each axis. They are great for showing relationships and correlations between two variables.

Example 3.25: Create Scatter Plot

Fig. 3.72: Create Bubble Scatter Plot using Python code.

62
Fig. 3.73: Output of Bubble Scatter Plot from Figure 1.

Example 3.26: Create a scatter Plot

Fig. 3.74: Create Scatter Plot with color Mapping and size Variation using Python code.

63
Fig. 3.75: Output of Scatter Plot with color Mapping and size Variation using Python code.

64
Example 3.27: Create Different types Scatter Plots.

65
66
Fig.3.76: Different types for Scatter Plots using Python Code.

67
Fig. 3.77: Basic Scatter Plot using Python.

Fig. 3.78: Scatter Plot with colors and sizes using Python.

68
Fig. 3.79: Scatter Plot with Labels using Python.

Fig. 3.80: Scatter Plot with Regression Line using Python.

69
Fig. 3.81: Scatter Plot with Custom Marks using Python.

Fig. 3.82: Scatter Plot with Log Scale using Python.

70
Fig. 3.83: Scatter Plot with Varying Transparency using Python.

Fig. 3.84: Scatter Plot with Error Bars using Python.

71
Fig. 3.85: Scatter Plot with Categorical Labels using Python.

Fig. 3.86: Scatter Plot with a Color Map using Python.

72
Fig. 3.87: Scatter Plot with Trendline and Confidence Interval using Python.

3.5. Histograms

Histograms: Histograms are used to represent the distribution of a single numeric variable. They group
data into bins and display the frequency of data points in each bin.

73
Example 3.28: Create Histogram

74
Fig. 3.88: Histogram Chart using Python code.

75
Fig. 3.89: Basic Histogram using Python code.

Fig. 3.90: Histogram with Custom Bin Edges using Python code.

76
Fig. 3.91: Histogram with Normalized Counts using Python code.

Fig. 3.92: Stacked Histograms using Python code.

77
Fig. 3.93: Histogram with Different Color using Python Code.

Fig. 3.94: Histogram with Cumulative Counts using Python Code.

78
Fig. 3.95: Histogram with Log Scale using Python Code.

Fig. 3.96: Histogram with Density Plot using Python Code.

79
Fig. 3.97: Histogram with Specified Range using Python Code.

Fig. 3.98: Horizontal Histogram using Python Code.

80
Fig. 3.99: Histogram with Stacked Bins using Python Code.

3.6 Heat map

Heatmaps: Heatmaps use colors to represent data values in a two-dimensional matrix or grid. They are
often used to visualize patterns and relationships in large datasets.

81
Example 3.29: Create Heatmap

Fig.3.100: Create Heatmap using Python code.

Fig. 3.101: Output of Heatmap using Python code.

82
Example 3.30: Create Heatmap

Fig. 3.102: Create Heatmap with Annotations using Python code.

Fig. 3.103: Output of Heatmap using Python code.

83
Example 3.31: Create Heatmap

84
Fig. 3.104: Heatmap using Python Code.

Fig. 3.105: Basic Heatmap using Python Code.

85
Fig. 3.106: Heatmap with Custom Colors using Python Code.

Fig. 3.107: Correlation Heatmap using Python Code.

86
Fig. 3.108: Heatmap with Hierarchical Clustering using Python Code.

Fig. 3.109: Annotated Heatmap using Python Code.

87
Fig. 3.110: Heatmap with Labels Python Code.

Fig. 3.111: Discrete Heatmap using Python Code.

88
Fig. 3.112: Discrete Heatmap using Python Code.

Fig. 3.113: Heatmap with Horizontal Color Bar using Python Code.

89
Fig. 3.114: Square Heatmap using Python Code.

Fig. 3. 115: Centered Heatmap Using Python Code.

3.7 Treemaps:
Treemaps display hierarchical data structures as nested rectangles. They are useful for showing
the hierarchical composition of data.

90
Example 3.32: Create Treemap

91
Fig. 3.116: TreeMap using Python Code.

Fig. 3.117: Basic TreeMap with one level using Python Code.

Fig. 3.118: Treemap with Multiple levels and color mapping using Python Code.

Fig. 3.119: Customized Treemap with hover Info using Python Code.

92
Fig. 3.120: Sunburst Terrmap using Python Code.

Fig. 3.121: Treemap with parent labels using Python Code.

Fig. 3.122: Treemap with Custom Color Mapping with Python Code.

Fig. 3.123: Treemap with Custom Hierarchy and Color Mapping with Python Code.

93
Fig. 3.124: Treemap using Python Code.

Fig. 3.125: Treemap with Custom template and title using Python Code.

Fig. 3.126: Treemap with zoom and Pan using Python Code.

3.8 Bubble Charts:


Bubble charts are similar to scatter plots but add a third variable by varying the size of data
points as well as their position on the graph.

94
Example 3.33: Create Bubble Chart

95
Fig. 3.127: Bubble Charts using Python Code.

96
Fig. 3.128: Basic Bubble Chart using Python Code.

Fig. 3.129: Bubble Chart with Color Mapping

97
Fig. 3.130: Bubble Chart with Labels using Python Code.

Fig. 3.131: Bubble Chart with Custom Marker Styles using Python Code.

98
Fig. 3.132: Bubble Chart with Transparency using Python Code.

Fig. 3.133: Bubble Chart with Log Scale using Python Code.

99
Fig. 3.134: Bubble Chart with Size Scaling using Python Code.

Fig. 3. 135: Bubble Chart with 3D Effect using Python Code.

100
Fig. 3.136: Bubble Chart with Seaborn using python Code.

Fig. 3.137: Bubble Chart with Categorical Colors using Python Code.

101
Fig. 3.138: Bubble Chart with Trendline using Python Code.

3.9 Choropleth Maps:


Choropleth maps use color-coded regions (e.g., countries or states) to represent data values.
They are commonly used to visualize geographic data.

Example 3.34 : Create Choropleth Map

Fig. 3.139: Choropleth Map using Python code.

102
Example 3.35: Create Choropleth Map

Fig. 3.140: Choropleth Map using Python code.

Example 3.36: Create Choropleth Map

Fig. 3.141: Choropleth Map using Python code.

103
Example 3.37: Create Choropleth Map

Fig. 3.142: Choropleth Map using Python code.

3.10 Sankey Diagrams:


Sankey diagrams show the flow of resources or data between multiple entities. They are useful
for visualizing processes or resource allocation.

Example 3.38: Sankey Diagram

Fig. 3.143: Basic Sankey Diagram using Python Code.

104
Fig. 3.144: Basic Sankey Diagram using Python Code.

Example 3.39: Create Sankey Diagram

Fig. 3.145: Sankey Diagram with Custom Colors using Python Code.

Fig. 3.146: Sankey Diagram with Custom Colors using Python Code.

105
Example 3.40: Create Sankey Diagram

Fig. 3.147: Vertical Sankey Diagram using Python Code.

Fig. 3.148: Vertical Sankey Diagram using Python Code.

Example 3.41: Create Sankey Diagram

Fig. 3.149: Horizontal Sankey Diagram with Padding using Python Code.

106
Fig. 3.150: Horizontal Sankey Diagram with Padding using Python Code.

Example 3.42: Create Sankey Duagram

Fig. 3.151: Sankey with Extra Categories using Python Code.

Fig. 3.152: Sankey with Extra Categories using Python Code.

107
Example 3.43: Create Sankey Diagram

Fig. 3.153: Customized Sankey Diagram using Python Code.

Fig. 3.154: Customized Sankey Diagram using Python Code.

108
Example 3.44: Create Sankey Diagram

Fig. 3.155: Interactive Sankey Diagram using Python Code.

Fig. 3.156: Interactive Sankey Diagram using Python Code.

Example 3.45: Create Sankey Diagram

Fig. 3.157: Sankey Diagram with Multiple Flows using Python Code.

109
Fig. 3.158: Sankey Diagram with Multiple Flows using Python Code.

3.11 Box Plots (Box-and-Whisker Plots):


Box plots summarize the distribution of a dataset, showing the median, quartiles, and outliers.
They are helpful for identifying skewness and outliers in data.

Example 3.46: Create Box Plots

110
Fig. 3.159: Different types of Box-plot using Python Code.

Fig. 3.160: Simple Box Plot of a single dataset using Python Code.

111
Fig.3.161: Box Plot for multiple datasets using Python Code.

Fig.3.162: Horizontal Box Plot using Python Code.

112
Fig.3.163: Notched Box Plot using Python Code.

Fig. 3.164: Custom Box Plot Colors using Python Code.

113
Fig. 3.165: Box Plot with Outliers using Python Code.

Fig. 3.166: Box Plot with horizontal whiskers using Python Code.

114
Fig. 3.167: Grouped Box Plots using Python Code.

Fig. 3.168: Box Plot with notches and Custom Whisker Caps using Python Code.

115
Fig. 3.169: Box Plot with horizontal Median Line using Python Code.

Fig. 3.170: Box Plot with Custom x-axis labels using Python Code.

3.12 Parallel Coordinates Plots:


Parallel coordinates plots are used to visualize multivariate data by plotting each variable on a
separate parallel axis. They can reveal patterns and relationships among variables.

116
Example 3.47: Create Parallel Coordinates Plots

117
Fig. 3.171: Different types for Parallel Coordinates Plot using Python Code.

Fig. 3.172: Parallel Coordinates Plot Using Python Code.

118
Fig. 3.173: Parallel Coordinates Plot using Python Code.

Fig. 3.174: Parallel Coordinates Plot using Python Code.

119
Fig. 3.175: Parallel Coordinates Plot using Python Code.

Fig. 3.176: Parallel Coordinates Plot using Python Code.

3.13 Radar Charts (Spider Charts):


Radar charts display multivariate data on a circular grid with each variable represented by a
spoke. They are useful for comparing items across multiple dimensions.

120
Example 3.48: Radar Chart Create

Fig. 3.177: Different types for Radar Charts

Fig. 3.178: Basic Radar Chart using Python Code.

121
Fig. 3.179: Radar Chart with Different Categories using Python Code.

3.14 Network Diagrams:


Network diagrams show relationships between nodes (entities) and edges (connections) in a
network. They are used for visualizing social networks, transportation systems, and more.

122
Example 3.49: Create Network Diagram

123
Fig. 3.180: Different types for Network Diagrams using Python Code.

124
Fig. 3.181: Network Diagram Python code Output.

Fig. 3.182: Network Diagram Python code Output.

125
Fig. 3.183: Network Diagram Python code Output.

Fig. 3.184: Network Diagram Python code Output.

126
Fig. 3.185: Network Diagram Python code Output.

127
Fig. 3.186: Network Diagram Python code Output.

Fig. 3.187: Network Diagram Python code Output.

128
Fig. 3.188: Network Diagram Python code Output.

Fig. 3.189: Network Diagram Python code Output.

129
Fig. 3.190: Network Diagram Python code Output.

Fig. 3.191: Network Diagram Python code Output.

130
3.15 Word Clouds:
Word clouds represent the frequency of words or terms in a text by varying the size and color of
the words. They are often used for text analysis and summarization.

Example 3.50: Create Word Cloud

Fig. 3.192: Basic Word Cloud using Python Code.

Fig. 3.193: Basic Word Cloud using Python Code.

131
Example 3.51: Create Word Cloud

Fig. 3.194: Custom Color Word Cloud using Python Code.

Fig. 3.195: Custom Color Word Cloud using Python Code.

Example 3.52: Create Word Cloud

Fig. 3.196: Word Frequency Word Cloud using Python Code.

132
Fig. 3.197: Word Frequency Word Cloud using Python Code.

Example 3.53: Create Word Cloud

Fig. 3.198: Large Corpus Word Cloud using Python Code.

Fig. 3.199: Large Corpus Word Cloud using Python Code.

133
3.16 Streamgraphs:
Streamgraphs display time-series data as stacked areas, allowing you to see how different
categories contribute to a whole over time.

Example 3.54: Create Streamgraphs

Fig. 3.200: Basic Streamgraph using python code.

Fig. 3.201: Basic Streamgraph using python code.

Example 3.55: Create Streamgraphs

Fig. 3.202: Custom Color Streamgraph using Python Code.

134
Fig. 3.203: Custom Color Streamgraph using Python Code.

Example 3.56: Create Streamgraphs

Fig. 3.204: Streamgraph with Smooth Lines using Python Code.

Fig. 3.205: Streamgraph with Smooth Lines using Python Code.

135
Example 3.57: Create Streamgraphs

Fig. 3.206: Streamgraph with Negative Values using Python Code.

Fig. 3.207: Streamgraph with Negative Values using Python Code.

Example 3.58: Create Streamgraphs

Fig. 3.208: Streamgraph with Custom Labels and Tooltip using Python Code.

136
Fig. 3.209: Streamgraph with Custom Labels and Tooltip using Python Code.

3.17 3D Visualizations:
These visualizations add a third dimension (depth) to the data representation, making them
suitable for complex spatial data or volumetric data.

Example 3.59: Create 3D Visualizations

Fig. 3.210: 3D Scatter Plot using Python Code.

137
Fig. 3.211: 3D Scatter Plot using Python Code.

Example 3.60: Create 3D Visualizations

Fig. 3.212: 3D Surface Plot Using Python Code.

138
Fig. 3.213: 3D Surface Plot Using Python Code.

Example 3.61: Create 3D Visualizations

Fig. 3.214: 3D Line Plot using Python Code.

139
Fig. 3.215: 3D Line Plot using Python Code.

Example 3.62: Create 3D Visualizations

Fig. 3.216: 3D Bar Plot Using Python Code.

140
Fig. 3.217: 3D Bar Plot Using Python Code.

Example 3.63: Create 3D Visualizations

Fig. 3.218: 3D Contour Plot Using Python Code.

141
Fig. 3.219: 3D Contour Plot Using Python Code.

3.18 Gantt Charts:


Gantt charts are used for project management to visualize tasks, their durations, and
dependencies over time.

Example 3.64: Create Gantt Charts

Fig. 3.220: Basic Gantt Chart Using Python Code.

142
Fig. 3.221: Basic Gantt Chart Using Python Code.

Example 3.65: Create Gantt Charts

Fig. 3.222: Gantt Chart with Dependencies using Python Code.

143
Fig. 3.223: Gantt Chart with Dependencies using Python Code.

144
CHAPTER Four: Data Manipulation

DataFrame

Data Preprocessing

Data Cleaning

Checking for Consistency

145
Data manipulation

Data manipulation is the process of organizing, arranging, and transforming data in order to
make it more useful and informative. It is a fundamental step in data analysis, data mining, and
machine learning. Data manipulation can be used to:

 DataFrame
 Data Preprocessing
 Data Cleaning
 Checking for Consistency

4.1 DataFrame

In Python, a data frame is a two-dimensional array-like data structure provided by the pandas
library.

It is a powerful tool for data manipulation, processing, and analysis.

The pandas library provides two primary classes for data frames:

DataFrame and Series. A DataFrame is a table-like structure consisting of rows and columns,
where each column can have a different data type.

A Series is a one-dimensional array-like structure, which represents a single column of a


DataFrame see Examples from 3.1 to 3.4 included Figures: 3.1 – 3.8.

146
Example 4.1: Create DataFrame.

Fig.4.1 : Create DataFrame with output result using Python code for Example 1.

Example 4.2: Create DataFrame.

Fig.4.2 : Create DataFrame using Python code.

147
1

1. Before Run the code


2. There are no files (orders_data and customers_data)

Fig.4.3 : Picture before running the code for creating files using Python code.

Fig.4.4 : Save two data.csv files using Python code.

3. After running the cell


4. Create two files (orders_data and customers_data)

148
Fig.4.5 : After running the Python code.

Fig.4.6 : The output displays the two data.csv files.

149
Example 4.3: Reading data file.

Fig.4.7 Reading and print data.csv file.

150
Example 4.4: reading data file

Fig.4.8: Reading and print data.csv file.

4.2: Data preprocessing

Data preprocessing is the process of preparing data for analysis by cleaning, transforming, and
organizing it into a format that is suitable for analysis. The quality and accuracy of data are
critical to the success of any data science project, and data preprocessing is an essential step in
ensuring that the data is clean, complete, and consistent.

Here are some common techniques used in data preprocessing:

a. Data cleaning: Data cleaning involves identifying and correcting errors and inconsistencies in
the data. This includes handling missing values, dealing with outliers, and correcting errors in the
data.

b. Data transformation: Data transformation involves converting the data into a format that is
suitable for analysis. This includes scaling, normalization, and encoding categorical variables.

151
c. Feature engineering: Feature engineering involves creating new features from existing data to
improve the performance of machine learning models. This includes creating new variables
based on existing ones or combining multiple variables to create new ones.

d. Data reduction: Data reduction involves reducing the size of the data by eliminating redundant
or irrelevant features. This includes feature selection or dimensionality reduction techniques such
as PCA (Principal Component Analysis).

e. Data integration: Data integration involves combining data from multiple sources to create a
unified dataset. This includes merging or joining datasets that have a common variable.

f. Data splitting: Data splitting involves dividing the dataset into training, validation, and testing
sets to evaluate the performance of machine learning models.

The goal of data preprocessing is to prepare the data for analysis by cleaning, transforming, and
organizing it into a format that is suitable for analysis. By using these techniques, data scientists
can ensure that the data is accurate, complete, and consistent, and that it is in a format that is
suitable for analysis. Data preprocessing is an iterative process, and data scientists may need to
revisit and revise their preprocessing steps as they gain more insights from the data.

4.3. Data Cleaning

Data cleaning refers to the process of identifying and correcting or removing errors,
inconsistencies, and inaccuracies in a dataset to improve its quality and usefulness for analysis.

It is an essential step in data preprocessing before performing any data analysis or modeling.

The process of data cleaning typically involves several steps, including:

a. Removing duplicates: Identifying and removing duplicate records or observations in the


dataset to reduce bias and improve accuracy (see Figures 3.9 and 3.10).
b. Handling missing values: Identifying and handling missing values in the dataset by either
removing the records with missing values or imputing the missing values with estimates,
such as the mean or median (see Figures 3.11-3.18).
c. Handling outliers: Identifying and handling outliers or extreme values in the dataset by
either removing the outliers or replacing them with more reasonable values.
d. Standardizing data formats: Ensuring that the data formats are consistent and
standardized across the dataset to reduce errors and improve comparability.
e. Encoding categorical variables: Converting categorical variables into numerical or binary
format to enable analysis and modeling.

Now, let's see how to perform data cleaning in Python using the pandas library see the following
Examples from 3.5 to 3.8 Including Figures from 3.9 to 3.18.

152
Example 4.5: Create data file

Fig.4.9 :Output result Data Frame Data with duplicated two rows 1 and 4 using Python code.

153
Example 4.6: read and save data file.

Fig. 4.10: Cleaned duplicated data using Python code.

Example 4.7: Create DataFrame

Fig.4.11: install pandas library.

Fig.4.12: imports two libraries (pandas & numpy).

154
Fig.4.13: Define the number of rows & columns.

Fig.4.14: Create a Data Frame of random numbers between 0 and 10.

Fig.4.15: Display the Data Frame.

155
Example 4.8: Create data file with missing values and save file.

Fig.4.16: create missing values for Data Frame.

Fig.4.17: Display the Data Frame with missing values.

Fig.4.18: Save the Data Frame under “data_with_missing_values.csv” file.

156
4.3.1 Outliers

Outliers, in the realm of statistics and data analysis, refer to data points that deviate significantly
from the overall pattern or distribution of a dataset. These observations lie at an abnormal
distance from other data points, making them stand out distinctly. The identification and
understanding of outliers are crucial in various analytical processes as they can affect the
accuracy and validity of statistical models and interpretations. Outliers are typically identified by
employing various mathematical and statistical techniques, such as the use of z-scores, box plots,
or the calculation of interquartile range (IQR). By quantifying the degree of deviation from the
norm, outliers can be objectively recognized and analyzed.

Outliers play a fundamental role in several areas where data analysis is applied. One key
application is in anomaly detection, which involves identifying abnormal or unexpected
observations. By considering outliers as potential anomalies, statistical models and algorithms
can be designed to automatically detect and flag unusual patterns or outliers in large datasets.
This has numerous practical applications, such as fraud detection in financial transactions,
network intrusion detection in cyber security, or identifying potential outliers in medical data that
might indicate unusual health conditions.

Moreover, outliers can also impact the accuracy and reliability of statistical models and
predictions

Handling outliers is an important aspect of data analysis and statistical modeling. Outliers are
data points that significantly deviate from the rest of the data, and they can have a significant
impact on the analysis results. Outliers can occur due to various reasons such as measurement
errors, data entry errors, or genuine extreme observations. Dealing with outliers requires careful
consideration to ensure that they do not unduly influence the analysis outcome. Here's an
overview of the theory and some examples of how to handle outliers using Python:

4.3.2. Identify outliers:

The first step in handling outliers is to identify them in the dataset. This can be done through
graphical methods (e.g., box plots, scatter plots) or statistical methods (e.g., z-score, modified z-
score, Tukey's fences).

4.3.3. Understand the nature of outliers:

It's important to understand whether outliers are genuine extreme observations or if they are the
result of errors. This understanding helps in deciding how to handle them appropriately.

157
Decide on the approach:

Depending on the nature of the outliers and the specific analysis goals, there are several
approaches to handle outliers:

a. Removal: If the outliers are due to errors or have a significant impact on the analysis,
they can be removed from the dataset. However, this should be done cautiously, as
removing outliers may affect the representativeness of the data.
b. Transformation: Applying transformations (e.g., logarithmic, square root, or reciprocal
transformations) to the data can sometimes help in reducing the impact of outliers.
c. Winsorization: Winsorization replaces the extreme values with a less extreme but still
plausible value. For example, the outliers can be replaced with the nearest data values
within a certain percentile range.
d. Binning: Binning involves grouping data into bins or intervals and replacing outlier
values with the bin boundaries or central tendency measures.
e. Robust methods: Robust statistical techniques, such as mean and median, are less
influenced by outliers compared to their traditional counterparts (e.g., mean and standard
deviation).

There are several methods to detect outliers, including:

I. Z-score method: The z-score method is a statistical technique used to identify outlier
values within a dataset. It calculates the deviation of a data point from the mean of the
dataset in terms of standard deviations. The z-score is computed by subtracting the mean
from the data point and dividing the result by the standard deviation. This standardized
value provides a measure of how many standard deviations a data point is away from the
mean. By comparing the z-score of a data point to a threshold value, outliers can be
identified (see Figure 3.19).

This method calculates the z-score of each data point, which represents the number of
standard deviations a data point is from the mean of the dataset. Any data point with a z-
score greater than a certain threshold (usually 2 or 3) is considered an outlier.

Overall, the z-score method serves as a valuable tool for outlier detection in diverse
domains, facilitating decision-making processes and ensuring data accuracy.

II. Interquartile range (IQR) method: The interquartile range (IQR) is a measure of
variability that is used to describe the spread of a dataset. It is calculated as the difference
between the upper quartile (Q3) and the lower quartile (Q1) of the data.

158
This method calculates the IQR of the dataset, which is the range between the first
quartile (25th percentile) and the third quartile (75th percentile). Any data point outside
the range of 1.5 times the IQR below the first quartile or above the third quartile is
considered an outlier.

To calculate the IQR, you first need to find the median of the dataset. Then, you split the
dataset into two halves: the lower half (values below the median) and the upper half
(values above the median).
Next, you find the median of each half. The lower median is called the first quartile (Q1),
and the upper median is called the third quartile (Q3). The IQR is then calculated as the
difference between Q3 and Q1:

IQR = Q3 - Q1

The IQR is a useful measure of variability because it is less sensitive to extreme values or
outliers than other measures such as the range or standard deviation. It can be used to
identify potential outliers in a dataset or to compare the variability of different datasets.

To identify potential outliers using the IQR method, you can use the following rule:

Overall, the IQR method is a simple yet effective way to summarize the spread of a
dataset and identify potential outliers.

III. Visualization method: The visualization method is based on the principle of


representing data in a graphical or visual format to easily identify patterns, trends, and
anomalies. By plotting the data points on a graph or chart, outliers, which are data points
that significantly deviate from the expected or normal values, can be visually identified as
they appear as distinct and distant from the majority of the data points.

This method involves plotting the data points and visually inspecting the plot for any data
points that are significantly different from other data points. Box plots, scatter plots, and
histograms are commonly used for this method.

Method serves as an intuitive and effective tool in outlier detection, contributing to improved
decision-making and problem-solving in diverse domains see Examples from 3.9 to 3.16
including Figures from 3.19 to 3.36.

159
Example 4.9: create dataset with outliers

Fig.4.19: Determine the outlier value determine by z-score using Python code.

160
Fig.4.20: IQR and outliers using python code for Example1.

161
Fig.4.21: Output statistical and plot graph with IQR and outliers using python code

162
Example 4.10: Create dataset without Outliers.

Fig.4.22: IQR and without outliers using python code for Example2.

163
Fig.4.23: Output statistical and plot graph with IQR without outliers using python code for Example2.

164
Example 4.11:Visualization Method Plot Outliers

Fig.4.24 : Plot and determine outlier value using box plot using Python code for Example.

165
Example 4.12: Create dataset with two outliers.

Fig.4.25 : Plot and determine outliers value using box plot using Python code for Example.

166
Example 4.13: Standrized data

Fig.4.26 :Output Standrized data using Python code for Example .

Example 4.14: Standrized data

Fig.4.27 :Output Standrized data using Python code for Example .

167
Example 4.15: Standrized data

Fig.4.28 :Output Standrized data using Python code for Example .

Example 4.16: Encoding categorical variables

Fig.4.29: import library (pandas)

Fig.4.30: import library (LabelEncoder).

168
Fig.4.31: Creating and store LabelEncoder under the name “le”

Fig.4.32: Create the color data and store them under the name of “data”.

Fig.4.33: convert the raw data to data Frame.

Fig.4.34: Apply the library “Label Encoding” on the color column.

Fig.4.35: Display the encoded Data Frame.

Fig.4.36: the output of the encoded Data Frame.

169
4.4: Checking for consistency

Checking for consistency in data is an essential task in data management and analysis.

It involves ensuring that the data is accurate, valid, and coherent, both within individual data sets
and across multiple data sources.

The goal is to identify and resolve any discrepancies, errors, or anomalies that may exist in the
data.

Here's a detailed overview of the theory behind checking for consistency in data:

4.4.1. Data Integrity:

Data integrity refers to the accuracy, completeness, and reliability of data. It ensures that the data
is not corrupted, modified, or tampered with in any unauthorized manner. Various techniques can
be used to ensure data integrity, such as checksums, hash functions, and error detection codes.

4.4.2. Validation Rules:

Validation rules are predefined criteria or constraints that determine the acceptable values and
formats for data. These rules help ensure that data entered into a system meets the specified
criteria. Common validation rules include data type checks (e.g., numeric, alphanumeric), range
checks, format checks (e.g., email addresses, phone numbers), and referential integrity checks
(ensuring data consistency across related tables).

4.4.3. Cross-Field Consistency:

Cross-field consistency involves checking the relationships and dependencies between different
fields within a data set. It ensures that the values in one field are consistent with the values in
related fields. For example, in a customer database, the customer's age should match their birth
date.

4.4.4. Cross-Table Consistency:

Cross-table consistency focuses on checking the consistency of data across multiple tables or
data sources. It ensures that the relationships and references between tables are maintained
correctly. For instance, in a relational database, foreign key constraints are used to enforce
referential integrity between related tables.

4.4.5. Data Profiling:

Data profiling involves analyzing and understanding the structure, content, and quality of data. It
helps identify inconsistencies, duplicates, missing values, outliers, and other data issues. Data
profiling techniques include statistical analysis, pattern recognition, and data visualization.

170
4.4.6. Data Cleansing:

Data cleansing, or data scrubbing, is the process of identifying and correcting or removing
inconsistencies, errors, and inaccuracies in the data. This may involve tasks like removing
duplicate records, filling in missing values, correcting formatting issues, and resolving conflicts
in data from different sources.

4.4.7. Error Handling and Logging:

When inconsistencies or errors are detected, it's crucial to have proper error handling
mechanisms in place. This includes logging and reporting errors, capturing details about the
nature of the inconsistency, and providing notifications to appropriate stakeholders for further
investigation and resolution.

4.4.8. Data Governance:

Data governance encompasses the policies, processes, and controls put in place to ensure data
quality, consistency, and reliability across an organization. It involves defining data standards,
roles and responsibilities, data management procedures, and enforcing data quality measures.

By applying these principles and techniques, organizations can establish robust mechanisms to
check for consistency in their data, ensuring its accuracy, reliability, and usefulness for various
applications and decision-making processes see the following Examples from 3.17 to 3.27
including Figures from 3.37 to 3.51.

171
Example 4.17: Data Integrity

Fig. 4.37: Data Integrity using Python code for Example .

172
Example 4.18: Validation Rules

Fig. 4.38: Validation Rules using Python code.

Fig. 4.39: Output Validation Rules for the python code.

173
Example 4.19: Cross-field consistency

Fig 4.40: Cross-field consistency using Python code.

Fig. 4.41: The output Cross-field consistency of the python code.

174
Example 4.20: Cross-field consistency

Fig. 4.42: Cross-field consistency using Python code.

Fig. 4.43: The output Cross-field consistency of the python code.

175
Example 4.21: Cross-Table Consistency

Fig. 4.44 : Python code for Cross-Table Consistency.

Fig. 4.45: Output Cross-Table Consistency using Python code.

176
Example 4.22: Data Profiling

177
Fig. 4.46: Using Python code for data profiling with statistical output.

Example 4.23: Data Cleansing

Fig.4.47: Create data file “customer_data.csv” using Python code.

178
Example 4.24: Cleaned dataset.

Fig. 4.48: Using Python code to cleaned dataset.

179
Example 4.25: Error Handling

Fig. 4.49 Error result “Cannot divide by zero” using Python code.

Example 4.26: Data Governance

Fig. 4.50: Create data file “CSV” using Python code.

180
Example 4.27: Data Governance

Fig. 4.51: Using Python code to govern data.

181
CHAPTER FIVE: Exploratory Data Analysis

Measure of location

Measures of Dispersion

Measures of Position

182
Exploratory Data Analysis
5.1 Measure of location

Descriptive statistics is a branch of statistical analysis that focuses on summarizing and


interpreting data using numerical measures. It aims to describe the main characteristics of a
dataset, such as central tendency, variability, and distribution. The principle behind descriptive
statistics is to simplify complex data into manageable and meaningful information, allowing
researchers to gain insights and draw conclusions.

Descriptive statistics finds various applications in different fields, such as economics,


psychology, biology, and social sciences. One prominent application is in data exploration,
where descriptive statistics helps researchers gain an initial understanding of the dataset's
structure and properties. For example, measures like mean, median, and mode provide insights
into the dataset's central tendency, while measures like range, variance, and standard deviation
quantify its variability. Descriptive statistics also aids in summarizing categorical data through
frequency tables or charts, enabling researchers to identify patterns and relationships.
Furthermore, it plays a crucial role in data visualization, where graphical representations like
histograms, box plots, and scatter plots help communicate data patterns effectively. Overall,
descriptive statistics serves as the foundation for statistical analysis, allowing researchers to
make informed decisions and draw meaningful conclusions from data.

Measures of central tendency are statistical measures that aim to describe the central or typical
value of a dataset. They provide a single representative value that summarizes the distribution of
data points. The three commonly used measures of central tendency are the mean, median, and
mode.

Mean: The mean, also known as the arithmetic mean or average, is calculated by summing up all
the values in a dataset and dividing by the total number of values. It is sensitive to extreme
values and provides a measure of the central value around which the data points tend to cluster.

Median: The median is the middle value of a dataset when it is arranged in ascending or
descending order. If there are an odd number of values, the median is the value exactly in the
middle. If there is an even number of values, the median is the average of the two middle values.
The median is less affected by extreme values and provides a measure of the central position in
the dataset.

Mode: The mode is the value(s) that occur(s) most frequently in a dataset. Unlike the mean and
median, the mode does not require numerical data and can be used for both categorical and
numerical variables. A dataset may have one mode (unimodal), two modes (bimodal), or more
than two modes (multimodal). It can also be described as having no mode (no value occurs more
than once).

183
These measures of central tendency provide different perspectives on the typical value of a
dataset and can be used in different situations. The mean is commonly used when the data is
normally distributed and not influenced by extreme values. The median is useful when the data
has outliers or is skewed. The mode is often employed to describe the most frequently occurring
value or to identify the most common category in categorical data.

It is important to choose the appropriate measure of central tendency based on the characteristics
of the dataset and the purpose of analysis. Using multiple measures of central tendency can
provide a more comprehensive understanding of the dataset and its distribution.

5.1.1 The Mean

The formula for calculating the mean (also known as the average) of a set of numbers is:


̅

Where: x = 0,1,2, …, n.

n: sample size.

Here's a numerical example from 4.1 to 4.2 to illustrate how to calculate the mean:

Example 5.1:

Consider the following set of numbers: 5, 8, 12, 6, 10

Step 1: Add all the numbers together: 5 + 8 + 12 + 6 + 10 = 41

Step 2: Count the total number of numbers in the set: 5 numbers

Step 3: Calculate the mean: Mean = 41 / 5 = 8.2

So, the mean of the given set of numbers is 8.2.

Example 5.2:

Let's take another set of numbers: 3, 5, 7, 1, 2, 4, 9

Step 1: Add all the numbers together: 3 + 5 + 7 + 1 + 2 + 4 + 9 = 31

Step 2: Count the total number of numbers in the set: 7 numbers

Step 3: Calculate the mean: Mean = 31 / 7 ≈ 4.4286 (rounded to four decimal places)

So, the mean of the second set of numbers is approximately 4.4286.

184
5.1.2 The median

The median is the middle value of a dataset when it is ordered from least to greatest. In case the
dataset has an even number of values, the median is the average of the two middlemost values.
For datasets with an odd number of values:

Arrange the dataset in ascending order. The median is the middle value.

Here are some examples from 4.3 to 4.6 to illustrate how to calculate the median:

Example 5.3: (Odd number of values):

Consider the following dataset: 8, 15, 5, 12, 20, 6, 11

Step 1: Arrange the dataset in ascending order: 5, 6, 8, 11, 12, 15, 20

Step 2: Since there are 7 values, the median is the middle value, which is 11.

So, the median of the given dataset is 11.

Example 5.4: (Even number of values):

Let's take another dataset: 3, 7, 1, 5, 2, 4

Step 1: Arrange the dataset in ascending order: 1, 2, 3, 4, 5, 7

Step 2: Since there are 6 values, the median is the average of the two middle values: (3 + 4) / 2 =
3.5

So, the median of the second dataset is 3.5.

Example 5.5: (Repeated values, odd number):

Dataset: 5, 5, 5, 2, 3

Step 1: Arrange the dataset in ascending order: 2, 3, 5, 5, 5

Step 2: Since there are 5 values, the median is the middle value, which is 5.

So, the median of this dataset is 5.

Example 5.6: (Repeated values, even number):

Dataset: 8, 7, 5, 5, 3, 2

Step 1: Arrange the dataset in ascending order: 2, 3, 5, 5, 7, 8

Step 2: Since there are 6 values, the median is the average of the two middle values: (5 + 5) / 2 =
5
185
So, the median of this dataset is 5.

5.1.3 The mode

The mode is the value that appears most frequently in a dataset. A dataset can have one mode,
multiple modes (bimodal, trimodal, etc.), or no mode (when all values occur with the same
frequency).

Formula for calculating the mode:

For ungrouped data:

The mode is the value that occurs most frequently in the dataset.

Examples from 4.7 to 4.8 of calculating the mode:

Example 5.7 (Ungrouped data):

Consider the following dataset: 3, 7, 2, 7, 5, 9, 7

To find the mode, we look for the value that appears most frequently. In this case, the value "7"
occurs three times, which is more frequent than any other value.

So, the mode of the given dataset is 7.

Example 5.8 (Ungrouped data with no mode):

Dataset: 4, 1, 6, 2, 7, 3, 5

In this dataset, all the values appear exactly once, and there is no value with a higher frequency
than the others.

So, there is no mode for this dataset.

186
Example 5.9: Mean, Median, and Mode.

Fig. 5.1: Calculate mean, median and mode using python code.

5.2 Measures of Dispersion

Measures of dispersion, also known as measures of variability or spread, quantify the extent to
which data points deviate from the central tendency or average. They provide valuable insights
into the spread, variability, and consistency of a dataset. Some commonly used measures of
dispersion include the range, variance, standard deviation, and interquartile range.

5.2.1 Range: The range is the simplest measure of dispersion and is calculated as the difference
between the maximum and minimum values in a dataset. It gives an indication of the total spread
of the data but does not take into account the distribution of values in between.

Range = Max. – Min.

Example 5.10: For the dataset {10, 15, 20, 25, 30}, the range would be 30 - 10 = 20.

5.2.2 Variance: Variance measures how far each data point in a dataset deviates from the mean.
It is calculated by taking the average of the squared differences between each data point and the
mean. A higher variance indicates greater variability in the data.

187

Variance =

Example 5.11: For the dataset {10, 15, 20, 25, 30}, if the mean is 20, the variance would be
((10-20)^2 + (15-20)^2 + (20-20)^2 + (25-20)^2 + (30-20)^2) / 5 = 50.

5.2.3 Standard Deviation: The standard deviation is the square root of the variance. It provides
a measure of dispersion that is in the same units as the original data, making it easier to interpret.
A higher standard deviation indicates greater spread or variability in the dataset.

Example 5.12: For the same dataset as above, the standard deviation would be the square root of
the variance, which is √ .

5.2.4 Interquartile Range (IQR): The interquartile range is a measure of statistical dispersion
that focuses on the middle 50% of the data. It is calculated as the difference between the third
quartile (75th percentile) and the first quartile (25th percentile). The IQR is robust to outliers and
is often used to describe the spread of skewed or non-normally distributed data see the following
Figure 5.2:

Lowest Highest
Value Value

Fig. 5.2: Interquartile Range.

5.2.5 Coefficient of Variation (CV): CV is the ratio of the standard deviation to the mean,
expressed as a percentage. It allows comparison of dispersion between datasets with different
scales.

CV = (Standard deviation / Men) * 100.

Example 5.13: If the mean of dataset A is 50 and the standard deviation is 10, the CV would be
(10 / 50) * 100% = 20%.

These measures of dispersion allow for a more comprehensive understanding of the data beyond
just the central tendency. They help identify the spread of values, detect outliers, assess the
variability within a dataset, and compare the variability between different datasets.

188
It is important to consider the appropriate measure of dispersion based on the characteristics of
the dataset and the specific research or analysis goals. Different measures of dispersion may be
more suitable for different types of data and research questions.

Example 5.14: Measures of dispersion

Fig. 5.3: Measure of Dispersion Python code.

189
Fig. 5.4: The output of the Measure of Dispersion using Python code.

Fig. 5.5: The Measures of Dispersion Graph of the code of Fig.1.

5.3 Measures of Position

Measures of Position are statistical metrics that help us understand the relative position of a
specific data point within a dataset. They provide valuable insights into how an individual data
point compares to other data points in the same dataset. Some common examples of measures of
position include frequency distribution, stem and leaf, percentiles, quartiles, and z-scores.

5.3.1 Frequency distribution

Frequency distribution is a statistical representation of data that shows the frequency or count of
each value or range of values in a dataset. It organizes data into different categories or intervals

190
and provides a summary of how frequently each category occurs. Frequency distribution helps to
identify patterns, central tendencies, and variations within a dataset.

The main components of a frequency distribution are:

 Variable: The variable represents the characteristic or attribute being measured or


observed in the dataset. It can be numerical or categorical.
 Classes or Intervals: In a frequency distribution, the range of values for the variable is
divided into classes or intervals. Classes help to group similar values together and
simplify the representation of data. For numerical variables, classes are often defined by
specifying the lower and upper limits of each interval. For categorical variables, classes
represent distinct categories.
 Frequency: Frequency refers to the number of times a particular value or range of values
occurs within each class. It represents the count or frequency of observations falling
within each interval or category. Frequencies are typically presented as whole numbers.
 Cumulative Frequency: Cumulative frequency is the running total of frequencies for
each class. It provides additional information by showing the accumulation of frequencies
as you move from one class to the next. Cumulative frequency is useful for analyzing the
distribution of data and identifying percentiles or quartiles.

Frequency distribution can be represented using various formats, including tables, histograms,
bar charts, or line graphs. These visual representations help in understanding the distributional
characteristics of the data, such as the shape, central tendency, and dispersion.

Frequency distributions are widely used in descriptive statistics to summarize data, identify
outliers, detect patterns, and make comparisons between different groups or datasets. They
provide a compact and organized way to analyze and interpret large amounts of data.

Let's consider a dataset of test scores for a group of students as follows:

{52, 55, 60, 65, 68, 70, 72, 75, 78, 82, 85, 88, 92, 95, 98}

To create a frequency distribution table with intervals (classes), we'll group the scores into
classes and count the frequency of scores falling into each class.

Example 5.15:

Let's use intervals of width 10, starting from 50 to 100.

191
Step 1: Create the intervals (classes):

50 - 59

60 - 69

70 - 79

80 - 89

90 - 99

Step 2: Count the frequency of scores in each class:

50 - 59: 2 (52, 55)

60 - 69: 3 (60, 65, 68)

70 - 79: 4 (70, 72, 75,78)

80 - 89: 3 (82, 85, 88)

90 - 99: 3 (92, 95, 98)

Step 3: Create the frequency distribution table:

Class Frequency
50 - 59 2
60 - 69 3
70 - 79 4
80 - 89 3
90 - 99 3

In this frequency distribution table, we've grouped the test scores into intervals (classes) and
displayed the number of scores falling into each interval. This table provides a concise and
informative representation of the data's distribution.

Example 5.16: let’s consider the following data (ages):

{22, 25, 28, 32, 35, 38, 40, 42, 45, 48, 52, 55, 58, 62, 65, 68, 72, 75}

To create a frequency distribution table with intervals (classes), we'll group the ages into classes
and count the frequency of ages falling into each class.

Let's use intervals of width 10, starting from 20 to 80.

192
Step 1: Create the intervals (classes):

20 - 29

30 - 39

40 - 49

50 - 59

60 - 69

70 - 79

Step 2: Count the frequency of ages in each class:

20 - 29: 3 (22, 25, 28)

30 - 39: 3 (32, 35, 38)

40 - 49: 4 (40, 42, 45, 48)

50 - 59: 3 (52, 55, 58)

60 - 69: 3 (62, 65, 68)

70 - 79: 2 (72, 75)

Step 3: Create the frequency distribution table:

Class Frequency
20 - 29 3
30 - 39 3
40 - 49 4
50 - 59 3
60 – 69 3
70 - 79 2

In this frequency distribution table, we've grouped the ages into intervals (classes) and displayed
the number of participants falling into each interval. This table gives us a clear overview of the
age distribution among the survey participants.

5.3.2 Stem and Leaf


The stem-and-leaf plot is a data visualization technique used in statistics to display the
distribution of a dataset. It provides a way to organize and represent data in a clear and
concise manner. Let's delve into the details of stem-and-leaf plots:

193
a. Components of a Stem-and-Leaf Plot:
 Stem: The stem represents the leading digits or the tens-place digits of the data
values. It is typically arranged in ascending order from bottom to top. Each
unique stem corresponds to a group of data values that share the same leading
digit.
 Leaf: The leaf represents the trailing digits or the units-place digits of the data
values. It is usually arranged in ascending order from left to right within each
stem group. Each leaf represents an individual data point belonging to the same
stem.

b. Advantages of Stem-and-Leaf Plots:


 Compact Representation: Stem-and-leaf plots condense data while maintaining
the integrity of the original dataset. They are particularly useful for small to
moderately sized datasets.
 Quick Data Inspection: These plots allow you to quickly identify the distribution
of data, including the range, central tendency, and dispersion.
 Visual Clarity: Stem-and-leaf plots provide a visual representation of data that is
easier to read and interpret than raw data or frequency tables.
c. Creating a Stem-and-Leaf Plot:
Here are the steps to create a stem-and-leaf plot:
 Sort the Data: Arrange your dataset in ascending order.
 Identify Stems and Leaves: For each data point, separate the leading digit (stem)
from the trailing digits (leaves). For example, if you have the number 42, the stem
is 4, and the leaf is 2.
 Organize Data: Group the data points with the same stem together, and arrange
the leaves in ascending order within each group.
 Construct the Plot: Write the stems vertically in ascending order, aligning them
to the left. Then, list the leaves for each stem horizontally, typically in ascending
order.

Example 5.17:
Let's create a stem-and-leaf plot for the dataset: [23, 27, 31, 36, 38, 42].

Stem Leaves
2 3 7
3 1 6 8
4 2

194
In this example, the stems are 2, 3, and 4, and the corresponding leaves represent the
units digits of the data points.

Interpreting a Stem-and-Leaf Plot:

You can easily determine the range of the data by looking at the minimum and maximum
values of the stems and leaves.

Stem-and-leaf plots can help identify outliers. Values that are far from the main clusters of
leaves may indicate unusual or extreme data points.

Overall, stem-and-leaf plots are a valuable tool for exploratory data analysis, allowing you to
quickly grasp essential characteristics of a dataset. They are especially useful when you want
to maintain the granularity of individual data points while still visualizing the data's
distribution.

Example 5.18 : Stem and leaf

195
Fig. 5.6: Stem and Leaf using Python code.

196
Fig.5.7 : Raw data for 5 examples using Python code.

Fig.5.8 : Output Stem and leaf using Python code.

197
Fig.5.9 : Output 5 examples Plots for Stem and leaf using Python code.

5.3.3 Percentiles:
Percentiles divide a dataset into 100 equal parts, each representing 1% of the data. They show the
relative standing of a particular value in comparison to the rest of the data. For example, the 25th
percentile (also known as the first quartile) is the value below which 25% of the data falls, and
the 75th percentile (third quartile) is the value below which 75% of the data falls. The 50th
percentile is the median, where half of the data is above and half below.

5.3.4 Quartiles:
Quartiles are specific percentiles that divide the dataset into four equal parts, each representing
25% of the data. The three quartiles are as follows:
First quartile (Q1): The 25th percentile, separates the lowest 25% of the data from the rest.
Second quartile (Q2): The 50th percentile, same as the median, divides the data into two equal
halves.
Third quartile (Q3): The 75th percentile, separates the lowest 75% of the data from the highest
25%.
Quartiles are useful in detecting the spread and skewness of a dataset.

5.3.5. Z-Scores (Standard Scores):


Z-scores represent the number of standard deviations a data point is away from the mean of the
dataset. It standardizes the value, making it easier to compare across different datasets with
varying means and standard deviations. A positive z-score indicates the value is above the mean,
while a negative z-score indicates it is below the mean.
The formula for calculating the z-score of a data point (X) in a dataset with mean (μ) and
standard deviation (σ) is:

198
Z = (X - μ) / σ

Where: X independent variable

μ the population mean

σ The population standard deviation.

A z-score of 0 means the data point is equal to the mean, a z-score of +1 means it is one standard
deviation above the mean, and a z-score of -2 means it is two standard deviations below the
mean, and so on.

These measures of position play a crucial role in understanding the distribution of data and
identifying potential outliers or extreme values within a dataset. They help in making
comparisons and drawing meaningful conclusions from the data at hand.

Example 5.19: Percentile

Fig. 5.10: Determine 50 percentile using Python code.

Example 5.20: percentile

Fig. 5.11: Determine 95 percentile using Python code.

199
Example 5.21: IQR

Fig. 5.12: Interquartile Range Python code.

Fig. 5.13: Output the IQR Python code

Example 5.22: z-score

Fig. 5.14: z-score Python code.

200
Fig. 5.15: Output z-score Python code.

201
CHAPTER SIX: Statistical Modeling with Programming Concepts

Variables with programming concepts


Parameters
Probability Distribution
Correlation
Regression analysis
Hypothesis testing
Model Assumptions
Model Evaluation
t-test
F-test
Chi-Square

202
6.1. Variables with programming concepts

In the context of statistics and programming, variables are containers used to store and
manipulate data. They are an essential concept in any programming language, including Python.
Here are some key details about variables:

a. Definition and Purpose: A variable is a named storage location that holds a value. It acts as a
reference to a specific memory address where the data is stored. Variables are used to store
different types of data, such as numbers, strings, boolean values, or complex objects.

b. Variable Names: Variables are typically assigned names to provide a meaningful


representation of the data they store. Variable names should follow certain rules, such as starting
with a letter or underscore, being case-sensitive, and avoiding reserved words or special
characters.

c. Data Types: Variables can have different data types, which determine the kind of values they
can hold. Common data types include integers (int), floating-point numbers (float), strings (str),
booleans (bool), and more complex types like lists, dictionaries, or objects.

d. Variable Assignment: Assigning a value to a variable is done using the assignment operator
(=). It associates a value with the variable name, allowing it to be used and referenced later in the
program.

e. Variable Reassignment: Variables can be reassigned to new values during program


execution. This flexibility allows data to be updated or modified as needed.

f. Variable Scope: Variables have a scope, which defines the portion of the program where the
variable is visible and accessible. The scope can be global (accessible throughout the program)
or local (limited to a specific block or function).

g. Data Mutation: Depending on the data type, variables can be mutable or immutable. Mutable
variables can be modified directly, while immutable variables cannot be changed after
assignment. For example, strings and tuples are immutable, while lists and dictionaries are
mutable.

h. Variable Operations: Variables can participate in various operations, such as arithmetic


calculations, string concatenation, logical operations, and comparisons. These operations can be
performed using appropriate operators depending on the data type.

i. Variable Interpolation: In some programming languages, including Python, variables can be


inserted into strings using string interpolation. This allows the values of variables to be
dynamically included within a string, making it easier to create dynamic output.

203
j. Variable Scope and Lifetime: Variables have a defined scope and lifetime, which determine
when they are created, accessed, and destroyed. The scope and lifetime of a variable depend on
where it is declared and how it is used within the program.

Understanding variables and how to work with them is fundamental in programming and
statistical analysis. They enable the storage and manipulation of data, making it easier to perform
calculations, track information, and build complex systems see Figure 5.1.

Example 6.1: Variables

Fig. 6.1: Variables Python code.

Fig. 6.2: Output code of the Figure1.

204
6.2 Parameters

In the context of probability distributions, parameters are numerical values that define the
characteristics of a distribution. These values determine the shape, location, and scale of the
distribution. Here are some key details about parameters:

a. Definition: Parameters are fixed values that are used to define and specify a particular
probability distribution. They describe important characteristics of the distribution, such as its
center, spread, skewness, and kurtosis.

b. Types of Parameters: The specific parameters and their interpretation depend on the
distribution being considered. Some common parameters include:

- Mean (μ): The mean represents the average or central value of the distribution. It
determines the center or location of the distribution.

- Standard Deviation (σ): The standard deviation measures the spread or variability of the
distribution. It indicates how much the values typically deviate from the mean.

- Variance : The variance is the square of the standard deviation. It represents the
average of the squared deviations from the mean and provides a measure of dispersion.

- Shape Parameters: Some distributions have additional parameters that control the shape
of the distribution. For example, the gamma distribution has shape and scale
parameters.

c. Interpretation: Parameters provide important information about the distribution. For


example, the mean can give insight into the expected value or central tendency, while the
standard deviation can indicate the degree of dispersion or variability of the data.

d. Estimation: Estimating the parameters is a common task in statistics. Given a set of observed
data, statistical methods can be used to estimate the parameters that best fit the data to a specific
distribution. This process is known as parameter estimation.

e. Hypothesis Testing: Parameters play a crucial role in hypothesis testing. Researchers often
test hypotheses about the values of certain parameters in a population based on sample data.
Hypothesis tests help determine if the observed data provides evidence to support or reject
certain parameter values.

f. Relationships between Parameters: In some distributions, parameters may be related to each


other. For example, in the normal distribution, changing the mean or standard deviation affects
the shape and position of the bell curve.

205
Understanding and specifying the correct parameters is crucial for accurately representing and
analyzing data using probability distributions. The choice of parameters affects the distribution's
behavior and enables meaningful interpretation and inference based on the distribution.

Example 6.2: Parameters

Fig. 6.3: Parameters Python code.

206
Fig. 6.4: Output Parameters using Python code.

6.3 Probability Distribution

Probability distribution, in the field of statistics and probability theory, refers to a mathematical
function that describes the likelihood of different outcomes occurring in a given set of events or
experiments. It provides a systematic way of characterizing the uncertainty associated with these
outcomes by assigning probabilities to their occurrence. Probability distributions are defined by
their shape, parameters, and specific mathematical formulas, which allow for quantification and
analysis of random variables.

Application Probability distributions find extensive applications across various fields. One
common application is in risk assessment and decision-making processes. By understanding the
probability distribution of potential outcomes, decision-makers can evaluate the likelihood of
different scenarios and make informed choices. Probability distributions are also fundamental in
statistical inference, where they are used to estimate population parameters based on sample
data. In addition, they play a crucial role in modeling and simulating complex systems, such as
financial markets, weather patterns, and biological phenomena. By incorporating probability
distributions into these models, researchers can generate reliable predictions and understand the
underlying dynamics of the system.

207
The behavior of a random variable is described by its probability distribution. For a discrete
random variable, the probability distribution is often represented by a probability mass function
(PMF), which assigns probabilities to each possible value. For a continuous random variable, the
probability distribution is described by a probability density function (PDF), which gives the
probability of the variable falling within a certain range.

6.3.1 Random Variables

Random variables are a fundamental concept in probability theory and statistics. They are used
to model and describe uncertain or random quantities in mathematical terms. A random variable
is a variable whose possible values are outcomes of a random phenomenon. It assigns a
numerical value to each outcome of an experiment.

a. Definition: A random variable is a function that maps the outcomes of a random experiment to
numerical values. It assigns a real number to each outcome in the sample space.

To generate random numbers in Python, you can use the random module. This module provides
a number of functions for generating random numbers, including:

random(): Generates a random floating-point number between 0 and 1.

uniform(): Generates a random floating-point number between two specified numbers.

randint(): Generates a random integer number between two specified numbers.

Example 6.3 : Random Variable.

Fig. 6.5: Random variable Using Python code.

208
Example 6.4: Random variable.

Fig. 6.6: Random variable Using Python code.

Example 6.5: Random variable.

Fig. 6.7: Random variable Using Python code.

Here are some key points about random variables:

b. Types of random variables: Random variables can be classified into two main types: discrete
random variables and continuous random variables.

I. Discrete random variables: These variables take on a countable set of values. For example,
the number of heads obtained when flipping a coin multiple times is a discrete random
variable, as it can only take on the values 0, 1, 2, and so on.

II. Continuous random variables: These variables can take on any value within a certain range
or interval. Examples include the height of a person, the time it takes for a car to travel a
certain distance, etc.

209
 Discrete probability distributions are used to describe random variables that can take on a
finite or countable number of values. Some common examples of discrete probability
distributions include:
I. Bernoulli distribution is one of the simplest and fundamental discrete probability
distributions. It models a random experiment with only two possible outcomes, often
referred to as "success" and "failure." These outcomes are typically represented as 1
(success) and 0 (failure).

The Bernoulli distribution is characterized by a single parameter, usually denoted as "p," which
represents the probability of success in a single trial. The probability of failure is then given by
(1 - p).

The probability mass function (PMF) of the Bernoulli distribution is defined as follows:

Where:

P(X = x) is the probability of observing the outcome x (either 1 or 0).

x: is the outcome of the experiment, with x = 1 representing a success and x = 0 representing a


failure.

p: is the probability of success in a single trial.

Properties of the Bernoulli Distribution:

 The outcomes are mutually exclusive and exhaustive, meaning that only one of the two
outcomes can occur in a single trial.
 The mean (expected value) of a Bernoulli random variable X is E(X) = p. It represents
the average probability of success in a single trial.
 The variance of X is Var(X) = p * (1 - p). The variance is a measure of how spread out
the distribution is around its mean.

Applications of the Bernoulli Distribution:

The Bernoulli distribution finds various applications in statistics and probability theory, as well
as in real-world scenarios. Some common applications include:

 Modeling binary outcomes: It is used to model events with only two possible outcomes,
such as success/failure, yes/no, heads/tails, etc.
 Bernoulli trials: It is used to represent a single trial in a sequence of independent binary
experiments, where each trial has the same probability of success.

210
 Decision-making and classification: In machine learning and data analysis, the Bernoulli
distribution is used to model binary classification problems.
 Binomial distribution: The Bernoulli distribution is the basis for the binomial distribution,
which represents the number of successes in a fixed number of independent Bernoulli
trials.

Overall, the Bernoulli distribution serves as a building block for more complex probability
distributions and provides a foundation for understanding the behavior of random events with
two possible outcomes.

Example 6.6: Shooting a free throw in basketball

In this example, the probability of success (making the free throw) is p = 0.7 and the probability
of failure (missing the free throw) is q = 1 - p = 0.3. Here are some numerical examples of
Bernoulli trials and their probabilities:

 Trial: Shoot a free throw.


 Success: The ball goes through the hoop.
 Failure: The ball misses the hoop.
 Probability of success: p = 0.7
 Probability of failure: q = 0.3

You can use the Bernoulli distribution to calculate the probability of any event that has two
possible outcomes, such as winning a coin toss, rolling a six on a die, or making a free throw in
basketball.

Example 6.7: Getting a job offer after an interview

In this example, the probability of success (getting a job offer) is p and the probability of failure
(not getting a job offer) is q = 1 - p. The value of p will vary depending on the person's
qualifications, experience, and the competitiveness of the job market.

Example 6.8: Getting a customer to buy a product

In this example, the probability of success (the customer buying the product) is p and the
probability of failure (the customer not buying the product) is q = 1 - p. The value of p will vary
depending on the product, the customer's needs, and the sales skills of the salesperson.

211
These are just a few examples of the many ways that the Bernoulli distribution can be used to
model real-world situations.

Example 6.9: Bernoulli distribution

Fig. 6.8: Bernoulli distribution using Python code

Fig. 6.9: Numerical output Bernoulli distribution using Python code

212
Example 6.10: Bernoulli distribution

Fig. 6.10: Bernoulli distribution using Python code

Fig. 6.11 : Numerical output Bernoulli distribution using Python code

213
Fig. 6.12: Output graph of Bernoulli distribution using Python code

II. Binomial distribution: This distribution is used to describe the probability of a certain
number of successes in a fixed number of trials, where each trial has only two possible
outcomes (success or failure).

The PMF of the binomial distribution is given by:

where:

X is the random variable, which represents the number of successes in n trials


n is the number of trials
p is the probability of success on each trial.

Example 6.11: Binomial Distribution

Suppose you have a list of 50 email subscribers and you send them an email campaign. You
know from previous campaigns that the average open rate for your emails is 20%. This means
that the probability of any individual subscriber opening your email is 0.20. use the binomial
distribution to calculate the probability of a certain number of subscribers opening your email.
For example, the probability of 10 subscribers opening your email is:

214
= 0.1398
The probability of 10 people opening email is 0.1398.

Example 6.12: Binomial Distribution

Suppose the total number of free throws attempted was 20, number of successful throws equal
15, and the probability of making free throw was 0.7. Find the probability of making exactly 15
free throws.

= 0.178863
Probability of making exactly 15 free throws is 0.178863.

215
Example 6.13: Binomial Distribution.

Fig. 6.13: Binomial Distribution using Python code.

Fig. 6.14: Output distribution using Python code.t results for binomial

216
Example 6.14: Binomial Distribution

Fig. 6.15: Binomial Distribution using Python code.

217
Fig.6.16: Output plot for probability of exactly 10 people opening the email : 0.1398. using python code.

Fig. 6.17: Output plot for probability of exactly 15 free throws: 0.1780 using Python code.

218
Example 6.15: Binomial Distribution.

Fig. 6.18: Binomial Distribution using Python code.

Fig. 6.19: Output distribution using Python code.t results for binomial

219
Example 6.16: Binomial Distribution

Fig. 6.20: Binomial Distribution using Python code.

220
Fig.6.21: Output plot for probability of exactly 10 people opening the email : 0.1398. using python code.

Fig. 6.22 : Output plot for probability of exactly 15 free throws: 0.1780 using Python code.

221
III. Poisson distribution: This distribution is used to describe the probability of a certain
number of events occurring in a fixed time interval or space, where the events occur
independently of each other and at a constant rate.

The PMF of the Poisson distribution is given by:

where:

X is the random variable, which represents the number of events occurring in a fixed time
interval or space

λ (lambda) is the average rate of occurrence of events in the given interval.

k is the number of events we want to find the probability for.

k! (k factorial) is the factorial of k, i.e., k! = k * (k-1) * (k-2) * ... * 2 * 1.

Key characteristics of the Poisson distribution are:

1. Events occur independently and randomly in the given interval.

2. The probability of more than one event occurring in an infinitesimally small interval
approaches zero.

3. The mean (average) and variance of the Poisson distribution is both equal to λ.

Now, let's look at some examples of the Poisson distribution:

1. Customer Arrivals: The Poisson distribution can model the number of customer
arrivals at a store during a specific hour. Suppose, on average, 5 customers arrive per
hour (λ = 5). You can use the Poisson distribution to find the probability of exactly 3
customers arriving in the next hour (k = 3).

= 0.140

2. Defects in a Product: In manufacturing, the Poisson distribution can model the number
of defects in a batch of products. If, on average, there are 2 defects per batch (λ = 2), you
can find the probability of having exactly 1 defect in the next batch (k = 1).

3. Accidents per Day: The Poisson distribution can be used to model the number of
accidents occurring at a specific location in a day. If, on average, there are 0.5 accidents

222
per day (λ = 0.5), you can calculate the probability of having exactly 2 accidents in the
next day (k = 2).

4. Call Center Calls: Call centers can use the Poisson distribution to model the number of
incoming calls per minute or per hour. If, on average, there are 10 calls per minute (λ =
10), you can find the probability of receiving exactly 15 calls in the next minute (k = 15).

These examples illustrate how the Poisson distribution can be applied in various
scenarios to model the number of events occurring within a fixed interval of time or
space, given an average rate of occurrence.

Example 6.17: Poisson distribution

Fig. 6.23: Poisson distribution using Python code.

223
Fig. 6.24: output plot of Poisson distribution using Python code.

Fig. 6.25: output plot of Poisson distribution using Python code.

224
IV. Geometric distribution: This distribution is used to describe the probability of the first
success occurring on the nth trial, where each trial has only two possible outcomes
(success or failure).

The probability mass function (PMF) of the geometric distribution is given by:

Where:
P(X = k) is the probability of getting the first success on the k-th trial.
p is the probability of success in a single trial .
Key characteristics of the geometric distribution are:
1. Each trial is independent and has only two possible outcomes: success or failure.
2. The distribution is memoryless, meaning the probability of success on the next trial is
the same regardless of how many trials have been conducted before.

Now, let's look at some examples of the geometric distribution:

1. Waiting for a Bus: Suppose the probability of a bus arriving at a specific stop is 0.2 (p
= 0.2). You can use the geometric distribution to calculate the probability of waiting for
5 bus arrivals before the first bus arrives (k = 5).
(0.2)
= 0.082
2. Email Spam: In email filtering, the geometric distribution can be used to model the
number of non-spam emails received before the first spam email arrives. If the
probability of receiving a spam email is 0.1 (p = 0.1), you can find the probability of
receiving 10 non-spam emails before the first spam email (k = 10).
3. Product Defects: In quality control, the geometric distribution can model the number
of non-defective items produced before the first defective item is found. If the
probability of producing a defective item is 0.03 (p = 0.03), you can calculate the
probability of producing 6 non-defective items before the first defective item (k = 6).
(0.03)
= 0.02576

These examples demonstrate how the geometric distribution can be applied in various
scenarios to model the number of trials needed before the first success occurs, given the
probability of success in each trial (p).

225
Example 6.18: Geometric probability

Fig. 6.26: Geometric probability using Python code.

Fig. 6.27: Statistical output Geometric probability using Python code.

All discrete probability distributions must satisfy the following properties:

The probability of any event must be between 0 and 1.


The sum of the probabilities of all possible events must be equal to 1.
In addition to these general properties, there are a number of other properties that are specific
to different types of discrete probability distributions. For example, the mean and variance of
a binomial distribution are given by np and npq, respectively, where q = 1 - p.

 Continuous probability distributions

Continuous probability distributions are a powerful tool for modeling uncertainty in


situations where the variable of interest can take on any value within a continuous range.
Unlike discrete distributions, which deal with outcomes like rolls of a die or counts of
successes in trials, continuous distributions handle variables like height, weight, or time,
which can theoretically have an infinite number of possible values.

a. Normal Distribution
One commonly used continuous probability distribution is the normal distribution (also
known as the Gaussian distribution). It is characterized by its bell-shaped curve and is often
used to model real-world phenomena. The normal distribution is fully defined by its mean
(μ) and standard deviation (σ). The PDF of the normal distribution is given by the
following formula:

{ }

where:
x is the random variable
μ is the mean of the distribution
σ is the standard deviation of the distribution
e is the base of the natural logarithm (approximately 2.71828).

226
Example 6.19: Normal distribution

Fig. 6.28: Normal distribution using Python code.

Fig. 6.29: Output plot for normal distribution using Python code.

227
Example 6.20: Normal Distribution

Fig. 6.30: Normal distribution using Python code.

228
Fig. 6.31: Output plot for two normal distribution using Python code.

Example 6.21: Normal distribution

Fig. 6.32: Normal Distribution using Python code.

229
Fig. 6.33: Output plot for two normal distribution using Python code.

Example 6.22: Normal probability distribution.

Let’s we have a normal distribution population of heights with a mean of 68 inches and a
standard deviation of 3 inches. We want to find the probability that a random selected person
in shorter than 70 inches (X=70).

1.
2.
3.

Plug these values into the formula:


Now, let’s calculate this step by step:

Step 1: Calculate the z-score (standardized score):

230
Step 2: Look up the z-score in the standard normal distribution table or use a calculator to
find the cumulative probability associated with z-score. In this case, P(Z < 2/3).

Step 3: Find the probability using the cumulative probability from the standard normal
distribution table.

Look up the value in a z-table, then P(Z < is approximately 0.7486.

Therefore, the probability that a random selected person is shorter than 70 inches is
approximately 74.86%.

Example 6.23: Consider a normal distribution population of heights with a mean of 85


and a standard deviation of 10. We want to find the probability that a random selected
student scores above 90 (X>90).

Use the formula to calculate:

P(X > 90) = 1 – P(X < 9)

Now, let’s calculate it step by step:

Step 1: Calculate the z-score:

Step 2: Look up the z-score in the standard normal distribution table to find P(Z < 0.5).

Step 3: Calculate the complement: 1- P(Z < 0.5) to find the probability that a student score
above 90.

Look up the value in a z-table, then:

P(Z >

So, the probability that a random selected student score above 90 in the test is a
approximately 0.3085, or 30.85%.

231
Example 6.24: Probability of Task Completion Time between 40 and 50 Minutes

Imagine a dataset representing task completion times that follow a normal distribution with a
mean of 45 minutes and standard deviation of 8 minutes. We want to find the
probability that a random selected task between 40 and 50 minutes (40 < X <50).

1. Mean minutes.
2. Standard deviation = 8 minutes
3. Lower Bound = 40 minutes
4. Upper Bound = 50 minutes.

Now, let’s calculate it step by step:

Step 1: Calculate the z-score for both bounds:

Step 2: Look up the z-scores in the standard normal distribution table to find

P(Z < - 0.625) and P(Z < 0.625).

Step 3: Subtract the cumulative probability for the lower bound from cumulative probability
for the upper to find the desired probability.

For the lower bound:

P(X < 40) = P(Z < -0.625) = 0.265985

For the upper bound:

P(X > 50) = P(Z > 0.625) = 0.735915

Now, we can find the probability of task completion time between 40 and 50 minutes:

232
P(40 < X < 50) = P(X < 50) – P(X < 40)

P(40 < X < 50) = 0.735915 – 0.265985

P(40 < X < 50) = 0.46993

So, the probability that a random selected tasks between 40 and 50 minutes is approximately
0.46993 or about 46.99%.

Example 6.25: Normal distribution

233
Fig. 6.34: Normal distribution using Python code.

234
Fig. 6.35: Output plot for Normal distribution using Python code.

Fig. 6.36: Output statistical results for Normal distribution using Python code.

b. Uniform distribution is a continuous probability distribution where all values within


a specified range have an equal probability of occurring. It is often referred to as a
"rectangular" distribution due to its constant probability density function (PDF) over
the range.
The probability density function of a uniform distribution is defined as follows:
for a ≤ x ≤ b

where:
a is the lower bound of the distribution
b is the upper bound of the distribution
The PDF of the uniform distribution is constant within the interval [a, b], and outside
this interval, it is zero.
Here are some numerical examples for uniform distribution:

235
Example 6.26: Uniform distribution
Suppose we have a uniform distribution on the interval [0, 10]. This means that all
values between 0 and 10 are equally likely. We can use the uniform distribution
formula to calculate the probability of getting a value between 5 and 7.

P(5 < x < 7) = (7 - 5) / (10 - 0) = 2 / 10 = 0.2

This means that there is a 20% chance of getting a value between 5 and 7 in a
uniform distribution on the interval [0, 10].

Example 6.27: Uniform distribution


Suppose we have a uniform distribution on the interval [10, 20]. This means that all
values between 10 and 20 are equally likely. We can use the uniform distribution
formula to calculate the probability of getting a value greater than 15.

P(x > 15) = (20 - 15) / (20 - 10) = 5 / 10 = 0.5

This means that there is a 50% chance of getting a value greater than 15 in a uniform
distribution on the interval [10, 20].

Example 6.28: Uniform distribution


Suppose we have a uniform distribution on the interval [-10, 10]. This means that all
values between -10 and 10 are equally likely. We can use the uniform distribution
formula to calculate the probability of getting a value between -5 and 5.

P(-5 < x < 5) = (5 - (-5)) / (10 - (-10)) = 10 / 20 = 0.5

This means that there is a 50% chance of getting a value between -5 and 5 in a
uniform distribution on the interval [-10, 10].
Uniform distributions are often used to model situations where all values within a
certain range are equally likely.

For example, we could use a uniform distribution to model the probability of getting
a certain number on a die roll or the probability of picking a certain card from a deck
of cards.

Example 6.29: Uniform distribution

Let's consider the first example: modeling the daily high temperature in a city where the
temperature can vary between 70°F and 90°F with equal probability. In this case, we have a
continuous uniform distribution over the temperature range [70°F, 90°F].

Model the daily high temperature in a city where the temperature follows a continuous uniform
distribution between 70°F and 90°F. Find the probability density function (PDF) and calculate
the probability of the temperature being within a certain range.

236
Solution:

Probability Density Function (PDF):

In a continuous uniform distribution, the probability density function (PDF) is constant over the
entire range and is given by:

In this case, a = 70°F (lower bound) and b = 90°F (upper bound):

Probability of Temperature Range:

To find the probability that the temperature falls within a specific range, you can calculate the
area under the PDF curve within that range.

For example, to find the probability that the temperature is between 75°F and 85°F:

∫ ∫ ∫

= 0.5

So, the probability that the daily high temperature is between 75°F and 85°F is 0.5 or
50%.

237
Example 6.30: Uniform distribution

Fig. 6.37: Uniform distribution using Python code.

Fig. 6.38: Statistical output results for Uniform distribution using Python code

238
Fig. 6.39: Output graph for Uniform distribution using Python code.

c. Exponential distribution is one of the widely used continuous distributions. It is


often used to model the time elapsed between events. We will now mathematically
define the exponential distribution, and derive its mean and expected value. Then we
will develop the intuition for the distribution and discuss several interesting properties
that it has.
The probability density function (PDF) of the exponential distribution is given by:

where x is the random variable representing the time between events, λ (lambda) is
the rate parameter, and e is the base of the natural logarithm (approximately
2.71828).
The exponential distribution has the following properties:

1. Non-negative values: The exponential distribution is defined for x ≥ 0, meaning


the time between events must be non-negative.

2. Memoryless property: The exponential distribution has the memoryless property,


which means that the probability of an event occurring in the next interval is not

239
dependent on how much time has already elapsed. In other words, the
distribution does not "remember" the past.
3. Exponential decay: The exponential distribution exhibits exponential decay,
meaning the probability of an event occurring decreases exponentially as time
increases.
4. Constant hazard rate: The hazard rate, which represents the instantaneous
probability of an event occurring given that it has not occurred yet, is constant for
the exponential distribution. The hazard rate is given by λ, the rate parameter.
The mean (μ) and standard deviation (σ) of the exponential distribution are both
equal to . The variance ( ) is equal to .
The exponential distribution is commonly used in various fields, such as reliability
analysis, queuing theory, and survival analysis, where modeling the time to failure or
time to an event is important.

Example 6.31: Time between customer arrivals at a store

Suppose the time between customer arrivals at a store is exponentially distributed with a mean of
5 minutes. This means that the PDF of the time between arrivals, X, is given by:

where =1/5=0.2.

To calculate the probability that a customer will arrive within 5 minutes of the previous
customer, we can use the following formula:

∫ |

= = 0.368

Therefore, there is a 36.8% chance that a customer will arrive within 5 minutes of the previous
customer.

To calculate the probability that a customer will arrive after 5 minutes, we can use the following
formula:

240
P(X > 5) = 1 - P(X < 5) = 1 - 0.368= 0.632

Therefore, there is a 63.2% chance that a customer will arrive after 5 minutes.

To calculate the probability that a customer will arrive after 10 minutes, we can use the
following formula:

P(X > 10) = = 0.135

Therefore, there is a 13.5% chance that a customer will arrive after 10 minutes.

Example 6.32: Exponential distribution

Fig. 6.40: Exponential Distribution using Python code.

241
Fig. 6.41: Output plot for Exponential Distribution using Python code.

d. Beta distribution
The beta distribution is a continuous probability distribution defined on the interval
[0, 1]. It is characterized by two shape parameters, typically denoted as α (alpha) and
β (beta), which control the shape and behavior of the distribution.
The probability density function (PDF) of the beta distribution is given by:

where x is the random variable, α and β are the shape parameters, and B(α, β) is the
beta function, defined as:

The properties of the beta distribution include:


1. Support on the interval [0, 1]: The beta distribution is defined for x between 0 and
1, inclusive. This makes it suitable for modeling proportions and probabilities.
2. Symmetry (for certain parameter values): When α = β, the beta distribution is
symmetric around x = 0.5.

242
3. Shape and behavior controlled by α and β: The values of α and β determine the
shape of the distribution. Higher values of α and β result in distributions that are
more peaked and concentrated around the mean.
4. Special cases: When α = β = 1, the beta distribution reduces to the uniform
distribution on [0, 1]. When α = β > 1, the distribution is positively skewed, and
when α = β < 1, the distribution is negatively skewed.
The cumulative distribution function (CDF) of the beta distribution does not have a
closed-form expression, but it can be computed numerically using various
approximation methods or specialized software.
The mean (μ) and variance (σ^2) of the beta distribution are given by:

The beta distribution is widely used in various fields, including statistics, Bayesian
inference, modeling proportions, and in machine learning applications such as beta
regression and Bayesian parameter estimation.

Example 6.33: Beta distribution

Fig. 6.42: Probability distribution function for Beta distribution using Python code.

Fig. 6.43: Output results for Probability distribution function for Beta distribution using Python code.

243
Example 6.34: Beta distribution

Fig. 6.44: Generating random numbers from the beta distribution using Python code.

Fig. 6.45: Output random numbers from the beta distribution using Python code.

244
Example 6.35: Beta distribution

Fig. 6.46: Beta distribution using Python code.

245
Fig. 6.47: Output plot for Beta distribution using Python code.

246
Example 6.36: Beta distribution

Fig. 6.48: Beta distribution using Python code.

Fig. 6.49: Output for different plots for Beat distribution using Python code.

247
6.4 Cumulative distribution function (CDF): The cumulative distribution function of a random
variable gives the probability that the variable takes on a value less than or equal to a given
value. It provides a complete description of the random variable's behavior.

Example 6.37: Cumulative Distribution Function

Fig. 6.50: Cumulative Distribution Probability python code.

Fig. 6.51: Output of Cumulative Distribution Probability of the python code..

248
6.5 Expected value: The expected value, also known as the mean or average, is a measure of the
central tendency of a random variable. It represents the weighted average of all possible values,
where the weights are given by the probabilities associated with those values. Expected value, in
the context of statistics and probability theory, refers to the theoretical average outcome that can
be anticipated from a specific event or random variable. It is calculated by multiplying each
possible outcome by its respective probability and summing them up. Expected value serves as a
valuable tool for decision-making and risk assessment, allowing us to estimate the potential
outcome of an uncertain event based on its probabilities.

Expected value finds extensive applications across various fields, such as finance, economics,
and insurance. In finance, it is commonly used to assess investment opportunities by considering
the expected return and associated risks. By calculating the expected value, investors can make
informed decisions about the potential profitability of an investment and evaluate the level of
uncertainty involved. In economics, expected value aids in evaluating policy choices by
considering the potential outcomes and their probabilities. Additionally, in the field of insurance,
expected value assists in determining appropriate premiums by considering the potential losses
and their associated probabilities. By incorporating expected values, insurers can ensure their
pricing aligns with the potential risks involved. Overall, expected value provides a quantitative
measure to estimate the likely outcome of uncertain events, enabling better decision-making and
risk management.

The discrete formula for expected value is:

where:

 E(X) is the expected value of the random variable X


 x is a possible outcome of X
 P(x) is the probability of outcome x occurring
 Σ means "sum of"

Example 6.38: Expected value

Suppose you have a choice of two investments:

 Investment A has a 60% chance of returning 10% and a 40% chance of returning -5%.
 Investment B has a 50% chance of returning 5% and a 50% chance of returning 0%.

249
The expected value of each investment can be calculated as follows:

E(A) = (0.6 * 0.1) + (0.4 * -0.05) = 0.04


E(B) = (0.5 * 0.05) + (0.5 * 0.00) = 0.025

Therefore, Investment A has a higher expected value than Investment B.

Example 6.39:

Suppose you are trying to predict the amount of rain that will fall in a given month. You have
historical data that shows that there is a 20% chance of 1 inch of rain, a 70% chance of 2 inches
of rain, and a 10% chance of 3 inches of rain.

The expected amount of rain for the month is:

E(rain) = (0.2 * 1) + (0.7 * 2) + (0.1 * 3) = 1.9 inches

This means that, on average, you can expect 1.9 inches of rain in the month.

Expected value is a useful tool for making decisions in situations where there is uncertainty. By
calculating the expected value of each possible outcome, you can get an idea of which outcome
is most likely to occur and which outcome is most likely to benefit you.

The continuous formula for expected value is:

Example 6.40: Expected value

Lets use a probability density function (PDF) and integration to calculate the expected value of a
continuous random variable.

To find the expected value (E(X)), we’ll integrate the product of x and the PDF over the entire
range of X, which is from 0 to 1:

Now, let’s calculate the integral as follows:

[ ]

250
So, the expected value of the random variable X is:

Therefore, the expected value of X is 2/3.

Example 6.41: Expected Value

Fig. 6.52: Expected Value Python code.

Fig. 6.53: Output of Expected Value of Python code.

6.6 Moment generating function (MGF): The moment generating function is a useful tool for
characterizing random variables. It generates moments of the random variable and can be used to
derive various properties, such as moments, moments about the mean, and the shape of the
distribution.

The moment generating function is a significant concept in probability theory and statistics. It is
a mathematical function that uniquely characterizes a probability distribution. The MGF of a
random variable is defined as the expected value formula of

251

∑ ]

{

where t is a real-valued parameter and X is the random variable. In simpler terms, the MGF
provides a way to generate moments (expected values) of a random variable by taking the
exponential of the variable multiplied by a parameter.

The moment generating function has various applications in probability theory and statistics.
One of its primary uses is in determining the moments of a random variable. By taking the
derivatives of the MGF with respect to the parameter t and evaluating them at t=0, we can obtain
the moments of the random variable, including the mean, variance, skewness, and higher-order
moments. This information is crucial for understanding the characteristics of a probability
distribution and making statistical inferences.

The moment generating function (MGF) of a random variable is a function that summarizes all
of its moments. The moments of a random variable are the expected values of its powers, such as
the mean (first moment), variance (second moment), and skewness (third moment).

The MGF is useful for several reasons:

It can be used to easily derive the moments of a random variable. For example, the nth moment
of a random variable is the nth derivative of the MGF evaluated at t=0.

The MGF uniquely determines the distribution of a random variable. This means that if two
random variables have the same MGF, then they must have the same distribution.

The MGF can be used to find the distribution of a sum of random variables. This is done by
taking the product of the MGFs of the individual random variables.

Here is an example of how the MGF can be used to find the mean and variance of a random
variable. Let X be a random variable with MGF M(t). Then, the mean of X is given by:

E[X] = M'(0)

And the variance of X is given by:

Var[X] = M''(0)/2

In the code you provided, the mean and variance of the random variable X are 0 and 1,
respectively. This can be verified by taking the first and second derivatives of the MGF and
evaluating them at t=0.

252
The MGF is a powerful tool that can be used to analyze random variables. It is often used in
probability theory, statistics, and engineering.

Here are some other benefits of using the moment generating function:

It can be used to find the moment generating function of a random variable.

It can be used to prove certain theorems in probability theory, such as the central limit theorem.

It can be used to simulate random variables.

here are some examples of discrete moment generating functions (MGFs):

Bernoulli distribution: The MGF of a Bernoulli random variable X with parameter p is given by
{ }
[ ]

Binomial distribution: The MGF of a binomial random variable X with parameters n and p is
given by

Poisson distribution: The MGF of a Poisson random variable X with parameter is given by

{ ( )}

253
Example 6.42: Moment Generating Function

Fig. 6.54: Moment Generating Function using Python code.

254
Fig. 6.55: Output results for Moment Generating Function of different distribution using Python code.

here are some numerical examples for continuous moment generating functions (MGFs):

 The MGF of the normal distribution with mean 0 and variance 1 is given by

For example, the MGF of a normally distributed random variable with mean 0 and variance 1
evaluated at t = 1 is equal to exp(1/2) = 1.6487212707001282.

 The MGF of the exponential distribution with parameter λ is given by

255
Example 6.43: Moment Generating Function.

Fig. 6.56: Moment Generating Function for continous distribution using Python code.

256
Fig. 6.57: Output results for continuous Moment Generating Function using Python code.

6.7 Conditional probability

Conditional probability is a fundamental concept in probability theory and statistics that


measures the probability of an event occurring given that another event has already occurred. It
quantifies the likelihood of an outcome based on some additional information about the situation.
Conditional probability is denoted by P(A|B), where "P" stands for probability, "A" is the event
we want to know the probability of, and "B" represents the event that has already occurred.

The formula for conditional probability is given by:

P(A|B) = P(A ∩ B) / P(B)

where:

P(A|B) is the conditional probability of event A given event B has occurred.

P(A ∩ B) is the probability of both events A and B occurring simultaneously (the intersection of
events A and B).

P(B) is the probability of event B occurring.

In words, the formula can be explained as follows: The conditional probability of A given B is
equal to the probability of both A and B happening divided by the probability of B happening.

Key points about conditional probability:

The conditional probability of an event A given B is always a number between 0 and 1 (0 ≤


P(A|B) ≤ 1).

If P(A|B) = 1, it means event A is certain to happen if event B has occurred.

257
If P(A|B) = 0, it means event A is impossible if event B has occurred.

If P(A|B) = P(A), it means that events A and B are independent; the occurrence of event B does
not affect the probability of event A.

Conditional probability plays a crucial role in many real-world applications, such as medical
diagnosis, weather forecasting, and risk assessment. It helps in updating probabilities when new
information becomes available, making predictions based on observed data, and understanding
the relationships between events.

Example 6.44: conditional probability

Consider rolling a fair six-sided die. Let event A be rolling an even number (2, 4, or 6), and
event B be rolling a number greater than 3 (4, 5, or 6). The sample space of the die is {1, 2, 3, 4,
5, 6}. The probabilities are:

P(A) = 3/6 = 1/2 (because there are three even numbers out of six possibilities).

P(B) = 3/6 = 1/2 (because there are three numbers greater than 3 out of six possibilities).

P(A ∩ B) = 2/6 = 1/3 (because there are two numbers that satisfy both A and B: 4 and 6).

Then, the conditional probability of rolling an even number given that the number is greater than
3 is:

P(A|B) = P(A ∩ B) / P(B) = (1/3) / (1/2) = 2/3.

This means that if you roll a die and get a number greater than 3, the probability of it being an
even number is 2/3.

258
6.8 correlation

In statistics, correlation is a statistical measure that indicates the extent to which two variables
or quantities are related. It is a measure of the linear association between two variables,
meaning that it measures the strength and direction of the relationship between two variables. A
correlation coefficient ranges from -1 to 1, where -1 indicates a perfect negative correlation, 0
indicates no correlation, and 1 indicates a perfect positive correlation.

Types of correlation

There are two main types of correlation:

 Pearson correlation coefficient: This is the most common type of correlation


coefficient, and it is used to measure the strength and direction of the linear relationship
between two continuous variables.
 Spearman's rank correlation coefficient: This type of correlation coefficient is used to
measure the strength and direction of the monotonic relationship between two variables,
whether they are continuous or ordinal.

Interpretation of correlation coefficients

A correlation coefficient of -1 indicates a perfect negative correlation, meaning that as one


variable increases, the other variable decreases. A correlation coefficient of 0 indicates no
correlation, meaning that there is no linear relationship between the two variables. A correlation
coefficient of 1 indicates a perfect positive correlation, meaning that as one variable increases,
the other variable also increases.

Correlation and causation

It is important to note that correlation does not equal causation. Just because two variables are
correlated does not mean that one variable causes the other. For example, there is a strong
correlation between the number of ice cream sales and the number of drowning’s. However, this
does not mean that ice cream sales cause drowning’s. There is a third variable, which is the
temperature, that causes both ice cream sales to increase and drowning’s to increase.

Applications of correlation

Correlation is a widely used statistical measure in many fields, including:

 Economics: Correlation is used to study the relationship between economic variables,


such as inflation and unemployment.

259
 Finance: Correlation is used to study the relationship between financial variables, such
as stock prices and interest rates.
 Medicine: Correlation is used to study the relationship between medical variables, such
as smoking and lung cancer.
 Social science: Correlation is used to study the relationship between social variables,
such as education and income.

Limitations of correlation

Correlation is a useful statistical measure, but it has some limitations. Correlation does not
equal causation, and it can be affected by outliers. Additionally, correlation is only a measure of
the linear relationship between two variables. If the relationship between two variables is non-
linear, then correlation will not be a good measure of the strength of the relationship.

Here is the formula for the Pearson correlation coefficient between two variables X and Y:

Example 6.45: For the following data find the correlation

Y x
3 1
4 2
6 3
8 4
10 5
Solution:

y x ̅ ̅) ̅ ̅) ̅ ̅
3 1 -2 -3.2 6.4 4 10.24
4 2 -1 -2.2 2.2 1 4.84
6 3 0 -0.2 0 0 0.04
8 4 1 1.8 1.8 1 3.24
10 5 2 3.8 7.6 4 14.44
Total 18 10 32.8

√ √

Which indicates a strong positive correlation.

260
Fig. 6.58: Python code to calculate the correlation

Fig. 6.59: Output result using Python code

261
Fig. 6.60: Output Plot correlation between two variables x, and y using Python code.

Example 6.46: For the following data find the correlation

Y x
10 1
7 2
6 3
4 4
1 5

262
Fig. 6.61: Correlation Python code.

Fig. 6.62: Output result for correlation using Python code Which indicates a strong negative correlation.

263
Fig. 6.63: Output Plot correlation between two variables x, and y using Python code.

Example 6.47: For the following data find the correlation

Y x
9 1
8 2
1 3
11 4
9 5

264
Fig. 6.64: Correlation python code.

Fig. 6.65: Output correlation result using Python code.

265
Fig. 6.66: Output Plot correlation between two variables x, and y using Python code with r = 0.123.

266
6.8.1 Spearman's rank correlation coefficient

Spearman's rank correlation coefficient, often denoted by the Greek letter rho (ρ) or as r_s, is a
nonparametric measure of rank correlation that assesses the strength and direction of association
between two ranked variables. It is closely related to Pearson's correlation coefficient, but it is based on
ranks instead of actual values. This makes it more robust to outliers and non-normality in the data.

Assumptions of Spearman's rank correlation coefficient:

Independence: The observations in the data set should be independent of each other.

No ties: There should be no ties in the ranks of the data. If there are ties, they can be handled by
averaging the ranks.

Ordinality: The data should be at least ordinal, meaning that the order of the data points is meaningful.

Calculation of Spearman's rank correlation coefficient:

Rank the data: Rank the values of each variable separately, giving the highest value a rank of 1, the
second highest value a rank of 2, and so on.

Calculate the differences between the ranks: For each pair of observations, calculate the difference
between their ranks.

Square the differences between the ranks: Square each of the differences in ranks.

Sum the squared differences: Sum the squared differences between ranks.

Calculate the Spearman's rank correlation coefficient: Use the following formula to calculate the
Spearman's rank correlation coefficient:

where:

is the Spearman's rank correlation coefficient

d: is the difference between the ranks for each pair of observations

n: is the number of observations

Interpretation of Spearman's rank correlation coefficient:

267
6.8.2 The Spearman's rank correlation coefficient ( ) ranges from -1 to 1, with 0 indicating no
correlation and 1 indicating perfect positive correlation. A value of -1 indicates perfect
negative correlation. In general, a value of closer to 1 indicates a stronger positive
correlation, while a value closer to -1 indicates a stronger negative correlation.

Applications of Spearman's rank correlation coefficient:

Spearman's rank correlation coefficient is a versatile measure of association that is used in a


wide variety of applications, including:

Assessing the relationship between two variables: Spearman's rank correlation coefficient
can be used to assess the strength and direction of association between two variables, even if
the variables are not normally distributed.

Comparing groups: Spearman's rank correlation coefficient can be used to compare the ranks
of two groups of observations.

Assessing ordinal data: Spearman's rank correlation coefficient is particularly useful for
analyzing ordinal data, where the order of the data points is meaningful but the actual values
are not.

Example 6.48: Spearman's Rank Correlation Coefficient

To illustrate Spearman's rank correlation coefficient, let's consider an example of the scores
of 5 students in Math’s and Science:

Student Math’s Score Science Score


A 80 75
B 70 65
C 60 55
D 50 45
E 40 35

To calculate Spearman's rank correlation coefficient, we need to rank the data for each
variable and then calculate the differences between the ranks.

268
Rank the data for each variable:

Student Math’s Rank Science Rank


A 5 5
B 4 4
C 3 3
D 2 2
E 1 1

Calculate the differences between the ranks:

Student Math’s Rank Science Rank


A 5 5
B 4 4
C 3 3
D 2 2
E 1 1

Since there are no differences in ranks, the rank difference (d) for each pair of observations is
0.

Calculate the Spearman's rank correlation coefficient (r):

The formula for Spearman's rank correlation coefficient is:

where ∑ is the sum of the squared rank differences, and n is the number of observations.

In this example, ∑ = 0 and n = 5, so the formula simplifies to:

r = 1 - (6 * 0) / (5 * (5^2 - 1))

269
= 1 - 0 / (5 * 24)

= 1 - 0 / 120

=1-0

=1

Therefore, the Spearman's rank correlation coefficient for the given data is 1, indicating a
perfect positive monotonic correlation between the Math’s and Science scores of the
students.

270
Example 6.49: Spearman Rank correlation

Fig. 6.67: Spearman’s Rank Correlation using Python Code.

271
Fig. 6.68: Spearman’s Rank Correlation using Python Code.

272
Example 6.50: Spearman Rank correlation

Fig. 6.69: Spearman’s Rank Correlation using Python Code.

273
Fig. 6.70: Spearman’s Rank Correlation using Python Code.

274
Example 6.51: Spearman Rank correlation Plot Using Python Code.

Fig. 6.71: Spearman’s Rank Correlation using Python Code.

275
Fig. 6.72: Spearman’s Rank Correlation using Python Code.

276
Regression analysis

Regression analysis is a statistical method used to model and analyze the relationship between a
dependent variable and one or more independent variables. It helps to understand how changes in
the independent variables are associated with changes in the dependent variable.

6.9 Linear regression

Linear regression is a fundamental statistical and machine learning technique used to model the
relationship between a dependent variable (target) and one or more independent variables
(predictors). The main goal of linear regression is to establish a linear relationship between the
dependent and independent variables, which can then be used to make predictions or understand
the influence of the predictors on the target variable.

Here are some key components and concepts of linear regression:

6.9.1 Linear Equation:

Linear regression models the relationship between the dependent variable (y) and the
independent variables (X) using a linear equation of the form:

y = β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ + ε

where y is the dependent variable, X₁, X₂, ..., Xₙ are the independent variables, β₀ is the intercept,
β₁, β₂, ..., βₙ are the coefficients of the independent variables, and ε is the error term.

Linear regression makes several key assumptions:

a) Linearity: The relationship between the dependent and independent variables is linear.

b) Independence: The observations are independent of each other.

c) Homoscedasticity: The variance of the errors is constant across all levels of the independent
variables.

d) Normality: The errors are normally distributed.

Estimating coefficients:

The coefficients (β₀, β₁, ..., βₙ) are estimated using a method called Ordinary Least Squares
(OLS). The OLS method aims to minimize the sum of the squared differences between the
observed values and the predicted values (i.e., the residuals).

Model evaluation:

To evaluate the performance of a linear regression model, several metrics can be used, such as:

277
a) Coefficient of determination (R²): Measures the proportion of variance in the dependent
variable that is predictable from the independent variables.

b) Mean Squared Error (MSE): Measures the average squared difference between the observed
and predicted values.

c) Root Mean Squared Error (RMSE): The square root of MSE, which is easier to interpret since
it is in the same units as the dependent variable.

Linear regression has a wide range of applications, such as predicting house prices, sales
forecasting, estimating the effect of marketing activities on revenue, and assessing the impact of
various factors on public health.

Keep in mind that linear regression has its limitations. It may not be suitable for modeling
nonlinear relationships or when the assumptions are violated. In such cases, other techniques like
polynomial regression, decision trees, or neural networks can be considered.

Example 6.52: simple linear regression

Of course! Let's work through a simple linear regression example with a single independent
variable (simple linear regression).

Suppose we have the following data points representing the relationship between the number of
years of experience (independent variable, X) and the corresponding annual salary (dependent
variable, y) in thousands of dollars:

Years of Experience (X): [1, 2, 3, 4, 5]

Annual Salary (y): [30, 35, 41, 45, 51]

We want to create a linear model that predicts annual salary based on years of experience. The
linear equation we want to estimate is of the form:

y = β₀ + β₁X

Calculate the mean of X and y:

̅ = (1 + 2 + 3 + 4 + 5) / 5

We want to create a linear model that predicts annual salary based on years of experience. The
linear equation we want to estimate is of the form:

y = β₀ + β₁X

Calculate the mean of X and y:

̅ = (1 + 2 + 3 + 4 + 5) / 5 = 3

278
̅ = (30 + 35 + 41 + 45 + 51) / 5 = 40.4

Calculate the coefficients β₁ and β₀:

First, let's calculate the numerator and denominator in the β₁ formula:

numerator = Σ((Xi - X_mean) * (yi - y_mean))

= (1-3)(30-40.4) + (2-3)(35-40.4) + (3-3)(41-40.4) + (4-3)(45-40.4) + (5-3)(51-40.4)

= 52

denominator = Σ((Xi - ̅ )²)

= (1-3)² + (2-3)² + (3-3)² + (4-3)² + (5-3)² = 10

Now we can calculate β₁ and β₀:

β₁ = numerator / denominator = 52 / 10 = 5.2

β₀ = ̅ - β₁ * ̅ = 40.4 – 5.2 * 3 = 24.8

Thus, our linear equation is:

y = 24.8 + 5.2 * X

Use the linear equation for predictions:

For example, if someone has 6 years of experience, the predicted salary would be:

y = 24.8 + 5.2 * 6 = 56 (in thousands of dollars)

With this linear equation, we can now predict the annual salary based on the number of years of
experience. Keep in mind that this is a simple example with a small dataset, and the model may
not be very accurate for real-world applications.

Example 6.53: multiple linear regression

Sure! Let's consider a multiple linear regression example with two independent variables.
Suppose we have data representing the relationship between years of experience, education level,
and annual salary:

Years of Experience (X1): [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Education Level (X2): [1, 2, 2, 3, 2, 1, 3, 3, 2, 1] (1: Bachelor, 2: Master, 3: Ph.D.)

Annual Salary (y): [30, 37, 49, 57, 60, 53, 70, 80, 65, 58]

279
We want to create a linear model that predicts the annual salary based on years of experience and
education level. The linear equation we want to estimate is:

y = β₀ + β₁X₁ + β₂X₂

We will use the numpy and statsmodels libraries to perform multiple linear regression in Python.

This code fits a multiple linear regression model to the given data points and prints the resulting
linear equation.

After running the code, you should see the following linear equation:

Linear regression equation:

y = 16.45 + 3.61 * X1 + 9.80 * X2

Now, we can use this linear equation to predict annual salaries based on years of experience and
education level. For example, if someone has 5 years of experience and a Master's degree (X2 =
2), the predicted salary would be:

y = 16.45 + 3.61 * 5 + 9.8 * 2 = 54.1 (in thousands of dollars)

Keep in mind that this example uses a small dataset and may not be very accurate for real-world
applications. However, it illustrates the process of multiple linear regression using Python.

Example 6.54: Linear Regression

280
Fig. 6.73: Linear Regression Equation using Python code.

Fig.6.74: Output plot of Linear Regression Equation using Python code.

281
Example 6.55: Multiple Linear Regression

Fig.6.75 : multiple Linear Regression using Python code.

282
Fig.6.76: Output graph of Multiple Linear regression of the Python code of Fig.1.

6.9.2. Polynomial regression

Polynomial regression is a type of regression analysis that models the relationship between a
dependent variable (usually denoted as "y") and an independent variable (usually denoted as "x")
as an nth-degree polynomial. In other words, instead of fitting a straight line (as in simple linear
regression), it fits a curve to the data points.

The general equation for polynomial regression can be represented as follows:

y = β₀ + β₁x + β₂x² + β₃x³ + ... + βₙxⁿ + ε

where:

y is the dependent variable (the response or outcome you want to predict).

x is the independent variable (the input or predictor).

β₀, β₁, β₂, ..., βₙ are the coefficients of the polynomial, representing the impact of each degree of
the independent variables on the dependent variable.

n is the degree of the polynomial (the highest power of x in the equation).

ε is the error term, representing the difference between the predicted and actual values of y.

The goal of polynomial regression is to find the optimal values for the coefficients (β₀, β₁, β₂, ...,
βₙ) that minimize the sum of squared errors (the difference between the predicted and actual
values) and provide the best fit to the data.

283
The degree of the polynomial (n) is a crucial parameter in polynomial regression. Higher-degree
polynomials can fit the training data very well, but they may suffer from overfitting, which
means they perform poorly on new, unseen data. Lower-degree polynomials are less likely to
overfit, but they may not capture the underlying relationships in the data as effectively.

To perform polynomial regression, you can use various statistical software packages or
programming languages like Python, R, or MATLAB. In these tools, you can find libraries and
functions to fit a polynomial regression model to your data, estimate the coefficients, and make
predictions.

Keep in mind that when working with polynomial regression, it's essential to evaluate the
model's performance on a separate test dataset to ensure it generalizes well to unseen data and
doesn't overfit to the training data. Techniques like cross-validation can be helpful in assessing
the model's performance.

Overall, polynomial regression is a flexible technique that allows you to capture more complex
relationships between variables, but it requires careful consideration of the polynomial degree
and potential overfitting issues.

284
Example 6.56: polynomial regression

285
Fig. 6.77 : Polynomial Regression using Python code.

286
Fig. 6.78: Plot graph for Polynomial Regression using Python code.

287
6.10 Hypothesis testing

Hypothesis testing is a statistical method used to determine whether there is enough evidence to
support or reject a proposed hypothesis about a population parameter based on sample data. It
involves comparing the observed data to what would be expected under a null hypothesis, which
represents a default assumption about the population parameter. If the observed data is unlikely
to occur under the null hypothesis, the null hypothesis is rejected, and the alternative hypothesis
is supported.

The process of hypothesis testing generally involves the following steps:

a. Formulating the null and alternative hypotheses: The null hypothesis (H0) is the default
assumption about the population parameter, while the alternative hypothesis (Ha) is the
statement that contradicts the null hypothesis. For example, if we want to test whether a
new drug is effective in treating a particular disease, the null hypothesis would be that the
drug has no effect, while the alternative hypothesis would be that the drug is effective.
b. Choosing a level of significance: The level of significance (α) represents the probability
of rejecting the null hypothesis when it is actually true. The most common level of
significance is 0.05, which means that we are willing to accept a 5% chance of rejecting
the null hypothesis even if it is true.
c. Selecting a test statistic: The test statistic is a numerical value that measures the
difference between the observed data and what would be expected under the null
hypothesis. The choice of test statistic depends on the nature of the data and the
hypothesis being tested. Common test statistics include t-tests, z-tests, and chi-square
tests.
d. Computing the p-value: The p-value is the probability of obtaining a test statistic as
extreme or more extreme than the one observed, assuming that the null hypothesis is true.
A low p-value (less than the level of significance) indicates that the observed data is
unlikely to occur under the null hypothesis and suggests that the alternative hypothesis
may be true.
e. Interpreting the results: If the p-value is less than the level of significance, we reject the
null hypothesis and conclude that there is enough evidence to support the alternative
hypothesis. If the p-value is greater than the level of significance, we fail to reject the null
hypothesis and conclude that there is not enough evidence to support the alternative
hypothesis.

Hypothesis testing is widely used in many fields, including science, medicine, engineering, and
social sciences, to test theories, validate experimental results, and make decisions based on data.
It is important to note that hypothesis testing does not prove that the alternative hypothesis is
true, but rather provides evidence to support it or reject the null hypothesis.

288
Example 6.57: Hypothesis Test

289
Fig.6.79: Hypoythesis Test using Python code.

Fig.6.80: Plot output result for Hypoythesis Test using Python code.

290
6.10.1 p-value

The p-value is a statistical measure that helps determine whether the null hypothesis should be
rejected or accepted. In hypothesis testing, the p-value represents the probability of obtaining a
test statistic as extreme as, or more extreme than, the one observed, assuming that the null
hypothesis is true. A smaller p-value indicates stronger evidence against the null hypothesis.

For example, suppose we want to test whether the average height of students in a particular
school is significantly different from the national average of 68 inches. We can formulate the
null hypothesis as "the average height of students in the school is equal to 68 inches" and the
alternative hypothesis as "the average height of students in the school is different from 68
inches". We collect a sample of 50 students from the school and find that their average height is
70 inches, with a standard deviation of 3 inches. We can use a t-test to calculate the p-value of
the test statistic.

In this example, the p-value is very small (7.22e-20), which indicates strong evidence against the
null hypothesis. We can reject the null hypothesis and conclude that the average height of
students in the school is significantly different from the national average.

Note that the p-value can be interpreted as the probability of obtaining a test statistic as extreme
as, or more extreme than, the one observed, assuming that the null hypothesis is true. A p-value
less than 0.05 is generally considered statistically significant, indicating strong evidence against
the null hypothesis. However, the significance level can be set to a different value based on the
specific needs of the analysis.

Overall, the p-value is a crucial statistical measure that helps determine the strength of evidence
against the null hypothesis in hypothesis testing.

291
Example 6.58: p-value

292
Fig.6.81: p-value using Python code.

Fig.6.82: Output plot for p-value using Python code.

6.11 Model Assumptions

In the context of modeling, "Model Assumptions" refer to the fundamental assumptions or


conditions that are made when constructing a mathematical or statistical model. These
assumptions are essential because they provide a foundation for the model and shape its behavior
and predictions. Here are some key points regarding model assumptions:

293
a. Simplification: Models are simplifications of reality, designed to capture the essential features
of a system while disregarding certain complexities. Model assumptions help define the scope
and limitations of the model.

b. Linearity: Many models assume a linear relationship between variables, implying that the
effect of one variable on another is proportional and constant. Linear models are often simpler to
work with, but they may not capture more intricate relationships.

c. Independence: Models often assume that observations or variables are independent of each
other. This assumption implies that the behavior of one observation does not influence the
behavior of another. Violation of this assumption can lead to biased or inefficient estimates.

d. Normality: Some models assume that the variables or errors in the model follow a normal
distribution. This assumption facilitates various statistical tests and estimation techniques.
Departure from normality might affect the accuracy of the model's predictions.

e. Homoscedasticity: Homoscedasticity assumes that the variability of the errors or residuals is


constant across different levels of the predictors. Violation of this assumption, known as
heteroscedasticity, can affect the precision and reliability of the model's estimates.

f. No multicollinearity: When dealing with multiple predictors in a model, it is often assumed


that they are not highly correlated with each other. Multicollinearity can lead to unstable or
unreliable estimates of the predictor's effects.

g. Stationarity: In time series modeling, the assumption of stationarity is crucial. It assumes that
the statistical properties of the data, such as mean and variance, remain constant over time. Non-
stationarity can affect the model's ability to capture patterns and make accurate forecasts.

h. Causal assumptions: In causal modeling, assumptions about the relationships between


variables are made. These assumptions define the causal structure and help determine the
direction of causality between variables.

It's important to note that different models have different assumptions, and the appropriateness of
these assumptions depends on the specific context and data at hand. Model assumptions should
be carefully considered and validated to ensure the model's reliability and applicability to real-
world situations.

294
Example 6.59: Model Assumptions

Fig. 6.83: Model Assumptions using code.

295
Fig.6.84 : Output plot for Model Assumptions

Fig. 6.85: Output plot for Model Assumptions

296
Fig. 6.86: Output plot for Model Assumptions

Let's go through the output and interpretation of each part of the code.

a. Linearity: Scatter plot of y vs. X

- The scatter plot will show a visualization of the relationship between the independent variable
(X) and the dependent variable (y).

- Each data point represents an observation, with the x-coordinate corresponding to the value of
X and the y-coordinate corresponding to the value of y.

- By examining the scatter plot, you can visually assess if the relationship between X and y
appears to be linear. A roughly linear pattern suggests that the linearity assumption is reasonable.

b. Homoscedasticity: Residual plot

- The code fits an ordinary least squares (OLS) regression model using sm.OLS() and obtains
the predicted values (y_pred).

- Residuals are calculated as the difference between the actual values (y) and the predicted
values (y_pred).

- The scatter plot of the predicted values (y_pred) against the residuals visualizes the residuals'
relationship with the predicted values.

297
- Homoscedasticity refers to the assumption that the variance of the residuals is constant across
all levels of the independent variable.

- In the scatter plot, you should check if the spread or dispersion of the residuals appears
consistent across different predicted values. A roughly equal spread of residuals suggests
homoscedasticity.

c. Normality: Q-Q plot of residuals

- A quantile-quantile (Q-Q) plot of the residuals is created using sm.qqplot().

- The Q-Q plot compares the quantiles of the residuals against the quantiles of a theoretical
normal distribution.

- The Q-Q plot allows you to visually assess if the residuals follow a normal distribution.

- If the points on the plot closely align with the diagonal line (the line of expected normality), it
suggests that the residuals are normally distributed. Deviations from the diagonal line may
indicate departures from normality.

By examining the outputs from these code snippets, you can gain insights into the assumptions of
your linear regression model. It's important to consider these results and potentially take
appropriate actions, such as transforming variables or considering alternative modeling
approaches, if any of the assumptions are violated.

6.12 Model Evaluation

Model evaluation is an essential aspect of machine learning and data analysis. It involves
assessing the performance and quality of a predictive model to determine its effectiveness in
making accurate predictions or classifications. The evaluation process helps us understand how
well a model generalizes to unseen data and allows us to make informed decisions about its
deployment.

There are several key aspects and techniques involved in model evaluation. Let's delve into the
details:

a. Training and Test Sets: To evaluate a model, it is crucial to split the available data into
training and test sets. The training set is used to train the model, while the test set is used to
evaluate its performance. The idea behind this separation is to simulate the real-world scenario,
where the model is presented with unseen data.

b. Evaluation Metrics: Evaluation metrics quantify the performance of a model. The choice of
metrics depends on the problem type (e.g., regression, classification) and the specific
requirements of the task. Common evaluation metrics include accuracy, precision, recall, F1-

298
score, mean squared error (MSE), and area under the receiver operating characteristic curve
(AUC-ROC).

c. Confusion Matrix: A confusion matrix is a useful tool for evaluating classification models. It
provides a detailed breakdown of correct and incorrect predictions, highlighting true positives,
true negatives, false positives, and false negatives. From the confusion matrix, various metrics
like accuracy, precision, recall, and F1-score can be derived.

d. Cross-Validation: Cross-validation is a technique to assess model performance using


different subsets of the data. It helps mitigate biases that may arise from a single train-test split.
Common methods include k-fold cross-validation, where the data is divided into k equal-sized
subsets, and leave-one-out cross-validation, where each data point serves as the test set once.

e. Overfitting and Underfitting: Overfitting occurs when a model performs exceptionally well
on the training data but fails to generalize to new, unseen data. Underfitting, on the other hand,
occurs when a model fails to capture the underlying patterns in the data and performs poorly on
both the training and test sets. Model evaluation helps identify and address these issues by
finding the right balance of model complexity.

f. Bias-Variance Tradeoff: The bias-variance tradeoff is a fundamental concept in model


evaluation. Bias refers to the error introduced by approximating a real-world problem with a
simplified model. Variance, on the other hand, refers to the model's sensitivity to fluctuations in
the training data. A good model strikes a balance between bias and variance to achieve optimal
generalization performance.

g. Performance Visualization: Visualizations such as precision-recall curves, ROC curves,


calibration plots, and learning curves provide valuable insights into a model's performance.
These plots help understand how different evaluation metrics change with varying thresholds or
sample sizes, aiding in the interpretation and comparison of models.

h. Model Comparison: Model evaluation allows for the comparison of multiple models to
identify the best-performing one. This can involve comparing evaluation metrics, conducting
statistical tests, or using resampling techniques like bootstrapping.

i. Domain-specific Evaluation: In some cases, evaluation metrics specific to the domain or


problem may be required. For example, in natural language processing, metrics like BLEU
(bilingual evaluation understudy) or ROUGE (recall-oriented understudy for gisting evaluation)
are used to evaluate machine translation or text summarization models.

Remember, model evaluation is an iterative process. It is important to continuously evaluate and


refine models as new data becomes available or as the problem domain evolves.

299
Example 6.60: Evaluation Model

Fig.6.87: Eveluation metrics using Python code.

Fig.6.88: Output eveluation metrics using Python code for Figure1.

300
6.13 t-test

The t-distribution is a probability distribution that is used in statistical hypothesis testing when
the population standard deviation is unknown. It is a bell-shaped distribution, similar to the
normal distribution, but with thicker tails. This means that the t-distribution is more spread out
than the normal distribution, and is therefore more likely to produce extreme values.

The t-distribution is defined by its degrees of freedom (df), which is the number of independent
observations in the sample minus one. The larger the df, the more closely the t-distribution
resembles the normal distribution.

The t-test is a statistical test that is used to compare the means of two groups. It is a parametric
test, which means that it assumes that the data come from a normally distributed population. If
the population standard deviation is known, then the z-test can be used instead of the t-test.
However, if the population standard deviation is unknown, then the t-test must be used.

The t-test statistic is calculated by dividing the difference between the sample means by the
standard error of the difference between the means. The standard error of the difference between
the means is calculated using the sample standard deviations and the sample sizes of the two
groups.

The t-distribution is a versatile tool that can be used in a variety of statistical applications. It is a
commonly used test in the fields of psychology, education, and medicine.

Here are some of the assumptions of the t-test:

The data must be normally distributed.

The samples must be independent.

The samples must be drawn from populations with equal variances.

If these assumptions are not met, then the results of the t-test may be unreliable.

The t-test is a statistical test used to determine if there is a significant difference between the
means of two groups. It is commonly used to compare the means of two samples to determine if
they are likely to have come from the same population or from different populations with
different means. There are three main types of t-tests:

a. Independent Samples t-test: This is used when you have two separate, unrelated groups
of data, and you want to compare their means. For example, you might compare the test

301
scores of students from two different schools to see if there is a significant difference in
performance.

Formula:

̅ ̅

Where

̅ and ̅ are the means of the two groups.

and are the sample size of the two groups.

is the pooled standard deviation, calculated as:

and are the sample variances of the two groups.

The resulting t value is the compared to the critical value from the t-distribution with
degrees of freedom to determine whether the difference between the means is statistically
significant. If the calculated t value is greater than the critical value, the you can conclude that
there is a significant difference between the means of the two groups.

b. Paired Samples t-test: This is used when you have two sets of related data, often collected
from the same individuals or units at different times or under different conditions. For example,
you might compare the test scores of students before and after they take a tutoring program to
see if there is a significant difference in their performance.

Formula:

Let represent the difference between the paired observations for each subject I, and ̅ be the
mean of these differences.


̅

Where:

n is the number of paired observations.

302
The standard observation of the differences, can be calculated as follows:

∑ ̅

The formula for the paired samples t-test statistic t is:


This t value can then be compared to the critical value from the t-distribution with n-1 degrees of
freedom to determine the statistical significance of the difference between the means of paired
groups.

c. One-Sample t-test: This is used when you have a single sample and you want to
compare its mean to a known population mean or a hypothesized value. For example, you
might compare the mean height of a sample of students to the national average height to
see if there is a significant difference.

Formula:


Where:

̅ is the sample mean.

is the hypothesized population mean that you are testing against.

S is the sample standard deviation.

N is the sample size.

The resulting t value is then compared to the critical value from the t-distribution with n-1
degrees of freedom to determine whether the difference between the sample mean and the
hypothesized population mean is statistically significant. If the calculated t value is greater than
the critical value, you can conclude that there is a significant difference between the sample
mean and the hypothesized population mean.

303
Example 6.61: Independent samples t-test

Let’s consider an example using an independent samples t-test. Suppose we want to compare the
mean test scores of students from two different classrooms, A and B. We have the following
data:

Classroom A:

Sample size (n1) = 5

Mean score (M1) = 84.6

Standard deviation (s1) = 4.775

Classroom B:

Sample size (n2) = 5

Mean score (M2) = 89.6

Standard deviation (s2) = 4.159

We can now calculate the t-statistic using the formula:

t ≈ -1.766

Since p (0.115) > fail to reject the null hypothesis, there is no significant difference in
the average test score between two groups.

304
Example 6.62: t-test

Fig. 6.89: t-test Python code

Fig. 6.90: Output plot for t-test using Python code.

305
Example 6.63: t-test

Fig.6.91: t-test for two classroom using Python code.

306
Fig. 6.92: t-test for two classroom using Python code.

Example 6.64: t-test

Fig.6.93: t-test using Python code.

Fig.6.94: Output statistical results for t-test using Python code.

307
Example 6.65: t-test

Fig. 6.95: t-test using Python code.

Fig.6.96: Output statistical results for t-test using Python code.

6.14 F-test

The F-test, also known as Fisher's F-test or the variance ratio test, is a statistical test used to
compare the variances of two or more groups or samples. It is commonly used in analysis of
variance (ANOVA) and regression analysis to assess whether the variances of the groups are
equal or significantly different. The F-test is based on the F-distribution, which is a probability
distribution that arises when comparing the variances of different groups.

The F-test is often used in two main scenarios:

I. Testing Equality of Variances:

In this scenario, the F-test is used to compare the variances of two or more groups to determine if
they are statistically equal. The null hypothesis ( ) is that the variances of the groups are equal.
The alternative hypothesis ( ) is that the variances are not equal. If the F-test produces a large
F-statistic and a small p-value, we reject the null hypothesis and conclude that the variances are
significantly different.

308
II. ANOVA (Analysis of Variance):

In ANOVA, the F-test is used to assess whether there are statistically significant differences
among the means of three or more groups. It compares the variability between the group means
to the variability within the groups. The null hypothesis (H0) in ANOVA is that all group means
are equal. The alternative hypothesis (Ha) is that at least one group mean is different. If the F-test
results in a large F-statistic and a small p-value, we reject the null hypothesis and conclude that
there are significant differences among the group means.

III. F-Statistic:

The F-statistic is a numerical value that measures the ratio of the variances between the groups
(treatments) and the variances within the groups (residuals). For the equality of variances test,
the F-statistic is calculated as:

F = Variance between groups / Variance within groups

For ANOVA, the F-statistic is obtained from the sum of squares between groups and the sum of
squares within groups, which are components of the variance.

IV. Degrees of Freedom:

The degrees of freedom are used to calculate the critical value of the F-distribution and the p-
value. For the equality of variances test, the degrees of freedom for the numerator (between
groups) is k - 1, and the degrees of freedom for the denominator (within groups) is N - k, where k
is the number of groups, and N is the total number of observations.

For ANOVA, the degrees of freedom for the numerator is k - 1 (where k is the number of
groups), and the degrees of freedom for the denominator is N - k (where N is the total number of
observations).

V. F Distribution and Critical Value:

The F-distribution is positively skewed and depends on two degrees of freedom: df1 (numerator)
and df2 (denominator). The critical value of the F-distribution at a given significance level
(alpha) is used to determine if the F-statistic is statistically significant. If the calculated F-statistic
is greater than the critical value, we reject the null hypothesis.

The formula is:

Where and are the variances of the two samples. It can be shown that the sampling
distribution of such a ratio, appropriately called a variance ratio, is a continuous distribution

309
called F distribution. This distribution depends on the two parameters -1 and with this
statistic we reject the null hypothesis at the level of significance and accept the
alternative hypothesis

VI. p-value:

The p-value is the probability of observing the data or more extreme data under the assumption
that the null hypothesis is true. A small p-value (typically less than the chosen significance level,
such as 0.05) indicates that we can reject the null hypothesis in favor of the alternative
hypothesis.

VII. Decision:

Based on the p-value and the chosen significance level, we make a decision whether to reject or
fail to reject the null hypothesis. If the p-value is less than the significance level, we reject the
null hypothesis and conclude that there is a significant difference (for equality of variances) or
significant differences among the means (for ANOVA). If the p-value is greater than or equal to
the significance level, we fail to reject the null hypothesis and do not find significant evidence of
differences.

The F-test is widely used in various fields, including experimental research, quality control, and
regression analysis. It helps researchers determine if there are significant differences between
groups and assists in understanding the sources of variability in data.

310
Example 6.66: F-Distribution

Fig. 6.97: Using Python Code to display F-Distribution.

311
Fig. 6.98: Output graph for F-distribution using Python code.

Example 6.67: F-test

Fig.6.99: Calculate F-statistic using Python code.

Fig.6.100: Output of F-statistic using Python code.

Example 6.68:

In this example, we used the f_oneway function from the scipy.stats library to perform the one-
way ANOVA. The function takes the exam scores of each group as input and returns the F-
statistic and p-value.

Interpretation of results:

312
F-statistic: The computed F-statistic is approximately 11.944134078212295.

p-value: The p-value is approximately 0.0013975664568248779.

Since the p-value (0.0013975664568248779) is less than the typical significance level of 0.05,
we reject the null hypothesis. Therefore, we conclude that there is a significant difference in
mean exam scores among the three groups.

Assumptions of the F-Test:

a. Independence: The observations in each group should be independent of each other.


b. Normality: The data within each group should follow a normal distribution.
c. Homogeneity of Variance: The variances of the groups should be equal.

If any of these assumptions are violated, the results of the F-test may not be valid. It is essential
to check these assumptions before interpreting the results and consider alternative methods, such
as Welch's ANOVA or Kruskal-Wallis test, if the assumptions are not met.

In conclusion, the F-test (one-way ANOVA) is a powerful statistical tool for comparing means
among three or more groups. It allows researchers to determine whether observed differences in
means are statistically significant, helping to draw conclusions about the population and identify
significant factors that contribute to variability in data.

Example 6.69: F-test

Let's consider another example of performing the F-test, this time for comparing the variances of
two different samples.

Suppose we have data from two groups of students, Group X and Group Y, and we want to
determine if there is a significant difference in the variances of their exam scores.

Here's the Python code to perform the F-test for comparing variances:

Example 6.70: F-test

Fig. 6.101: Calculate F-statistic using Python code.

313
Fig. 6.102: Output of F-statistic using Python code.

In this example, we manually calculated the F-statistic and the corresponding p-value for
comparing the variances of the two groups. The np.var function is used to calculate the sample
variances, and f.cdf from scipy.stats is used to calculate the cumulative distribution function
(CDF) of the F-distribution.

Interpretation of results:

F-statistic: The computed F-statistic is approximately 4.304878048780488.

p-value: The p-value is approximately 0.1864127331234633.

Since the p-value (0.1864127331234633) is greater than the typical significance level of 0.05, we
do not reject the null hypothesis. Therefore, we conclude that there is no significant difference in
the variances of exam scores between Group X and Group Y.

It's important to note that the F-test for comparing variances is sensitive to the assumption of
normality. If the data is not normally distributed, the F-test may not provide accurate results, and
alternative.

Also, if the sample sizes are small, the F-test may not perform well due to limited statistical
power. In such cases, other methods like the Bartlett's test can be used.

In summary, the F-test for comparing variances is a useful tool to assess whether the variability
of two samples is significantly different. It helps researchers determine whether samples can be
assumed to have equal variances, which is important in various statistical analyses and
hypothesis testing.

6.15 Chi-Square test

Chi-Square (χ²) is a statistical test used to determine if there is a significant association between
two categorical variables. It is commonly used in the field of statistics, especially in hypothesis
testing and analyzing data with categorical outcomes. The Chi-Square test assesses whether there
is a difference between the expected and observed frequencies in a contingency table.

The Chi-Square test can be applied to categorical data with two or more categories, and it can be
used to answer questions like:

Is there a significant relationship between two categorical variables?

Do the observed frequencies differ significantly from the expected frequencies?

314
Are the proportions in different categories significantly different from each other?

Let's go into more detail about how the Chi-Square test works:

The chi-square test can be used to test two types of hypotheses:

I. The chi-square goodness of fit test: This test is used to test whether the observed distribution
of a categorical variable is different from a hypothesized distribution. For example, you could
use a chi-square goodness of fit test to test whether the distribution of eye colors in a population
is different from the expected distribution of 50% brown eyes, 25% blue eyes, and 25% green
eyes.

II. The chi-square test of independence: This test is used to test whether two categorical
variables are independent of each other. For example, you could use a chi-square test of
independence to test whether the distribution of eye colors is independent of the distribution
of hair colors in a population.

Contingency Table:

A contingency table, also known as a cross-tabulation table, displays the frequency distribution
of two categorical variables. The rows represent one variable, and the columns represent the
other variable. Each cell in the table contains the count of observations that fall into a specific
combination of categories from both variables.

Example 6.71: consider a study where we want to examine the relationship between smoking
habit (categories: "Smoker" and "Non-Smoker") and the development of a certain lung disease
(categories: "Disease" and "No Disease"). The contingency table might look like this:

Table 6.1: The contingency table

Disease No Disease
Smoker 30 20
Non-Smoker 10 40

Null Hypothesis ( ) and Alternative Hypothesis (Ha):

The Chi-Square test involves formulating two hypotheses:

Null Hypothesis ( ): There is no significant association between the two categorical variables.
The observed frequencies are similar to the expected frequencies.

315
Alternative Hypothesis ( ): There is a significant association between the two categorical
variables. The observed frequencies are significantly different from the expected frequencies.

Expected Frequencies:

To perform the Chi-Square test, we need to calculate the expected frequencies for each cell in the
contingency table. The expected frequency is the count we would expect if there were no
association between the variables, assuming the null hypothesis is true. The formula to calculate
the expected frequency for a cell (i, j) in a contingency table with r rows and c columns is:

Expected Frequency(i, j) = (row total i * column total j) / grand total

Chi-Square Statistic:

The Chi-Square statistic is calculated by comparing the observed frequencies with the expected
frequencies for each cell in the contingency table. The formula to calculate the Chi-Square
statistic is:

χ² = Σ [(Observed Frequency(i, j) - Expected Frequency(i, j))^2 / Expected Frequency(i, j)]

where the summation (Σ) is taken over all cells in the contingency table.

Degrees of Freedom (df):

The degrees of freedom for the Chi-Square test depend on the dimensions of the contingency
table. For a 2x2 contingency table (2 rows and 2 columns), the degrees of freedom is (r-1) * (c-
1), where r is the number of rows and c is the number of columns.

Critical Value and p-value:

Once we have the Chi-Square statistic and the degrees of freedom, we can compare the test
statistic to the Chi-Square distribution to obtain the p-value. The p-value represents the
probability of observing the data or more extreme data under the assumption of the null
hypothesis. A small p-value (typically less than the chosen significance level, such as 0.05)
indicates that we can reject the null hypothesis in favor of the alternative hypothesis.

Decision:

Finally, based on the p-value and the chosen significance level, we make a decision whether to
reject or fail to reject the null hypothesis. If the p-value is less than the significance level, we
reject the null hypothesis and conclude that there is a significant association between the two
categorical variables. If the p-value is greater than or equal to the significance level, we fail to
reject the null hypothesis and do not find significant evidence of an association.

316
Chi-Square test can be performed using statistical software packages or programming languages
like Python, R, or MATLAB. In Python, you can use libraries like scipy.stats.chi2_contingency
from SciPy to perform the Chi-Square test on a contingency table and calculate the Chi-Square
statistic, p-value, and expected frequencies.

The chi-square test is a relatively simple test to perform, but it is important to understand the
assumptions that are made when using it. The main assumption of the chi-square test is that the
expected values in each cell of the contingency table are large enough. This means that the
expected values should be at least 5. If the expected values are not large enough, then the chi-
square test may not be accurate.

The chi-square test is a powerful tool for testing hypotheses about categorical data. However, it
is important to use it correctly and to understand the assumptions that are made when using it.

Here are the steps involved in performing a chi-square test:

a. State the null and alternative hypotheses. The null hypothesis is the hypothesis that there
is no difference between the observed and expected data. The alternative hypothesis is the
hypothesis that there is a difference between the observed and expected data.
b. Calculate the chi-square statistic. The chi-square statistic is calculated by comparing the
observed and expected values in each cell of the contingency table.
c. Determine the p-value. The p-value is the probability of obtaining the observed chi-
square statistic if the null hypothesis is true.
d. Make a decision about the null hypothesis. If the p-value is less than the significance
level, then the null hypothesis is rejected. This means that there is sufficient evidence to
support the alternative hypothesis. If the p-value is greater than the significance level,
then the null hypothesis is not rejected. This means that there is not enough evidence to
support the alternative hypothesis.

The significance level is the probability of making a Type I error. A Type I error is rejecting the
null hypothesis when it is true. The default significance level is 0.05, which means that there is a
5% chance of making a Type I error.

Example 6.72: Chi-Square test.

Suppose you want to test whether the distribution of students' grades in a class follows an
expected grade distribution. You have 150 students, and you expect the grade distribution to be
as follows:

317
A: 20%

B: 30%

C: 25%

D: 15%

E: 10%

You count the actual number of students who received each grade and want to test if it matches
the expected distribution, while the observed values are:

A: 30

B: 40

C: 35

D: 20

E: 25

We can calculated the expected value as follows:

E = [0.20 0.30 0.25 0.15 0.10] * 150

= [30 45 37.5 22.5 15]

318
Example 6.73: Chi-Square distribution

Fig. 6.103: Using Python code to display Chi-Square distribution.

319
Fig. 6.104: Output graph for Chi-Square distribution using Python code.

Example 6.74: (Python code) Chi-square goodness of fit test

320
Fig.6.105 : Chi-Square test using Python code.

Fig.6.106: Chi-Square distribution graph output using Python code.

Fig.6.107: Statistical output for chi-square test using Python code.

321
Example 6.75: chi-square test of independence

Let's consider a numerical example to demonstrate the chi-square test using a hypothetical
dataset. Suppose we want to investigate whether there is a significant association between gender
and favorite color among a group of people. Here's the observed data:

Table 6.2: Observed values

Blue Green Red Total


Male 47 59 162 162
Female 33 61 44 138
Total 80 120 100 300

=43.2
= 64.8
= 54

= 55.2

Table 6.3: Expected values

Blue Green Red


Male 43.2 64.8 54
Female 36.8 55.2 46

= 2.016

322
Since P-value = 0.364 is greater than , then the null hypothesis is not rejected.

Next, you would consult a chi-square distribution table with appropriate degrees of freedom (in
this case, d.f = (rows - 1) * (columns - 1) = 2 and with P-value: 0.365 your chosen significance
level to determine the critical value. If the calculated χ² value is greater than the critical value,
you would reject the null hypothesis and conclude that there is a significant association between
gender and favorite color.

Example 6.76: Chi-Square test

323
Fig. 6.108: Chi-square test using Python code.

Fig. 6.109: Statistical output for chi-square test using Python code.

Fig. 6.110: Output graph for Observed data and Expected Frequency Using Python code.

324
Fig. 6.111: Output curve for Chi-Square Distribution using Python code.

325
CHAPTER SEVEN: Case studies

Case Study #1: Exploratory Data Analysis for Retail Sales

Case Study # 2: Sentiment Analysis on Customer Reviews

Case Study # 3: Customer Churn Prediction

326
7.1 CASE STUDY #1 : Exploratory Data Analysis for Retail Sales

Here's a simple case study for a data science project using Python. Let's consider a scenario
where you're working for a retail company that wants to analyze their sales data to make data-
driven decisions. The goal of this case study is to perform exploratory data analysis (EDA) on
the sales data using Python.

Case Study: Exploratory Data Analysis for Retail Sales

Problem Statement: The retail company wants to gain insights from their sales data to
understand trends, patterns, and make informed business decisions.

Dataset: The dataset contains information about sales transactions, including the date of
purchase, product ID, quantity sold, price, and customer ID.

Objectives:

Load and preprocess the data.

Perform basic data exploration to understand the structure of the data.

Analyze sales trends over time.

Identify top-selling products and customers.

Visualize the data to communicate insights effectively.

327
Example 7.1: Case study #1:

328
Fig. 7.1: Exploratory Data Analysis for Retail Sales using Python code.

329
Fig.7.2 : Output results for Exploratory Data Analysis for Retail Sales using Python code.

330
Fig.7.3: Output graph for selling products using Python code.

Fig.7.4: Output graph for customers using Python code.

331
7.2 Case Study # 2: Sentiment Analysis on Customer Reviews

Let's explore another case study involving sentiment analysis on customer reviews using Python.
In this scenario, you'll work with a dataset of customer reviews for a product and build a
sentiment analysis model to classify each review as positive, negative, or neutral.

Case Study: Sentiment Analysis on Customer Reviews

Problem Statement: The company wants to understand the sentiment of customer reviews about
their product in order to identify areas of improvement and track customer satisfaction.

Dataset: The dataset contains customer reviews along with their corresponding sentiment labels
(positive, negative, or neutral).

Objectives:

Load and preprocess the review dataset.

Perform exploratory data analysis (EDA) to understand the distribution of sentiments.

Preprocess the text data by removing stopwords and performing text normalization.

Build a sentiment analysis model using machine learning.

Evaluate the model's performance and visualize the results.

Example 7.2: Case Study #2:

332
Fig.7.5: Sentiment Analysis on Customer Reviews using Python code.

333
Fig.7.6: Output results Sentiment Analysis on Customer Reviews using Python code.

334
7.3 Case Study #3: Customer Churn Prediction

Let's explore a different case study involving customer churn prediction using machine learning.
In this scenario, you'll work with a dataset of customer information and their historical behavior
to build a model that predicts whether a customer will churn (leave) the company.

Case Study: Customer Churn Prediction

Problem Statement: The company wants to predict which customers are likely to churn so that
they can take proactive measures to retain them.

Dataset: The dataset contains customer information including features like contract type,
monthly charges, tenure, and whether the customer churned or not.

Objectives:

Load and preprocess the customer churn dataset.

Perform exploratory data analysis (EDA) to understand the data distribution.

Preprocess the data by encoding categorical variables and handling missing values.

Build a customer churn prediction model using a Random Forest classifier.

Evaluate the model's performance and visualize the results.

Example 7.3: case study #3

335
Fig. 7.7: Customers pridections Python code.

336
Fig. 7.8: Output statistical results using Python code.

Fig. 7.9: Output Confusion Matrix using Python code.

337
CHAPTER EIGHT: Data Scinece Relationships

 Relationship between data science, machine learning, and artificial intelligence

 Relationship between Data Science and Bioinformatics

338
8.1 Relationship between data science, machine learning, and artificial intelligence

Data science, machine learning, and artificial intelligence are interconnected fields that build
upon each other, contributing to various aspects of modern technology and problem-solving.
Let's delve into the details of their relationships and definitions:

a. Data Science:

Data science involves the extraction of insights and knowledge from large and complex datasets.
It combines expertise in various domains, including statistics, computer science, domain
knowledge, and data visualization, to make informed decisions and predictions. The key steps in
data science include data collection, cleaning, exploration, analysis, visualization, and
interpretation.

b. Machine Learning:

Machine learning is a subset of artificial intelligence that focuses on the development of


algorithms and models that allow computer systems to improve their performance on a specific
task through learning from data. Instead of explicit programming, machine learning enables
systems to adapt and improve automatically based on patterns and experiences gained from data.
Machine learning algorithms can be broadly categorized into supervised learning, unsupervised
learning, and reinforcement learning.

c. Artificial Intelligence:

Artificial intelligence (AI) is a broader concept that involves creating machines and software
capable of intelligent behavior, similar to human cognitive functions. AI encompasses a wide
range of techniques, including machine learning, natural language processing, computer vision,
robotics, and expert systems. AI systems aim to perform tasks that typically require human
intelligence, such as understanding language, recognizing patterns, making decisions, and
solving complex problems.

Relationships:

 Data Science and Machine Learning: Data science heavily relies on machine learning
techniques to extract insights from data. Machine learning algorithms are used to build
predictive models and make data-driven decisions in various domains. Data scientists
leverage machine learning algorithms to analyze patterns and trends in datasets to gain
insights and make predictions.
 Machine Learning and Artificial Intelligence: Machine learning is a subset of AI, and
it's a crucial component that enables AI systems to learn and improve from data. Machine
learning algorithms power many AI applications, including natural language processing,
image recognition, recommendation systems, and autonomous vehicles.

339
 Data Science and Artificial Intelligence: Data science provides the foundation for
creating intelligent AI systems. AI systems require vast amounts of data for training and
improving their performance. Data science helps AI systems collect, preprocess, and
analyze data to make accurate predictions and decisions.

In essence, data science provides the tools and methodologies to gather and process data,
machine learning empowers systems to learn from the data, and artificial intelligence
encompasses the broader goal of creating intelligent machines that can perform human-like
tasks.

It's important to note that these fields are rapidly evolving, and advancements in one field often
contribute to progress in the others. The synergy between data science, machine learning, and
artificial intelligence continues to shape the landscape of modern technology and innovation.

Example 8.1:

Fig. 8.1: Create and save data.csv file using Python code.

340
Example 8.2: Linear regression model

Fig. 8.2: Linear regression model using Python code.

341
Example 8.3: Build CNN model

Fig. 8.3: Build CNN model using Python Code.

Fig.8.4: Output results

342
Example 8.4: Generate Sample sentiment analysis data

Fig.8.5: Create and save data.csv file Python code.

Example 8.5: Build an SVM model

Fig. 8.6: Build an SVM model using Python model.

343
Fig.8.7: SVM model using Python model output.

8.2 Relationship between Data Science and Bioinformatics:

a. Overview:

Data science and bioinformatics are two closely intertwined fields that have a profound impact
on modern biological and medical research. They combine techniques from computer science,
statistics, and domain-specific knowledge to extract meaningful insights from large and complex
biological datasets. The relationship between data science and bioinformatics is symbiotic, with
data science providing the tools and methods necessary to analyze biological data and
bioinformatics guiding the application of these methods in the context of biological research.

b. Data-Intensive Nature of Biology:

Advancements in technology have led to the generation of massive amounts of biological data,
such as genomic sequences, proteomic profiles, medical images, and clinical records. This data
deluge has created a need for sophisticated techniques to process, analyze, and interpret this
information effectively.

c. Shared Concepts and Techniques:

 Data Preprocessing: Both fields involve the cleaning, normalization, and transformation
of raw data to ensure accurate and meaningful analysis. In bioinformatics, this includes
tasks like quality control of genomic sequences or removing noise from protein data.
 Feature Extraction: Feature extraction methods in data science, such as dimensionality
reduction and feature selection, find applications in bioinformatics for identifying
relevant features in biological data. For example, identifying important genes or proteins
related to a specific disease.

344
 Statistical Analysis: Statistical methods are used to identify patterns, correlations, and
significant differences in biological datasets. In bioinformatics, statistical techniques help
researchers understand the significance of genetic variations or differential gene
expression.
 Machine Learning: Both fields utilize machine learning algorithms for predictive
modeling, classification, clustering, and regression tasks. In bioinformatics, machine
learning aids in predicting protein structures, classifying diseases based on genomic
profiles, and drug discovery.
 Data Visualization: Visualizations play a crucial role in communicating complex
biological insights to researchers and clinicians. Interactive visualizations help
understand genetic relationships, expression patterns, and evolutionary trees.d.
d. Applications:
 Genomics: Data science techniques are instrumental in analyzing DNA sequences,
identifying genes, predicting gene functions, and understanding genetic variations
associated with diseases.
 Proteomics: Bioinformatics and data science collaborate to analyze protein structures,
interactions, and functions, leading to insights into cellular processes and drug discovery.
 Medical Diagnostics: Data-driven approaches aid in disease diagnosis, prognosis, and
treatment planning by analyzing patient data, medical images, and clinical records.
 Pharmaceuticals: Data science helps in drug discovery, designing molecular structures,
and predicting drug-target interactions.

e. Challenges:

Data Integration: Integrating data from diverse sources with varying formats is a challenge in
bioinformatics. Data science techniques enable the harmonization and integration of multi-omics
data.

Scalability: Biological datasets can be massive and complex. Data science methods need to
handle big data efficiently.

Interdisciplinary Collaboration: Collaboration between biologists, clinicians, data scientists,


and domain experts is crucial to ensure meaningful analysis and interpretation.

f. Future Directions:

The relationship between data science and bioinformatics will continue to grow stronger with
advancements in machine learning, deep learning, and big data analytics. Innovative applications
in personalized medicine, precision agriculture, and synthetic biology are expected to emerge.

In essence, the relationship between data science and bioinformatics is marked by the synergistic
combination of computational methods and biological knowledge. Together, they pave the way
for breakthroughs in understanding biology, diseases, and ultimately improving human health.

345
Example 8.6: Relationship between data science and Bioinformatics

Fig.8.8: Relationship between data science and Bioinformatics using Python code.

Fig.8.9: Output statistical results relationship between data science and Bioinformatics using Python code.

346
Example 8.7: Gene Expression

Fig. 8.10: Gene Expression using Python Code.

347
Fig. 8.11: Gene Expression using Python Code.

348
Example 8.8: Generate random DNA Sequences

Fig. 8.12: Generate random DNA Sequences using Python Code.

Fig. 8.13: Distribution of GC Content in DNA sequences

349
Appendix A: FARTHER READING

 Artificial Intelligence

 Bioinformatics

 Data Science

350
A.1. Artificial Intelligence
 Ali A. Ibrahim, and et., (2023), “Forecasting Stock Prices with an Integrated Approach
Combining ARIMA and Machine Learning Techniques ARIMAML”, Journal of
Computer and Communications, 2023, 11, pp.: 58-70.
 Ali A. Ibrahim, and et., (2023), “Use the Power of a Genetic Algorithm to Maximize
and Minimize Cases to Solve Capacity Supplying Optimization and Travelling
Salesman in Nested Problems”, 11, pp: 24-31.
 Ali A. Ibrahim, and et., (2022), “Multi-Stage Image Compression-Decompression
System Using PCA/PCA to Enhance Wireless Transmission Security”, Journal of
Computer and Communications, Journal of Computer and Communications, 10, pp.:
87-96.
 Ali A. Ibrahim, and et., (2019), “Design & Implementation of an Optimization Loading
System in Electric by Using Genetic Algorithm”, Journal of Computer and
Communications, 7, 7, pp.: 135-146.
 Ali A. Ibrahim, and et., (2018), “The effect of Z-Score standardization on binary
input due the speed of learning in back-propagation neural network”, Iraqi Journal of
Information and Communication Technology, 1, 3, pp.: 42-48.
 Ali A. Ibrahim, and et., (2018), “Design and implementation of fingerprint
identification system based on KNN neural network”, Journal of Computer and
Communications, 6, 3, pp.: 1-18.
 Ali A. Ibrahim, and et., (2016), “Using neural networks to predict secondary
structure for protein folding”, Journal of Computer and Communications, 5, 1, pp.: 1-
8.
 Ali A. Ibrahim, and et., (2016), “Design
and implementation of iris pattern
recognition using wireless network system” , Journal of Computer and
Communications, 4, 7, pp.: 15-21.
 Ali A. Ibrahim, and et., (2013), Design and Implementation Iris Recognition System
Using Texture Analysis, Al-Nahrain Journal for Engineering Sciences, 16, 1, pp: 98-
101.

351
A.2. Bioinformatics

 Ali A. Ibrahim, and et., (2020), “Proposed Genetic Profiling System Based on Gel
Electrophoresis for Forensic DNA Identification”, Indian Journal of Public Health Research &
Development, 11,2.
 Ali A. Ibrahim, and et., (2019), “CLASSIFICATION NUMBER OF ORGANISMS USING
CLUSTER ANALYSIS OF THE PEPTIDE CHAINS MULTIPLE CHYMOTRYPSIN LACTATE
DEHYDROGENASE”, Biochem. Cell. Arch, 19, 2, pp.: 4425-4429
 Ali A. Ibrahim, and et., (2019), “Beta-2-microglobulin as a marker in patients with thyroid
cancer”, Iraqi Postgraduate Med Journal, 18, 1, pp.: 18 – 22.
 Ali A. Ibrahim, and et., (2019), “Functional Analysis of Beta 2 Microglobulin Protein in
Patients with Prostate Cancer Using Bioinformatics Methods”, Indian Journal of Public
Health, 10, 3.
 Ali A. Ibrahim, et., (2018), “Sequence and Structure Analysis of CRP of Lung and Breast
Cancer Using Bioinformatics Tools and Techniques”, 11, 1, pp.: 163-174.
 Ali A. Ibrahim, and et., (2018), “C-Reactive Protein as a Marker in the Iraq Patients with
Poisoning Thyroid Gland Disease”, Engineering and Technology Journal, 36, 1 Part (B)
Scientific, University of Technology.
 Ali A. Ibrahim, and et., (2018), Detecting the concentration of C- reactive protein by HPLC,
and analysis the effecting mutations the structure and function of CRP reactive protein, of
standard sample, The First International Scientific Conference / Syndicate of Iraqi Academics
 Ali A. Ibrahim, and et., (2017), C-reactive protein as a marker for cancer and poising thyroid
gland, Engineering & Technology Journal, 35
 Ali A. Ibrahim, and et, (2014), “Using Hierarchical Cluster and Factor Analysis to Classify and
Built a phylogenetic Tree Species of ND1 Mitochondria”, 17, 1, pp: 114-122.
 Ali A. Ibrahim, and et., (2012), BIOINFORMATICS, first edition.

352
A.3. Data Science

 Ali A Ibrahim, and et., (2019), “Forecasting the Bank of Baghdad index using the Box-
Jenkins methodology”, Dinars Magazine, 15, pp. 441-460.
 Ali A Ibrahim, and et., (2013), “Using of Two Analyzing Methods Multidimensional Scaling
and Hierarchical Cluster for Pattern Recognition via Data Mining”, 3,1, pp:16-20.
 Ali A. Ibrahim, and et., (2011) “DESIGN A FINGERPRINT DATABASE PATTERN
RECOGNITION SYSTEM VIA CLUSTER ANALYSIS METHOD I- DESIGN OF
MATHEMATICAL MODEL”, IRAQI JOURNAL OF BIOTECHNOLOGY, 10, 2, pp: 273-283.
 Ali A. Ibrahim, (2008) “ Using the discriminatory function model for the chemical classification
of powdered milk models and knowing their conformity with the Iraqi standard specifications
through”, Journal College of Science of Al-Nahrain University, 11, 1, pp: 46-57.
 Ali A. Ibrahim, (2002), “Using the discriminatory function model for the chemical classification
of powdered milk models and knowing their conformity with the Iraqi standard specifications
through, Journal of Economic and Administrative Sciences / University of Baghdad, 9,29, pp:
114-138. (ARABIC)
 Ali A. Ibrahim, (2002), “Using protein databases and cluster analysis to compare the protein
homology regions of the Leader Peptidase enzyme and to determine the degree of genetic
affinity between”, Journal of the College of Administration and Economics / Al-Mustansiriya
University, 42, pp: 64-78. (ARABIC)
 Ali A. Ibrahim, (2000),”Using a multidimensional scale to analyze the chemical compositions of
different milk powder samples”, The twelfth scientific conference of the Iraqi Association for
Statistical Sciences, pp: 189-210, (ARABIC)
 Ali A. Ibrahim, “The use of cluster analysis in the compositional analysis of different milk
powder samples”, (2000), Scientific Journal of Tikrit University, College of Engineering.
(ARABIC)
 Ali A. Ibrahim,(1996) “Use the factor analysis method to extract the variables that determine
The suitability of powdered milk for human consumption”, (1996), Scientific Journal of Tikrit
University, College of Engineering. (ARABIC).

353
Bibliography

[1] Agresti Alan and Kateri Maria, 2022, “Foundations of Statistics for Data Scientists”,
CRC Press.

[2] Al-Faiz Mohammed Z., Ibrahim Ali A., Hadi Sarmad M, 2018, “The effect of Z-Score
standardization on binary input due the speed of learning in back-propagation neural
network”, Iraqi Journal of Information and Communication Technology, 1, 3, pp.: 42-48.

[3]Al-Faiz Mohammed Z., Ibrahim Ali A., Hadi Sarmad M, 2020, “Proposed Genetic
Profiling System Based on Gel Electrophoresis for Forensic DNA Identification”, Indian
Journal of Public Health Research & Development, 11,2.

[4] Broucke Seppe vanden and Baesens Bart, 2018 “Practical Web Scraping
for Data Science”, Apress.

[5] Caldarelli Guido and Chessa Alessandro, 2016, “Data Science and Complex
Networks”, OXFORD UNIVERSITY PRESS.

[6] CIELEN DAVY, et., 2016, “Introducing Data Science”, MANNING


SHELTER ISLAND

[7]COX D. R., 2006, “Principles of Statistical Inference”, Published in the United States
of America by Cambridge University Press, New York.

[8] Dietrich David and et, 2015, “Data Science & Big Data Analysis”, John Wiley & Sons,
Inc.

[9] Draghici Sorin, 2012, “Statistics and Data Analysis for Microarrays Using R and
Bioconductor”, Second Edition, CRC Press.

[10] Freund John E, 1979, “MODERN ELEMENTARY STATISTICS”, FIFTH EDITION,


Prentice/Hall International, Inc., London.

[11] Grus Joel, 2019, “Data Science from Scratch”, Second Edition, O’Reilly Media.

[12] Hubbard Kent D. Lee • Steve, 2015, “Data Structures and Algorithms with Python”,
Springer.

354
[13] Ibrahim Ali A., 1996, “Use the factor analysis method to extract the variables that
determine The suitability of powdered milk for human consumption”, (1996), Scientific
Journal of Tikrit University, College of Engineering. (ARABIC Language).

[14] Ibrahim Ali, A., 2000, “The use of cluster analysis in the compositional analysis of
different milk powder samples”, Scientific Journal of Tikrit University, College of
Engineering. (ARABIC Langue).

[15] Ibrahim Ali, 2000 ,”Using a multidimensional scale to analyze the chemical
compositions of different milk powder samples”, The twelfth scientific conference of the
Iraqi Association for Statistical Sciences, pp.: 189-210, (ARABIC Language).

[16] Ibrahim Ali, 2002, “Using protein databases and cluster analysis to compare the
protein homology regions of the Leader Peptidase enzyme and to determine the degree
of genetic affinity between”, Journal of the College of Administration and Economics /
Al-Mustansiriya University, 42, pp.: 64-78. (ARABIC Language).

[17] Ibrahim Ali, 2002, “Using the discriminatory function model for the chemical
classification of powdered milk models and knowing their conformity with the Iraqi
standard specifications through, Journal of Economic and Administrative Sciences /
University of Baghdad, 9,29, pp.: 114-138. (ARABIC Langauge).

[18] Ibrahim Ali, 2008, “Using the discriminatory function model for the chemical
classification of powdered milk models and knowing their conformity with the Iraqi
standard specifications through”, Journal College of Science of Al-Nahrain University,
11, 1, pp.: 46-57.

[19] Ibrahim Ali, 2011, “DESIGN A FINGERPRINT DATABASE PATTERN


RECOGNITION SYSTEM VIA CLUSTER ANALYSIS METHOD I- DESIGN OF
MATHEMATICAL MODEL”, IRAQI JOURNAL OF BIOTECHNOLOGY, 10, 2, pp: 273-
283.

[20] Ibrahim Ali A. and et, BIOINFORMATICS, first edition, AL-NAHRAIN UNIVERSITY

[21] Ibrahim Ali, and et., 2013, “Using of Two Analyzing Methods Multidimensional
Scaling and Hierarchical Cluster for Pattern Recognition via Data Mining”, 3,1, pp:16-20.

[22] Ibrahim Ali A. and et. ,2013, “Design and Implementation Iris Recognition System
Using Texture Analysis”, Al-Nahrain Journal for Engineering Sciences, 16, 1, pp: 98-
101.

355
[23] Ibrahim Ali A. and et, 2014, “Using Hierarchical Cluster and Factor Analysis to
Classify and Built a phylogenetic Tree Species of ND1 Mitochondria”, 17, 1, pp.: 114-
122.

[24] Ibrahim Ali A. and et, 2016, “Design and implementation of iris pattern recognition
using wireless network system” , Journal of Computer and Communications, 4, 7, pp.: 15-21.

[25] Ibrahim Ali A. and et, 2016, “Using neural networks to predict secondary structure
for protein folding”, Journal of Computer and Communications, 5, 1, pp.: 1-8.

[26] Ibrahim Ali A. and et, 2017, C-reactive protein as a marker for cancer and poising
thyroid gland, Engineering & Technology Journal, 35.

[27] Ibrahim Ali A. and et, 2018, “Design and implementation of fingerprint identification
system based on KNN neural network”, Journal of Computer and Communications, 6, 3,
pp.: 1-18.

[28] Ibrahim Ali A. and et, 2018, “Sequence and Structure Analysis of CRP of Lung and
Breast Cancer Using Bioinformatics Tools and Techniques”, 11, 1, pp.: 163-174.

[29] Ibrahim Ali A. and et, 2018, “C-Reactive Protein as a Marker in the Iraq Patients
with Poisoning Thyroid Gland Disease”, Engineering and Technology Journal, 36, 1 Part
(B) Scientific, University of Technology.

[30] Ibrahim Ali, and et., 2019, “Forecasting the Bank of Baghdad index using the Box-

Jenkins methodology”, Dinars Magazine, 15, pp. 441-460.

[31] Ibrahim Ali A. and et, 2019, “Design & Implementation of an Optimization Loading
System in Electric by Using Genetic Algorithm”, Journal of Computer and
Communications, 7, 7, pp.: 135-146.

[32] Ibrahim Ali A. and et, 2019, “CLASSIFICATION NUMBER OF ORGANISMS USING
CLUSTER ANALYSIS OF THE PEPTIDE CHAINS MULTIPLE CHYMOTRYPSIN
LACTATE DEHYDROGENASE”, Biochem. Cell. Arch, 19, 2, pp.: 4425-4429.

[33] Ibrahim Ali A. and et, 2019, “Beta-2-microglobulin as a marker in patients with
thyroid cancer”, Iraqi Postgraduate Med Journal, 18, 1, pp.: 18 – 22.

[34] Ibrahim Ali A. and et,, 2019, “Functional Analysis of Beta 2 Microglobulin Protein in
Patients with Prostate Cancer Using Bioinformatics Methods”, Indian Journal of Public
Health, 10, 3.

356
[35] Ibrahim Ali A. and et, 2022, “Multi-Stage Image Compression-Decompression
System Using PCA/PCA to Enhance Wireless Transmission Security”, Journal of
Computer and Communications, Journal of Computer and Communications, 10, pp.: 87-
96.

[36] Ibrahim Ali A. and et,2013, “Use the Power of a Genetic Algorithm to Maximize and
Minimize Cases to Solve Capacity Supplying Optimization and Travelling Salesman in
Nested Problems”, 11, pp: 24-31.

[37] Ibrahim Ali A. and et, 2023, “Forecasting Stock Prices with an Integrated Approach
Combining ARIMA and Machine Learning Techniques ARIMAML”, Journal of Computer
and Communications, 2023, 11, pp.: 58-70.

[38] Nylen Erik Lee and Wallisch Pascal, 2017, “NEURAL DATA SCIENCE, ACADEMIC
PRESS

[39] Ozdemir Sinan, 2016, “Principles of Data Science”, Packt Publishing Ltd

[40] Provost Faster and Fawcett Tom, 2013, “Data Science for Business”, O’REILLY.

[41] Rastogi S. C. and et., 2008, “BIOINFORMATICS Methods and Applications”, Third
Edition, PHI Learning Private Limited.

[42] Reimann Clemens, and et., 2008, “Statistical Data Analysis Explained”, John Wiley
& Sons Ltd.

[43] Salazar Jesús Rogel, 2020, “Advanced Data Science and Analytics with Python”,
CRC Press.

[44] Sundnes Joakim, 2020, “Introduction to Scientific Programming with Python”,


Springer.

[45] VanderPlas Jake, 2017, “Python Data Science Handbook”, O’Reilly Media

[46] Varga Ervin, 2019, “Practical Data Science with Python 3”, Apress.

357
Index Checking for Consistency, 145, 146, 170.

3D, 20, 48, 59, 60-62, 100, 137-142. Central limit theorem, 253.

Chi-square distribution, 316, 319-321, 323,


325.
A
Choropleth Map, 20, 102-104.
(alpha) , 304, 310, 323
Coefficient, 259, 260, 267-270, 277-279,
Type I error, 317 283, 284.
Alternative hypothesis, 288, 291, 308-310, Coefficient of Variation, 188.
315-317.
Confidence interval, 73
Analysis of variance, 308, 309
Continuous random variable, 208, 209, 250.
Audio Data, 2
Correlation, 62, 86, 202, 259-276, 345.
Average, 183-185, 187, 205, 210, 214, 222,
223, 249, 250, 278, 291, 303, 304. Critical value, 302, 303, 309, 316, 323.

Cumulative distribution, 243, 248, 314.

B Customer, 1, 2, 148, 170, 178, 211, 222,


240, 241, 326, 327, 331-336.
Bar chart, 20, 39, 40, 41, 43-49, 191.
Customer segmentation 1
Bayesian inference, 243.

Bell-shaped distribution, 301.


D
Beta distribution, 242 – 247.
Data, i, 1-12, 16-18, 20-22, 39, 47, 49,62,
Beta (Β), 242-247. 73, 81, 90,94, 102, 104, 110, 111, 112, 116,
Binomial distribution, 211, 214 – 217, 219, 120, 134, 137, 145-160, 163, 166-172, 177-
220, 226, 253. 195, 197-199, 203-207, 211, 232, 250, 258,
260, 262, 264, 267-270, 278, 279, 280, 283,
Box Plot, 20, 110-116, 157, 159, 165, 166, 284, 288, 294, 297-299, 301-304, 310, 313,
183. 314, 316, 317, 322, 324, 326, 327, 329, 330,
332, 335, 338, 339, 340, 343- 346, 350, 353-
Bubble, 20, 62, 63, 94-102.
355, 357.

Data Cleaning, 145, 146, 151, 152.


C
Data Collection, ii,7-9, 339.
Categorical Data, 2, 183, 184, 314, 317.
358
DataFrame, 2, 145-147, 154. Heatmap, 81-90.

Data science, i, ii, 1, 8, 151, 327-340, 344- Histogram, 20, 73-81, 159, 183, 191.
346, 350, 353, 354, 357.
Hypothesis testing, i, 202, 205, 288, 291,
Data Preprocessing, 145, 146, 151, 152, 344. 301, 314.
Dependent variable, 199, 277-279, 283, 297,
298.
I
Descriptive Statistics 183, 191
Independent variable, 199, 277-279, 283,
E 297, 298.
Error, 46, 71, 151, 152, 157, 158, 170, 171, Integer Generation. 12
180, 277, 278, 283, 294, 299, 301, 317.
Interval, 73, 158, 190-193, 209, 222, 223,
Exploratory Data, i, ii, 182, 183, 195, 326, 235, 236, 239, 242.
327, 329, 330, 332, 335.
Image Data, 2
Expected value, 205, 210,239, 249-252,
317, 318, 322. L

Experiments, 8, 207, 210. Line Chart, 20, 22-39.

F Linear Regression, 277-283, 298, 341.

F distribution, 309, 310.

Frequency distribution, 190 – 193, 3154. M

F-test, 202, 308-310, 312-314. Mean, 12, 152, 158, 183, 184, 187, 188,
198, 199, 205, 210, 222, 226, 230-232, 236,
239, 240, 243, 249, 251, 252, 255, 278, 279,
294, 301-304, 309, 310, 313.
G
Measures of Dispersion, 182, 187-190.
Gantt Charts, 20, 142, 143.
Measure of location, 182, 183.
Geometric distribution, 225.
Median, 110, 116, 152, 158, 159, 183-198.
Geospatial Data, 2
Mode, 183, 184, 186, 187.
Generating Random Data, 7, 11, 12, 16.
Model Assumptions, 202, 293-297.

Model Evaluation, 202, 277, 298, 299.


H

359
Q

N Quartile, 110, 157-159, 187, 188, 190, 191,


198, 200.
Network Diagram, 20, 122-130.

Normal Distribution, 205, 226-235, 255,


294, 298, 301, 313. R

Numeric Data, 2 Radar Charts, 20, 120, 121.

Random Choices, 12, 14.

O Randomness, 12, 16.

Observed frequency, 316. Random numbers, 12, 155, 208, 244.

Observational, 8 Random variables, 207-210, 251-253.

Range, 1, 12, 13, 21, 80, 157-159, 170, 183,


187, 188, 190, 191, 194, 195, 200, 208, 209,
P 226, 235-237, 250, 259, 268, 278, 339.
Parallel Coordinates Plot, 20, 116-120. Raw data, 169, 194, 197, 344.
Parameter, 202, 205-107, 210, 239, 240, Regression analysis, 202, 277, 283, 308,
242, 243, 252, 253, 255, 284, 288, 310. 310.
Percentiles, 190, 191, 198. Root mean square, 278.
Predictive modeling, 1, 345. S
Pie chart, 20, 49-62. Sankey, 20, 104-110.
Poisson distribution, 222-224, 253. Sample, 16, 184, 205, 207,208, 258,
Population, 199, 205, 207, 230, 231, 288, 288,291, 299, 301-304, 308, 309, 313, 314,
301, 303,313, 315. 343.

Probability distribution, 11, 202, 205-208, Sampling, 12, 16, 299, 309.
210, 211, 226, 230, 235, 242, 243, 251, 252, Scatter Plot, 20, 62-65, 67-73, 94, 137, 138,
301, 308. 157, 159, 183, 297, 298.
P-value, 288, 291-293, 308-310, 312-314, Score, 191, 192, 231, 268, 302, 304, 312-
316, 317, 323. 314.

Sensor data collection, 8

360
Sentiment Analysis, 326, 332-334, 343.

Standard deviation, 12, 158, 159, 183, 187, V


188, 198, 199, 205, 226, 230-232, 240, 291,
301-304. Variable, 8, 62, 73, 94, 116, 120, 151, 152,
168, 183, 191, 199, 202-204, 207-210, 214,
Size, 21, 63, 64, 68, 94, 100, 131, 152, 184, 222, 226, 239, 242, 248-253, 255, 259, 260,
194, 299, 301-304, 314. 262, 264, 266-269, 277-279, 283, 284, 294,
297, 298, 314-316, 335, 353, 355.
Social media monitoring, 8.
Variance, 183, 187, 188, 205, 210, 222, 226,
Shuffling and Sampling, 12. 240, 243, 252, 255, 277, 278, 294, 298, 299,
Streamgraph, 20, 134-137. 301, 302, 308-310, 313, 314.

Surveys, 8. Video Data, 2.

T W

t-distribution, 301-303. Web scraping, 8.

Text Data, 2, 4 , 5, 332. Weight, 226, 249.

Time Series Data, 2, 5, 6. Word Cloud, 20, 131-133.

Treemap, 20, 90-94.

t-test, 202, 288, 291, 301-308. Z

z-score, 157, 158, 160, 190, 198-201, 230-


232.
U

Uniform distribution, 12, 17, 18, 235-239,


243.

361
362
‫الوحتىيات‬
‫انفصم األول‪ :‬يقذيت‬

‫انفصم انثاني‪ :‬جًع انبياناث‬

‫انفصم انثانث‪ :‬عزض يزئي نهًعهىياث‬

‫انفصم انزابع‪ :‬يعانجت انبياناث‬

‫انفصم انخايس‪ :‬استكشاف تحهيم انبياناث‬

‫انفصم انسادس‪ :‬اننًذجت اإلحصائيت يع يفاهيى انبزيجت‬

‫انفصم انسابع‪ :‬حاالث دراسيت‬

‫انفصم انثاين‪ :‬عالقاث عهىو انبياناث‬

‫يهحق‪ :‬قزاءة أبعذ‬

‫هزس‬
‫فِ ِ‬

‫‪363‬‬
‫ًخص هرا الكحاب الطلبة بصىزة عامة وطلبة الدزاسات العلُا والباحثين خاصة في مخحلف الاخحصاصات الطبُة‬
‫والهىدسُة والعلىم الصسفة والعلىم الاوساهُة‪.‬‬

‫د‪ .‬علي عبذ الحافظ ابراهين‬

‫أستار الزكاء االصطٌاعي والوعلىهاتية الحيىية‬

‫كلية اقتصادصات االعوال‬

‫جاهعة الٌهريي‬

‫‪364‬‬
‫علن البياًات‬
‫باستعوال لغة بايثىى هع التطبيقات‬
‫م‬
‫القدمة ‪:‬‬
‫م‬
‫السعي ‪ ،‬للحصىل على زؤي‬ ‫في عصس ثقىده ثىزة علمُة ( غير مسبىقة ) في ثىلُد البُاهات والحقدم الحكىىلىجي أصبح‬
‫مً مجمىعات البُاهات الىاسعة والعقدة ‪ ،‬زكيزة ال غنى عنها الكخساب العسفة الحدًثة ‪ .‬إذ مهد ظهىز علم البُاهات ‪-‬‬
‫كمجال دًىامُكي ومحعدد الحخصصات ‪ -‬الطسٍق لخسخير لاسالُ العحمدة على البُاهات للكفف عً لاهما‪،،‬‬
‫م‬
‫واسحخساج العلىمات ذات القُمة الكبيرة في اثخاذ قسازات مسخىيرة جسخىد إليها الىخ لاكادًمُة ‪ ،‬واملجحمعُة ‪.‬‬

‫إن هرا الكحاب ًبحث في علم البُاهات‪ ،‬موٍقدم اسحكفافات محعمقة في مىهجُاثه ‪ ،‬ومبادئه ‪ ،‬وثطبُقاثه‪ .‬وذلك مً خالل‬
‫اثباع ههج علمي صازم‪ ،‬مً خالل ثحلُل البُاهات‪ ،‬واسحعمال الحقىُات لاحصائُة وخىازشمُات الحعلم آلالي الحطىزة‬
‫لفك العالقات الخفُة‪ ،‬والحيبؤ باالثجاهات‪ ،‬وحل الفكالت العقدة‪.‬‬

‫وٍمكً القىل أن هرا الكحاب هى محاولة لجعل القساء ًجحاشون مً خالله الحضازَس العقدة في معالجة البُاهات‪،‬‬
‫والحصىز‪ ،‬والىمرحة‪ .‬التي وعحمد على الفاهُم لاساسُة ‪ ،‬وهي ‪ ( :‬السٍاضُات ‪ ،‬ولاحصاء ‪ ،‬وعلىم الكمبُىثس ) ؛ لحمكين‬
‫القساء مً اسحعمال لادوات الالشمة لعالجة مجمىعات البُاهات العقدة‪ ،‬وثحدًد مصادز الححيز‪ ،‬وضمان سالمة‬
‫الىحائج‪ .‬ومً خالل ثبني عقلُة علمُة‪ ،‬فئهىا هؤكد على إمكاهُة الحكساز ‪ ،‬وأهمُة الىهجُة الففافة في السعي لححقُق‬
‫هحائج مىثىقة جعحمد على البُاهات‪.‬‬

‫‪365‬‬
‫ْ‬
‫وقد حاءت فصىل هرا الكحاب على وفق ثصمُم دقُق لحىحُه القساء مً الفاهُم لاساسُة إلى الىهجُات الحقدمة‪،‬‬
‫والكفف عً جعقُدات ثحلُل البُاهات الاسحكفافُة‪ ،‬واخحبازالفسضُات‪ ،‬والححقق مً صحة الىماذج‪.‬‬
‫م‬
‫وال ٌسعىا في هرا القام إال أن و ْعسب عً امحىاهىا وشكسها الخالص لعلماء البُاهات ‪ ،‬ولاحصائُين ‪ ،‬والباحثين الرًً‬
‫مهد عملهم السائد الطسٍق للمىهجُات الىضحة في هرا الكحاب‪ .‬وهحً هحطلع أن ًكىن هرا الىص دلُال شامال ألولئك‬
‫الباحثين الجدد ‪ -‬في هرا املجال ‪ -‬ومصدزا قُما للممازسين مً ذوي الخبرة الرًً ٌسعىن إلى جعمُق فهمهم وصقل‬
‫مهازاتهم‪.‬‬

‫إن مً املجدي أن وفسع في السعي لفحح السؤي املخفُة داخل بحس البُاهات الهائل الري ًحُط بىا ‪ ،‬وأن وغامس بدخىل‬
‫عالم علم البُاهات معا‪ ،‬محمسكين بالفضىل فكسي والىهج العلمي‪،‬‬

‫‪366‬‬
‫ح ِم ۡن أ ۡم ِر ر ِبّی وم ۤا‬ ‫لروحِ قُ ِل ٱ ُّ‬
‫لرو ُ‬ ‫ی ۡسـَٔلُونك ع ِن ٱ ُّ‬
‫‪[ ﴾٥‬اإلسراء [‬ ‫أُوتِیتُم ِ ّمن ٱ ۡل ِع ۡل ِم ِإ اَّل ق ِلیال ۝‪٨‬‬

‫‪367‬‬
368
‫علن البياًات باستخذام لغة بايثىى هع التطبيقات‬

‫د‪ .‬علي عبذ الحافظ ابراهين‬

‫أستار الزكاء االصطٌاعي والوعلىهاتية الحيىية‬

‫كلية اقتصادصات االعوال‬

‫جاهعة الٌهريي‬

‫‪369‬‬
370

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy