Data Science Book
Data Science Book
Al-Nahrain University
2
Preface
This book delves into the heart of data science, offering an in-depth exploration of its
methodologies, principles, and applications. With a rigorous and scientific approach, we embark
on a journey through the landscape of data analysis, employing robust statistical techniques and
cutting-edge machine learning algorithms to decipher hidden relationships, predict trends, and
solve intricate problems.
Through the lens of this book, readers will traverse the intricate terrain of data manipulation,
visualization, and modeling. We draw upon foundational concepts from mathematics, statistics,
and computer science to empower readers with the tools needed to wrangle complex datasets,
identify sources of bias, and ensure the integrity of results. Embracing a scientific mindset, we
emphasize reproducibility and the importance of transparent methodology in the pursuit of
credible data-driven findings.
The chapters within this book are meticulously crafted to guide readers from foundational
concepts to advanced methodologies. We unravel the intricacies of exploratory data analysis,
hypothesis testing, and model validation, while also delving into the nuances of ethical
considerations and the responsible use of data in a rapidly evolving technological landscape.
We extend our gratitude to the vast community of data scientists, statisticians, and researchers
whose groundbreaking work has paved the way for the methodologies outlined in this book. Our
aspiration is for this text to serve as both a comprehensive guide for those new to the field and a
valuable resource for experienced practitioners seeking to deepen their understanding and refine
their skills.
As we venture into the world of data science together, let us embrace curiosity, critical thinking,
and the scientific method, and let us embark on a quest to unlock the insights concealed within
the vast sea of data that surrounds us.
Ali A. Ibrahim
Prof., PhD. in Artificial Intelligence and Bioinformatics
College of Business Economics
Al-Nahrain University
Contents
Preface
Chapter 1: Introduction
Index
ii
iii
CHAPTER ONE: INTRODUCTION
iv
1.1 Introduction
Data Science is an interdisciplinary field that involves using statistical and computational
methods to extract insights and knowledge from data. It encompasses a wide range of techniques
and tools, including machine learning, data mining, statistics, and visualization. The goal of data
science is to analyze and interpret complex data sets in order to extract meaningful insights that
can inform decision-making and drive business value.
Data science is used in a variety of industries, including healthcare, finance, marketing, and
technology. It is particularly useful in areas where there are large amounts of data that need to be
analyzed in order to make informed decisions. Data scientists typically work with large data sets,
using programming languages such as Python and R to clean and analyze the data. They then use
statistical techniques and machine learning algorithms to identify patterns and relationships in
the data, and create models to predict future outcomes.
Data science has become increasingly important in recent years, as more and more organizations
have realized the value of data-driven decision making. With the rise of big data and the
increasing availability of data from a variety of sources, data science is playing an increasingly
important role in business strategy and decision-making.
a. Predictive modeling: A common use case for data science is predictive modeling, which
involves using historical data to build a model that can make predictions about future events.
For example, a retailer might use data science to predict which products are likely to be
popular during the upcoming holiday season, so they can stock up accordingly.
b. Customer segmentation: Data science can be used to identify groups of customers with similar
characteristics and behaviors. This can help businesses better target their marketing efforts
and tailor their products and services to the specific needs of each group.
c. Fraud detection: Data science can be used to identify patterns and anomalies in financial
transactions, which can help detect and prevent fraud.
d. Personalized recommendations: Many online retailers and streaming services use data science
to make personalized recommendations to their users based on their past behavior and
preferences.
e. Medical research: Data science is increasingly being used in medical research to analyze large
data sets of patient information in order to identify new treatments and improve patient
outcomes.
1
Here are some examples of different types of data:
a. Text Data: Text data refers to any data that is in written or textual form. Examples of text data
include emails, social media posts, customer reviews, news articles, and chat logs.
b. Numeric Data: Numeric data refers to data that consists of numbers. Examples of numeric data
include stock prices, sensor readings, temperature readings, and financial data.
c. Categorical Data: Categorical data refers to data that is divided into categories or groups.
Examples of categorical data include gender, age group, educational level, and job title.
d. Image Data: Image data refers to any data that is in the form of images or photographs.
Examples of image data include medical images, satellite images, and photographs.
e. Audio Data: Audio data refers to any data that is in the form of sound recordings. Examples of
audio data include music recordings, podcast episodes, and phone call recordings.
f. Video Data: Video data refers to any data that is in the form of videos or motion pictures.
Examples of video data include movies, TV shows, and security camera footage.
g. Geospatial Data: Geospatial data refers to data that is related to geographic locations.
Examples of geospatial data include maps, GPS coordinates, and weather data.
h. Time Series Data: Time series data refers to any data that is recorded over a period of time.
Examples of time series data include stock prices, weather data, and website traffic data.
See Examples from 1.1 to 1.3 included Figures from 1.1 to 1.6.
Fig. 1.1: Create Data file as a list of numbers and save it using Python code.
2
Example 1.2: Read file
Fig. 1.2: Read the data “text” file and print the output result using Python code.
Fig. 1.4: Create categories “text” data file using Python code.
3
Example 1.4: Reading file.
Fig. 1.5: Read data.txt file and print the output result using Python code.
4
Fig. 1.8: Output Text Data Python code
5
Fig.1.11: Output Plot for Time series Data using Python code.
6
CHAPTER TWO: Data
Data Collection
7
2.1 Data Collection
Data collection is the process of gathering and acquiring data from various sources for analysis
and interpretation. The quality and accuracy of data collected are critical in ensuring that the
resulting analysis and insights are reliable and meaningful. In data science, data collection
involves identifying the relevant data sources, selecting appropriate data collection methods, and
ensuring that the data is clean and well-organized.
a. Surveys: Surveys are a popular method of data collection and involve gathering
information from individuals through questionnaires or interviews. Surveys can be
conducted online, over the phone, or in person (see Figures 2.1-2.5).
b. Experiments: Experiments involve manipulating variables to study the effects on an
outcome of interest. Experiments can be conducted in controlled laboratory settings or in
the field (see Figure 2.6).
c. Observational studies: Observational studies involve observing and recording data
without manipulating variables. Observational studies can be conducted in natural
settings or in controlled laboratory settings (see Figures 2.7-2.8).
d. Web scraping: Web scraping involves extracting data from websites using automated
tools. Web scraping is a useful method for collecting large amounts of data from online
sources.
e. Social media monitoring: Social media monitoring involves analyzing social media
platforms to gather information about trends, sentiments, and opinions. Social media
monitoring is useful for understanding public opinion.
f. Sensor data collection: Sensor data collection involves collecting data from sensors such
as GPS, accelerometers, and temperature sensors. Sensor data collection is useful for
monitoring physical environments and behaviour.
The goal of data collection is to gather high-quality data that is relevant to the research question
and analysis. Data scientists need to carefully consider the data sources, data collection methods,
and data quality when collecting data to ensure that the analysis and insights derived from the
data are reliable and meaningful. Data collection is an ongoing process, and data scientists need
to continually monitor and update their data sources and collection methods to ensure that the
data remains relevant and accurate.
The following Examples (from 2.1 to 2.13) included Figures (2.1 – 2.15) describe python code
of the first method of data collection which is survey method and as follows:
8
Example 2.1: Data collect
Fig. 2.1: demonstrate using python code to ask 4 question, name, age, country, and the color, in the first part
and in the second part ask about number of persons including this survey.
Fig. 2.2: create loop to collect data ask questions and get the answers (responses) collect the data.
9
Example 2.3: Save file data
Fig. 2.3: create function “save_responses” to save the answers of the questions using python code.
Fig. 2.4: all outputs implementation of the python code, (a) display the data csv file name
(survey_responses.csv), (b) the output content of implementation of python code, (c) the content of output file
(survey_responses.csv).
Fig. 2.5: using python code to generate and print the current temperature.
10
Example 2.5 : Generate Data:
In Python, the random module provides functions for generating random data. Here are some key
concepts and functions related to generating random data:
11
a. Randomness and Seed: Randomness refers to the lack of predictability in generated values.
The random module uses a pseudorandom number generator (PRNG), which is an algorithm that
produces a sequence of numbers that appears random but is actually deterministic. By default,
the PRNG is initialized based on the current system time, so running the program multiple times
produces different results. However, you can set a specific seed value using the random.seed()
function to obtain the same sequence of random values each time the program is executed.
b. Uniform Distribution: The random module provides functions for generating random
numbers that follow a uniform distribution. In a uniform distribution, all values in the range have
an equal probability of occurring. For example:
c. Integer Generation: To generate random integers, you can use the following functions:
- random.randrange(start, stop, step) returns a random integer from the range start (inclusive) to
stop (exclusive) with the specified step.
e. Random Choices: The random.choices() function allows you to randomly select elements
from a given sequence, allowing for sampling with replacement (i.e., the same element can be
chosen multiple times) or sampling without replacement (i.e., each element is chosen only once).
These are just a few of the functions available in the random module for generating random data.
Depending on your specific needs, you can explore additional functions and techniques for
generating random values, such as random choices from custom distributions, permutations, or
combinations.
Different examples from to that demonstrate generating random data based on randomness and
seed in Python see Figures 2.9 – 2.16:
12
Example 2.6: Genrate Integer numbers
Fig.2.8 : Genrate 5 random integers for the range between 1 and 10 with output using Python code.
Fig.2.9 : Genrate 5 Random Float Numbers for the range between 1 and 5 with the output using Python code.
13
Example 2.8: Choosing elements
14
Example 2.9: Input Numbers
Fig.2.11 : Shuffling the Input List with output result using Python code.
15
Example 2.10: Samples
Fig.2.13 : Generating Random Data with randomness and seed and output results using Python code.
16
Example 2.12:Genrate data
17
Example 2.13: genrate data
18
19
CHAPTER Three: Data Visulaization
Line Chart
Bar Chart
Pie Chart
Scatter Plot
Histogram
Heat Maps
Treemap
Bubble
Choropleth Map
Sankey
Box Plot
Radar Charts
Network Diagram
Word Cloud
Streamgraphs
3D
Gantt Charts
20
Data visualization
Data visualization is a rich and multifaceted field with deep theoretical underpinnings. At its core, data
visualization is about representing data visually to facilitate understanding, exploration, and
communication of information. Here are some key theoretical concepts and principles in data
visualization:
The Visual Encoding Framework: This foundational concept proposes that data attributes should
be mapped to visual properties in a way that exploits human perception effectively. Common
visual properties include position, length, angle, color, shape, and size. For example, using
position along a common scale for two data attributes allows for easy comparison.
Pre-attentive Processing: The theory of pre-attentive processing suggests that certain visual
attributes, like color or shape, can be quickly and accurately perceived by the human brain
without conscious attention. Effective data visualization leverages these attributes to highlight
important information and make patterns easily discernible.
Gestalt Principles: These principles, such as proximity, similarity, continuity, and closure, explain
how humans naturally group and perceive visual elements. Understanding Gestalt principles
helps in designing visualizations that encourage viewers to see patterns and relationships in
data.
Exploratory vs. Explanatory Visualizations: Data visualizations can serve different purposes.
Exploratory visualizations are created during the data analysis process to help researchers
understand the data themselves. Explanatory visualizations are designed for a broader audience
to communicate insights clearly and persuasively.
Visualization Taxonomies: Various taxonomies categorize different types of visualizations based
on their purposes and characteristics. For example, there are hierarchical, network, time-series,
and spatial visualizations, among others. Understanding these categories can help choose the
most appropriate visualization for a specific dataset and objective.
Ethical Considerations: As data visualization can influence perception and decision-making, it's
important to consider ethical implications. This includes issues related to bias,
misrepresentation, and privacy.
Interactivity and User Experience (UX): Interactive visualizations allow users to explore data
actively. Understanding principles of UX design, such as responsiveness, user feedback, and
usability, is crucial for creating engaging and effective data visualizations.
Data Semiotics: This emerging field explores the semiotic aspects of data visualization,
considering how symbols and signs in visualizations convey meaning. It delves into the cultural,
social, and cognitive aspects of data representation.
These theoretical foundations, among others, provide a solid framework for creating meaningful and
impactful data visualizations. Effective data visualization combines both art and science, utilizing these
theories to communicate complex information clearly and persuasively.
Data visualization encompasses a wide range of types and techniques, each suited to different purposes
and data characteristics. Here are some common types of data visualizations:
21
3.1 Line Charts: Line charts display data as a series of data points connected by lines. They are
excellent for showing trends and changes in data over time.
22
Fig.3.2: Output Line Chart from python code.
23
Example 3.2: Create a line Chart with two different lines
24
Fig. 3.4: Output of Multiple Line Chart.
25
Fig. 3.6: Time series line chart using Python code.
26
Fig. 3.8: Create simple Line Chart using Python code.
27
Example 3.6: Create Line Chart with Markers
Fig. 3.10: Line Chart with „o‟ Marks using Python code.
Fig. 3.11: Line Chart with „o‟ Marks plot using Python code.
28
Example 3.7: Create Line Chart with different colors.
Fig. 3.12 : Create Line Chart with different colors using Python code.
Fig. 3.13: Create Line Chart with different colors using Python code.
29
Example 3.8: Create Line Chart with Different Line Styles
Fig. 3.14: Line Chart with Different Line Styles using Python code.
Fig. 3.15: Line Chart with Different Line Styles using Python code.
30
Example 3.9: Create Line Chart with a Grid
31
Example 3.10: Create Line Chart
32
Fig. 3.21: Step Line Chart Using Pyhton code.
33
Fig. 3.23: Stacked Area Chart using Python code.
34
Fig. 3.25: Logarithmic Scale Line Chart using Python code.
Fig. 3.26: Line Chart with Date on x-axis using Python code.
35
Fig. 3.27: Line Chart with Date on x-axis using Python code.
36
Fig. 3.29: Line Chart with Annotations using Python Code.
37
Fig. 3.31: Dual Y-Axis Line Chart using Python code.
Fig. 3.32: Line Chart with Shaded Region using python code.
38
Fig. 3.33: Line Chart with Shaded Region using python code.
Bar Charts: Bar charts represent data using rectangular bars of varying lengths or heights. They are
effective for comparing values across categories or showing trends over time (in the case of
horizontal bar charts).
39
Fig. 3.35: Output Bar Chart using Python code.
40
Fig. 3.37: Output Grouped Bar chart using Python code.
41
42
Fig. 3.38: Different Bar Charts examples using Python code.
43
Fig. 3.39: Basic Bar Chart using Python code.
44
Fig. 3.41: Grouped Bar Chart using Python code.
45
Fig. 3.43: Bar Chart with Error Bars using Python code.
Fig. 3.44: Bar Chart with Custom Colors using Python code.
46
Fig. 3.45: Bar Chart with Data Labels using Python code.
47
Fig. 3.47: Bar Chart with Logarithmic Scale using Python code.
48
Fig. 3.49: Horizontal Stacked Bar Chart using Python code.
Pie Charts: Pie charts represent data as a circle divided into segments, with each segment representing a
proportion of the whole. They are useful for showing parts of a whole, but can be less effective for
precise comparisons.
49
Fig. 3.51: Output Pie Chart from using Python code.
50
Fig. 3.53: Output of exploded Pie Chart from Figure 3.
51
Fig. 3.54: Different types for Pie Charts using Python code.
52
Fig. 3.55: Basic Pie Charts using Python code.
53
Fig. 3.57: Donut Pie Charts using Python code.
Fig. 3.58: Pie Charts with Custom Colors using Python code.
54
Fig. 3.59: Pie Charts with Shadow Effect using Python code.
Fig. 3.60: Pie Charts with Percentage Labels using Python code.
55
Fig. 3.61: Pie Charts with a Single Exploded using Python code.
Fig. 3.62: Pie Charts with Custom Start Angle using Python code.
56
Fig. 3.63: Pie Charts with Custom Labels Removed using Python code.
57
Fig. 3.65: Nested Pie Chart using Python code.
58
Example 3.24: Create different types 3D-like Pie Chart
Fig. 3.66: Different type’s 3D-like Pie Chart using Python code.
59
Fig. 3.67: 3D-like Pie Chart with Beveled Edge.
60
Fig. 3.69: 3D-like Donut Pie Chart.
61
Fig. 3.71: 3D-like Pie Chart with Rotated Angle.
Scatter Plots: Scatter plots display individual data points on a two-dimensional grid, with one variable on
each axis. They are great for showing relationships and correlations between two variables.
62
Fig. 3.73: Output of Bubble Scatter Plot from Figure 1.
Fig. 3.74: Create Scatter Plot with color Mapping and size Variation using Python code.
63
Fig. 3.75: Output of Scatter Plot with color Mapping and size Variation using Python code.
64
Example 3.27: Create Different types Scatter Plots.
65
66
Fig.3.76: Different types for Scatter Plots using Python Code.
67
Fig. 3.77: Basic Scatter Plot using Python.
Fig. 3.78: Scatter Plot with colors and sizes using Python.
68
Fig. 3.79: Scatter Plot with Labels using Python.
69
Fig. 3.81: Scatter Plot with Custom Marks using Python.
70
Fig. 3.83: Scatter Plot with Varying Transparency using Python.
71
Fig. 3.85: Scatter Plot with Categorical Labels using Python.
72
Fig. 3.87: Scatter Plot with Trendline and Confidence Interval using Python.
3.5. Histograms
Histograms: Histograms are used to represent the distribution of a single numeric variable. They group
data into bins and display the frequency of data points in each bin.
73
Example 3.28: Create Histogram
74
Fig. 3.88: Histogram Chart using Python code.
75
Fig. 3.89: Basic Histogram using Python code.
Fig. 3.90: Histogram with Custom Bin Edges using Python code.
76
Fig. 3.91: Histogram with Normalized Counts using Python code.
77
Fig. 3.93: Histogram with Different Color using Python Code.
78
Fig. 3.95: Histogram with Log Scale using Python Code.
79
Fig. 3.97: Histogram with Specified Range using Python Code.
80
Fig. 3.99: Histogram with Stacked Bins using Python Code.
Heatmaps: Heatmaps use colors to represent data values in a two-dimensional matrix or grid. They are
often used to visualize patterns and relationships in large datasets.
81
Example 3.29: Create Heatmap
82
Example 3.30: Create Heatmap
83
Example 3.31: Create Heatmap
84
Fig. 3.104: Heatmap using Python Code.
85
Fig. 3.106: Heatmap with Custom Colors using Python Code.
86
Fig. 3.108: Heatmap with Hierarchical Clustering using Python Code.
87
Fig. 3.110: Heatmap with Labels Python Code.
88
Fig. 3.112: Discrete Heatmap using Python Code.
Fig. 3.113: Heatmap with Horizontal Color Bar using Python Code.
89
Fig. 3.114: Square Heatmap using Python Code.
3.7 Treemaps:
Treemaps display hierarchical data structures as nested rectangles. They are useful for showing
the hierarchical composition of data.
90
Example 3.32: Create Treemap
91
Fig. 3.116: TreeMap using Python Code.
Fig. 3.117: Basic TreeMap with one level using Python Code.
Fig. 3.118: Treemap with Multiple levels and color mapping using Python Code.
Fig. 3.119: Customized Treemap with hover Info using Python Code.
92
Fig. 3.120: Sunburst Terrmap using Python Code.
Fig. 3.122: Treemap with Custom Color Mapping with Python Code.
Fig. 3.123: Treemap with Custom Hierarchy and Color Mapping with Python Code.
93
Fig. 3.124: Treemap using Python Code.
Fig. 3.125: Treemap with Custom template and title using Python Code.
Fig. 3.126: Treemap with zoom and Pan using Python Code.
94
Example 3.33: Create Bubble Chart
95
Fig. 3.127: Bubble Charts using Python Code.
96
Fig. 3.128: Basic Bubble Chart using Python Code.
97
Fig. 3.130: Bubble Chart with Labels using Python Code.
Fig. 3.131: Bubble Chart with Custom Marker Styles using Python Code.
98
Fig. 3.132: Bubble Chart with Transparency using Python Code.
Fig. 3.133: Bubble Chart with Log Scale using Python Code.
99
Fig. 3.134: Bubble Chart with Size Scaling using Python Code.
100
Fig. 3.136: Bubble Chart with Seaborn using python Code.
Fig. 3.137: Bubble Chart with Categorical Colors using Python Code.
101
Fig. 3.138: Bubble Chart with Trendline using Python Code.
102
Example 3.35: Create Choropleth Map
103
Example 3.37: Create Choropleth Map
104
Fig. 3.144: Basic Sankey Diagram using Python Code.
Fig. 3.145: Sankey Diagram with Custom Colors using Python Code.
Fig. 3.146: Sankey Diagram with Custom Colors using Python Code.
105
Example 3.40: Create Sankey Diagram
Fig. 3.149: Horizontal Sankey Diagram with Padding using Python Code.
106
Fig. 3.150: Horizontal Sankey Diagram with Padding using Python Code.
107
Example 3.43: Create Sankey Diagram
108
Example 3.44: Create Sankey Diagram
Fig. 3.157: Sankey Diagram with Multiple Flows using Python Code.
109
Fig. 3.158: Sankey Diagram with Multiple Flows using Python Code.
110
Fig. 3.159: Different types of Box-plot using Python Code.
Fig. 3.160: Simple Box Plot of a single dataset using Python Code.
111
Fig.3.161: Box Plot for multiple datasets using Python Code.
112
Fig.3.163: Notched Box Plot using Python Code.
113
Fig. 3.165: Box Plot with Outliers using Python Code.
Fig. 3.166: Box Plot with horizontal whiskers using Python Code.
114
Fig. 3.167: Grouped Box Plots using Python Code.
Fig. 3.168: Box Plot with notches and Custom Whisker Caps using Python Code.
115
Fig. 3.169: Box Plot with horizontal Median Line using Python Code.
Fig. 3.170: Box Plot with Custom x-axis labels using Python Code.
116
Example 3.47: Create Parallel Coordinates Plots
117
Fig. 3.171: Different types for Parallel Coordinates Plot using Python Code.
118
Fig. 3.173: Parallel Coordinates Plot using Python Code.
119
Fig. 3.175: Parallel Coordinates Plot using Python Code.
120
Example 3.48: Radar Chart Create
121
Fig. 3.179: Radar Chart with Different Categories using Python Code.
122
Example 3.49: Create Network Diagram
123
Fig. 3.180: Different types for Network Diagrams using Python Code.
124
Fig. 3.181: Network Diagram Python code Output.
125
Fig. 3.183: Network Diagram Python code Output.
126
Fig. 3.185: Network Diagram Python code Output.
127
Fig. 3.186: Network Diagram Python code Output.
128
Fig. 3.188: Network Diagram Python code Output.
129
Fig. 3.190: Network Diagram Python code Output.
130
3.15 Word Clouds:
Word clouds represent the frequency of words or terms in a text by varying the size and color of
the words. They are often used for text analysis and summarization.
131
Example 3.51: Create Word Cloud
132
Fig. 3.197: Word Frequency Word Cloud using Python Code.
133
3.16 Streamgraphs:
Streamgraphs display time-series data as stacked areas, allowing you to see how different
categories contribute to a whole over time.
134
Fig. 3.203: Custom Color Streamgraph using Python Code.
135
Example 3.57: Create Streamgraphs
Fig. 3.208: Streamgraph with Custom Labels and Tooltip using Python Code.
136
Fig. 3.209: Streamgraph with Custom Labels and Tooltip using Python Code.
3.17 3D Visualizations:
These visualizations add a third dimension (depth) to the data representation, making them
suitable for complex spatial data or volumetric data.
137
Fig. 3.211: 3D Scatter Plot using Python Code.
138
Fig. 3.213: 3D Surface Plot Using Python Code.
139
Fig. 3.215: 3D Line Plot using Python Code.
140
Fig. 3.217: 3D Bar Plot Using Python Code.
141
Fig. 3.219: 3D Contour Plot Using Python Code.
142
Fig. 3.221: Basic Gantt Chart Using Python Code.
143
Fig. 3.223: Gantt Chart with Dependencies using Python Code.
144
CHAPTER Four: Data Manipulation
DataFrame
Data Preprocessing
Data Cleaning
145
Data manipulation
Data manipulation is the process of organizing, arranging, and transforming data in order to
make it more useful and informative. It is a fundamental step in data analysis, data mining, and
machine learning. Data manipulation can be used to:
DataFrame
Data Preprocessing
Data Cleaning
Checking for Consistency
4.1 DataFrame
In Python, a data frame is a two-dimensional array-like data structure provided by the pandas
library.
The pandas library provides two primary classes for data frames:
DataFrame and Series. A DataFrame is a table-like structure consisting of rows and columns,
where each column can have a different data type.
146
Example 4.1: Create DataFrame.
Fig.4.1 : Create DataFrame with output result using Python code for Example 1.
147
1
Fig.4.3 : Picture before running the code for creating files using Python code.
148
Fig.4.5 : After running the Python code.
149
Example 4.3: Reading data file.
150
Example 4.4: reading data file
Data preprocessing is the process of preparing data for analysis by cleaning, transforming, and
organizing it into a format that is suitable for analysis. The quality and accuracy of data are
critical to the success of any data science project, and data preprocessing is an essential step in
ensuring that the data is clean, complete, and consistent.
a. Data cleaning: Data cleaning involves identifying and correcting errors and inconsistencies in
the data. This includes handling missing values, dealing with outliers, and correcting errors in the
data.
b. Data transformation: Data transformation involves converting the data into a format that is
suitable for analysis. This includes scaling, normalization, and encoding categorical variables.
151
c. Feature engineering: Feature engineering involves creating new features from existing data to
improve the performance of machine learning models. This includes creating new variables
based on existing ones or combining multiple variables to create new ones.
d. Data reduction: Data reduction involves reducing the size of the data by eliminating redundant
or irrelevant features. This includes feature selection or dimensionality reduction techniques such
as PCA (Principal Component Analysis).
e. Data integration: Data integration involves combining data from multiple sources to create a
unified dataset. This includes merging or joining datasets that have a common variable.
f. Data splitting: Data splitting involves dividing the dataset into training, validation, and testing
sets to evaluate the performance of machine learning models.
The goal of data preprocessing is to prepare the data for analysis by cleaning, transforming, and
organizing it into a format that is suitable for analysis. By using these techniques, data scientists
can ensure that the data is accurate, complete, and consistent, and that it is in a format that is
suitable for analysis. Data preprocessing is an iterative process, and data scientists may need to
revisit and revise their preprocessing steps as they gain more insights from the data.
Data cleaning refers to the process of identifying and correcting or removing errors,
inconsistencies, and inaccuracies in a dataset to improve its quality and usefulness for analysis.
It is an essential step in data preprocessing before performing any data analysis or modeling.
Now, let's see how to perform data cleaning in Python using the pandas library see the following
Examples from 3.5 to 3.8 Including Figures from 3.9 to 3.18.
152
Example 4.5: Create data file
Fig.4.9 :Output result Data Frame Data with duplicated two rows 1 and 4 using Python code.
153
Example 4.6: read and save data file.
154
Fig.4.13: Define the number of rows & columns.
155
Example 4.8: Create data file with missing values and save file.
156
4.3.1 Outliers
Outliers, in the realm of statistics and data analysis, refer to data points that deviate significantly
from the overall pattern or distribution of a dataset. These observations lie at an abnormal
distance from other data points, making them stand out distinctly. The identification and
understanding of outliers are crucial in various analytical processes as they can affect the
accuracy and validity of statistical models and interpretations. Outliers are typically identified by
employing various mathematical and statistical techniques, such as the use of z-scores, box plots,
or the calculation of interquartile range (IQR). By quantifying the degree of deviation from the
norm, outliers can be objectively recognized and analyzed.
Outliers play a fundamental role in several areas where data analysis is applied. One key
application is in anomaly detection, which involves identifying abnormal or unexpected
observations. By considering outliers as potential anomalies, statistical models and algorithms
can be designed to automatically detect and flag unusual patterns or outliers in large datasets.
This has numerous practical applications, such as fraud detection in financial transactions,
network intrusion detection in cyber security, or identifying potential outliers in medical data that
might indicate unusual health conditions.
Moreover, outliers can also impact the accuracy and reliability of statistical models and
predictions
Handling outliers is an important aspect of data analysis and statistical modeling. Outliers are
data points that significantly deviate from the rest of the data, and they can have a significant
impact on the analysis results. Outliers can occur due to various reasons such as measurement
errors, data entry errors, or genuine extreme observations. Dealing with outliers requires careful
consideration to ensure that they do not unduly influence the analysis outcome. Here's an
overview of the theory and some examples of how to handle outliers using Python:
The first step in handling outliers is to identify them in the dataset. This can be done through
graphical methods (e.g., box plots, scatter plots) or statistical methods (e.g., z-score, modified z-
score, Tukey's fences).
It's important to understand whether outliers are genuine extreme observations or if they are the
result of errors. This understanding helps in deciding how to handle them appropriately.
157
Decide on the approach:
Depending on the nature of the outliers and the specific analysis goals, there are several
approaches to handle outliers:
a. Removal: If the outliers are due to errors or have a significant impact on the analysis,
they can be removed from the dataset. However, this should be done cautiously, as
removing outliers may affect the representativeness of the data.
b. Transformation: Applying transformations (e.g., logarithmic, square root, or reciprocal
transformations) to the data can sometimes help in reducing the impact of outliers.
c. Winsorization: Winsorization replaces the extreme values with a less extreme but still
plausible value. For example, the outliers can be replaced with the nearest data values
within a certain percentile range.
d. Binning: Binning involves grouping data into bins or intervals and replacing outlier
values with the bin boundaries or central tendency measures.
e. Robust methods: Robust statistical techniques, such as mean and median, are less
influenced by outliers compared to their traditional counterparts (e.g., mean and standard
deviation).
I. Z-score method: The z-score method is a statistical technique used to identify outlier
values within a dataset. It calculates the deviation of a data point from the mean of the
dataset in terms of standard deviations. The z-score is computed by subtracting the mean
from the data point and dividing the result by the standard deviation. This standardized
value provides a measure of how many standard deviations a data point is away from the
mean. By comparing the z-score of a data point to a threshold value, outliers can be
identified (see Figure 3.19).
This method calculates the z-score of each data point, which represents the number of
standard deviations a data point is from the mean of the dataset. Any data point with a z-
score greater than a certain threshold (usually 2 or 3) is considered an outlier.
Overall, the z-score method serves as a valuable tool for outlier detection in diverse
domains, facilitating decision-making processes and ensuring data accuracy.
II. Interquartile range (IQR) method: The interquartile range (IQR) is a measure of
variability that is used to describe the spread of a dataset. It is calculated as the difference
between the upper quartile (Q3) and the lower quartile (Q1) of the data.
158
This method calculates the IQR of the dataset, which is the range between the first
quartile (25th percentile) and the third quartile (75th percentile). Any data point outside
the range of 1.5 times the IQR below the first quartile or above the third quartile is
considered an outlier.
To calculate the IQR, you first need to find the median of the dataset. Then, you split the
dataset into two halves: the lower half (values below the median) and the upper half
(values above the median).
Next, you find the median of each half. The lower median is called the first quartile (Q1),
and the upper median is called the third quartile (Q3). The IQR is then calculated as the
difference between Q3 and Q1:
IQR = Q3 - Q1
The IQR is a useful measure of variability because it is less sensitive to extreme values or
outliers than other measures such as the range or standard deviation. It can be used to
identify potential outliers in a dataset or to compare the variability of different datasets.
To identify potential outliers using the IQR method, you can use the following rule:
Overall, the IQR method is a simple yet effective way to summarize the spread of a
dataset and identify potential outliers.
This method involves plotting the data points and visually inspecting the plot for any data
points that are significantly different from other data points. Box plots, scatter plots, and
histograms are commonly used for this method.
Method serves as an intuitive and effective tool in outlier detection, contributing to improved
decision-making and problem-solving in diverse domains see Examples from 3.9 to 3.16
including Figures from 3.19 to 3.36.
159
Example 4.9: create dataset with outliers
Fig.4.19: Determine the outlier value determine by z-score using Python code.
160
Fig.4.20: IQR and outliers using python code for Example1.
161
Fig.4.21: Output statistical and plot graph with IQR and outliers using python code
162
Example 4.10: Create dataset without Outliers.
Fig.4.22: IQR and without outliers using python code for Example2.
163
Fig.4.23: Output statistical and plot graph with IQR without outliers using python code for Example2.
164
Example 4.11:Visualization Method Plot Outliers
Fig.4.24 : Plot and determine outlier value using box plot using Python code for Example.
165
Example 4.12: Create dataset with two outliers.
Fig.4.25 : Plot and determine outliers value using box plot using Python code for Example.
166
Example 4.13: Standrized data
167
Example 4.15: Standrized data
168
Fig.4.31: Creating and store LabelEncoder under the name “le”
Fig.4.32: Create the color data and store them under the name of “data”.
169
4.4: Checking for consistency
Checking for consistency in data is an essential task in data management and analysis.
It involves ensuring that the data is accurate, valid, and coherent, both within individual data sets
and across multiple data sources.
The goal is to identify and resolve any discrepancies, errors, or anomalies that may exist in the
data.
Here's a detailed overview of the theory behind checking for consistency in data:
Data integrity refers to the accuracy, completeness, and reliability of data. It ensures that the data
is not corrupted, modified, or tampered with in any unauthorized manner. Various techniques can
be used to ensure data integrity, such as checksums, hash functions, and error detection codes.
Validation rules are predefined criteria or constraints that determine the acceptable values and
formats for data. These rules help ensure that data entered into a system meets the specified
criteria. Common validation rules include data type checks (e.g., numeric, alphanumeric), range
checks, format checks (e.g., email addresses, phone numbers), and referential integrity checks
(ensuring data consistency across related tables).
Cross-field consistency involves checking the relationships and dependencies between different
fields within a data set. It ensures that the values in one field are consistent with the values in
related fields. For example, in a customer database, the customer's age should match their birth
date.
Cross-table consistency focuses on checking the consistency of data across multiple tables or
data sources. It ensures that the relationships and references between tables are maintained
correctly. For instance, in a relational database, foreign key constraints are used to enforce
referential integrity between related tables.
Data profiling involves analyzing and understanding the structure, content, and quality of data. It
helps identify inconsistencies, duplicates, missing values, outliers, and other data issues. Data
profiling techniques include statistical analysis, pattern recognition, and data visualization.
170
4.4.6. Data Cleansing:
Data cleansing, or data scrubbing, is the process of identifying and correcting or removing
inconsistencies, errors, and inaccuracies in the data. This may involve tasks like removing
duplicate records, filling in missing values, correcting formatting issues, and resolving conflicts
in data from different sources.
When inconsistencies or errors are detected, it's crucial to have proper error handling
mechanisms in place. This includes logging and reporting errors, capturing details about the
nature of the inconsistency, and providing notifications to appropriate stakeholders for further
investigation and resolution.
Data governance encompasses the policies, processes, and controls put in place to ensure data
quality, consistency, and reliability across an organization. It involves defining data standards,
roles and responsibilities, data management procedures, and enforcing data quality measures.
By applying these principles and techniques, organizations can establish robust mechanisms to
check for consistency in their data, ensuring its accuracy, reliability, and usefulness for various
applications and decision-making processes see the following Examples from 3.17 to 3.27
including Figures from 3.37 to 3.51.
171
Example 4.17: Data Integrity
172
Example 4.18: Validation Rules
173
Example 4.19: Cross-field consistency
174
Example 4.20: Cross-field consistency
175
Example 4.21: Cross-Table Consistency
176
Example 4.22: Data Profiling
177
Fig. 4.46: Using Python code for data profiling with statistical output.
178
Example 4.24: Cleaned dataset.
179
Example 4.25: Error Handling
Fig. 4.49 Error result “Cannot divide by zero” using Python code.
180
Example 4.27: Data Governance
181
CHAPTER FIVE: Exploratory Data Analysis
Measure of location
Measures of Dispersion
Measures of Position
182
Exploratory Data Analysis
5.1 Measure of location
Measures of central tendency are statistical measures that aim to describe the central or typical
value of a dataset. They provide a single representative value that summarizes the distribution of
data points. The three commonly used measures of central tendency are the mean, median, and
mode.
Mean: The mean, also known as the arithmetic mean or average, is calculated by summing up all
the values in a dataset and dividing by the total number of values. It is sensitive to extreme
values and provides a measure of the central value around which the data points tend to cluster.
Median: The median is the middle value of a dataset when it is arranged in ascending or
descending order. If there are an odd number of values, the median is the value exactly in the
middle. If there is an even number of values, the median is the average of the two middle values.
The median is less affected by extreme values and provides a measure of the central position in
the dataset.
Mode: The mode is the value(s) that occur(s) most frequently in a dataset. Unlike the mean and
median, the mode does not require numerical data and can be used for both categorical and
numerical variables. A dataset may have one mode (unimodal), two modes (bimodal), or more
than two modes (multimodal). It can also be described as having no mode (no value occurs more
than once).
183
These measures of central tendency provide different perspectives on the typical value of a
dataset and can be used in different situations. The mean is commonly used when the data is
normally distributed and not influenced by extreme values. The median is useful when the data
has outliers or is skewed. The mode is often employed to describe the most frequently occurring
value or to identify the most common category in categorical data.
It is important to choose the appropriate measure of central tendency based on the characteristics
of the dataset and the purpose of analysis. Using multiple measures of central tendency can
provide a more comprehensive understanding of the dataset and its distribution.
The formula for calculating the mean (also known as the average) of a set of numbers is:
∑
̅
Where: x = 0,1,2, …, n.
n: sample size.
Here's a numerical example from 4.1 to 4.2 to illustrate how to calculate the mean:
Example 5.1:
Example 5.2:
Step 3: Calculate the mean: Mean = 31 / 7 ≈ 4.4286 (rounded to four decimal places)
184
5.1.2 The median
The median is the middle value of a dataset when it is ordered from least to greatest. In case the
dataset has an even number of values, the median is the average of the two middlemost values.
For datasets with an odd number of values:
Arrange the dataset in ascending order. The median is the middle value.
Here are some examples from 4.3 to 4.6 to illustrate how to calculate the median:
Step 2: Since there are 7 values, the median is the middle value, which is 11.
Step 2: Since there are 6 values, the median is the average of the two middle values: (3 + 4) / 2 =
3.5
Dataset: 5, 5, 5, 2, 3
Step 2: Since there are 5 values, the median is the middle value, which is 5.
Dataset: 8, 7, 5, 5, 3, 2
Step 2: Since there are 6 values, the median is the average of the two middle values: (5 + 5) / 2 =
5
185
So, the median of this dataset is 5.
The mode is the value that appears most frequently in a dataset. A dataset can have one mode,
multiple modes (bimodal, trimodal, etc.), or no mode (when all values occur with the same
frequency).
The mode is the value that occurs most frequently in the dataset.
To find the mode, we look for the value that appears most frequently. In this case, the value "7"
occurs three times, which is more frequent than any other value.
Dataset: 4, 1, 6, 2, 7, 3, 5
In this dataset, all the values appear exactly once, and there is no value with a higher frequency
than the others.
186
Example 5.9: Mean, Median, and Mode.
Fig. 5.1: Calculate mean, median and mode using python code.
Measures of dispersion, also known as measures of variability or spread, quantify the extent to
which data points deviate from the central tendency or average. They provide valuable insights
into the spread, variability, and consistency of a dataset. Some commonly used measures of
dispersion include the range, variance, standard deviation, and interquartile range.
5.2.1 Range: The range is the simplest measure of dispersion and is calculated as the difference
between the maximum and minimum values in a dataset. It gives an indication of the total spread
of the data but does not take into account the distribution of values in between.
Example 5.10: For the dataset {10, 15, 20, 25, 30}, the range would be 30 - 10 = 20.
5.2.2 Variance: Variance measures how far each data point in a dataset deviates from the mean.
It is calculated by taking the average of the squared differences between each data point and the
mean. A higher variance indicates greater variability in the data.
187
∑
Variance =
Example 5.11: For the dataset {10, 15, 20, 25, 30}, if the mean is 20, the variance would be
((10-20)^2 + (15-20)^2 + (20-20)^2 + (25-20)^2 + (30-20)^2) / 5 = 50.
5.2.3 Standard Deviation: The standard deviation is the square root of the variance. It provides
a measure of dispersion that is in the same units as the original data, making it easier to interpret.
A higher standard deviation indicates greater spread or variability in the dataset.
Example 5.12: For the same dataset as above, the standard deviation would be the square root of
the variance, which is √ .
5.2.4 Interquartile Range (IQR): The interquartile range is a measure of statistical dispersion
that focuses on the middle 50% of the data. It is calculated as the difference between the third
quartile (75th percentile) and the first quartile (25th percentile). The IQR is robust to outliers and
is often used to describe the spread of skewed or non-normally distributed data see the following
Figure 5.2:
Lowest Highest
Value Value
5.2.5 Coefficient of Variation (CV): CV is the ratio of the standard deviation to the mean,
expressed as a percentage. It allows comparison of dispersion between datasets with different
scales.
Example 5.13: If the mean of dataset A is 50 and the standard deviation is 10, the CV would be
(10 / 50) * 100% = 20%.
These measures of dispersion allow for a more comprehensive understanding of the data beyond
just the central tendency. They help identify the spread of values, detect outliers, assess the
variability within a dataset, and compare the variability between different datasets.
188
It is important to consider the appropriate measure of dispersion based on the characteristics of
the dataset and the specific research or analysis goals. Different measures of dispersion may be
more suitable for different types of data and research questions.
189
Fig. 5.4: The output of the Measure of Dispersion using Python code.
Measures of Position are statistical metrics that help us understand the relative position of a
specific data point within a dataset. They provide valuable insights into how an individual data
point compares to other data points in the same dataset. Some common examples of measures of
position include frequency distribution, stem and leaf, percentiles, quartiles, and z-scores.
Frequency distribution is a statistical representation of data that shows the frequency or count of
each value or range of values in a dataset. It organizes data into different categories or intervals
190
and provides a summary of how frequently each category occurs. Frequency distribution helps to
identify patterns, central tendencies, and variations within a dataset.
Frequency distribution can be represented using various formats, including tables, histograms,
bar charts, or line graphs. These visual representations help in understanding the distributional
characteristics of the data, such as the shape, central tendency, and dispersion.
Frequency distributions are widely used in descriptive statistics to summarize data, identify
outliers, detect patterns, and make comparisons between different groups or datasets. They
provide a compact and organized way to analyze and interpret large amounts of data.
{52, 55, 60, 65, 68, 70, 72, 75, 78, 82, 85, 88, 92, 95, 98}
To create a frequency distribution table with intervals (classes), we'll group the scores into
classes and count the frequency of scores falling into each class.
Example 5.15:
191
Step 1: Create the intervals (classes):
50 - 59
60 - 69
70 - 79
80 - 89
90 - 99
Class Frequency
50 - 59 2
60 - 69 3
70 - 79 4
80 - 89 3
90 - 99 3
In this frequency distribution table, we've grouped the test scores into intervals (classes) and
displayed the number of scores falling into each interval. This table provides a concise and
informative representation of the data's distribution.
{22, 25, 28, 32, 35, 38, 40, 42, 45, 48, 52, 55, 58, 62, 65, 68, 72, 75}
To create a frequency distribution table with intervals (classes), we'll group the ages into classes
and count the frequency of ages falling into each class.
192
Step 1: Create the intervals (classes):
20 - 29
30 - 39
40 - 49
50 - 59
60 - 69
70 - 79
Class Frequency
20 - 29 3
30 - 39 3
40 - 49 4
50 - 59 3
60 – 69 3
70 - 79 2
In this frequency distribution table, we've grouped the ages into intervals (classes) and displayed
the number of participants falling into each interval. This table gives us a clear overview of the
age distribution among the survey participants.
193
a. Components of a Stem-and-Leaf Plot:
Stem: The stem represents the leading digits or the tens-place digits of the data
values. It is typically arranged in ascending order from bottom to top. Each
unique stem corresponds to a group of data values that share the same leading
digit.
Leaf: The leaf represents the trailing digits or the units-place digits of the data
values. It is usually arranged in ascending order from left to right within each
stem group. Each leaf represents an individual data point belonging to the same
stem.
Example 5.17:
Let's create a stem-and-leaf plot for the dataset: [23, 27, 31, 36, 38, 42].
Stem Leaves
2 3 7
3 1 6 8
4 2
194
In this example, the stems are 2, 3, and 4, and the corresponding leaves represent the
units digits of the data points.
You can easily determine the range of the data by looking at the minimum and maximum
values of the stems and leaves.
Stem-and-leaf plots can help identify outliers. Values that are far from the main clusters of
leaves may indicate unusual or extreme data points.
Overall, stem-and-leaf plots are a valuable tool for exploratory data analysis, allowing you to
quickly grasp essential characteristics of a dataset. They are especially useful when you want
to maintain the granularity of individual data points while still visualizing the data's
distribution.
195
Fig. 5.6: Stem and Leaf using Python code.
196
Fig.5.7 : Raw data for 5 examples using Python code.
197
Fig.5.9 : Output 5 examples Plots for Stem and leaf using Python code.
5.3.3 Percentiles:
Percentiles divide a dataset into 100 equal parts, each representing 1% of the data. They show the
relative standing of a particular value in comparison to the rest of the data. For example, the 25th
percentile (also known as the first quartile) is the value below which 25% of the data falls, and
the 75th percentile (third quartile) is the value below which 75% of the data falls. The 50th
percentile is the median, where half of the data is above and half below.
5.3.4 Quartiles:
Quartiles are specific percentiles that divide the dataset into four equal parts, each representing
25% of the data. The three quartiles are as follows:
First quartile (Q1): The 25th percentile, separates the lowest 25% of the data from the rest.
Second quartile (Q2): The 50th percentile, same as the median, divides the data into two equal
halves.
Third quartile (Q3): The 75th percentile, separates the lowest 75% of the data from the highest
25%.
Quartiles are useful in detecting the spread and skewness of a dataset.
198
Z = (X - μ) / σ
A z-score of 0 means the data point is equal to the mean, a z-score of +1 means it is one standard
deviation above the mean, and a z-score of -2 means it is two standard deviations below the
mean, and so on.
These measures of position play a crucial role in understanding the distribution of data and
identifying potential outliers or extreme values within a dataset. They help in making
comparisons and drawing meaningful conclusions from the data at hand.
199
Example 5.21: IQR
200
Fig. 5.15: Output z-score Python code.
201
CHAPTER SIX: Statistical Modeling with Programming Concepts
202
6.1. Variables with programming concepts
In the context of statistics and programming, variables are containers used to store and
manipulate data. They are an essential concept in any programming language, including Python.
Here are some key details about variables:
a. Definition and Purpose: A variable is a named storage location that holds a value. It acts as a
reference to a specific memory address where the data is stored. Variables are used to store
different types of data, such as numbers, strings, boolean values, or complex objects.
c. Data Types: Variables can have different data types, which determine the kind of values they
can hold. Common data types include integers (int), floating-point numbers (float), strings (str),
booleans (bool), and more complex types like lists, dictionaries, or objects.
d. Variable Assignment: Assigning a value to a variable is done using the assignment operator
(=). It associates a value with the variable name, allowing it to be used and referenced later in the
program.
f. Variable Scope: Variables have a scope, which defines the portion of the program where the
variable is visible and accessible. The scope can be global (accessible throughout the program)
or local (limited to a specific block or function).
g. Data Mutation: Depending on the data type, variables can be mutable or immutable. Mutable
variables can be modified directly, while immutable variables cannot be changed after
assignment. For example, strings and tuples are immutable, while lists and dictionaries are
mutable.
203
j. Variable Scope and Lifetime: Variables have a defined scope and lifetime, which determine
when they are created, accessed, and destroyed. The scope and lifetime of a variable depend on
where it is declared and how it is used within the program.
Understanding variables and how to work with them is fundamental in programming and
statistical analysis. They enable the storage and manipulation of data, making it easier to perform
calculations, track information, and build complex systems see Figure 5.1.
204
6.2 Parameters
In the context of probability distributions, parameters are numerical values that define the
characteristics of a distribution. These values determine the shape, location, and scale of the
distribution. Here are some key details about parameters:
a. Definition: Parameters are fixed values that are used to define and specify a particular
probability distribution. They describe important characteristics of the distribution, such as its
center, spread, skewness, and kurtosis.
b. Types of Parameters: The specific parameters and their interpretation depend on the
distribution being considered. Some common parameters include:
- Mean (μ): The mean represents the average or central value of the distribution. It
determines the center or location of the distribution.
- Standard Deviation (σ): The standard deviation measures the spread or variability of the
distribution. It indicates how much the values typically deviate from the mean.
- Variance : The variance is the square of the standard deviation. It represents the
average of the squared deviations from the mean and provides a measure of dispersion.
- Shape Parameters: Some distributions have additional parameters that control the shape
of the distribution. For example, the gamma distribution has shape and scale
parameters.
d. Estimation: Estimating the parameters is a common task in statistics. Given a set of observed
data, statistical methods can be used to estimate the parameters that best fit the data to a specific
distribution. This process is known as parameter estimation.
e. Hypothesis Testing: Parameters play a crucial role in hypothesis testing. Researchers often
test hypotheses about the values of certain parameters in a population based on sample data.
Hypothesis tests help determine if the observed data provides evidence to support or reject
certain parameter values.
205
Understanding and specifying the correct parameters is crucial for accurately representing and
analyzing data using probability distributions. The choice of parameters affects the distribution's
behavior and enables meaningful interpretation and inference based on the distribution.
206
Fig. 6.4: Output Parameters using Python code.
Probability distribution, in the field of statistics and probability theory, refers to a mathematical
function that describes the likelihood of different outcomes occurring in a given set of events or
experiments. It provides a systematic way of characterizing the uncertainty associated with these
outcomes by assigning probabilities to their occurrence. Probability distributions are defined by
their shape, parameters, and specific mathematical formulas, which allow for quantification and
analysis of random variables.
Application Probability distributions find extensive applications across various fields. One
common application is in risk assessment and decision-making processes. By understanding the
probability distribution of potential outcomes, decision-makers can evaluate the likelihood of
different scenarios and make informed choices. Probability distributions are also fundamental in
statistical inference, where they are used to estimate population parameters based on sample
data. In addition, they play a crucial role in modeling and simulating complex systems, such as
financial markets, weather patterns, and biological phenomena. By incorporating probability
distributions into these models, researchers can generate reliable predictions and understand the
underlying dynamics of the system.
207
The behavior of a random variable is described by its probability distribution. For a discrete
random variable, the probability distribution is often represented by a probability mass function
(PMF), which assigns probabilities to each possible value. For a continuous random variable, the
probability distribution is described by a probability density function (PDF), which gives the
probability of the variable falling within a certain range.
Random variables are a fundamental concept in probability theory and statistics. They are used
to model and describe uncertain or random quantities in mathematical terms. A random variable
is a variable whose possible values are outcomes of a random phenomenon. It assigns a
numerical value to each outcome of an experiment.
a. Definition: A random variable is a function that maps the outcomes of a random experiment to
numerical values. It assigns a real number to each outcome in the sample space.
To generate random numbers in Python, you can use the random module. This module provides
a number of functions for generating random numbers, including:
208
Example 6.4: Random variable.
b. Types of random variables: Random variables can be classified into two main types: discrete
random variables and continuous random variables.
I. Discrete random variables: These variables take on a countable set of values. For example,
the number of heads obtained when flipping a coin multiple times is a discrete random
variable, as it can only take on the values 0, 1, 2, and so on.
II. Continuous random variables: These variables can take on any value within a certain range
or interval. Examples include the height of a person, the time it takes for a car to travel a
certain distance, etc.
209
Discrete probability distributions are used to describe random variables that can take on a
finite or countable number of values. Some common examples of discrete probability
distributions include:
I. Bernoulli distribution is one of the simplest and fundamental discrete probability
distributions. It models a random experiment with only two possible outcomes, often
referred to as "success" and "failure." These outcomes are typically represented as 1
(success) and 0 (failure).
The Bernoulli distribution is characterized by a single parameter, usually denoted as "p," which
represents the probability of success in a single trial. The probability of failure is then given by
(1 - p).
The probability mass function (PMF) of the Bernoulli distribution is defined as follows:
Where:
The outcomes are mutually exclusive and exhaustive, meaning that only one of the two
outcomes can occur in a single trial.
The mean (expected value) of a Bernoulli random variable X is E(X) = p. It represents
the average probability of success in a single trial.
The variance of X is Var(X) = p * (1 - p). The variance is a measure of how spread out
the distribution is around its mean.
The Bernoulli distribution finds various applications in statistics and probability theory, as well
as in real-world scenarios. Some common applications include:
Modeling binary outcomes: It is used to model events with only two possible outcomes,
such as success/failure, yes/no, heads/tails, etc.
Bernoulli trials: It is used to represent a single trial in a sequence of independent binary
experiments, where each trial has the same probability of success.
210
Decision-making and classification: In machine learning and data analysis, the Bernoulli
distribution is used to model binary classification problems.
Binomial distribution: The Bernoulli distribution is the basis for the binomial distribution,
which represents the number of successes in a fixed number of independent Bernoulli
trials.
Overall, the Bernoulli distribution serves as a building block for more complex probability
distributions and provides a foundation for understanding the behavior of random events with
two possible outcomes.
In this example, the probability of success (making the free throw) is p = 0.7 and the probability
of failure (missing the free throw) is q = 1 - p = 0.3. Here are some numerical examples of
Bernoulli trials and their probabilities:
You can use the Bernoulli distribution to calculate the probability of any event that has two
possible outcomes, such as winning a coin toss, rolling a six on a die, or making a free throw in
basketball.
In this example, the probability of success (getting a job offer) is p and the probability of failure
(not getting a job offer) is q = 1 - p. The value of p will vary depending on the person's
qualifications, experience, and the competitiveness of the job market.
In this example, the probability of success (the customer buying the product) is p and the
probability of failure (the customer not buying the product) is q = 1 - p. The value of p will vary
depending on the product, the customer's needs, and the sales skills of the salesperson.
211
These are just a few examples of the many ways that the Bernoulli distribution can be used to
model real-world situations.
212
Example 6.10: Bernoulli distribution
213
Fig. 6.12: Output graph of Bernoulli distribution using Python code
II. Binomial distribution: This distribution is used to describe the probability of a certain
number of successes in a fixed number of trials, where each trial has only two possible
outcomes (success or failure).
where:
Suppose you have a list of 50 email subscribers and you send them an email campaign. You
know from previous campaigns that the average open rate for your emails is 20%. This means
that the probability of any individual subscriber opening your email is 0.20. use the binomial
distribution to calculate the probability of a certain number of subscribers opening your email.
For example, the probability of 10 subscribers opening your email is:
214
= 0.1398
The probability of 10 people opening email is 0.1398.
Suppose the total number of free throws attempted was 20, number of successful throws equal
15, and the probability of making free throw was 0.7. Find the probability of making exactly 15
free throws.
= 0.178863
Probability of making exactly 15 free throws is 0.178863.
215
Example 6.13: Binomial Distribution.
Fig. 6.14: Output distribution using Python code.t results for binomial
216
Example 6.14: Binomial Distribution
217
Fig.6.16: Output plot for probability of exactly 10 people opening the email : 0.1398. using python code.
Fig. 6.17: Output plot for probability of exactly 15 free throws: 0.1780 using Python code.
218
Example 6.15: Binomial Distribution.
Fig. 6.19: Output distribution using Python code.t results for binomial
219
Example 6.16: Binomial Distribution
220
Fig.6.21: Output plot for probability of exactly 10 people opening the email : 0.1398. using python code.
Fig. 6.22 : Output plot for probability of exactly 15 free throws: 0.1780 using Python code.
221
III. Poisson distribution: This distribution is used to describe the probability of a certain
number of events occurring in a fixed time interval or space, where the events occur
independently of each other and at a constant rate.
where:
X is the random variable, which represents the number of events occurring in a fixed time
interval or space
2. The probability of more than one event occurring in an infinitesimally small interval
approaches zero.
3. The mean (average) and variance of the Poisson distribution is both equal to λ.
1. Customer Arrivals: The Poisson distribution can model the number of customer
arrivals at a store during a specific hour. Suppose, on average, 5 customers arrive per
hour (λ = 5). You can use the Poisson distribution to find the probability of exactly 3
customers arriving in the next hour (k = 3).
= 0.140
2. Defects in a Product: In manufacturing, the Poisson distribution can model the number
of defects in a batch of products. If, on average, there are 2 defects per batch (λ = 2), you
can find the probability of having exactly 1 defect in the next batch (k = 1).
3. Accidents per Day: The Poisson distribution can be used to model the number of
accidents occurring at a specific location in a day. If, on average, there are 0.5 accidents
222
per day (λ = 0.5), you can calculate the probability of having exactly 2 accidents in the
next day (k = 2).
4. Call Center Calls: Call centers can use the Poisson distribution to model the number of
incoming calls per minute or per hour. If, on average, there are 10 calls per minute (λ =
10), you can find the probability of receiving exactly 15 calls in the next minute (k = 15).
These examples illustrate how the Poisson distribution can be applied in various
scenarios to model the number of events occurring within a fixed interval of time or
space, given an average rate of occurrence.
223
Fig. 6.24: output plot of Poisson distribution using Python code.
224
IV. Geometric distribution: This distribution is used to describe the probability of the first
success occurring on the nth trial, where each trial has only two possible outcomes
(success or failure).
The probability mass function (PMF) of the geometric distribution is given by:
Where:
P(X = k) is the probability of getting the first success on the k-th trial.
p is the probability of success in a single trial .
Key characteristics of the geometric distribution are:
1. Each trial is independent and has only two possible outcomes: success or failure.
2. The distribution is memoryless, meaning the probability of success on the next trial is
the same regardless of how many trials have been conducted before.
1. Waiting for a Bus: Suppose the probability of a bus arriving at a specific stop is 0.2 (p
= 0.2). You can use the geometric distribution to calculate the probability of waiting for
5 bus arrivals before the first bus arrives (k = 5).
(0.2)
= 0.082
2. Email Spam: In email filtering, the geometric distribution can be used to model the
number of non-spam emails received before the first spam email arrives. If the
probability of receiving a spam email is 0.1 (p = 0.1), you can find the probability of
receiving 10 non-spam emails before the first spam email (k = 10).
3. Product Defects: In quality control, the geometric distribution can model the number
of non-defective items produced before the first defective item is found. If the
probability of producing a defective item is 0.03 (p = 0.03), you can calculate the
probability of producing 6 non-defective items before the first defective item (k = 6).
(0.03)
= 0.02576
These examples demonstrate how the geometric distribution can be applied in various
scenarios to model the number of trials needed before the first success occurs, given the
probability of success in each trial (p).
225
Example 6.18: Geometric probability
a. Normal Distribution
One commonly used continuous probability distribution is the normal distribution (also
known as the Gaussian distribution). It is characterized by its bell-shaped curve and is often
used to model real-world phenomena. The normal distribution is fully defined by its mean
(μ) and standard deviation (σ). The PDF of the normal distribution is given by the
following formula:
{ }
√
where:
x is the random variable
μ is the mean of the distribution
σ is the standard deviation of the distribution
e is the base of the natural logarithm (approximately 2.71828).
226
Example 6.19: Normal distribution
Fig. 6.29: Output plot for normal distribution using Python code.
227
Example 6.20: Normal Distribution
228
Fig. 6.31: Output plot for two normal distribution using Python code.
229
Fig. 6.33: Output plot for two normal distribution using Python code.
Let’s we have a normal distribution population of heights with a mean of 68 inches and a
standard deviation of 3 inches. We want to find the probability that a random selected person
in shorter than 70 inches (X=70).
1.
2.
3.
√
Now, let’s calculate this step by step:
230
Step 2: Look up the z-score in the standard normal distribution table or use a calculator to
find the cumulative probability associated with z-score. In this case, P(Z < 2/3).
Step 3: Find the probability using the cumulative probability from the standard normal
distribution table.
Therefore, the probability that a random selected person is shorter than 70 inches is
approximately 74.86%.
Step 2: Look up the z-score in the standard normal distribution table to find P(Z < 0.5).
Step 3: Calculate the complement: 1- P(Z < 0.5) to find the probability that a student score
above 90.
P(Z >
So, the probability that a random selected student score above 90 in the test is a
approximately 0.3085, or 30.85%.
231
Example 6.24: Probability of Task Completion Time between 40 and 50 Minutes
Imagine a dataset representing task completion times that follow a normal distribution with a
mean of 45 minutes and standard deviation of 8 minutes. We want to find the
probability that a random selected task between 40 and 50 minutes (40 < X <50).
1. Mean minutes.
2. Standard deviation = 8 minutes
3. Lower Bound = 40 minutes
4. Upper Bound = 50 minutes.
Step 2: Look up the z-scores in the standard normal distribution table to find
Step 3: Subtract the cumulative probability for the lower bound from cumulative probability
for the upper to find the desired probability.
Now, we can find the probability of task completion time between 40 and 50 minutes:
232
P(40 < X < 50) = P(X < 50) – P(X < 40)
So, the probability that a random selected tasks between 40 and 50 minutes is approximately
0.46993 or about 46.99%.
233
Fig. 6.34: Normal distribution using Python code.
234
Fig. 6.35: Output plot for Normal distribution using Python code.
Fig. 6.36: Output statistical results for Normal distribution using Python code.
where:
a is the lower bound of the distribution
b is the upper bound of the distribution
The PDF of the uniform distribution is constant within the interval [a, b], and outside
this interval, it is zero.
Here are some numerical examples for uniform distribution:
235
Example 6.26: Uniform distribution
Suppose we have a uniform distribution on the interval [0, 10]. This means that all
values between 0 and 10 are equally likely. We can use the uniform distribution
formula to calculate the probability of getting a value between 5 and 7.
This means that there is a 20% chance of getting a value between 5 and 7 in a
uniform distribution on the interval [0, 10].
This means that there is a 50% chance of getting a value greater than 15 in a uniform
distribution on the interval [10, 20].
This means that there is a 50% chance of getting a value between -5 and 5 in a
uniform distribution on the interval [-10, 10].
Uniform distributions are often used to model situations where all values within a
certain range are equally likely.
For example, we could use a uniform distribution to model the probability of getting
a certain number on a die roll or the probability of picking a certain card from a deck
of cards.
Let's consider the first example: modeling the daily high temperature in a city where the
temperature can vary between 70°F and 90°F with equal probability. In this case, we have a
continuous uniform distribution over the temperature range [70°F, 90°F].
Model the daily high temperature in a city where the temperature follows a continuous uniform
distribution between 70°F and 90°F. Find the probability density function (PDF) and calculate
the probability of the temperature being within a certain range.
236
Solution:
In a continuous uniform distribution, the probability density function (PDF) is constant over the
entire range and is given by:
To find the probability that the temperature falls within a specific range, you can calculate the
area under the PDF curve within that range.
For example, to find the probability that the temperature is between 75°F and 85°F:
∫ ∫ ∫
= 0.5
So, the probability that the daily high temperature is between 75°F and 85°F is 0.5 or
50%.
237
Example 6.30: Uniform distribution
Fig. 6.38: Statistical output results for Uniform distribution using Python code
238
Fig. 6.39: Output graph for Uniform distribution using Python code.
where x is the random variable representing the time between events, λ (lambda) is
the rate parameter, and e is the base of the natural logarithm (approximately
2.71828).
The exponential distribution has the following properties:
239
dependent on how much time has already elapsed. In other words, the
distribution does not "remember" the past.
3. Exponential decay: The exponential distribution exhibits exponential decay,
meaning the probability of an event occurring decreases exponentially as time
increases.
4. Constant hazard rate: The hazard rate, which represents the instantaneous
probability of an event occurring given that it has not occurred yet, is constant for
the exponential distribution. The hazard rate is given by λ, the rate parameter.
The mean (μ) and standard deviation (σ) of the exponential distribution are both
equal to . The variance ( ) is equal to .
The exponential distribution is commonly used in various fields, such as reliability
analysis, queuing theory, and survival analysis, where modeling the time to failure or
time to an event is important.
Suppose the time between customer arrivals at a store is exponentially distributed with a mean of
5 minutes. This means that the PDF of the time between arrivals, X, is given by:
where =1/5=0.2.
To calculate the probability that a customer will arrive within 5 minutes of the previous
customer, we can use the following formula:
∫ |
= = 0.368
Therefore, there is a 36.8% chance that a customer will arrive within 5 minutes of the previous
customer.
To calculate the probability that a customer will arrive after 5 minutes, we can use the following
formula:
240
P(X > 5) = 1 - P(X < 5) = 1 - 0.368= 0.632
Therefore, there is a 63.2% chance that a customer will arrive after 5 minutes.
To calculate the probability that a customer will arrive after 10 minutes, we can use the
following formula:
Therefore, there is a 13.5% chance that a customer will arrive after 10 minutes.
241
Fig. 6.41: Output plot for Exponential Distribution using Python code.
d. Beta distribution
The beta distribution is a continuous probability distribution defined on the interval
[0, 1]. It is characterized by two shape parameters, typically denoted as α (alpha) and
β (beta), which control the shape and behavior of the distribution.
The probability density function (PDF) of the beta distribution is given by:
where x is the random variable, α and β are the shape parameters, and B(α, β) is the
beta function, defined as:
242
3. Shape and behavior controlled by α and β: The values of α and β determine the
shape of the distribution. Higher values of α and β result in distributions that are
more peaked and concentrated around the mean.
4. Special cases: When α = β = 1, the beta distribution reduces to the uniform
distribution on [0, 1]. When α = β > 1, the distribution is positively skewed, and
when α = β < 1, the distribution is negatively skewed.
The cumulative distribution function (CDF) of the beta distribution does not have a
closed-form expression, but it can be computed numerically using various
approximation methods or specialized software.
The mean (μ) and variance (σ^2) of the beta distribution are given by:
The beta distribution is widely used in various fields, including statistics, Bayesian
inference, modeling proportions, and in machine learning applications such as beta
regression and Bayesian parameter estimation.
Fig. 6.42: Probability distribution function for Beta distribution using Python code.
Fig. 6.43: Output results for Probability distribution function for Beta distribution using Python code.
243
Example 6.34: Beta distribution
Fig. 6.44: Generating random numbers from the beta distribution using Python code.
Fig. 6.45: Output random numbers from the beta distribution using Python code.
244
Example 6.35: Beta distribution
245
Fig. 6.47: Output plot for Beta distribution using Python code.
246
Example 6.36: Beta distribution
Fig. 6.49: Output for different plots for Beat distribution using Python code.
247
6.4 Cumulative distribution function (CDF): The cumulative distribution function of a random
variable gives the probability that the variable takes on a value less than or equal to a given
value. It provides a complete description of the random variable's behavior.
248
6.5 Expected value: The expected value, also known as the mean or average, is a measure of the
central tendency of a random variable. It represents the weighted average of all possible values,
where the weights are given by the probabilities associated with those values. Expected value, in
the context of statistics and probability theory, refers to the theoretical average outcome that can
be anticipated from a specific event or random variable. It is calculated by multiplying each
possible outcome by its respective probability and summing them up. Expected value serves as a
valuable tool for decision-making and risk assessment, allowing us to estimate the potential
outcome of an uncertain event based on its probabilities.
Expected value finds extensive applications across various fields, such as finance, economics,
and insurance. In finance, it is commonly used to assess investment opportunities by considering
the expected return and associated risks. By calculating the expected value, investors can make
informed decisions about the potential profitability of an investment and evaluate the level of
uncertainty involved. In economics, expected value aids in evaluating policy choices by
considering the potential outcomes and their probabilities. Additionally, in the field of insurance,
expected value assists in determining appropriate premiums by considering the potential losses
and their associated probabilities. By incorporating expected values, insurers can ensure their
pricing aligns with the potential risks involved. Overall, expected value provides a quantitative
measure to estimate the likely outcome of uncertain events, enabling better decision-making and
risk management.
where:
Investment A has a 60% chance of returning 10% and a 40% chance of returning -5%.
Investment B has a 50% chance of returning 5% and a 50% chance of returning 0%.
249
The expected value of each investment can be calculated as follows:
Example 6.39:
Suppose you are trying to predict the amount of rain that will fall in a given month. You have
historical data that shows that there is a 20% chance of 1 inch of rain, a 70% chance of 2 inches
of rain, and a 10% chance of 3 inches of rain.
This means that, on average, you can expect 1.9 inches of rain in the month.
Expected value is a useful tool for making decisions in situations where there is uncertainty. By
calculating the expected value of each possible outcome, you can get an idea of which outcome
is most likely to occur and which outcome is most likely to benefit you.
Lets use a probability density function (PDF) and integration to calculate the expected value of a
continuous random variable.
To find the expected value (E(X)), we’ll integrate the product of x and the PDF over the entire
range of X, which is from 0 to 1:
[ ]
250
So, the expected value of the random variable X is:
6.6 Moment generating function (MGF): The moment generating function is a useful tool for
characterizing random variables. It generates moments of the random variable and can be used to
derive various properties, such as moments, moments about the mean, and the shape of the
distribution.
The moment generating function is a significant concept in probability theory and statistics. It is
a mathematical function that uniquely characterizes a probability distribution. The MGF of a
random variable is defined as the expected value formula of
251
∑
∑ ]
∫
{
where t is a real-valued parameter and X is the random variable. In simpler terms, the MGF
provides a way to generate moments (expected values) of a random variable by taking the
exponential of the variable multiplied by a parameter.
The moment generating function has various applications in probability theory and statistics.
One of its primary uses is in determining the moments of a random variable. By taking the
derivatives of the MGF with respect to the parameter t and evaluating them at t=0, we can obtain
the moments of the random variable, including the mean, variance, skewness, and higher-order
moments. This information is crucial for understanding the characteristics of a probability
distribution and making statistical inferences.
The moment generating function (MGF) of a random variable is a function that summarizes all
of its moments. The moments of a random variable are the expected values of its powers, such as
the mean (first moment), variance (second moment), and skewness (third moment).
It can be used to easily derive the moments of a random variable. For example, the nth moment
of a random variable is the nth derivative of the MGF evaluated at t=0.
The MGF uniquely determines the distribution of a random variable. This means that if two
random variables have the same MGF, then they must have the same distribution.
The MGF can be used to find the distribution of a sum of random variables. This is done by
taking the product of the MGFs of the individual random variables.
Here is an example of how the MGF can be used to find the mean and variance of a random
variable. Let X be a random variable with MGF M(t). Then, the mean of X is given by:
E[X] = M'(0)
Var[X] = M''(0)/2
In the code you provided, the mean and variance of the random variable X are 0 and 1,
respectively. This can be verified by taking the first and second derivatives of the MGF and
evaluating them at t=0.
252
The MGF is a powerful tool that can be used to analyze random variables. It is often used in
probability theory, statistics, and engineering.
Here are some other benefits of using the moment generating function:
It can be used to prove certain theorems in probability theory, such as the central limit theorem.
Bernoulli distribution: The MGF of a Bernoulli random variable X with parameter p is given by
{ }
[ ]
Binomial distribution: The MGF of a binomial random variable X with parameters n and p is
given by
Poisson distribution: The MGF of a Poisson random variable X with parameter is given by
{ ( )}
253
Example 6.42: Moment Generating Function
254
Fig. 6.55: Output results for Moment Generating Function of different distribution using Python code.
here are some numerical examples for continuous moment generating functions (MGFs):
The MGF of the normal distribution with mean 0 and variance 1 is given by
For example, the MGF of a normally distributed random variable with mean 0 and variance 1
evaluated at t = 1 is equal to exp(1/2) = 1.6487212707001282.
255
Example 6.43: Moment Generating Function.
Fig. 6.56: Moment Generating Function for continous distribution using Python code.
256
Fig. 6.57: Output results for continuous Moment Generating Function using Python code.
where:
P(A ∩ B) is the probability of both events A and B occurring simultaneously (the intersection of
events A and B).
In words, the formula can be explained as follows: The conditional probability of A given B is
equal to the probability of both A and B happening divided by the probability of B happening.
257
If P(A|B) = 0, it means event A is impossible if event B has occurred.
If P(A|B) = P(A), it means that events A and B are independent; the occurrence of event B does
not affect the probability of event A.
Conditional probability plays a crucial role in many real-world applications, such as medical
diagnosis, weather forecasting, and risk assessment. It helps in updating probabilities when new
information becomes available, making predictions based on observed data, and understanding
the relationships between events.
Consider rolling a fair six-sided die. Let event A be rolling an even number (2, 4, or 6), and
event B be rolling a number greater than 3 (4, 5, or 6). The sample space of the die is {1, 2, 3, 4,
5, 6}. The probabilities are:
P(A) = 3/6 = 1/2 (because there are three even numbers out of six possibilities).
P(B) = 3/6 = 1/2 (because there are three numbers greater than 3 out of six possibilities).
P(A ∩ B) = 2/6 = 1/3 (because there are two numbers that satisfy both A and B: 4 and 6).
Then, the conditional probability of rolling an even number given that the number is greater than
3 is:
This means that if you roll a die and get a number greater than 3, the probability of it being an
even number is 2/3.
258
6.8 correlation
In statistics, correlation is a statistical measure that indicates the extent to which two variables
or quantities are related. It is a measure of the linear association between two variables,
meaning that it measures the strength and direction of the relationship between two variables. A
correlation coefficient ranges from -1 to 1, where -1 indicates a perfect negative correlation, 0
indicates no correlation, and 1 indicates a perfect positive correlation.
Types of correlation
It is important to note that correlation does not equal causation. Just because two variables are
correlated does not mean that one variable causes the other. For example, there is a strong
correlation between the number of ice cream sales and the number of drowning’s. However, this
does not mean that ice cream sales cause drowning’s. There is a third variable, which is the
temperature, that causes both ice cream sales to increase and drowning’s to increase.
Applications of correlation
259
Finance: Correlation is used to study the relationship between financial variables, such
as stock prices and interest rates.
Medicine: Correlation is used to study the relationship between medical variables, such
as smoking and lung cancer.
Social science: Correlation is used to study the relationship between social variables,
such as education and income.
Limitations of correlation
Correlation is a useful statistical measure, but it has some limitations. Correlation does not
equal causation, and it can be affected by outliers. Additionally, correlation is only a measure of
the linear relationship between two variables. If the relationship between two variables is non-
linear, then correlation will not be a good measure of the strength of the relationship.
Here is the formula for the Pearson correlation coefficient between two variables X and Y:
Y x
3 1
4 2
6 3
8 4
10 5
Solution:
y x ̅ ̅) ̅ ̅) ̅ ̅
3 1 -2 -3.2 6.4 4 10.24
4 2 -1 -2.2 2.2 1 4.84
6 3 0 -0.2 0 0 0.04
8 4 1 1.8 1.8 1 3.24
10 5 2 3.8 7.6 4 14.44
Total 18 10 32.8
√ √
260
Fig. 6.58: Python code to calculate the correlation
261
Fig. 6.60: Output Plot correlation between two variables x, and y using Python code.
Y x
10 1
7 2
6 3
4 4
1 5
262
Fig. 6.61: Correlation Python code.
Fig. 6.62: Output result for correlation using Python code Which indicates a strong negative correlation.
263
Fig. 6.63: Output Plot correlation between two variables x, and y using Python code.
Y x
9 1
8 2
1 3
11 4
9 5
264
Fig. 6.64: Correlation python code.
265
Fig. 6.66: Output Plot correlation between two variables x, and y using Python code with r = 0.123.
266
6.8.1 Spearman's rank correlation coefficient
Spearman's rank correlation coefficient, often denoted by the Greek letter rho (ρ) or as r_s, is a
nonparametric measure of rank correlation that assesses the strength and direction of association
between two ranked variables. It is closely related to Pearson's correlation coefficient, but it is based on
ranks instead of actual values. This makes it more robust to outliers and non-normality in the data.
Independence: The observations in the data set should be independent of each other.
No ties: There should be no ties in the ranks of the data. If there are ties, they can be handled by
averaging the ranks.
Ordinality: The data should be at least ordinal, meaning that the order of the data points is meaningful.
Rank the data: Rank the values of each variable separately, giving the highest value a rank of 1, the
second highest value a rank of 2, and so on.
Calculate the differences between the ranks: For each pair of observations, calculate the difference
between their ranks.
Square the differences between the ranks: Square each of the differences in ranks.
Sum the squared differences: Sum the squared differences between ranks.
Calculate the Spearman's rank correlation coefficient: Use the following formula to calculate the
Spearman's rank correlation coefficient:
where:
267
6.8.2 The Spearman's rank correlation coefficient ( ) ranges from -1 to 1, with 0 indicating no
correlation and 1 indicating perfect positive correlation. A value of -1 indicates perfect
negative correlation. In general, a value of closer to 1 indicates a stronger positive
correlation, while a value closer to -1 indicates a stronger negative correlation.
Assessing the relationship between two variables: Spearman's rank correlation coefficient
can be used to assess the strength and direction of association between two variables, even if
the variables are not normally distributed.
Comparing groups: Spearman's rank correlation coefficient can be used to compare the ranks
of two groups of observations.
Assessing ordinal data: Spearman's rank correlation coefficient is particularly useful for
analyzing ordinal data, where the order of the data points is meaningful but the actual values
are not.
To illustrate Spearman's rank correlation coefficient, let's consider an example of the scores
of 5 students in Math’s and Science:
To calculate Spearman's rank correlation coefficient, we need to rank the data for each
variable and then calculate the differences between the ranks.
268
Rank the data for each variable:
Since there are no differences in ranks, the rank difference (d) for each pair of observations is
0.
where ∑ is the sum of the squared rank differences, and n is the number of observations.
r = 1 - (6 * 0) / (5 * (5^2 - 1))
269
= 1 - 0 / (5 * 24)
= 1 - 0 / 120
=1-0
=1
Therefore, the Spearman's rank correlation coefficient for the given data is 1, indicating a
perfect positive monotonic correlation between the Math’s and Science scores of the
students.
270
Example 6.49: Spearman Rank correlation
271
Fig. 6.68: Spearman’s Rank Correlation using Python Code.
272
Example 6.50: Spearman Rank correlation
273
Fig. 6.70: Spearman’s Rank Correlation using Python Code.
274
Example 6.51: Spearman Rank correlation Plot Using Python Code.
275
Fig. 6.72: Spearman’s Rank Correlation using Python Code.
276
Regression analysis
Regression analysis is a statistical method used to model and analyze the relationship between a
dependent variable and one or more independent variables. It helps to understand how changes in
the independent variables are associated with changes in the dependent variable.
Linear regression is a fundamental statistical and machine learning technique used to model the
relationship between a dependent variable (target) and one or more independent variables
(predictors). The main goal of linear regression is to establish a linear relationship between the
dependent and independent variables, which can then be used to make predictions or understand
the influence of the predictors on the target variable.
Linear regression models the relationship between the dependent variable (y) and the
independent variables (X) using a linear equation of the form:
where y is the dependent variable, X₁, X₂, ..., Xₙ are the independent variables, β₀ is the intercept,
β₁, β₂, ..., βₙ are the coefficients of the independent variables, and ε is the error term.
a) Linearity: The relationship between the dependent and independent variables is linear.
c) Homoscedasticity: The variance of the errors is constant across all levels of the independent
variables.
Estimating coefficients:
The coefficients (β₀, β₁, ..., βₙ) are estimated using a method called Ordinary Least Squares
(OLS). The OLS method aims to minimize the sum of the squared differences between the
observed values and the predicted values (i.e., the residuals).
Model evaluation:
To evaluate the performance of a linear regression model, several metrics can be used, such as:
277
a) Coefficient of determination (R²): Measures the proportion of variance in the dependent
variable that is predictable from the independent variables.
b) Mean Squared Error (MSE): Measures the average squared difference between the observed
and predicted values.
c) Root Mean Squared Error (RMSE): The square root of MSE, which is easier to interpret since
it is in the same units as the dependent variable.
Linear regression has a wide range of applications, such as predicting house prices, sales
forecasting, estimating the effect of marketing activities on revenue, and assessing the impact of
various factors on public health.
Keep in mind that linear regression has its limitations. It may not be suitable for modeling
nonlinear relationships or when the assumptions are violated. In such cases, other techniques like
polynomial regression, decision trees, or neural networks can be considered.
Of course! Let's work through a simple linear regression example with a single independent
variable (simple linear regression).
Suppose we have the following data points representing the relationship between the number of
years of experience (independent variable, X) and the corresponding annual salary (dependent
variable, y) in thousands of dollars:
We want to create a linear model that predicts annual salary based on years of experience. The
linear equation we want to estimate is of the form:
y = β₀ + β₁X
̅ = (1 + 2 + 3 + 4 + 5) / 5
We want to create a linear model that predicts annual salary based on years of experience. The
linear equation we want to estimate is of the form:
y = β₀ + β₁X
̅ = (1 + 2 + 3 + 4 + 5) / 5 = 3
278
̅ = (30 + 35 + 41 + 45 + 51) / 5 = 40.4
= 52
y = 24.8 + 5.2 * X
For example, if someone has 6 years of experience, the predicted salary would be:
With this linear equation, we can now predict the annual salary based on the number of years of
experience. Keep in mind that this is a simple example with a small dataset, and the model may
not be very accurate for real-world applications.
Sure! Let's consider a multiple linear regression example with two independent variables.
Suppose we have data representing the relationship between years of experience, education level,
and annual salary:
Annual Salary (y): [30, 37, 49, 57, 60, 53, 70, 80, 65, 58]
279
We want to create a linear model that predicts the annual salary based on years of experience and
education level. The linear equation we want to estimate is:
y = β₀ + β₁X₁ + β₂X₂
We will use the numpy and statsmodels libraries to perform multiple linear regression in Python.
This code fits a multiple linear regression model to the given data points and prints the resulting
linear equation.
After running the code, you should see the following linear equation:
Now, we can use this linear equation to predict annual salaries based on years of experience and
education level. For example, if someone has 5 years of experience and a Master's degree (X2 =
2), the predicted salary would be:
Keep in mind that this example uses a small dataset and may not be very accurate for real-world
applications. However, it illustrates the process of multiple linear regression using Python.
280
Fig. 6.73: Linear Regression Equation using Python code.
281
Example 6.55: Multiple Linear Regression
282
Fig.6.76: Output graph of Multiple Linear regression of the Python code of Fig.1.
Polynomial regression is a type of regression analysis that models the relationship between a
dependent variable (usually denoted as "y") and an independent variable (usually denoted as "x")
as an nth-degree polynomial. In other words, instead of fitting a straight line (as in simple linear
regression), it fits a curve to the data points.
where:
β₀, β₁, β₂, ..., βₙ are the coefficients of the polynomial, representing the impact of each degree of
the independent variables on the dependent variable.
ε is the error term, representing the difference between the predicted and actual values of y.
The goal of polynomial regression is to find the optimal values for the coefficients (β₀, β₁, β₂, ...,
βₙ) that minimize the sum of squared errors (the difference between the predicted and actual
values) and provide the best fit to the data.
283
The degree of the polynomial (n) is a crucial parameter in polynomial regression. Higher-degree
polynomials can fit the training data very well, but they may suffer from overfitting, which
means they perform poorly on new, unseen data. Lower-degree polynomials are less likely to
overfit, but they may not capture the underlying relationships in the data as effectively.
To perform polynomial regression, you can use various statistical software packages or
programming languages like Python, R, or MATLAB. In these tools, you can find libraries and
functions to fit a polynomial regression model to your data, estimate the coefficients, and make
predictions.
Keep in mind that when working with polynomial regression, it's essential to evaluate the
model's performance on a separate test dataset to ensure it generalizes well to unseen data and
doesn't overfit to the training data. Techniques like cross-validation can be helpful in assessing
the model's performance.
Overall, polynomial regression is a flexible technique that allows you to capture more complex
relationships between variables, but it requires careful consideration of the polynomial degree
and potential overfitting issues.
284
Example 6.56: polynomial regression
285
Fig. 6.77 : Polynomial Regression using Python code.
286
Fig. 6.78: Plot graph for Polynomial Regression using Python code.
287
6.10 Hypothesis testing
Hypothesis testing is a statistical method used to determine whether there is enough evidence to
support or reject a proposed hypothesis about a population parameter based on sample data. It
involves comparing the observed data to what would be expected under a null hypothesis, which
represents a default assumption about the population parameter. If the observed data is unlikely
to occur under the null hypothesis, the null hypothesis is rejected, and the alternative hypothesis
is supported.
a. Formulating the null and alternative hypotheses: The null hypothesis (H0) is the default
assumption about the population parameter, while the alternative hypothesis (Ha) is the
statement that contradicts the null hypothesis. For example, if we want to test whether a
new drug is effective in treating a particular disease, the null hypothesis would be that the
drug has no effect, while the alternative hypothesis would be that the drug is effective.
b. Choosing a level of significance: The level of significance (α) represents the probability
of rejecting the null hypothesis when it is actually true. The most common level of
significance is 0.05, which means that we are willing to accept a 5% chance of rejecting
the null hypothesis even if it is true.
c. Selecting a test statistic: The test statistic is a numerical value that measures the
difference between the observed data and what would be expected under the null
hypothesis. The choice of test statistic depends on the nature of the data and the
hypothesis being tested. Common test statistics include t-tests, z-tests, and chi-square
tests.
d. Computing the p-value: The p-value is the probability of obtaining a test statistic as
extreme or more extreme than the one observed, assuming that the null hypothesis is true.
A low p-value (less than the level of significance) indicates that the observed data is
unlikely to occur under the null hypothesis and suggests that the alternative hypothesis
may be true.
e. Interpreting the results: If the p-value is less than the level of significance, we reject the
null hypothesis and conclude that there is enough evidence to support the alternative
hypothesis. If the p-value is greater than the level of significance, we fail to reject the null
hypothesis and conclude that there is not enough evidence to support the alternative
hypothesis.
Hypothesis testing is widely used in many fields, including science, medicine, engineering, and
social sciences, to test theories, validate experimental results, and make decisions based on data.
It is important to note that hypothesis testing does not prove that the alternative hypothesis is
true, but rather provides evidence to support it or reject the null hypothesis.
288
Example 6.57: Hypothesis Test
289
Fig.6.79: Hypoythesis Test using Python code.
Fig.6.80: Plot output result for Hypoythesis Test using Python code.
290
6.10.1 p-value
The p-value is a statistical measure that helps determine whether the null hypothesis should be
rejected or accepted. In hypothesis testing, the p-value represents the probability of obtaining a
test statistic as extreme as, or more extreme than, the one observed, assuming that the null
hypothesis is true. A smaller p-value indicates stronger evidence against the null hypothesis.
For example, suppose we want to test whether the average height of students in a particular
school is significantly different from the national average of 68 inches. We can formulate the
null hypothesis as "the average height of students in the school is equal to 68 inches" and the
alternative hypothesis as "the average height of students in the school is different from 68
inches". We collect a sample of 50 students from the school and find that their average height is
70 inches, with a standard deviation of 3 inches. We can use a t-test to calculate the p-value of
the test statistic.
In this example, the p-value is very small (7.22e-20), which indicates strong evidence against the
null hypothesis. We can reject the null hypothesis and conclude that the average height of
students in the school is significantly different from the national average.
Note that the p-value can be interpreted as the probability of obtaining a test statistic as extreme
as, or more extreme than, the one observed, assuming that the null hypothesis is true. A p-value
less than 0.05 is generally considered statistically significant, indicating strong evidence against
the null hypothesis. However, the significance level can be set to a different value based on the
specific needs of the analysis.
Overall, the p-value is a crucial statistical measure that helps determine the strength of evidence
against the null hypothesis in hypothesis testing.
291
Example 6.58: p-value
292
Fig.6.81: p-value using Python code.
293
a. Simplification: Models are simplifications of reality, designed to capture the essential features
of a system while disregarding certain complexities. Model assumptions help define the scope
and limitations of the model.
b. Linearity: Many models assume a linear relationship between variables, implying that the
effect of one variable on another is proportional and constant. Linear models are often simpler to
work with, but they may not capture more intricate relationships.
c. Independence: Models often assume that observations or variables are independent of each
other. This assumption implies that the behavior of one observation does not influence the
behavior of another. Violation of this assumption can lead to biased or inefficient estimates.
d. Normality: Some models assume that the variables or errors in the model follow a normal
distribution. This assumption facilitates various statistical tests and estimation techniques.
Departure from normality might affect the accuracy of the model's predictions.
g. Stationarity: In time series modeling, the assumption of stationarity is crucial. It assumes that
the statistical properties of the data, such as mean and variance, remain constant over time. Non-
stationarity can affect the model's ability to capture patterns and make accurate forecasts.
It's important to note that different models have different assumptions, and the appropriateness of
these assumptions depends on the specific context and data at hand. Model assumptions should
be carefully considered and validated to ensure the model's reliability and applicability to real-
world situations.
294
Example 6.59: Model Assumptions
295
Fig.6.84 : Output plot for Model Assumptions
296
Fig. 6.86: Output plot for Model Assumptions
Let's go through the output and interpretation of each part of the code.
- The scatter plot will show a visualization of the relationship between the independent variable
(X) and the dependent variable (y).
- Each data point represents an observation, with the x-coordinate corresponding to the value of
X and the y-coordinate corresponding to the value of y.
- By examining the scatter plot, you can visually assess if the relationship between X and y
appears to be linear. A roughly linear pattern suggests that the linearity assumption is reasonable.
- The code fits an ordinary least squares (OLS) regression model using sm.OLS() and obtains
the predicted values (y_pred).
- Residuals are calculated as the difference between the actual values (y) and the predicted
values (y_pred).
- The scatter plot of the predicted values (y_pred) against the residuals visualizes the residuals'
relationship with the predicted values.
297
- Homoscedasticity refers to the assumption that the variance of the residuals is constant across
all levels of the independent variable.
- In the scatter plot, you should check if the spread or dispersion of the residuals appears
consistent across different predicted values. A roughly equal spread of residuals suggests
homoscedasticity.
- The Q-Q plot compares the quantiles of the residuals against the quantiles of a theoretical
normal distribution.
- The Q-Q plot allows you to visually assess if the residuals follow a normal distribution.
- If the points on the plot closely align with the diagonal line (the line of expected normality), it
suggests that the residuals are normally distributed. Deviations from the diagonal line may
indicate departures from normality.
By examining the outputs from these code snippets, you can gain insights into the assumptions of
your linear regression model. It's important to consider these results and potentially take
appropriate actions, such as transforming variables or considering alternative modeling
approaches, if any of the assumptions are violated.
Model evaluation is an essential aspect of machine learning and data analysis. It involves
assessing the performance and quality of a predictive model to determine its effectiveness in
making accurate predictions or classifications. The evaluation process helps us understand how
well a model generalizes to unseen data and allows us to make informed decisions about its
deployment.
There are several key aspects and techniques involved in model evaluation. Let's delve into the
details:
a. Training and Test Sets: To evaluate a model, it is crucial to split the available data into
training and test sets. The training set is used to train the model, while the test set is used to
evaluate its performance. The idea behind this separation is to simulate the real-world scenario,
where the model is presented with unseen data.
b. Evaluation Metrics: Evaluation metrics quantify the performance of a model. The choice of
metrics depends on the problem type (e.g., regression, classification) and the specific
requirements of the task. Common evaluation metrics include accuracy, precision, recall, F1-
298
score, mean squared error (MSE), and area under the receiver operating characteristic curve
(AUC-ROC).
c. Confusion Matrix: A confusion matrix is a useful tool for evaluating classification models. It
provides a detailed breakdown of correct and incorrect predictions, highlighting true positives,
true negatives, false positives, and false negatives. From the confusion matrix, various metrics
like accuracy, precision, recall, and F1-score can be derived.
e. Overfitting and Underfitting: Overfitting occurs when a model performs exceptionally well
on the training data but fails to generalize to new, unseen data. Underfitting, on the other hand,
occurs when a model fails to capture the underlying patterns in the data and performs poorly on
both the training and test sets. Model evaluation helps identify and address these issues by
finding the right balance of model complexity.
h. Model Comparison: Model evaluation allows for the comparison of multiple models to
identify the best-performing one. This can involve comparing evaluation metrics, conducting
statistical tests, or using resampling techniques like bootstrapping.
299
Example 6.60: Evaluation Model
300
6.13 t-test
The t-distribution is a probability distribution that is used in statistical hypothesis testing when
the population standard deviation is unknown. It is a bell-shaped distribution, similar to the
normal distribution, but with thicker tails. This means that the t-distribution is more spread out
than the normal distribution, and is therefore more likely to produce extreme values.
The t-distribution is defined by its degrees of freedom (df), which is the number of independent
observations in the sample minus one. The larger the df, the more closely the t-distribution
resembles the normal distribution.
The t-test is a statistical test that is used to compare the means of two groups. It is a parametric
test, which means that it assumes that the data come from a normally distributed population. If
the population standard deviation is known, then the z-test can be used instead of the t-test.
However, if the population standard deviation is unknown, then the t-test must be used.
The t-test statistic is calculated by dividing the difference between the sample means by the
standard error of the difference between the means. The standard error of the difference between
the means is calculated using the sample standard deviations and the sample sizes of the two
groups.
The t-distribution is a versatile tool that can be used in a variety of statistical applications. It is a
commonly used test in the fields of psychology, education, and medicine.
If these assumptions are not met, then the results of the t-test may be unreliable.
The t-test is a statistical test used to determine if there is a significant difference between the
means of two groups. It is commonly used to compare the means of two samples to determine if
they are likely to have come from the same population or from different populations with
different means. There are three main types of t-tests:
a. Independent Samples t-test: This is used when you have two separate, unrelated groups
of data, and you want to compare their means. For example, you might compare the test
301
scores of students from two different schools to see if there is a significant difference in
performance.
Formula:
̅ ̅
Where
The resulting t value is the compared to the critical value from the t-distribution with
degrees of freedom to determine whether the difference between the means is statistically
significant. If the calculated t value is greater than the critical value, the you can conclude that
there is a significant difference between the means of the two groups.
b. Paired Samples t-test: This is used when you have two sets of related data, often collected
from the same individuals or units at different times or under different conditions. For example,
you might compare the test scores of students before and after they take a tutoring program to
see if there is a significant difference in their performance.
Formula:
Let represent the difference between the paired observations for each subject I, and ̅ be the
mean of these differences.
∑
̅
Where:
302
The standard observation of the differences, can be calculated as follows:
∑ ̅
√
√
This t value can then be compared to the critical value from the t-distribution with n-1 degrees of
freedom to determine the statistical significance of the difference between the means of paired
groups.
c. One-Sample t-test: This is used when you have a single sample and you want to
compare its mean to a known population mean or a hypothesized value. For example, you
might compare the mean height of a sample of students to the national average height to
see if there is a significant difference.
Formula:
√
Where:
The resulting t value is then compared to the critical value from the t-distribution with n-1
degrees of freedom to determine whether the difference between the sample mean and the
hypothesized population mean is statistically significant. If the calculated t value is greater than
the critical value, you can conclude that there is a significant difference between the sample
mean and the hypothesized population mean.
303
Example 6.61: Independent samples t-test
Let’s consider an example using an independent samples t-test. Suppose we want to compare the
mean test scores of students from two different classrooms, A and B. We have the following
data:
Classroom A:
Classroom B:
t ≈ -1.766
Since p (0.115) > fail to reject the null hypothesis, there is no significant difference in
the average test score between two groups.
304
Example 6.62: t-test
305
Example 6.63: t-test
306
Fig. 6.92: t-test for two classroom using Python code.
307
Example 6.65: t-test
6.14 F-test
The F-test, also known as Fisher's F-test or the variance ratio test, is a statistical test used to
compare the variances of two or more groups or samples. It is commonly used in analysis of
variance (ANOVA) and regression analysis to assess whether the variances of the groups are
equal or significantly different. The F-test is based on the F-distribution, which is a probability
distribution that arises when comparing the variances of different groups.
In this scenario, the F-test is used to compare the variances of two or more groups to determine if
they are statistically equal. The null hypothesis ( ) is that the variances of the groups are equal.
The alternative hypothesis ( ) is that the variances are not equal. If the F-test produces a large
F-statistic and a small p-value, we reject the null hypothesis and conclude that the variances are
significantly different.
308
II. ANOVA (Analysis of Variance):
In ANOVA, the F-test is used to assess whether there are statistically significant differences
among the means of three or more groups. It compares the variability between the group means
to the variability within the groups. The null hypothesis (H0) in ANOVA is that all group means
are equal. The alternative hypothesis (Ha) is that at least one group mean is different. If the F-test
results in a large F-statistic and a small p-value, we reject the null hypothesis and conclude that
there are significant differences among the group means.
III. F-Statistic:
The F-statistic is a numerical value that measures the ratio of the variances between the groups
(treatments) and the variances within the groups (residuals). For the equality of variances test,
the F-statistic is calculated as:
For ANOVA, the F-statistic is obtained from the sum of squares between groups and the sum of
squares within groups, which are components of the variance.
The degrees of freedom are used to calculate the critical value of the F-distribution and the p-
value. For the equality of variances test, the degrees of freedom for the numerator (between
groups) is k - 1, and the degrees of freedom for the denominator (within groups) is N - k, where k
is the number of groups, and N is the total number of observations.
For ANOVA, the degrees of freedom for the numerator is k - 1 (where k is the number of
groups), and the degrees of freedom for the denominator is N - k (where N is the total number of
observations).
The F-distribution is positively skewed and depends on two degrees of freedom: df1 (numerator)
and df2 (denominator). The critical value of the F-distribution at a given significance level
(alpha) is used to determine if the F-statistic is statistically significant. If the calculated F-statistic
is greater than the critical value, we reject the null hypothesis.
Where and are the variances of the two samples. It can be shown that the sampling
distribution of such a ratio, appropriately called a variance ratio, is a continuous distribution
309
called F distribution. This distribution depends on the two parameters -1 and with this
statistic we reject the null hypothesis at the level of significance and accept the
alternative hypothesis
VI. p-value:
The p-value is the probability of observing the data or more extreme data under the assumption
that the null hypothesis is true. A small p-value (typically less than the chosen significance level,
such as 0.05) indicates that we can reject the null hypothesis in favor of the alternative
hypothesis.
VII. Decision:
Based on the p-value and the chosen significance level, we make a decision whether to reject or
fail to reject the null hypothesis. If the p-value is less than the significance level, we reject the
null hypothesis and conclude that there is a significant difference (for equality of variances) or
significant differences among the means (for ANOVA). If the p-value is greater than or equal to
the significance level, we fail to reject the null hypothesis and do not find significant evidence of
differences.
The F-test is widely used in various fields, including experimental research, quality control, and
regression analysis. It helps researchers determine if there are significant differences between
groups and assists in understanding the sources of variability in data.
310
Example 6.66: F-Distribution
311
Fig. 6.98: Output graph for F-distribution using Python code.
Example 6.68:
In this example, we used the f_oneway function from the scipy.stats library to perform the one-
way ANOVA. The function takes the exam scores of each group as input and returns the F-
statistic and p-value.
Interpretation of results:
312
F-statistic: The computed F-statistic is approximately 11.944134078212295.
Since the p-value (0.0013975664568248779) is less than the typical significance level of 0.05,
we reject the null hypothesis. Therefore, we conclude that there is a significant difference in
mean exam scores among the three groups.
If any of these assumptions are violated, the results of the F-test may not be valid. It is essential
to check these assumptions before interpreting the results and consider alternative methods, such
as Welch's ANOVA or Kruskal-Wallis test, if the assumptions are not met.
In conclusion, the F-test (one-way ANOVA) is a powerful statistical tool for comparing means
among three or more groups. It allows researchers to determine whether observed differences in
means are statistically significant, helping to draw conclusions about the population and identify
significant factors that contribute to variability in data.
Let's consider another example of performing the F-test, this time for comparing the variances of
two different samples.
Suppose we have data from two groups of students, Group X and Group Y, and we want to
determine if there is a significant difference in the variances of their exam scores.
Here's the Python code to perform the F-test for comparing variances:
313
Fig. 6.102: Output of F-statistic using Python code.
In this example, we manually calculated the F-statistic and the corresponding p-value for
comparing the variances of the two groups. The np.var function is used to calculate the sample
variances, and f.cdf from scipy.stats is used to calculate the cumulative distribution function
(CDF) of the F-distribution.
Interpretation of results:
Since the p-value (0.1864127331234633) is greater than the typical significance level of 0.05, we
do not reject the null hypothesis. Therefore, we conclude that there is no significant difference in
the variances of exam scores between Group X and Group Y.
It's important to note that the F-test for comparing variances is sensitive to the assumption of
normality. If the data is not normally distributed, the F-test may not provide accurate results, and
alternative.
Also, if the sample sizes are small, the F-test may not perform well due to limited statistical
power. In such cases, other methods like the Bartlett's test can be used.
In summary, the F-test for comparing variances is a useful tool to assess whether the variability
of two samples is significantly different. It helps researchers determine whether samples can be
assumed to have equal variances, which is important in various statistical analyses and
hypothesis testing.
Chi-Square (χ²) is a statistical test used to determine if there is a significant association between
two categorical variables. It is commonly used in the field of statistics, especially in hypothesis
testing and analyzing data with categorical outcomes. The Chi-Square test assesses whether there
is a difference between the expected and observed frequencies in a contingency table.
The Chi-Square test can be applied to categorical data with two or more categories, and it can be
used to answer questions like:
314
Are the proportions in different categories significantly different from each other?
Let's go into more detail about how the Chi-Square test works:
I. The chi-square goodness of fit test: This test is used to test whether the observed distribution
of a categorical variable is different from a hypothesized distribution. For example, you could
use a chi-square goodness of fit test to test whether the distribution of eye colors in a population
is different from the expected distribution of 50% brown eyes, 25% blue eyes, and 25% green
eyes.
II. The chi-square test of independence: This test is used to test whether two categorical
variables are independent of each other. For example, you could use a chi-square test of
independence to test whether the distribution of eye colors is independent of the distribution
of hair colors in a population.
Contingency Table:
A contingency table, also known as a cross-tabulation table, displays the frequency distribution
of two categorical variables. The rows represent one variable, and the columns represent the
other variable. Each cell in the table contains the count of observations that fall into a specific
combination of categories from both variables.
Example 6.71: consider a study where we want to examine the relationship between smoking
habit (categories: "Smoker" and "Non-Smoker") and the development of a certain lung disease
(categories: "Disease" and "No Disease"). The contingency table might look like this:
Disease No Disease
Smoker 30 20
Non-Smoker 10 40
Null Hypothesis ( ): There is no significant association between the two categorical variables.
The observed frequencies are similar to the expected frequencies.
315
Alternative Hypothesis ( ): There is a significant association between the two categorical
variables. The observed frequencies are significantly different from the expected frequencies.
Expected Frequencies:
To perform the Chi-Square test, we need to calculate the expected frequencies for each cell in the
contingency table. The expected frequency is the count we would expect if there were no
association between the variables, assuming the null hypothesis is true. The formula to calculate
the expected frequency for a cell (i, j) in a contingency table with r rows and c columns is:
Chi-Square Statistic:
The Chi-Square statistic is calculated by comparing the observed frequencies with the expected
frequencies for each cell in the contingency table. The formula to calculate the Chi-Square
statistic is:
where the summation (Σ) is taken over all cells in the contingency table.
The degrees of freedom for the Chi-Square test depend on the dimensions of the contingency
table. For a 2x2 contingency table (2 rows and 2 columns), the degrees of freedom is (r-1) * (c-
1), where r is the number of rows and c is the number of columns.
Once we have the Chi-Square statistic and the degrees of freedom, we can compare the test
statistic to the Chi-Square distribution to obtain the p-value. The p-value represents the
probability of observing the data or more extreme data under the assumption of the null
hypothesis. A small p-value (typically less than the chosen significance level, such as 0.05)
indicates that we can reject the null hypothesis in favor of the alternative hypothesis.
Decision:
Finally, based on the p-value and the chosen significance level, we make a decision whether to
reject or fail to reject the null hypothesis. If the p-value is less than the significance level, we
reject the null hypothesis and conclude that there is a significant association between the two
categorical variables. If the p-value is greater than or equal to the significance level, we fail to
reject the null hypothesis and do not find significant evidence of an association.
316
Chi-Square test can be performed using statistical software packages or programming languages
like Python, R, or MATLAB. In Python, you can use libraries like scipy.stats.chi2_contingency
from SciPy to perform the Chi-Square test on a contingency table and calculate the Chi-Square
statistic, p-value, and expected frequencies.
The chi-square test is a relatively simple test to perform, but it is important to understand the
assumptions that are made when using it. The main assumption of the chi-square test is that the
expected values in each cell of the contingency table are large enough. This means that the
expected values should be at least 5. If the expected values are not large enough, then the chi-
square test may not be accurate.
The chi-square test is a powerful tool for testing hypotheses about categorical data. However, it
is important to use it correctly and to understand the assumptions that are made when using it.
a. State the null and alternative hypotheses. The null hypothesis is the hypothesis that there
is no difference between the observed and expected data. The alternative hypothesis is the
hypothesis that there is a difference between the observed and expected data.
b. Calculate the chi-square statistic. The chi-square statistic is calculated by comparing the
observed and expected values in each cell of the contingency table.
c. Determine the p-value. The p-value is the probability of obtaining the observed chi-
square statistic if the null hypothesis is true.
d. Make a decision about the null hypothesis. If the p-value is less than the significance
level, then the null hypothesis is rejected. This means that there is sufficient evidence to
support the alternative hypothesis. If the p-value is greater than the significance level,
then the null hypothesis is not rejected. This means that there is not enough evidence to
support the alternative hypothesis.
The significance level is the probability of making a Type I error. A Type I error is rejecting the
null hypothesis when it is true. The default significance level is 0.05, which means that there is a
5% chance of making a Type I error.
Suppose you want to test whether the distribution of students' grades in a class follows an
expected grade distribution. You have 150 students, and you expect the grade distribution to be
as follows:
317
A: 20%
B: 30%
C: 25%
D: 15%
E: 10%
You count the actual number of students who received each grade and want to test if it matches
the expected distribution, while the observed values are:
A: 30
B: 40
C: 35
D: 20
E: 25
318
Example 6.73: Chi-Square distribution
319
Fig. 6.104: Output graph for Chi-Square distribution using Python code.
320
Fig.6.105 : Chi-Square test using Python code.
321
Example 6.75: chi-square test of independence
Let's consider a numerical example to demonstrate the chi-square test using a hypothetical
dataset. Suppose we want to investigate whether there is a significant association between gender
and favorite color among a group of people. Here's the observed data:
=43.2
= 64.8
= 54
= 55.2
= 2.016
322
Since P-value = 0.364 is greater than , then the null hypothesis is not rejected.
Next, you would consult a chi-square distribution table with appropriate degrees of freedom (in
this case, d.f = (rows - 1) * (columns - 1) = 2 and with P-value: 0.365 your chosen significance
level to determine the critical value. If the calculated χ² value is greater than the critical value,
you would reject the null hypothesis and conclude that there is a significant association between
gender and favorite color.
323
Fig. 6.108: Chi-square test using Python code.
Fig. 6.109: Statistical output for chi-square test using Python code.
Fig. 6.110: Output graph for Observed data and Expected Frequency Using Python code.
324
Fig. 6.111: Output curve for Chi-Square Distribution using Python code.
325
CHAPTER SEVEN: Case studies
326
7.1 CASE STUDY #1 : Exploratory Data Analysis for Retail Sales
Here's a simple case study for a data science project using Python. Let's consider a scenario
where you're working for a retail company that wants to analyze their sales data to make data-
driven decisions. The goal of this case study is to perform exploratory data analysis (EDA) on
the sales data using Python.
Problem Statement: The retail company wants to gain insights from their sales data to
understand trends, patterns, and make informed business decisions.
Dataset: The dataset contains information about sales transactions, including the date of
purchase, product ID, quantity sold, price, and customer ID.
Objectives:
327
Example 7.1: Case study #1:
328
Fig. 7.1: Exploratory Data Analysis for Retail Sales using Python code.
329
Fig.7.2 : Output results for Exploratory Data Analysis for Retail Sales using Python code.
330
Fig.7.3: Output graph for selling products using Python code.
331
7.2 Case Study # 2: Sentiment Analysis on Customer Reviews
Let's explore another case study involving sentiment analysis on customer reviews using Python.
In this scenario, you'll work with a dataset of customer reviews for a product and build a
sentiment analysis model to classify each review as positive, negative, or neutral.
Problem Statement: The company wants to understand the sentiment of customer reviews about
their product in order to identify areas of improvement and track customer satisfaction.
Dataset: The dataset contains customer reviews along with their corresponding sentiment labels
(positive, negative, or neutral).
Objectives:
Preprocess the text data by removing stopwords and performing text normalization.
332
Fig.7.5: Sentiment Analysis on Customer Reviews using Python code.
333
Fig.7.6: Output results Sentiment Analysis on Customer Reviews using Python code.
334
7.3 Case Study #3: Customer Churn Prediction
Let's explore a different case study involving customer churn prediction using machine learning.
In this scenario, you'll work with a dataset of customer information and their historical behavior
to build a model that predicts whether a customer will churn (leave) the company.
Problem Statement: The company wants to predict which customers are likely to churn so that
they can take proactive measures to retain them.
Dataset: The dataset contains customer information including features like contract type,
monthly charges, tenure, and whether the customer churned or not.
Objectives:
Preprocess the data by encoding categorical variables and handling missing values.
335
Fig. 7.7: Customers pridections Python code.
336
Fig. 7.8: Output statistical results using Python code.
337
CHAPTER EIGHT: Data Scinece Relationships
338
8.1 Relationship between data science, machine learning, and artificial intelligence
Data science, machine learning, and artificial intelligence are interconnected fields that build
upon each other, contributing to various aspects of modern technology and problem-solving.
Let's delve into the details of their relationships and definitions:
a. Data Science:
Data science involves the extraction of insights and knowledge from large and complex datasets.
It combines expertise in various domains, including statistics, computer science, domain
knowledge, and data visualization, to make informed decisions and predictions. The key steps in
data science include data collection, cleaning, exploration, analysis, visualization, and
interpretation.
b. Machine Learning:
c. Artificial Intelligence:
Artificial intelligence (AI) is a broader concept that involves creating machines and software
capable of intelligent behavior, similar to human cognitive functions. AI encompasses a wide
range of techniques, including machine learning, natural language processing, computer vision,
robotics, and expert systems. AI systems aim to perform tasks that typically require human
intelligence, such as understanding language, recognizing patterns, making decisions, and
solving complex problems.
Relationships:
Data Science and Machine Learning: Data science heavily relies on machine learning
techniques to extract insights from data. Machine learning algorithms are used to build
predictive models and make data-driven decisions in various domains. Data scientists
leverage machine learning algorithms to analyze patterns and trends in datasets to gain
insights and make predictions.
Machine Learning and Artificial Intelligence: Machine learning is a subset of AI, and
it's a crucial component that enables AI systems to learn and improve from data. Machine
learning algorithms power many AI applications, including natural language processing,
image recognition, recommendation systems, and autonomous vehicles.
339
Data Science and Artificial Intelligence: Data science provides the foundation for
creating intelligent AI systems. AI systems require vast amounts of data for training and
improving their performance. Data science helps AI systems collect, preprocess, and
analyze data to make accurate predictions and decisions.
In essence, data science provides the tools and methodologies to gather and process data,
machine learning empowers systems to learn from the data, and artificial intelligence
encompasses the broader goal of creating intelligent machines that can perform human-like
tasks.
It's important to note that these fields are rapidly evolving, and advancements in one field often
contribute to progress in the others. The synergy between data science, machine learning, and
artificial intelligence continues to shape the landscape of modern technology and innovation.
Example 8.1:
Fig. 8.1: Create and save data.csv file using Python code.
340
Example 8.2: Linear regression model
341
Example 8.3: Build CNN model
342
Example 8.4: Generate Sample sentiment analysis data
343
Fig.8.7: SVM model using Python model output.
a. Overview:
Data science and bioinformatics are two closely intertwined fields that have a profound impact
on modern biological and medical research. They combine techniques from computer science,
statistics, and domain-specific knowledge to extract meaningful insights from large and complex
biological datasets. The relationship between data science and bioinformatics is symbiotic, with
data science providing the tools and methods necessary to analyze biological data and
bioinformatics guiding the application of these methods in the context of biological research.
Advancements in technology have led to the generation of massive amounts of biological data,
such as genomic sequences, proteomic profiles, medical images, and clinical records. This data
deluge has created a need for sophisticated techniques to process, analyze, and interpret this
information effectively.
Data Preprocessing: Both fields involve the cleaning, normalization, and transformation
of raw data to ensure accurate and meaningful analysis. In bioinformatics, this includes
tasks like quality control of genomic sequences or removing noise from protein data.
Feature Extraction: Feature extraction methods in data science, such as dimensionality
reduction and feature selection, find applications in bioinformatics for identifying
relevant features in biological data. For example, identifying important genes or proteins
related to a specific disease.
344
Statistical Analysis: Statistical methods are used to identify patterns, correlations, and
significant differences in biological datasets. In bioinformatics, statistical techniques help
researchers understand the significance of genetic variations or differential gene
expression.
Machine Learning: Both fields utilize machine learning algorithms for predictive
modeling, classification, clustering, and regression tasks. In bioinformatics, machine
learning aids in predicting protein structures, classifying diseases based on genomic
profiles, and drug discovery.
Data Visualization: Visualizations play a crucial role in communicating complex
biological insights to researchers and clinicians. Interactive visualizations help
understand genetic relationships, expression patterns, and evolutionary trees.d.
d. Applications:
Genomics: Data science techniques are instrumental in analyzing DNA sequences,
identifying genes, predicting gene functions, and understanding genetic variations
associated with diseases.
Proteomics: Bioinformatics and data science collaborate to analyze protein structures,
interactions, and functions, leading to insights into cellular processes and drug discovery.
Medical Diagnostics: Data-driven approaches aid in disease diagnosis, prognosis, and
treatment planning by analyzing patient data, medical images, and clinical records.
Pharmaceuticals: Data science helps in drug discovery, designing molecular structures,
and predicting drug-target interactions.
e. Challenges:
Data Integration: Integrating data from diverse sources with varying formats is a challenge in
bioinformatics. Data science techniques enable the harmonization and integration of multi-omics
data.
Scalability: Biological datasets can be massive and complex. Data science methods need to
handle big data efficiently.
f. Future Directions:
The relationship between data science and bioinformatics will continue to grow stronger with
advancements in machine learning, deep learning, and big data analytics. Innovative applications
in personalized medicine, precision agriculture, and synthetic biology are expected to emerge.
In essence, the relationship between data science and bioinformatics is marked by the synergistic
combination of computational methods and biological knowledge. Together, they pave the way
for breakthroughs in understanding biology, diseases, and ultimately improving human health.
345
Example 8.6: Relationship between data science and Bioinformatics
Fig.8.8: Relationship between data science and Bioinformatics using Python code.
Fig.8.9: Output statistical results relationship between data science and Bioinformatics using Python code.
346
Example 8.7: Gene Expression
347
Fig. 8.11: Gene Expression using Python Code.
348
Example 8.8: Generate random DNA Sequences
349
Appendix A: FARTHER READING
Artificial Intelligence
Bioinformatics
Data Science
350
A.1. Artificial Intelligence
Ali A. Ibrahim, and et., (2023), “Forecasting Stock Prices with an Integrated Approach
Combining ARIMA and Machine Learning Techniques ARIMAML”, Journal of
Computer and Communications, 2023, 11, pp.: 58-70.
Ali A. Ibrahim, and et., (2023), “Use the Power of a Genetic Algorithm to Maximize
and Minimize Cases to Solve Capacity Supplying Optimization and Travelling
Salesman in Nested Problems”, 11, pp: 24-31.
Ali A. Ibrahim, and et., (2022), “Multi-Stage Image Compression-Decompression
System Using PCA/PCA to Enhance Wireless Transmission Security”, Journal of
Computer and Communications, Journal of Computer and Communications, 10, pp.:
87-96.
Ali A. Ibrahim, and et., (2019), “Design & Implementation of an Optimization Loading
System in Electric by Using Genetic Algorithm”, Journal of Computer and
Communications, 7, 7, pp.: 135-146.
Ali A. Ibrahim, and et., (2018), “The effect of Z-Score standardization on binary
input due the speed of learning in back-propagation neural network”, Iraqi Journal of
Information and Communication Technology, 1, 3, pp.: 42-48.
Ali A. Ibrahim, and et., (2018), “Design and implementation of fingerprint
identification system based on KNN neural network”, Journal of Computer and
Communications, 6, 3, pp.: 1-18.
Ali A. Ibrahim, and et., (2016), “Using neural networks to predict secondary
structure for protein folding”, Journal of Computer and Communications, 5, 1, pp.: 1-
8.
Ali A. Ibrahim, and et., (2016), “Design
and implementation of iris pattern
recognition using wireless network system” , Journal of Computer and
Communications, 4, 7, pp.: 15-21.
Ali A. Ibrahim, and et., (2013), Design and Implementation Iris Recognition System
Using Texture Analysis, Al-Nahrain Journal for Engineering Sciences, 16, 1, pp: 98-
101.
351
A.2. Bioinformatics
Ali A. Ibrahim, and et., (2020), “Proposed Genetic Profiling System Based on Gel
Electrophoresis for Forensic DNA Identification”, Indian Journal of Public Health Research &
Development, 11,2.
Ali A. Ibrahim, and et., (2019), “CLASSIFICATION NUMBER OF ORGANISMS USING
CLUSTER ANALYSIS OF THE PEPTIDE CHAINS MULTIPLE CHYMOTRYPSIN LACTATE
DEHYDROGENASE”, Biochem. Cell. Arch, 19, 2, pp.: 4425-4429
Ali A. Ibrahim, and et., (2019), “Beta-2-microglobulin as a marker in patients with thyroid
cancer”, Iraqi Postgraduate Med Journal, 18, 1, pp.: 18 – 22.
Ali A. Ibrahim, and et., (2019), “Functional Analysis of Beta 2 Microglobulin Protein in
Patients with Prostate Cancer Using Bioinformatics Methods”, Indian Journal of Public
Health, 10, 3.
Ali A. Ibrahim, et., (2018), “Sequence and Structure Analysis of CRP of Lung and Breast
Cancer Using Bioinformatics Tools and Techniques”, 11, 1, pp.: 163-174.
Ali A. Ibrahim, and et., (2018), “C-Reactive Protein as a Marker in the Iraq Patients with
Poisoning Thyroid Gland Disease”, Engineering and Technology Journal, 36, 1 Part (B)
Scientific, University of Technology.
Ali A. Ibrahim, and et., (2018), Detecting the concentration of C- reactive protein by HPLC,
and analysis the effecting mutations the structure and function of CRP reactive protein, of
standard sample, The First International Scientific Conference / Syndicate of Iraqi Academics
Ali A. Ibrahim, and et., (2017), C-reactive protein as a marker for cancer and poising thyroid
gland, Engineering & Technology Journal, 35
Ali A. Ibrahim, and et, (2014), “Using Hierarchical Cluster and Factor Analysis to Classify and
Built a phylogenetic Tree Species of ND1 Mitochondria”, 17, 1, pp: 114-122.
Ali A. Ibrahim, and et., (2012), BIOINFORMATICS, first edition.
352
A.3. Data Science
Ali A Ibrahim, and et., (2019), “Forecasting the Bank of Baghdad index using the Box-
Jenkins methodology”, Dinars Magazine, 15, pp. 441-460.
Ali A Ibrahim, and et., (2013), “Using of Two Analyzing Methods Multidimensional Scaling
and Hierarchical Cluster for Pattern Recognition via Data Mining”, 3,1, pp:16-20.
Ali A. Ibrahim, and et., (2011) “DESIGN A FINGERPRINT DATABASE PATTERN
RECOGNITION SYSTEM VIA CLUSTER ANALYSIS METHOD I- DESIGN OF
MATHEMATICAL MODEL”, IRAQI JOURNAL OF BIOTECHNOLOGY, 10, 2, pp: 273-283.
Ali A. Ibrahim, (2008) “ Using the discriminatory function model for the chemical classification
of powdered milk models and knowing their conformity with the Iraqi standard specifications
through”, Journal College of Science of Al-Nahrain University, 11, 1, pp: 46-57.
Ali A. Ibrahim, (2002), “Using the discriminatory function model for the chemical classification
of powdered milk models and knowing their conformity with the Iraqi standard specifications
through, Journal of Economic and Administrative Sciences / University of Baghdad, 9,29, pp:
114-138. (ARABIC)
Ali A. Ibrahim, (2002), “Using protein databases and cluster analysis to compare the protein
homology regions of the Leader Peptidase enzyme and to determine the degree of genetic
affinity between”, Journal of the College of Administration and Economics / Al-Mustansiriya
University, 42, pp: 64-78. (ARABIC)
Ali A. Ibrahim, (2000),”Using a multidimensional scale to analyze the chemical compositions of
different milk powder samples”, The twelfth scientific conference of the Iraqi Association for
Statistical Sciences, pp: 189-210, (ARABIC)
Ali A. Ibrahim, “The use of cluster analysis in the compositional analysis of different milk
powder samples”, (2000), Scientific Journal of Tikrit University, College of Engineering.
(ARABIC)
Ali A. Ibrahim,(1996) “Use the factor analysis method to extract the variables that determine
The suitability of powdered milk for human consumption”, (1996), Scientific Journal of Tikrit
University, College of Engineering. (ARABIC).
353
Bibliography
[1] Agresti Alan and Kateri Maria, 2022, “Foundations of Statistics for Data Scientists”,
CRC Press.
[2] Al-Faiz Mohammed Z., Ibrahim Ali A., Hadi Sarmad M, 2018, “The effect of Z-Score
standardization on binary input due the speed of learning in back-propagation neural
network”, Iraqi Journal of Information and Communication Technology, 1, 3, pp.: 42-48.
[3]Al-Faiz Mohammed Z., Ibrahim Ali A., Hadi Sarmad M, 2020, “Proposed Genetic
Profiling System Based on Gel Electrophoresis for Forensic DNA Identification”, Indian
Journal of Public Health Research & Development, 11,2.
[4] Broucke Seppe vanden and Baesens Bart, 2018 “Practical Web Scraping
for Data Science”, Apress.
[5] Caldarelli Guido and Chessa Alessandro, 2016, “Data Science and Complex
Networks”, OXFORD UNIVERSITY PRESS.
[7]COX D. R., 2006, “Principles of Statistical Inference”, Published in the United States
of America by Cambridge University Press, New York.
[8] Dietrich David and et, 2015, “Data Science & Big Data Analysis”, John Wiley & Sons,
Inc.
[9] Draghici Sorin, 2012, “Statistics and Data Analysis for Microarrays Using R and
Bioconductor”, Second Edition, CRC Press.
[11] Grus Joel, 2019, “Data Science from Scratch”, Second Edition, O’Reilly Media.
[12] Hubbard Kent D. Lee • Steve, 2015, “Data Structures and Algorithms with Python”,
Springer.
354
[13] Ibrahim Ali A., 1996, “Use the factor analysis method to extract the variables that
determine The suitability of powdered milk for human consumption”, (1996), Scientific
Journal of Tikrit University, College of Engineering. (ARABIC Language).
[14] Ibrahim Ali, A., 2000, “The use of cluster analysis in the compositional analysis of
different milk powder samples”, Scientific Journal of Tikrit University, College of
Engineering. (ARABIC Langue).
[15] Ibrahim Ali, 2000 ,”Using a multidimensional scale to analyze the chemical
compositions of different milk powder samples”, The twelfth scientific conference of the
Iraqi Association for Statistical Sciences, pp.: 189-210, (ARABIC Language).
[16] Ibrahim Ali, 2002, “Using protein databases and cluster analysis to compare the
protein homology regions of the Leader Peptidase enzyme and to determine the degree
of genetic affinity between”, Journal of the College of Administration and Economics /
Al-Mustansiriya University, 42, pp.: 64-78. (ARABIC Language).
[17] Ibrahim Ali, 2002, “Using the discriminatory function model for the chemical
classification of powdered milk models and knowing their conformity with the Iraqi
standard specifications through, Journal of Economic and Administrative Sciences /
University of Baghdad, 9,29, pp.: 114-138. (ARABIC Langauge).
[18] Ibrahim Ali, 2008, “Using the discriminatory function model for the chemical
classification of powdered milk models and knowing their conformity with the Iraqi
standard specifications through”, Journal College of Science of Al-Nahrain University,
11, 1, pp.: 46-57.
[20] Ibrahim Ali A. and et, BIOINFORMATICS, first edition, AL-NAHRAIN UNIVERSITY
[21] Ibrahim Ali, and et., 2013, “Using of Two Analyzing Methods Multidimensional
Scaling and Hierarchical Cluster for Pattern Recognition via Data Mining”, 3,1, pp:16-20.
[22] Ibrahim Ali A. and et. ,2013, “Design and Implementation Iris Recognition System
Using Texture Analysis”, Al-Nahrain Journal for Engineering Sciences, 16, 1, pp: 98-
101.
355
[23] Ibrahim Ali A. and et, 2014, “Using Hierarchical Cluster and Factor Analysis to
Classify and Built a phylogenetic Tree Species of ND1 Mitochondria”, 17, 1, pp.: 114-
122.
[24] Ibrahim Ali A. and et, 2016, “Design and implementation of iris pattern recognition
using wireless network system” , Journal of Computer and Communications, 4, 7, pp.: 15-21.
[25] Ibrahim Ali A. and et, 2016, “Using neural networks to predict secondary structure
for protein folding”, Journal of Computer and Communications, 5, 1, pp.: 1-8.
[26] Ibrahim Ali A. and et, 2017, C-reactive protein as a marker for cancer and poising
thyroid gland, Engineering & Technology Journal, 35.
[27] Ibrahim Ali A. and et, 2018, “Design and implementation of fingerprint identification
system based on KNN neural network”, Journal of Computer and Communications, 6, 3,
pp.: 1-18.
[28] Ibrahim Ali A. and et, 2018, “Sequence and Structure Analysis of CRP of Lung and
Breast Cancer Using Bioinformatics Tools and Techniques”, 11, 1, pp.: 163-174.
[29] Ibrahim Ali A. and et, 2018, “C-Reactive Protein as a Marker in the Iraq Patients
with Poisoning Thyroid Gland Disease”, Engineering and Technology Journal, 36, 1 Part
(B) Scientific, University of Technology.
[30] Ibrahim Ali, and et., 2019, “Forecasting the Bank of Baghdad index using the Box-
[31] Ibrahim Ali A. and et, 2019, “Design & Implementation of an Optimization Loading
System in Electric by Using Genetic Algorithm”, Journal of Computer and
Communications, 7, 7, pp.: 135-146.
[32] Ibrahim Ali A. and et, 2019, “CLASSIFICATION NUMBER OF ORGANISMS USING
CLUSTER ANALYSIS OF THE PEPTIDE CHAINS MULTIPLE CHYMOTRYPSIN
LACTATE DEHYDROGENASE”, Biochem. Cell. Arch, 19, 2, pp.: 4425-4429.
[33] Ibrahim Ali A. and et, 2019, “Beta-2-microglobulin as a marker in patients with
thyroid cancer”, Iraqi Postgraduate Med Journal, 18, 1, pp.: 18 – 22.
[34] Ibrahim Ali A. and et,, 2019, “Functional Analysis of Beta 2 Microglobulin Protein in
Patients with Prostate Cancer Using Bioinformatics Methods”, Indian Journal of Public
Health, 10, 3.
356
[35] Ibrahim Ali A. and et, 2022, “Multi-Stage Image Compression-Decompression
System Using PCA/PCA to Enhance Wireless Transmission Security”, Journal of
Computer and Communications, Journal of Computer and Communications, 10, pp.: 87-
96.
[36] Ibrahim Ali A. and et,2013, “Use the Power of a Genetic Algorithm to Maximize and
Minimize Cases to Solve Capacity Supplying Optimization and Travelling Salesman in
Nested Problems”, 11, pp: 24-31.
[37] Ibrahim Ali A. and et, 2023, “Forecasting Stock Prices with an Integrated Approach
Combining ARIMA and Machine Learning Techniques ARIMAML”, Journal of Computer
and Communications, 2023, 11, pp.: 58-70.
[38] Nylen Erik Lee and Wallisch Pascal, 2017, “NEURAL DATA SCIENCE, ACADEMIC
PRESS
[39] Ozdemir Sinan, 2016, “Principles of Data Science”, Packt Publishing Ltd
[40] Provost Faster and Fawcett Tom, 2013, “Data Science for Business”, O’REILLY.
[41] Rastogi S. C. and et., 2008, “BIOINFORMATICS Methods and Applications”, Third
Edition, PHI Learning Private Limited.
[42] Reimann Clemens, and et., 2008, “Statistical Data Analysis Explained”, John Wiley
& Sons Ltd.
[43] Salazar Jesús Rogel, 2020, “Advanced Data Science and Analytics with Python”,
CRC Press.
[45] VanderPlas Jake, 2017, “Python Data Science Handbook”, O’Reilly Media
[46] Varga Ervin, 2019, “Practical Data Science with Python 3”, Apress.
357
Index Checking for Consistency, 145, 146, 170.
3D, 20, 48, 59, 60-62, 100, 137-142. Central limit theorem, 253.
Data science, i, ii, 1, 8, 151, 327-340, 344- Histogram, 20, 73-81, 159, 183, 191.
346, 350, 353, 354, 357.
Hypothesis testing, i, 202, 205, 288, 291,
Data Preprocessing, 145, 146, 151, 152, 344. 301, 314.
Dependent variable, 199, 277-279, 283, 297,
298.
I
Descriptive Statistics 183, 191
Independent variable, 199, 277-279, 283,
E 297, 298.
Error, 46, 71, 151, 152, 157, 158, 170, 171, Integer Generation. 12
180, 277, 278, 283, 294, 299, 301, 317.
Interval, 73, 158, 190-193, 209, 222, 223,
Exploratory Data, i, ii, 182, 183, 195, 326, 235, 236, 239, 242.
327, 329, 330, 332, 335.
Image Data, 2
Expected value, 205, 210,239, 249-252,
317, 318, 322. L
F-test, 202, 308-310, 312-314. Mean, 12, 152, 158, 183, 184, 187, 188,
198, 199, 205, 210, 222, 226, 230-232, 236,
239, 240, 243, 249, 251, 252, 255, 278, 279,
294, 301-304, 309, 310, 313.
G
Measures of Dispersion, 182, 187-190.
Gantt Charts, 20, 142, 143.
Measure of location, 182, 183.
Geometric distribution, 225.
Median, 110, 116, 152, 158, 159, 183-198.
Geospatial Data, 2
Mode, 183, 184, 186, 187.
Generating Random Data, 7, 11, 12, 16.
Model Assumptions, 202, 293-297.
359
Q
Probability distribution, 11, 202, 205-208, Sampling, 12, 16, 299, 309.
210, 211, 226, 230, 235, 242, 243, 251, 252, Scatter Plot, 20, 62-65, 67-73, 94, 137, 138,
301, 308. 157, 159, 183, 297, 298.
P-value, 288, 291-293, 308-310, 312-314, Score, 191, 192, 231, 268, 302, 304, 312-
316, 317, 323. 314.
360
Sentiment Analysis, 326, 332-334, 343.
T W
361
362
الوحتىيات
انفصم األول :يقذيت
هزس
فِ ِ
363
ًخص هرا الكحاب الطلبة بصىزة عامة وطلبة الدزاسات العلُا والباحثين خاصة في مخحلف الاخحصاصات الطبُة
والهىدسُة والعلىم الصسفة والعلىم الاوساهُة.
جاهعة الٌهريي
364
علن البياًات
باستعوال لغة بايثىى هع التطبيقات
م
القدمة :
م
السعي ،للحصىل على زؤي في عصس ثقىده ثىزة علمُة ( غير مسبىقة ) في ثىلُد البُاهات والحقدم الحكىىلىجي أصبح
مً مجمىعات البُاهات الىاسعة والعقدة ،زكيزة ال غنى عنها الكخساب العسفة الحدًثة .إذ مهد ظهىز علم البُاهات -
كمجال دًىامُكي ومحعدد الحخصصات -الطسٍق لخسخير لاسالُ العحمدة على البُاهات للكفف عً لاهما،،
م
واسحخساج العلىمات ذات القُمة الكبيرة في اثخاذ قسازات مسخىيرة جسخىد إليها الىخ لاكادًمُة ،واملجحمعُة .
إن هرا الكحاب ًبحث في علم البُاهات ،موٍقدم اسحكفافات محعمقة في مىهجُاثه ،ومبادئه ،وثطبُقاثه .وذلك مً خالل
اثباع ههج علمي صازم ،مً خالل ثحلُل البُاهات ،واسحعمال الحقىُات لاحصائُة وخىازشمُات الحعلم آلالي الحطىزة
لفك العالقات الخفُة ،والحيبؤ باالثجاهات ،وحل الفكالت العقدة.
وٍمكً القىل أن هرا الكحاب هى محاولة لجعل القساء ًجحاشون مً خالله الحضازَس العقدة في معالجة البُاهات،
والحصىز ،والىمرحة .التي وعحمد على الفاهُم لاساسُة ،وهي ( :السٍاضُات ،ولاحصاء ،وعلىم الكمبُىثس ) ؛ لحمكين
القساء مً اسحعمال لادوات الالشمة لعالجة مجمىعات البُاهات العقدة ،وثحدًد مصادز الححيز ،وضمان سالمة
الىحائج .ومً خالل ثبني عقلُة علمُة ،فئهىا هؤكد على إمكاهُة الحكساز ،وأهمُة الىهجُة الففافة في السعي لححقُق
هحائج مىثىقة جعحمد على البُاهات.
365
ْ
وقد حاءت فصىل هرا الكحاب على وفق ثصمُم دقُق لحىحُه القساء مً الفاهُم لاساسُة إلى الىهجُات الحقدمة،
والكفف عً جعقُدات ثحلُل البُاهات الاسحكفافُة ،واخحبازالفسضُات ،والححقق مً صحة الىماذج.
م
وال ٌسعىا في هرا القام إال أن و ْعسب عً امحىاهىا وشكسها الخالص لعلماء البُاهات ،ولاحصائُين ،والباحثين الرًً
مهد عملهم السائد الطسٍق للمىهجُات الىضحة في هرا الكحاب .وهحً هحطلع أن ًكىن هرا الىص دلُال شامال ألولئك
الباحثين الجدد -في هرا املجال -ومصدزا قُما للممازسين مً ذوي الخبرة الرًً ٌسعىن إلى جعمُق فهمهم وصقل
مهازاتهم.
إن مً املجدي أن وفسع في السعي لفحح السؤي املخفُة داخل بحس البُاهات الهائل الري ًحُط بىا ،وأن وغامس بدخىل
عالم علم البُاهات معا ،محمسكين بالفضىل فكسي والىهج العلمي،
366
ح ِم ۡن أ ۡم ِر ر ِبّی وم ۤا لروحِ قُ ِل ٱ ُّ
لرو ُ ی ۡسـَٔلُونك ع ِن ٱ ُّ
[ ﴾٥اإلسراء [ أُوتِیتُم ِ ّمن ٱ ۡل ِع ۡل ِم ِإ اَّل ق ِلیال ٨
367
368
علن البياًات باستخذام لغة بايثىى هع التطبيقات
جاهعة الٌهريي
369
370