HSB3119 Theory Summary p1 Stud
HSB3119 Theory Summary p1 Stud
Foreword
This note is prepared for HSB3119: Introduction to Data Science, which is being offered to 3rd year
students taking the Data Science and Digital Business track of the MAS degree program (Bachelor
degree in Management and Security) of Hanoi School of Business and Management, Vietnam National
University.
This course is partly theoretical and partly applied, requiring extensive practice using the programming
language Python and a little bit of SQL. Students are required to have their laptops with the necessary
software ready during all the sessions. Most examples are excluded from these notes and will be
incorporated only in the class presentations or Jupyter notebooks via Google Colab.
Images used in this note are taken from various sources (the references and the from generic searches
from google.com). Images were also taken (without permission from copyright owners, if any) from the
internet. These notes are not for commercial purposes and are only intended for the private learning
purposes of students in this class.
References
1. Bruce, P., Bruce, A. and Gedeck P. “Practical Statistics for Data Scientists”, O’Reilly, 2020.
2. Igual, L. and Seguí, S. “Introduction to Data Science”, Springer 2017.
3. Shah, C. “A Hands-On Introduction to Data Science”, Cambridge University Press, 2020.
Note: Friendly references for coding Python or SQL include https://www.w3schools.com/, among others.
Software
We will mainly use Python (via Jupyter Notebook/Jupyter Lab or Google Colaboratory) to perform our
analyses. You will also be asked to create a Github account. Always have your laptop ready for class.
1. To install Jupyter Notebook/Lab, I recommend installing Anaconda (Free, download the
graphical version) Link: https://www.anaconda.com/products/distribution
2. To use Google Colab, save a copy of the Jupyter notebooks to your own drive. I will only provide
you read-only files from my drive. Google Colab requires internet connection.
3. You can also use alternative software (online or offline) that can open Jupyter notebooks, like
jupyterlite or Visual Studio, but the infrastructure may be different.
Requirements
Component Form Weight
Attendance Individual 10%
Activities Individual (Jupyter notebook 10%
submission)
Mini-presentation Group 5%
Mini-exam Individual 5%
Group Project Individual (2%) 10%
/Group (8%)
Final Examination Individual 60%
Activities are often the submission of a working Jupyter notebook for the assigned laboratory. The
teachers will assign a mark at the end of the session or by submission. Each activity is worth 2% of the
grade, and so doing all 7 activities can be worth 14%.
The mini-presentation is the culmination of a series of laboratory activities which includes data cleaning,
exploration, and visualization.
The mini-exam (or long quiz) is a written exam to be done in class.
Class Rules
1. Attendance during a lecture may be checked by that day’s activity, by roll call, or by the AI face
recognition system of HSB. Students are only allowed at most 3 absences (both excused and
unexcused). If you have a valid reason to be absent for more than 3 sessions, please contact the
UPMO and the teacher. Students who fail this requirement may not be allowed to take the final
exam.
3. The teacher reserves the right to send students out of the room and/or lock students out of the room.
4. Quizzes/activities must be done within the assigned time (in class, or within the deadline if assigned
to be done at home). Late submissions will not be accepted.
5. There are plenty of opportunities to score higher in my class. Please do not ask for more.
Schedule
MWF (8:00-11:40 A.M.), Sept 6-Oct 9, 2023
Session Day Lecture Tutorial
1 W Introduction to Data Activity 1 (Review – basic data
Science; project launch analysis, python basics)
2 F Collecting Data Activity 2 (Lab 1a, 1b – Python
basics, Lab 1c – Data Collection)
3 M Accessing Data from a Activity 3 (Lab 2 – SQL)
database
4 W Cleaning Data, Exploring (Lab 3 – Data Cleaning, Data
Data Wrangling)
5 F Visualizing Data (Lab 4 – Data Visualization)
6 M Lab catch-up
7 W Publishing Data (Lab 5 – Publishing in Github)
8 F Data and Ethics Mini-presentation
9 M Intro to Machine Activity 4 (Lab 6 – KNN)
Learning, K Nearest
Neighbors
10 W Linear Regression, Activity 5 (Lab 7 – Linear
Multilinear regression Regression)
11 F Logistic Regression Activity 6 (Lab 8 – Logistic
Regression)
12 M Mini – exam Project Consultation
13 W Neural networks Activity 7 (Lab 9 – Tensorflow)
14 F Project Presentation Invited guest speaker
15 M Project Presentation Review
Chapter 1 Introduction to Data Science
The word data is the plural form of datum, which comes from the Latin word “datum”, which means
“given”, i.e., a given or known detail. Often, the data that is stored is unprocessed (when processed, people
usually refer to it as information).
Types of data
There are several ways to classify data. In data analysis, it is based on the real-world meaning of the data.
However, as we account for encoding it into a computer or a software, another “data type” will be seen.
Let us discuss the different types:
• Microsoft Excel – many traditional companies still use this and is even suitable for quick
solutions to simple needs. In August 2022, Python is starting to be rolled out in MS Excel.
• Tableau and PowerBI for data visualization
• IBM Cognos Analytics for business intelligence needs
• Hadoop is a framework that manages large amounts of data
In this course, we will focus on the use of Python. We will also use SQL and Github.
Chapter 2 The Data Science Process – Collecting Data
Samples
It’s rarely possible to survey the entire population of your research – it would be very difficult to get a
response from every person in Brazil or every transgender woman in Vietnam. Instead, you will usually
survey a sample from the population.
The sample size depends on how big the population is – and this depends, among others, on the
confidence level you want (remember data analysis?). Your survey should aim to produce results that can
be generalized to the whole population. That means you need to carefully define exactly who you want to
draw conclusions about.
Online Survey low cost and flexibility The anonymity and Qualtrics,
accessibility of online SurveyMonkey and
quick access to a large surveys having less Google Forms.
sample without control over who
constraints on time or responds, which can Students doing
location. lead to biases like self- dissertation research
selection bias.
The data is easy to
process and analyze.
Note: Using survey does not mean it generates quantitative. In fact, it depends on the questions whether
they are open or closed-ended. Open-ended questions are often qualitative.
Like questionnaires, interviews can be used to collect quantitative data: the researcher records each
response as a category or rating and statistically analyzes the results. But they are more commonly used to
collect qualitative data: the interviewees’ full responses are transcribed and analyzed individually to gain
a richer understanding of their opinions and feelings.
Data Collection Definition Pros Cons Questions
Face-to-face involves asking higher potential for more complicated to
interview individuals or insights organize How does
small groups higher social media
questions about a bias than with a shape body
topic. focus group organization costs image in
especially in case of teenagers?
no-show
Focus group specific form of How can
group interview, diversity of Speaking time of teachers
where interaction interviewees’ some attendees may integrate
between profiles and be considerably social issues
participants is enrichment of higher than that of into science
encouraged. The responses others, making their curriculums?
person conducting contribution
a focus group cheaper than face- disproportionate
plays the role of a to-face interview
facilitator Lower average
encouraging the confirm insights speaking time
discussion, rather obtained through
than an interviewer other qualitative moderator’s bias is
asking questions methodologies hard to prevent
There are other ways to obtain data sources, but this depends on what data you need. For example,
scientists use sensors to measure ocean salinity and temperature, and these are all processed by
computers.
Step 3: Okay, what to ask and how to ask?
Closed-ended questions give the respondent a predetermined set of answers to choose from. A closed-
ended question can include:
Experiment To test a causal relationship. Manipulate variables and measure their effects
on others.
Survey To understand the general characteristics Distribute a list of questions to a sample online,
or opinions of a group of people. in person or over-the-phone.
Observation To understand something in its natural Measure or survey a sample without trying to
setting. affect them.
Ethnography To study the culture of a community or Join and participate in a community and record
organization first-hand. your observations and reflections.
Secondary data To analyze data from populations that Find existing datasets that have already been
collection you can’t access first-hand. collected, from sources such as government
agencies or research organizations.
Secondary data
Secondary data is data obtained by a research team from other sources. For example, the General
Statistics Office of Vietnam (gso.gov.vn) regularly reports key figures about the social and economic
affairs of Vietnam. Several groups (e.g. United Nations, Vietnam Chamber of Commerce and Industry)
also publish reports of their research projects, and often, keep data accessible to everyone, although at
times, it must be bought. The researchers must understand the reliability and appropriateness of the data
collection process used by the institution.
An API, or application programming interface, is a set of defined rules that enable different applications
to communicate with each other. It acts as an intermediary layer that processes data transfers between
systems, letting companies open their application data and functionality to external third-party developers,
business partners, and internal departments within their companies. Using API and Python, we can extract
data from websites, which is fast and cost-effective.
Even without API, data stored in the internet can be scraped using webscraping tools, just like what we
will do in the laboratory sessions of this class. Take note scraping requires some understanding of how
html works (although for this course, you will be guided on how to do it)/
Note: As far as we are concerned, we will not discuss null values in SQL yet. We will leave that for the
Data Cleaning chapter and hopefully in your database class.
Chapter 4 The Data Science Process – Cleaning and Wrangling
Data
Preprocessing
Understanding the data
The first thing to do when obtaining a data set is to understand what it means. This includes knowing the
context of the data, what each entry is, and what each column is. By doing so, we can decide the
appropriate type of data, data type, and analyses that can be done. Of course, we can only solve a problem
or present about it if we understand what the data is.
Ideally, we also have an idea on how the data was collected and stored in order to ensure how reliable it is
and the limitations it may have.
For example, it is possible to have data coming from a survey of students about their marks in HS and
university. After looking at the data (or talking with the one who administered the survey), then one can
realize the following:
1. Each entry (row) in the table is a response from a student.
2. The columns may represent the following information: student details like ID number, name,
gender, birthday, HS math score, HS English score, University score 1, University score 2.
3. The data was done by online survey.
What are the implications? Well, we should understand that the scores should be numbers in general (but
still, it could be written in a Vietnamese format or an English format), that names are often in Vietnamese
(so getting the family name and first name may be tricky), etc. Some students are also often careless (even
adults can make mistakes) so sometimes, instead of writing 9.1, they type 91 or even 911. Knowing your
data means one should know when to accept 911 or not.
Formatting data
One of the common things to perform is standardizing the format of the input, especially if the data is
manually registered by individuals, in contrast to being automatically generated (e.g. transactions logged
in by computers). For instance, a person might type her name as NGUYEN CHI THANH but some would
write Nguyen Chi Thanh, or even Chi Thanh Nguyen. As someone working with data, one should be
careful in analyzing this data not only to make our data look nicer, but also more reliable. This must be
standardized, and we will see some examples of this in the lab. Typing mistakes are also common so
people might type “Ha noi” and “Hanoi” and Hanoii” and these must all be standardized. (The best way is
of course to design the survey in a better way but an analyst must always confirm this.)
Duplicates and missing values
Part of the preprocessing stage is deciding what to do with missing values and duplicates.
Duplicates: Does the data expect duplicate entries, and how does it affect the table contents? Maybe a
person can order from the same shop in a span of 1 hour, but these should have different order numbers
(if the order numbers are the same, then maybe we should only process once). Often, duplicates occur
when we combine data from different sources, and so, duplicates must often be removed.
Missing values: For missing values, also called NULL values, there are often third approaches. The first
is to delete them if (1) there are extremely many data points and losing a small percentage won’t change
the distribution significantly, and if there are not suitable ways to address the issue which is the second
approach. Note that missing values can occur for 3 reasons: it was really unknown, or known but not
disclosed, or sometimes, this value does not really exist. Thus, deleting this entry must be done with
extreme care. The second approach is to put an appropriate value here, for example, the value could be
replaced by the mean/median value in that column. As such, this data point is roughly in the middle of the
dataset and should not strongly affect the data – there remains exceptions, of course. The last approach is
to leave it empty but analyses must be done carefully to leave out these NULL values depending on the
analyses done.
Data Wrangling
Descriptive statistics, histograms, and correlations
Once we have a clean set of data, it is important to check several summary statistics, not only to check the
veracity of our data but to know its shape and possibly see some patters. As such, it is common to look at
the distribution of the data: often shown as a bar chart (nominal and ordinal), histogram (numerical), or
scatter/line chart (for time series data). Boxplots can also be useful especially when comparing different
samples. A visual check of the distribution allows us to identify normality (which is a common
assumption), skewness, modality, or even identify which test we can or cannot use. This will also give us
an idea if we need to do some transformation for our data (e.g. logarithmic transformation).
Aside from histogram and distribution, it will be necessary to know the following terms: mean, median,
mode, standard deviation, range, variance, covariance, correlation, quartiles, percentiles, minimum,
maximum, count, and even uniqueness of the data points. An important concept, to be reviewed in
regression, is the correlation coefficient.
Outliers
Outliers are data points that lie very far from the “center” of the data. As such, these data points can
occasionally strongly impact the subsequent analyses, e.g. when we talk about the mean, or when we do
regression analysis. Often, an analyst must determine whether to include these outliers or not in the
analysis. On some occasions, these outliers may actually be faulty data that were mistakenly typed. But
sometimes, because of the underlying nature of the data, we may need to perform transformations to keep
this data but reduce their outlier effect. For instance, taking the logarithm of some variables will decrease
the distance between points.
Selecting, sorting and filtering data
When used in real life, data must be summarized into the most important and relevant statistics or
indicators. For example, an e-commerce company may have all the data regarding all the transactions of
its customers, but no one brings all these into the board room. The directors will simply look at, for
instance, the total sales and total expenditures every month, as well as a few detailed analysis on certain
products and services that may be interesting.
As a data analyst and a data scientist, you must then be able to carefully choose what data must be
included in each step of the analysis. Some data might be excluded from one part, but may be useful in
another; keeping all the data will however be distracting and may incur costs.
Selecting data involves choosing which attributes or features to keep. Sorting involves arranging them in
a particular order as needed, and filtering is removing the unrelated data entries as needed. These steps
may be done repeatedly for each new analysis or output.
Transforming data
One-hot key encoding (dummy variables) for categorical data
Categorical or nominal data are difficult to handle as it is together with the standard structured data,
especially since many algorithms require numerical data. To deal with this, each attribute with 𝑘
categories can be converted in 𝑘 columns or attributes – one for each category. A data entry is marked 1 is
it belongs to that category and 0 if not. This can be done using conditionals or the get_dummies
command.
Transforming data
Transforming data can sometimes be necessary of the values are too big or too skewed. Such skewness
will bring imbalance to the analysis (it can give an emphasis on large values). By taking the logarithm (or
some other transformation), the values of the data are close to having linear relationships with other
variables and possibly display a more normal distribution.
Standardizing data
To standardize data is to make the data follow certain rules, e.g. having minimum (0) or maximum (1)
values, or having standard normal distributions (zero mean and variance 1). Either way, this can easily be
done by code. In this class, we will usually use the standard normal transformation:
𝑋−𝜇
𝑍=
𝜎
so that our data points will have an average of 0 and variance (and standard deviation) of 1.
Aggregating data
Cleaned data is still not summarized. You might have learned to use pivot tables in Excel or aggregate
functions in SQL, and we can do something similar in Python. We will focus on using groupby along
with one or more aggregate functions (count, min, max, mean, median, mode, stdev, variance, etc.) to
summarize data into a new table that will be ready for visualization.
For the laboratory, we will focus on some descriptive statistics, handling missing values, sorting and
aggregating.
Chapter 5 The Data Science Process – Visualizing Data
What is data visualization?
1. Data visualization involves looking at your data and trying to understand where it comes from
and what it could say
2. Visualization transforms data into images that effectively and accurately represent information
about the data. – Schroeder et al. The Visualization Toolkit, 2nd ed. 1998
3. Visualization – creating appropriate and informative pictures of a dataset – iterative exploration
of data. These informative pictures are visual properties that we notice without using conscious
efforts to do so.
Univariate Plots
Categorical
Bar chart and Pie chart
The most commonly used plots for categorical variables are bar plots and
pie charts. We can also simply represent them by tables, but they are less
eye-catching.
Pie chart: Graphical representation for categorical data in which a circle is
partitioned into “slices” on the basis of the proportions of each category
Bar or Pie?
This could be a question of personal taste, but here’s the rule of thumb that I use: when there are more
than 4-5 categories in the variable, I’d use a bar plot rather than a pie chart.
Numerical
A picture of a numerical variable should show how many cases there are for each value, that is, its
distribution.
Histogram
A histogram is used when a variable takes numeric values. DO NOT confuse histograms with bar charts.
A bar chart is a graphical representation of data that compares distinct variables. The histogram depicts
the frequency distribution of a
numerical variable, and for
which “bins” are determined to
create numerical categories.
BoxPlot
The box plot is a diagram suited
for showing the distribution
univariate numerical data, by visualizing the following characteristics of a dataset: first quartile, median,
third quartile, and outliers.
First quartile: Value which separates the lower 25% from the upper 75% of the values of the dataset
Median: Middle value of the dataset
Third quartile: Value which separates the upper 25% from the lower 75% of the values of the dataset
Interquartile range: Range between first and third quartile
Whiskers: Lines ranging from minimum to first quartile and from third quartile to maximum (outliers
excluded)
Outliers: Values that are extremely far from the data set, represented by circles beyond the whiskers.
Bivariate Plots
Bivariate means two variables. In a bivariate plot, we explore the relationship between two variables. One
of these is often referred to as the independent (cause, usually 𝑥) and dependent (effect, usually 𝑦)
variables. We look at line graphs and scatter plots to explore the relationship between one IV and one DV.
A Line Chart is a visualization that displays information as data points connected by straight line
segments. You can use the chart to extract trend and patterns insights into raw data, which is oftentimes
time series.
On the other hand, a Scatter Plot Chart uses dots to display
associations and correlations present in your data. Besides,
it uses a line of regression (or the line of best fit) to display
the relationship between two sets of varying data,
highlighting correlations and associations in your data.
Density plot
Density plots are used to study the distribution of one or a few variables. Checking the distribution of
your variables one by one is probably the first task you should do when you get a new dataset. It delivers
a good quantity of information. Density plots are continuous analogues of the histogram.
Heatmap
Heatmaps are maps that employ colors to show where values are larger or smaller, and are super-imposed
in charts or even geographic locations.
Wordcloud
Wordclouds is a visual representation of the words that appear most commonly in a set of observations.
This is a great tool to attract the attention of the audience as to which words appear the most often instead
of using a bar chart (which could be better, depending on the intention of the visual); the numerical value
of each word will determine the size of the text.
Software
Many data analysts will be asked to know one of the following: Microsoft Excel, PowerBI, Tableau, or
visualization packages in R or Python. In Python, we highlight the packages pandas, matplotlib, seaborn,
and tensorflow, as having useful plotting functions. This list is not and will never be extensive as newer
software and packages can be developed.
Chapter 6 The Data Science Process – Publishing Data (Github)
Making a compelling story
In contrast to a Data Analyst, whose main job focuses on presenting key finds, trends, and patterns to the
stakeholders (usually creating dashboard or a report to executives), and is usually more concerned about
monitoring, a Data Scientist often addresses specific questions such as:
1. What factors of the business process must be improved (and how)?
2. Are there specific segments of the customers who must be targeted?
3. How can we improve specific processes in our systems?
For more specific questions, it could be:
4. [Postal industry] How can we reduce mistakes in reading the address in a letter, and increase the
efficiency of the mailing system? (Solution: use AI to read the addresses and sort them)
5. [Banking industry] How can we reduce the risk of giving bad loans to people? (Solution: use
classification algorithms to determine which customer has higher or lower risk)
6. [Health industry] Can we use laboratory test to suggest if a person is likely to develop a certain
disease? (Solution: use classification to determine likelihood of developing a disease)
As such, the story developed by a data scientist can be more specific to the problem at hand. Depending
on the audience, it can be technically involved (if reporting to a senior data scientist) or a summarized
version of it. Nevertheless, a report should include the following:
- An executive summary
- An introduction to the problem
- Information on how data is collected, sorted, and cleaned
- What methodologies are employed
- The results of the processes
- Discussion and conclusion
While this can be comprehensive enough, making the story compelling lies in first choosing an
appropriate framing of the problem, and then choosing the appropriate set of details to be discussed as
well as showing how the problem is solved, highlighting how using the particular proposal will result in
massive advantages over the current system. While disadvantages must be noted, focusing on the
advantages is crucial to making the story attractive.