0% found this document useful (0 votes)
11 views22 pages

HSB3119 Theory Summary p1 Stud

The document outlines the course HSB3119: Introduction to Data Science at the Hanoi School of Business and Management, detailing its theoretical and practical components, software requirements, and grading criteria. It covers essential topics such as data collection, types of data, and the data science process, emphasizing the use of Python and SQL. The course aims to equip students with skills in data exploration, feature engineering, and presentation through hands-on activities and projects.

Uploaded by

tkcuamon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views22 pages

HSB3119 Theory Summary p1 Stud

The document outlines the course HSB3119: Introduction to Data Science at the Hanoi School of Business and Management, detailing its theoretical and practical components, software requirements, and grading criteria. It covers essential topics such as data collection, types of data, and the data science process, emphasizing the use of Python and SQL. The course aims to equip students with skills in data exploration, feature engineering, and presentation through hands-on activities and projects.

Uploaded by

tkcuamon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

HSB3119: INTRODUCTION TO DATA SCIENCE

(TÔNG QUAN VỀ KHOA HỌC DỮ LIỆU)


Emmanuel Lance Christopher VI M. Plan
Hanoi School of Business and Management – Vietnam National University
Email: emmanuelplan@hsb.edu.vn

Nguyễn Quốc Thành Hoàng


Hanoi School of Business and Management – Vietnam National University
Email: thanhhnq@hsb.edu.vn

Foreword
This note is prepared for HSB3119: Introduction to Data Science, which is being offered to 3rd year
students taking the Data Science and Digital Business track of the MAS degree program (Bachelor
degree in Management and Security) of Hanoi School of Business and Management, Vietnam National
University.
This course is partly theoretical and partly applied, requiring extensive practice using the programming
language Python and a little bit of SQL. Students are required to have their laptops with the necessary
software ready during all the sessions. Most examples are excluded from these notes and will be
incorporated only in the class presentations or Jupyter notebooks via Google Colab.
Images used in this note are taken from various sources (the references and the from generic searches
from google.com). Images were also taken (without permission from copyright owners, if any) from the
internet. These notes are not for commercial purposes and are only intended for the private learning
purposes of students in this class.

References
1. Bruce, P., Bruce, A. and Gedeck P. “Practical Statistics for Data Scientists”, O’Reilly, 2020.
2. Igual, L. and Seguí, S. “Introduction to Data Science”, Springer 2017.
3. Shah, C. “A Hands-On Introduction to Data Science”, Cambridge University Press, 2020.

Note: Friendly references for coding Python or SQL include https://www.w3schools.com/, among others.

Software
We will mainly use Python (via Jupyter Notebook/Jupyter Lab or Google Colaboratory) to perform our
analyses. You will also be asked to create a Github account. Always have your laptop ready for class.
1. To install Jupyter Notebook/Lab, I recommend installing Anaconda (Free, download the
graphical version) Link: https://www.anaconda.com/products/distribution

2. To use Google Colab, save a copy of the Jupyter notebooks to your own drive. I will only provide
you read-only files from my drive. Google Colab requires internet connection.
3. You can also use alternative software (online or offline) that can open Jupyter notebooks, like
jupyterlite or Visual Studio, but the infrastructure may be different.

Requirements
Component Form Weight
Attendance Individual 10%
Activities Individual (Jupyter notebook 10%
submission)
Mini-presentation Group 5%
Mini-exam Individual 5%
Group Project Individual (2%) 10%
/Group (8%)
Final Examination Individual 60%

Activities are often the submission of a working Jupyter notebook for the assigned laboratory. The
teachers will assign a mark at the end of the session or by submission. Each activity is worth 2% of the
grade, and so doing all 7 activities can be worth 14%.
The mini-presentation is the culmination of a series of laboratory activities which includes data cleaning,
exploration, and visualization.
The mini-exam (or long quiz) is a written exam to be done in class.

Class Rules
1. Attendance during a lecture may be checked by that day’s activity, by roll call, or by the AI face
recognition system of HSB. Students are only allowed at most 3 absences (both excused and
unexcused). If you have a valid reason to be absent for more than 3 sessions, please contact the
UPMO and the teacher. Students who fail this requirement may not be allowed to take the final
exam.

2. Respect the teacher and fellow students by:


a. Speaking only when acknowledged. Do not interrupt the teacher unless urgent.
b. Avoid chatting with your classmates on topics unrelated to the class.
c. Switch your mobile phones to silent mode during class time / Mute yourself in an online
class.

3. The teacher reserves the right to send students out of the room and/or lock students out of the room.

4. Quizzes/activities must be done within the assigned time (in class, or within the deadline if assigned
to be done at home). Late submissions will not be accepted.

5. There are plenty of opportunities to score higher in my class. Please do not ask for more.
Schedule
MWF (8:00-11:40 A.M.), Sept 6-Oct 9, 2023
Session Day Lecture Tutorial
1 W Introduction to Data Activity 1 (Review – basic data
Science; project launch analysis, python basics)
2 F Collecting Data Activity 2 (Lab 1a, 1b – Python
basics, Lab 1c – Data Collection)
3 M Accessing Data from a Activity 3 (Lab 2 – SQL)
database
4 W Cleaning Data, Exploring (Lab 3 – Data Cleaning, Data
Data Wrangling)
5 F Visualizing Data (Lab 4 – Data Visualization)
6 M Lab catch-up
7 W Publishing Data (Lab 5 – Publishing in Github)
8 F Data and Ethics Mini-presentation
9 M Intro to Machine Activity 4 (Lab 6 – KNN)
Learning, K Nearest
Neighbors
10 W Linear Regression, Activity 5 (Lab 7 – Linear
Multilinear regression Regression)
11 F Logistic Regression Activity 6 (Lab 8 – Logistic
Regression)
12 M Mini – exam Project Consultation
13 W Neural networks Activity 7 (Lab 9 – Tensorflow)
14 F Project Presentation Invited guest speaker
15 M Project Presentation Review
Chapter 1 Introduction to Data Science
The word data is the plural form of datum, which comes from the Latin word “datum”, which means
“given”, i.e., a given or known detail. Often, the data that is stored is unprocessed (when processed, people
usually refer to it as information).

The Data Science Process


Data Science actually consists of several
steps, and a big company can have
individuals or even teams working on each of
these steps. A typical process can be broken
down to 4-8 major steps, and data analysis
can only happen when the correct data has
been collected and cleaned. After analysis, it
can then be used for interpretation, (final)
visualization and be used to make decisions.
Independent data analysts often say data
cleaning is the hardest part, because (big)
data is often collected without particular
reasons aside from knowing that it can be
valuable, and analysts often have to answers
questions by using whatever data they have.
Thus, a whole bunch of data has to be
summarized into a few key figures and
pictures (as we will review in this course) by cleaning and analysis.
Data Science involves several steps (often from). In this course, we will focus on data exploration, feature
engineering, prediction, and presentation. The goal is to go use unprocessed data and transform these,
using tools learned from the course, into a compelling story.

Terminologies in the field of data


Here, we briefly define/differentiate the terms that commonly appear in the field of data.

• Data Science vs Data Analysis vs Data Engineering vs Business Analyst


o A business analyst focuses on the business implications of the data analysis process.
o A data engineer focuses on the technical (both hardware and software) aspects of data
storage, use and management
o A data analyst provides insights coming from the data, usually in the form of dashboards
and reports.
o A data scientist focuses on using select data to draw patterns and insights, often in
response to a business question; a data scientist often requires more technical knowledge
and tools to manipulate data.
• Data Science vs Artificial intelligence vs Machine learning
o Artificial intelligence is the science of making computers imitate human processes.
Machine learning is a subset of it, which focuses on creating models that can train
machines to perform human tasks.
• Data Science vs Big data
o Big data refers to the large amount of data that is available, and which can be used by
data analysts and scientists to find patterns, insights, and make conclusions and
predictions.

Types of data
There are several ways to classify data. In data analysis, it is based on the real-world meaning of the data.
However, as we account for encoding it into a computer or a software, another “data type” will be seen.
Let us discuss the different types:

• Quantitative vs qualitative data


o Quantitative data has numbers; qualitative data are often words or opinions
• Primary vs secondary data
o Primary data is data gathered by the reporter; secondary data is gathered and often
derived from some other people’s work
• Structured vs semi-structured vs unstructured data
o Structured data often comes in tabular format or any format equivalent to it; unstructured
data has little format and may appear in different shapes (e.g. emails, social media
content); semi-structured data is somewhere in between these two
• Types of data (nominal, ordinal, numerical)
o Nominal data (also categorical data) is data that provides a label to a data point that can
often be described by words and does not have comparative nor calculative purposes.
o Ordinal data are ranked data – often appears as a rating/ranking
o Numerical data is data that appears as a number; it can be discrete or continuous
• Data types (string, int, float, date/time, …)
o A string is a sequence of characters (letters, numbers, and symbols”
o An int or integer is a data type that stores integers (no decimal places); a double or a float
can handle decimal places
o Dates and times often appear as date/time or datetime. It can also be saved as a number
when in reference to a starting datetime.
Take note that a different type of data or data type results in different ways of managing it.

Data Science Tools


There are many tools that a data scientist must have. Often one tool is associated with (at least) one of the
data analytics steps.
The programming language R and its integrated development environment (IDE) RStudio is a famous
end-to-end data analytic tool. So is Python. R has stronger emphasis on statistical tools whereas Python is
more suitable for machine learning and rather versatile. Both are widely-used in the data community.
SQL (or Structured Query Language) is widely used for data management purposes, especially when
dealing with vast amounts of structured data. This will be discussed more in the specified chapter. Most
famous languages using SQL are SQL Server, MySQL, DB2, PostgreSQL.
Github is a (cloud-based) platform that allows users to share their code, improve on others’ code and
share work derived from these. In this class, it will be used to store a repository and create a simple online
portfolio.
Other famous tools include:

• Microsoft Excel – many traditional companies still use this and is even suitable for quick
solutions to simple needs. In August 2022, Python is starting to be rolled out in MS Excel.
• Tableau and PowerBI for data visualization
• IBM Cognos Analytics for business intelligence needs
• Hadoop is a framework that manages large amounts of data
In this course, we will focus on the use of Python. We will also use SQL and Github.
Chapter 2 The Data Science Process – Collecting Data

What Is Data Collection?


Data collection is the systematic process of gathering observations or measurements. It allows you to gain
knowledge and original insights into your research problem. Follow the steps below for an effective data
collection plan.
First, let us talk about primary data collection. Data is tagged as primary when the research team obtains
the data by themselves.
Step 1: Whom Do I Want to Research?
Populations
The target population is the specific group of people that you want to find out about. This group can be
very broad or relatively narrow. For example:

• The population of Brazil


• US college students
• Second-generation immigrants in the Netherlands
• Customers of a specific company aged 18-24
• Vietnamese transgender women over the age of 50

Samples
It’s rarely possible to survey the entire population of your research – it would be very difficult to get a
response from every person in Brazil or every transgender woman in Vietnam. Instead, you will usually
survey a sample from the population.

The sample size depends on how big the population is – and this depends, among others, on the
confidence level you want (remember data analysis?). Your survey should aim to produce results that can
be generalized to the whole population. That means you need to carefully define exactly who you want to
draw conclusions about.

Step 2: Questionnaires or Interview? Or other data sources?


A questionnaire, where a list of questions is distributed by mail, online or in person, and respondents fill it
out themselves. An interview is where the researcher asks a set of questions by phone or in person and
records the responses. Keep in mind that which type you choose depends on the sample size and location,
as well as the focus of the research.
If we choose a survey (a.k.a questionnaire) research, you should consider the pros and cons of each
method
Survey Pros Cons Use In
Paper-based Survey Easily access a large The response rate is a common method of
sample. often low, and at risk for gathering
biases like self-selection demographic
Have some control over bias. information (for
who is included in the example, in a
sample (e.g. residents of government census
a specific region). of the population).

Online Survey low cost and flexibility The anonymity and Qualtrics,
accessibility of online SurveyMonkey and
quick access to a large surveys having less Google Forms.
sample without control over who
constraints on time or responds, which can Students doing
location. lead to biases like self- dissertation research
selection bias.
The data is easy to
process and analyze.

In-person Screen respondents to The sample size will be focuses on a specific


make sure only people smaller, so this method location, you can
in the target population is less suitable for distribute a written
are included in the collecting data on broad questionnaire to be
sample. populations and is at risk completed by
for sampling bias. respondents on the
Collect time- and spot
location-specific data
(e.g. the opinions of a
store’s weekday
customers).

Note: Using survey does not mean it generates quantitative. In fact, it depends on the questions whether
they are open or closed-ended. Open-ended questions are often qualitative.
Like questionnaires, interviews can be used to collect quantitative data: the researcher records each
response as a category or rating and statistically analyzes the results. But they are more commonly used to
collect qualitative data: the interviewees’ full responses are transcribed and analyzed individually to gain
a richer understanding of their opinions and feelings.
Data Collection Definition Pros Cons Questions
Face-to-face involves asking higher potential for more complicated to
interview individuals or insights organize How does
small groups higher social media
questions about a bias than with a shape body
topic. focus group organization costs image in
especially in case of teenagers?
no-show
Focus group specific form of How can
group interview, diversity of Speaking time of teachers
where interaction interviewees’ some attendees may integrate
between profiles and be considerably social issues
participants is enrichment of higher than that of into science
encouraged. The responses others, making their curriculums?
person conducting contribution
a focus group cheaper than face- disproportionate
plays the role of a to-face interview
facilitator Lower average
encouraging the confirm insights speaking time
discussion, rather obtained through
than an interviewer other qualitative moderator’s bias is
asking questions methodologies hard to prevent

There are other ways to obtain data sources, but this depends on what data you need. For example,
scientists use sensors to measure ocean salinity and temperature, and these are all processed by
computers.
Step 3: Okay, what to ask and how to ask?

• The type of questions (e.g. open-ended or closed-ended)


• The content of the questions
• The phrasing of the questions
• The ordering and layout of the survey

Closed-ended questions give the respondent a predetermined set of answers to choose from. A closed-
ended question can include:

• A binary answer (e.g. yes/no or agree/disagree)


• A scale (e.g. a Likert scale with five points ranging from strongly agree to strongly disagree)
• A list of options with a single answer possible (e.g. age categories)
• A list of options with multiple answers possible (e.g. leisure interests)
Closed-ended questions are best for quantitative research. They provide you with numerical data that can
be statistically analyzed to find patterns, trends, and correlations.
Open-ended questions are best for qualitative research. This type of question has no predetermined
answers to choose from. Instead, the respondent answers in their own words. However, current AI
technologies also allow quantitative methods for qualitative data, e.g. sentiment analysis.
Open questions are most common in interviews, but you can also use them in questionnaires. They are
often useful as follow-up questions to ask for more detailed explanations of responses to the closed
questions.
The content of the survey questions
To ensure the validity (causality) and reliability (consistency) of your results, you need to carefully
consider each question in the survey. All questions should be narrowly focused with enough context for
the respondent to answer accurately. Avoid questions that are not directly relevant to the survey’s
purpose. When constructing closed-ended questions, ensure that the options cover all possibilities. If you
include a list of options that isn’t exhaustive, you can add an “other” field.
Step 4: Distribute the survey and collect responses
Before you start, create a clear plan for where, when, how, and with whom you will conduct the survey.
Determine in advance how many responses you require and how you will gain access to the sample.
When you are satisfied that you have created a strong research design suitable for answering your
research questions, you can conduct the survey through your method of choice – by mail, online, or in
person

Data collection methods


We summarize data collection methods below.

Method When to use How to collect data

Experiment To test a causal relationship. Manipulate variables and measure their effects
on others.

Survey To understand the general characteristics Distribute a list of questions to a sample online,
or opinions of a group of people. in person or over-the-phone.

Interview/focus To gain an in-depth understanding of Verbally ask participants open-ended questions


group perceptions or opinions on a topic. in individual interviews or focus group
discussions.

Observation To understand something in its natural Measure or survey a sample without trying to
setting. affect them.

Ethnography To study the culture of a community or Join and participate in a community and record
organization first-hand. your observations and reflections.

Archival research To understand current or historical Access manuscripts, documents or records


events, conditions or practices. from libraries, depositories or the internet.

Secondary data To analyze data from populations that Find existing datasets that have already been
collection you can’t access first-hand. collected, from sources such as government
agencies or research organizations.

Experimental research is primarily a quantitative method.


Interviews, focus groups, and ethnographies are qualitative methods.
Surveys, observations, archival research and secondary data collection can be quantitative or qualitative
methods.

Secondary data
Secondary data is data obtained by a research team from other sources. For example, the General
Statistics Office of Vietnam (gso.gov.vn) regularly reports key figures about the social and economic
affairs of Vietnam. Several groups (e.g. United Nations, Vietnam Chamber of Commerce and Industry)
also publish reports of their research projects, and often, keep data accessible to everyone, although at
times, it must be bought. The researchers must understand the reliability and appropriateness of the data
collection process used by the institution.
An API, or application programming interface, is a set of defined rules that enable different applications
to communicate with each other. It acts as an intermediary layer that processes data transfers between
systems, letting companies open their application data and functionality to external third-party developers,
business partners, and internal departments within their companies. Using API and Python, we can extract
data from websites, which is fast and cost-effective.
Even without API, data stored in the internet can be scraped using webscraping tools, just like what we
will do in the laboratory sessions of this class. Take note scraping requires some understanding of how
html works (although for this course, you will be guided on how to do it)/

Data Warehouse, Data Mart, and Data Lake


Three common terminologies are seen when discussing data storage, especially when large amounts of
data are available: data warehouse, data mart, and data lake. We highlight in bold the key differences we
will need to know.

Characteristics Data Warehouse Data Mart

Scope Centralized, multiple subject areas


Decentralized, specific subject area
integrated together
Users Organization-wide A single community or department
A single or a few sources, or a portion of data
Data source Many sources
already collected in a data warehouse
Large, can be 100's of gigabytes
Size Small, generally up to 10's of gigabytes
to petabytes
Design Top-down Bottom-up
Data detail Complete, detailed data May hold summarized data

Characteristics Data Warehouse Data Lake

Relational data from transactional


Data All data, including structured, semi-
systems, operational databases, and
structured, and unstructured
line of business applications
Often designed prior to the data
warehouse implementation but also can
Schema Written at the time of analysis (schema-on-
be written at the time of analysis
read)
(schema-on-write or schema-on-read)
Query results getting faster using low-cost
Price/Performance Fastest query results using local storage storage and decoupling of compute and
storage
Highly curated data that serves as the Any data that may or may not be
Data quality
central version of the truth curated (i.e. raw data)
Business analysts (using curated data),
Users Business analysts, data scientists, and
data scientists, data developers, data
data developers
engineers, and data architects
Machine learning, exploratory analytics,
Analytics Batch reporting, BI, and
data discovery, streaming, operational
visualizations
analytics, big data, and profiling
Chapter 3 The Data Science Process – Accessing and managing
data: SQL
What is SQL?
Structure Query Language (SQL, pronounced as “sequel”) is a language that was developed to manage
relational databases. You will discuss relational databases more in your database course. Simply put, a
relational database is data stored in one or more tables, where each table is related to at least one other
table. Relational databases are (highly-)structured because (1) of its tabular format, where each column is
associated to an “attribute” of an “entity” (or data entry), and because of the fixed relations between
tables.
Structure Query Language (SQL) was originally developed by IBM in the 1970s. As of 2022, the latest
version of SQL is SQL:2016, and this has greatly improved since the original SQL-86. It is a DDL (data
definition language) and DML (data manipulating language), and can control security and handle
transactions. It is pronounced either as “Sequel” given that the original name was Structured English QUEry
Language, but has then been shortened to SQL.
Each table in SQL is often (not all… but let us leave that to your database course) related to an entity
type. A table is also called a relation. Each row is also called an entry or a tuple, and refers to one specific
entity of that entity type. Often, however, we simply call the entity type as entity without ambiguity. Each
column is called an attribute.

The SELECT keyword


Because of time constraints, this course will only deal with using SQL to access and view databases that
already exist. We will not modify the contents of a database. Instead, we will just query then and look at
the resulting tables. We will also perform some aggregation to process the data.
Using the SELECT function, we can
(1) Choose which columns of the table we need (and rename as needed)
(2) Filter the results using TOP/LIMIT and WHERE
a. Filter based on columns with numerical values
b. Filter based on columns with date/time values
c. Filter based on columns with string values
(3) Sort results (ORDER BY) in ascending or descending order
(4) Aggregate results using MIN, MAX, COUNT, SUM, and AVG and the GROUP BY keyword
Note: Different SQL query languages like SQL Server, MySQL, sqlite, etc. have different syntax. We will
use sqlite syntax for this course. See https://www.sqlite.org/lang.html or
https://www.w3schools.blog/select-query-sqlite for more information.
For standalone SELECT queries, it will generally follow the following syntax, with the lines from JOIN
and below being optional:
SELECT <column names>
FROM <table name>
JOIN <table 2 name> on <join condition>
WHERE <condition/s>
GROUP BY <column name>
ORDER BY <column name> <ASC/DESC>

Primary keys, foreign keys and joining


Of importance in discussing SQL are the keys and joining them via these keys. The power of SQL lies in
being able to manage large amounts of data rather efficiently (at least before scaling horizontally became
necessary).
Each table has a primary key – a set of columns that, in combination, has a unique value for each row.
Often, this primary key is a single column. This primary key is often used as the identifier of that data
entry.
When working with multiple tables that are related to each other, it is often necessary to have a foreign
key. A foreign key in table B is a (set of) column(s) that points to a primary key in the other table A. A
rule here is that any value in the foreign key in table B must appear in table A.
For example, we can have a CLASS table that contains all the classes to be taught and a TEACHER table
that contains information about the teachers. Each class has several attributes, such as ClassID,
ClassName, Year, Semester, TeacherID, StartDate, etc. The column ClassID (like HSB3119, HSB3114,
etc.) can be the primary key. The TEACHER table has attributes TeacherID, Name, Title, Specialty, etc.
The TeacherID can be the primary key. Now, the CLASS table has a foreign key (column Teacher) that
points to the TEACHER table (to the TeacherID). Clearly, if a class X is taught by a teacher whose
TeacherID is Y, then the TeacherID of the teacher with TeacherID Y should be in the TEACHER table.
Using the foreign key, we can then JOIN the two tables CLASS and TEACHER, by coming up with a
bigger table CLASS_FULLDETAILS which has attributes as ClassID, ClassName, Year, Semester,
Teacher, StartDate, TeacherID, Name, Title, Specialty, etc. This can be done using the JOIN command.
For simplicity, we will only limit to the basic INNER and LEFT JOIN. Note that it is possible to JOIN a
table with itself.
We will also learn how to combine tables that have the same structure using UNION, UNION ALL,
INTERSECT, and EXCEPT.

Note: As far as we are concerned, we will not discuss null values in SQL yet. We will leave that for the
Data Cleaning chapter and hopefully in your database class.
Chapter 4 The Data Science Process – Cleaning and Wrangling
Data
Preprocessing
Understanding the data
The first thing to do when obtaining a data set is to understand what it means. This includes knowing the
context of the data, what each entry is, and what each column is. By doing so, we can decide the
appropriate type of data, data type, and analyses that can be done. Of course, we can only solve a problem
or present about it if we understand what the data is.
Ideally, we also have an idea on how the data was collected and stored in order to ensure how reliable it is
and the limitations it may have.
For example, it is possible to have data coming from a survey of students about their marks in HS and
university. After looking at the data (or talking with the one who administered the survey), then one can
realize the following:
1. Each entry (row) in the table is a response from a student.
2. The columns may represent the following information: student details like ID number, name,
gender, birthday, HS math score, HS English score, University score 1, University score 2.
3. The data was done by online survey.
What are the implications? Well, we should understand that the scores should be numbers in general (but
still, it could be written in a Vietnamese format or an English format), that names are often in Vietnamese
(so getting the family name and first name may be tricky), etc. Some students are also often careless (even
adults can make mistakes) so sometimes, instead of writing 9.1, they type 91 or even 911. Knowing your
data means one should know when to accept 911 or not.
Formatting data
One of the common things to perform is standardizing the format of the input, especially if the data is
manually registered by individuals, in contrast to being automatically generated (e.g. transactions logged
in by computers). For instance, a person might type her name as NGUYEN CHI THANH but some would
write Nguyen Chi Thanh, or even Chi Thanh Nguyen. As someone working with data, one should be
careful in analyzing this data not only to make our data look nicer, but also more reliable. This must be
standardized, and we will see some examples of this in the lab. Typing mistakes are also common so
people might type “Ha noi” and “Hanoi” and Hanoii” and these must all be standardized. (The best way is
of course to design the survey in a better way but an analyst must always confirm this.)
Duplicates and missing values
Part of the preprocessing stage is deciding what to do with missing values and duplicates.
Duplicates: Does the data expect duplicate entries, and how does it affect the table contents? Maybe a
person can order from the same shop in a span of 1 hour, but these should have different order numbers
(if the order numbers are the same, then maybe we should only process once). Often, duplicates occur
when we combine data from different sources, and so, duplicates must often be removed.
Missing values: For missing values, also called NULL values, there are often third approaches. The first
is to delete them if (1) there are extremely many data points and losing a small percentage won’t change
the distribution significantly, and if there are not suitable ways to address the issue which is the second
approach. Note that missing values can occur for 3 reasons: it was really unknown, or known but not
disclosed, or sometimes, this value does not really exist. Thus, deleting this entry must be done with
extreme care. The second approach is to put an appropriate value here, for example, the value could be
replaced by the mean/median value in that column. As such, this data point is roughly in the middle of the
dataset and should not strongly affect the data – there remains exceptions, of course. The last approach is
to leave it empty but analyses must be done carefully to leave out these NULL values depending on the
analyses done.

Data Wrangling
Descriptive statistics, histograms, and correlations
Once we have a clean set of data, it is important to check several summary statistics, not only to check the
veracity of our data but to know its shape and possibly see some patters. As such, it is common to look at
the distribution of the data: often shown as a bar chart (nominal and ordinal), histogram (numerical), or
scatter/line chart (for time series data). Boxplots can also be useful especially when comparing different
samples. A visual check of the distribution allows us to identify normality (which is a common
assumption), skewness, modality, or even identify which test we can or cannot use. This will also give us
an idea if we need to do some transformation for our data (e.g. logarithmic transformation).
Aside from histogram and distribution, it will be necessary to know the following terms: mean, median,
mode, standard deviation, range, variance, covariance, correlation, quartiles, percentiles, minimum,
maximum, count, and even uniqueness of the data points. An important concept, to be reviewed in
regression, is the correlation coefficient.
Outliers
Outliers are data points that lie very far from the “center” of the data. As such, these data points can
occasionally strongly impact the subsequent analyses, e.g. when we talk about the mean, or when we do
regression analysis. Often, an analyst must determine whether to include these outliers or not in the
analysis. On some occasions, these outliers may actually be faulty data that were mistakenly typed. But
sometimes, because of the underlying nature of the data, we may need to perform transformations to keep
this data but reduce their outlier effect. For instance, taking the logarithm of some variables will decrease
the distance between points.
Selecting, sorting and filtering data
When used in real life, data must be summarized into the most important and relevant statistics or
indicators. For example, an e-commerce company may have all the data regarding all the transactions of
its customers, but no one brings all these into the board room. The directors will simply look at, for
instance, the total sales and total expenditures every month, as well as a few detailed analysis on certain
products and services that may be interesting.
As a data analyst and a data scientist, you must then be able to carefully choose what data must be
included in each step of the analysis. Some data might be excluded from one part, but may be useful in
another; keeping all the data will however be distracting and may incur costs.
Selecting data involves choosing which attributes or features to keep. Sorting involves arranging them in
a particular order as needed, and filtering is removing the unrelated data entries as needed. These steps
may be done repeatedly for each new analysis or output.
Transforming data
One-hot key encoding (dummy variables) for categorical data
Categorical or nominal data are difficult to handle as it is together with the standard structured data,
especially since many algorithms require numerical data. To deal with this, each attribute with 𝑘
categories can be converted in 𝑘 columns or attributes – one for each category. A data entry is marked 1 is
it belongs to that category and 0 if not. This can be done using conditionals or the get_dummies
command.
Transforming data
Transforming data can sometimes be necessary of the values are too big or too skewed. Such skewness
will bring imbalance to the analysis (it can give an emphasis on large values). By taking the logarithm (or
some other transformation), the values of the data are close to having linear relationships with other
variables and possibly display a more normal distribution.
Standardizing data
To standardize data is to make the data follow certain rules, e.g. having minimum (0) or maximum (1)
values, or having standard normal distributions (zero mean and variance 1). Either way, this can easily be
done by code. In this class, we will usually use the standard normal transformation:
𝑋−𝜇
𝑍=
𝜎
so that our data points will have an average of 0 and variance (and standard deviation) of 1.
Aggregating data
Cleaned data is still not summarized. You might have learned to use pivot tables in Excel or aggregate
functions in SQL, and we can do something similar in Python. We will focus on using groupby along
with one or more aggregate functions (count, min, max, mean, median, mode, stdev, variance, etc.) to
summarize data into a new table that will be ready for visualization.
For the laboratory, we will focus on some descriptive statistics, handling missing values, sorting and
aggregating.
Chapter 5 The Data Science Process – Visualizing Data
What is data visualization?
1. Data visualization involves looking at your data and trying to understand where it comes from
and what it could say
2. Visualization transforms data into images that effectively and accurately represent information
about the data. – Schroeder et al. The Visualization Toolkit, 2nd ed. 1998
3. Visualization – creating appropriate and informative pictures of a dataset – iterative exploration
of data. These informative pictures are visual properties that we notice without using conscious
efforts to do so.

What does visualization do?


Three types of goals for visualization –
… to explore
• Nothing is known,
• Visualization is used for data exploration
– … to analyze
• There are hypotheses to be checked
• Visualization is used for verification
– … to present
• “everything” known about the data,
• Visualization is used to communicate results

Univariate Plots
Categorical
Bar chart and Pie chart
The most commonly used plots for categorical variables are bar plots and
pie charts. We can also simply represent them by tables, but they are less
eye-catching.
Pie chart: Graphical representation for categorical data in which a circle is
partitioned into “slices” on the basis of the proportions of each category

Bar Chart: Graphical representation for categorical data in which vertical


(or sometimes horizontal) bars are used to depict the number of
experimental units in each category; bars are separated by space.

Bar or Pie?
This could be a question of personal taste, but here’s the rule of thumb that I use: when there are more
than 4-5 categories in the variable, I’d use a bar plot rather than a pie chart.

Numerical
A picture of a numerical variable should show how many cases there are for each value, that is, its
distribution.
Histogram
A histogram is used when a variable takes numeric values. DO NOT confuse histograms with bar charts.
A bar chart is a graphical representation of data that compares distinct variables. The histogram depicts
the frequency distribution of a
numerical variable, and for
which “bins” are determined to
create numerical categories.
BoxPlot
The box plot is a diagram suited
for showing the distribution
univariate numerical data, by visualizing the following characteristics of a dataset: first quartile, median,
third quartile, and outliers.
First quartile: Value which separates the lower 25% from the upper 75% of the values of the dataset
Median: Middle value of the dataset
Third quartile: Value which separates the upper 25% from the lower 75% of the values of the dataset
Interquartile range: Range between first and third quartile
Whiskers: Lines ranging from minimum to first quartile and from third quartile to maximum (outliers
excluded)
Outliers: Values that are extremely far from the data set, represented by circles beyond the whiskers.

Bivariate Plots
Bivariate means two variables. In a bivariate plot, we explore the relationship between two variables. One
of these is often referred to as the independent (cause, usually 𝑥) and dependent (effect, usually 𝑦)
variables. We look at line graphs and scatter plots to explore the relationship between one IV and one DV.

A Line Chart is a visualization that displays information as data points connected by straight line
segments. You can use the chart to extract trend and patterns insights into raw data, which is oftentimes
time series.
On the other hand, a Scatter Plot Chart uses dots to display
associations and correlations present in your data. Besides,
it uses a line of regression (or the line of best fit) to display
the relationship between two sets of varying data,
highlighting correlations and associations in your data.

Other types of plots


There are a lot of other charts that can be used to visualize
data, although it is not our goal to discuss all of them. We
will tell you more a bit about some other types.
Regardless of the choice of visualization, it must be able to
effectively convey the message of the data, especially
when used in the presenting stage.

Density plot
Density plots are used to study the distribution of one or a few variables. Checking the distribution of
your variables one by one is probably the first task you should do when you get a new dataset. It delivers
a good quantity of information. Density plots are continuous analogues of the histogram.

Heatmap
Heatmaps are maps that employ colors to show where values are larger or smaller, and are super-imposed
in charts or even geographic locations.

Wordcloud
Wordclouds is a visual representation of the words that appear most commonly in a set of observations.
This is a great tool to attract the attention of the audience as to which words appear the most often instead
of using a bar chart (which could be better, depending on the intention of the visual); the numerical value
of each word will determine the size of the text.

Software
Many data analysts will be asked to know one of the following: Microsoft Excel, PowerBI, Tableau, or
visualization packages in R or Python. In Python, we highlight the packages pandas, matplotlib, seaborn,
and tensorflow, as having useful plotting functions. This list is not and will never be extensive as newer
software and packages can be developed.
Chapter 6 The Data Science Process – Publishing Data (Github)
Making a compelling story
In contrast to a Data Analyst, whose main job focuses on presenting key finds, trends, and patterns to the
stakeholders (usually creating dashboard or a report to executives), and is usually more concerned about
monitoring, a Data Scientist often addresses specific questions such as:
1. What factors of the business process must be improved (and how)?
2. Are there specific segments of the customers who must be targeted?
3. How can we improve specific processes in our systems?
For more specific questions, it could be:
4. [Postal industry] How can we reduce mistakes in reading the address in a letter, and increase the
efficiency of the mailing system? (Solution: use AI to read the addresses and sort them)
5. [Banking industry] How can we reduce the risk of giving bad loans to people? (Solution: use
classification algorithms to determine which customer has higher or lower risk)
6. [Health industry] Can we use laboratory test to suggest if a person is likely to develop a certain
disease? (Solution: use classification to determine likelihood of developing a disease)
As such, the story developed by a data scientist can be more specific to the problem at hand. Depending
on the audience, it can be technically involved (if reporting to a senior data scientist) or a summarized
version of it. Nevertheless, a report should include the following:
- An executive summary
- An introduction to the problem
- Information on how data is collected, sorted, and cleaned
- What methodologies are employed
- The results of the processes
- Discussion and conclusion
While this can be comprehensive enough, making the story compelling lies in first choosing an
appropriate framing of the problem, and then choosing the appropriate set of details to be discussed as
well as showing how the problem is solved, highlighting how using the particular proposal will result in
massive advantages over the current system. While disadvantages must be noted, focusing on the
advantages is crucial to making the story attractive.

A repository and a webpage


The results of the data scientist project can be a report that is given orally (as in a meeting) or as a
portfolio entry (as a notebook, or as a website). For those catering to high-level audiences (executive
summary), it is best to keep the report to a minimum: a quick introduction and statement of the problem,
the data and method used (and why it is applicable), the results and a few key figures, conclusion. Those
catering to the fellow data scientists might need to show all the steps done, particularly emphasizing the
data cleaning, exploration, wrangling, and validation process. Comparison to other models, if any, should
also be presented.
Data scientists often share their work as Jupyter notebooks shared online, or as a portfolio in their
website. Jupyter notebooks allow scientist to see the entire process, and help improve each others’ work
by shaing thoughts, insights, and even improvements, especially in case of errors. A webpage functions
well to showcase your capabilities especially when you aim to attract recruiters or collaborators.
In your project, you will be asked to do both: Your Jupyter notebook should contain all your workings
while a simple website hosted via Github should contain the key findings of your report.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy