0% found this document useful (0 votes)
20 views90 pages

Data Analytics

Uploaded by

Linh Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views90 pages

Data Analytics

Uploaded by

Linh Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 90

Course 1

WEEK 1:
- Gap analytics: it is not focused on comparing the current state with an ideal or desired state.
- Data ecosystems are made up of various elements that interact with one another in order to
produce, manage, store, organize, analyze, and share data.
WEEK 2:
- Analytical skills: The qualities and characteristics associated with solving problems using facts
- A technical mindset: The analytical skill that involves breaking processes down into smaller
steps and working with them in an orderly, logical way
- Data design: The analytical skill that involves how you organize information
- Understanding context: The analytical skill that has to do with how you group things into
categories
- Data strategy: The analytical skill that involves managing the processes and tools used in data
analysis
- Analytical thinking involves identifying and defining a problem and then solving it by using data
in an organized, step-by-step manner.
+ The five key aspects to analytical thinking. They are visualization, strategy, problem
orientation, correlation, and finally, big-picture and detail-oriented thinking.
 Visualization is the graphical representation of information
 With so much data available, having a strategic mindset is key to staying focused and on
track. Strategizing helps data analysts see what they want to achieve with the data and
how they can get there. Strategy also helps improve the quality and usefulness of the
data we collect. By strategizing, we know all our data is valuable and can help us
accomplish our goals.
 Problem orientation: It's all about keeping the problem top of mind throughout the
entire project
 Correlation does not equal causation. In other words, just because two pieces of data
are both trending in the same direction, that doesn't necessarily mean they are all
related.
 Big-picture thinking is like looking at a complete puzzle. detail-oriented thinking is all
about figuring out all of the aspects that will help you execute a plan. In other words, the
pieces that make up your puzzle
- Gap analysis is used to examine and evaluate how a process currently works with the goal of
getting to where you want to be in the future.
WEEK 3:
- The data analysis life cycle:
+ Ask: Define the problem and confirm stakeholder expectations
 Defining a problem means you look at the current state and identify how it's different
from the ideal state. For instance, a sports arena might want to reduce the time fans
spend waiting in the ticket line. The obstacle is figuring out how to get the customers to
their seats more quickly
 Understand the stakeholder expectation. For instance, if your manager assigns you a
data analysis project related to business risk, it would be smart to confirm whether they
want to include all types of risks that could affect the company, or just risks related
to weather such as hurricanes and tornadoes.
+ Prepare: Collect and store data for analysis
+ Process: Clean and transform data to ensure integrity (data analysts find and eliminate
any errors and inaccuracies that can get in the way of results. This usually means cleaning data,
transforming it into a more useful format, combining two or more datasets to make information
more complete and removing outliers, which are any data points that could skew the
information)
+ Analyze: Use data analysis tools to draw conclusions
+ Share: Interpret and communicate results to others to make data-driven decisions
+ Act: Put your insights to work in order to solve the original problem
- The life cycle of data is plan, capture, manage, analyze, archive and destroy.
+ During planning, a business decides what kind of data it needs, how it will be managed
throughout its life cycle, who will be responsible for it, and the optimal outcomes.
For example, let's say an electricity provider wanted to gain insights into how to save people
energy. In the planning phase, they might decide to capture information on how much
electricity its customers use each year, what types of buildings are being powered, and what
types of devices are being powered inside of them. The electricity company would also decide
which team members will be responsible for collecting, storing, and sharing that data.
+ Capture data: data is collected from a variety of different sources and brought into the
organization. The common method for collecting data:
 getting data from outside resources. For example, if you were doing data analysis on
weather patterns, you'd probably get data from a publicly available dataset like the
National Climatic Data Center.
 Company's own documents and files ~ database: A database is a collection of data
stored in a computer system. When you maintain a database of customer information,
ensuring data integrity, credibility, and privacy are all important concerns
+ Analyze data: In this phase, the data is used to solve problems, make great decisions, and
support business goals.
+ Archive phase. Archiving means storing data in a place where it's still available, but may not be
used again

NOTE: Be careful not to mix up or confuse the six stages of the data life cycle (Plan, Capture,
Manage, Analyze, Archive, and Destroy) with the six phases of the data analysis life cycle (Ask,
Prepare, Process, Analyze, Share, and Act). They shouldn't be used or referred to
interchangeably.
- While the data analysis process will drive your projects and help you reach your business goals,
you must understand the life cycle of your data in order to use that process. To analyze your
data well, you need to have a thorough understanding of it. Similarly, you can collect all the data
you want, but the data is only useful to you if you have a plan for analyzing it.
- The Plan and Ask phases both involve planning and asking questions, but they tackle different
subjects. The Ask phase in the data analysis process focuses on big-picture strategic thinking
about business goals. However, the Plan phase focuses on the fundamentals of the project, such
as what data you have access to, what data you need, and where you’re going to get it.
- A database is a collection of data stored in a computer system.
WEEK 4:

- A query is a request for data or information from a database. When you query databases,
you use SQL to communicate your question or request. You and the database can always
exchange information as long as you speak the same language.
Course 2:
WEEK 1:
- Structured thinking is the process of recognizing the current problem or situation, organizing
available information, revealing gaps and opportunities, and identifying the options.
- Operators are symbols used in formulas, including + (addition), – (subtraction), *
(multiplication), and / (division).
- In data analytics, qualitative data measures qualities and characteristics + is subjective
- Dashboards monitor live, incoming data from multiple datasets and organize the
information into one central location. Reports are static collections of data.
- Small data is effective for analyzing day-to-day decisions. Big data is effective for analyzing
more substantial decisions.
+ Small data involves datasets concerned with a small number of specific metrics. Big data
involves datasets that are larger and less specific.
+ Small data focuses on short, well-defined time periods. Big data focuses on change over a long
period of time.
- Small data involves a small number of specific metrics over a shorter period of time. It’s
effective for analysing day-to-day decisions. Big data involves larger and less specific datasets
and focuses on change over a long period of time. It’s effective for analysing more substantial
decisions.
- Structured thinking is the process of recognizing the current problem or situation, organizing
available information, revealing gaps and opportunities, and identifying the options.
- Data analysts ask thoughtful questions to help them reach solid conclusions, consider how
to share data with others, and help team members make effective decisions.
- Stakeholders included the owner, the vice president of communications, and the director of
marketing and finance
- Data analysts work with a variety of problems. These include: making predictions,
categorizing things, spotting something unusual, identifying themes, discovering
connections, and finding patterns.
+ Making predictions: This problem type involves using data to make an informed decision
about how things may be in the future.
+ Categorizing things: This means assigning information to different groups or clusters based
on common features
+ Spotting unusual things: data analysts identify data that is different from the norm
+ Identifying themes: Identifying themes takes categorization as a step further by grouping
information into broader concepts
+ Discovering connections: enables data analysts to find similar challenges faced by different
entities, and then combine data and insights to address them (a scooter company is
experiencing an issue with the wheels it gets from its wheel supplier. That company would
have to stop production until it could get safe, quality wheels back in stock. But meanwhile,
the wheel companies encountering the problem with the rubber it uses to make wheels, turns
out its rubber supplier could not find the right materials either. If all of these entities could
talk about the problems they're facing and share data openly, they would find a lot of similar
challenges and better yet, be able to collaborate to find a solution)
+ Finding patterns: Data analysts use data to find patterns by using historical data to
understand what happened in the past and is therefore likely to happen again
- Effective questions follow SMART methodology:
+ Specific: Specific questions are simple, significant and focused on a single topic or a few
closely related ideas. For example, instead of asking a closed-ended question, like, are kids
getting enough physical activities these days? Ask what percentage of kids achieve the
recommended 60 minutes of physical activity at least five days a week?
+ Measurable: Measurable questions can be quantified and assessed. An example of an
unmeasurable question would be, why did a recent video go viral? Instead, you could ask how
many times was our video shared on social channels the first week it was posted?
+ Action-oriented: Action-oriented questions encourage change. So rather than asking, how can
we get customers to recycle our product packaging? You could ask, what design features will
make our packaging easier to recycle?
+ Relevant
+ Time-bound: Time-bound questions specify the time to be studied
- Fairness means asking questions that make sense to everyone. Even if a data analyst suspects
people will understand abbreviations, slang, or other jargon, it’s important to write questions
with simple wording.
WEEK 2:
- Data-inspired decision-making explores different data sources to find out what they have in
common
- An algorithm is a process or set of rules to be followed for a specific task
- Quantitative data is all about the specific and objective measures of numerical facts (how
often, how many,…)
- Qualitative data describes subjective or explanatory measures of qualities and
characteristics or things that can't be measured with numerical data (why question)

- 2 data presentation tools:


+ Reports: A report is a static collection of data given to stakeholders periodically
They can be designed and sent out periodically, often on a weekly or monthly basis, as
organized and easy to reference information. They're quick to design and easy to use as long as
you continually maintain them ~ high-level historical data
Reports use static data or data that doesn't change once it's been recorded, they reflect data
that's already been cleaned and sorted ~ pre-cleaned and sorted data
Reports don't show live, evolving data ~ chỉ show data trong quá khứ
+ Dashboards: monitors live, incoming data

Reports thường được tạo ra và cập nhật theo chu kỳ, trong khi dashboard thường được cập
nhật liên tục hoặc theo thời gian thực.
- Data and metrics:
+ A metric is a single, quantifiable type of data that can be used for measurement (VD: data sẽ
bao gồm các thông tin liên quan đến quality, prices,… Nhưng muốn so sánh revenue giữa
salesperson thì phải dùng metric sales = quantity * price)
- Metric goal is a measurable goal set by a company and evaluated using metrics
- Types of dashboard:
+ Strategic: focuses on long term goals and strategies at the highest level of metrics
+ Operational: short-term performance tracking and intermediate goals
+ Analytical: consists of the datasets and the mathematics used in these sets
- Dashboards are visualizations: Visualizing data can be enormously useful for understanding
and demonstrating what the data really means.
- Dashboards identify metrics: Relevant metrics may help analysts assess company performance.
Some differences include the timeframe described in each dashboard. The operational
dashboard has a timeframe of days and weeks, while the strategic dashboard displays the entire
year. The analytic dashboard skips a specific timeframe. Instead, it identifies and tracks the
various KPIs that may be used to assess strategic and operational goals.
- Dashboards can help companies perform many helpful tasks, such as:
+ Track historical and current performance.
+ Establish both long-term and/or short-term goals.
+ Define key performance indicators or metrics.
+ Identify potential issues or points of inefficiency.
While almost every company can benefit in some way from using a dashboard, larger
companies and companies with a wider range of products or services will likely benefit more.
Companies operating in volatile, or swiftly changing markets like marketing, sales, and tech also
tend to more quickly gain insights and make data-informed decisions.
- Dashboards can provide convenient access to information and analytics and are easy to use in
collaboration. Moreover, they may be tailored to the specific needs of the businesses, like
tracking performance towards a milestone.
Using a previous example of the ice cream store, the store owner might use an operational
dashboard to track their day-to-day sales. Meanwhile, they might use a strategic dashboard to
decide whether they have enough capacity to expand their business.
- Mathematical approach: It means looking at a problem and logically breaking it down step-by-
step, so you can see the relationship of patterns in your data, and use that to analyze your
problem.
- Small data can be useful for making day-to-day decisions
- Big data: has larger, less specific datasets covering a longer period of time
WEEK 3:
* Working with spreadsheets

- Operators which are symbols that name the type of operation or calculation to be performed
(dấu)
- DIV error happens when a formula is trying to divide a value in a cell by zero or by an empty
cell (chia cho 0 hoặc ô trống)
- ERROR tells us the formula can't be interpreted as it is input (input của công thức không đúng)
VD: sum(B2:B4 C6:C8)~ thiếu dấu , giữa 2 vế
- The N/A error tells you that the data in your formula can't be found by the spreadsheet (không
tìm thấy thông tin) VD: khi dùng function vlookup, Vlookup(abc,….) nhưng giá trị cần tìm thực tế
là abcd không có dữ liệu abc trong bảng
- NAME error can happen when a formula's name isn't recognized or understood (tên công thức
chưa đúng) VD: Vloookup (thừa 1 chữ o)
- The NUM error tells us that a formula's calculation can't be performed as specified by the data.
- The VALUE error can indicate a problem with a formula or referenced cells (giá trị dùng trong
công thức chưa đúng)
- The REF error, which often comes up when cells being referenced in a formula have been
deleted (công thức chứa dòng dữ liệu bị xoá)

- The problem domain: the specific area of analysis that encompasses every activity affecting or
affected by the problem.
- Structured thinking is the process of recognizing the current problem or situation, organizing
available information, revealing gaps and opportunities, and identifying the options
- A statement of work is a document that clearly identifies the products and services a vendor or
contractor will provide to an organization. It includes objectives, guidelines, deliverables,
schedule, and costs.
- A scope of work is project-based and sets the expectations and boundaries of a project. A
scope of work may be included in a statement of work to help define project outcomes (an
agreed-upon outline of the work)
- Deliverables are items or tasks you will complete before you can finish the project.
- Reports notify everyone as you finalize deliverables and meet milestones.
- Milestones are significant tasks you will confirm along your timeline to help everyone know
the project is on track.
- Timelines include due dates for when deliverables, milestones, and/or reports are due.
WEEK 3:
- 3 common stakeholders group:
+ Executive team: provides strategic and operational leadership to the company. They set goals,
develop strategy, and make sure that strategy is executed effectively. These stakeholders think
about decisions at a very high level and they are looking for the headline news about your
project first. They are less interested in the details.
+ Customer-facing team: The customer-facing team includes anyone in an organization who has
some level of interaction with customers and potential customers.
+ Data science team: Organizing data within a company takes teamwork.
- Working effectively with stakeholders:
+ Discuss goals
+ Fell empowered to say no
+ Plan for the unexpected
+ Know your project
+ Start with words and visuals
+ Communicate often
PREPARING DATA
WEEK 1:
- Data is collected through interviews, observations (most use), forms, questionnaires, surveys
and cookies ( which are small files stored on computers that contain information about user)

* Data sources
- First-party data: This is data collected by an individual or group using their own resources.
Collecting first-party data is typically the preferred method because you know exactly where it
came from
- Second-party data: which is data collected by a group directly from its audience and then
sold
- Third-party data: data collected from outside sources who did not collect it directly
- Remember to consider time frame when collecting data
* Different type of data(format)
- Nominal data is a type of qualitative data that's categorized without a set order => this data
doesn't have a sequence (yes, no, not sure)
- Ordinal data: a type of qualitative data with a set order or scale ( rank the movie from 1 to 5)
- Internal data: which is data that lives within a company's own systems
- External data: data that lives and is generated outside of an organization (It’s useful when
analysis depends on as many data sources as possible)
- Structured data: data that's organized in a certain format, such as rows and columns
- Unstructured data: data that is not organized in any easily identifiable manner (audio, video
files, emails, social media,…)
- Data model: a model that is used for organizing data elements and how they relate to one
another
+ Data element: hey're pieces of information, such as people's names, account numbers, and
addresses
* Data types:
- A data type is a specific kind of data attribute that tells what kind of value the data is. Data
type can be number, text or string and boolean
+ Text or string: a sequence of characters and punctuation that contains textual information
( can include number that cannot be used for calculation such as house number,…)
+ Boolean: a data type with only two possible values: true or false

- Wide data: every data subject has a single row with multiple columns to hold the values of
various attributes of the subject
- Long data: data in which each row is one time point per subject, so each subject will have
data in multiple rows.

- Use historical data if the time frame to complete is short


WEEK 2: BIAS
* Unbiased and objective data
- Bias: preference in favor of or against a person, group of people, or thing.
- Data bias is a type of error that systematically skews results in a certain direction
- Observer bias (experimenter bias/research bias): the tendency for different people to
observe things differently (looking at the same thing but have different ideas about that)
- Interpretation bias: he tendency to always interpret ambiguous situations in a positive, or
negative way (can lead to two people seeing or hearing the exact same thing, and interpreting
it in a variety of different ways, because they have different backgrounds, and experiences)
- Confirmation bias: is the tendency to search for, or interpret information in a way that
confirms preexisting beliefs (they only notice things that support it, ignoring all other signals)
* How to identify good data resources

* Data ethics and privacy


- Data ethics refers to well- founded standards of right and wrong that dictate how data is
collected, shared, and used
- 6 aspects of data ethics:
+ Ownership: It isn't the organization that invested time and money collecting, storing,
processing, and analyzing it. It's individuals who own the raw data they provide, and they
have primary control over its usage, how it's processed and how it's shared.
+ Transaction transparent: the idea that all data processing activities and algorithms should
be completely explainable and understood by the individual who provides their data
+ Consent: This is an individual's right to know explicit details about how and why their data
will be used before agreeing to provide it
+ Currency: Individuals should be aware of financial transactions resulting from the use of
their personal data and the scale of these transactions
+ Privacy: preserving a data subject's information and activity any time a data transaction
occurs
+ Openness: free access, usage, and sharing of data
 Interoperability data is key to open data's success. Interoperability is the ability of data
systems and services to openly connect and share data
WEEK 3: DATABASES
* Working with databases
- Database: a collection of data stored in a computer system
- Metadata: data about data. Metadata tells you where the data comes from, when and how it
was created
- A relational database is a database that contains a series of related tables that can be
connected via their relationships
+ A primary key is an identifier that references a column in which each value is unique (meaning
no two rows can have the same primary key). A column might not have a primary key. A primary
key may also be constructed using multiple columns of a table. This type of primary key is called
a composite key. This cannot have null or blank values.
+ A foreign key is a field within a table that's a primary key in another table
Primary key: Part ID
Foreign key: Branch ID, VIN
- Normalization is a process of organizing data in a relational database. For example, creating
tables and establishing relationships between those tables. It is applied to eliminate data
redundancy, increase data integrity, and reduce complexity in a database
* Metadata
- Metadata helps data analysts interpret the contents of the data within a database
+ Descriptive metadata: metadata that describes a piece of data and can be used to identify it at
a later point in time (VD: tên sách, mã vạch để tra ra được tên tác giả và các thông tin liên quan
đến quyển sách)
+ Structural metadata: metadata that indicates how a piece of data is organized and whether it's
part of one or more than one data collection ( VD: mục lục của 1 quyển sách,..)
+ Administrative metadata: metadata that indicates the technical source of a digital asset (VD:
thông tin về bức ảnh được chụp ~ time, độ phân giải, file name….)
- Benefits of using metadata:
+ Metadata creates a single source of truth by keeping things consistent and uniform
+ Metadata also makes data more reliable by making sure it's accurate, precise, relevant, and
timely
- A metadata repository is a database specifically created to store metadata. Metadata
repositories can be stored in a physical location, or they can be virtual.

- Using a metadata repository, a data analyst can find it easier to bring together multiple sources
of data, confirm how or when data was collected, and verify that data from an outside source is
being used appropriately.
- Metadata is stored in a single, central location and it gives the company standardized
information about all of its data
- Data governance is a process to ensure the formal management of a company’s data assets
- CSV files use plain text and are delineated by characters, such as a comma. A delineator
indicates a boundary or separation between two things. (CSV file saves data in a table format)
* Sorting and filtering
- Sorting involves arranging data into a meaningful order to make it easier to understand,
analyze, and visualize
- Filtering means showing only the data that meets a specific criteria while hiding the rest
* Working with SQL
- Khi viết dòng lệnh có thể viết hoa hay viết thường đều được, ví dụ như select, SELECT, SeLect,..
- Có thể dùng ‘….’ Hoặc “…..” khi viết điều kiện. Giả sử như trường hợp điều kiện có sử dụng dấu
‘ chẳng hạn như Shepherd’s pie thì khi viết điều kiện, ta sẽ viết như sau:
where Favorite_food =” Shepherd’s pie” vì nếu viết là Favorite_food =’ Shepherd’s pie’ thì SQL sẽ
hiểu điều kiện chỉ là Favorite_food = Shepherd
- Cách viết comment trong SQL: viết sau dấu -- ví dụ như:

SQL sẽ chỉ ignore everything in the same line after –


Nếu muốn viết nhiều comment thì mỗi dòng comment đều bắt đầu bằng – hoặc /*……..*/
- Sử dụng Snake_case names for columns: tên cột luôn được đặt ở dạng ví dụ như: total_tickets
nếu không đặt tên cột thì SQL sẽ tự đặt tên dưới dạng f0, f1,f2,…..
- Khi đặt tên table, nên đặt tên dưới dạng CamelCase hoặc snake_case ví dụ như
TicketsByOccasio (CamelCase)
- Khi viết câu lệnh nên xuống dòng khi hết 1 ý không nên viết liền hết vào vì sẽ tạo ra câu lệnh
quá dài
WEEK 4:
* Organizing data
- Benefits of organizing data:

- Naming conventions: consistent guidelines that describe the content, date, or version of a
file in its name
WEEK 4:
PROCESS DATA FROM DIRTY TO CLEAN
WEEK 1: Data integrity
- A strong analysis depends on the integrity of the data
- Data integrity is the accuracy, completeness, consistency and trustworthiness of data
throughout its lifecycle
- Data replication is the process of storing data in multiple locations
- Data transfer is the process of copying data from a storage device to memory or from one
computer to another
- Data manipulation is the process of changing data to make it more organized and easier to
read
- Threads to data integrity: human error, viruses, malware, hacking and system failures
* Dealing with insufficient data
- Types of insufficient data:
+ Data from only one source
+ Data that keeps updating ~ it means that the data is still coming and not completed
+ Outdated data
+ Geographically-limited data
- Ways to adres insufficient data:
+ Identify trends with the available data
+ Wait for more data if time allows
+ Talk with stakeholders and adjust your objective
+ Look for a new dataset
* The importance of sample size:
- Random sampling is a way of selecting a sample from a population so that every possible
type of the sample has an equal chance of being chosen
- Increase the sample size to meet specific needs of your project:
+ For a higher confidence level, use a larger sample size
+ To decrease the margin of error, use a larger sample size
+ For greater statistical significance, use a larger sample size
NOTE: You could probably accept a larger margin of error surveying how residents feel about
the new library versus surveying residents about how they would vote to fund it. For that
reason, you would most likely use a larger sample size for the voter survey.
* Testing data
- Statistical power is the probability of getting meaningful results from a test
- Hypothesis testing is a way to see if a survey or experiment has meaningful results
- If a test is statistically significant, it means the results of the test are real and not an error
caused by random chance ( statistically significant = 60% => there is 60% the test is real and
realiable). 80% is considered to be the lowest accepted level of significance
- We need to consider all the factors before deciding the sample size to make sure a high
statistical power
* Proxy data

* Determine the best sample size


- Confidence level: the probability that your sample size accurately reflects the greater
population
- Confidence level is independend of the margin of error
* The margin of error
- Margin of error is the maximum amount that the sample results are expected to differ from
those of the actual population
- To calculate margin of error, we need: population size, sample size and confidence level
WEEK 2: CLEANING DATA
* The necessity of cleaning data
- Dirty data is the data that is incomplete, incorrect or irrelevant to the problem you are trying
to solve
- Data engineers transform data into a useful format for analysis and give it a reliable
infrastructure. This means they develop, maintain, and test databases, data processors and
related systems
- Data warehousing specialists develop processes and procedures to effectively store and
organize data. They make sure that data is available, secure, and backed up to prevent loss
- Cleaning data helps save time and money
- Types of dirty data:
+ Duplicate data

+ Outdated data

+ Incomplete data
+ Incorrect/inaccurate data

+ Inconsistent data

- Field: a single piece of information from a row or column of a spreadsheet


- Data validation is a tool for checking the accuracy and quality of data before adding or
importing it.
* Characteristics of data
- Validity: The concept of using data integrity principles to ensure measures conform to
defined business rules or constraints
- Accuracy: The degree of conformity of a measure to a standard or a true value
- Completeness: The degree to which all required measures are known
- Consistency: The degree to which a set of measures is equivalent across systems
* Data-cleaning toold and techniques
- Before deleting something on your data set, remember to make a copy of that data set first
- Common data-cleaning pitfalls:
+ Not checking for spelling errors: Misspellings can be as simple as typing or input errors. Most
of the time the wrong spelling or common grammatical errors can be detected, but it gets
harder with things like names or addresses. For example, if you are working with a
spreadsheet table of customer data, you might come across a customer named “John” whose
name has been input incorrectly as “Jon” in some places. The spreadsheet’s spellcheck
probably won’t flag this, so if you don’t double-check for spelling errors and catch this, your
analysis will have mistakes in it.
+ Forgetting to document errors: Documenting your errors can be a big time saver, as it helps
you avoid those errors in the future by showing you how you resolved them. For example, you
might find an error in a formula in your spreadsheet. You discover that some of the dates in
one of your columns haven’t been formatted correctly. If you make a note of this fix, you can
reference it the next time your formula is broken, and get a head start on troubleshooting.
Documenting your errors also helps you keep track of changes in your work, so that you can
backtrack if a fix didn’t work.
+ Not checking for misfielded values: A misfielded value happens when the values are entered
into the wrong field. These values might still be formatted correctly, which makes them
harder to catch if you aren’t careful. For example, you might have a dataset with columns for
cities and countries. These are the same type of data, so they are easy to mix up. But if you
were trying to find all of the instances of Spain in the country column, and Spain had
mistakenly been entered into the city column, you would miss key data points. Making sure
your data has been entered correctly is key to accurate, complete analysis.
+ Overlooking missing values: Missing values in your dataset can create errors and give you
inaccurate conclusions. For example, if you were trying to get the total number of sales from
the last three months, but a week of transactions were missing, your calculations would be
inaccurate. As a best practice, try to keep your data as clean as possible by maintaining
completeness and consistency.
+ Only looking at a subset of the data: It is important to think about all of the relevant data
when you are cleaning. This helps make sure you understand the whole story the data is
telling, and that you are paying attention to all possible errors. For example, if you are
working with data about bird migration patterns from different sources, but you only clean
one source, you might not realize that some of the data is being repeated. This will cause
problems in your analysis later on. If you want to avoid common errors like duplicates, each
field of your data requires equal attention.
+ Losing track of business objectives: When you are cleaning data, you might make new and
interesting discoveries about your dataset-- but you don’t want those discoveries to distract
you from the task at hand. For example, if you were working with weather data to find the
average number of rainy days in your city, you might notice some interesting patterns about
snowfall, too. That is really interesting, but it isn’t related to the question you are trying to
answer right now. Being curious is great! But try not to let it distract you from the task at
hand.
+ Not fixing the source of the error: Fixing the error itself is important. But if that error is
actually part of a bigger problem, you need to find the source of the issue. Otherwise, you will
have to keep fixing that same error over and over again. For example, imagine you have a
team spreadsheet that tracks everyone’s progress. The table keeps breaking because different
people are entering different values. You can keep fixing all of these problems one by one, or
you can set up your table to streamline data entry so everyone is on the same page.
Addressing the source of the errors in your data will save you a lot of time in the long run.
+ Not analyzing the system prior to data cleaning: If we want to clean our data and avoid
future errors, we need to understand the root cause of your dirty data. Imagine you are an
auto mechanic. You would find the cause of the problem before you started fixing the car,
right? The same goes for data. First, you figure out where the errors come from. Maybe it is
from a data entry error, not setting up a spell check, lack of formats, or from duplicates. Then,
once you understand where bad data comes from, you can control it and keep your data
clean.
+ Not backing up your data prior to data cleaning: It is always good to be proactive and create
your data backup before you start your data clean-up. If your program crashes, or if your
changes cause a problem in your dataset, you can always go back to the saved version and
restore it. The simple procedure of backing up your data can save you hours of work-- and
most importantly, a headache.
+ Not accounting for data cleaning in your deadlines/process: All good things take time, and
that includes data cleaning. It is important to keep that in mind when going through your
process and looking at your deadlines. When you set aside time for data cleaning, it helps you
get a more accurate estimate for ETAs for stakeholders, and can help you know when to
request an adjusted ETA.
* Data-cleaning features
- Conditional formatting is a spreadsheet tool that changes how cells appear when values
meet specific conditions
- A text string is a group of characters within a cell, most often composed of letters.
- Split is a tool that divides a text string around the specified character and puts each
fragment into a new and separate cell. Split is helpful when you have more than one piece of
data in a cell and you want to separate them out
+ VD: data đang ở cùng 1 ô là A,B,C. Muốn tách ra thành các cột riêng sử dụng text to column
sẽ convert data thành các ô A, B, C riêng biệt
- CONCATENATE is a function that joins multiple text strings into a single string
+ VD: ô 1 – A, ô 2-B => when concatenate, it becomes AB
- A delimiter is a character that indicates the beginning or end of a data item
* Optimize the data-cleaning process
- COUNTIF is a function that returns the number of cells that match a specified value
- Syntax is a predetermined structure that includes all required information and its proper
placement
- LEN is a function that tells you the length of the text string by counting the number of
characters it contains
- TRIM is a function that removes leading, trailing, and repeated spaces in data
- A schema is a way of describing how something is organized
WEEK 3: SQL AND SPREADSHEETS
* SQL queries
- Phần from ví dụ nếu muốn chọn file customer_address từ folder customer_data

- Insert new data into the existing table:

- Thay đổi thông tin:

- Drop table if exists: store different created table within database


- Remove duplicates by using SQL: Distinct
- Đến lượng ký tự: phần đằng sau As là tạo ra 1 column mới có tên là letters_in_c

- Lựa chọn dữ liệu đạt 1 điều kiện nhất định: bình thường có thể chỉ dùng cú pháp where tuy
nhiên ví dụ trong dữ liệu phần country có cả USA và US , đển lựa chọn data mà customer từ US
thì sử dụng câu lệnh substr
- TRIM function: eliminate those extra spaces for consistency

- Order by: sắp xếp theo thứ tự


- TYPECASTING: convert anything from one data type to another. We use CAST function to help
the database recognizes as float ~ số bao gồm decimal instead of text strings

+ Thay đổi từ date time sang date only

- Concat: add strings together to create new text strings that can be used as unique keys
- Coalesce: an be used to return non-null values in a list. Null values are missing values

Check column product trước, nếu có null value ở cột product thì sẽ lấy thông tin product_code
WEEK 4: DOCUMENTING RESULTS AND THE CLEANING PROCESS
* Verifying and reporting results
- Verification is a process to confirm that a data cleaning effort was well- executed and the
resulting data is accurate and reliable.
- A changelog: a file containing a chronologically ordered list of modifications made to project
- Verification process:
+ Going back to your original unclean data set and comparing it to what you have now. Review
the dirty data and try to identify any common problems.
- Big picture when verifying data-cleaning:
+ Consider the business problem
+ Consider the goal
+ Consider the data
- The CASE statement goes through one or more conditions and returns a value as soon as a
condition is met

* Documenting results
- Documentation which is the process of tracking changes, additions, deletions and errors
involved in your data cleaning effort
- To see what has been changed in the data set, we can use query history in SQL and show edit
history in spreadsheet
- Common data errors:
+ Human error in data entry
+ Flawed processes
+ System issues
ANALYZE DATA TO ANSWER QUESTIONS
WEEK 1: DATA ANALYTICS BASICS
* Analysis process
- Analysis is the process used to make sense of the data collected
- The goal of analysis is to identify trends and relationships within the data
- 4 phases of analysis:
+ Organize data
+ Format and adjust data
+ Get input from others
+ Transform data
- Sorting is when you arrange data into a meaningful order to make it easier to understand,
analyze, and visualize
- Filtering is showing only the data that meets a specific criteria while hiding the rest
* Sorting ịn spreadsheets
- Sort sheet: all of the data in a spreadsheet is sorted by the conditions of a single column, but
the related information across each row stays together.
- Sort range: chỉ những specified cells được sort còn các ô còn lại sẽ giữ nguyên không thay đổi
- Sort function:

+ A2:D6 is the data range


+ 2 is the number of column
+ True: sắp xếp theo thứ tự tăng dần (ascending), False ~ giảm dần (descending)
- Sort với nhiều điều kiện:
* Sorting in SQL
- Order by: always the last line
WEEK 2: DATA FORMATTING
* Data validation
- Data validation allows you to control what can and can’t be entered in your worksheet
- Command:
+ Add dropdown lists with predetermined options: Data => Data validation => In criteria
selects lists of items and then write on the next column which type in selections we want to
create ( tạo ra 1 cột với mỗi dòng có 3 lựa chọn trước giống nhau để tích chọn)
+ Create custom checkboxes: Data => Data validation => In criteria selects checkbox and then
tick “Use custom cell values”
+ Protect structured data and fomulas: choose reject input in data validation
* Conditional formatting
- Conditional formatting: a spreadsheet tool that changes how cells appear when values meet
specific conditions
* Combine multiple datasets
WEEK 3: DATA AGGREGATION
* Vlookup
- Vlookup can be used to connect 2 sheets together on a matching column to populate one
single sheet
- Limitations of Vlookup and how to fix some of the most common problems:
+ Match function: a function used to locate the position of a specific lookup value
+ Limitations: (chỉ trả duy nhất giá trị được tìm thấy đầu tiên và giá trị cần tìm phải nằm về
phía bên phải)
 it only returns the first match it finds within the specified range
 VLOOKUP only returns the first match it finds within a specified range and can only
search in columns to the right
* Use JOINS to aggregate data in SQL
- JOIN is a SQL clause that's used to combine rows from two or more tables based on a related
column. There are 4 common JOINS: Inner, Left, Right, Outer

- Inner JOIN is a function that returns records with matching values in both tables
+ Phần inner join, departments represents the other table that we want to combine. We can
specify which column and each table will contain the matching join key by writing On…..
- LEFT JOIN is a function that will return all the records from the left table and only the
matching records from the right table

- RIGHT JOIN will return all records from the right table and only the matching records from
the left

- The importance of aliases: Aliases are used in SQL queries to create temporary names for a
column or table (AS)~ tạo 1 cái tên ngắn hơn khi existing name is too long. Dùng được cho cả
phần Select và From
- Ví dụ như phần left join and right join, ở phần Select, phần customers.XXX và sales.XXX thì
customers và sales là tên viết tắt của các table, XXX là tên column của từng table
* Count and count distinct
- COUNT is a query that returns the number of rows in a specified range
- COUNT DISTINCT is a query that only returns the distinct values in that range. This means
COUNT DISTINCT doesn't count repeating values
- Group by: nhóm theo tiêu chí
* Subquery
- A subquery is a SQL query that is nested inside of a larger query
- The inner query executes first so that the results can be passed on to the outer query to use

- Subquery starts by ( and close by )


- Where chỉ dùng được trước Group by. Nếu dùng Group By trước phải dùng 1 query khác
- HAVING basically allows you to add a filter to your query instead of the underlying table
when you're working with aggregate functions
- CASE returns records with your conditions by allowing you to include if/then statements in
your query
- You can nest subqueries within SELECT, FROM, and/or WHERE clauses
- Comparison operators such as >, <, or = help you compare data in subqueries. You can also
use multiple row operators including IN, ANY, or ALL
- A subquery is called an inner query or inner select because it’s nested in a statement called
an outer query or outer select
- The statement containing a subquery is an outer query or outer select. Subqueries are
nested within these statements, called inner queries or inner select
- The innermost query executes first. Its parent query executes last so it can use the results
returned by inner queries
- Parentheses are used to mark the beginning and end of a subquery
WEEK 4: DATA CALCULATIONS
* Common calculations
- When we want to sum up the product of 2 or more arrays, we can use SUMPRODUCT
- An operator is a symbol that names the type of operation or calculation to be performed in a
formula ( + , -, *, / )
* SQL calculations

- The modulo operator is represented by the percent symbol. This is an operator that returns
the remainder when one number is divided by another = MOD function
- Division: /

+ Where command tells SQL to exclude the value of total bags <>0 since we cannot divide sth
by 0
- Extract command lets us pull one part of a given date to use

- Create a temporary table: with is the name of table


+ ## tells the SQL that it is a description

+ Select into: This statement copies data from one table into a new table but it doesn't add
the new table to the database

+ If lots of people will be using the same table, then the CREATE TABLE statement might be the
better option
SHARE DATA THROUGH THE ART OF VISUALIZATION
WEEK 1: VISUALIZING DATA

* Basic information about visualizing data


- Marks: basic visual objects like points, lines, and shapes. Every mark can be broken down
into four qualities:
+ Position: Where a specific mark is in space in relation to a scale or to other marks
+ Size: how big, small, long or tall a mark is
+ Shape: Whether a specific object is given a shape that communicates something about it
+ Color: what color the mark is
- Channels: visual aspects or va re the channels helpful in accurately estimating the values
being represented? riables that represent characteristics of the data
+ Accuracy: Are the channels helpful in accurately estimating the values being represented?
+ Popout: How easy is it to distinguish certain values from others?
+ Grouping: How good is a channel at communicating groups that exist in the data?
- Design principles:

- Dynamic visualizations are interactive or change over time


* Design data visualizations
- The elements of art:
+ Line: đậm, nhạt
+ Shape
+ Color: hue, intensity (how bright or dull the color is), value (The value is how light or dark
the color is in a visualization. Value indicates how much light is being reflected)
+ Space: there should be space in data visualization
+ Movement
- 9 basic principles of design:
+ Balance: The design of a data visualization is balanced when the key visual elements, like
color and shape, are distributed evenly.
+ Emphasis: Your data visualization should have a focal point, so that your audience knows
where to concentrate. In other words, your visualizations should emphasize the most
important data so that users recognize it first.
+ Movement: Movement can refer to the path the viewer’s eye travels as they look at a data
visualization, or literal movement created by animations
+ Pattern: You can use similar shapes and colors to create patterns in your data visualization
+ Repetition: Repeating chart types, shapes, or colors adds to the effectiveness of your
visualization.
+ Proportion: Proportion is another way that you can demonstrate the importance of certain
data. Using various colors and sizes helps demonstrate that you are calling attention to a
specific visual over others.
+ Rhythm: This refers to creating a sense of movement or flow in your visualization
+ Variety: Your visualizations should have some variety in the chart types, lines, shapes,
colors, and values you use. Variety keeps the audience engaged
+ Unity: This means that your final data visualization should be cohesive. If the visual is
disjointed or not well organized, it will be confusing and overwhelming.
- Elements for effective visuals:
+ Clear meaning: good visualization clearly communicate their intended insight
+ Sophisticated use of contrast: which helps seperate the most important data from the rest
using visual context
+ Refined execution: visuals with refined execution include deep attention to detail using
visual elements like lines, shapes, colors, value, space and movement

- Design thinking: a process used to solve complex problems in a user-centric way


- 5 phases of the design process:
+ Empathize: think about the needs of the target audience of data visualization, whether it's
stakeholders, team members or the general public. Here you should avoid areas where people
might face obstacles interacting with your visualizations ( cần chú trọng những điểm gì khi
visualizing data, có thể có những người bị khó phân biệt giữa các màu tương đồng nhau thì sử
dụng các màu khác nhau như xanh, đỏ, vàng,…. hơn là việc sử dụng các màu xanh gần gần
giống nhau)
+ Define: he define phase helps you to find your audiences needs, their problems, and your
insights. This goes hand in hand with the empathize phase as you'll use what you learned in
that phase to help you spell out exactly what your audience needs from your visualization
+ Ideate: start to generate data viz ideas, brainstorm potential data viz solutions. This might
involve creating drafts of your visualization with different color combinations or maybe
experimenting with different shapes
+ Prototype: Putting visualizations together for testing and feedback
+ Test: Showing prototype visualizations to people before stakeholders see them
- A headline is a line of words printed in large letters at the top of the visualization to
communicate what data is being presented
- A subtitle supports the headline by adding more context and description.

- Assessible visualizations:
+ Alternative text provides a textual alternative to non-text content. It allows the content and
function of the image to be accessible to those with visual or certain cognitive disabilities
+ Avoid relying solely on color to convey information, and instead distinguished with different
textures and shapes
TABLEAU
* Basic information about tableau
- Tableau is a business intelligence and analytics platform that you can use online to help
people see, understand, and make decisions with data
- A diverging color palette displays two ranges of values using color intensity to show the
magnitude of the number and the actual color to show which range the numbers from
- A dashboard is a tool that organizes information, typically from multiple data sets, into one
central location for tracking, analysis, and simple visualization through charts, graphs, and
maps
- 3 data storytelling steps:
+ Engaging audiences: capturing and holding someone’s interest and attention
+ Create compelling visuals: Visuals should take your audience on a journey of how the data
changed over time or highlight the meaning behind the numbers.
+ Tell the story in an interesting narrative: For example using word clouds. These words are
presented in different sizes based on how often they appear in your data set ( xuất hiện nhiều
thì text lớn còn xuất hiện ít thì text nhỏ)
- Story telling:
+ Setting is what’s happening and other background info.
+ The big reveal, or resolution, is how the data shows the way to solve the conflict or problem
you face.
+ The aha moment means sharing data-driven recommendations for success. To identify it,
ask: “What’s the fix moving forward?”
+ Characters are people affected by your story.
+ The plot is what creates the conflict that compels the characters to act—and what the data
analysis seeks to resolve.
PROGRAMMING
- To find out more about the function of a function name, we can type function name => Then
in the help window, there is information about the function name to read
- Variable: a presentation of a value in R that can be stored for use later during programming
- A variable name should start with a letter and can also contain numbers and underscores
- If you want to add a comment to explain what you’re doing in R, start using # and then write
the comment
- Create variables:
40) Giá trị sổ sách của tài sản:

A) Luôn là thước đo tốt nhất về giá trị của công ty đối với nhà đầu tư.

B) Đại diện cho giá trị thị trường thực của những tài sản đó theo GAAP.

C) Được xác định theo Nguyên tắc kế toán được chấp nhận chung (GAAP) và dựa trên chi phí
của những tài sản đó.

D) Luôn cao hơn chi phí thay thế của tài sản.

E) Được thể hiện trên báo cáo thu nhập của công ty.

- Vector is a group of data elements of the same type stored in a sequence in R


- Pipe : a tool in R for expressing a sequence of multiple operations represented with “%>%”

- Creating vectors: c()


+ Every vector has 2 key properties: type and length
 The type of vector is determined by using typeof() function

 Length: length()

+ Check the vector by using specific function: is.logical(), is.double(), is.integer(),


is.character()

+ Name vectors:

- Creating lists: Lists are different from atomic vectors because their elements can be of any
type—like dates, data frames, vectors, matrices, and more. Lists can even contain other lists.

- Determining the structure of the lists: str()


- Naming lists: list()
- Convert a date-time to date: as_date()

- Create data frame: data.frame()

* Operators and calculations


- Operator: a symbol that names the type of operation or calculation to be performed in a
formula
- Assignment operators: used to assign values to variables and vectors
- Logical operators:
+ And: &
+ Or: | or ||
+ Not: !

- Argument: Information needed by a function in R in order to run


- Data types: An attribute that describes a piece of data based on its values, its programming
language, or the operations it can perform
- R packages include:
+ Reusable R functions
+ Documentation about the functions
+ Sample datasets
+ Tests for checking your code
- Packages offer a helpful combination of code, reusable R functions, descriptive
documentation, tests for checking operability, and sample data sets
- 8 core tidyverse packages: ggplot2 (data visualization), tidyr, readr, dplyr, tibble, purr,
stringr, forcats
+ ggplot2: you can create a variety of data viz by applying different visual properties to the
data variables
+ Tidyr is a package used for data cleaning to make tidy data
+ Readr: use for importing data
+ Dplyr: offers a consistent set of functions that help you complete some common data
manipulation tasks
- Pipes: is a tool in R for expressing a sequence of multiple operations. In other words, it takes
the output of one statement and makes it the input of the next statement
- Nested: describes code that performs a particular function and is contained within code that
performs a broader function
+ VD: arrange(filter(ToothGrowth,dose==0.5),len)
- Các bước lần lượt để trả về giá trị kết quả như code nested

Last line of code không cần a pipe operator


- CRAN is an online archive with R pạckages and other R-related resources
- Base packages are installed and loaded by default and recommended packages are not.
- Cách hiện data trong R: data(“diamonds”) => View(“diamond”)
- Hiện 6 dòng đầu của dữ liệu: head(“diamonds”)
- Hiện 10 dòng đầu dữ liệu: as_tibble(diamonds)
- Hiện overall structure của data: str(diamonds)
- Hiện tên cột : colnames(diamonds)
- Tạo 1 cột mới: mutate (diamonds, carat_2 = carat *100)
- Tibbles: ibbles are like streamlined data frames that are automatically set to pull up only the
first 10 rows of a dataset, and only as many columns as can fit on the screen
- Here package: makes referencing files easier
- Skimr package: makes summarizing data easier
- Janitor package: makes cleaning data easier
- Summary of data frame:

- Đổi tên cột: rename(A=B)


- Đổi tên toàn bộ các cột: rename with (A=B)
- Clean_names: ensures that there's only characters, numbers, and underscores in the names
- File names:
- Operators in R:
- Hiện data theo thứ tự từ nhỏ đến lớn: penguins %>% arrange(bill_length_mm)
- Hiện data theo thứ tự từ lớn đến nhỏ: penguins %>% arrange(-bill_length_mm)
- Save data frames từ data sắp xếp: penguins2 <- penguins %>% arrange(-bill_length_mm)
- Pipes function:
- Unite: merge các column vào với nhau

+ VD trên là merge first_name và last_name


- Seperate: tách các thành phần của 1 column thành các column khác nhau (first name, last
name)
- Mutate: add a new column
- Cleaning functions help you preview and rename data so that it’s easier to work with.
- Organizational functions help you sort, filter, and summarize your data.
- Transformational functions help you separate and combine data, as well as create new
variables.
- Anscombe’s quartet: four datasets that have nearly identical summary statistics ( chia data
thành các datasets có số liệu tóm tắt như mean, standard deviation …. giống nhau)
NOTE: data có thể đưa ra các số liệu về mean, SD,… giống nhau nhưng khi biểu diễn nó trên
scatterplot thì sẽ không có phân bổ giống nhau

- Bias function: so sánh số liệu actual và predict


+ Giá trị càng gần 0 thì không có bias
* Ggplot2

- Aesthetic: A visual property of an object in your plot (the size, shape or color of your data
points)
- Geom: the geometric object used to represent your data (the size, shape or color of your data
points)
- Facets: display smaller groups or subsets of your data
- Labels and annotations: customize your plot
NOTE: không để dấu cộng xuống dòng
Alpha: độ đậm nhạt
Smooth:line
The geom underscore jitter function creates a scatter plot and then adds a small amount of
random noise to each point in the plot. Jittering helps us deal with over-plotting, which
happens when the data points in a plot overlap with each other. Jittering makes the points
easier to find
Đối với bar chat, R sẽ tự count số lần xuất hiện của x
*Facets function
- Facet functions let you display smaller groups or subsets of your data

Facet_wrap lets us create a separate plot for each species.


Facet underscore grid will split the plot into facets vertically by the values of the first variable
and horizontally by the values of the second variable
- Annotate means to add notes to a document or diagram to explain or comment upon it
Add title+ subtitile to scatter plot

thay đổi kiểu chữ và cỡ chữ

- Saving visualization: using export in R or ggsave()


* R Markdown
- A file format for making dynamic documents with R
- Markdown: a syntax for formatting plain text files.
- R notebook: let users run your code and show the graphs and charts that visualize the code
- Markdown formating: if you want to italicize a word or phrase in Markdown, just add a single
underscore or asterisk right before and after the word. When you create a report of the
document, the Markdown formatting is no longer visible, just the word or phrase in italics.
- Save Rmarkdown: File => Rmarkdown=> File=> Save As
+ Muốn hiện các biểu đồ, code,… sử dụng Knit. The knit button creates a shareable HTML
report of the R Markdown file.
+ The more hashtags ##, the smaller the header

- R Markdown files can be converted into HTML, PDF and Word, slideshow presentations, or
dashboards.
- YAML: a language for data that translates it so it’s readable
+ Syntax for YALM: ---……….--- ~ trong khoảng ba chấm bao gồm thông tin về title, author,
date, output. R markdown automatically generates these information nhưng ta có thể tự tạo
- Inline code: A data analyst inserts some code directly into their R Markdown file so that they
can refer to it directly in their write-upx
- Embed link to R Markdown file: nhiều khi link có tên dài nên ta có thể chuyển thành embed
“link”
From

To

- Embed image:

- Adding caption to an image using “!”

- Adding bullet points: using * before the thing you wanna add
- Code chunk: code added directly to an .Rmd file
- Delimiter: a character that indicates the beginning and ending of an item

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy