0% found this document useful (0 votes)
87 views

Prepare Data

Uploaded by

Zeus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views

Prepare Data

Uploaded by

Zeus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Course syllabus

1. Foundations: Data, Data, Everywhere


2. Ask Questions to Make Data-Driven Decisions
3. Prepare Data for Exploration (this course)
4. Process Data from Dirty to Clean
5. Analyze Data to Answer Questions
6. Share Data Through the Art of Visualization
7. Data Analysis with R Programming
8. Google Data Analytics Capstone: Complete a Case Study
Welcome to the third course in the Google Data Analytics Certificate! So far, you have been
introduced to the field of data analytics and discovered how data analysts can use their skills
to answer business questions.

As a data analyst, you need to be an expert at structuring, extracting, and making sure the
data you are working with is reliable. To do this, it is always best to develop a general idea of
how all data is generated and collected, since every organization structures data differently.
Then, no matter what data structure you are faced with in your new role, you will feel
confident working with it.  

You will soon discover that when data is extracted, it isn’t perfect. It might be biased instead
of credible, or dirty instead of clean. Your goal is to learn how to analyze data for bias and
credibility and to understand what clean data means. You will also get up close and personal
with databases and even get to extract your own data from a database using spreadsheets and
SQL. The last topics covered are the basics of data organization and the process of protecting
your data. 

And you will learn how to identify different types of data that can be used to understand and
respond to a business problem. In this part of the program, you will explore different types of
data and data structures. And best of all, you will keep adding to your data analyst tool box!
From extracting and using data, to organizing and protecting it, these key skills will come in
handy no matter what you are doing in your career as a data analyst.

Course content
Course 3 – Prepare Data for Exploration

1. Understanding data types and structures: We all generate lots of data in our
daily lives. In this part of the course, you will check out how we generate data
and how analysts decide which data to collect for analysis. You’ll also learn
about structured and unstructured data, data types, and data formats as you start
thinking about how to prepare your data for exploration.
2. Understanding bias, credibility, privacy, ethics, and access: When data
analysts work with data, they always check that the data is unbiased and credible.
In this part of the course, you will learn how to identify different types of bias in
data and how to ensure credibility in your data. You will also explore open data
and the relationship between and importance of data ethics and data privacy.
3. Databases: Where data lives: When you are analyzing data, you will access
much of the data from a database. It’s where data lives. In this part of the course,
you will learn all about databases, including how to access them and extract,
filter, and sort the data they contain. You will also check out metadata to
discover the different types and how analysts use them.
4. Organizing and protecting your data: Good organization skills are a big part
of most types of work, and data analytics is no different. In this part of the
course, you will learn the best practices for organizing data and keeping it
secure. You will also learn how analysts use file naming conventions to help
them keep their work organized.
5. Engaging in the data community (optional): Having a strong online presence
can be a big help for job seekers of all kinds. In this part of the course, you will
explore how to manage your online presence. You will also discover the benefits
of networking with other data analytics professionals.
6. Completing the Course Challenge: At the end of this course, you will be able
to apply what you have learned in the Course Challenge. The Course Challenge
will ask you questions about the key concepts and then will give you an
opportunity to put them into practice as you go through two scenarios.

What to expect
This part of the program is designed to get you familiar with different data structures and
show you how to collect, apply, organize, and protect data. All of these skills will be part of
your daily tasks as an entry-level data analyst. You will work on a wide range of activities
that are similar to real-life tasks that data analysts come across on a daily basis.

This course has five modules or weeks, and each has several lessons included. Within each
lesson, you will find content such as:

 Videos of instructors teaching new concepts and demonstrating the use of tools 
 In-video questions that pop up during or at the end of a video to check your
learning
 Readings to introduce new ideas and build on the concepts from the videos
 Discussion forums to discuss, explore, and reinforce new ideas for better
learning
 Discussion prompts to promote thinking and engagement in the discussion
forums
 Hands-on activities to introduce real-world, on-the-job situations, and the tools
and tasks to complete assignments 
 Practice quizzes to prepare you for graded quizzes
 Graded quizzes to measure your progress and give you valuable feedback 
Hands-on activities promote additional opportunities to build your skills. Try to get as much
out of them as possible. Assessments are based on the approach taken by the course to offer a
wide variety of learning materials and activities that reinforce important skills. Graded and
ungraded quizzes will help the content sink in. Ungraded practice quizzes are a chance for
you to prepare for the graded quizzes. Both types of quizzes can be taken more than one time.

As a quick reminder, this course is designed for all types of learners, with no degree or prior
experience required. Everyone learns differently, so the Google Data Analytics Certificate
has been designed with that in mind. Personalized deadlines are just a guide, so feel free to
work at your own pace. There is no penalty for late assignments. If you prefer, you can
extend your deadlines by returning to Overview in the navigation pane and clicking Switch
Sessions. If you already missed previous deadlines, click Reset my deadlines instead.

If you would like to review previous content or get a sneak peek of upcoming content, you
can use the navigation links at the top of this page to go to another course in the program.
When you pass all required assignments, you will be on track to earn your certificate.

Optional speed track for those experienced in data analytics


The Google Data Analytics Certificate provides instruction and feedback for learners hoping
to earn a position as an entry-level data analyst. While many learners will be brand new to the
world of data analytics, others may be familiar with the field and simply wanting to brush up
on certain skills. 

If you believe this course will be primarily a refresher for you, we recommend taking the
practice diagnostic quiz offered this week. It will enable you to determine if you should
follow the speed track, which is an opportunity to proceed to Course 4 after having taken
each of the Course 3 Weekly Challenges and the overall Course Challenge. Learners who
earn 100% on the diagnostic quiz can treat Course 3 videos, readings, and activities as
optional. Learners following the speed track are still able to earn the certificate.

Tips
 Do your best to complete all items in order. All new information builds on earlier
learning.
 Treat every task as if it is real-world experience. Have a mindset that you are
working at a company or in an organization as a data analyst. This will help you
apply what you learn in this program to the real world.
 Even though they aren’t graded, it is important to complete all practice items.
They will help you build a strong foundation as a data analyst and better prepare
you for the graded assessments.
 Take advantage of all additional resources provided.
 When you encounter useful links in the course, remember to bookmark them so
you can refer to the information later for study or review.

Selecting the right data


Following are some data-collection considerations to keep in mind for your analysis:
How the data will be collected
Decide if you will collect the data using your own resources or receive (and possibly
purchase it) from another party. Data that you collect yourself is called first-party data.

Data sources
If you don’t collect the data using your own resources, you might get data from second-party
or third-party data providers. Second-party data is collected directly by another group and
then sold. Third-party data is sold by a provider that didn’t collect the data themselves.
Third-party data might come from a number of different sources.

Solving your business problem


Datasets can show a lot of interesting information. But be sure to choose data that can
actually help solve your problem question. For example, if you are analyzing trends over
time, make sure you use time series data — in other words, data that includes dates.

How much data to collect


If you are collecting your own data, make reasonable decisions about sample size. A random
sample from existing data might be fine for some projects. Other projects might need more
strategic data collection to focus on certain criteria. Each project has its own needs. 

Time frame
If you are collecting your own data, decide how long you will need to collect it, especially if
you are tracking trends over a long period of time. If you need an immediate answer, you
might not have time to collect new data. In this case, you would need to use historical data
that already exists.

Use the flowchart below if data collection relies heavily on how much time you have:
Data formats in practice
When you think about the word "format," a lot of things might come to mind. Think of an
advertisement for your favorite store. You might find it in the form of a print ad, a billboard,
or even a commercial. The information is presented in the format that works best for you to
take it in. The format of a dataset is a lot like that, and choosing the right format will help you
manage and use your data in the best way possible.
Data format examples
As with most things, it is easier for definitions to click when we can pair them with real life
examples. Review each definition first and then use the examples to lock in your
understanding of each data format.

the following table highlights the differences between primary and secondary data and examples of
each

Data Format
Definition Examples
Classification
- Data from an interview you conducted - Data
Collected by a
from a survey returned from 20 participants -
Primary data researcher from
Data from questionnaires you got back from a
first-hand sources
group of workers
- Data you bought from a local data analytics
Gathered by other
firm’s customer profiles - Demographic data
Secondary data people or from
collected by a university - Census data
other research
gathered by the federal government
the following table highlights the differences between internal and external data and examples of
each

Data Format
Definition Examples
Classification
- Wages of employees across different
Data that lives inside
business units tracked by HR - Sales data by
Internal data a company’s own
store location - Product inventory levels
systems
across distribution centers
- National average wages for the various
Data that lives outside
positions throughout your organization -
External data of a company or
Credit reports for customers of an auto
organization
dealership

the following table highlights the differences between continuous and discrete data and examples of
each

Data Format
Definition Examples
Classification
Data that is measured - Height of kids in third grade classes (52.5
Continuous data and can have almost inches, 65.7 inches) - Runtime markers in a
any numeric value video - Temperature
- Number of people who visit a hospital on
Data that is counted
a daily basis (10, 20, 200) - Room’s
Discrete data and has a limited
maximum capacity allowed - Tickets sold
number of values
in the current month

the following table highlights the differences between qualitative and quantitative data and
examples of each
Data Format
Definition Examples
Classification
Subjective and - Exercise activity most enjoyed -
explanatory measures of Favorite brands of most loyal
Qualitative
qualities and customers - Fashion preferences of
characteristics young adults
- Percentage of board certified
Specific and objective
doctors who are women -
Quantitative measures of numerical
Population of elephants in Africa -
facts
Distance from Earth to Mars

the following table highlights the differences between nominal and ordinal data and examples of
each

Data Format
Definition Examples
Classification
- First time customer, returning
A type of qualitative
customer, regular customer - New job
data that isn’t
Nominal applicant, existing applicant, internal
categorized with a set
applicant - New listing, reduced price
order
listing, foreclosure
- Movie ratings (number of stars: 1 star,
 A type of qualitative 2 stars, 3 stars) - Ranked-choice voting
Ordinal data with a set order selections (1st, 2nd, 3rd) - Income level
or scale (low income, middle income, high
income)

the following table highlights the differences between structured and unstructured data and
examples of each

Data Format
Definition Examples
Classification

Data organized in a certain - Expense reports - Tax


Structured data
format, like rows and columns returns - Store inventory
Data Format
Definition Examples
Classification

Data that isn’t organized in any - Social media posts -


Unstructured data
easily identifiable manner Emails - Videos

The structure of data


Data is everywhere and it can be stored in lots of ways. Two general categories of data are: 

 Structured data: Organized in a certain format, such as rows and columns.


 Unstructured data: Not organized in any easy-to-identify way.
For example, when you rate your favorite restaurant online, you're creating structured data.
But when you use Google Earth to check out a satellite image of a restaurant location, you're
using unstructured data. 

Here's a refresher on the characteristics of structured and unstructured data:


Structured data: - Defined data types - Most often quantitative data - Easy to organize - Easy
to search - Easy to analyze - Stored in relational databases - Contained in rows and columns -
Examples: Excel, Google Sheets, SQL, customer data, phone records, transaction history
Unstructured data: - Varied data types - Most often qualitative data - Difficult to search -
Provides more freedom for analysis - Stored in data lakes and NoSQL databases - Can't be
put in rows and columns - Examples: Text messages, social media comments, phone call
transcriptions, various log files, images, audio, video
Structured data
As we described earlier, structured data is organized in a certain format. This makes it
easier to store and query for business needs. If the data is exported, the structure goes along
with the data.

Unstructured data
Unstructured data can’t be organized in any easily identifiable manner. And there is much
more unstructured than structured data in the world. Video and audio files, text files, social
media content, satellite imagery, presentations, PDF files, open-ended survey responses, and
websites all qualify as types of unstructured data.

The fairness issue


The lack of structure makes unstructured data difficult to search, manage, and analyze. But
recent advancements in artificial intelligence and machine learning algorithms are beginning
to change that. Now, the new challenge facing data scientists is making sure these tools are
inclusive and unbiased. Otherwise, certain elements of a dataset will be more heavily
weighted and/or represented than others. And as you're learning, an unfair dataset does not
accurately represent the population, causing skewed outcomes, low accuracy levels, and
unreliable analysis.
What is data modeling?
Data modeling is the process of creating diagrams that visually represent how data is
organized and structured.  These visual representations are called data models. You can
think of data modeling as a blueprint of a house. At any point, there might be electricians,
carpenters, and plumbers using that blueprint. Each one of these builders has a different
relationship to the blueprint, but they all need it to understand the overall structure of the
house. Data models are similar; different users might have different data needs, but the data
model gives them an understanding of the structure as a whole. 

Levels of data modeling


Each level of data modeling has a different level of detail.

1. Conceptual data modeling gives a high-level view of the data structure, such as
how data interacts across an organization. For example, a conceptual data model
may be used to define the business requirements for a new database. A
conceptual data model doesn't contain technical details.
2. Logical data modeling focuses on the technical details of a database such as
relationships, attributes, and entities. For example, a logical data model defines
how individual records are uniquely identified in a database. But it doesn't spell
out actual names of database tables. That's the job of a physical data model.
3. Physical data modeling depicts how a database operates. A physical data model
defines all entities and attributes used; for example, it includes table names,
column names, and data types for the database.
More information can be found in this comparison of data models.
Data-modeling techniques
There are a lot of approaches when it comes to developing data models, but two common
methods are the Entity Relationship Diagram (ERD) and the Unified Modeling Language
(UML) diagram. ERDs are a visual way to understand the relationship between entities in the
data model. UML diagrams are very detailed diagrams that describe the structure of a system
by showing the system's entities, attributes, operations, and their relationships. As a junior
data analyst, you will need to understand that there are different data modeling techniques,
but in practice, you will probably be using your organization’s existing technique. 

You can read more about ERD, UML, and data dictionaries in this data modeling techniques
article.

Data analysis and data modeling


Data modeling can help you explore the high-level details of your data and how it is related
across the organization’s information systems. Data modeling sometimes requires data
analysis to understand how the data is put together; that way, you know how to map the data.
And finally, data models make it easier for everyone in your organization to understand and
collaborate with you on your data. This is important for you and everyone on your team!

Understanding Boolean logic


In this reading, you will explore the basics of Boolean logic and learn how to use multiple
conditions in a Boolean statement. These conditions are created with Boolean operators,
including AND, OR, and NOT. These operators are similar to mathematical operators and
can be used to create logical statements that filter your results. Data analysts use Boolean
statements to do a wide range of data analysis tasks, such as creating queries for searches and
checking for conditions when writing programming code. 
Boolean logic example
Imagine you are shopping for shoes, and are considering certain preferences:

 You will buy the shoes only if they are pink and grey
 You will buy the shoes if they are entirely pink or entirely grey, or if they are
pink and grey
 You will buy the shoes if they are grey, but not if they have any pink
Below are Venn diagrams that illustrate these preferences. AND is the center of the Venn
diagram, where two conditions overlap. OR includes either condition. NOT includes only the
part of the Venn diagram that doesn't contain the exception.

The AND operator


Your condition is “If the color of the shoe has any combination of grey and pink, you will
buy them.” The Boolean statement would break down the logic of that statement to filter your
results by both colors. It would say “IF (Color=”Grey”) AND (Color=”Pink”) then buy
them.” The AND operator lets you stack multiple conditions. 

Below is a simple truth table that outlines the Boolean logic at work in this statement. In the
Color is Grey column, there are two pairs of shoes that meet the color condition. And in the
Color is Pink column, there are two pairs that meet that condition. But in the If Grey AND
Pink column, there is only one pair of shoes that meets both conditions. So, according to the
Boolean logic of the statement, there is only one pair marked true. In other words, there is
one pair of shoes that you can buy.

Color is Grey Color is Pink If Grey AND Pink, then Buy B


Grey/True Pink/True True/Buy T
Color is Grey Color is Pink If Grey AND Pink, then Buy B
Grey/True Black/False False/Don't buy T
Red/False Pink/True False/Don't buy F
Red/False Green/False False/Don't buy F

The OR operator
The OR operator lets you move forward if either one of your two conditions is met. Your
condition is “If the shoes are grey or pink, you will buy them.” The Boolean statement would
be “IF (Color=”Grey”) OR (Color=”Pink”) then buy them.” Notice that any shoe that meets
either the Color is Grey or the Color is Pink condition is marked as true by the Boolean
logic. According to the truth table below, there are three pairs of shoes that you can buy.

Color is Grey Color is Pink If Grey OR Pink, then Buy


Red/False Black/False False/Don't buy
Black/False Pink/True True/Buy
Grey/True Green/False True/Buy
Grey/True Pink/True True/Buy

The NOT operator


Finally, the NOT operator lets you filter by subtracting specific conditions from the results.
Your condition is "You will buy any grey shoe except for those with any traces of pink in
them." Your Boolean statement would be “IF (Color="Grey") AND (Color=NOT “Pink”)
then buy them.” Now, all of the grey shoes that aren't pink are marked true by the Boolean
logic for the NOT Pink condition. The pink shoes are marked false by the Boolean logic for
the NOT Pink condition. Only one pair of shoes is excluded in the truth table below.

If Grey AND (NOT Pink), th


Color is Grey Color is Pink Boolean Logic for NOT Pink
Buy
Grey/True Red/False Not False = True True/Buy
Grey/True Black/False Not False = True True/Buy
Grey/True Green/False Not False = True True/Buy
Grey/True Pink/True Not True = False False/Don't buy

The power of multiple conditions


For data analysts, the real power of Boolean logic comes from being able to combine multiple
conditions in a single statement. For example, if you wanted to filter for shoes that were grey
or pink, and waterproof, you could construct a Boolean statement such as: “IF ((Color =
”Grey”) OR (Color = “Pink”)) AND (Waterproof=”True”).”  Notice that you can use
parentheses to group your conditions together. 

Whether you are doing a search for new shoes or applying this logic to your database queries,
Boolean logic lets you create multiple conditions to filter your results. And now that you
know a little more about how Boolean logic is used, you can start using it!

Additional Reading/Resources
 Learn about who pioneered Boolean logic in this historical article: Origins of
Boolean Algebra in the Logic of Classes.
 Find more information about using AND, OR, and NOT from these tips for
searching with Boolean operators.

Transforming data
What is data transformation?
A woman presenting data, a hand holding a medal, two people chatting, a ship's wheel being
steered, two people high-fiving each other
In this reading, you will explore how data is transformed and the differences between wide
and long data. Data transformation is the process of changing the data’s format, structure,
or values. As a data analyst, there is a good chance you will need to transform data at some
point to make it easier for you to analyze it. 

Data transformation usually involves:

 Adding, copying, or replicating data 


 Deleting fields or records 
 Standardizing the names of variables
 Renaming, moving, or combining columns in a database
 Joining one set of data with another
 Saving a file in a different format. For example, saving a spreadsheet as a comma
separated values (CSV) file.

Why transform data?


Goals for data transformation might be: 

 Data organization: better organized data is easier to use


 Data compatibility: different applications or systems can then use the same data
 Data migration: data with matching formats can be moved from one system to
another
 Data merging: data with the same organization can be merged together
 Data enhancement: data can be displayed with more detailed fields 
 Data comparison: apples-to-apples comparisons of the data can then be made 

Data transformation example: data merging


Mario is a plumber who owns a plumbing company. After years in the business, he buys
another plumbing company. Mario wants to merge the customer information from his newly
acquired company with his own, but the other company uses a different database. So, Mario
needs to make the data compatible. To do this, he has to transform the format of the acquired
company’s data. Then, he must remove duplicate rows for customers they had in common.
When the data is compatible and together, Mario’s plumbing company will have a complete
and merged customer database.

Data transformation example: data organization (long to wide)


To make it easier to create charts, you may also need to transform long data to wide data.
Consider the following example of transforming stock prices (collected as long data) to wide
data.

Long data is data where each row contains a single data point for a particular item. In the
long data example below, individual stock prices (data points) have been collected for Apple
(AAPL), Amazon (AMZN), and Google (GOOGL) (particular items) on the given dates.

Long data example: Stock prices

Wide data is data where each row contains multiple data points for the particular items
identified in the columns. 

Wide data example: Stock prices


With data transformed to wide data, you can create a chart comparing how each company's
stock changed over the same period of time.

You might notice that all the data included in the long format is also in the wide format. But
wide data is easier to read and understand. That is why data analysts typically transform long
data to wide data more often than they transform wide data to long data. The following table
summarizes when each format is preferred:

Wide data is preferred when   Long data is preferred when 


Storing a lot of variables about each subject. For
Creating tables and charts with a
example, 60 years worth of interest rates for each
few variables about each subject
bank
Comparing straightforward line Performing advanced statistical analysis or
graphs graphing 

Data anonymization
What is data anonymization?
You have been learning about the importance of privacy in data analytics. Now, it is time to
talk about data anonymization and what types of data should be anonymized. Personally
identifiable information, or PII, is information that can be used by itself or with other data
to track down a person's identity. 

Data anonymization is the process of protecting people's private or sensitive data by


eliminating that kind of information. Typically, data anonymization involves blanking,
hashing, or masking personal information, often by using fixed-length codes to represent data
columns, or hiding data with altered values.

Your role in data anonymization


Organizations have a responsibility to protect their data and the personal information that
data might contain. As a data analyst, you might be expected to understand what data needs
to be anonymized, but you generally wouldn't be responsible for the data anonymization
itself. A rare exception might be if you work with a copy of the data for testing or
development purposes. In this case, you could be required to anonymize the data before you
work with it.

What types of data should be anonymized?


Healthcare and financial data are two of the most sensitive types of data. These industries rely
a lot on data anonymization techniques. After all, the stakes are very high. That’s why data in
these two industries usually goes through de-identification, which is a process used to wipe
data clean of all personally identifying information.

Data anonymization is used in just about every industry. That is why it is so important for
data analysts to understand the basics. Here is a list of data that is often anonymized:

 Telephone numbers
 Names
 License plates and license numbers
 Social security numbers
 IP addresses
 Medical records
 Email addresses
 Photographs
 Account numbers
For some people, it just makes sense that this type of data should be anonymized. For others,
we have to be very specific about what needs to be anonymized. Imagine a world where we
all had access to each other’s addresses, account numbers, and other identifiable information.
That would invade a lot of people’s privacy and make the world less safe. Data
anonymization is one of the ways we can keep data private and secure!

The open-data debate


Just like data privacy, open data is a widely debated topic in today’s world. Data analysts
think a lot about open data, and as a future data analyst, you need to understand the basics to
be successful in your new role.

What is open data?


In data analytics, open data is part of data ethics, which has to do with using data ethically.
Openness refers to free access, usage, and sharing of data. But for data to be considered
open, it has to:

 Be available and accessible to the public as a complete dataset


 Be provided under terms that allow it to be reused and redistributed
 Allow universal participation so that anyone can use, reuse, and redistribute the
data
Data can only be considered open when it meets all three of these standards. 

The open data debate: What data should be publicly available?


One of the biggest benefits of open data is that credible databases can be used more widely.
Basically, this means that all of that good data can be leveraged, shared, and combined with
other data. This could have a huge impact on scientific collaboration, research advances,
analytical capacity, and decision-making. But it is important to think about the individuals
being represented by the public, open data, too.
Third-party data is collected by an entity that doesn’t have a direct relationship with the
data. You might remember learning about this type of data earlier. For example, third parties
might collect information about visitors to a certain website. Doing this lets these third parties
create audience profiles, which helps them better understand user behavior and target them
with more effective advertising. 

Personal identifiable information (PII) is data that is reasonably likely to identify a person
and make information known about them. It is important to keep this data safe. PII can
include a person’s address, credit card information, social security number, medical records,
and more.

Everyone wants to keep personal information about themselves private. Because third-party
data is readily available, it is important to balance the openness of data with the privacy of
individuals.

Sites and resources for open data


Luckily for data analysts, there are lots of trustworthy sites and resources available for open
data. It is important to remember that even reputable data needs to be constantly evaluated,
but these websites are a useful starting point:

1. U.S. government data site: Data.gov is one of the most comprehensive data
sources in the US. This resource gives users the data and tools that they need to
do research, and even helps them develop web and mobile applications and
design data visualizations. 
2. U.S. Census Bureau: This open data source offers demographic information
from federal, state, and local governments, and commercial entities in the U.S.
too. 
3. Open Data Network: This data source has a really powerful search engine and
advanced filters. Here, you can find data on topics like finance, public safety,
infrastructure, and housing and development.
4. Google Cloud Public Datasets: There are a selection of public datasets
available through the Google Cloud Public Dataset Program that you can find
already loaded into BigQuery.  
5. Dataset Search: The Dataset Search is a search engine designed specifically for
data sets; you can use this to search for specific data sets.
Databases in data analytics
Databases enable analysts to manipulate, store, and process data. This helps them search
through data a lot more efficiently to get the best insights. 

Relational databases
A relational database is a database that contains a series of tables that can be connected to
show relationships. Basically, they allow data analysts to organize and link data based on
what the data has in common. 

In a non-relational table, you will find all of the possible variables you might be interested in
analyzing all grouped together. This can make it really hard to sort through. This is one
reason why relational databases are so common in data analysis: they simplify a lot of
analysis processes and make data easier to find and use across an entire database. 

The key to relational databases


Tables in a relational database are connected by the fields they have in common. You might
remember learning about primary and foreign keys before. As a quick refresher, a primary
key is an identifier that references a column in which each value is unique. In other words,
it's a column of a table that is used to uniquely identify each record within that table. The
value assigned to the primary key in a particular row must be unique within the entire table.
For example, if customer_id is the primary key for the customer table, no two customers will
ever have the same customer_id. 

By contrast, a foreign key is a field within a table that is a primary key in another table. A
table can have only one primary key, but it can have multiple foreign keys. These keys are
what create the relationships between tables in a relational database, which helps organize
and connect data across multiple tables in the database.

Some tables don't require a primary key. For example, a revenue table can have multiple
foreign keys and not have a primary key. A primary key may also be constructed using
multiple columns of a table. This type of primary key is called a composite key. For
example, if customer_id and location_id are two columns of a composite key for a customer
table, the values assigned to those fields in any given row must be unique within the entire
table.
SQL? You’re speaking my language 
Databases use a special language to communicate called a query language. Structured
Query Language (SQL) is a type of query language that lets data analysts communicate with
a database. So, a data analyst will use SQL to create a query to view the specific data that
they want from within the larger set. In a relational database, data analysts can write queries
to get data from the related tables. SQL is a powerful tool for working with databases —
which is why you are going to learn more about it coming up!

Metadata is as important as the data itself


Data analytics, by design, is a field that thrives on collecting and organizing data. In this
reading, you are going to learn about how to analyze and thoroughly understand every aspect
of your data.
Take a look at any data you find. What is it? Where did it come from? Is it useful? How do
you know? This is where metadata comes in to provide a deeper understanding of the data.
To put it simply, metadata is data about data. In database management, it provides
information about other data and helps data analysts interpret the contents of the data within a
database.

Regardless of whether you are working with a large or small quantity of data, metadata is the
mark of a knowledgeable analytics team, helping to communicate about data across the
business and making it easier to reuse data. In essence, metadata tells the who, what, when,
where, which, how, and why of data.

Elements of metadata
Before looking at metadata examples, it is important to understand what type of information
metadata typically provides.

Title and description


What is the name of the file or website you are examining? What type of content does it
contain?

Tags and categories


What is the general overview of the data that you have? Is the data indexed or described in a
specific way? 

Who created it and when


Where did the data come from, and when was it created? Is it recent, or has it existed for a
long time?

Who last modified it and when


Were any changes made to the data?  If yes, were the modifications recent?

Who can access or update it


Is this dataset public? Are special permissions needed to customize or modify the dataset?

Examples of metadata
In today’s digital world, metadata is everywhere, and it is becoming a more common practice
to provide metadata on a lot of media and information you interact with. Here are some real-
world examples of where to find metadata:
Photos
Whenever a photo is captured with a camera, metadata such as camera filename, date, time,
and geolocation are gathered and saved with it.

Emails
When an email is sent or received, there is lots of visible metadata such as subject line, the
sender, the recipient and date and time sent. There is also hidden metadata that includes
server names, IP addresses, HTML format, and software details.

Spreadsheets and documents


Spreadsheets and documents are already filled with a considerable amount of data so it is no
surprise that metadata would also accompany them. Titles, author, creation date, number of
pages, user comments as well as names of tabs, tables, and columns are all metadata that one
can find in spreadsheets and documents. 

Websites
Every web page has a number of standard metadata fields, such as tags and categories, site
creator’s name, web page title and description, time of creation and any iconography. 

Digital files
Usually, if you right click on any computer file, you will see its metadata. This could consist
of file name, file size, date of creation and modification, and type of file. 

Books
Metadata is not only digital. Every book has a number of standard metadata on the covers and
inside that will inform you of its title, author’s name, a table of contents, publisher
information, copyright description, index, and a brief description of the book’s contents.

Data as you know it


Knowing the content and context of your data, as well as how it is structured, is very valuable
in your career as a data analyst. When analyzing data, it is important to always understand the
full picture. It is not just about the data you are viewing, but how that data comes together.
Metadata ensures that you are able to find, use, preserve, and reuse data in the future.
Remember, it will be your responsibility to manage and make use of data in its entirety;
metadata is as important as the data itself.

From external source to a spreadsheet


When you work with spreadsheets, there are a few different ways to import data. This reading
covers how you can import data from external sources, specifically:

 Other spreadsheets
 CSV files
 HTML tables (in web pages)
Importing data from other spreadsheets
In a lot of cases, you might have an existing spreadsheet open and need to add additional data
from another spreadsheet.

Google Sheets
In Google Sheets, you can use the IMPORTRANGE function. It enables you to specify a
range of cells in the other spreadsheet to duplicate in the spreadsheet you are working in.

You must allow access to the spreadsheet containing the data the first time you import the
data. The URL shown below is for syntax purposes only. Don't enter it in your own
spreadsheet. Replace it with a URL to a spreadsheet you have created so you can control
access to it by clicking the Allow access button.

Refer to the Google Help Center's IMPORTRANGE page for more information about the
syntax. There is also an example of its use later in the program in Advanced functions for
speedy data cleaning.

Microsoft Excel
To import data from another spreadsheet, do the following:

Step 1: Select Data from the main menu.

Step 2: Click Get Data, select From File, and then select From Workbook.

Step 3: Browse for and select the spreadsheet file and then click Import.

Step 4: In the Navigator, select which worksheet to import.

Step 5: Click Load to import all the data in the worksheet; or click Transform Data to open
the Power Query Editor to adjust the columns and rows of data you want to import.

Step 6: If you clicked Transform Data, click Close & Load and then select one of the two
options:

 Close & Load - import the data to a new worksheet


 Close & Load to... - import the data to an existing worksheet
Importing data from CSV files

Google Sheets
Step 1: Open the File menu in your spreadsheet and select Import to open the Import file
window.

Step 2: Select Upload and then select the CSV file you want to import.
Step 3: From here, you will have a few options. For Import location, you can choose to
replace the current spreadsheet, create a new spreadsheet, insert the CSV data as a new sheet,
add the data to the current spreadsheet, or replace the data in a specific cell. The data will be
inserted as plain text only if you uncheck the Convert text to numbers, dates, and formulas
checkbox, which is the default setting. Sometimes a CSV file uses a separator like a semi-
colon or even a blank space instead of a comma. For Separator type, you can select Tab or
Comma, or select Custom to enter another character that is being used as the separator.
Step 4: Select Import data. The data in the CSV file will be loaded into your sheet, and you
can begin using it!

Note: You can also use the IMPORTDATA function in a spreadsheet cell to import data
using the URL to a CSV file. Refer to Google Help Center's IMPORTDATA page for more
information and the syntax.

Microsoft Excel
Step 1: Open a new or existing spreadsheet

Step 2: Click Data in the main menu and select the From Text/CSV option.

Step 3: Browse for and select the CSV file and then click Import.

Step 4: From here, you will have a few options. You can change the delimiter from a comma
to another character such as a semicolon. You can also turn automatic data type detection on
or off. And, finally, you can transform your data by clicking Transform Data to open the
Power Query Editor.

Step 5: In most cases, accept the default settings in the previous step and click Load to load
the data in the CSV file to the spreadsheet. The data in the CSV file will be loaded into the
spreadsheet, and you can begin working with the data.

Importing HTML tables from web pages


Importing HTML tables is a very basic method to extract or "scrape" data from public web
pages. Web scraping made easy introduces how to do this with Google Sheets or Microsoft
Excel.

Google Sheets
In Google Sheets, you can use the IMPORTHTML function. It enables you to import the
data from an HTML table (or list) on a web page.

Refer to the Google Help Center's IMPORTHTML page for more information about the
syntax. If you are importing a list, replace "table" with "list" in the above example. The
number 4 is the index that refers to the order of the tables on a web page. It is like a pointer
indicating which table on the page you want to import the data from.

You can try this yourself! In blank worksheets, copy and paste each of the following
IMPORTHTML functions into cell A1 and watch what happens. You will actually be
importing the data from four different HTML tables in a Wikipedia article: Demographics of
India. You can compare your imported data with the tables in the article.

 =IMPORTHTML("http://en.wikipedia.org/wiki/Demographics_of_India","table"
,1)
 =IMPORTHTML("http://en.wikipedia.org/wiki/Demographics_of_India","table"
,2)
 =IMPORTHTML("http://en.wikipedia.org/wiki/Demographics_of_India","table"
,3)
 =IMPORTHTML("http://en.wikipedia.org/wiki/Demographics_of_India","table"
,4)

Microsoft Excel
You can import data from web pages using the From Web option:

Step 1: Open a new or existing spreadsheet.

Step 2: Click Data in the main menu and select the From Web option.

Step 3: Enter the URL and click OK.

Step 4: In the Navigator, select which table to import.

Step 5: Click Load to load the data from the table into your spreadsheet.

Exploring public datasets


Open data helps create a lot of public datasets that you can access to make data-driven
decisions. Here are some resources you can use to start searching for public datasets on your
own:

 The Google Cloud Public Datasets allow data analysts access to high-demand
public datasets, and make it easy to uncover insights in the cloud. 
 The Dataset Search can help you find available datasets online with keyword
searches. 
 Kaggle has an Open Data search function that can help you find datasets to
practice with.
 Finally, BigQuery hosts 150+ public datasets you can access and use. 
Public health datasets
1. Global Health Observatory data: You can search for datasets from this page or
explore featured data collections from the World Health Organization.  
2. The Cancer Imaging Archive (TCIA) dataset: Just like the earlier dataset, this
data is hosted by the Google Cloud Public Datasets and can be uploaded to
BigQuery.
3. 1000 Genomes: This is another dataset from the Google Cloud Public resources
that can be uploaded to BigQuery. 

Public climate datasets


1. National Climatic Data Center: The NCDC Quick Links page has a selection of
datasets you can explore. 
2. NOAA Public Dataset Gallery: The NOAA Public Dataset Gallery contains a
searchable collection of public datasets.

Public social-political datasets


1. UNICEF State of the World’s Children: This dataset from UNICEF includes a
collection of tables that can be downloaded.
2. CPS Labor Force Statistics: This page contains links to several available datasets
that you can explore.
3. The Stanford Open Policing Project: This dataset can be downloaded as a .CSV
file for your own use.

Using BigQuery
BigQuery is a data warehouse on Google Cloud that data analysts can use to query, filter
large datasets, aggregate results, and perform complex operations.
An upcoming activity is performed in BigQuery. This reading provides instructions to create
your own BigQuery account, select public datasets, and upload CSV files. At the end of this
reading, you can confirm your access to the BigQuery console before you move on to the
activity,

Note: Additional getting started resources for a few other SQL database platforms are also
provided at the end of this reading if you choose to work with them instead of BigQuery.

Types of BigQuery accounts


There are two different types of accounts: sandbox and free trial. A sandbox account allows
you to practice queries and explore public datasets for free, but has additional restrictions on
top of the standard quotas and limits. If you prefer to use BigQuery with the standard limits,
you can set up a free trial account instead. More details:

 A free sandbox account doesn’t ask for a method of payment. It does, however,
limit you to 12 projects. It also doesn't allow you to insert new records to
a database or update the field values of existing records. These data manipulation
language (DML) operations aren't supported in the sandbox.
 A free trial account requires a method of payment to establish a billable
account, but offers full functionality during the trial period.
With either type of account, you can upgrade to a paid account at any time and retain all of
your existing projects. If you set up a free trial account but choose not to upgrade to a paid
account when your trial period ends, you can still set up a free sandbox account at that time.
However, projects from your trial account won't transfer to your sandbox account. It would
be like starting from scratch again.

Set up a free sandbox account for use in this program


 Follow these step-by-step instructions or watch the video, Setting up BigQuery,
including sandbox and billing options.
 For more detailed information about using the sandbox, start with the
documentation, Using the BigQuery sandbox. 
 After you set up your account, you will see the project name you created for the
account in the banner and SANDBOX at the top of your BigQuery console.

Set up a free trial account instead (if you prefer)


If you prefer not to have the sandbox limitations in BigQuery, you can set up a free trial
account for use in this program.

 Follow these step-by-step instructions or watch the video, Setting up BigQuery,


including sandbox and billing options. The free trial offers $300 in credit over
the next 90 days. You won’t get anywhere near that spending limit if you just use
the BigQuery console to practice SQL queries. After you spend the $300 credit
(or after 90 days) your free trial will expire and you will need to personally select
to upgrade to a paid account to keep using Google Cloud Platform services,
including BigQuery. Your method of payment will never be automatically
charged after your free trial ends. If you select to upgrade your account, you
will begin to be billed for charges. 
 After you set up your account, you will see My First Project in the banner and
the status of your account above the banner – your credit balance and the number
of days remaining in your trial period.

How to get to the BigQuery console


In your browser, go to console.cloud.google.com/bigquery.

Note: Going to console.cloud.google.com in your browser takes you to the main dashboard
for the Google Cloud Platform. To navigate to BigQuery from the dashboard, do the
following:

 Click the Navigation menu icon (Hamburger icon) in the banner.


 Scroll down to the BIG DATA section.
 Click BigQuery and select SQL workspace.
Watch the How to use BigQuery video for an introduction to each part of the BigQuery SQL
workspace.

(Optional) Explore a BigQuery public dataset 


You will be exploring a public dataset in an upcoming activity, so you can perform these
steps later if you prefer.

 Refer to these step-by-step instructions. 

(Optional) Upload a CSV file to BigQuery


These steps are provided so you can work with a dataset on your own at this time. You will
upload CSV files to BigQuery later in the program.

 Refer to these step-by-step instructions.

Getting started with other databases (if not using BigQuery)


It is easier to follow along with the course activities if you use BigQuery, but if you are
connecting to and practicing SQL queries on other database platforms instead of BigQuery,
here are similar getting started resources:

 Getting started with MySQL: This is a guide to setting up and using MySQL.
 Getting started with Microsoft SQL Server: This is a tutorial to get started using
SQL Server.
 Getting started with PostgreSQL: This is a tutorial to get started using
PostgreSQL.
 Getting started with SQLite: This is a quick start guide for using SQLite.

Organization guidelines
This reading summarizes best practices for file naming, organization, and storage.

Best practices for file naming conventions


Review the following file naming recommendations:

 Work out and agree on file naming conventions early on in a project to avoid
renaming files again and again.
 Align your file naming with your team's or company's existing file-naming
conventions.
 Ensure that your file names are meaningful; consider including information like
project name and anything else that will help you quickly identify (and use) the
file for the right purpose.
 Include the date and version number in file names; common formats are
YYYYMMDD for dates and v## for versions (or revisions).
 Create a text file as a sample file with content that describes (breaks down) the
file naming convention and a file name that applies it.
 Avoid spaces and special characters in file names. Instead, use dashes,
underscores, or capital letters. Spaces and special characters can cause errors in
some applications.

Best practices for keeping files organized


Remember these tips for staying organized as you work with files:

 Create folders and subfolders in a logical hierarchy so related files are stored
together.
 Separate ongoing from completed work so your current project files are easier to
find. Archive older files in a separate folder, or in an external storage location.
 If your files aren't automatically backed up, manually back them up often to
avoid losing important work.

Learning Log: Review file structure and naming conventions

Overview

In the previous lesson, you were introduced to file structuring and naming conventions. Now,
you’ll complete an entry in your learning log reviewing these concepts and reflecting on why
they are so important. By the time you complete this entry, you will have a stronger
understanding of how and why data analysts use file structuring and naming conventions on
the job. This will help you think critically about file structuring and naming for your own
projects in the future and keep your work more organized.

Review best practices

Before you begin thinking about what sort of naming conventions and patterns you would use
in your own projects, take a moment to review the best practices for file structure and naming
conventions. 

When creating a file structure and naming convention pattern for a project, you should
always:

 Work out your conventions early in your project. The earlier you start, the more
organized you’ll be. 
 Align file naming conventions with your team. Conventions are most useful
when everyone follows them.
 Make sure filenames are meaningful. Stick to a consistent pattern that contains
the most useful information needed.
 Keep file names short and to the point.
This includes understanding the expected structure of folders and files in a project. Where
does your data live? Your spreadsheets? Your data visualizations? Being able to navigate
your folders easily makes for a well-structured project. 

Remember, there are some stylistic choices you’ll need to make when it comes to filename
conventions. However, there are still best practices you should follow here, too:

Formatting Convention Example


Format Dates as yyyymmdd SalesReport20201
Formatting Convention Example
Lead revision numbers with 0 SalesReport20201
Use hyphens, underscores, or capitalized letters SalesReport_2020

Balancing security and analytics


The battle between security and data analytics
Data security means protecting data from unauthorized access or corruption by putting
safety measures in place. Usually the purpose of data security is to keep unauthorized users
from accessing or viewing sensitive data. Data analysts have to find a way to balance data
security with their actual analysis needs. This can be tricky-- we want to keep our data safe
and secure, but we also want to use it as soon as possible so that we can make meaningful and
timely observations. 

In order to do this, companies need to find ways to balance their data security measures with
their data access needs.
Luckily, there are a few security measures that can help companies do just that. The two we
will talk about here are encryption and tokenization. 

Encryption uses a unique algorithm to alter data and make it unusable by users and
applications that don’t know the algorithm. This algorithm is saved as a “key” which can be
used to reverse the encryption; so if you have the key, you can still use the data in its original
form.  

Tokenization replaces the data elements you want to protect with randomly generated data
referred to as a “token.” The original data is stored in a separate location and mapped to the
tokens. To access the complete original data, the user or application needs to have permission
to use the tokenized data and the token mapping. This means that even if the tokenized data is
hacked, the original data is still safe and secure in a separate location. 

Encryption and tokenization are just some of the data security options out there. There are a
lot of others, like using authentication devices for AI technology. 

As a junior data analyst, you probably won’t be responsible for building out these systems. A
lot of companies have entire teams dedicated to data security or hire third party companies
that specialize in data security to create these systems. But it is important to know that all
companies have a responsibility to keep their data secure, and to understand some of the
potential systems your future employer might use.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy