Semana 3
Semana 3
Hello again. So far, you've seen how data can be gathered and analyzed to solve all kinds of
problems. Next step, we're going to learn all about databases as a refresher. A database is a
collection of data stored in a computer system, but storage is just the beginning. You'll
discover how databases make it possible to find the exact piece of information you need for
your analysis. You'll also learn how to sort data in order to zoom in on what you need to
generate insightful reports and much more. Then we'll go even deeper, and I mean really,
really deep. I'm talking about metadata. You've probably heard someone say, wow that's so
meta. Usually they're talking about something referencing back to itself or being completely
self aware. For example if a character in a book knows she's in a book, that's meta. If you make
a documentary about making documentaries, that's also meta. And here at Google, I
constantly analyze how I analyze data. That's definitely meta.
I do that to give my work a quality check to make sure my methods are fair. And to be certain
that I'm paying attention to any biases that might affect the outcome. As an analyst, you
should do this too. Sometimes we get a little too close to our data. So stepping back and asking
ourselves if our processes make sense is key. But let's back up just a bit and define metadata.
Metadata is data about data. Like I said: deep.
Metadata is extremely important when working with databases. Think of it like a reference
guide. Without the guide all you have is a bunch of data with no context explaining what it
means. Metadata tells you where the data comes from, when and how it was created, and
what it's all about.
Up next, you'll learn how to take data from a database or another source and bring it into a
spreadsheet. You'll do this either by importing it directly or by using SQL to generate the
request. And once you have data in a spreadsheet, the possibilities are endless. Everything
we're about to cover is a very important part of the prepare phase of the data analysis process.
It's how data analysts figure out which kind of data is going to be most helpful to them. If you
have the right data, you're much more likely to be able to solve your business problems
successfully. So, ready to tap into the incredible power of databases? Let's go!
DATABASE FEATURE
Databases are essential tools for data analysts. I use them constantly. Just about all of the data
I access is stored within databases. Databases store and organize data, making it much easier
for data analysts to manage and access information. They help us get insights faster, make
data-driven decisions, and solve problems. You've already heard a bit about what databases
are and how they're used by data analysts. Now let's learn more about database features and
components. Here's a simple database structure. It contains tables with information from a car
manufacturer. The top level includes car dealerships, product details, and repair parts. Then if
you drill down to the next level by selecting one of those tables, you'll find more specific
details about each item. This is called a relational database. A relational database is a database
that contains a series of related tables that can be connected via their relationships. For two
tables to have a relationship, one or more of the same fields must exist inside both tables. For
example, here, branch ID exists in this table and this one. If a field exists within both tables, we
can use it to connect the tables together. The branch ID field is the key to connecting these
tables. There are two types of keys. A primary key is an identifier that references a column in
which each value is unique. You can think of it as a unique identifier for each row in a table.
For our dealership table with information about the different dealership branches, branch ID is
the primary key. Similarly, for the product details table about each car, VIN is our primary key.
As an analyst you may need to create tables. If you do decide to include a primary key, it
should be unique, meaning no two rows can have the same primary key. Also, it cannot be null
or blank. There are also foreign keys. A foreign key is a field within a table that's a primary key
in another table. In other words, a foreign key is how one table can be connected to another.
Because our repair parts table contains information about each car part, the primary key is
part ID. Each row in our repair parts table represents one unique part. All the other keys in this
table, such as the VIN, are the foreign keys that allow the repair parts table to be connected to
the other tables. As you can see, a table can only have one primary key but it can have multiple
foreign keys. Understanding primary and foreign keys can be tricky, so you'll have more
opportunities to practice coming up. But as a general summary, a primary key is used to ensure
data in a specific column is unique. It uniquely identifies a record in a relational database table.
Only one primary key is allowed in a table and they cannot contain null or blank values. And a
foreign key is a column or group of columns in a relational database table that provides a link
between the data and two tables. It refers to the field in a table that's the primary key of
another table. Lastly, it's important to note that more than one foreign key is allowed to exist
in a table. Feel free to rewatch this video to be sure you understand primary and foreign keys
clearly. And coming up, you'll begin practicing how to access and analyze data from actual
databases. That will be a great opportunity to improve your understanding of primary and
foreign keys, database organization and how you might use databases in your future analytics
career.
Databases enable analysts to manipulate, store, and process data. This helps them search
through data a lot more efficiently to get the best insights.
Relational databases
A relational database is a database that contains a series of tables that can be connected
to show relationships. Basically, they allow data analysts to organize and link data based
on what the data has in common.
In a non-relational table, you will find all of the possible variables you might be interested in
analyzing all grouped together. This can make it really hard to sort through. This is one
reason why relational databases are so common in data analysis: they simplify a lot of
analysis processes and make data easier to find and use across an entire database.
Tables in a relational database are connected by the fields they have in common. You
might remember learning about primary and foreign keys before. As a quick refresher, a
primary key is an identifier that references a column in which each value is unique. In
other words, it's a column of a table that is used to uniquely identify each record within that
table. The value assigned to the primary key in a particular row must be unique within the
entire table. For example, if customer_id is the primary key for the customer table, no two
customers will ever have the same customer_id.
By contrast, a foreign key is a field within a table that is a primary key in another table. A
table can have only one primary key, but it can have multiple foreign keys. These keys are
what create the relationships between tables in a relational database, which helps organize
and connect data across multiple tables in the database.
Some tables don't require a primary key. For example, a revenue table can have multiple
foreign keys and not have a primary key. A primary key may also be constructed using
multiple columns of a table. This type of primary key is called a composite key. For
example, if customer_id and location_id are two columns of a composite key for a customer
table, the values assigned to those fields in any given row must be unique within the entire
table.
SQL? You’re speaking my language
Databases use a special language to communicate called a query language. Structured
Query Language (SQL) is a type of query language that lets data analysts communicate
with a database. So, a data analyst will use SQL to create a query to view the specific data
that they want from within the larger set. In a relational database, data analysts can write
queries to get data from the related tables. SQL is a powerful tool for working with
databases — which is why you are going to learn more about it coming up!
As a data analyst, you'll use data to answer questions and solve problems. When you
analyze data and draw conclusions, you are generating insights that can influence business
decisions, drive positive change, and help your stakeholders meet their goals.
Before you begin an analysis, it’s important to inspect your data to determine if it contains
the specific information you need to answer your stakeholders’ questions. In any given
dataset, it may be the case that:
The data is not there (you have sandwich data, but you need pizza data)
The data is insufficient (you have pizza data for June 1-7, but you need data
for the entire month of June)
The data is incorrect (your pizza data lists the cost of a slice as $250, which
makes you question the validity of the dataset)
Inspecting your dataset will help you pinpoint what questions are answerable and what data
is still missing. You may be able to recover this data from an external source or at least
recommend to your stakeholders that another data source be used.
In this reading, imagine you’re a data analyst inspecting spreadsheet data to determine if
it’s possible to answer your stakeholders’ questions.
The scenario
You are a data analyst working for an ice cream company. Management is interested in
improving the company's ice cream sales.
The company has been collecting data about its sales—but not a lot. The available data is
from an internal data source and is based on sales for 2019. You’ve been asked to review
the data and provide some insight into the company’s ice cream sales. Ideally,
management would like answers to the following questions: