Week 2 - The Data Engineering Ecosystem
Week 2 - The Data Engineering Ecosystem
Types of Data
Structured data
Has a well-defined structure
Contains tags and elements, or metadata, which is used to griup data and organize it in a hierarchy
Unstructured data
Does not have an easily identifiable structure
Cannot be organized in a mainstream relational database in the form of rows and columns
Semi-structured data is data that is somewhat organized and relies on meta tags for grouping and hierarchy;
Unstructured data is data that is not conventionally organized in the form of rows and columns in a particular format. In the
next video, we will learn about the different types of file structures.
https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0100EN-SkillsNetwork/readings/Reading_Metadata_and_Metadata_Managemen
t.md.html?origin=www.coursera.org
Structured data, that is data which is well organized in formats that can be stored in databases.
Semi-structured data, that is data which is partially organized and partially free-form.
Unstructured data, that is data which can not be organized conventionally into rows and columns.
Data comes in a wide-ranging variety of file formats, such as, delimited text files, spreadsheets, XML, PDF, and JSON,
each with its own list of benefits and limitations of use.
Data is extracted from multiple data sources, ranging from relational and non-relational databases, to APIs, web services,
data streams, social platforms, and sensor devices.
Once the data is identified and gathered from different sources, it needs to be staged in a data repository so that it can be
prepared for analysis. The type, format, and sources of data influence the type of data repository that can be used.
Data professionals need a host of languages that can help them extract, prepare, and analyse data. These can be
classified as:
Querying languages, such as SQL, used for accessing and manipulating data from databases.
Programming languages such as Python, R, and Java, for developing applications and controlling application
behavior.
Shell and Scripting languages, such as Unix/Linux Shell, and PowerShell, for automating repetitive operational tasks.
Quiz
Practice Quiz
Question 1
Combine data from multiple sources into a unified view that is accessed by data consumers to query and
manipulate data
Question 2
Which of these data sources is an example of semi-structured data?
Documents
Emails
Question 3
Which one of the provided file formats is commonly used by APIs and Web Services to return data?
XML
Delimited file
JSON
XLS
Question 4
What is one example of the relational databases
discussed in the video?
Spreadsheet
XML
Flat files
SQL Server
Question 5
Which of the following languages is one of the most popular querying languages in use today?
SQL
Java
Python
Graded Quiz
Question 1
There are two main types of data repositories – Transactional and Analytical. For high-volume day-to-day operational data
such as banking transactions, Transactional, or OLTP, systems are the ideal choice.
True
False
Transactional, or OLTP, systems are designed and optimized for handling high-volume transactions.
Question 2
Which of the following is an example of unstructured data?
Zipped files
XML
Spreadsheets
Question 3
Which one of these file formats is independent of software, hardware, and operating systems, and can be viewed the
same way on any device?
XML
XLSX
PDF format is independent of software, hardware, and operating systems, and can be viewed the same way on any
device.
Question 4
Which data source can return data in plain text, XML, HTML, or JSON among others?
APIs
XML
APIs can return data in a wide variety of formats such as plain text, XML, HTML, or JSON among others.
Question 5
In the data engineer’s ecosystem, languages are classified by type. What are shell and scripting languages most
commonly used for?
Manipulating data
Building apps
Querying data