Glossary
Glossary
Welcome! This alphabetized glossary contains many terms used in this course. Understanding these
terms is essential when working in the industry, participating in user groups, or participating in other
certificate programs.
Term Definition
Apache Airflow An open-source workflow management platform for data engineering pipelines.
Apache Beam An open-source, unified programming model for batch and streaming data
processing pipelines.
Apache HBase A non-relational database that runs on Hadoop, providing real-time access to large
data sets.
Apache Kafka An open-source software platform used to handle real-time data feeds.
Apache Storm A framework for distributed stream processing computation primarily written in the
Clojure programming language.
Apache Spark Streaming An extension of the core Spark API that allows for fault-tolerant
stream processing of live data streams with high throughput and scalability.
BeautifulSoup A Python library to get data out of HTML, XML, and other markup languages.
Big data stores A larger, more complex data set, especially from new data sources.
Big dataA dynamic, large, and disparate volume of data being created by people, tools, and
machines.
Cloudant A fully managed, distributed database optimized for heavy workloads and fast-
growing web and mobile apps.
Comma-separated values (CSV) A text-formatted file uses commas to separate the values.
Conceptual data model The model, created by business stakeholders and data architects, defines the
system's scope, concepts, and rules.
CouchDB An open-source NoSQL document database that collects and stores data in JSON-
based document formats. Unlike relational databases, CouchDB uses a schema-free data model,
which simplifies record management across various computing devices, mobile phones, and web
browsers.
Customer relationship management (CRM) software Software that helps companies measure and
control their lead generation and sales pipelines.
Data abstraction The process of simplifying a set of data to represent the whole.
Data analyst A data professional who first gathers and understands the data, then analyzes and
interprets it before visualizing it and, finally, weaving it into a story.
Data analytics Focuses on extracting valuable information from data using various tools,
techniques, processes, and algorithms. It includes data analysis and the interpretation of the results,
keeping in mind specific business objectives.
Data fabric An architecture that facilitates the end-to-end integration of various data pipelines
and cloud environments through intelligent and automated systems.
Data integration The combination of technical and business processes that are used to
combine data from disparate sources into meaningful and valuable information.
Data lakes A centralized repository designed to store, process, and secure large amounts of
structured, semistructured, and unstructured data. It can store data in its native format and process
any variety, ignoring size limits.
Data marts Data warehouses are segmented into smaller subsets, known as data marts. These
data marts are designed to manage specific business functions, departments, or subject areas. By
doing so, data marts make it easier for a defined group of users to access specific data, enabling
them to quickly find crucial insights without wasting time searching through an entire data
warehouse.
Data modeling Creating a visual representation of either a whole information system or parts of it to
communicate connections between data points and structures.D
Data repository Data sets isolated to be mined for reporting and analysis. It is also known as a data
archive or library.
Data science Process that focuses on understanding the data. This involves data analysis,
beginning with data loading, exploring, and cleaning. It creatively explores data, coming up with new
solutions and inventions.
Data source The physical or digital location where the data is held in a data table, object, or other
storage format.
Data streams The process of transmitting continuous data and feeding it into stream processing
software to derive valuable insights.
Data visualization The graphical representation of information and data. It helps data
visualization to understand trends, outliers, and patterns in data.
Data warehouses A storage architecture that pulls data from many sources into a single data
repository for sophisticated analytics and decision support.
Database as a service A cloud-computing service that allows users to access and use a cloud
database system without purchasing and setting up their own hardware, installing their own
database software, or managing the database themselves.
Database Management System (DBMS) Software to store and retrieve users' data by considering the
security of their information.
DenodoA unified virtual data layer that allows enterprise users to access data across formats,
protocols, and locations using techniques like search.
DocumentDB A NoSQL database service that supports document data structures with some
MongoDB 3.6 and 4.0 compatibility.
Enterprise resource planning (ERP) systems A type of software system that enables businesses
to automate and efficiently manage their key business processes to gain optimal performance.
Entity-relationship model (E-R model) A high-level data model is created to define the data
elements and their relationships for a specific system. It develops a conceptual design for the
database and presents a simple and easy-to-design data view.
Extract, load, transform (ETL) process A process that extracts, loads, and transforms data from
multiple sources to a data warehouse or other unified data repository.
Flat files Collection of data that is stored specifically in a two-dimensional database. It usually
contains a series of records (or lines), where each record is usually a sequence of fields.
Global Positioning Systems (GPS) A radio navigation system that accurately determines
location, time, and velocity regardless of weather conditions.
Hadoop Distributed File System (HDFS) A storage system for big data that runs on multiple
commodity hardware devices connected through a network. HDFS provides scalable and reliable big
data storage by partitioning files over multiple nodes.
Hierarchical model A data model in which the data are organized into a tree-like structure.
Hive A data warehouse for data query and analysis built on top of Hadoop.
HadoopA collection of tools that provides distributed storage and processing of big data.
Java A programming language known for its platform independence, which allows Java programs
to run on different operating systems without modification.
JavaScript object notation (JSON) An open standard file format that uses readable text to store
and transmit data objects consisting of attributes.
Logical data model Provides detailed descriptions of data elements and is utilized to create
visual representations of data entities, attributes, keys, and relationships.
Network model A database model conceived as a flexible way of representing objects and their
relationships.
NoSQL database A non-tabular database that stores data with different data storage tables
than relational tables.
Online analytical processing (OLAP) Software that is used to conduct multidimensional analysis
on large volumes of data from a data warehouse, data mart, or other centralized data store.
Online Transaction Processing (OLTP) A computerized system that allows real-time data processing
and immediate response to users' queries.
Oracle Cloud A cloud platform that offers complete cloud application suites across software as a
service (SaaS), platform as a service (PaaS), and infrastructure as a service (IaaS).
Oracle databaseA multi-model database management system generally used for online transaction
processing (OLTP), data warehousing, and both workloads.
Physical data model A database-specific model that represents relational data objects (for
example, tables, columns, primary and foreign keys) and their relationships.
Platform as a service A cloud computing model that provides customers a complete cloud
platform, hardware, software, and infrastructure, for developing, running, and managing applications
without the cost, complexity, and inflexibility that often come with building and maintaining that
platform on-premises.
PostgreSQL An open-source database that has a strong reputation for its reliability, flexibility, and
support of open technical standards.
PowerShell A cross-platform command-line shell and scripting language designed for automating
tasks and managing configurations.
Python An agile, dynamically typed, expressive, open-source programming language that supports
multiple programming philosophies, including procedural, object-oriented, and functional. Python is
a popular high-level programming language that is easily extensible through the use of third-party
packages and often allows powerful functions to be written with a few lines of code.
Radio Frequency Identification (RFID) tags A method for tracking goods through their tags.
Relational Database Service (RDS) Organizes data into rows and columns, which collectively
form a table. Data is typically structured across multiple tables, which can be joined together via a
primary key or a foreign key.
Relational model An approach to managing data using a structure and language consistent
with first-order predicate logic.
Scala A programming language designed for concise, elegant, and type-safe expression of
programming patterns. This language seamlessly integrates object-oriented and functional features.
Scrapy A free and open-source web-crawling framework written in Python and developed in
Cambuslang.
Spark A distributed data analytics framework designed to perform complex data analytics in real-
time.
Statistical Analysis System (SAS) A programming language that provides all the tools necessary to
read, write, and create system files, SAS databases, and reports.
Structured data The data that conforms to a defined structure follows a consistent order and is easily
accessible to people or computer programs.
Talend Open Studio A free, open-source ETL tool for data integration and big data.
Unstructured data Typically categorized as qualitative data that cannot be processed and
analyzed via conventional data tools and methods.
VelocityA tool to provide insights to the business about how well software delivery is working and
where to focus new processes, resources, or more automation.
VeracityThe term "Veracity" was coined by IBM to describe the challenges of managing data from
disparate sources, which can be inconsistent and unreliable.
Web scraping A technique used to collect online content and data generally gets saved in a local
file so as to manipulate and analyze as needed.
Learning Objective:
Define and distinguish between descriptive, diagnostic, predictive, and prescriptive analytics.
You have learned about the four types of analytics: descriptive analytics, diagnostic analytics,
predictive analytics, and prescriptive analytics. Now, let's look at the differences between them.
Descriptive analytics
Descriptive analytics is the first and most basic step in business intelligence. It involves summarizing
raw data using data mining and aggregation techniques, such as measures of distribution (frequency
or count), measures of central tendency (mean, median, mode), and measures of variability
(variance and standard deviation) to reveal trends. For instance, it helps you identify the customer
segment responsible for generating the highest revenue for your product.
Diagnostic analytics
Moving on, diagnostic analytics delves into the "why." It analyzes the data and correlates it with
other data sets to uncover the underlying reasons for trends identified through descriptive analytics.
Some techniques include probability theory, time-series analysis, filtering, and regression analysis.
For example, diagnostic analytics identifies specific features of the customer segment that contribute
to their product purchases.
Predictive analytics
Predictive analytics uses statistical modeling, data mining, and machine learning to analyze large
volumes of data to forecast what will happen in the future. It uses historical data to calculate
upcoming trends. For example, predictive analytics can project the expected demand each customer
segment will generate in the upcoming quarter.
Prescriptive analytics
Lastly, prescriptive analytics goes beyond descriptive, diagnostic, and predictive analytics to
recommend the next course of action based on predictions. It harnesses data optimization,
simulation, and decision analysis methods using artificial intelligence or machine learning techniques
to determine the best solution considering several data points. Prescriptive analytics can recommend
the optimal price and marketing strategies based on the forecasted demand and other contributing
factors.
The objectives of the four types of analytics, the techniques used in each case, and examples of their
applications are summarized in the table below.
Key objectives To present what happened in an actionable format To delve into why
something happened To forecast what will happen To recommend the next course of action
Techniques Data mining and aggregation, such as distribution (frequency or count), measures of
central tendency (mean, median, mode), and measures of variability (variance and standard
deviation) Correlation with other data sets using techniques like probability theory, time-series
analysis, filtering, and regression analysis Statistical modeling, data mining, and machine
learning Data optimization, simulation, and decision analysis using artificial intelligence or
machine learning techniques
In this reading, you have learned the differences between descriptive analytics, diagnostic analytics,
predictive analytics, and prescriptive analytics.
Diagnostic analytics correlates the data with other data sets to find the reason for the trends.
Prescriptive data goes beyond descriptive, diagnostic, and predictive analytics to recommend the
next course of action based on various data points.
Learning Objective:
Describe the key business intelligence or BI components that make up its process.
You have learned about the key components of a business intelligence (BI) system, including data
sources and integration, data warehousing, data analysis and mining, reporting and visualization
systems, software technologies, and advanced analytics. While these form the essential architecture
of the BI system, there are other factors that contribute to the success of a BI system. Let's delve into
what comprises the BI ecosystem.
The BI ecosystem or BI environment comprises four key elements: data, people, processes, and
technologies. While creating a BI strategy, one should consider all four elements. Let's deep dive into
each of these elements and understand their role in the BI ecosystem.
Data
Data is the most important element of the ecosystem, as it is the raw material for analytics and
reporting. Data can originate from both internal and external sources. Internal data includes human
resources data, customer data, financial data, and website-related data. External data includes
publicly available sources such as market trends, customer demographics, and financial trends.
Collecting the right data is crucial to getting actionable insights that help decision-making.
People
People involved in BI include data analysts responsible for sourcing and processing data and users
who query the data to generate the required reports to make informed decisions. Each individual
involved must have the skills required for their role.
Processes
The processes in the BI system must be designed to fulfil the requirements of the business and
produce accurate results. The BI architecture typically consists of data collection, data integration
and management, data analysis, and data reporting and visualization processes.
Technologies
Technologies run the BI processes. The technology stack for BI must be carefully chosen to align with
the purpose of the BI system and always be kept up to date. It should be capable of handling high
volumes of complex data. Components such as data mining, extract, transform, and load (ETL),
analytics, and reporting software should be seamlessly integrated to produce the intended
outcomes.
Summary
The BI ecosystem or environment comprises four key elements: data, people, processes, and
technologies.
Data from both internal and external sources form the raw material for analytics and reporting.
All the people involved, individuals responsible for creating and maintaining the system, and those
using it are key to making the BI system work.
Processes need to be designed to cater to the requirements of the business and yield accurate
results.
The technologies selected must align with the BI system's purpose and must be kept up to date.
Learning Objective:
Evaluate and compare different business intelligence tools and technologies used in the BI analyst
ecosystem.
You have learned about the categories of tools used in a BI system. One of the most popular types of
tools is a dashboard and visualization tool. These tools seamlessly integrate with the broader BI
architecture to enable advanced analytics, data visualization, and reporting functions.
We'll now delve into the features, pros, and cons of four of these tools: IBM Cognos Analytics,
Tableau, Power BI, and Looker.
IBM Cognos Analytics
Features:
Advanced analytics, customization, scalable distribution, and scheduling abilities to meet business
goals
Pros:
Cons:
Tableau
Features:
User-friendly interface that allows users to create intuitive visualizations and interactive dashboards
Drag-and-drop functionality that makes it easy for users to explore data and gain insights
Advanced analytics capabilities that support complex calculations, statistical analysis, and forecasting
Built-in functions and integration with R and Python for advanced analytics
Seamless integration capabilities with various data sources, including databases, spreadsheets, cloud
services, and web connectors
Pros:
Cons:
Steep learning curve for navigating the advanced features and complex data models
Power BI
Features:
Seamless integration with Microsoft products such as Excel, SharePoint, Teams, Power Automate,
and Azure
Ability to leverage existing data sources and collaborate within the Microsoft ecosystem
Question-and-answer feature enabling interaction using natural language to get instant answers
through visualizations
Access to reports and dashboards on the go through Power BI Mobile apps for iOS and Android
devices
Collection of pre-trained machine learning models enhancing your data preparation efforts
Pros:
Cons:
Looker
Looker is a cloud-based self-service data visualization and exploration tool from Google.
Features:
Cloud-based tool
Powerful data modeling layer allowing users to define relationships between different datasets and
create reusable metrics and dimensions
Data visualization integrations into applications or websites
Granular access controls, auditing capabilities, and integration with single sign-on providers, thus
securing sensitive data and making it accessible only to authorized users
Pros:
Model customization and complex calculations through LookML, or Looker Modeling language
Detailed user-interaction history through granular access control and auditing capabilities
Cons:
Summary
In this reading, you have learned about the features, pros, and cons of four dashboard and
visualization tools.
IBM Cognos Analytics is a cloud-based AI-powered tool with self-service and geospatial capabilities.
Tableau's user-friendly interface allows users to create intuitive visualizations and interactive
dashboards.
Power BI seamlessly integrates with the Microsoft ecosystem and is popular among MS Office users.
Looker is a cloud-based tool that allows robust customization and the addition of complex
calculations.