Business Intelligence Lecture Notes-21-05(1)
Business Intelligence Lecture Notes-21-05(1)
Lecture Notes
Image: iStock/mindscanner
When it comes to assembling a list of key big data terms, it makes
sense to identify terms that everyone needs to know — whether they
are highly technical big data practitioners, or corporate executives
who confine their big data interests to dashboard reports. These 20
big data terms hit the mark.
Analytics
The discipline of using software-based algorithms and statistics to
uncover meaning from data.
Algorithm
A mathematical formula placed in a software program that
performs an analysis on a dataset.The algorithm often consists of
multiple calculation steps. Its goal is to operate on data in order to
solve a particular question or problem.
Behavioral analytics
An analytics methodology that uses data collected about users'
behavior to understand intent and predict future actions.
Big data
Data that is not system of record data, and that meets one or more
of the following criteria: it comes in extremely large datasets that
exceed the size of system of record datasets; it comes in from diverse
sources, including but not limited to: machine-generated data,
internet-generated data, computer log data, data from social media
sources, or graphics and voice-based data.
Business intelligence (BI)
A set of methodologies and tools that analyze, report, manage, and
deliver information that is relevant to the business, and that includes
dashboards and query/reporting tools similar to those found in
analytics. One key difference between analytics and BI is that
analytics uses statistical and mathematical data analysis that
predicts future outcomes for situations. In contrast, BI analyzes
historical data to provide insights and trends information.
Clickstream analytics
The analysis of users' online activity based on the items that users
click on a web page.
Dashboard
A graphic report on a desktop or mobile device that gives managers
and others quick summaries of activity status. This high-level
graphic report often features a green light (all operations are
normal), a yellow alert (there is some operational impact), or a red
alert (there is an operational stoppage). This "eyeshot" visibility of
events and operations enables employees to track operations status,
and to quickly drill down into details whenever it is needed.
Data aggregation
The collection of data from multiple and diverse sources with the
intention of bringing all of this data together into a common data
repository for the purposes of reporting and analysis.
Data analyst
A person responsible for working with end business users to define
the types of analytics reports needed in the business, and then
capturing, modeling, preparing, and cleaning the required data for
the purpose of developing analytics reports on this data that
business users can act on.
Data analytics
The science of examining data with software-based queries and
algorithms with the goal of drawing conclusions about that
information for business decision making.
Data governance
A set of data management policies and practices defined to ensure
that data availability, usability, quality, integrity, and security are
maintained.
Data mining
An analytic process where data is "mined" or explored, with the
goal of uncovering potentially meaningful data patterns or
relationships.
Data repository
A central data storage area.
Data scientist
An expert in computer science, mathematics, statistics, and/or data
visualization who develops complex algorithms and data models for
the purpose of solving highly complex problems.
ETL (extract, transform, and load)
ETL enables companies to take data from one database and move it
to another database. ETL is accomplished by extracting data from
the database that it originally is kept in, transforming the data into a
format that can be used in the database that the data is being moved
to, and then loading the transformed data into the database it is
being moved to. The ETL process enables companies to move data
in and out of different data storage areas to create new
combinations of data for analytics queries and reports.
Hadoop
Administered by the Apache Software Foundation, Hadoop is a
batch processing software framework that enables the distributed
processing of large data sets across clusters of computers.
HANA
A software/hardware in-memory computing platform from SAP
designed to process high-volume transactions and real-time
analytics.
Legacy system
An established computer system, application, or technology that
continues to be used because of the value it provides to the
enterprise.
Map/reduce
A big data batch processing framework that breaks up a data
analysis problem into pieces that are then mapped and distributed
across multiple computers on the same network or cluster, or across
a grid of disparate and possibly geographically separated systems.
The data analytics performed on this data are then collected and
combined into a distilled or "reduced" report.
System of record (SOR) data
Data that is typically found in fixed record lengths, with at least one
field in the data record serving as a data key or access field. System
of records data makes up company transaction files, such as orders
that are entered, parts that are shipped, bills that are sent, and
records of customer names and addresses.
Three tips on how to tackle big data storage management and
safekeeping chore:
Data stores continue to be overwhelmed by big data, so why don't
data center managers get rid of excess big data that isn't of use?
The main reason why is fear of missing out on any possible uses of
big data analytics. There is an ever-present thought that the Vice-
President of marketing might one day ask for a long-term trends
analysis of product sales over the past 20 years. Companies have
made use of data that old, and you never know where new
governance and regulatory requirements might take you, so why not
hold on to the data to be safe?
There is also the very real possibility that these vast stores of data
will go unused for years and even decades, while companies continue
to steadfastly store and maintain them. Gartner (2001), refers to this
unused data as "dark" data, and defines it as "the information
assets organizations collect, process and store during regular
business activities, but generally fail to use for other purposes (for
example, analytics, business relationships and direct monetizing).
Similar to dark matter in physics, dark data often comprises most
organizations' universe of information assets. Thus, organizations
often retain dark data for compliance purposes only. Storing and
securing data typically incurs more expense (and sometimes greater
risk) than value because often organizations don't classify it or
intend to use it."
"Data is dark when we do not know it exists, when we can not find
it, when we cannot interpret it, and when we cannot share or
interface with it." Colgan, the director of information governance
solutions for Nuix, helps companies manage growing volumes of
unidentified, unstructured data that sits in their storage
repositories. "Sometimes data goes dark because we're simply too
busy to deal with it, so we push it to the side and ignore it."
So, how can you "lighten up" dark data and still ensure you retain
necessary data?
Here are three suggestions:
1: Filter data
If you are using machine- or internet-generated big data, you are
getting a lot of noise as well as useful information. Data filtration
that can isolate the information you want and eliminate the rest is
one way to purify data feeds before you end up with a lot of
unidentifiable junk in your data repositories. Vendors and tools can
help you with this data cleansing process, but they can't help if you
haven't identified the present and most likely future pieces of data
that your business will need.
2: Export data
If you are concerned about retaining information for decades for
purposes of governance or long-term trends analysis, start exporting
this data to a trusted cloud-based vendor for safekeeping. You can
bring the data back into your data center for analysis when the time
comes.
3: Define data retention policies
The hallmark of excellent data center management is to be as
aggressive in defining data retention polices with business users for
big data as you are with systems of record data.