Enterprise Data Management (Midterm Reviewer)
Enterprise Data Management (Midterm Reviewer)
LESSON 1
Data is a corporate asset that must be managed to maximize its value, and its management must be enabled as a
common capability to ensure that this asset is used properly.
Data are raw facts that describe the characteristics of an event or object.
Structured data has a defined length, type and format and includes number, dates, or strings such as customer address.
1. Machine generated data is created by a machine without human intervention. Machine generated structured data
includes sensor data, point of sale data, and web log data.
2. Human generated data is data that humans, in interaction with computers, generate. Human-generated structured
data includes input data, clickstream data, or gaming data.
Unstructured data is not defined and does not follow a specified format and is typically free-form text such as emails,
Twitter tweets, and text messages.
1. Machine-generated unstructured data, including satellite images, scientific atmosphere data, and radar data.
2. Human-generated unstructured data, including text messages, social media data, and emails.
Big data is a collection of large, complex data sets, including structured and unstructured data, which cannot be analyzed
using traditional database methods and tools.
The simple difference between data and information is that computers or machine need data and humans need
information.
Data is raw building block that has not be shaped, processed, or analyzed and frequently appears disorganized and
unfriendly.
Information gives meaning and context to analyzed data, making it insightful for human by providing context and
structure that is extremely valuable when making informed business decisions.
A report is a document containing data organized in a table, matrix, or graphical format allowing users to easily
comprehend and understand information.
A static report is created once based on data that does not change.
A variable is a data characteristic that stands for a value that changes or varies over time.
Business intelligence (BI) is information collected from multiple sources such as suppliers, customers, competitors,
partners, and industries that analyzes patterns, trends, and relationships for strategic decision making.
Business analytics is the scientific process of transforming data into insight for making better decisions.
Analytics
1. Descriptive analytics use techniques that describe past performance and history.
2. Predictive analytics use techniques that extract information from data and use it to predict future trends and identify
behavioral patterns.
3. Prescriptive analytics use techniques that create models indicating the best decision to make or course of action to
take.
Knowledge includes the skills, experience, and expertise, coupled with information and intelligence, that create a
person's intellectual resources.
Knowledge assets, also called intellectual capital, are the human, structural, and recorded resources available to the
organization.
What is Enterprise Management? The ability of the organization to effectively create, integrate, disseminate and
manage data for all applications, processes and entities of the enterprise requiring accurate and timely delivery of data.
Enterprise data managers are most often database administrators, IT administrators, or IT project managers. They are in
charge of the process of managing your business’s entire data life cycle.
• Improved access to organized, properly defined data through data governance and metadata management.
• Improved quality of data for decision making and operations – faster operations and faster, more accurate decisions.
• Improved reporting and analytic capabilities on both enterprise and local scales, for accurate results.
• Improved data security and privacy access according to standards and procedures applied consistently.
• Integration of data across sources according to standards and using a consistent architecture framework, for ease of
integration and access.
Components of EDM
• Data Governance – planning, oversight, and control over management of data and the use of data and data-related
resources; development and implementation of policies and decision rights over the use of data.
• Data Security and Privacy – ensuring privacy, confidentiality and appropriate access to data.
• Data Integration & Development – acquisition, extraction, transformation, movement, delivery, replication, federation,
virtualization and operational support.
• Data Analytics & Business Intelligence – managing analytical data processing and enabling access to decision support
data for reporting and analysis.
• Data Quality – defining, monitoring, maintaining data integrity, and improving accuracy, completeness, validity,
timeliness, consistency of data
LESSON 2
1. Perform Assessment – Businesses need a clear understanding of their data flows and the types of data they have in
order to craft an effective data management strategy.
2. Define Deliverables – It is important for an organization to outline what they hope to accomplish by implementing
3. Determine Standards, Policies and Procedures - Standards, policies, and procedures are invaluable guideposts,
keeping data where it needs to be and helping to prevent corruption, security breaches, and loss of data.
4. Educate the stakeholders - Enterprise data management is sure to fail if the standards, policies, and procedures
surrounding it are not properly disseminated and emphasized. Additionally, EDM strategies are better positioned for
success if all of those who deal with data are on board with the project.
5. Emphasize Quality - Bad data is actually worse than no data at all. Adopting a culture of data quality will help protect
your data’s security and integrity and ultimately preserve its worth.
6. Invest in the Right People and Technology - Understanding the art of managing data isn’t everyone’s forte. It’s best to
have an in-house or consultative expert with experience establishing enterprise data management systems. Their
knowledge can help identify the right technologies to use.
DAMA-DMBOK is a detailed guidebook that provides best practices and standards for specific data management
functions.
• Data Governance – Ensuring all data management activities align with organizational policies, standards, and
regulations. It provides direction and control over data management processes.
• Data Architecture - Designing and managing the structure of an organization's data, ensuring it aligns with business
goals and supports data flow across systems.
• Data Modeling and Design – Creating models that define data elements, their relationships, and rules for data storage,
which support effective database design.
• Data Storage and Operations - Managing how data is stored, accessed, and maintained. This includes the physical and
technical aspects of data storage.
• Data Security – Protecting data from unauthorized access and breaches, ensuring confidentiality, integrity, and
availability of data.
• Data Integration and Interoperability - Ensuring data can be shared and used across different systems, applications,
and platforms without losing consistency.
• Document and Content Management - Managing unstructured data, including documents, images, and multimedia,
ensuring they are properly stored and retrievable.
• Reference & Master Data Management - Managing critical data that is used across various systems and ensuring
consistency, accuracy, and reliability.
• Data Warehousing & Business Intelligence - Collecting, storing, and analyzing large volumes of data to support
decision-making and business intelligence activities.
• Metadata - Managing data about data, which provides context and meaning, making it easier to manage and
understand other data assets.
• Data Quality - Ensuring that data is accurate, complete, consistent, and reliable, which is critical for effective decision-
making.
LESSON 3
A repository is a central location in which data is stored and managed. A data warehouse is a collection of information-
gathered from many different operational databases-that supports business analysis activities and decision-making tasks.
The primary purpose of a data warehouse to combine information, more specifically, strategic information, throughout
an organization a single repository in such a way that the people who need that information can make decision and
undertake business analysis.
Standardization of data elements allows for greater accuracy, completeness, and consistency and increases the quality of
information in making strategic business decisions.
Data aggregation is the collection of data from various sources for the purpose of data processing. Businesses collect a
tremendous amount of transactional information as part of their routine operations.
Extraction, transformation, and loading (ETL) is a process that extracts information from internal and external
databases, transforms it using a common set of enterprise definitions, and loads it into a data warehouse. The data
warehouse then sends portions (or subsets) of the information to data marts.
Information cleaning or scrubbing is a process that weeds out and fixes or discards inconsistent, incorrect, or incomplete
information.
In a data warehouse, information cleansing occurs first during the ETL process and again once the information is in the
data warehouse.
LESSON 4
Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but
effective approach to object-oriented programming.
Python has gone from a bleeding-edge or “at your own risk” scientific computing language to one of the most important
languages for data science, machine learning, and general software development in academia and industry.
NumPy, short for Numerical Python, has long been a cornerstone of numerical computing in Python. It provides the data
structures, algorithms, and library glue needed for most scientific applications involving numerical data in Python.
• Functions for performing element-wise computations with arrays or mathematical operations between arrays.
• A mature C API to enable Python extensions and native C or C++ code to access NumPy’s data structures and
computational facilities.
Pandas, provides high-level data structures and functions designed to make working with structured or tabular data fast,
easy, and expressive. (emergence in 2010,)
The primary objects in pandas that will be used is the DataFrame, a tabular, column-oriented data structure with both
row and column labels, and the Series, a one-dimensional labeled array object.
Pandas blends the high-performance, array-computing ideas of NumPy with the flexible data manipulation capabilities of
spreadsheets and relational databases (such as SQL).
Matplotlib is the most popular Python library for producing plots and other two-dimensional data visualizations. It was
originally created by John D. Hunter and is now maintained by a large team of developers. It is designed for creating plots
suitable for publication.
IPython and Jupyter is designed from the ground up to maximize your productivity in both interactive computing and
software development. It encourages an execute-explore workflow instead of the typical edit-compile- run workflow of
many other programming languages.
In 2014, Fernando and the IPython team announced the Jupyter project, a broader initiative to design language-
agnostic interactive computing tools.
The IPython web notebook became the Jupyter notebook, with support now for over 40 programming languages.
The IPython system can now be used as a kernel (a programming language mode) for using Python with Jupyter.
Anaconda is a Python distribution (prebuilt and preconfigured collection of packages) that is commonly used for data
science.
NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in
Python.
One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large
datasets in Python.
Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent
operations between scalar elements.
An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same
type.
Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of
the array.
You can also slice NumPy arrays. Slicing is used to extract some portion of data from actual array.
Arrays are important because they enable you to express batch operations on data without writing any for loops. NumPy
users call this vectorization.
A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels,
called its index.
DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a
different value type (numeric, string, boolean, etc.).
The DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same
index.
An aggregated function returns a single aggregated value for each group. Once the group by object is created, several
aggregation operations can be performed on the grouped data.
We can use the Pandas DataFrame merge() function is used to merge two DataFrame objects with a database-style join
operation. The joining is performed on columns or indexes.