This document provides an introduction to data warehousing and business intelligence. It discusses the purposes of reporting and analysis and how they differ. It also defines key concepts like data warehousing, business intelligence, the data lifecycle, metadata repositories, and different types of data marts. Reporting organizes data for monitoring performance while analysis explores data for insights to improve business. A data warehouse stores historical data from multiple sources to support analysis and decision making.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
21 views37 pages
1a Ravi
This document provides an introduction to data warehousing and business intelligence. It discusses the purposes of reporting and analysis and how they differ. It also defines key concepts like data warehousing, business intelligence, the data lifecycle, metadata repositories, and different types of data marts. Reporting organizes data for monitoring performance while analysis explores data for insights to improve business. A data warehouse stores historical data from multiple sources to support analysis and decision making.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37
Introduction to Data Warehousing
and Business Intelligence
Prof. Ravi Patel
IT Department ADIT Why Reporting and Analysis? • Reporting: The process of organizing data into informational summaries in order to monitor how different areas of a business are performing. • Analysis: The process of exploring data and reports in order to extract meaningful insights, which can be used to better understand and improve business performance. Cont… • Reporting translates raw data into information. Analysis transforms data and information into insights. • Reporting helps companies to monitor their online business and be alerted to when data falls outside of expected ranges. Good reporting should raise questions about the business from its end users. The goal of analysis is to answer questions by interpreting the data at a deeper level and providing actionable recommendations. • Through the process of performing analysis you may raise additional questions, but the goal is to identify answers, or at least potential answers that can be tested. • In summary, reporting shows you what is happening while analysis focuses on explaining why it is happening and what you can do about it. Data life Cycle • The data life cycle provides a high level overview of the stages involved in successful management and preservation of data for use and reuse. • Plan: description of the data that will be compiled, and how the data will be managed and made accessible throughout its lifetime • Collect: observations are made either by hand or with sensors or other instruments and the data are placed a into digital form • Assure: the quality of the data are assured through checks and inspections • Describe: data are accurately and thoroughly described using the appropriate metadata standards • Preserve: data are submitted to an appropriate long-term archive (i.e. data center) • Discover: potentially useful data are located and obtained, along with the relevant information about the data (metadata) • Integrate: data from disparate sources are combined to form one homogeneous set of data that can be readily analyzed • Analyze: data are analyzed What is Business Intelligence? • BI(Business Intelligence) is a set of processes, architectures, and technologies that convert raw data into meaningful information that drives profitable business actions.It is a suite of software and services to transform data into actionable intelligence and knowledge. • BI has a direct impact on organization's strategic, tactical and operational business decisions. BI supports fact-based decision making using historical data rather than assumptions and gut feeling. • BI tools perform data analysis and create reports, summaries, dashboards, maps, graphs, and charts to provide users with detailed intelligence about the nature of the business. • Business Intelligence tools often source the data from data warehouses. The reason is straightforward: a data warehouse already has data from various production systems within an enterprise; the data is cleansed, consolidated, conformed and stored in one location. Because of this BI tools are able to concentrate on analyzing the data. BI And DW • Business Intelligence and Data Warehouse (BI/DW) are two separate but closely linked technologies that are crucial to the success of any large or mid-size business. The insights derived from these systems are vital for an organization as it helps in revenue enhancement, cost reduction, and decision making. • Data storage and management is an important managerial activity in any organization today and have become significant for rational decision making. A DW acts as a central repository system where an enterprise stores all its data (from one or more sources) in one place. DW helps industries in reporting and data analysis from the current and historical data stored, and hence it is considered as a core component of Business Intelligence. What is Data Warehouse? Explain it with Key Feature. • Data warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. • A data warehouse refers to a database that is maintained separately from an organization’s operational databases. • They support information processing by providing a solid platform of consolidated historical data for analysis. • “A data warehouse is a subject-oriented, integrated, time- variant, and nonvolatile collection of data in support of management’s decision making process” • The four keywords, subject-oriented, integrated, time-variant, and nonvolatile, distinguish data warehouses from other data repository systems, such as relational database systems, transaction processing systems, and file systems. • Why Subject-oriented ? • A data warehouse is organized around major subjects, such as customer, supplier, product, and sales. • Rather than concentrating on the day-to-day operations and transaction processing of an organization, a data warehouse focuses on the modeling and analysis of data for decision makers. • Data warehouses typically provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. • Why Integrated? • A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational databases, flat files, and on- line transaction records. • Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures, and so on. • Why Time-variant ? • Data are stored to provide information from a historical perspective (e.g., the past 5–10 years). • Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time. • Why Nonvolatile? • A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. • Due to this separation, a data warehouse does not require transaction processing, recovery, and concurrency control mechanisms. • It usually requires only two operations in data accessing: initial loading of data and access of data. Meta data repository: • Metadata are data about data. When used in a data warehouse, metadata are the data that define warehouse objects. • Metadata are created for the data names and definitions of the given warehouse. • Additional metadata are created and captured for time stamping any extracted data, the source of the extracted data, and missing fields that have been added by data cleaning or integration processes. A Metadata repository should contain the following: • A description of the structure of the data warehouse, which includes the warehouse schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart locations and contents. • Operational metadata, which include data lineage (history of migrated data and the sequence of transformations applied to it),monitoring information (warehouse usage statistics, error reports, and audit trails). • The algorithms used for summarization, which include measure and dimension definition algorithms, data on granularity, partitions, subject areas, aggregation, summarization and predefined queries and reports. • The mapping from the operational environment to the data warehouse, which includes source databases and their contents, data partitions, data extraction, cleaning, transformation rules and defaults, data refresh and purging rules, and security (user authorization and access control). • Data related to system performance, which include indices and profiles that improve data access and retrieval performance, in addition to rules for the timing and scheduling of refresh, update, and replication cycles. • Business metadata, which include business terms and definitions, data ownership information, and charging policies. data mart and its types : • Data marts contain a subset of organization-wide data that is valuable to specific groups of people in an organization. • A data mart contains only those data that is specific to a particular group. • Data marts improve end-user response time by allowing users to have access to the specific type of data they need to view most often by providing the data in a way that supports the collective view of a group of users. • A data mart is basically a condensed and more focused version of a data warehouse that reflects the regulations and process specifications of each business unit within an organization. • Each data mart is dedicated to a specific business function or region. • For example, the marketing data mart may contain only data related to items, customers, and sales. Data marts are confined to subjects. Three basic types of data marts are dependent, independent, and hybrid. • The categorization is based primarily on the data source that feeds the data mart. • Dependent data marts draw data from a central data warehouse that has already been created. • Independent data marts, in contrast, are standalone systems built by drawing data directly from operational or external sources of data or both. • Hybrid data marts can draw data from operational systems or data warehouses Dependent Data Marts • A dependent data mart allows you to unite your organization's data in one data warehouse. • This gives you the usual advantages of centralization. • Figure illustrates a dependent data mart. Independent Data Marts • An independent data mart is created without the use of a central data warehouse. • This could be desirable for smaller groups within an organization. • Figure illustrates an independent data mart. Hybrid Data Marts • A hybrid data mart allows you to combine input from sources other than a data warehouse. • This could be useful for many situations, especially when you need ad hoc integration, such as after a new group or product is added to the organization. Figure illustrates a hybrid data mart. Basics elements of Data Warehouse • Source System • Data Staging Area • Presentation Server/area • Metadata • End User Application Source System • An operational system of record whose function it is to capture the transactions of the business. • A source system is often called a "legacy system" In a mainframe environment. Data Staging Area • A storage area and a set of processes that clean, transform, combine, de-duplicate, household, archive, and prepare source data for use in the data warehouse. • The data staging area is everything in between the source system and the data presentation server. • It may be on single machine or separated over different machines • Data staging is an intermediate storage area used for data processing during the extract, transform and load (ETL) process. The data staging area sits between the data source(s) and the data target(s), which are often data warehouses, data marts, or other data repositories. Presentation Server/area • The target physical machine on which the data warehouse data is organized and stored for direct querying by end users, report writers, and other applications. • it is the presentation server where we insist that the data be presented and stored in a dimensional framework. • If the presentation server is based on a relational database, then the tables will be organized as star schemas. If the presentation server is based on non-relational on-line analytic processing (OLAP) technology, then the data will still have recognizable dimensions, most of the large data marts (greater than a few gigabytes) are implemented on relational databases. End User Application • A collection of tools that query, analyze, and present information targeted to support a business need. • A minimal set of such tools would consist of an end user data access tool, a spreadsheet, a graphics package, and a user interface facility for eliciting prompts and simplifying the screen presentations to end users. Components of Data Warehouse • Source data Component • Data staging Component • Data storage Component • Information Delivery Component • Metadata Component • Management and Control Component Information Delivery Component • In order to provide information for decision making to the wide community of data warehouse users, the information delivery component includes different methods of information delivery • Provides information to one or more destinations according to specified scheduling algorithm. • Information delivery may be based on time of day or completion of external events Management and Control Component • This component of the data warehouse architecture sits on top of all the other components. • The management and control component coordinates the services and activities within the data warehouse. • This component controls the data transformation and the data transfer into the data warehouse storage. • It works with the database management systems and enables data to be properly stored in the repositories. It monitors the movement of data into the staging area and from there into the data warehouse storage itself. • The management and control component interacts with the metadata component to perform the management and control functions