100% found this document useful (1 vote)
2K views70 pages

Warehousing and Data Mining

Data Mining

Uploaded by

Prince Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
100% found this document useful (1 vote)
2K views70 pages

Warehousing and Data Mining

Data Mining

Uploaded by

Prince Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 70
Es CONTEN SS =m KOE 093 : Data Warehous 1s & Data Mi ANALYSIS OF AKTU PAPERS s UNIT: : DATA WAR oa taeenTaee aes EHOUSING ebaeniee (1-1 D to 1-19 D) -19D) Overview, Definitio , Dat . Data Warehouse ny Data Warehousing C ee een DST Architect, Diterence betwee can Dats Cubes Sa, Warton Multi Dimensional Date UNIT-2: DATA WAREHOUSE PRO ones Fact Constellations, Concept. Warehousing Strate gy, W, (2-1 D to 2- Processes, Warehouse canine a /management and Suj ° map) rating Systems for Implementation, Ha Dd ea Data Warehousing, Client /Se1 dware and Model & Data Warhowsing, Perle Prosesor aeocem De ies eee Warehousing Software, UNITS: DATAMINING Overview, Motivatio ce : -. Definiti aon 1D to 3-19 D) Form of Data Pre-processing ou Cle mictionalties, Data Process Data, (Binning, Ghostering, Regression, © Missing Valuss, Nosy inspection), Inconsistent Data RE fom dT es Dans Reduction:-Data Cube Agere a and Transformation. Data Compression, Numerceity Red Disses ow Concept hierarchy generation eae Discretization and 1-4 : CLASSIFICATION AND CLUSTERING Definition, Data Generalization, Analytic Characterizati (#1 Diees7D) attribute relevance, Mining Class comparisons, moe poe large Databases, Statistical-Based Al Sat ea Algori oecsion Tree" her Aer Distance-Based Sina Niyand c Md aes eee pores Introduction, aoa al Clustering CURE and Chameleon Fe oe iB al Custering Density Based Methods- DBSCAN, OF cs Gi Based Methods- STING, CLIQUE. Model Based fethod Statistical A\ ‘ation rules: Introduct Them ses, Basic ca PP Pull cation rl ere, New Network approach. ued Algorts, News UNIT-5 : DATA VISUALIZATION (5-1 D to 5-18 D) von, Historical information, Query Facil and Tools. OLAP Servers, MOLAP, MOLAP, SOLAP, Osta Mining interac, ‘security, Backup and Recovery, Tuning Data mail Testing Data ‘Wecehouse. Warehousing applications and Recent Nee Type Minin Applications, Web Mining, Spatial SHORT QUESTIONS (s.1D to SQ-210) (sp-1D to SP-31D) SOLVED PAPERS (2013-14TO 2018-19) www.askbooks.net *AKTU Quantums «Toppers Notes :Books *Practical Files «Projects *IITJEE Books www.askbooks.net All AATU QUANTUMS are available Si eh a ie ek eines perenr etry * Your complete engineering solution. * Hub of educational books. ee eet eee ee ee ee eee ee eee ec eee SPP Ree Mau Cu mor Celt Rose CR Ac eR Cl Re Ee Ce ecru 2. We don't intend to infringe any copyrighted material. CRO aT au RR CR men lela eae COCR Cue aber) 4. All the logos, trademarks belong to their respective owners. A-1D(CSIT6) Data Warehousing & Data Minin = ered ar AKTU Papers Pd Data warehousing components 3. | Building a data warehouse 4. | Mapping the data warehouse 5. | Difference between data warehouse and database 6. | Data cube 110%, 1.11 8. | Concept hierarchy ‘Total Questions 1.12%, 1.13 A-2D(CSIT-6) Warehouse strategy Hardware of data ‘Topies Warehouse planning | 0 0 ‘warehouse 001 00 a Clientéserver computing model | ° ° °| Distributed DBMS | implementation 0:°0.,.0.,0,,.0 ‘Total Questions oo o1 0} Topies 5 g Introduction 1 Data processing | 0 Data cleaning 1 Data reduction | 0 Total Questions 2 we 2016-17 2015-16 1 0 | 39, 3.10, 3.12) * = Asked in different years * = Asked in different years ata Mining ‘Data Warehousing & 46,47, 48, 4.9, 4.10, 4.12, 4.19 4.15, 4.16¢, 4.17, 4.18 4.20 421%, 4.22, 4.25, 4.26 4.27, 4.28 4.30, 4.31 0 4.34, 4.35* A4D(CSITS) Analysis of Previous AKTU Papers Unit-5 : Data Visualization and Overall Perspective * = Asked in different years Qi Total Questions CONTENTS Overview on 1-89 to 1-2D Definition Data Warehousing Components 1-80 to 14D Building a Data Warehouse 1-40 to 1-6D Warehouse Database 1-8 to 1-9D Mapping the Data Warehouse ta Pl fultiprocessor Architecture Difforence between Database nw 1-100 to 1-12) System and Data Warehouse Multidimensional Data Modet Data Cube 1-190 to 1-18D 1-18D to 1-17D Stars Snow Fl Fact Constellations 1-170 to 1-19D Concept Hierarchy \ Questions-Answers aI Ee eee “Long Answer Type and Medium Answer Type Questions Gael. | What do you mean by data warehouse? Discuss ite hey features with euitable example. “Answer sr adata warehouse (DW) is collection of corporat information and dats onal systems and external data sores, ved to support business decisions by allowing nnd reporting at diferent aggregate eves Key features of data warehouse a0: ie inject oiented A datawareoue cane uso aaa aTSGler Set er cxame ae, marketing ete canbe a parle we enaced | A data warehouse integrates gta from multe es nteareted | plo aplaton And Bmtaosinrmation nS aren a yarn, ati nfrmatin seria cmmet format ce ares rian Hitrcl dataiaketin ata wach FS ‘Time vara gata ron 3 onthe © months 2 monte Sh Sider data from 8 "Ba wareout can bald ‘tidressesasociated 4 Non-volatile: Data warchoute Non-volatie Cy wven naw dataisenteredinit Date ae et reced For example chase n trate : find a the lve of Data granularity : Data eran Breatserdate. tn OT, the data Ie tgoe product. For example, eval th ‘pave recorded the ‘maximum sal Setben all the stores, which region Nay L-8pcsit9) GeTE Deerte the component of data warehouse Tomwer | Following are the various componen ‘a. Data warehouse database: [RDBMS technology. Due to some constraints, database are used 1. RDBMS are deployed in parallel to allow scalability 2 New index structures are wsed to bypass relational table scans and improve sped. 8. Multidimensional databases Ave to relational data model bi ETLtoole: The functionality of sourcing, acquisition, cleanup and transformation tole also called as ETL. tols includes 1 Removing unwanted data from operational databases. 2 Converting to common data names and definitions. Establishing defaults for missing data, 1s of data warehouse ‘The database is implemented on the different approaches to xo used to overcome any limitations Fi: 1.4. Information ow ofa data warehouse, 14p(csir6) Data Warehousing Matadata «Mota date abut data hat dover the ata travotse is wed for building, intaining, managing sed eroghe data warehouse. ae ee Access tools Acexs ool are divided into four main eateries 1. Query and reporting eos 2 Application development tools 2. Online analytial processing too 4 Datamining tole ‘@ Data warehouse bus architecture: Data warehouse bus determines the flow of data in our warchoue. The data low ina data warehouse fan be categorized as Inflow, Upiow, Downiow, Outflow and Meta fw PARTS Building a Data Warehouse, Warehouse Database. Questions-Anewers Long Answer Type and Medium Answer Type Questions ‘GaeHR Hepa the concept o building a data warehouse ewer | Following steps should be adopted to build a successful data warehouse 1. Business considerations : ‘a. Approach : For data warehouse development, one of the two approaches is used: {Top-down approach : In the top-down approach, data warehouse is bil first. Thedata marts are then created from the data warehouse, i, Bottom-upapproach In the bottom-up approach, data mats are created fiat and then data warehouse is built. 1b Organizational isrues : Most 1S organizations have expertise in ‘developing operational systems. 12 Design considerations : There are several points related to data ‘warehouse design ‘a Data content : The data warehouse system should not contain as ‘much detail-level ataas the operational system used to source this Data Warehousing & Date ining 1s(cs1r6) Mctnta: Metdaaiedataabout at. means tina erp Met eda help to organize, find and understand oak the ds Duta distribution :Iebecome necessary t knowhow the dts Pesto aieded arcs mulpleservers and which wars should et acrecs to whch te of data i Tools: The tools provide the facilities for defining the Teel cast and deanop rae, data movement, wer Gr, ‘eprtng and at sal ePerfrmnnce considerations: An ideal datawarehovse sytem ‘ouldsoport interactive query processing ‘Technical considerations A numberof echnical nae re 10 be Ziletel ches nplementig ond buldiaga data warehouse ester 2 Mtardware platform The data warcouse erver has tobe able {eroppr luge ata volumes and complex queries. bh The database management aytem that supports the warehouse craters Communication infrastructure : A data warehouse user Sopa lrg bund winters with the data warehouse Turetievea large amount of data for analsi 4 The hardware platform and software to support the metadata repairs The ajtems management framevork that enables centralized Tanagement and administration of the ote envionment 4. Implementation consideration : The implementation of data terthoue requires the ategraion of many products 2 heces tool Ranking statistical nal tie eis analysis ‘iti ieligens nfraton spying ar sme fhe exaples ‘aces tala ype bb Datacrractio, leanupand transformation and migration ©. Data placement stratogien: Ara data warehoaoe grow, here ‘ould bes way to tore the datain astarage media and ditribule {hedatein the data warchoute arose multiple servers, Metadata: Metadata deta bout data. Temes tsa description tects afthe dt eel orgs, Sd and ndervand © User sophisticated levels : A certain degree of sophistication if required ta effectively ue the warehouse B Lepccsnrs) ‘+ Mapping the relational databue tothe multiprocessing hardware architectures allows au" aful implementation of date ‘warehouse, i (ilar | | eng Anewee Type and Medlum / wer Type @uenions | QueTA, | Enumerate the steps involved in mapping the data warehouse to a multiprocessor architecture, ‘ARTO 2016-17, Maris 10) oR What is the architecture of data warehouse operations ? = ‘Stpsinvalvedinmapping the dtawarcsetoamuiroesoraitecae 1. Relational database technology for data warehouse : ‘a. “Linear speed up :The ability toinerease the numer of processor to reduce response time. 1b. Linear seale up: The ability to provide same performance on the same requests as the database size increases, ‘Types of parallelism : i." Horizontal parallelism :In this, different server threads or processes ___ handle multiple requests at the same time. fi, Vertical parallelism : This form of parallelism decomposes the sri SQL query into lower level operations such as scan, join, sort ete Response Tine Serial RDBMS, Sort HovzotalParlletam Vertical araleiem (Data Parttonng) (Query Petting) Pau r00us 7 A in et HA 0 cu at Lapcstrs) shia ata Warehousing Data Warohousng& Data Mining sie Date partitioning: Data parting 9 the Ke ompont™ Th Shared dink architecture Shared dink arehites Date, partition! og cation of database operations, Pariion can be trance of shared ener a ee ‘abhi serverucachot whith xrunaingon anode of isivted Leonel gon endotu dried done randowaly or intelligently ges Includes random data striping across bp Intelligent partitioning: Assumes that DBMS knows were 4 Inte em Ps leated and doce nt wastetime searching fr ‘erose all disks. @ Hash partitioning : A hash algorit! Partition mumber based on the value of the tach row. 4. Key range partitioning : Rows are placed end located inthe Meions according to the vale ofthe partitioning Key ‘e. Schema partitioning : An entire table is placed 0” one Scheme ier table ie placed on diferent disk ete, This is usefol for Small reference tables. User defined partitioning :It allows atable tobe partitioned on the basis ofa user defined expression. thm is used to caleulate the partitioning key for 2 Database architectures of parallel processing : There are three Dents software architecture styles for parallel processing: pee Ghared memory or shared-everything architecture 1t has ce Shared nothing architecture :In shared architecture estes, the folowing characteristics ane CPU is connected wo a given disk Iftable or database tao that diak shared nothing systems are concerned ith (eioe to disks, not with acess to memory. 1 Multiple Processing Units (PU) share memory, Tis hmple to implement and provide a single system image Kistimpating an RDBMS on SMP (Symmetric Lop (csr) Data Warchousing & Data Mining ‘Parallel DBMS features: Parallel environment DBMS management tools Price / Performance ‘Scope and techniques of parallel DBMS operations Optimized implementations Application transparency 4. Alternative technologies : For improving performance in data ‘varehouse environment include following : fa Advanced database indexing products b. Multidimensional databases ‘e Specialized RDBMS 5 Parallel DBMS Vendors: ‘a. Oracle :Support parallel database processing. by Informix: It supports oll parallelism. "TEM ea pall centres roduc D2 arae tion) d._ SYSBASE: It implemented its parallel DBMS functionality in @ product called SYSBASE MPP (SYSBASE+NCR). RETA dene at vacchous. What srt shouldbe taken Pearse ‘care while designing a warehouse ? Tarver] Data warehouse : Refer @ 1.1, Page 1-2D, Unit-. ‘The strategies that should be taken care while designing = warehouse are: 1. Educate yourself: We must understand what users want because the purpose afa data warehouse aystem is to provide decision-makers the ‘curate, timely information they need to make the right cheices 2 Determine business requirements : To determine business requirements are should understand the following: ‘a. Why the requestor needs a data warehouse. b.Whatare they trying to accomplish - saving time in collecting data, higher quality of data, supporting certain applications etc., we need to tie these business objectives to data sources. ‘What business rues to follow and what users and/or applications to support. ‘Make a timeline: Break up business objectives mentioned above into two to three month incremental deliverables. (Choosing architecture, methodology and technology and building team. Data Warehousing i acho os ste Youd ia ameeriasior CONCEPT OUTLINE [A database system deseriben procensing at operational sites ‘whereas adata warehouse describes processing at warshoure. | “Arulidimensional data mode used forthe design of corporate ‘data warehouses. Quertions-Answere Long Answer Type and Medium Anawer Type Questions Tee | what i can wurhrur How dos ifr ron 2 databace? Data warehouse : Refer Q.1.1, Page 1-20, Unit Difference: [ENo] Data warehouse Database Tit involves istorieal | involves day-trday proce | processing of information. Eo Tie ie used to analyze the | Itis used to run the business Duviness. 3. [Tefoeuses on information out, | Te focuses on data. [Abbreviation | It stands for ‘Onlin ‘on 1. Metadata is data about data. It means itis a description and context of Ie Procesing | Trnsctinn the date Tbelpa a organi, find and nferstand Sat Prove 2, Indata warehouse, metadata ae the data that defines warehouse object. i is used for transaction ‘i 2 [Use Ttis used for Query | Tt 3. Metadata can be classified into two types : Technical metadata a Processing. Processing | Business metadata 7 sanded 3, [Data Teholds historical data. | Itholds current data. Importance of metadat Tt stores only relevant | It stores ll data 1. Metadata drives data warehouse processes. data. - 2, Metadata gives user the meaning ofeach data element, + [tye TKisenalysis driven, |Itisapplicatondriven.| 5. Mfotadataestabishes the context for data clements [s [source | The data comes from | Itis the orignal source various OLTP sources | of data. 1, Multidimensional ata model stores datain the form of data cube. Adata ‘i cube allows data to be viewed in multiple dimensions. © [Purpose | To help with planning | To control and run 1s bap with ig Fe mera as] Daa warehouses ed Onn oat PrcesingOLAP ns ae decision support. tasks. based on a multidimensional data model 7 ovpplie\ Table _ vest] sn ey te Specie] Saree sappberirm one 1168p CsT-6)_ Data Warehousing Fact constellations : tte fact constellation ean have multiple fact tales that share many imonsion tables. 12. Thistype of chema can be viewed asa collection of stars, snow flake ind hence is called a galaxy schema ora fact constellation ‘The main disadvantage of fact constellation schemas is its more complicated design Example Let us assume that Decean Electronics would like to have another Barbe for aupoly and delivery. It may contain five dimensions, or keys {act ata, delivery gent, origin, destination alongwith the meri measure led and the eostof delivery. Itcan be seen thot fas the numberof units sup ta eifact tables can share the same item-dimension table as well as time- eet aes table, A fact constellation schema is shown in Fig. 1.10.3 (sar) Fig. 1.108. Fact constellation. Difference: S.No. | __ Star schema Fact constellation | 7 [im star schema, each | In fact constellation, each {Timension is represented by | dimension is represented by only one table. ‘multiple fact tables. Z| Weissimple tounderstandand | Tt is more complex and hard te ‘easily designed, design. 3 | Tedoes not use normalization. | Ituses normalization. [Tit caves the space due to | It does not save space due to single fact table. multiple fat table. Fig. 1.10.2. Snowflake scheme. 1-17 (CRIT, WCTATT] suppone hat ndata ware couree, emester and renulwing four dimension student, Curse, of he foo acanuree auch na count and SV Sed sannrst the wert concepteal level (for example, for # ven Wage tt Guroe,nementer and instructor combination) the tiadeng Steamure tore the actual course grade of Une student, ‘THfferconceptul level, avgyrade mores the average grads for aiotren combination. 1 Draw a enow flake schema diagram for the data warehouse iL Starting with the bare cuboid [otudent, course, semester, instructor), what speciic OLAP operations for example ol from remester to year) should one perform in order lst ch student of the Data Warehousing & Data Mining : University. Tmadent id course. id semester id instructor 1d count ave_grade Semester dimension table Semester id semester year ‘i Starting with the base cuboid {student, course, semester, instructor] 1. Roll-up on course from (course_key) to major. : {elbaponstadent from (student_key) to University. ‘on course, student with department = “CS" and University = “Big University”, = ‘4 Drill-down on student from University to student name, Lapiewrre CONCEPT GUTLINE rete cycle graph ofomeeaa, where [7 Sapiens goede rohan cach of the concept ‘WHe TAB] Describe concept hierarchy with example. “Answer 1. Concept hierarchy represents the relationship between data elements in such a way that they can relate to each other as one shave another, one below another. 2 Example of concept hierarchy is Date hiorarchy which forms relationship as Year ~> Month -» Day -» Week ete. Date Year ‘Month Day, Week 3, There are three main types of hierarchies in data warehouse design a. Balanced hierarchy b. Unbalanced hierarchy Ragged hierarchy 4. Concepthierarchy reduces the data by collecting and replacing low level ‘concepts by higher level eoncepta. a. Determining tne aimensiony ua. b. Determining the location to place the hi rarchy ofeach dimension ‘of information, Partitioning : Refer Q 1.4, Page 1-6D, Unit-1 @@o Cr Part-1 Part-2 : Port-3 Part-4 CONTENTS Support Processes Warehouse Planning and... Implementation Hardware and Operating Data Warehousing ClientServer os Computing Model and Data Ws Parallel Cluster Systems Distributed DBMS . Implementations Warehousing Software and Warchoure Schema Design 22D to 23D 29D to 2-6) v» 26D to 2-7D 27D to 211D 2-120 to 2-14D house & Process Technology Ware sep cats) Pats —rarese decisions och ogy volver any important Aare ee poran cx that make w the otic cverbed aS ete elements of warehouse. Far urna x oi sent se date warehouse rollout plan : Alle user Preliminary dette inen data archers he sd necnaryb re Frotminary date warehouse architecture defines the overall Popa rater oes alot Shorted data warehouse environment and tool : Create Ser tend ene tha open to meet warehousing mee ‘TEGHET eran warhouse management nd esport processes Warehouse management and support processes are designed to address the aspects of the planning and managing DW project, subject to successful {implementation and extension of software. ‘Steps in warehouse management and support processes : 1. Define issue tracking and resolution process :It includes following ‘quidelines: soe desertion, urgency, raised by, asigned to, date pened ‘ate close, resolved by and resolution description. 2 Perform capacity planning I can be done is following forms : Data Warehousing & Data Mining 2epcsrrs) {Space required: Spee requremenio a iarniney sma atin, beckopandrecovery rsa)” indexing sists seoerstin metadata 1h Machine processing pover:T chores scout tat talableand meth pocnngequremen Network bandwidth very al armpin the wrk ‘iat btre proseting wh ouch leu 4 Define warehense pening leet dane the chain fr Erlvng removing cdr dam he Gta arth Sn cack Sen ge coelry orlig eye 4 Desinseeriymamagument i eps tin eta route ere ave ib lm ofinrmaio ber nia or dn oan Unter nr Varinu stp involved in ecrty management i Determine and evaluate IP assets Analyze risk fii, Definesecurty practices jv, Implement practices ‘Monitor violations and take corresponding actions vi. _Re-evaluate I assets and risk 5 Define backup and recovery strategy: i. Data to be backed up Identify the data that must be backed up ‘ona regular basis This eives us an indication of the regular backup fi, Batch window of the warehouse : It determines the maximum allowable down time for the warehouse. ‘ii, Maximum acceptable time for recovery : It determines the ‘maximum acceptable time for the warehouse data and metadata to be restored. iv. Acceptable costs for backup and recovery : Different backup ‘mechanisms imply different backup costs. 6 Setupcollection of warehouse usage statistics: Warehouse usage ‘statistics are collected to provide the data warehouse designer with ‘inputs for further refining the data warehouse design and to track the ‘general usage and acceptance of warehouse. waren ea sort note on date warehouse Planning. activities related to planning og ng describes the ‘Te date warehouse planing di aches for data Warehine ‘Tae date rthe data warehouse. Different aPPr hie and orient the team 1 Assemble and ren ere and rete abt the pro. Distribute copes of DW strateny Se Setup teams and specify oes. Je Give traning ifreqired. Set upmilestones and check points Conant ecotonal requirements analysis : It means gain a Cont ee ending of the information needs of decision maker, a eee eiotonal source system audit : SUEY CUFFED source ‘tte for data warehouse tee icgjeal ana physical warehouse schema : I includes two schema design techniques Je Noctaleation: Normalize led atrbute data soas ofall within, rot specified range, such as 0.010 10. i Dimensional modeling: This technique produces denormalizea, aamerpcn dergns consisting of fact and dimension tables. A ‘$l Riso the dimensional star echema als exists (ce, nowake schema) & Produce source-to-arget field mapping: The source-totarget field ‘napping documents how fields in the operational systems are transformed into the data warehouse fics 6 Select development and production environment and tools fnalizes the computing environment and tol sat for rollout based on the results of development and production environment. 1L Create prototype for this rollout Tt creates a prototype ofthe dato ‘warehouse using the final tools and produetion environment. Create implementation plan for this rollout : It drafts 22 implementation plan forthe rllout. HRA] expiain att stepe and guidelines for data warehouse ‘implementation. ‘Steps for data warehouse implementation : 1 Requirements analysis and data warehousing involves defining ® enterprise ne Data Warehousing & Data Mining 25DICSIT-S) architecture, carving out capacity planning and alecting the hard architecture, carey pacity planning an sletngthe hardware 2 Hardware integration : Once the hardware and software have been felectd, they need tobe pt together by integrating the server, the Storage devices and the cient software tos 3. Modeling : Modeling is s major step that involves designing the ‘warehouse schema and views. This may inva using the modeling tool Tf the data warehouse in complex 4. Physical modeling : This involves designing the physical data ‘warehouse organization, data placement data partitioning, deciding on {ccess methods and indexing {5 Sources : The data for the data warehouse is likely to come from a umber of datasources. This step involves identifying and connecting the sources using gateways, ODBC drives or other wrappers BTL: Thedata from the source aystems will need togo through an ETL process. The step of designing and smplementing the ETL process may avolve identifying a suitable ETL tool vendor and purchasing and {implementing the tol Populate the data warehouse : Once the BTL toolshave been agreed tipon, testing the tools wil be required, perhaps using a staging area {& User applications :For the data warehouse tobe seful there must be ‘cod-user applications. This step involves designing and implementing ‘applications requiredby the end users. ‘9. Roll-out the warchouse and applications: Once the data warehouse has been populated and the end-user applications are tested, the Warehouse system and the applications may be rolled ox forthe wser ‘community to se. Guidelines for data warehouse implementation : 1. Build incrementally : Data warehouses must be built incrementally. It's reoommended that a data part may frst be built with one particular project in mind and then data warehouse can be implemented in an Merative manner allowing al data parts toextract informatio from the data warehouse 2 Need a champion :A data warehouse project must have a champion foie willing to cary ot considerable research into expected costs and benefits ofthe projet. 3. Senior management support : A data warehouse project must be falls supported by the senior management, Give the resource intensive ful sure puch projeets and the time they take to implement, « paNhouse projet cals for a sustained commitment from senior management. ‘4. Bnmure quality: Only data that hasbeen cleaned shouldbe loaded in the data warehouse Warehouse & Process 2-8DCSITs) Data Werehowe Technology 7 hardware, sft ne ictal costs (hardware, sofware, an, lan : My benefits anda project plan (including an fy Beonlee eyareoute project must be clearly outlined ang ‘understood by all stakeholders. Teaining:Adata warehouse projet must not overlook data Warehouse traning requirement. fh Adepbity The projet sould bud in adaptability that chang, ae ica warehouse when required. Like any system, y Fan calmed ta change, as needs of an enterprise chang, 8. The project must be managed by both IT and business professionals in the enterprise. PART-3 Hardware and Operating Systems for Data Warehousing. : Questions-Answers ‘Long Answer Type and Medium Answer Type Questions [QaeRS |] Explain hardware and operating systems used in data warehouse. Hardware and operating ayatem refersto the server platforms and operati ! serv operating yiem that serve asthe computing environment ofthe data warehouse. ware and operating system used in data warehouse are: Parallel hardware technology : Symmetric multi These system consists of pair of about 6 recaor that share a common memory and operating nly managed nie fevoueet are shared hence, they can be Sender ake we of high ped interconnections ih [cru}/cPu(ePy i 3 S Memory Data Warehousing & Data Mining 27DicsnT6) 'b. Massively parallel processor systems : It uses a large number of procesore which comm nome message interface. Bach processor has its own CPU, memory and disk subsystem, ig, 2.52. MPP architecture 2 Clustered system : These systems are configured with multiported ‘array 20 that nodes which have direc disk access enjoy same disk VO rates as standalone SMP systems. Nodes which not have direct disk fcess must use the high-speed cluster interconnect mechanism. cooo) joo =| |= gees 8688 Fig. 2.58. Closter of four SMP systems. TERRE teaton extern for hrdare slstion ‘SMP Nodes ‘The following selections are recommended for hardware selection: Delivery lead time Reference sites “Availabilty of support * PART- Sie Me ind Data Warehousing, int Server Computing Mote ora oes tems, eon ee 1a Warehouse & Process Technology, 2eDicsirs) sapere mt soca eet ————~"goNGEPT OUTLINE Pree _BoNcEPT | Spice era cn eh ean ee r Seger eer etetiniet ciated ae ‘ De Pete gar teaiat prep cesta ey lel processing ism vo fragments to speed up the execution of Programs File services Desktop | Busnes pe ‘GavRA | Explain client/server architecture. ‘cient | peat ‘Aaewor Advantages of twortier architecture 1 Chienvserver architecture is a network architecture in which each 1, Interoperability Computer on the network is either a client oF a server. areenaatonn 12 Chentserver architecture works when the client sends a request to the att pases cre erer the network connection, which is then processed and a delivered othe client 4. Transparency Components of client/server architecture + 5. Security 1. Client :It is a computer which processes the request service from the Disadvantages of twotier architecture : a 1. Networktrafeis handle les efficient. 2 Server: Any computer can provide services tothe client. Fhe tent and gerveraze tightly coupled Comanmioton mllerare:Acmpeter hrongh wich len tot 2 ‘Three er architecture’ Inbethreeier areitectreamidleware ; ae ress te cient environment andthe database management int |—raquear > Bern 5s weed vironment. It i used in large environment. “application Advantages of clientserver architecture: omer Dieavaninos of elletinerverarchtecare ‘Advantages of three-tier architecture ‘Single point of feiure 1 Improve performance 2, Improve lesbilty 2 Costly tomsintain DB server BBBRT] Wt ar he tes of tonsnorver architecture? RRR Deere anata memory architecture all nthe system are directly Indentet men ae cor ee nother processor's memory. ‘Two types of distributed memory architecture are: 1 Shared nothing architecture ‘a. Shared nothing architecture is used in tmhich each node have their own memory, inpavloutput interfaces. bb. Bach node do not shares any resources with other nodes and communicate with each other by passing messages. distributing computing in ‘storage and independent 2 Shared disk architecture : a Ashared disk architecture i 7 sa distributed computing architecture in whieh all disks are ted computing architect Data Warehousing & Data Mining 2uDCsIT6) b. Multiple processors can access all disks directly via intercommunication network and every processor has local memory. Global Shard Dink Subsystem rata aay OAS 7 en 1. Inaclusteraystem, every processor unit (PU) executes a copy of operating ‘item andthe inter PU conmanieations are performed over an opet- systeme-based interconnection. 2. Custer aystem is designed for high availability by providing shared acces to disks. 2. Cluster system de ‘avery high-speed: hundreds of PUs. scribes many characteristies of MPP system, including rsalable interconnection mechanism and support for 3 away] Fewer] PEway) sup] [sme] [SMP dst sep 668 608 Fig. 2.10.1 Distributed memory cluster PARTS: caused DBMS Implementations, Warehousing Distr and WoreKouse Schema Design. -—GeNGEPT OUTLINE | [+ Sebemain stoi description of the entire database 1. Connectivity tools: system in heterogeneous environment, For example: i. IBM: Data joiner ii Oracle Transparent gateway i, SAS:SAS/connect Sybase : Enterprise connect 2 Extraction tools: There are two pri re are two primary methods to use extracto® ‘os .,bulkextraction and change-based replication, For example: i. Apertus carleton : Passport a atinam: InfoPump ‘Transformation tools : These too se tools has following features i, Field splitting and consolidation : Standardization Data Warehousing & Data Mining 21D (csaT.6) For example Data flux : Data quality workbench Prism: Quality manager Pine cone systems : Content tracker Data loaders: It transforms data into data warehouse 6 Data access and retrieval tools : These tools are classified into two categories i. OLAP tools: These allow users to make ad hoc queries or generate queries against warchouse database i, Reporting tools : These allow users to produce scanned snd sophisticated reports based on warehouse data 1. Data modeling tools: These tools allow users to prepare and maintain ‘an information model ofboth source and target database For examp! i. Cayenne software, Terrain {i Relational matters, Syntagma designer iii, Sybase, PowerDesigner WarchouseArchitect & Warehouse management tools : These tools assist warehouse admin in the day-to-day management and administration ofthe warehouse For example : i. Pine cone systems, usage tracker, refreshment tracker. ii. Red rick systems, enterprise control and coordination. Que Baz, | Discuss various warehouse schema design techniques. =a ‘Various warehouse schema design techniques are 1. OLTPsystems use normalized data structures 2 Dimensional modeling for decisional systems : number of techniques for denormalizing database to cre 8. Star schema : Refer Q.1.10, Page 1-14D, Unit-l Dimensional hierarchies : Each dimension will have hierarchies that imply grouping and structure. 5. Granularity of the fact tabl is to determine the granularity of the fact ean the lowest level of information that willbe ‘This constitutes twosteps ‘a. Determine which dimensions will be included. 1b. Determine where along the hierarchy of each dim information will be kept It provides sate schema, "The frst step in designing afact table table. By granularity, we stored in the fat table. yension the auDCsT ‘Aggregates or summaries : Aggregates are the summarization of ‘taetrelnted data forthe purpose of improved performance. Aggrogatos ‘are tobe considered for te when the number of detailed records to be [processed is lange and/or the processing of the customer queries begins to impact the performance. Dimensional attributes : The attribute values are used to establish the context ofthe facts. ‘Multiple star schemas : A data warehouse will have multiple star schemas, many fact tables, 900 Data Mining CONTENTS a Motivation Definition and Funetionalities Partt 3-20 to 37D Part2 1 Data Proce 7 3-80 to 3-90 Form of Data Pre-Processing Partd 1 Data Cleaning : Missing Values Noiry Data (Binning, Clustering Regression, Computer and Homan Tnypection) | Taconsatent Data 3-90 to 3-18 farts4 Data eduction: Data Cube.. sists | ‘Aggregation Dimensionality Reduction Data Compression Numerovity Reduction Diseretization and Concept rehy Generation and Decision Tree Overview, Motivation, Definition and Functionalities, s2DCsITo) GONGEPT OUTLINE 5 minin ‘organizationstoturnraw data | = Data miningisaprocess used by into useful information, + Punctionalities of data mining 1. Characterization 2. Discrimination 7 4 Outlier analysis, ‘Que, | Explain data, information and knowledge. (ARTO 2014. irks 05 Answer Data : Data are raw facts and figures that can be processed or stored by & computer. For example, text, numbers, symbols, ete. Information : Information is data that has been processed into a form that ives it meaning. For example, analysis of retail of sale data can provide information on which products are selling. ‘Knowledge : Knowledge is the understanding of rules needed to interpret information. For example, information on retail market sales ean be analyzed with promotional efforts to yield knowledge of customer behaviour. Data [APBlied for Formation] Build and [rao Monte’ fo, [nformation] Put 204 [Knowledge ‘QueS2, | What is data mining? Define the major issues in data mining. [ARTO BOTH, Marks 05 oR Describe challenges to data mining regarding data mining issues. methodology and user interaction CRemeranaTe eral 8 Diverse data types issues: ‘Data Warehousing & Data Mining aapicsars) al Data mining : D mining is defined as a process used to extract usable data from a larger set of any raw data Key features of data mi sheer “Major issues in data mining L ‘Automatic pattern predictions based on trend and behaviour analysis. Prediction based on likely outcomes (Creation of decision oriented information, Focus on large datasets and databases for analysis Clustering based on groups of facts not previously known, ‘Mining methodology and user interaction issues : ‘a, Mining different kinds of knowledge in databases: Different, ‘users may be interested in diferent kinds of knowledge. b. Interactive mining of knowledge at multiple levels of abstraction :1t allows users to focus the search fr pattern fom, different angles. & Incorporation of background knowledge : Background knowledge is used to guide discovery process and to express the discovered patterns. Data mining query Languages and adhoc data mining: Data ‘mining query language siould be integrated with data warehouse query language. ‘e Presentation and visualization of data mining results : Once ‘the patterns are discovered it needs tobe expressed in high level languages, {Handling noisy or incomplete data The data cleaning methods ‘arerequired to handle the noise and incomplete objects while mining the data regularities. & Pattern evaluation :The patterns discovered shouldbe intresting because they represent common knowledge, 2 Performance issues : 1 pimciney and scalability of data mining algorithms : 7 Bmcioney 7d avon om nog aan of tain Stas see ipothm musts ecient ends tk. Poste, dvtibutedandneremental mining algorithms; Paral date ved ou of dane mde triton of Th actors wh ety ata mining methods mate the ata a aan dite ata ining ahs Date Mining sapere s + Mormon system a Gear Jonna ining nin en ; ARTO 2017-18, Marks 10) sndaul? (RR BOTA ae tnt DD tent cn dri Jone ig om nt = i etn dr FS 1 Dale len a an rove fy and eran Pee ern bs imote ‘Cleaning in case of missing values. i ‘where noise is a random ot iL Cleaning noisy data, error 4 ii Cleaning with a transformation ols 2 Data integration : Data integration is defined as fources combined in a common 8 binder: 1 Dataintegration using data migration tools i Data integration using data synchronization tol. ii Data integration using BTL (Extract-Load-Transformation) process. 8 Data selection : a Dataacleconis defined as the process where data relevant tothe Analisis decided and retrieved fom the data collection. b eincades: i Data selection using neural network. Data selection uring decision trees. 4% Data selection using Naive Bayes. it. Data selection using clustering, regression, ete 4 Data transformation : Tn this step, data is tranaformed or consolidated into forms appropriate for mining by performing summary or sining by performing summary or ageree? 3 Datatransformaton isa two step process: Data mapping : Assigning elements from source base" ‘destination to eapture transformations. QF variance discrepancy detection and data heterogeneous data from multiple rarce (Data Warehouse). Data Warehousing & Data Mining SoD cents 4K. Code generation : Creston of the actual transformation program ™ 5 Data mining ‘8. Data mining in defined ax « ver techniques th ver techniques that are applied to ‘extract palternn potentially wea oe b. Iineludes 4. Trannorms tank relevant st into patterns ii, Decides purpose of model uring classification or characterization 6 Pattern evaluation : Pattern evaluation is defined as an identifying Aritly increasing patterns representin: knawledge based on given 7. Knowledge representation: Krwnledgereprewntation isdefined as technique which utilizes visualization tools t» represent data mining i a | ne Data mining A Patieras SRT ow data mining eytems ae lasted Deeb cach classification with example. [AKTU 2016-17, Marks 10) ‘Neng ten canbe clasiedacering ott lowing: 1. Database technology 2 Statistios 3, Machine learning Data mining system can also be classified a8 a Clamification based on the databases mined : Database system ceercaified acrordng to different criteria such as data models Saree fataet For example, fwe classify database according tthe tee etc thea we may have a relational, transactional, object relational, or data warehouse mining system, 1b. Classification based on the kind of knowledge mined : It means the data mining system is classified on the basis of functionalities such ascharacterization, disrimination, association analysis, lasiication, prediction, outlier analysis, evolution analysis. A comprehensive data fining eystem usually provides multiple integrated data mining functionalities. Classification based on the techniques utilized : We can classy a data mining system according tothe kind of techniques used in user autonomous systems, interactive exploratory systems, query-driven, systems or the methods of analysis employed such as machine learning, ‘statistics, visualization, pattern recognition, neural networks. 4 Classification basedon the applications adapted: We can classify ‘a data mining system according to the applications adapted. The applications are as follows: finance, telecommunications, oc ot lecommunications, DNA, stock ‘QeeTRT] explain data mining functionalities, Following are th data mining functional 1. Data characterization st aon Seams fhe ent Data housing & Data Mining s-7Dcsrr6), 2 Data discrimination : It refers to the mapping or cass class with some predefined group or clas, sociation analysis : It analyses the set of items that frequently ‘appear together ina transactional dataret, 4 Classification : In classification, data are grouped into predefined clases, ation of & & Prediction It refers to predict some unavailable data value athor than class labels, i 4 Cluster analysis: Casifcaton an prediction analyze cls labeled data objets whereas clustering analyees data objec weibout conuling ‘Snown ls abe 1. Outlier analysis: Outliers are data elements that cannot be grouped ‘in agiven class or cluster. a & Evolution analysis: Evolution analysis refers tothe description and Imodel regularities or trends for cjects whose behaviour ehange over time ‘Guess. | Describe the difference between the following, approaches for the integration of data mining system with database ‘or data warehouse systems: no coupling, loose coupling and semi tight coupling. [AKTU 2016-16, Marks 7.5] Ifa data mining systems not integrated with a database or adata warehouse system, then there will be no system to communicate with. This scheme is known as the non-coupling scheme. ‘Various integration schemes are as follows : ‘a, Nocoupling : In this scheme, the data mining system does not utilize ‘any of the database or data warehouse functions. It fetches the data from a particular source and processes that data using some data mining algorithms. ‘b. Loose coupling : In this scheme, the data mining system may use| ‘some of the functions of database and data warehouse system. It fetches the data from the data respiratory and performs data mining on that data. © ‘Semi-tight coupling: Tn thisscheme, the data mining system et with a database or a data warehouse system and efficient {implementations of afew data mining primitives can be provided in the ms ‘mining system is smoothly Tight coupling : In this scheme, the data system is integrated into the database or data warehouse system, The data ming subsystem is treated as one functional component of an information system. Data Mining sapesmrs El Da'a Processing, Form of Dolo GONGEPT OUTLINE ] ta into usable and desired nf dat Data processing isthe conversa | Forms of data processing ar° GaeRAT] what are the different forms of data processing ? [ARTO 2014-15, Marks 06 Different forms of data processing are: 1. Data cleaning : Data cleaning isa process to remove the noisy dats, clean the data by filling in the missing values and correct the inconsistenciesin data. 2 Data integration : Data integration is a technique that combines the data from multiple heterogeneous data sources into a coherent dats store, Data integration may involve inconsistent data and therefore needs data cleaning. ‘& Datatransformation :In this.ten data ic trancformod or consolidated Data Warehousing & Data Mining s9pcstr6) © Generalization : In generalization low-level data are replaced swith high-level data by using concept hierarchies climbing d._ Normalization : Normalization scaled atribute data so as to fall ‘within a small specified range, such as0.0 to 1.0. tis oftwo types: i. Min-max normalization : It isa technique that belps to normalize dat. It ill cal the data between O andl i, rescore normalization : Transform the data by converting the values to-a common seale with an average of zero and 3 standard deviation of one. Attribute/feature construction : New attributes constrocted from the given ones. 4 Data redu Data reduction is used to obtain reduced representation of data in small values by maintaining the integrity of original data, eS, | Data consolidation is data modeling activity. This [ARTO 2018-14, Marks 05; statement is true or not ? Justify. sane The statement is true as data consoldation means transforming data {no the forms that are appropriate for mining by performing certain operations. ‘The normal data which we obtain from diferent datasources is notin lable frm tobe stored in data warehouses or for performing data ‘Mining operations. So, data is modeled for further activites ater performing data consolidation. 2. Data consolidation javolve the following operations: Refer Q. 27, Page 3-8D, Unit. PART-3 ‘Data Cleaning : Missing Values, Noisy Data (Binning, Clustering ‘Regression, Computer and Furman Inspection) Inco msiatent Data. s10D (CST) EE How to handle noi Tamer] Noise isa Following ae L 2 ay data? werent sre a untalented ien Tine ea tai ec een Sh 2.2.0 inthis emer eee at cern ete Std seat pornn mune Binning: It Regression : ‘a. Data canbe smoothed by fitting the data into a regression functions, Danese regression and multiple linear regression are type of regression, regression task begins with a dataset in which the target values are known. e Forexample regression model could be used to predict the value tf house based on location, number of rooms, lt size, and other factors. Clustering fa Outliers may be detected by clustering, where similar values are organized into groups, or clusters. Data Warehousing & Data Mining ‘Que 3.10. | Elaborate the different strategies for d sD csaTs) Vale a loti oF th st of cers may be considered cutrs cc Forexanple, clustering analysincante wedi area such t market research, pattern recogetion, data analysis, and image processing Combined computer and human inspection >The outliers anal belentited with the help of computer and human inspection. The tutlers patterns can be informative or garbage. Humans can srt out the garbage patterns, [ARTU9017-18, Marks 10] aal ‘Data is leaned through processors suchas data migration, data serabing and data auditing 7 1 Data migration : fa. During data migration, transformation rules are specified (for example, replacing sex by gender) toclean the data. 1b. Transcription errors incomplete information, and lack of standard formats are alzo addressed during data migration. a scrubbing : ‘a. Itinvolvesdetecting and removing errors and inconsistencies from data in order to improve the quality of data. 1b. Data scrubbing involves a complex cleaning and mapping process that is the moat labor intensive part of building adata warehouse. fe During the cleaning process, desired informations filtered out and its quality is maintained for the target system, Data auditing : ‘a, Data auditing tools make it possible to discover rules and relationships or to signal violation of stated rules by scanning data, Ttenhances the systems reliability and makes it possible to prevent, detect, and climinate data errors irregularities, and fraud. b Quo BAT, | List the ways to handle the missing values. What do you mean by inconsistent data ? wel ‘Ways to handle missing values are L ‘This is usually done when class label is missing. Ignore the tupl Cad foameomse ees How to handle noiy data? Famer] ‘None na random error or variance in a measured variable Following are the data smoothing techniques oe inning: Ttisatechniqn in which fst ofallwesort the dataand then pinition the data into equa frequency bins, Fr example, Price = 4, 8, 15,21, 21,24, 25, 28, 34 ‘a. Partition into (equal-frequency) bins: 8,15, Binb: 21, 21,24, Bin: 25, 28, 348 +b. Smoothing by bin means: In smoothing by bin, each value in a bine replaced by the mean value ofthe bin. Bina: 9, 9, 9, Bin b: 22, 2, 22, Bin ¢:29, 29, 29 e. Smoothing by bin boundaries: In smoothing by bin boundaries, tach bin value is replaced by the closest boundary value Bin a: 4, 4, 15, Binb: 21, 21, 24, Bin: 25, 25, 4 2 Regression ‘a. Data can be smoothed by fitting the datainto a regression functions Linear regression and multiple linear regression are type of regression. b. A regression task begins with adataset in which the target values are known, For example, regression model could be used to predict the value of a house based on location, number of rooms, lot size, and other factors. Bin: 3 Clustering ‘a. Outliers may be detected by clustering, where similar values are organized into groups, or clusters Data Warehousing & Data Mining auDcsirs Values that fall outside of the sot Vals de ofthe se ofelumers may be considered © Forexample, clustering analysis can be used in area uch as market, esearch, pattern recgnition, data analysis, and image preceoring 4 Combined computer and human inspection : Tae outlierscan lu be ented withthe ip of puter nd oman pein outliers patterns can be informative or garbage. Humans c cot ot ‘the garbage patterns, . oo ‘Que 3.40. | Elahorate the different strategies for data cleaning [ARTO 2017-18, Marks 10) “Answer Data is leaned through processors such as data migration, data ser and data auditing ratios nner 1. Data migration ‘a, During data migration, transformation rules are specified (for example, replacing sex by gender) to clean the data b, Transcription errors incomplete information and lack of standard formats are also addressed Suring data migration 2 Data scrubbing : a. It involvesdetecting and removingerrors and inconsistencies from data in order to improve the quality of data. b. Data serubbing involves a complex cleaning and mapping process that is the most labor intensive part of building adata warehouse ‘e During the cleaning process, desired informations filtered ost and its quality is maintained for the target system. 3 Data auditing a. Data auditing tools make it possible to discover rules relationships or to signal violation of stated rules by scanning data. bb. Ttenhances the systems reliability and makes it posible to prevent, detect, and eliminate dataerrors, irregularities, and fraud and Que BAI] List the ways to handle the missing values. What do you ‘mean by inconsistent data? a ‘Ways to handle missing values are 1. Ignore the tuple: This is usually done when las label ising sg (CSTP6) ____———_ ‘lin the missing Bboy aot be fens Uren global constant sissing attribute va nthe missing valve ‘ « se the attribute oot alue to ili the missing value: This may Unethe mons Pithregression or decision tree induction, | ae aermimg mean ral samples belong 0 occur when similar datas keptin Inconsistent data: Dat inconsistency aE data mt be method. Show using Chi-square Fees] Bolan Criaquare tot ani preferred reading are independent or not from SEMI Given are te obeorved counts). Male | Female | Total Fiction 250 200 fod Non-Fiction | 50 1000 a Total 30 1200 | 1500 [AKTU 2016-16, Marks 15 Taaewer | ‘Acorrelation relationship between two categorical (discrete) attributes, ‘A.and B, can be discovered by a (Chi-square) test. 2 The value also known as the Pearson y* statistics) is computed as py lObeerved -Bepeted? bd ‘Expected § § lye a her oj the observed frequency (e., actual count ofthe join event (A,B) ahde, isthe expected frequency ofA, B),which ean computed count (A~6))xcount(B = by) ant (A = 4)» count (B= ,) where, . xe Data Warehousing & Data Mining ssspcstrs) americas L | Mate ‘Total | Rain 280 | NewPcten |" oo | | tet 300 00 1. Suppose that a group of 1,500 people was surveyed. The gender of each Person was noted, Each person was polled as to whether their preferred ‘ype of reading material was fiction or non-fiction. Thus, we have Geo attributes, gender and preferred reading. 2 The observed frequency (or count) of each possible joint event is ‘Summarized inthe contingency table at shown, where the numbers in parentheses are the expected Male | Female | Total] Fiction | 250190) | 200.360) 450 Non-Fiction | 50(210) | 10001640) 1050 ‘Total |" 300 001500 8. The expected frequency for the eel (male, Setion) is N and soon 4. Using equation for y* computation, we get Sloe? epee (250-907 | (50-2107 | (200-3607 1000-840)" SPOT Ht as@ i itrT S60 Tt ean Hie = 284.44 + 121.90 + 71.41 + 90.48 = 507.93 5. For this 2x2 table, the degrees of freedom are (2-1) (21) = 1. For 1 degree of freedom, the 3" value needed to reject the hypothesis at the 0.001 significance level is 10.828. Since our computed value is above this, we can reject the hypothesis that gender and preferred reading are independent and conclude that the two attributes are (strongly) correlated for the given group of people. Data Warehousing & Data Mining Answer Methods for attribute subset selection are : 3-15 D (CS/IT-6) 1. Stepwise forward selection : In this method, the best of the origi i ii . " thi i attributes is determine and addedtothereduecdset. For example : Initial attribute set : (Al, A2, A3, Ad, A5} Initial reduced set : {) = {Al} = (Al, A4) Reduced attribute set : {A2, A3, A5} 2, Stepwise backward elimination : It removes the worst attribute remaining in the set. For example : Initial attribute set : {A1, A2, A3, A4, A5) {A1, A3, A4, A5) = {A1, Ad, A5} Reduced attribute set : (A1, A5) 3. Combination of forward selection and backward elimination : This procedure selects the best attribute and removes the worst from remaining attributes. For example : Initial attribute set : (Al, A2, A3, A4, A5) Reduced attribute set in stepwise forward selection : {A2, 43, A5) Reduced attribute set in stepwise backward elimination : {A1, A5} Reduced attribute set : (Al. A2, A3, A5} 4. Decision tree induction : It constructs a flowchart where the best attribute is chosen to partition the data into individual classes. For example : Initial attribute set : {A,, A, Ay, Ay As, Ag} Fig. 3.14.1. Decision tree. Reduced attribute set: {Ay Ay Ag) Que 3.15. | Write a short note on dimensionality reduction. saepeesmr orteansformation are applied sos tobtain a reluced or tation of the original data. dimensionality reduction = Jection is a process of removing 1. Dataeneoding ‘compressed represe 2 There aro two components of ‘a. Feature selection : Feature #¢ features that are not relevant or are redundant av extraction + Feature extraction i8@ Process of Fontan rata deta nt etares etl fr modeling 4. Theverusnthds ued for dimensionality edeton include ree rele tranaform 1s iar signal processing technique Wavelet ern data vector into numeral diferent vector Otel consents th Pncipl Component Analysis (PCA) In this the data in a anc eon! pceis mapped ta data in alower dimension {pace invte the fllowing tps: 1 Contract the covariance matrix ofthe date ii Compute the eigen vectra of hi matrix igen vectors creopndng othe largest eigenvalues are eda ecnatus fare rctionoariance ofthe original tata BO] Discuss numerosity reduction in detail Taever | Innumerosity reduction, data volume can be reduced by choosing alternative forms of data representation. The various methods used for numerosity reduction include: ‘8. Regression and log-linear model : These models are used to approximate the given data, b. Histograms : Histogr binni rams us jing to approximate data Aistributions.Itdivide data into buckets and store average sum for each bucket. ‘© Clustering: Partition data set into clusters based on similarit on similarity and ag eter representation ony = ‘Sampling : It allows a large data set to be represent ® set to be re mucl smaller random sample ofthe data. aan BERRI] Disngsien berweon dimensionality reduc Warehousing & Data Mining sa7D! Taswer | [S.No.[ Dimensionality E reduction 1. [In dimensionality reduction, | 1a numerosity reduction, data volume is reduced by choosing alternating, smaller forms of obtain areduced or compressed data representation. representation of original data Numerosity reduction 2. [Methods for dimensionality| Methods for numerosity reduction are: reduction are a. Wavelet transforms | a. Regression and log-linear model (parametric) b, Principal Component| b, Histograms, clustering, Analysis (PCA) sampling (non-parametric. | 3. [Tt can be used for removing | It is merely a representation | irrelevant and redundant| technique of original data to can nee nti met edt a) ai etd treo tein me sn an FEB wo hort nte on concept herrcy oneatio for pumerie data. Me etrinotedesiyntangudrobi sees leonept wig coment Concept hererhy generation for numerics data methods 1 Binning co ingest dwn plitingtehiq aeons aber orvins t.Stmngiean anspesoed dirt cit Histogram anges aan nograns arin the vale or annul into cated tet 1 Hisograne anal a Chumter evga Tis und to parton the eas spin ranges an unsupervised discretization technique. lata into clusters oF nnswere i 1. Categorical data are discrete data. 2. Categorical attributes have finite number of distinct values, with no ordering among the values, 3. There are several methods for generation of concept hierarchies for categorical data : a. Specification of a Partial orderin; the schema level by experts: Concept hierarchies for categorical attributes or dimensions typically involve a group of attributes. A user or an expert can easily define concept hierarchy by specifying @ partial or total ordering of the attributes at a schema level. b. Specification of a portion of a hierarchy by explicit data grouping: Ina large database, it is unrealistic to define an entire concept hierarchy by explicit value enumeration. However, it is realistic to specify explicit groupings for a small Portion of the intermediate level data. g of attributes explicitly at ce Specification of a set of attributes but not their partial ordering: Auser may specify a set of attributes forming a concept hierarchy, but omit to specify their partial ordering. The system can then try to automatically generate the attribute ordering so as to construct a meaningful concept hierarchy. a = Specification of only of partial set of attributes : To handle Partially specified hierarchies, it is important to embed data semantics in the database schema so that attributes with tight semantic connections can be pinned together. Que 3.20. | What do you mean by data mining ? Differentiate between data mining technique and data mining strategy. AKTU 2013-14, Marks 05 Answer | Data mining : Refe- Q. 3.2, Page 3-2D,Unit-3. CONTEN1L® | STi ete al Parva + Classification : Definition —-~ Dots Generalization ‘Characterization Attribute Relevance 48D to 5D aot ee cower | eet ee eet ert saber ee we cnet sth es ee Se cs pert + ml tn a ie eee Se Set Dewy ue ee bscattortc Lorman Sine ute ce: ste Arp Pact famine aren 4140 to 418D 4180 to 4-200 Part Part5 4-20D to 4-270 Part: 4270 to 4-900 ee 4$1D to 4-38D Part.o + Basic Algorithme a Parallel and Distributed Algorithme eer Neural Network Approach = 4-34 to 4-87) 1D CSITS) Define the terms data generalization and analytical Questions-Answers ‘Long Answer Type and Medium Answer Type Questions sSaS—e=a,.:_ >_o= =? mend characterization with example. Data generalization : 1. Data generalization summarizes data by replacing relatively low level values with higher level concepts. 2. Data generalization approaches include : Data cube approach and attribute oriented induction approach. 8. Data generalization is a form of descriptive data mining, 4. Forexample, let us consider the database of XYZ electronics, instead of ‘ceamining individual customer transactions, sales manager may prefer fo view the generalized data to higher levels, such as summarized by customers groups according to regions, income, ete. Analytical characterization 1. Analytical characterization performs attribute and dimension relevance ‘analysis in order to filter out irrelevant or weakly attributes. 2 It is performed to overcome the various limitations of cl characterization. 8. For example, employee birth date, birth_month, birth year are not relevant to the employee's salary but experience is highly relevant to the salary of employee. snousing & Data Mining 43D (CSIT6) are a 4 Data Warehousing & Data 8 Gavad | Explain data cube approach and attribute orienteq 2 [ARTO 2014-15, Marks 05 approach. a generalization. Discuss basic approaches of data ‘over | ‘There are two basic approaé 1. Data cube approach : ‘a Itisalso known as OLAP appr b_Intthis approach, computation ches of data generalization roach ‘and results are stored in the data cube <¢_ Ieigan ficient approach sit is helpful to make the past selling raph. ions on a data cube 4. Ituses rollup and drill-down operat 2 Attribute oriented inductior {tis an online data analysis, query orien based approsch. b. _Inthis approach, we perform generalization on the basi of different ‘Values of each attributes within the relevant data set. After that, ame tuples are merged and their respective counts are accumulated in order to perform aggregation. Attribute oriented induction approach used two methods : i. Attribute removal i Attribute generalization PART-2 Mining Class Comparisons. 7 ed and generalization cations, users may not be interested in having a single clss* desription but they need Yo compare twtr one canee tae agiogue? savor nom nee a ispecies ges ‘Stops of class comparisons are : Ppascacaee pa aiemen lO 3. Synchronous generalization 1 Pmt ft and opin PURI et cn cast cet ang Be Mara ied {Statistic is «component of data mining that provides the tls " aod 1 tse ih ieee It i the science of learning from data and includes everything from collecting and organizing to analyzing and presenting data. Statistics {Seana oe probe oda, opccaly erence’ 4. Sica wed in data mining br empting ils ee! manage thedata ands analysis ands automation dat anaes. 4 Main arons where tata] approech ued in data mining Maine rinng i Sinootdata Sampling ie Dateansis RBBB] cpiain various mearares of central tendency. ‘Measures of central tendency are Mean: a. Itisaconter ofthe dataset. b. Letdata set Xare in values a8 4, Moan ot dna tit = 1S 2 Median: sot ifthe numberof valves nis ‘a. Tein the middle vale ofthe ordered creda number ori isthe average of mide two values ifm in ‘even number. ‘eth bh Median =, 45D(CSIT 6) Data Warehousing & Data Mining Data Warehousing 5 ON tly occur value from a large dats Se Median -2 Mean. sre merage of tie argest and smallest valve of tact ‘Statistical Meosures in Large Database, Statistical-Based “tlgorithme, Distance-Bosed Algorithms. CONCEPT OUTLINE ca Fire deasptve statisti are used in statistical measures co | TY Measuring the central tendency | 2. Measuring the dispersion of ata + Distance-based algorithms are | 1. Simple approach 2__knearest neighbours _| Questions-Answers ‘Long Anewer Type and Medium Answer Type Questions ee Que 46. |] Discuss various measures of dispersion of data. oR Measures ofdaprsin of data are 1. Range:The rage of the data neti the diference between lowest value, 7 — \ Range = HL 3 Teint nine in at Quarles: The frst quartile is denoted by Qi the 25th noted by Qi percentile teeter ne ceased Se is the 75th percentile. The distance bine tei ended quartile enue itn 7 range covered by the middle half of the data. This Aistaneiscalled as Interquartile Range (QR), defined as: IQR = Q3-Q1 3% Outliers : Outliers are the values higher/lower than 1.5*1QR. 46D (CaIT6) eo sification and Clustering 4 Boxplot: Boxpos are popular way of vialing a diet borpit ncorpraenthe fie number wunmaryar 4 Typialy, the ends ofthe bo the ends ofthe ox are the quartiles, 50 length is the Interquartile Range QR = Themedian is marked by alin within the bx. ‘Twolne called whites outside the box her) outside the bxexend othe malt ‘Minimon andlaret simun eertone 5 Standard deviation and variance :The standart ds jance : The standard deviation ofadata set gives a measure of how each value in a data set varies from the Tester deviation ot oe treats yy — 5m ‘The basic properties of the standard deviation are: fa. X measures spread about the mean and should be used only when the smean is chosen as the measure of center. b._=Oonly when there is no spread, that i, when all observations have fhe same value. Othervise o> 0, the variance isthe mean ofthe squared deviations about the by oF. The variance fn observations.,#,---5yi® given by: 7 Sar ye Hes ‘Draw a box-and-whisker plot for the following data set 41, 141, 142, 148, 144, 144,144, 145, 146, 147,148, 148, Que 4. 126, 182,138, 140, 149, 149, 160, 150,160, 164, 155,188, 158, ‘Also find the outliers. ARTU 2015-16, Marks 10) ‘Answer 145, 146,147,148, Given: 126,182,198, 140, 141,141, 142, 145, 14, 14, 148, 148, 149, 149, 160,150, 150, 154, 188, 158, 198 sre Apbre are 25 data points, the median Q willbe = 46 Since there or elve value, othe mein the average ofthe middle two: 41414 415 Cn ‘The median of the second half is = 050210) «150 162.76 00 there are no outliers at the tye end. per eo.aancutlr i any data pint ee thse Tats 12.78 » 12875 pata ptt tn tha 128.7 refrain er ive hefellowing et of atest 8,8 16 30), determine deanna ite stimulator both the moan and standard devistion oa (ERTIES Ware 05) Given nae 2,510, 8,8, 18, 20 get Fe Atty tt p21, x= 20 143+9+15 +20 ibs 96 By ierng sy BTBEH 3+9415+! = Sebeleem = 117 By ignoring By ignoring sy +8D(C8TT-) Cassifieatio and Chstering By ignoring x, Ata th ‘ 143494! = 11819620 gas By ignoring 2, . +H 4 849415 Ha jo OF HO tH 1 ¢ unser 25 497648256 = MBs 11E 875042567 5 9g Jack kaif eatimate for moan i given by! a= [in-267+0-s67+0-067 +05 = [fras2 vores «714 By ignoring 25 os [ivs-0.67 «(9-267 + 5-9." + (0-971 161.24 = JBI = 6.73 By ignoring x + os [ha-26# 0-967 +as-26? an-80" nes = Ear = 721 By ignoring #3: 4" Pra-v6r +a-o8r us-26" + F1 ‘49D (CSIT-6) Mining ‘Data Warehousing & Data [ieee JT = 198 arf By ignoring 4° oe Bjg.aorsa-96? 1-907 +00-9571 Ne = Fnaann = 56ST = 751 a By ion % om Aa-96" 9.6) +(15-98) = Jt -9.6' +(8-9.6) + 9° ose age [rater 04 - VIB = 608 sa 3 5 611 ‘Jack knife estimate for standard deviation is given by = alo) - (nw - D8 (7.144) — (6 - 1) (70) 295.72 ~ 28.44 = 7.28 FERRE write snort notes on 4 Quartiles 4 Histograms il, Scatter plots on Explain the various graphs for statistical class description, Different types of graphs are: 1, Histogram : in this, we partition the data distribution of an attribute into dajoint aot but the width ofeach subset should be uniform. Each taht ren by erectange wow sgt egal otha cunt ofthe Seatter plots : This graphical method is used for determining the existence of any relationship, pattern between two numerical attributes. Inthis method, every pair of value considered as a pair of coordinates in tan algebraic sense and plotted as points in the plane. ‘Quartile plots :A quartile pl loti simple and et Srilanka universiedacduutunee hen tae aidstribtion Fir for the given attribute. Second, it plots quarie ineenatn ae ‘mechanism used in this step is ee P is slightly different from the percentile QQ Quartile-Quartile) plot : A quartile-quartle plot graphs the quartiles of one univariate distribution a quartiles of another. It is a powerful visual inst the corresponding ton tool that allows the ‘user to view whether there i a shift in going ‘ the ite from one distribution to TEWGIOT] Waites hort nte on Bayesian asian [ARTO 2015-14, Marks 0 ‘Bayesian classifiers are the statistical classifiers, ‘Bayesian classifiers can predict class membership probabilities such as ‘the probability that a given tuple belongs toa particular class, ‘Bayesian classifiers have lao exhibited high accuracy and speed when applied to large databases, ‘Bayesian classification is based on Bayesian theorem. Bayesian theorem : The purpose of Bayesian theorem is to predict the class label fora given tuple, Let X bea data tuple. In Bayesian terms, Xs ‘considered “evidence.” Let Hbe some hypothesis, suchas thatthe data tuple “Xbelonge toa specified class C. There are two types of probabilities 1. Posterior Probability (PUH/XI 2. Prior Probability [AED 4 ‘where X is data tuple and H is some hypothesis. According to B theorem, to = PANN POD ERT wr «sort not om Nate Bays clan i! assy date. Naive ANaive Bayes classifier uses probability theory to classify ‘Bayes is also known as simple ‘Bayes or independence Bayes. [Naive Bayes isa kind of clasifir which uss the Bases theorem. Itpredicta membership probabilities for each clas such asthe probability eee eed data pnt elngs parila cas Danang anne ——_* DIETS 4 he clans wth he ght probity i conse he ma ly Ma tue knoe Marimum Astron MAP? aus amir enue hat al he tee re uneaed caro Forni uty bconiee tobe anapleiitiae eundon Fr ee re fnew features depend ach tes on oat te ath ther features, a Nave Bayes classifier consis of Ate ieeries ta independently contribote tote probability tha Ne fs isan apple. HERI] cansity the tuple x = (Colour » RED, Type = SV Origin « DOMESTIC’) using Naive Tralning data in given in the following table where clase label is (STOLEN). Colour | Type Origin | Stolen | Red Sports | Domestic | Yeo Red Sports | Domestic | No Rt | Ent | Domes | Bo No Re Sports| Impered | Yeo | Therefore the prediction sno, fellow Imported | No we cet So pecee Yaew | BY | paves | Bo Seen sce dbus otal = (|S jt |< =a oot v - jum, students no) ‘Tabled. [ATU BO, Marke 15) ee Tneome | Student | Credit rating| Class: buys = re youth high [No | Fair No a youth high [No | Excellent No our on TOrein - Tniddle aged| high [No | Fair Yeu | _| seston | mediom | Ne Fair te Yea | No Yeu |No Yeo | No senior tow te, osttent | x he. middle aged| low | Yeu | acelent fo 4 | 2 | Sports| 4 | 2 |Domestic| 3 | 3 youth medium |No | Fair Ne Yawlaile youth Tow | Yee | Fair 7c LAL SL SN as _ltnvered | 2 1? fenlor median [3e, | tame |S middle aged | medium | No middle aged| high _ | Yet eenior | medium | No__! 4-18 (CSIT-6) ta Warehousing Data Mining lacian correctior tee aaa jacian correction i “ ‘tw each count will 1a rang et are enue at apy avoiding #0 Wake nelle aiferenee in pol a er anette a eng tothe ear vrereminator used in the probability falelation Rewer ome = medium, student = no credit rating = fair) Tage ame gee area can be extzated based on the training 'ple® PMbuya. computer = yes) = 9/4 = 0.643 ‘Pibayn. computer = 0) = 6/14 = 0.367 ‘Tocompate PX |G), fori 12,we compute the flowing conditional probabilities Pragessenior buys computerayes)= 49 0.383 Prage=senior|buys_computer=no)=2/5=0.400 ‘Rincome=medium [buys.computersyet)= 49 = 0.444 ‘Pincome=medium |buys_computer=no)= 25 =0.400 Pistudent=no buys computer=yes)=H9=0:333 ‘Pistudent=no | buys_computer=no)=4/5=0.800 Prcredit_rating=fair | buys_computer=yes)=6/9=0.667 ‘Pocredit_rating=fair | buys_computer=no)=2/5=0.400 Using the above probabilities PAX|buys_computer=yes) = lage=senior |buys_computer=yes) x Piincome=medium | buys_computer=yes) * ‘Pistudent=no /buys_computer=yes) ‘Preredit_rating=fair| buys_computer=yes) = 0.083 ‘AX |buys_computersno)= Plage=senior | buys_computer=no) »Pincome=medium |buys_computer=no)x ‘Pstudent=no | buys_computer=no) ~Prcredit.rating=fuir |buys_computer=no ‘Compute P(X|C)P(C) for each class: ‘POX buys_computer=yes) x Ptbuys_computer=yes)=0,083 x0,643=0,021. ‘PX buys computer=no) x Ptbuys_computer=no)=0.051 x0. : i buys. -=n0)=0.051 x0.357=0.018 ‘The Bayesian Classifier predicts buys_computer=yes for tuple X. siding zero probability tpnique sed for avoiding 051 ion and Clustering Tra Explain distance-based algorithms in detail, {_ Distance-besed algorithms are non-parametric methods that can be ‘used for classification. ‘These algorithms classify objects by the dissimilar 2 Thee alge clan cic arity between them as ‘4, There are two types of distance-based algorithm ‘a. Simple approach It assumes that each lassi * enter or centri, Tye new item i placed inte Ca withthe largest similarity val, 1b kenearest neighbour ‘The KNN scheme requiresnotonly train ‘ction aloo the odred nsfeation for exh tee, When 8 Classification ito he made fora new item, ts distance to each tem {nthe training et muste determined Only the closest entrisin the training set are considered, The new item is then placed in the ‘lass that contains the most items for ths set of loset items Algorithm : Input: Tr (Training data K [Number of neighbours {input tupleto clasify Output: ¢ _// lass to which is assigned KN algorithm : TAigorithm to classify tuple using KNN N=e Find set of neighbours, N, fort for each d eT do if |N|SK, then N=Nuld); ‘lee if 3 € N such that sim, u) sim (then begin N=N-(u; NeNuldi ™ classificatio Find class for n : ‘cedlase to which the most u ¢ N are classified

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy