0% found this document useful (0 votes)
807 views188 pages

Bda Sem 7 Book

Uploaded by

Swayam Jilla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
807 views188 pages

Bda Sem 7 Book

Uploaded by

Swayam Jilla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 188
bx *\} i As per the New Revised Syllabus (REV- 2019 ‘C’ Scheme) of Mumbai University w.e.f. academic year 2022-23 Big Data Analytics (Code : CSC702) (Compulsory Subject) Semester VII - Computer Engineering Mahesh Mali Sanjay Mate Saprativa Bhattacharjee — ee TechKnowledge uestion Papers upto Dec.2022. “SW publications 2 To introduce programming skills to build simple soluti Big Data Analytics (CSC702) Semester VII : Computer Engineering (Mumbai University) etree) 1 Toprovide an overview of the big data platforms, its use cases and Hadoop ecosystem. MapReduce, Scripting for No SQL and R 3 To learn the fundamental techniques and principles in achieving big data analytics with | scalability and streaming capability. 4 To enable students to have skills that will help them to solve complex real-world problems for decision support. problems. wn Understand the building blocks of Big Data Analytics. Apply fundamental enabling techniques like Hadoop and MapReduce in solving real world Understand different NoSQL systems and how it handles big data. Apply advanced techniques for emerging applications like stream analytics. Achieve adequate perspectives of big data analytics in various applications like recommender systems, social media applications, etc. 6 _ Apply statistical computing techniques and graphics for analyzing big data. ions using big data technologies such as Module T Content ‘Hrs 1 Introduction to Big Data and Hadoop 02 1.1 | Introduction to Big Data - Big Data characteristics and Types of Big Data 12 1.3 | Case Study of Big Data Solutions 14 | Concept of Hadoop, Core Hadoop Components; Hadoop Ecosystem 2 Hadoop HDFS and MapReduce 08 2.1 | Distributed File Systems: Physical Organization of Compute Nodes, Large-Scale File-System Organization. 22 | MapReduce: The Map Tasks, Grouping by Key, The Reduce Tasks, Combiners, | _ Details of MapReduce Execution, Coping With Node Failures. 2.3 | Algorithms Using MapReduce: Matrix-Vector Multiplication by MapReduce, Relational-Algebra Operations, Computing Selections by MapReduce, Computing Projections by MapReduc, Union Intersection, and Difference by MapReduce 2.4 | Hadoop Limitations Module Content 3 NosQu Introduction to NoSQL, NoSQL Business Drivers NoSQL Data Architecture Patterns: Key-value stores, Graph stores, Column family (Bigtable)stores, Document stores, Variations of NoSQL architectural patterns, NoSQL Case Study NoSQL solution for big data, Understanding the types of big data problems; Analyzing big data with a shared-nothing architecture; Choosing distribution models: master-slave versus peer-to-peer; NoSQL systems to handle big data problems. Mining Data Streams The Stream Data Model: A Data-Stream-Management System, Examples of Stream Sources, Stream Queries, Issues in Stream Processing. Sampling Data techniques in a Stream Filtering Streams: Bloom Filter with Analysis. Counting Distinct Elements in a Stream,Count-Distinct Problem, Flajolet-Martin Algorithm, Combining Estimates, Space Requirements Counting Ones in a Window: The Cost of Exact Counts,The Datar-Gionis-Indyk- Motwani Algorithm, Query Answering in theDGIM Algorithm, Decaying Windows. Real-Time Big Data Models ‘A Model for Recommendation Systems, Content-Based Recommendations, Collaborative Filtering Case Study: Product Recommendation Social Networks as Graphs, Clustering of Social-Network Graphs, Direct Discovery of Communities in a social graph Data Analytics with R Exploring Basic features of R, Exploring RGUI, Exploring RStudio, Handling Basic Expressions in R, Variables in R, Working with Vectors, Storing and Calculating Values in R, Creating and using Objects, Interacting with users, Handling data in R workspace, Executing Scripts, Creating Plots, Accessing help and documentation in R | Reading datasets and Exporting data from R, Manipulating and Processing Data in R, Using functions instead of script, built-in functions in R Data Visualization: Types, Applications. Introduction to Big Data, of Big Data Solutions 1a 23 24 25 26 27 Big Data characteristics, 1 Introduction to Big Dat Manas Big Data Big Data Characteristics ~ Four Important ‘Types of Big Data pig Data vs. Traditional Data Business Approach. Tools used for Big Date Requirements jolutions. ‘Data Infrastructure Case Studies of Big Data S introduction to HadooP. 211 Hadoop - Features on .d Traditional RDBMS... 212 — Hadoop an 241. HDFS (Hadoop Distributed File System). 242 — MapReduce.... 243 Hadoop - Limitation... Hadoop - Ecosystem ... Zookeeper. HBase.. 271 Comparison of HDFS and HBase..... 272 Comparison of RDBMS and Hace 2.73 HBase ArchitectUte rom 2.74 Region Splitting Methods... 27.5 Region Assignment and Load Balancing 27.6 HBase Data Model... we W& Big Data Analytics (MU) Table of Con 28 HIVE, 2 281 Architecture of HIVE. 282 Working of HIVE. = tee 283 HIVE Data Models. fermen Chapter 3: Hadoop HDFS and MapReduce Bef to 3-1: Distributed File Systems : Physical Organization of Compute Nodes, Large-Scale File-System Organization, MapReduce : The | Map Tasks, Grouping by Key, The Reduce Tasks, Combiners, Details of MapReduce Execution, Coping With Node Failures, | Algorithms Using MapReduce : Matrix-Vector Multiplication by MapReduce, Relational-Algebra Operations, Computing| | Selections by MapReduce, Computing Projections by MapReduce, Union, Intersection, and Difference by MapReduce, Hadoop Limitations 3.1 _ Distributed File Systems...... 3.11 Physical Organization of Compute Nodes... 312 Large-Scale File-System Organization 3.2 MapReduce rennin 321 The Map Tasks... = 322 Grouping by Key. 323 The Reduce Tasks. : ee 3.24 Combiners nnn 325 Details of MapReduce ExeCUtiON evnnnsnn ee 326 Coping with Node Failures.. os a 3.3 Algorithms using MapReduce. 33.1 Matrix-Vector Mutiplication by MapReduce.nnn-- a 332 _ Relational-Algebra Operation nn esnnnnnnsnnnen . evan 9 333 Computing Selections by MapREdUuCe nnwnmnnnnnin oe 3.10 334 Computing Projections by MapReduce nnn nen SAO 3. 3.35 Union, Intersection and Difference by MapReducew.u.umsnnnnnnnnnnnnnn 3.4 Hadoop Limitations Chapter4:_NoSQL 4110426 Introduction to NoSQl, NoSQL Business Drivers, NoSQL Data Architecture Patterns: Key-vlue stores, Graph stores, Column family (igtablestores, Document stores, Variations of NoSQL architectural patterns, NoSQL Case Study, NoSQL solution for big data, Understanding the tyes of big data problems; Analyzing big data wth a shared-nothing architecture; Choosing Aistbution models: master-slave versus peet-to-peer, NOSQL systems to handle big data problems (My) NoSQL (What is NoSQL 7) 4.2 NoSQl Basic Concepts.. 4.3, Case Study NoSQL (SQL vs NoSQU.. 4.4 Business Drivers of NoSQL. 4.5 NoSQL Database Types 4.6 Benefits of NoSQL. 4.7 Introduction to Big Data Management .rinicionnonninnn 4B Big Data enn 481 Tools Used for Big Data... 7 no 482 Understanding Types of Big Data Problems venom st 4.9 Four Ways of NoSQL to Operate 4.10 Analyzing Big Data with a Shared-Nothing Arc 4101 Shared Memory System. 4102 Shared Disk System 7 4103 Shared Nothing Disk System... 4104 Hierarchical System ene ~ 4.11 Choosing Distribution Models : Master-Slave versus Peer-to-Peer. 4111 Big Data NoSQL Solutions... ALLUA) Cassendra.. ~ 4.1.1(8) Dynamo DB. =o Chapter 5: Mining Data Streams 5-1 to 5 Sespter Sz Mining Date Stree Bt t0 5 The Stream Data Model: A Data-Stream-Management System, Examnples of Stream Sources, Stream Queries, Issues in Sear Processing, Sampling Data techniques in a Stream, Filtering Streams: Bloom Filter with Analysis, Counting Distinct Elements 2 Stream, Count-Distinct Problem, Flajolet-Martin Algorithm, Combining Estimates, Space Requirements, Items Cost Windows. 5a 5.2 Counting Frequer sin a Stream, Sampling Methods for Streams, Frequent Itemsets in Decaying Windows, Counting Ones in a Window. Te of Exact Counts, The Datar-Gioni Indyk-Motwani Algorithm, Query Answering in the DGIM Algorithm, Decay ‘The Stream Data Model 511 A Data-Stream-Management System... 5.12 Examples of Stream Sources, 5.13 Stream Queries. 5.14 __Issuesin Stream Processing. ‘sampling Data Techniques in a Stream 5.3.1 Bloom Filter with Analysis. 5.4 Counting Distinct Elements in a Stream 54.1 Count - Distinct Problem. 54.2 The Flajolet- Martin Algorithm 543 Combining Estimates... 544 Space Requirements.... 5.5 Counting Frequent Items in a Stream.. 5.5.1 Sampling Methods for Streams... 5.5.2 Frequent Itemsets in Decaying Windows.. 5.6 Counting Ones in a Window 5.6.1 The Cost of Exact Counts.. 5.62 The DGIM Algorithm (Datar — Gionis - Indyk - Motwani) 5.63 Query Answering in the DGIM Algorithm 5.64 Decaying Windows... Chapter 6 : Finding Similar Items 611068 Distance Measures : Definition of a Distance Measure, Euclidean Distances, Jaccard Distance, Cosine Distance, Edit Distance, Hamming Distance. 6.1 Distance Measures .. 611 Euclidean Distance: 612 —_ Jaccard Distance. 613 Cosine Distance 614 Edit Distance. 615 Hamming Distance TA to7-7 Chapter 7: Clustering ‘A Stream-Clustering, Algorithm, Initializing & Merging Buckets, Answering Queries ] CURE Algorithm, Stream-Computing, renee TL 7.1 Introductios rnnnmresee To 7.2 CURE Algorithm..... 721 Overview of CURE (Cluster Using Representative) 722 Hierarchical Clustering Algorithm. 7.22(A) Random Sampling and Partitioning Sample ier's and Data Labelling. - WF Tecttowateags 7.228) Eliminate Outli 73.1 A Stream - Clustering Algorithm 7-4 Initializing and Merging Buckets. ws Answering ‘Queries... Chapter 8: Link Analysis 21 toay [emt Ovni, Ent conputaton of Paton Poenoar soame Using MapReduce, Use f Combing Consol Kethe Result Vector _ 8.1 Page Rank Definition, B11 Importance of Page Rank .ccou., settee ven ‘ren 81.2 Links in Page Ranking +n 8 813 ‘Structure of the Web. i tsetse a tee stern Be B14 Using Page Rank in a Search ENQINC .nsnsntsinsmnannns 82 Efficient Computation of Page RAMs 821 Representation Of Transition Matrix. seetnttnveteeinttesi - stereneerenestne os wssnesem Bld 822 Iterating Page Rank with MapReduce...... etn — rene 823 Use of Combiners to Aggregate the RESUME VECIOF a reesennrnnsenon 83 Link Spam. soe 83.1 Spam Farm Architecture. 83.2 Spam Farm Analysis, 833 Dealing with Link Spam 8.4 Hubs and Authorities.......... 841 Formalizing Hubs and Authority. _ revseersene semen stermttntennten rns Eng Chay ipter 9: Recommendation Systems a 9-1 toot A Model for Recommendation Systems, Content-Based Recommendations, Collaborative Fil 91 tering 9.2 Recommendation System 33 911 The Utility Matrix.. 8! 9.12 Applications of Recommendation Sysiems... # 9.13 Taxonomy for Application Recommendation System... Content Based Recommendation 9.21 Item Profile. + Fy 9.22 Discovering Features of Documents i 9.23 Obtaining Item Features from Tags... ‘ 9.2.4 Representing Item Profit... Table of Contents WF Big Data Analytics (MU) 6 925 User Profiles... ae 926 Recommending Items to Users based on Content. aed 92.7 Classification Algorithm. ae 9.3 Collaborative Filtering.. ne 9-6 93.1 Measuring Similarity “97 9.32 Jaccard Distance. 97 933 Cosine Distance. 9-8 934 Rounding the Data 98 935 Normalizing Rating 98 9.4 Pros and Cons in Recommendation System, 941 Collaborative Filtering, 942 Content-based Filtering, 9.5 Case Study: Product Recommendation. Chapter 10 : Mining Social Network Graph 10-1 to 10-13 | Social Networks as Graphs, Clustering of Social-Network Graphs, Direct Discovery of Communities in a socal graph 10.1 Introduction... 10.2. Social Network as Graphs 1021 Parameters Used in Graph (Social NetWot) nnn 1022 Varieties of Social NetWork nmnnsnnnn 1022(A) Collaborative Netwotk nnn 10.2.2(8) Email NetWork nnn 1022() Telephone Network... 10.3 Clustering of Social Network Graphs nen 1031 Distance Measure for Social-Network Graphs... 1032 Applying Standard Cluster Method... 103.3 Betweenness.rnnrnns 1034 The Girvan - Newman Algorithm nnn 1035 Using Betweenness to Find Communities... 104 Direct Discovery of Communities... 1041 Bipirate Graph.. 1042 Complete Bipirate Graph. 105.1 Random Walker on Social Network. 1052 Random Walks with Restart... 10.6 Counting Triangles using MapReduce WW Big Data Analytics (MU) e fermen Exploring Basic features of R, Exploring RGUI, Exploring RStudlo, Handling Basic Expressions in R, Variables in R, wo, with Vectors, Storing and Calculating Values in R, Creating and using Objects, Interacting with users, Handling das, ‘workspace, Executing Scripts, Creating Plots, Accessing help and documentation in R 82 Reading datasets and Exporting data from R, Manipulating and Processing Data in R, Using functions instead of built-in functions in R §3 Data Visualization: Types, Applications Socal Networks as Graphs, Clustering of Social-Network Graphs, Direct Disa ‘of Communities ina social graph 111 Exploring Basic Features of Rn. 1112 Exploring GUL... 11.13 Exploring RStudio, 11.14 Handling Basic Expressions in R 115 Variables in R... 11.16 Working with Vectors in R. 1117 Storing and Calculating Values in R... 1118 Creating and using Object... 11.19 Interacting with Users... 11.110 Handling Data in R Workspace... L111 Executing Scripts. 11112 Creating Plots. 11.113 Accessing Help and Documentation in Rew 11.2 Reading Datasets and Exporting Data from R 11.21 Export Data FOO Rr 11.3 Manipulating and Processing Data in R..... 1131 Data Processing IA R evcnnnnnn 113.2 Using Functions instead of a Script, Built-in Functions in R. 11.4 Data Visualization : Types, Applications 1141 Data Visualization Types. 11.42. Data Visualization Applications n-ne 11.43 Data Visualization TOOl nnn [Mopute =3 | Introduction to Big Data a Introduction to Big Data, Big Data characteristics, Types of Big Data, Traditional vs. Big Data business approach, Case Study of Big Data Solutions 1.1 __ Introduction to Big Data Management + We all are surrounded by huge data. People upload/download videos, audios, images from variety of devices. Sending text messages, multimedia messages, updating their Facebook, Whatsapp, Twitter status, comments, online shopping, online advertising etc. generates huge data. «As a result of multiple processing machines have to generate and keep huge data too. Due to this exponential growth of data, Data analysis becomes very much required task for day to day operations. The term ‘Big Data’ means huge volume, high velocity and a variety of data + This big data is increasing tremendously day by day. 02 * Traditional data management systems and existing 00° No tools are facing difficulties to process such @ B19 pg data ») ponte Data 3 ° 08,0 + Ris one of the main computing tools used in statistical education and research. It is also widely used for data analysis and numerical computing in other fields of scientific research. Fig. 11.1 1.2 Big Data @ What is Big Data? cm Q. Write a short note on Big Data. © We all are surrounded by huge data. People upload/download videos, audios, images from variety of devices, * Sending text messages, multimedia messages, comments, online shopping, online advertising etc. © Big data generates huge amount data, Asa result machines have to generate and keep huge data too. Due to ‘this exponential growth of data the analysis of that data becomes challenging and difficult ‘eans huge volume, high velocity and a variety of data. This big data is increasing eo ym ae ae ‘day, Traditional data management systems and existing tools are facing difficulties to updating their Facebook, Whatsapp, Twitter status, process such a Big Data. ¥ Big Data Analytics (MU) 1-2 ae Introduction to by, + Big data is the most important technologies in modern world. It is really critical to store and 2 collection of large datasets that cannot be processed using traditional computing techniques. frameworks, [ University Question] ? + Huge amount of data Q. What are the three Vs of big data? . Describe any five characteristics of Big Data Big data characteristics are as follows : Scale of dat Volume Velocity Veracity Analysis o data Fig. 1.3.1 : Big data characteristics is generated during big data applications Uncertainty of data manage it, * Big Data includes huge volume, high velocity and extensible variety of data, The data in it may be structy data, Semi Structured data or unstructured data, Big data also involves various tools, techniques , 1.3 Big Data Characteristics - Four Important V of Big Data Different forms of data Variety oe + The amount of data generated as well as storage volume is very big in size, gigabytes of now data ‘generated every day. Syemajoreior ——goihey wake op. 1,000,000,000,000 aisths: 2x Connoctodobects ofthe works data sas many peop n Sha devices on Unstructured audio, 2018 wore theplanet generating vedo.RFID dala. "wing share thir dats y 2098 ope oot e0.caton data rtm represontnew or personatzed cine x ceeeiomme ——oomparodo ha cae forint provi your by Soy sotraee 84% dipestined cial many 2017 ai wey ceamneney ‘500M land user generated DVDs worth of data aaa ee isgorertod aa srinton eae us ‘on what they buy. 25 Potabytos of data ie collected every hour 84% ‘of smartphone users ‘check an app aa soon 70% 01 boomers agree, Fig. 1.3.2 4iths of U.S. adult smartphone users keep thelr ‘hones with ther 22 hours per day, 5 minutes The response timo users, expect from 8 company once they have contacted them via social media 87% of companies | In 2014 expect todevoto ‘engagement, (almost. double tho Investment ‘one year ago.) rane W_ Big Data Analytics (MU) 1:3 Introduction to Big Data 2. Velocity For time critical applications the faster processing is very important. E.g. share marketing, video streaming The huge amount of data is generated and stored requires higher processing speed of processing data ecu Of digital data will be doubled in every 18 months and it repeats may be in less time in re. 3. Variety The type and nature of data is having great variety. Fig. 1.3.3 4. Veracity ‘The data captured isis not in certain format. ‘+ Data captured can vary greatly. * So accuracy of analysis depends on the veracity of the source data. 5. Additional characteristics (2) Programmable + Itis possible with big data to explore all types by programming logic. * Programming can be used to perform any kind of exploration because of the scale of the data. (b) Data driven ‘* The data driven approach is possible for scientists + Asdata collected is huge amounts. (© Multi Attributes * Its possible to deal with many gigabytes of data that consist of thousands of attributes, + Asall data operations are now happening on a larger scale. Se BF _Big Data Analytics (MU) 14 Introduction to Big, (d) Iterative The more computing power can iterate on your models until you get them as per your own requirements 1.4 _ Types of Big Data_ | @. Write a short note on Types of Big Data, 1. Introduction Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, 54, MapR, Acunu, Flume, Kafka, Azkaban, Oo, Greenplum DB Group Structured data || Unstructured data | Somi- structured age Fig. 1.4.1 : Types of big data 2. Structured data * Structured data is generally data that has a definate length and format for big data. ‘+ Like RDBMS tables have fixed number of cclums and data can be increased by adding rows. Example : Structured data include marks data as numbers, dates or data like words and numbers. Structured dat very simple to dealing with, and easy to store in a database. Sources of structured data The data can be generated by human or it will be generated by machine. (Human generated data 1. Sensor data : Radio frequency ID tags, medical devices, and Global Positioning System data. 2. Web log data : All kinds of data about their activity 3. Point-of-sale data : Data associated with sales. 4. Financial data : Stock-trading data or banking transaction data, (ii) Machine generated data 1. _ Input data : Survey data, responses sheets and so on. 2. Click-stream data : Data is generated every time you click a Link on a website. 3. Gaming-related data : Every move you make in a game can be recorded. © ea © Big Data Analytics (MU) 1s Introduction to Big Data (iil) Tools generates structured data (Data Marts (i) RDBMS (iii) Greenplum (iv) Teraata 3. Un-structured data * — Unstructured data is generally formats. * like audio, video data, Web blog data etc Example : Unstructured data include video recording of CCTV surveillance. Sources of unstructured data The data can be generated by human or it wll be generated by machine (Human generated data 1. Satellite images : This includes weather data or the data from satellite 2. Scientific data : This includes seismic imagery, weather forecasting data etc 3. Photographs and 4. Radar or sonar data : This includes vehicular, oceanographic data (i) Machine generated data leo : This includes security, video ete. 2. Text data : The documents, logs, survey results, and e-mails in company. 2. Social media : Generated from the social media platforms such as YouTube, Facebook ete 3. Mobile data : This includes data such as text messages and location information etc, 4. Website content ite data like YouTube, Flickr, or Instagram. (iii) Tools generates structured data 1. Hadoop 2. HBase 3. Hive 4. Pig 5. Cloudera 6. MapR 4. Semi-structured data * Along with structured and unstructured data, there's also a semi-structured data. * Semi-structured data is information that doesn't reside in a RDBMS. + Itmay organized in tree pattern which is easier to analyze in some cases. * Examples of semi-structured data might include XML documents and NoSQL databases, 5. Hybrid data * There are systems which will make use of both types of data to achieve competitive advantages. * Structured data is offering simplicity whereas unstructured data wil give lot of data about topic ON hd ‘Multl-Structured! Hybrid ‘0 Emerging market data Loyalty © E-commerce © Other third party data © Weather © Currency conversion © Demographic © Panel © POS, POL,IR, EDI, RFID, NFC, QR, IAI, Fi, Nielsen, other syndicated, IMS, MSA, ete Fig. 1.4.2 1.5 _ Big Data vs. Traditional Data Business Approach @. Compare Traditional approach and traditional big data roach. The modern world is generating massive volumes of data at very fast rates. As a result, big data analyti § becoming a powerful tool for businesses looking to mining valuable data for competitive advantage. Classic BI TT. structures the data to “answer those questions Business determines what ‘questions toask “Capture only what's needed" TT delivers a platform for ~ storing, refining, and” analyzing all data sources Big data analytics Business sxplores data tro noted and trative “Capture only what's needed” ics (MU) Introduction to Big Data Traditional business intelligence There are many systems distributed throughout the organization. ‘The traditional data warehouse and business intelligent approach required extensive data analysis work with each of the systems and extensive transfer of data, Traditional Business Intelligence (Bt) systems offer various levels and types of analyses on structured data but they are not designed to handle unstructured data, For these systems Big Data may creates a big problems due to data that flows in either structured or unstructured way. This makes them limited when it comes to delivering Big Data benefits. Many of the data sources are incomplete, do not use the same definitions, and not always available, Saving all the data from each system to a centralized location makes it unfeasible. ‘Traditional approach Big data approach structured and repeatable analysis Iterative and exploratory analysis Business users 7 Determine what statorm to uestion to ask Dolivors a platform : * orate cea discovery T Business Structures the data tg, (IMS Explores what questions answer that questo could be asked Monthly sales ropota |S Brand sentiment proftabllty analysis product strategy ‘customer surveys maximum asset utllzation preventative care Fig. 1.5.2 : Traditional business intelligence 2. Big data analysis Big data means large or complex data set that traditional data processing applications may no be able jig data m to process efficientely. a Big data analytics involves data analysis, data capture, search, sharing, storage, transfer, 1e Big da ; visualization, querying and information security. ‘The term generally only used for predictive analytics. The ¥ big data may lead to more confident decision making, and better decisions which can fe efficiency of bi result in greater operational efficiency, cost reduction and reduced risk ased platform can be used for th ig data analysis Fig. 1.5.3 business world's big data problems. nal database may be the be: © — Cloud-bi «There can be some situations where running workloads on a taditio solution. SM ‘Accelerating time-to-val 1° $175 J iso | Big data solution (MPP) 3125 4 5 = si00 8 i $75 7 3 E ‘$50 é ws appliance s 4 (single SKU) ‘$(25)+ Fig. 1.5.4 3. Comparison of traditional and big data Sr. Dimension Traditional : Fis Big Data ! 1 [Data source [Mainly internal. Both inside and outsid le 3 ; radltional organization includ! 2 |ata structure |Pre-defined structure. lui Instructured in natur | re. 3 [Data By default, stable and interrelationshiy 7 relationship ip. |Unknown relationship. duction to Big Data Big Data Analytics (MU) 19 Ub 2 Sr. | Dimension Traditional Big Data | No. | 4 |Datatocation |centralized. Physically highly distributed. | 5 [Data analysis [after the complete build. Intermediate analysis, as you go. 6 [Data reporting [Mostly canned with limited and pre-| Reporting in all possible direction across the ‘defined interaction paths, \data in real time mode. 7 |Cost factor [Specialized high end hardware and! Inexpensive commodity boxes in cluster mode. | ‘software, 8 [CAP theorem — |Consistency - Top priority, |Aveilability - Top priority 4. Comparison between RDBMS and Hadoop Sr. No. Traditional RDBMS Hadoop / Map Reduce 1 | Data size Gigabytes (Terabytes) _| Petabytes (Hexabytes) | 2 | Access Interactive and Batch | Batch ~ NOT interactive | 3__| Updates Read/Write many times | Write once, Read many times 4 _| Structure Static schema Dynamic schema 5 _ | Integrity High (ACID) low 6 | Scaling Non linear Linear 7 _| Query response time _| Can benearimmediate_| Has latency (due to batch processing) 1.6 Tools used for Big Data Q. Explain various tools used in Big Data. 1. MapReduce Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, $4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum 2. Storage $3, Hadoop Distributed File System 3. Servers EC2, Google App Engine, Elastic, Beanstalk, Heroku 4, Nosai Zookeeper Databases, MongoDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, CouchDB testa W_ Big Data Analytics (MU) 10 Introduction to 5. Processing R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop 1.7 __ Data Infrastructure Requirements T Low, Proditabio iatonoy apap arava ‘High transaction volume, het ceveloprstl + Floxible data structuros Seve oonlatai High throughput J] |e Real time results + In-plaoe preparation + All data sources/Structures| Fig. 1.7.1: Data infrastructure requirement 1. Acquiring data High volume of data and transactions are the basic requirements of big data. Infrastructure should sup: the same. Flexible data structures should be used for the same. The amount of time required for this sic be as less as possible. Organizing data As the data may be structured, semi structured or unstructured it should be organized in a fast and efc way. Analyzing data Data analysis should be faster and efficient. It should support distributed computing, 1.8 Case Studies of Big Data Solutions Q.__ Give two examples of big data case studies. Indicate which vs are satisfied by these case studi There are many cases, in which big data solutions can be used effectively, 1. Healthcare and public health industry * The computing power of big data analytics enables us to predict diseas 5, allow yew Cv and better understand and predict disease patterns. us to find . Like entire DNA strings can be decoded in minutes Smart watches can be used to apply to predict symptoms for various diseases, + Big data techniques are already being used to monitor babies in a specialist premature and sick unit. By recording and analyzing every heart beat and breathing pattern of every baby, the unit" able to develop algorithms that can now predict infections 24 hours before any physical sym!” appeer. ———————————EEE Data Introduction to Bi (Mu) «Big data analytics allow us to monitor and predict the developments of epidemics and disease Integrating data from medical records with social media analytics enables us to monitor few diseases. 2. Sports * All sports popularly performing all analysis in big data analytics. The Cricket and football matches you will observe many predictions which found correct maximum times. * Example is IBM SlamTracker tool for tennis tournaments Also video analytics track the performance of every player in a cricket game, and sensor technology in sports equipment such as basket ball allows us to get feedback even using smart phones. Many elite sports teams also track athletes outside of the sporting environment using smart technology to track nutrition and sleep, as well as social media conversations. 3. Science and research + Science and research is currently being transformed by the new possibilities big data. ‘+ Experiments involve testing lot of possibilities of test cases and generate huge amounts of data. © The many advanced lab uses computing powers of thousands of computers distributed across many data centers worldwide to analyze the data, + Itwill help to leveraged performance areas of science and research. 4, Security enforcement ‘+ Big data is applied for improving national security enforcement. «The National Security Agency (NSA) in the U.S. uses big data analytics to foil terrorist plots. © The big data techniques are used to detect and prevent cyber attacks. «Police forces use big data tools to catch criminals and even predict criminal activity and credit card companies use big data use it to detect fraudulent transactions. 5. Financial trading «Automated Trading and High-Frequency Trading (HFT) is @ new area where big data can play a role. © The big data algorithms can be used to make trading decisions. «Today, the majority of equity trading now takes place via data algorithms that increasingly take into 5 account signals from social media networks and news websites to make buy and sell decisions in split @.2 Explain various applications of Big Data applications. seconds. Y Q1 Write a short note on Big Data. a $ a3 a4 as as a7 as as 10 Give all characteristics of Big Data. Explain three vs of Big Data. Explain various types of big data in details. Why to use Big Data over traditional business approach ? Compare Traditional approach and traditional big data approach Explain various needs of Big Data. Explain various tools used in Big Data. Write @ short note on = (@) Types of Big Data (b)__ Traditional vs Big Data business approach. Sfroome ix Introduction to Hadoop Concept of Hadoop, Core Hadoop Components, Hadoop Ecosystem 2.1__Hadoop @. Whats Hadoop? How Big Data andHadoop are linked ? PU: May 17.4 Marks | [© _Write a short note on Hadoop « Hadoop is an open-source, big data storage and processing software framework, Hadoop stores and process big data in a distributed fashion on large clusters of commodity hardware. Massive data storage and faster processing are the two important aspects of Hadoop. fon] [om] Lo Hadoop Cluster Fig. 2.1.1: Hadoop cluster «As shown in Fig, 2.1.1 Hadoop cluster is a set of commodity machines networked together in one location ie. cloud. sed for Data storage and processing. From individual clients users can «These cloud machines are then us int at some remote locations from the Hadoop cluster. submit their jobs to cluster. These clients may be preset ‘ems with thousands of nodes involving huge storage capabilities. As * Hadoop run applications on syst data transfer rates among nodes are very faster. distributed file system is used by Hadoop © Asthousands of machines are there in a cluster user can get uninterrupted service and node failure is not a big issue in Hadoop even if large numberof nodes become inoperative * Hadoop uses distributed storage and transfers code to data. This code is tiny and consumes less memory also results is saved as the data is locally available. Thus interprocess communication time is saved which makes it faster processing. «The edundancy of datas important feature of Hadoop due to which node ales ae easily handled, «In Hadoop us com munication between that data. Low cost 1 Hadoop - Features ‘As Hadoop is an opensource framewor data, Hence itis not much costly. High computing power distributed computing Hadoop uses ister can be processed quickly. Clu Hadoop. Scalability Nodes can be easily added an« litle administration is required. Huge and flexible storage Massive data storage is aval unstructured data. No preprocessing is required on data Fault tolerance and data protection If any node fails partitioning the data, handles it all, use! rk, it is free. It uses © model. Due to this task can b have thousands of d removed. Failed nodes can ilable due to thousands of nodes in the data and tas! concentrate 0” Kk ase i data and operation; r can Cl e to store and process hy, ommodity hardwar .e distributed amongst different nodes a, ‘odes which gives high computing capability ; be easily detected. For all these activities ve cluster. It supports both structured a before storing it. the tasks in hand are automatically redirected to other nodes. Multiple copies of all data? ‘automatically stored. Due to this even if any node fails that data is available on some other nodes also. 2.1.2. Hadoop and Traditional RDBMS = Hadoop : 1, |Hadoop stores both structured ‘ : and i Hndoop aera RDBMS stores data in a structural way. 2. ‘SQL can be implemented on t op of Hadoop a Sat can bento p as | SQL (Structured Query Language) is used. 3. | Scaling out is not that mi wich expensive i and little administration, 4. Basic data unit is key/valu Tf pened ey/value pairs. Basic data unit is relational tabl With MapRedce we can ute scripts nd codes | With SQL = actual steps in processing the data, peas or ee eemmeten aca ee | - . engine derives it. ear | Hadoop is designed for offlit fline proc analysis of large-scale data. prosssing and RDBMS is designed for online transacti ions, Watt up We wources 1 ocala UF software 1 1, Sealing out 7 In Traditional RO! this can be easily done i. seal 2. Transfer code to data in RDBMS generally data is moved 10 < security threat. In Hadoop small code is Hadoop correlates preprocessors and st0ra9e 3. Fault tolerance Hadoop is designed to very common problem. 4. Abstraction of complexities Hedoop provides proper interfaces between compor 5. Data protection and consistency Hadoop handles system level challenges as it BMS it is quite difficult to add more taeclvs le down. sina? te «data is moving tere IF ode and result vi the dt local moved to data and it is exact cope up with node failures. As large number of machines are there, 2 node failu’ nents for proper working. supports data consistency. 23 _Hadoop Physical Architecture | = |. Explain Physical architecture of Hadoop.. ‘ rACom) my sy ‘+ Running Hadoop means running a set of resident programs. These residen wn, 9 it programs are also known as . Tse ‘daemons may be running on the same server or on the different servers in the network. ‘+ Allthese daemons have some specific functi ther see these daemons. ionality assigned to them. . Let us see these di : Secondary Namehiode P= NameNode = dobTracker NameNode The Name Ne is ode is known as the master of HES ataNode is kno The Name wn as the slave of HDFS, i Node has Jobr; which keeps track of fil +, JobTracker whicl rack iles dics aimed W_Big Data Analytics (MU) Introduction to Hadoop 2.2__Hadoop System Principles 1. Scaling out In Traditional RDBMS it is quite difficult to add more hardware, software resources i.e. scale up. In Hadoop this can be easily done ie. scale down, 2. Transfer code to data In RDBMS generally data is moved to code and results are stored back. As data is moving there is always a security threat. In Hadoop small code is moved to data and it is executed there itself. Thus data is local. Thus. Hadoop correlates preprocessors and storage. 3. Fault tolerance Hadoop is designed to cope up with node failures. As large number of machines are there, @ node failure is very common problem 4, Abstraction of complexities Hadoop provides proper interfaces between components for proper working. 5. Data protection and consistency Hadoop handles system level challenges as it supports data consistency. 2.3 _Hadoop Physical Architecture Q. Explain Physical architecture of Hadoop. Running Hadoop means running a set of resident programs. These resident programs are also krown as daemons. + These daemons may be running on the same server or on the different servers in the network. + Allthese daemons have some specific functionality assigned to them. Let us see these daemons. Gocondary NameNode) NameNode JobTracker | Fig. 2.3.1: Hadoop cluster topology NameNode 1. The NameNode is known as the master of HDFS. 2. DataNode is known as the slave of HDFS. 3. The NameNode has JobTracker which keeps track of files distributed to DataNodes. We ratenieat 4, NameNode directs DataNode regarding the low-level I/O tasks 5, NameNode is the only single point of failure component. DataNode 1. DataNode is known as the slave of HDFS. 2. The DataNode takes client block addresses from NameNodes. 3. Using this address client communicates directly with the DataNode. 4, For replication of data a DataNode may communicate with other DataNodes. 5, DataNode continually informs local change updates to NameNodes. 6. To create, move or delete blocks DataNode receives instructions from the local disk Secondary NameNode (SNN) 1. State monitoring of cluster HDFS is done by SNN? 2. Every cluster has one SN. 3. SNN resides on its own machine also 4. On the same server any other DataNode or TaskTracker daemons cannot run, 5. The SNN takes snapshots of the HDFS metadata at intervals by communicating constantly with Name) JobTracker 1. _JobTracker determines files to process, node assignments for different tasks, tasks monitoring etc. 2. Only one JobTracker daemon per Hadoop cluster is allowed. 3. JobTracker runs on a server as a master node of the cluster. TaskTracker 1. _ Individual tasks assigned by JobTracker are executed by TaskTracker. 2. There is a single TaskTracker per slave node, 3. TaskTracker may handle multiple tasks parallely by using multiple JVMs, 4 TaskTracker constantly communicates with the JobTracker. Within a specified amount of time TaskTracker fails to respond to JobTracker then itis assumed that the TaskTracker hi of corresponding tasks are done to other nodes in the cluster. pee tee esc Fig. 2.3.2 : Jobtracker and tasktracker interaction Big Data Analytics (MU) 25 Introduction to Hadoop. 2.4 _Hadoop Core Components Q. Explain components of Gore Hadoop. ] HDFS is a file system for Hadoop. It runs on clusters on commodity hardware. * HOFS - Hadoop dtibuted le systom (storage) ie racy ] | Map reduce] | [ Job tracker [Task tracker Task wacko] ‘Admin node HOFS Cluster | | [ Name node Data node || - Data node Fig, 2.4.1 : Hadoop core components 2.4.1 HDFS (Hadoop Distributed File System) a Uni ‘Quest Describe the structure of HDFS in a hadoop ecosystem using a diagram. | HDFS is a file system for Hadoop. It runs on clusters on commodity hardware. HDFS has following important characteristics : © Highly fault-tolerant © High throughput © Supports application with massive data sets © Streaming access to file system data © Can be built out of commodity hardware. HDFS Architecture For distributed storage and distributed computation Hadoop uses a master/slave architecture. The distributed storage system in Hadoop is called as the Hadoop Distributed File System or HDFS. In HDFS a file is chopped into 64MB chunks and then stored, known as blocks. As previously discussed HDFS cluster has Master (NameNode) and Slave (DataNode) architecture, Name Node manages the namespace of the filesystem. In this namespace the information regarding filesystem tree, metadata for all the files and directories in that tree etc is stored, For this it creates two files the namespace image and the edit log and stores information in it on consistent basis. WY feortonwege W_ Big Data Analytics (MU) 26 Introduction toy, + Aclient interacts with HDFS by communicating with the Name Node and Datanodes. The user does no,, about the assignment of Name Node and Data Node for functioning. i.e. Which NameNode and Data, are assigned or will be assigned. 1. NameNode * The NameNode is known as the master of HDFS. + DataNode is known as the slave of HDFS. * The NameNode has Job Tracker which keeps track of 1s distributed to DataNodes. NameNode directs DataNode regarding the low-level I/O tasks. + NameNode is the only single point of failure component. DataNode DataNode is known as the slave of HDFS, ‘The DataNode takes client block addresses from NameNodes. Using this address client communicates directly with the DataNode. For replication of data a DataNode may communicate with other DataNodes. DataNode continually informs local change updates to NameNodes. To create, move or delete blocks DataNode receives instructions from the local disk. [Méiadata(Name, replicas. ) (¢oomertoo/data, 6... ) Namenoaé Datanodes: a Blocks Fig. 2.4.2: HDFS architecture 2.4.2 MapReduce -Q, What is MapReduce ? Explain How MapReduce work? ORE MapReduce is a software framework. In Mapreduce an application is broken down into number of parts, These small parts are also as called fragments or blocks. These blocks then can be run on any nod the cluster. Data Processing is done by MapReduce. MapReduce scales and runs an application to different ¢! machines. ey Teo” WF Big Data Analytics (MU) Introduction to Hadoop Required configuration changes for scaling and running for these applications are done by MapReduce itself. There are two primitives used for data processing by Mapreduce known as mappers and reducers. Mapping and reducing are the two important phases for executing an application program. In the mapping phase MapReduce takes the input data, filters that input data and then transforms each data element to the mapper. In the reducing phase, the reducer processes all the outputs from the mapper, aggregates all the outputs and then provides a final result. ‘MapReduce uses lists and key/value pairs for processing of data. MapReduce core functions 1 5. Read input Divides input into small parts / blocks. These blocks then get assigned to a Map function Function mapping It converts file data to smaller, intermediate pairs. Partition, compare and sort ‘+ Partition function : With the given key and number of reducers it finds the correct reducer. © Compare function : Map intermediate outputs are sorted according to this compare function Function reducing Intermediate values are reduced to smaller solutions and given to output. Write output Gives file output. Fig, 2.4.3 : The general MapReduce dataflow To understand how it works let us see one example, File 1 : “Hello Sachin Hello Sumit” File2: “Goodnight Sachin Goodnight Sumit” Count occurrences of each word across different files. Teal YF _Big Data Analytics (MU) 28 Introduction Three operations will be there as follows, @ Map Mapl Map2 < Hello, 1 > < Goodnight, 1 > < Sachin, 1 > < Sachin, 1 > < Hello, 1 > < Goodnight, 1 > < Sumit, 1 > < Sumit, 1 > (i) Combine Combine Map1 Combine Map2 < Sachin, 1 > < Sachin, 1 > < Sumit, 1 > < Sumit, 1 > < Hello, 2 > < Goodnight, 2 > (iii) Reduce < Sachin, 2 > < Sumit, 2 > < Goodnight, 2 > < Hello, 2 > 2.4.3, Hadoop - Limitation Un y Qui Q. State Limitations of Hadoop. Q. What are the limitations of Hadoop 2 + Hadoop can perform only batch processing and sequential access, + Sequential access is time consuming, + So.anew technique is needed to get rid of this problem, 2.5 _Hadoop - Ecosystem @. Give Hadoop Ecosystem and brietly explain its components, ©. Explain Hadoop Ecosystem with core components, Cees @. What do you mean by the Hadoop Ecosystom?Describe any three components of a typical Had ‘ loop Ecosy'! MU: MM E Q. Explain Hadoop Ecosystem. we Became a Introduction to Hadoop 1. Introduction + Hadoop can perform only batch processing and sequential access. * Sequential access is time consuming, * Soa new technique is needed to get rid of this problem. + The data in today’s world is growing rapidly in size as well as scale up and shows no signs of slowing down. ‘+ Statistics show that every year amount of data generated is more than previous years. © The amount of unstructured data is much more than structured information stored in rows and columns. + Big Data actually comes from complex, unstructured formats, everything from web sites, social media and email, to videos, presentations, etc. + The pioneers in this field of data is Google, which designed scalable frameworks like MapReduce and Google File System. tive by the name Hadoop, it is a framework that allows for * Apache open source has started with the distributed processing of such large data sets across clusters of machines. g 5 i ‘ : £ 3 es: e i Daeg 5 & |(Hadoop distabuted fle system) Fig. 2.5. ladoop ecosystem 2. Ecosystem * Apache Hadoop, has 2 core projects, © Hadoop MapReduce ° Hadoop Distributed File System ; ragramming model and software for writing applications which can process llel on large clusters of computers. it creates multiple replicas of data blocks and distributes them on extremely rapid computations. © Hadoop MapReduce is @ vast amounts of data in paral © HDFS is the primary storage system, compute nodes throughout a cluster to enable reliable, + Other Hadoop-related projects are Chukwa, Hive, HBase Mahout, Sqoop and Zookeeper. > Apache hadoop ecosystem , é x Ambari 2 Provisioning, managing and monitoring hadoop clusters ih z § F | | i z 5 2 2 g é #3 z 2 % 8a ga 2 BS|[_e|| 22|| 2a|| 23 i & vamuaproamet || fe 3 & @ Distributed processing framework 8 Bi) 23 23|| $5 |[iore B31 | 8 8 || Hedoop astibutod tle system Fig. 2.5.2 2.6 ZooKeeper 1 Zookeeper is a distributed, open-source coordination service for distributed appli tions used by Hadc 2. This system is a simple set of primitives that distributed applications can build upon to implement level services for synchronization, configuration maintenance, and groups and naming. 3. This Coordination services are prone to errors such as race con ditions and deadlock. ions, 4. The main goal behind Zookeeper is to use distributed applicati Big Data Analytics (MU) Zookeeper wi 2 1 Introduc to Hadoop allows distributed processes to coor ‘dinate with each other using shared hierarchical namespace organized as a standard file system, 6. The name Space made up of of data registers called znodes, and these are similar to files and directories. 7. Zookeeper data is kept in-memory, which means it can achieve high throughput and low latency. ® 2.7__HBase © HBase is a distributed column-oriented database. * _HBase is hadoop application built on top of HDFS. + HBase is suitable for huge datasets where real-time read/write random access is required + HBase is not a relational database. Hence does not support SQL. * _Itisan open-source project and is horizontally scalable. * Cassandra, couchDB, Dynamo and MongoD8 are some other databases similar to HBase. * Data can be entered in HDFS either directly or through HBase. Consistent read and writes, Automatic failure support is provided. It can be easily integrated with JAVA. * Data is replicated across cluster. Useful when some node fails. 2.7.1 Comparison of HDFS and HBase Sr. HDFS ; HBase No, 1 | HFS is a distributed file system suitable for storing large | HBase is a database built on top of the files HOFS. 2. | HOFS does not support fast individual record lookups. | HBase provides fast lookups for larger tables. Low latency random access. 3. _| It provides high latency batch processing. i _ ST 2.7.2. Comparison of RDBMS and HBase _ sr. RDBMS HBase No. _ 1. | RDBMS uses schema. Data is stored according | HBase is schemarless. Only column fami. to its schema. defined, : 2. | Scaling is difficult. Horizontally scalable. 3. | ROBMS is transactional No transactions are there in HBase. 7 4, | Ithas normalized data, It has de-normalized data. 5, _ | Itis good for structured data It is good for semi-structured as well a5 sir data. 6. _| Itis row oriented database. It is column oriented database. 7 Tite is suitable for Online Transaction Process | Itis suitable for Online Analytical Processing | (our). a 2.7.3, HBase Architecture The Master performs administration, cluster management, region management, load balancing anc handling. Region Server hosts and manages servers, region splitting, read/write request handling ‘communication etc. Region contains Write Ahead Log (WAL). It may have multiple regions. Region is made up of Mem Hfiles in which data is stored. Zookeeper is required to manage all the services. Java Client APIs External APIs (Thi, Avro, REST) Ragon Sever [ Wite-Ahoad Log (WAL) } Fig. 2.7.1 : HBase database architecture 2.7.4. Region Splitting Methods L Pre splitting Regions are created first and split points are assigned at the time of table creatic : tion. Ir 0 points are to be used very carefully otherwise load distribution will be heterogeneous fe of a clusters performance. = Data Analytics (MI tics (MU) 213 Introduction to Hadoop This is by default action. It splits region when one of the stores crosses the max configured value. 3. Manual splitting Split regions which are not uniformly loaded. 2.7.5. Region Assignment and Load Balancing This information cannot be changed further as these are the standard procedures, (On startup (On startup Assignment Manager is invoked by Master. 1 2, From META the information about existing region assignments is taken by the AssignrnentManager. 3, _ Ifthe RegionServer is still online then the assignment is kept as itis. 4, If the RegionServer is not online then for region assignment the LoadBalancerFactory is invoked. The DefaultLoadBalancer will randomly assign the region to a RegionServer. 5, META is updated with this new RegionServer assignment. The RegionServer starts functioning upon 129108 opening by the RegionServer. When region server fails 1. Regions become unavailable when any RegionServer 2. The Master finds which RegionServer is failed. The region assignments done by that Region: nt as that of startup. fails. Server then becomes invalid. The same process is followed for new region assignmer Region assignment upon load balancing When there are no regions in transition, the cluster load is balanced by around, Thus redistributes the regions on the cluster. Itis configured via hbase.bal is 300000 (5 minutes). 2.7.6 HBase Data Model ta Model in HBase is made of different logical components such as Tables, RO a load balancer by moving regions lancer period. The default value yws, Column Families, «The Dat Columns, Cells and Versions. _ « ttcan handle semi-structured data that may be varied in terms of data type, size and columns: Th partitioning and distributing data across the cluster is easier. c= | Row Key Movies Shows | screen | _MovieName ticket | Time | Day o1 Harry Potter1 | 200 | 600 [Saturday ce Harry Potter 2 | 250 | 300 Sunday Fi Base data model a 1. Tables Tables are stored as a logical collection of rows in Regions 2. Rows Each row is one instance of data Each table row is identified by 2 rowkey. These rowkeys are unique always treated as a bytel}. 3. Column Families Data in a row are grouped together as Column Families. These are stored in HFiles, 4. Columns , AColumn Family is made of one or more columns. « AColumn is accessed by, column family : columnname. * There can be multiple Columns within a Column Family and Rows within 2 table can have \ number of Columns. 5. Cell ‘ACell stores data as a combination of rowkey, Column Family and the Column (Column Qualifier, 6. Version : On the basis of timestamp different data versions are created. By default is the number of versions 2 it can be configured to some other value as well y” 2.8 HIVE + Hive is a data warehouse infrastructure tool. It processes structured data in HDFS. Hive structures data into tables, rows, columns and partitions + Itresides on top of Hadoop. ‘+ Itis used to summarize big Data, analysis of big data. + It is suitable for Online Analytical Application Processing. It supports ad hoc querries. It has its own SQL type language called HiveQL or HQL. © SQL type scripts can be created for MapReduce operations using HIVE. Primitive datatypes like Integers, Floats, Doubles, and Strings are supported by HIVE «Associative Arrays, Lists, Structs etc. can be used, 7 + Serialize API and Deserialized API are used to store and retrieve data + HIVE's easy to scale and has faster processing, we

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy