0% found this document useful (0 votes)

13 views36 pages

02 Haddop Biginsights

Uploaded by

imenhamada17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views36 pages

02 Haddop Biginsights

Uploaded by

imenhamada17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Hadoop and BigInsights

Agenda

▪ Why? When? Where?

▪ Hadoop Basics
– Comparison with RDBMS

▪ Hadoop architecture
– MapReduce
– HDFS
– Hadoop Common
– Ecosystem of related projects
• Pig, Hive, Jaql
• Other projects

▪ Hadoop Distributions

▪ BigInsights

Importance of Hadoop
▪ “We believe that more than half of the world’s data will be
stored in Apache Hadoop within five years”
– Hortonworks
Hardware improvements through the
years... ▪ CPU Speeds:

–
1990 – 44 MIPS at 40 MHz
–
2010 – 147,600 MIPS at 3.3 GHz

▪ RAM Memory
– 1990 – 640K conventional memory (256K extended memory recommended)
– 2010 – 8-32GB (and more)

▪ Disk Capacity
– 1990 – 20MB
– 2010 – 1TB

▪ Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years, currently
around 70 – 80MB / sec

How long will it take to read 1TB of data?

1TB (at 80Mb / sec):
1 disk - 3.4 hours
10 disks - 20 min
100 disks - 2 min
1000 disks - 12 sec

Parallel Data Processing is the answer!

▪ It was with us for a while:
– GRID computing - spreads processing load
– Distributed workload - hard to manage applications, overhead on
developer – Parallel databases – DB2 DPF, Teradata, Netezza, etc
(distribute the data)

▪ Distributed computing: Multiple computers appear as one super computer,

communicate with each other by message passing, operate together to achieve a
common goal

▪ Challenges
– Heterogeneity
– Openness
– Security
– Scalability
– Concurrency
– Fault tolerance
– Transparency

What is Hadoop?
▪ Apache open source software framework for reliable, scalable, distributed
computing of massive amount of data
▪ Hides underlying system details and complexities from user
▪ Developed in Java

▪ Consists of 3 sub projects:

– MapReduce
– Hadoop Distributed File System a.k.a. HDFS
– Hadoop Common

▪ Supported by several Hadoop-related projects

▪ HBase
▪ Zookeeper
▪ Avro
▪ Etc.

▪ Meant for heterogeneous commodity hardware

Design principles of Hadoop

▪ New way of storing and processing the data:
– Let system handle most of the issues automatically:
• Failures
• Scalability
• Reduce communications
• Distribute data and processing power to where the data is
• Make parallelism part of operating system
• Relatively inexpensive hardware ($2 – 4K)

▪ Bring processing to Data!

▪ Hadoop = HDFS + MapReduce infrastructure
▪ Optimized to handle
– Massive amounts of data through parallelism
– A variety of data (structured, unstructured, semi-structured)
– Using inexpensive commodity hardware
▪ Reliability provided through replication

Hadoop is not for all types of work

▪ Not to process transactions (random access)

▪ Not good when work cannot be parallelized

▪ Not good for low latency data access

▪ Not good for processing lots of small files

▪ Not good for intensive calculations with little data

Who uses Hadoop?

Hadoop / MapReduce timeline
Hadoop Open Source Projects
▪ Hadoop is supplemented by an ecosystem of open source projects

Jaql

Oozie

What is Apache Hadoop?

▪ Flexible, enterprise-class support for processing large volumes of
data
– Inspired by Google technologies (MapReduce, GFS, BigTable, …)
– Initiated at Yahoo
• Originally built to address scalability problems of Nutch, an open source Web search
technology
– Well-suited to batch-oriented, read-intensive applications
– Supports wide variety of data

▪ Enables applications to work with thousands of nodes and petabytes

of data in a highly parallel, cost effective manner
– CPU + disks = “node”
– Nodes can be combined into clusters
– New nodes can be added as needed without changing
• Data formats
• How data is loaded
• How jobs are written

Two Key Aspects of Hadoop

▪ MapReduce framework
– How Hadoop understands and assigns work to the nodes
(machines)

▪ Hadoop Distributed File System = HDFS

– Where Hadoop stores data
– A file system that spans all the nodes in a Hadoop
cluster – It links together the file systems on many local
nodes to make them into one big file system

What is the Hadoop Distributed File System? ▪

HDFS stores data across multiple nodes ▪

HDFS assumes nodes will fail, so it achieves

reliability by replicating data across multiple nodes

▪ The file system is built from a cluster of data nodes,

each of which serves up blocks of data over the
network using a block protocol specific to HDFS.

(Very quick) Introduction to MapReduce

▪ Driving principals
– Data is stored across the entire cluster
– Programs are brought to the data, not the data to the program

▪ Data is stored across the entire cluster (the DFS)

– The entire cluster participates in the file system
– Blocks of a single file are distributed across the cluster
– A given block is typically replicated as well for resiliency
10110100 1
101 00100 1
11100111 11100101
00111010 01010010
1
110
201001
01010011 00010100
10111010 11101011 4 2
Cluster 3 3
Blocks
4
4
110
311011
01010110

2
2
10010101
00101010
10101110
010
4
01101
01110100

3
Logical File

Hadoop Common
▪ Formerly known as Hadoop Core

▪ Contains common utilities and libraries that support the other Hadoop sub
projects
– File system
– Remote Procedure Call (RPC)
– Serialization

▪ E.g. file system shell

– To interact directly with HDFS files, you need to use
/bin/hdfs dfs <args>

hadoop fs –dus –h /user/hadoop/file1 hdfs://node/dir1

Hadoop – Installation Requirements

▪ Installation types:
– Single-node:
• simple operations
• local testing and debugging
– Multi-node cluster:
• production level operation
• thousands of nodes
▪ Hardware:
– Can use commodity hardware
– Best practice:
• RAM: MapReduce jobs mostly I/O bound, plan enough RAM
• CPU: high-end CPUs are often not cost-effective
• Disks: use high capacity disks as Hadoop is storage hungry
• Network: depends on workload, consider high-end network gear for large
clusters
▪ Software:
– OS:
• GNU / Linux for development and production
• Windows / Mac for development
– Java
– ssh

What’s a Hadoop Distribution?

▪ What’s a Linux Distribution?
– Linux Kernel
– Open Source Tools around Kernel
– Installer
– Administration UI
▪ Open Source Distribution Formula
– Kernel
– Core Projects around Kernel
– Value Add
• Test Components Hadoop is becoming the
• Installer
• Administration UI kernel of a distributed
• Apps operating system

▪ WebSphere WAS

– 25 > Apache Projects + Additional Open Source + installer + IBM Value Add

IBM Enriches Hadoop

▪ Scalable
– New nodes can be added
on the fly
Performance & reliability
▪ Affordable – Adaptive MapReduce, Compression,
– Massively parallel computing on Indexing, Flexible Scheduler, +++
commodity servers
▪ Enterprise Hardening of
▪ Flexible Hadoop
– Hadoop is schema-less, and can
absorb any type of data ▪ Productivity Accelerators
– Web-based Uis and tools
▪ Fault Tolerant – End-user visualization
– Through MapReduce – Analytic Accelerators
software framework – +++
19

▪ ▪ Enterprise Integration
– To extend & enrich your information
supply chain

BigInsights: Value
Beyond Open Source
▪ IBM InfoSphere BigInsights brings the
power of Hadoop to the enterprise.
– Apache Hadoop is the open source software framework, used to reliably
managing large volumes of structured and unstructured data.

– BigInsights enhances this technology to withstand the demands of your

enterprise, adding administrative, workflow, provisioning, and security
features, along with best-in-class analytical capabilities from IBM Research.

– The result is that you get a more developer and user-friendly solution for
complex, large scale analytics.

▪ InfoSphere BigInsights allows enterprises of all sizes to cost effectively

manage and analyze the massive volume, variety and velocity of data that
consumers and businesses create every day.

Integrated Installation
▪ Integrated installation of supported open source and IBM components.
– Seamless process for single node and cluster environments
– Integrated installation of all selected components

▪ Post-install validation of IBM and open source components

BigInsights
▪ Many disparate • No need to worry about
components components & versions
▪ Manual install ▪ Install requires very little
interaction
▪ Leg-work required •
• No extra prerequisites to
What components? •
download
Which versions?
▪ Single install

Roll
Your Own Easiest
21
From Getting Starting to Enterprise
Deployment: Different BigInsights Editions For Varying
Needs

Enterprise class
Standard Edition
Enterprise Edition

- Accelerators
-
- GPFS – FPO
-
- Adaptive MapReduce
- Text analytics
- Enterprise Integration
- Monitoring and alerts
-

Quick Start
Free. Non-production
- Spreadsheet-style tool -
- RDBMS connectivity
-
- Web console -
- Big SQL
- Dashboards
- -
- Jaql
Apache - Pre-built applications -
- Platform enhancements
Hadoop -...
-
- Eclipse tooling -
capabilities
-
- Big R
-
- InfoSphere Streams* -- Watson
Explorer* -- Cognos BI*
Breadth of -...
-
* Limited use license
-

Analytics and Accelerators

▪ Adminstration Console
– Dashboards
– Applications
– BigSheets

▪ Analytics
– Text Analytics
– BigSQL
– BigR
▪ Accelerators
– Machine Data Analytics (MDA)
– Social Data Analytics (SDA)
– Telecommunications Event Data (TEDA) – Streams only

Overview of Web Console Capabilities

▪ Manage BigInsights
– Inspect /monitor system
health
– Add / drop nodes
– Start / stop services
– Launch / monitor jobs
– Explore / modify file system
– Create custom dashboards
▪ Launch applications
– Spreadsheet-like analysis tool
– Pre-built applications (IBM
supplied or user developed)

▪ Publish applications

▪ Monitor cluster,
applications, data, etc.

Welcome Tab – your starting point

Tasks: Where and how to begin performing

common administrative or analytical tasksQuick links to common functions

Learn more through external Web resources
25

Dashboards
▪ Monitor overall system, data, and application services ▪
Create your own dashboard with supplied or custom
widgets

Applications
▪ Manage, execute, and link applications
– Browse available applications
– Deploy / undeploy applications
– Launch (or schedule for launch) a deployed application
– Monitor job (application) execution status
– Inspect application output
– Link or “chain” applications for sequential execution

BigSheets: Enhanced Data Discovery & Visualization Capabilities

▪ With a familiar, spreadsheet-like interface, BigSheets provides web
based analysis and visualization for BigInsights users. ▪ It helps to
analyze large amount of data in an innovative and easy to use way
and helps to design and manage long running data collection jobs.

▪ Version 2.1 broadens BigSheets …

– data discovery capability on unstructured text by providing build-in text analytics
functions to extract names, addresses, organizations, email, and phone numbers …
– offers enhanced visualization
(more customizations of charts &
multi series charts)
28

Text Analytics – Highly Accurate Analysis of Unstructured Big Data

▪ How it works
– Parses text and detects meaning
with annotators
– Understands the context in which
the text is analyzed
– Hundreds of pre-built annotators for
names, addresses, phone
numbers, along others
• Out of box support: English,
Spanish, French, German,

Unstructured text (document, email,

etc)
the eventual champions 1-0 in the
Final. Early in the second half,
Netherlands’ striker, Arjen Robben, had
a breakaway, but the keeper for Spain,
Iker Casillas
Portuguese, Dutch, Japanese,
Chinese –
Distills structured info from unstructured text

made the save. Winger Andres Iniesta

Football World Cup 2010, one team
scored for Spain for the win.
distinguished themselves well, losing to
••
Sentiment analysis
Consumer behavior
•
Illegal or suspicious activities
▪ Benefits
– More precise and correct answers • 2x vs. marketplace alternatives
– 50% faster than manual method • Used to build world-class text
analysis applications
– Run faster text analysis
• 10x or more vs. marketplace
alternatives
Classification and

Insight
29

Big SQL: Native SQL Query Access for Hadoop

▪ Native SQL access to data
stored in BigInsights
– ANSI SQL 92+ – Database metadata API support
– Standard syntax support (joins, data types, …) – Secure socket connections (SSL)

▪ Real JDBC/ODBC drivers ▪ Optimization

– Prepared statements – Leveraging MapReduce parallelism
– Cancel support
or…
– Direct access for low-latency queries

▪ Varied data sources

– HBase (including secondary indexes)
– CSV, Delimited files, Sequence files
– JSON
– Hive tables

Application
SQL
JDBC / ODBC Driver Big
SQL Engine Data Sources
JDBC / ODBC Server

Hive Tables HBase tables CSV Files

BigInsights

Big Data Accelerators - Summary

▪ Software components that accelerate
AnalyticApplications development and/or
Visualization
BI / Predictive Analytics
implementat specific use cases the Big Data
Reporting Functional App ion of solutions or on top of platform
Exploration / Industry App Content Analytics

IBM Big Data Platform Provide business logic, data

▪
Management tailored for
Visualization
& Discovery a given
use case
Bundled
with Big
Applications&
Development
Data
▪ platform

Accelerators

Systems

processing, and components –

UI/visualization, InfoSphere
Hadoop
System
They are
Solution
templates not
▪
Stream Contextual Turnkey
Data BigInsights and
Computing Solutions !
Warehouse
Search
InfoSphere
Streams
Key Benefits
Leverage best practices around
implementation of a given use case.
▪
Information Integration & Governance

Cloud | Mobile |
Security
Time to value
31

Accelerators – Big Data Use Cases Supported

▪ Quickly build and deploy custom applications in high-value areas
▪ IBM Accelerator for Social Data Analytics
– Enhanced 360 Degree view of customer
•
Out of the box implementation tuned for Cross-Industry
support •
Sample applications for Customer Acquisition / Retention /
Segmentation, Marketing Campaign Optimization, Lead Generation,
Brand Management, Surveillance, and Up Sell

▪ IBM Accelerator for Machine Data Analytics – Operational

Analysis and Security-Intelligence Extension • Out of the box
implementation tuned for Cross-Industry support of Manufacturing,
Oil & Gas, Energy & Utility, Healthcare, Travel, CPG, Transportation,
and Retail sectors
•
Operational Efficiency Monitoring, Security Incident
Investigation, Proactive Maintenance, Troubleshooting, and
Outage Prevention

32
Questions?

Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
BIA BigData Overview
No ratings yet
BIA BigData Overview
38 pages
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
No ratings yet
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
5 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
Unit 2 Part A
No ratings yet
Unit 2 Part A
34 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
HADOOP
No ratings yet
HADOOP
55 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Hadoop Presentation: Swarnali B.SC Computer Science Hons. 2 Year Chandernagore Govt. College Halder
No ratings yet
Hadoop Presentation: Swarnali B.SC Computer Science Hons. 2 Year Chandernagore Govt. College Halder
8 pages
Data Platform and Analytics Foundational Training: (Speaker Name)
No ratings yet
Data Platform and Analytics Foundational Training: (Speaker Name)
31 pages
BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
61 pages
Hadoop Chapter 1
No ratings yet
Hadoop Chapter 1
6 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Unit 2
No ratings yet
Unit 2
73 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Hadoop by Dr. Kamal Gulati
No ratings yet
Hadoop by Dr. Kamal Gulati
33 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
HD Insight
No ratings yet
HD Insight
1,315 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
HADOOP
No ratings yet
HADOOP
10 pages
Introduction To
No ratings yet
Introduction To
7 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Unit-5 - Hadoop
No ratings yet
Unit-5 - Hadoop
29 pages
I Am Preparing For A Big Data Analytics University...
No ratings yet
I Am Preparing For A Big Data Analytics University...
15 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Bda PPT M1 P2 1
No ratings yet
Bda PPT M1 P2 1
19 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
wk8 Final
No ratings yet
wk8 Final
39 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Big Data Overview
No ratings yet
Big Data Overview
19 pages
Big Data
No ratings yet
Big Data
67 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
06 Hadoop Query Languages
No ratings yet
06 Hadoop Query Languages
23 pages
04 MapReduce
No ratings yet
04 MapReduce
45 pages
Demov6 141213202739 Conversion Gate01
No ratings yet
Demov6 141213202739 Conversion Gate01
41 pages
Penn Testing
No ratings yet
Penn Testing
14 pages
BW SystemEngineeringGuide
No ratings yet
BW SystemEngineeringGuide
76 pages
EDU - DATASHEET VMware Vsphere Advanced Administration V8
No ratings yet
EDU - DATASHEET VMware Vsphere Advanced Administration V8
3 pages
Must Know Before Your Next Databricks Interview
No ratings yet
Must Know Before Your Next Databricks Interview
7 pages
Conference Report PDF
No ratings yet
Conference Report PDF
6 pages
Virtualization For Data-Centre Automation
No ratings yet
Virtualization For Data-Centre Automation
11 pages
Cheat Sheet Kubernetes
No ratings yet
Cheat Sheet Kubernetes
3 pages
Ibm Webshere Application Server Vertical Clustering
No ratings yet
Ibm Webshere Application Server Vertical Clustering
34 pages
VMware ESXI PDF
No ratings yet
VMware ESXI PDF
74 pages
SLE HA 15 SP2 - Administration Guide - Cluster Management Tools (Command Line)
No ratings yet
SLE HA 15 SP2 - Administration Guide - Cluster Management Tools (Command Line)
3 pages
Hair Styles of The 1970's
No ratings yet
Hair Styles of The 1970's
3 pages
NetBackup82 Plug-In Nutanix-AHV Guide
No ratings yet
NetBackup82 Plug-In Nutanix-AHV Guide
72 pages
GPFS 4.1.0.5
No ratings yet
GPFS 4.1.0.5
70 pages
Managing PostgreSQL High Availability
No ratings yet
Managing PostgreSQL High Availability
10 pages
Ccs335 CC Unit IV Cloud Computing Unit 4 Notes
No ratings yet
Ccs335 CC Unit IV Cloud Computing Unit 4 Notes
42 pages
WVCIEERD R&D Cluster's Action Plan For 2018 Format
No ratings yet
WVCIEERD R&D Cluster's Action Plan For 2018 Format
3 pages
VCF 52 Administering
No ratings yet
VCF 52 Administering
407 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
26 pages
FLANN Presnetation For Group
No ratings yet
FLANN Presnetation For Group
26 pages
Red Hat OpenShift Data Foundation-4.16-Deploying OpenShift Data Foundation Using Microsoft Azure-en-US
No ratings yet
Red Hat OpenShift Data Foundation-4.16-Deploying OpenShift Data Foundation Using Microsoft Azure-en-US
31 pages
Telemetry Data Processing and Analysis Platform Fo
No ratings yet
Telemetry Data Processing and Analysis Platform Fo
8 pages
Virtual Machine Compute Optimizer v2.1.0
No ratings yet
Virtual Machine Compute Optimizer v2.1.0
7 pages
0301 Nomani
No ratings yet
0301 Nomani
65 pages
Nutanix and HPE Partner To Deliver Hyperconverged Systems: The End of An Era
No ratings yet
Nutanix and HPE Partner To Deliver Hyperconverged Systems: The End of An Era
7 pages
Li Rhel Resilient Storage Add On Datasheet f30535 202111 en
No ratings yet
Li Rhel Resilient Storage Add On Datasheet f30535 202111 en
2 pages
Computer Glosary
No ratings yet
Computer Glosary
37 pages
CMGT - B - Webex Video Mesh Deployment Guide
No ratings yet
CMGT - B - Webex Video Mesh Deployment Guide
136 pages
Eset Server Security 9.0 Enu
No ratings yet
Eset Server Security 9.0 Enu
209 pages
Clustered Installations: Sterling B2B Integrator
No ratings yet
Clustered Installations: Sterling B2B Integrator
52 pages
PySpark Q&A
No ratings yet
PySpark Q&A
56 pages
PowerStore 3.0 Concepts and Features - Participant Guide
No ratings yet
PowerStore 3.0 Concepts and Features - Participant Guide
68 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

02 Haddop Biginsights

Uploaded by

02 Haddop Biginsights

Uploaded by

Hadoop and BigInsights

▪ Why? When? Where?

How long will it take to read 1TB of data?

Parallel Data Processing is the answer!

▪ Distributed computing: Multiple computers appear as one super computer,

▪ Consists of 3 sub projects:

▪ Supported by several Hadoop-related projects

▪ Meant for heterogeneous commodity hardware

Design principles of Hadoop

▪ Bring processing to Data!

Hadoop is not for all types of work

▪ Not good when work cannot be parallelized

▪ Not good for processing lots of small files

▪ Not good for intensive calculations with little data

Who uses Hadoop?

What is Apache Hadoop?

▪ Enables applications to work with thousands of nodes and petabytes

Two Key Aspects of Hadoop

▪ Hadoop Distributed File System = HDFS

What is the Hadoop Distributed File System? ▪

HDFS stores data across multiple nodes ▪

reliability by replicating data across multiple nodes

▪ The file system is built from a cluster of data nodes,

(Very quick) Introduction to MapReduce

▪ Data is stored across the entire cluster (the DFS)

▪ E.g. file system shell

hadoop fs –dus –h /user/hadoop/file1 hdfs://node/dir1

Hadoop – Installation Requirements

What’s a Hadoop Distribution?

IBM Enriches Hadoop

– BigInsights enhances this technology to withstand the demands of your

▪ InfoSphere BigInsights allows enterprises of all sizes to cost effectively

▪ Post-install validation of IBM and open source components

Analytics and Accelerators

Overview of Web Console Capabilities

Welcome Tab – your starting point

common administrative or analytical tasksQuick links to common functions

BigSheets: Enhanced Data Discovery & Visualization Capabilities

▪ Version 2.1 broadens BigSheets …

Text Analytics – Highly Accurate Analysis of Unstructured Big Data

Unstructured text (document, email,

made the save. Winger Andres Iniesta

Big SQL: Native SQL Query Access for Hadoop

▪ Real JDBC/ODBC drivers ▪ Optimization

▪ Varied data sources

Hive Tables HBase tables CSV Files

Big Data Accelerators - Summary

IBM Big Data Platform Provide business logic, data

processing, and components –

Accelerators – Big Data Use Cases Supported

▪ IBM Accelerator for Machine Data Analytics – Operational

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.