0% found this document useful (0 votes)
13 views36 pages

02 Haddop Biginsights

Uploaded by

imenhamada17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views36 pages

02 Haddop Biginsights

Uploaded by

imenhamada17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Hadoop and BigInsights

Agenda

▪ Why? When? Where?


▪ Hadoop Basics
– Comparison with RDBMS

▪ Hadoop architecture
– MapReduce
– HDFS
– Hadoop Common
– Ecosystem of related projects
• Pig, Hive, Jaql
• Other projects

▪ Hadoop Distributions

▪ BigInsights

Importance of Hadoop
▪ “We believe that more than half of the world’s data will be
stored in Apache Hadoop within five years”
– Hortonworks
Hardware improvements through the
years... ▪ CPU Speeds:


1990 – 44 MIPS at 40 MHz

2010 – 147,600 MIPS at 3.3 GHz

▪ RAM Memory
– 1990 – 640K conventional memory (256K extended memory recommended)
– 2010 – 8-32GB (and more)

▪ Disk Capacity
– 1990 – 20MB
– 2010 – 1TB

▪ Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years, currently
around 70 – 80MB / sec

How long will it take to read 1TB of data?


1TB (at 80Mb / sec):
1 disk - 3.4 hours
10 disks - 20 min
100 disks - 2 min
1000 disks - 12 sec

Parallel Data Processing is the answer!


▪ It was with us for a while:
– GRID computing - spreads processing load
– Distributed workload - hard to manage applications, overhead on
developer – Parallel databases – DB2 DPF, Teradata, Netezza, etc
(distribute the data)

▪ Distributed computing: Multiple computers appear as one super computer,


communicate with each other by message passing, operate together to achieve a
common goal

▪ Challenges
– Heterogeneity
– Openness
– Security
– Scalability
– Concurrency
– Fault tolerance
– Transparency

What is Hadoop?
▪ Apache open source software framework for reliable, scalable, distributed
computing of massive amount of data
▪ Hides underlying system details and complexities from user
▪ Developed in Java

▪ Consists of 3 sub projects:


– MapReduce
– Hadoop Distributed File System a.k.a. HDFS
– Hadoop Common

▪ Supported by several Hadoop-related projects


▪ HBase
▪ Zookeeper
▪ Avro
▪ Etc.

▪ Meant for heterogeneous commodity hardware

Design principles of Hadoop


▪ New way of storing and processing the data:
– Let system handle most of the issues automatically:
• Failures
• Scalability
• Reduce communications
• Distribute data and processing power to where the data is
• Make parallelism part of operating system
• Relatively inexpensive hardware ($2 – 4K)

▪ Bring processing to Data!


▪ Hadoop = HDFS + MapReduce infrastructure
▪ Optimized to handle
– Massive amounts of data through parallelism
– A variety of data (structured, unstructured, semi-structured)
– Using inexpensive commodity hardware
▪ Reliability provided through replication

Hadoop is not for all types of work


▪ Not to process transactions (random access)

▪ Not good when work cannot be parallelized


▪ Not good for low latency data access

▪ Not good for processing lots of small files

▪ Not good for intensive calculations with little data

Who uses Hadoop?


Hadoop / MapReduce timeline
Hadoop Open Source Projects
▪ Hadoop is supplemented by an ecosystem of open source projects

Jaql

Oozie

What is Apache Hadoop?


▪ Flexible, enterprise-class support for processing large volumes of
data
– Inspired by Google technologies (MapReduce, GFS, BigTable, …)
– Initiated at Yahoo
• Originally built to address scalability problems of Nutch, an open source Web search
technology
– Well-suited to batch-oriented, read-intensive applications
– Supports wide variety of data

▪ Enables applications to work with thousands of nodes and petabytes


of data in a highly parallel, cost effective manner
– CPU + disks = “node”
– Nodes can be combined into clusters
– New nodes can be added as needed without changing
• Data formats
• How data is loaded
• How jobs are written

Two Key Aspects of Hadoop


▪ MapReduce framework
– How Hadoop understands and assigns work to the nodes
(machines)

▪ Hadoop Distributed File System = HDFS


– Where Hadoop stores data
– A file system that spans all the nodes in a Hadoop
cluster – It links together the file systems on many local
nodes to make them into one big file system

What is the Hadoop Distributed File System? ▪

HDFS stores data across multiple nodes ▪


HDFS assumes nodes will fail, so it achieves

reliability by replicating data across multiple nodes

▪ The file system is built from a cluster of data nodes,


each of which serves up blocks of data over the
network using a block protocol specific to HDFS.

(Very quick) Introduction to MapReduce


▪ Driving principals
– Data is stored across the entire cluster
– Programs are brought to the data, not the data to the program

▪ Data is stored across the entire cluster (the DFS)


– The entire cluster participates in the file system
– Blocks of a single file are distributed across the cluster
– A given block is typically replicated as well for resiliency
10110100 1
101 00100 1
11100111 11100101
00111010 01010010
1
110
201001
01010011 00010100
10111010 11101011 4 2
Cluster 3 3
Blocks
4
4
110
311011
01010110

2
2
10010101
00101010
10101110
010
4
01101
01110100

3
Logical File

Hadoop Common
▪ Formerly known as Hadoop Core

▪ Contains common utilities and libraries that support the other Hadoop sub
projects
– File system
– Remote Procedure Call (RPC)
– Serialization

▪ E.g. file system shell


– To interact directly with HDFS files, you need to use
/bin/hdfs dfs <args>

hadoop fs –dus –h /user/hadoop/file1 hdfs://node/dir1

Hadoop – Installation Requirements


▪ Installation types:
– Single-node:
• simple operations
• local testing and debugging
– Multi-node cluster:
• production level operation
• thousands of nodes
▪ Hardware:
– Can use commodity hardware
– Best practice:
• RAM: MapReduce jobs mostly I/O bound, plan enough RAM
• CPU: high-end CPUs are often not cost-effective
• Disks: use high capacity disks as Hadoop is storage hungry
• Network: depends on workload, consider high-end network gear for large
clusters
▪ Software:
– OS:
• GNU / Linux for development and production
• Windows / Mac for development
– Java
– ssh

What’s a Hadoop Distribution?


▪ What’s a Linux Distribution?
– Linux Kernel
– Open Source Tools around Kernel
– Installer
– Administration UI
▪ Open Source Distribution Formula
– Kernel
– Core Projects around Kernel
– Value Add
• Test Components Hadoop is becoming the
• Installer
• Administration UI kernel of a distributed
• Apps operating system

▪ WebSphere WAS

– 25 > Apache Projects + Additional Open Source + installer + IBM Value Add

IBM Enriches Hadoop

▪ Scalable
– New nodes can be added
on the fly
Performance & reliability
▪ Affordable – Adaptive MapReduce, Compression,
– Massively parallel computing on Indexing, Flexible Scheduler, +++
commodity servers
▪ Enterprise Hardening of
▪ Flexible Hadoop
– Hadoop is schema-less, and can
absorb any type of data ▪ Productivity Accelerators
– Web-based Uis and tools
▪ Fault Tolerant – End-user visualization
– Through MapReduce – Analytic Accelerators
software framework – +++
19

▪ ▪ Enterprise Integration
– To extend & enrich your information
supply chain

BigInsights: Value
Beyond Open Source
▪ IBM InfoSphere BigInsights brings the
power of Hadoop to the enterprise.
– Apache Hadoop is the open source software framework, used to reliably
managing large volumes of structured and unstructured data.

– BigInsights enhances this technology to withstand the demands of your


enterprise, adding administrative, workflow, provisioning, and security
features, along with best-in-class analytical capabilities from IBM Research.

– The result is that you get a more developer and user-friendly solution for
complex, large scale analytics.

▪ InfoSphere BigInsights allows enterprises of all sizes to cost effectively


manage and analyze the massive volume, variety and velocity of data that
consumers and businesses create every day.

20

Integrated Installation
▪ Integrated installation of supported open source and IBM components.
– Seamless process for single node and cluster environments
– Integrated installation of all selected components

▪ Post-install validation of IBM and open source components

BigInsights
▪ Many disparate • No need to worry about
components components & versions
▪ Manual install ▪ Install requires very little
interaction
▪ Leg-work required •
• No extra prerequisites to
What components? •
download
Which versions?
▪ Single install

Roll
Your Own Easiest
21
From Getting Starting to Enterprise
Deployment: Different BigInsights Editions For Varying
Needs

Enterprise class
Standard Edition
Enterprise Edition

- Accelerators
-
- GPFS – FPO
-
- Adaptive MapReduce
- Text analytics
- Enterprise Integration
- Monitoring and alerts
-

Quick Start
Free. Non-production
- Spreadsheet-style tool -
- RDBMS connectivity
-
- Web console -
- Big SQL
- Dashboards
- -
- Jaql
Apache - Pre-built applications -
- Platform enhancements
Hadoop -...
-
- Eclipse tooling -
capabilities
-
- Big R
-
- InfoSphere Streams* -- Watson
Explorer* -- Cognos BI*
Breadth of -...
-
* Limited use license
-

22

Analytics and Accelerators

▪ Adminstration Console
– Dashboards
– Applications
– BigSheets

▪ Analytics
– Text Analytics
– BigSQL
– BigR
▪ Accelerators
– Machine Data Analytics (MDA)
– Social Data Analytics (SDA)
– Telecommunications Event Data (TEDA) – Streams only

23

Overview of Web Console Capabilities


▪ Manage BigInsights
– Inspect /monitor system
health
– Add / drop nodes
– Start / stop services
– Launch / monitor jobs
– Explore / modify file system
– Create custom dashboards
▪ Launch applications
– Spreadsheet-like analysis tool
– Pre-built applications (IBM
supplied or user developed)

▪ Publish applications

▪ Monitor cluster,
applications, data, etc.

24

Welcome Tab – your starting point


Tasks: Where and how to begin performing

common administrative or analytical tasksQuick links to common functions


Learn more through external Web resources
25

Dashboards
▪ Monitor overall system, data, and application services ▪
Create your own dashboard with supplied or custom
widgets

26

Applications
▪ Manage, execute, and link applications
– Browse available applications
– Deploy / undeploy applications
– Launch (or schedule for launch) a deployed application
– Monitor job (application) execution status
– Inspect application output
– Link or “chain” applications for sequential execution

27

BigSheets: Enhanced Data Discovery & Visualization Capabilities


▪ With a familiar, spreadsheet-like interface, BigSheets provides web
based analysis and visualization for BigInsights users. ▪ It helps to
analyze large amount of data in an innovative and easy to use way
and helps to design and manage long running data collection jobs.

▪ Version 2.1 broadens BigSheets …


– data discovery capability on unstructured text by providing build-in text analytics
functions to extract names, addresses, organizations, email, and phone numbers …
– offers enhanced visualization
(more customizations of charts &
multi series charts)
28

Text Analytics – Highly Accurate Analysis of Unstructured Big Data

▪ How it works
– Parses text and detects meaning
with annotators
– Understands the context in which
the text is analyzed
– Hundreds of pre-built annotators for
names, addresses, phone
numbers, along others
• Out of box support: English,
Spanish, French, German,

Unstructured text (document, email,


etc)
the eventual champions 1-0 in the
Final. Early in the second half,
Netherlands’ striker, Arjen Robben, had
a breakaway, but the keeper for Spain,
Iker Casillas
Portuguese, Dutch, Japanese,
Chinese –
Distills structured info from unstructured text

made the save. Winger Andres Iniesta


Football World Cup 2010, one team
scored for Spain for the win.
distinguished themselves well, losing to
••
Sentiment analysis
Consumer behavior

Illegal or suspicious activities
▪ Benefits
– More precise and correct answers • 2x vs. marketplace alternatives
– 50% faster than manual method • Used to build world-class text
analysis applications
– Run faster text analysis
• 10x or more vs. marketplace
alternatives
Classification and

Insight
29

Big SQL: Native SQL Query Access for Hadoop


▪ Native SQL access to data
stored in BigInsights
– ANSI SQL 92+ – Database metadata API support
– Standard syntax support (joins, data types, …) – Secure socket connections (SSL)

▪ Real JDBC/ODBC drivers ▪ Optimization


– Prepared statements – Leveraging MapReduce parallelism
– Cancel support
or…
– Direct access for low-latency queries

▪ Varied data sources


– HBase (including secondary indexes)
– CSV, Delimited files, Sequence files
– JSON
– Hive tables

30

Application
SQL
JDBC / ODBC Driver Big
SQL Engine Data Sources
JDBC / ODBC Server

Hive Tables HBase tables CSV Files

BigInsights

Big Data Accelerators - Summary


▪ Software components that accelerate
AnalyticApplications development and/or
Visualization
BI / Predictive Analytics
implementat specific use cases the Big Data
Reporting Functional App ion of solutions or on top of platform
Exploration / Industry App Content Analytics

IBM Big Data Platform Provide business logic, data



Management tailored for
Visualization
& Discovery a given
use case
Bundled
with Big
Applications&
Development
Data
▪ platform

Accelerators

Systems

processing, and components –


UI/visualization, InfoSphere
Hadoop
System
They are
Solution
templates not

Stream Contextual Turnkey
Data BigInsights and
Computing Solutions !
Warehouse
Search
InfoSphere
Streams
Key Benefits
Leverage best practices around
implementation of a given use case.

Information Integration & Governance

Cloud | Mobile |
Security
Time to value
31

Accelerators – Big Data Use Cases Supported


▪ Quickly build and deploy custom applications in high-value areas
▪ IBM Accelerator for Social Data Analytics
– Enhanced 360 Degree view of customer

Out of the box implementation tuned for Cross-Industry
support •
Sample applications for Customer Acquisition / Retention /
Segmentation, Marketing Campaign Optimization, Lead Generation,
Brand Management, Surveillance, and Up Sell

▪ IBM Accelerator for Machine Data Analytics – Operational


Analysis and Security-Intelligence Extension • Out of the box
implementation tuned for Cross-Industry support of Manufacturing,
Oil & Gas, Energy & Utility, Healthcare, Travel, CPG, Transportation,
and Retail sectors

Operational Efficiency Monitoring, Security Incident
Investigation, Proactive Maintenance, Troubleshooting, and
Outage Prevention

32
Questions?

33

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy