02 Haddop Biginsights
02 Haddop Biginsights
Agenda
▪ Hadoop architecture
– MapReduce
– HDFS
– Hadoop Common
– Ecosystem of related projects
• Pig, Hive, Jaql
• Other projects
▪ Hadoop Distributions
▪ BigInsights
Importance of Hadoop
▪ “We believe that more than half of the world’s data will be
stored in Apache Hadoop within five years”
– Hortonworks
Hardware improvements through the
years... ▪ CPU Speeds:
–
1990 – 44 MIPS at 40 MHz
–
2010 – 147,600 MIPS at 3.3 GHz
▪ RAM Memory
– 1990 – 640K conventional memory (256K extended memory recommended)
– 2010 – 8-32GB (and more)
▪ Disk Capacity
– 1990 – 20MB
– 2010 – 1TB
▪ Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years, currently
around 70 – 80MB / sec
▪ Challenges
– Heterogeneity
– Openness
– Security
– Scalability
– Concurrency
– Fault tolerance
– Transparency
What is Hadoop?
▪ Apache open source software framework for reliable, scalable, distributed
computing of massive amount of data
▪ Hides underlying system details and complexities from user
▪ Developed in Java
Jaql
Oozie
2
2
10010101
00101010
10101110
010
4
01101
01110100
3
Logical File
Hadoop Common
▪ Formerly known as Hadoop Core
▪ Contains common utilities and libraries that support the other Hadoop sub
projects
– File system
– Remote Procedure Call (RPC)
– Serialization
▪ WebSphere WAS
– 25 > Apache Projects + Additional Open Source + installer + IBM Value Add
▪ Scalable
– New nodes can be added
on the fly
Performance & reliability
▪ Affordable – Adaptive MapReduce, Compression,
– Massively parallel computing on Indexing, Flexible Scheduler, +++
commodity servers
▪ Enterprise Hardening of
▪ Flexible Hadoop
– Hadoop is schema-less, and can
absorb any type of data ▪ Productivity Accelerators
– Web-based Uis and tools
▪ Fault Tolerant – End-user visualization
– Through MapReduce – Analytic Accelerators
software framework – +++
19
▪ ▪ Enterprise Integration
– To extend & enrich your information
supply chain
BigInsights: Value
Beyond Open Source
▪ IBM InfoSphere BigInsights brings the
power of Hadoop to the enterprise.
– Apache Hadoop is the open source software framework, used to reliably
managing large volumes of structured and unstructured data.
– The result is that you get a more developer and user-friendly solution for
complex, large scale analytics.
20
Integrated Installation
▪ Integrated installation of supported open source and IBM components.
– Seamless process for single node and cluster environments
– Integrated installation of all selected components
BigInsights
▪ Many disparate • No need to worry about
components components & versions
▪ Manual install ▪ Install requires very little
interaction
▪ Leg-work required •
• No extra prerequisites to
What components? •
download
Which versions?
▪ Single install
Roll
Your Own Easiest
21
From Getting Starting to Enterprise
Deployment: Different BigInsights Editions For Varying
Needs
Enterprise class
Standard Edition
Enterprise Edition
- Accelerators
-
- GPFS – FPO
-
- Adaptive MapReduce
- Text analytics
- Enterprise Integration
- Monitoring and alerts
-
Quick Start
Free. Non-production
- Spreadsheet-style tool -
- RDBMS connectivity
-
- Web console -
- Big SQL
- Dashboards
- -
- Jaql
Apache - Pre-built applications -
- Platform enhancements
Hadoop -...
-
- Eclipse tooling -
capabilities
-
- Big R
-
- InfoSphere Streams* -- Watson
Explorer* -- Cognos BI*
Breadth of -...
-
* Limited use license
-
22
▪ Adminstration Console
– Dashboards
– Applications
– BigSheets
▪ Analytics
– Text Analytics
– BigSQL
– BigR
▪ Accelerators
– Machine Data Analytics (MDA)
– Social Data Analytics (SDA)
– Telecommunications Event Data (TEDA) – Streams only
23
▪ Publish applications
▪ Monitor cluster,
applications, data, etc.
24
Dashboards
▪ Monitor overall system, data, and application services ▪
Create your own dashboard with supplied or custom
widgets
26
Applications
▪ Manage, execute, and link applications
– Browse available applications
– Deploy / undeploy applications
– Launch (or schedule for launch) a deployed application
– Monitor job (application) execution status
– Inspect application output
– Link or “chain” applications for sequential execution
27
▪ How it works
– Parses text and detects meaning
with annotators
– Understands the context in which
the text is analyzed
– Hundreds of pre-built annotators for
names, addresses, phone
numbers, along others
• Out of box support: English,
Spanish, French, German,
Insight
29
30
Application
SQL
JDBC / ODBC Driver Big
SQL Engine Data Sources
JDBC / ODBC Server
BigInsights
Accelerators
Systems
Cloud | Mobile |
Security
Time to value
31
32
Questions?
33