0% found this document useful (0 votes)

9 views20 pages

Bigdata Notes

rgpv

Uploaded by

sheikhmaaz2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views20 pages

Bigdata Notes

rgpv

Uploaded by

sheikhmaaz2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

What is Bigdata?

Big Data ka matlab hai aise bohot bade aur complex datasets jo traditional data
processing techniques aur tools se process, analyze, aur manage karna mushkil
hota hai. Ye datasets alag-alag sources se generate hote hain aur lagataar badhte
rehte hain, jo inhe handle karna ek challenge banata hai, lekin valuable insights
derive karne ka mauka bhi deta hai.

Big Data ke Key Characteristics (5 V's)

1. Volume: Data ki quantity bohot zyada hoti hai, jo terabytes ya petabytes
mein measure ki ja sakti hai.
2. Velocity: Data generate aur process hone ki speed bohot fast hoti hai, aksar
real-time mein.
3. Variety: Data alag-alag formats mein hota hai, jaise structured (databases),
unstructured (text, videos, images), aur semi-structured (XML, JSON).
4. Veracity: Data ki quality aur accuracy vary karti hai, isliye inconsistencies
ya noise ko address karna zaroori hota hai.
5. Value: Ultimate goal hota hai meaningful insights derive karna jo businesses
aur applications ke liye valuable ho.

Big Data ke Sources

 Social media platforms (posts, tweets, likes, shares)

 IoT (Internet of Things) devices (sensor data)
 Business transactions (sales data, customer data)
 Healthcare systems (medical records, imaging data)
 E-commerce (clickstreams, product recommendations)

Types Of Bigdata
Big Data ko unke nature aur structure ke basis par alag-alag types me classify kiya
ja sakta hai. Ye classification mainly teen prakar ka hota hai: Structured Data,
Unstructured Data, aur Semi-Structured Data. Har type ke sath examples bhi
diye gaye hain:

1. Structured Data
 Definition: Structured Data wo data hai jo ek defined format aur structure
me organize hota hai, jaise rows aur columns me. Is data ko relational
databases me easily store aur manage kiya ja sakta hai.
 Example:
o Bank ke customer records (Customer Name, Account Number,
Balance).
o Employee database (Employee ID, Name, Salary).
 Source:
o Relational databases (SQL), spreadsheets.

2. Unstructured Data

 Definition: Unstructured Data wo data hai jo kisi predefined structure me

nahi hota. Iska format complex hota hai aur isse analyze karna mushkil hota
hai.
 Example:
o Social media posts (Twitter tweets, Facebook comments).
o Videos aur Images (YouTube videos, Instagram photos).
o Audio files (Podcasts, recorded calls).
 Source:
o Social media platforms, IoT devices, email attachments.

3. Semi-Structured Data

 Definition: Semi-Structured Data wo data hai jo fully structured nahi hota,

lekin kuch metadata ke sath organized hota hai. Ye data predefined tags ya
markers use karta hai.
 Example:
o JSON files.
o XML data.
o Log files (Server logs, Application logs).
 Source:
o APIs se output, configuration files, or web scraping ke data.

Various Technology Available of big data

Big Data handle karne ke liye kai advanced technologies aur tools available hain,
jo storage, processing, analysis, aur visualization ke liye use kiye jaate hain.
Neeche in technologies ka categorization diya gaya hai:

1. Data Storage

 Definition: Data Storage is the process of saving and managing data in

physical or cloud systems for easy access and processing.
 Technologies: Hadoop HDFS, Amazon S3, Google Cloud Storage, Azure
Data Lake.
 Example: E-commerce companies storing customer purchase history in
cloud storage.

2. Data Mining

 Definition: Data Mining involves discovering patterns and extracting useful

information from large datasets.
 Technologies: RapidMiner, Apache Mahout, Python libraries like Scikit-
learn.
 Example: Banks analyzing transaction data to detect fraud.

3. Data Analytics

 Definition: Data Analytics focuses on analyzing datasets to derive insights

for better decision-making.
 Technologies: Python (Pandas, NumPy), R, Apache Spark.
 Example: Retailers predicting future sales using historical data.

4. Data Visualization

 Definition: Data Visualization is the graphical representation of data to

make insights more understandable.
 Technologies: Tableau, Power BI, D3.js, Matplotlib.
 Example: Creating dashboards to track real-time website traffic trends.
1. Data Source

 Definition: Data source wo jagah hoti hai jahan se data collect ya generate
kiya jata hai.
 Examples: Relational databases, IoT devices, social media platforms, logs.
 Purpose: Data collection aur analysis ka starting point hota hai.

2. Data Storage

 Definition: Data ko system me store karna taaki usse future me retrieve aur
process kiya ja sake.
 Examples: HDFS (Hadoop Distributed File System), Amazon S3, Google
Cloud Storage.
 Purpose: Large volumes of data ko store karna for later use in analytics aur
processing.

3. Batch Processing

 Definition: Data ko ek baar me batch me process karna, usually scheduled

time pe.
 Examples: Apache Hadoop, ETL processes.
 Purpose: Large datasets ko efficiently process karne ke liye jo immediate
processing ki zarurat nahi rakhte (jaise daily sales reports).

4. Analytical Data Store

 Definition: Ek data storage system jo analysis aur reporting ke liye
optimized hota hai, jahan data structured form me stored hota hai for easy
querying.
 Examples: Data warehouses jaise Amazon Redshift, Google BigQuery, ya
Snowflake.
 Purpose: Historical data ko query karna aur business insights generate
karna.

5. Analytics and Reporting

 Definition: Data ko analyze karna taaki meaningful insights nikale ja sake

aur unhe visual ya reports me present kiya ja sake.
 Examples: Tools jaise Tableau, Power BI, aur Google Analytics.
 Purpose: Processed data ke basis par organizations ko data-driven decisions
lene me madad milti hai.

6. Stream Processing

 Definition: Continuous data streams ko real-time me process karna.

 Examples: Apache Kafka, Apache Flink, Amazon Kinesis.
 Purpose: Data ko jaise hi wo arrive hota hai process karna (e.g., live website
traffic ya sensor data).

7. Real-Time Message Ingestion

 Definition: Real-time data ko collect karna aur processing system me

transfer karna.
 Examples: Apache Kafka, Amazon Kinesis, RabbitMQ.
 Purpose: Real-time data jaise website interactions, user activity, ya sensor
data ko immediate processing aur analysis ke liye capture karna.

Big Data Analytics ?

Big Data Analytics wo process hai jisme large aur complex datasets (jo Big Data
kehlate hain) ko examine karke unme hidden patterns, correlations, trends, aur
insights find kiye jate hain jo decision-making aur strategic business planning ke
liye valuable hote hain. Isme advanced analytics techniques jaise statistical
analysis, machine learning, aur data mining ka use kiya jata hai.

Key Features:
1. Volume: Massive amounts of data (terabytes ya petabytes) ko handle karta
hai.
2. Variety: Structured, semi-structured, aur unstructured data ke saath kaam
karta hai.
3. Velocity: Real-time ya near-real-time data ko analyze karta hai taaki timely
decision-making ho sake.
4. Veracity: Data ki accuracy aur quality ko ensure karta hai taaki reliable
insights mil sake.
5. Value: Meaningful insights ko extract karta hai jo business growth aur
efficiency ko drive karte hain.

Introduction to Hadoop
Hadoop ek open-source framework hai jo large datasets ko distributed manner
mein process aur store karne ke liye use hota hai. Yeh mainly large-scale data
storage aur processing ke liye design kiya gaya hai aur high availability aur fault
tolerance ensure karta hai. Hadoop ko Apache Software Foundation ne develop
kiya hai.

NameNode

 Yeh ek single master server hota hai jo HDFS cluster mein exist karta hai.
 Kyunki yeh ek single node hai, isliye yeh single point of failure ka reason
ban sakta hai.
 Yeh file system namespace ko manage karta hai, jaise file ko open, rename,
aur close karna.
 Yeh system architecture ko simplify karta hai.

DataNode

 HDFS cluster mein multiple DataNodes hote hain.

 Har DataNode mein multiple data blocks hote hain.
 Ye data blocks data ko store karne ke liye use kiye jate hain.
 DataNode ka kaam file system ke clients se read aur write requests ko
handle karna hota hai.
 NameNode ke instructions par, DataNode block creation, deletion, aur
replication karta hai.

Job Tracker

 Job Tracker ka role hai client se MapReduce jobs ko accept karna aur data
ko process karna using NameNode.
 Iske response mein, NameNode Job Tracker ko metadata provide karta hai.

Task Tracker

 Task Tracker, Job Tracker ke liye ek slave node ke roop mein kaam karta
hai.
 Yeh Job Tracker se task aur code receive karta hai aur us code ko file par
apply karta hai. Is process ko Mapper bhi kaha ja sakta hai.

MapReduce Layer

 MapReduce tab exist karta hai jab client application MapReduce job ko Job
Tracker ko submit karta hai.
 Iske response mein, Job Tracker request ko appropriate Task Trackers ko
bhejta hai.
 Kabhi-kabhi, Task Tracker fail ho jata hai ya time out ho jata hai. Aise case
mein, us job ka wo part dobara schedule kiya jata hai.

Master Node:

 Master node Hadoop cluster mein tasks ko manage aur coordinate karne ke
liye responsible hota hai.
 Yeh poore system ke operation ko oversee karta hai, jaise cluster ke
resources ko manage karna, jobs ko schedule karna, aur job failures ko
handle karna.
 Examples: NameNode aur Job Tracker.

Slave Node:

 Slave node master node dwara assign kiye gaye tasks ko execute karne ke
liye responsible hota hai.
 Yeh actual data storage, processing, aur reporting operations perform karta
hai.
 Examples: DataNode aur Task Tracker.

Hadoop Components

Hadoop architecture mein kuch key components hote hain jo large-scale data
processing aur storage ko efficiently handle karte hain. In components ka
combined use Hadoop ko powerful aur scalable banata hai.

1. HDFS (Hadoop Distributed File System)

 Purpose:- HDFS (Hadoop Distributed File System) Apache Hadoop

framework mein primary storage system hai, jo multiple machines par
distributed manner mein large volumes of data ko store aur manage karne ke
liye design kiya gaya hai. Yeh highly scalable, fault-tolerant hai, aur large
datasets ko handle karne ke liye optimized hai, jo commonly big data
applications mein use hote hain jaise data warehousing, analytics, aur
machine learning.
 Key Features: High fault tolerance, scalability, large file support.
 Components:
o NameNode: Master node jo file system ko manage karta hai.
o DataNode: Slave nodes jo data ko store karte hain.

2. MapReduce

 Purpose:- MapReduce ek programming model hai jo Hadoop ecosystem ka

ek essential component hai, jo large-scale data processing ko efficiently
handle karta hai. Yeh model large datasets ko parallel processing ke through
process karta hai, jisme tasks ko small sub-tasks mein divide kiya jata hai jo
multiple nodes par execute kiye jate hain.

 Key Features: Parallel processing of large datasets, splits tasks into smaller
sub-tasks.
 Components:
o Job Tracker: Master node jo jobs ko schedule karta hai aur Task
Trackers ko assign karta hai.
o Task Tracker: Slave node jo tasks ko execute karta hai.

3. YARN (Yet Another Resource Negotiator)

Purpose:- YARN (Yet Another Resource Negotiator) ek resource
management layer hai jo Hadoop ecosystem mein kaam karta hai. YARN ka
main role hai cluster resources ko manage karna aur applications ko allocate
karna, jisse Hadoop ke computing resources ko effectively utilize kiya ja
sake. YARN Hadoop ke architecture ko enhance karta hai aur usse multi-
tenant aur scalable banaata hai

 .
 Key Features: Dynamic resource allocation, multi-tenant environment.
 Components:
o ResourceManager: Cluster resources ko manage karta hai.
o NodeManager: Cluster nodes ke resources ko monitor karta hai.

Hadoop Ecosystem
Hadoop Ecosystem ek collection hai tools aur technologies ka, jo Hadoop
framework ke upar kaam karte hain, jise large-scale data processing aur storage ko
manage karne ke liye design kiya gaya hai. Yeh ecosystem distributed data storage,
processing, aur analysis ko efficiently handle karta hai.
Key Components of Hadoop Ecosystem:

1. Hadoop Distributed File System (HDFS):

o HDFS Hadoop ka primary storage layer hai, jo large datasets ko
distributed manner mein store karta hai. Yeh high throughput aur fault
tolerance provide karta hai.
o
2. YARN (Yet Another Resource Negotiator):
o YARN resource management layer hai, jo cluster resources ko
manage karta hai aur jobs ko schedule karke unhe allocate karta hai.
Yeh cluster ko efficiently manage karne mein madad karta hai.
o
3. MapReduce:
o MapReduce ek programming model hai jo large-scale data ko parallel
aur distributed manner mein process karta hai. Iska use data
processing tasks ko execute karne ke liye kiya jata hai.
o
4. Apache Hive:
o Hive ek data warehousing tool hai jo SQL-like query language
(HiveQL) provide karta hai. Yeh data ko HDFS par store karte hue
structured data ke saath kaam karta hai aur data analysis ko simple
banata hai.
5. Apache Pig:
o Pig ek platform hai jo large datasets ko process karne ke liye use hota
hai. Yeh Pig Latin scripting language use karta hai, jo complex data
transformations ko simple banata hai.
o
6. Apache HBase:
o HBase Hadoop ka NoSQL database hai jo real-time, random
read/write operations support karta hai. Yeh large-scale structured
data ko store karne ke liye use hota hai.
o
7. Apache Spark:
o Spark ek powerful in-memory computing framework hai jo batch aur
real-time data processing dono ko handle karta hai. Yeh Hadoop ka
alternative bhi ho sakta hai for faster data processing.
o
8. A:pache Flume
o Flume ek distributed service hai jo large amounts of log data ko
collect aur aggregate karta hai aur usse HDFS ya other storage
systems mein transfer karta hai.
o
9. Apache Kafka:
o Kafka ek distributed streaming platform hai jo real-time data feeds ko
process karta hai aur efficiently data ko publish aur subscribe karta
hai.
o
10.Apache Zookeeper:
o Zookeeper ek distributed coordination service hai, jo distributed
applications ko manage karne aur synchronization, configuration
management, aur group services provide karta hai.
o
11.Apache Oozie:
o Oozie ek workflow scheduler hai jo Hadoop jobs ko manage karta hai,
jisme data processing tasks ko order aur schedule karne mein madad
karta hai.
o
12.Apache Sqoop:
o Sqoop ek tool hai jo relational databases aur Hadoop ke beech data ko
import aur export karta hai.
13.Apache Mahout:
o Mahout machine learning libraries provide karta hai, jo scalable
machine learning algorithms ko Hadoop environment mein run karte
hain.
o
14.Apache Storm:
o Storm ek real-time stream processing system hai jo high-speed data
streams ko process karta hai.

Hive and Architecture

Apache Hive ek data warehousing tool hai jo Hadoop ecosystem ka part hai. Hive
ko structured data ko query aur analyze karne ke liye design kiya gaya hai, aur yeh
SQL-like query language HiveQL ka use karta hai. Hive ka main goal hai large-
scale data processing ko simplify karna, jisme users ko traditional SQL jaise
queries likhne ki ability milti hai, jisse Hadoop cluster par data analysis asaan ho
jata hai.
What is Apache Pig And Data Type?
Apache Pig ek high-level platform hai jo large-scale data analysis aur processing
ke liye Hadoop ecosystem ka part hai. Pig ek scripting language provide karta hai,
jise Pig Latin kehte hain, jo data transformations aur analysis ko simplify karta
hai. Pig ka main goal hai large datasets ke saath kaam karna aur complex
MapReduce programs likhne ki zarurat ko kam karna.

 int − Signed 32-bit integer and similar to Integer in Java.

 Long − It is a fully signed 64-bit number similar to Long in Java.
 Float − It is a signed 32-bit floating surface that appears to be similar to
Java's float.
 Double − A floating-point 63-bit and similar to Double in Java.
 Char array − A list of characters in the Unicode format, UTF-8. This is
compatible with the Java character unit item.
 byte array − The byte data type represents bytes by default. When the data
file type is not specified, the default value is byte array.
 Boolean − A value that is either true or false.

What is HBase?

HBase (Hadoop Database) ek open-source, non-relational (NoSQL) distributed

database hai jo Hadoop ecosystem ka part hai. Yeh Google ke Bigtable architecture
se inspired hai aur Hadoop Distributed File System (HDFS) ke upar kaam karta
hai. HBase ka primary use case large datasets ko real-time read/write operations ke
saath efficiently handle karna hai.

Key Features of HBase

1. Scalability:
o Horizontal scaling ko support karta hai, jahan multiple nodes add
karke data capacity ko easily increase kiya ja sakta hai.
2. Real-Time Access:
o HDFS ke unlike, HBase real-time data read/write operations ko
support karta hai.
3. Column-Oriented Storage:
o HBase ek column-oriented database hai, jisme data ko columns ke
format mein store kiya jata hai instead of rows.
4. Automatic Sharding:
o Data ko automatically region servers ke beech shard kiya jata hai.
5. Fault-Tolerance:
o Distributed nature aur HDFS ka use karke, HBase high fault tolerance
provide karta hai.
6. Supports Massive Datasets:
o Terabytes aur petabytes tak ke structured aur semi-structured data ko
handle kar sakta hai.

NoSQL ?
NoSQL (Not Only SQL) ek modern database approach hai jo traditional relational
databases (RDBMS) ka alternative hai. Yeh unstructured, semi-structured, aur
structured data ko store aur manage karne ke liye design kiya gaya hai. NoSQL
databases scalability, flexibility, aur high performance provide karte hain, jo Big
Data aur real-time applications ke liye ideal hain.

Advantages of NoSQL

1. Scalability: Horizontally scale karne ki ability.

o Example: DynamoDB high traffic ko handle karta hai.
2. Flexibility: Schema-less design dynamic changes ko support karta hai.
o Example: MongoDB new fields ko documents me add karne ki
flexibility deta hai.
3. High Performance: Fast queries aur real-time processing ke liye optimize
hota hai.
o Example: Redis caching ke liye quick response provide karta hai.
4. Supports Unstructured Data: JSON, XML, aur multimedia jaise
unstructured data ko handle karta hai.
o Example: CouchDB JSON documents store karne ke liye use hota
hai.
5. Fault Tolerance: Data replication ke through reliability ensure karta hai.
o Example: Cassandra multiple nodes par data replicate karta hai.

MongoDB CRUD Operations Explained

MongoDB supports the basic CRUD (Create, Read, Update, Delete) operations for
working with data. These operations are essential for managing documents in
MongoDB collections.

1. CreateOperations
Create ya insert operations ka use naye documents ko collection mein insert
ya add karne ke liye hota hai. Agar collection exist nahi karta hai, toh
MongoDB naya collection create kar lega database mein.
2. ReadOperations
Read operations ka use collection se documents ko retrieve karne ke liye
hota hai, yaani query karne ke liye collection mein se kisi document ko
padhne ke liye.
3. UpdateOperations
Update operations ka use existing documents ko update ya modify karne ke
liye hota hai. Aap update operations perform kar sakte hain jo MongoDB ke
dwara diye gaye methods ke through kiye jaate hain.
4. DeleteOperations
Delete operations ka use documents ko collection se delete ya remove karne
ke liye hota hai. Aap delete operations perform kar sakte hain jo MongoDB
ke dwara diye gaye methods ke through kiye jaate hain.

Social Network Graph in Big Data

Social Network Graph ek graphical representation hota hai social relationships ka

jo entities ke beech hoti hai, jaise log, organizations, ya groups, ek network ke
andar. Big Data ke context mein, social network graphs ka use relationships aur
interactions ko model aur analyze karne ke liye kiya jata hai, jo aksar large scale pe
hote hain, jisme millions ya billions of entities shamil hoti hain.

Key Components of a Social Network Graph

 Nodes: Ye individual entities ko represent karte hain (jaise users, groups,

organizations).
 Edges: Ye relationships ya interactions ko represent karte hain entities ke
beech (jaise friendships, follows, likes, comments).
 Attributes: Ye data hai jo nodes ya edges se associated hota hai (jaise user
profile information, message content).
Data Type Comparison Between Hive and Pig in Big Data
Hive aur Pig dono Hadoop ecosystem mein data processing ke liye popular tools
hain, lekin dono apne operations ke liye alag data types ka use karte hain. Neeche
Hive aur Pig ke data types ka comparison diya gaya hai:

Data Type Hive Pig

Integer INT (32-bit signed integer) int (32-bit signed integer)
Long BIGINT (64-bit signed integer) long (64-bit signed integer)
Float FLOAT (32-bit single precision) float (32-bit single precision)
Double DOUBLE (64-bit double precision) double (64-bit double precision)
String STRING (text data, UTF-8 encoding) chararray (text data, UTF-8 encoding)
Boolean BOOLEAN (true or false) boolean (true or false)
Binary BINARY (raw binary data) bytearray (raw binary data)

Clustering in Social Network Analysis

Social network analysis mein clustering ka matlab hai entities (users,
organizations, ya groups) ko unke relationships, interactions, ya behaviors ke basis
par group karna. Iska primary goal hai un communities ya subgroups ko identify
karna jisme group ke andar stronger connections hote hain, jo dusre groups ke
comparison mein zyada hoti hain, aur isse bade networks mein hidden structures ya
patterns ko reveal kiya ja sakta hai.

Applications of Clustering in Social Network Analysis:

1. Recommendation Systems: Users ki similarities ke adhar par products ya

content ko suggest karna.
2. Community Detection: Social networks mein user groups ko identify karna.
3. Influence Analysis: Network mein central figures ya influential nodes ko
pehchanna.
4. Anomaly Detection: Network mein unusual behavior ko detect karna aur
outliers ko identify karna.
5. Marketing and Advertising: User behaviors ke adhar par targeted
campaigns ke liye segmentation karna.

Different Hadoop and RDBMS

ETL Processing in Big Data
ETL (Extract, Transform, Load) processing in big data ka matlab hai data ko alag-
alag sources se extract karna, usko analysis ke liye transform karna, aur phir data
warehouse ya data lake mein load karna.

ETL Process Steps:

1. Extract:
o Data ko various sources (databases, APIs, IoT devices, etc.) se extract
kiya jata hai.

Example: Twitter se raw data extract karna.

2. Transform:
o Extracted data ko clean aur transform kiya jata hai, jaise ki
aggregation, conversion, aur missing values ko handle karna.
Example: JSON data ko structured format mein convert karna.

3. Load:
o Transformed data ko data warehouse ya data lake mein load kiya jata
hai.

Example: Hadoop HDFS ya cloud storage mein data load karna.

NoSQL Business Drivers

NoSQL databases ko businesses zyada adopt kar rahe hain kyunki ye modern
applications ke badhte huye demands ko handle karne mein madadgar hote hain,
khaas kar un applications ko jo large-scale, dynamic, aur unstructured data ke saath
kaam karte hain. Neeche kuch key business drivers diye gaye hain jo NoSQL ka
use badhate hain:

NoSQL Business Drivers

1. Scalability:
o Need: Badhta hua data handle karne ke liye scalable database chahiye.
o Benefit: NoSQL easily scale ho sakta hai, jyada servers add karke
data handle karta hai.

Example: E-commerce websites jo high traffic handle karti hain.

2. Flexibility:
o Need: Data structure ko jaldi change karna padta hai.
o Benefit: NoSQL flexible schema provide karta hai, jisse data ko
quickly modify kiya ja sakta hai.

Example: Social media platforms jo naye user data types add karti hain.

3. High Performance:
o Need: Real-time processing important hoti hai.
o Benefit: NoSQL fast data access aur low-latency operations offer
karta hai.

Example: Streaming services jo real-time recommendations dete hain.

4. Unstructured Data Handling:

o Need: Images, videos aur JSON jese unstructured data store karna
hota hai.
o Benefit: NoSQL unstructured aur semi-structured data ko efficiently
handle karta hai.

Example: YouTube par user-generated content store karna.

5. Cost Efficiency:
o Need: Kam cost me database solution chahiye.
o Benefit: NoSQL cheap hardware par chal sakta hai aur infrastructure
cost reduce karta hai.

Example: Startups cloud-based NoSQL solutions use karte hain.

KCA 034 - Unit 1
No ratings yet
KCA 034 - Unit 1
48 pages
Big Data Processing
No ratings yet
Big Data Processing
38 pages
Unit 1
No ratings yet
Unit 1
51 pages
Question Bank
No ratings yet
Question Bank
62 pages
Getting Started With SQL and Databases Managing and Manipulating Data With SQL (Mark Simon) (Z-Library)
No ratings yet
Getting Started With SQL and Databases Managing and Manipulating Data With SQL (Mark Simon) (Z-Library)
390 pages
Big Data
No ratings yet
Big Data
67 pages
DP-900 My Notes PDF
No ratings yet
DP-900 My Notes PDF
36 pages
Big Data Hadoop Complete Final Spaced
No ratings yet
Big Data Hadoop Complete Final Spaced
15 pages
DWM MODULE 2 Full
No ratings yet
DWM MODULE 2 Full
19 pages
Ds Answers
No ratings yet
Ds Answers
14 pages
Big Data
No ratings yet
Big Data
7 pages
DWM Module 1 (1.1)
No ratings yet
DWM Module 1 (1.1)
11 pages
Data Science Cloud Storage QnA
No ratings yet
Data Science Cloud Storage QnA
2 pages
Big Data Notes
No ratings yet
Big Data Notes
89 pages
Spark Notes Hinglish
No ratings yet
Spark Notes Hinglish
4 pages
CH - 1 Introduction To Data Science
No ratings yet
CH - 1 Introduction To Data Science
8 pages
DataMining 10marks Hinglish
No ratings yet
DataMining 10marks Hinglish
4 pages
Big Data Analysis BDA IMP QNA Openinapp
No ratings yet
Big Data Analysis BDA IMP QNA Openinapp
33 pages
Viva Dsbda
No ratings yet
Viva Dsbda
4 pages
Unit I
No ratings yet
Unit I
64 pages
Bda Ans
No ratings yet
Bda Ans
18 pages
Data Science: Lecture #1
No ratings yet
Data Science: Lecture #1
22 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
15 pages
Selected Topic 2
No ratings yet
Selected Topic 2
8 pages
Complete Data Analytics Solutions
No ratings yet
Complete Data Analytics Solutions
5 pages
Gorenje Wa 512 Upute
0% (5)
Gorenje Wa 512 Upute
3 pages
Week 1 Lecture 1
No ratings yet
Week 1 Lecture 1
63 pages
Bda Q&a
No ratings yet
Bda Q&a
15 pages
M1 Q&a
No ratings yet
M1 Q&a
26 pages
DA PUT Solutions
No ratings yet
DA PUT Solutions
30 pages
Test 1 Big Data
No ratings yet
Test 1 Big Data
17 pages
Intro To Big Data Analytics
No ratings yet
Intro To Big Data Analytics
14 pages
Group 4
No ratings yet
Group 4
10 pages
V'S" V'S,"
No ratings yet
V'S" V'S,"
4 pages
Unit 1
No ratings yet
Unit 1
16 pages
Unit 1 DS
No ratings yet
Unit 1 DS
3 pages
BDA Unit 1
No ratings yet
BDA Unit 1
36 pages
Bda Unit-1 Notes
No ratings yet
Bda Unit-1 Notes
10 pages
BDA Module
No ratings yet
BDA Module
6 pages
Master Data Management Policy Updated
No ratings yet
Master Data Management Policy Updated
3 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Big Data 1
No ratings yet
Big Data 1
28 pages
BDA Notes Part 1
No ratings yet
BDA Notes Part 1
11 pages
Big Data Analytics
No ratings yet
Big Data Analytics
21 pages
Big Data Analytics Unit - 1 Notes
No ratings yet
Big Data Analytics Unit - 1 Notes
24 pages
BDA
No ratings yet
BDA
148 pages
1 Introduction To Big Data Management and Processing
No ratings yet
1 Introduction To Big Data Management and Processing
42 pages
Unit 1 B Tech 3 Year BD
No ratings yet
Unit 1 B Tech 3 Year BD
10 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
17 pages
Ak As2
No ratings yet
Ak As2
15 pages
UNIT-1:Overview of Big Data
No ratings yet
UNIT-1:Overview of Big Data
10 pages
Big Data - Comprehensive Summary
No ratings yet
Big Data - Comprehensive Summary
12 pages
Big Data 3
No ratings yet
Big Data 3
16 pages
Experiment No - 1 Bda
No ratings yet
Experiment No - 1 Bda
10 pages
Unit - I Question & Answer
No ratings yet
Unit - I Question & Answer
23 pages
DBMS Unit1
No ratings yet
DBMS Unit1
30 pages
Reviewerku
No ratings yet
Reviewerku
6 pages
Big Data
No ratings yet
Big Data
10 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
Big Data Analytics-Report
No ratings yet
Big Data Analytics-Report
7 pages
File 1
No ratings yet
File 1
3 pages
Vector Databases - A Technical Primer
100% (1)
Vector Databases - A Technical Primer
68 pages
Unit 5
No ratings yet
Unit 5
20 pages
Ccs 334
No ratings yet
Ccs 334
16 pages
Functions of Database Server
0% (2)
Functions of Database Server
4 pages
Chapter 3 - Data Storage and Processing Systems
No ratings yet
Chapter 3 - Data Storage and Processing Systems
108 pages
NOSQL
No ratings yet
NOSQL
55 pages
Current Log
100% (1)
Current Log
40 pages
Actualtests.1Z0-133.79.Qa: 1Z0-133 Oracle Weblogic Server 12C: Administration I
No ratings yet
Actualtests.1Z0-133.79.Qa: 1Z0-133 Oracle Weblogic Server 12C: Administration I
28 pages
Tableau VS Pro BI Report 2020 SelectHub PDF
100% (1)
Tableau VS Pro BI Report 2020 SelectHub PDF
31 pages
Cqrs Best Practices and Misconceptions Slides
No ratings yet
Cqrs Best Practices and Misconceptions Slides
31 pages
Datawarehouse Final Edit-1
No ratings yet
Datawarehouse Final Edit-1
40 pages
Database Documentation For Carely
No ratings yet
Database Documentation For Carely
4 pages
North Wind
No ratings yet
North Wind
14 pages
What Is BIG DATA - Introduction, Types, Characteristics, Example
No ratings yet
What Is BIG DATA - Introduction, Types, Characteristics, Example
11 pages
Manual For Dspace 11012021
No ratings yet
Manual For Dspace 11012021
35 pages
System Design Interview Overview
No ratings yet
System Design Interview Overview
23 pages
Yogesh Data Analyst 2+ Years Exp Resume
No ratings yet
Yogesh Data Analyst 2+ Years Exp Resume
1 page
Ad3381 Set3
No ratings yet
Ad3381 Set3
4 pages
Aggregating Data Using Group Functions
100% (1)
Aggregating Data Using Group Functions
26 pages
Seo Marketing Plan Template
No ratings yet
Seo Marketing Plan Template
8 pages
Dbms (CSN 208) Lab Assignment: Submitted To: Alka Jindal Submitted By: Tamanna Puaar 16103077 CSE, 2 Year
No ratings yet
Dbms (CSN 208) Lab Assignment: Submitted To: Alka Jindal Submitted By: Tamanna Puaar 16103077 CSE, 2 Year
17 pages
Computer Application To Business-II
No ratings yet
Computer Application To Business-II
3 pages
AZ 305T00A ENU PowerPoint - 03
No ratings yet
AZ 305T00A ENU PowerPoint - 03
28 pages
Oracle GoldenGate Notes
No ratings yet
Oracle GoldenGate Notes
9 pages
Understanding AWS Core Services - Services List
No ratings yet
Understanding AWS Core Services - Services List
3 pages
ERPMan Archiving 070318 1302 43906
No ratings yet
ERPMan Archiving 070318 1302 43906
6 pages
Correct Db2 Final Answer List 1 Q &amp A
No ratings yet
Correct Db2 Final Answer List 1 Q &amp A
17 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Bigdata Notes

Uploaded by

Bigdata Notes

Uploaded by

What is Bigdata?

Big Data ke Key Characteristics (5 V's)

Big Data ke Sources

 Social media platforms (posts, tweets, likes, shares)

 Definition: Unstructured Data wo data hai jo kisi predefined structure me

 Definition: Semi-Structured Data wo data hai jo fully structured nahi hota,

Various Technology Available of big data

 Definition: Data Storage is the process of saving and managing data in

 Definition: Data Mining involves discovering patterns and extracting useful

 Definition: Data Analytics focuses on analyzing datasets to derive insights

 Definition: Data Visualization is the graphical representation of data to

 Definition: Data ko ek baar me batch me process karna, usually scheduled

4. Analytical Data Store

5. Analytics and Reporting

 Definition: Data ko analyze karna taaki meaningful insights nikale ja sake

 Definition: Continuous data streams ko real-time me process karna.

7. Real-Time Message Ingestion

 Definition: Real-time data ko collect karna aur processing system me

Big Data Analytics ?

 HDFS cluster mein multiple DataNodes hote hain.

1. HDFS (Hadoop Distributed File System)

 Purpose:- HDFS (Hadoop Distributed File System) Apache Hadoop

 Purpose:- MapReduce ek programming model hai jo Hadoop ecosystem ka

3. YARN (Yet Another Resource Negotiator)

1. Hadoop Distributed File System (HDFS):

Hive and Architecture

 int − Signed 32-bit integer and similar to Integer in Java.

HBase (Hadoop Database) ek open-source, non-relational (NoSQL) distributed

Key Features of HBase

1. Scalability: Horizontally scale karne ki ability.

MongoDB CRUD Operations Explained

Social Network Graph in Big Data

Social Network Graph ek graphical representation hota hai social relationships ka

Key Components of a Social Network Graph

 Nodes: Ye individual entities ko represent karte hain (jaise users, groups,

Data Type Hive Pig

Clustering in Social Network Analysis

Applications of Clustering in Social Network Analysis:

1. Recommendation Systems: Users ki similarities ke adhar par products ya

Different Hadoop and RDBMS

ETL Process Steps:

Example: Twitter se raw data extract karna.

Example: Hadoop HDFS ya cloud storage mein data load karna.

NoSQL Business Drivers

NoSQL Business Drivers

Example: E-commerce websites jo high traffic handle karti hain.

Example: Streaming services jo real-time recommendations dete hain.

4. Unstructured Data Handling:

Example: YouTube par user-generated content store karna.

Example: Startups cloud-based NoSQL solutions use karte hain.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.