Bigdata Notes
Bigdata Notes
Big Data ka matlab hai aise bohot bade aur complex datasets jo traditional data
processing techniques aur tools se process, analyze, aur manage karna mushkil
hota hai. Ye datasets alag-alag sources se generate hote hain aur lagataar badhte
rehte hain, jo inhe handle karna ek challenge banata hai, lekin valuable insights
derive karne ka mauka bhi deta hai.
Types Of Bigdata
Big Data ko unke nature aur structure ke basis par alag-alag types me classify kiya
ja sakta hai. Ye classification mainly teen prakar ka hota hai: Structured Data,
Unstructured Data, aur Semi-Structured Data. Har type ke sath examples bhi
diye gaye hain:
1. Structured Data
Definition: Structured Data wo data hai jo ek defined format aur structure
me organize hota hai, jaise rows aur columns me. Is data ko relational
databases me easily store aur manage kiya ja sakta hai.
Example:
o Bank ke customer records (Customer Name, Account Number,
Balance).
o Employee database (Employee ID, Name, Salary).
Source:
o Relational databases (SQL), spreadsheets.
2. Unstructured Data
3. Semi-Structured Data
1. Data Storage
2. Data Mining
3. Data Analytics
4. Data Visualization
Definition: Data source wo jagah hoti hai jahan se data collect ya generate
kiya jata hai.
Examples: Relational databases, IoT devices, social media platforms, logs.
Purpose: Data collection aur analysis ka starting point hota hai.
2. Data Storage
Definition: Data ko system me store karna taaki usse future me retrieve aur
process kiya ja sake.
Examples: HDFS (Hadoop Distributed File System), Amazon S3, Google
Cloud Storage.
Purpose: Large volumes of data ko store karna for later use in analytics aur
processing.
3. Batch Processing
6. Stream Processing
Key Features:
1. Volume: Massive amounts of data (terabytes ya petabytes) ko handle karta
hai.
2. Variety: Structured, semi-structured, aur unstructured data ke saath kaam
karta hai.
3. Velocity: Real-time ya near-real-time data ko analyze karta hai taaki timely
decision-making ho sake.
4. Veracity: Data ki accuracy aur quality ko ensure karta hai taaki reliable
insights mil sake.
5. Value: Meaningful insights ko extract karta hai jo business growth aur
efficiency ko drive karte hain.
Introduction to Hadoop
Hadoop ek open-source framework hai jo large datasets ko distributed manner
mein process aur store karne ke liye use hota hai. Yeh mainly large-scale data
storage aur processing ke liye design kiya gaya hai aur high availability aur fault
tolerance ensure karta hai. Hadoop ko Apache Software Foundation ne develop
kiya hai.
NameNode
Yeh ek single master server hota hai jo HDFS cluster mein exist karta hai.
Kyunki yeh ek single node hai, isliye yeh single point of failure ka reason
ban sakta hai.
Yeh file system namespace ko manage karta hai, jaise file ko open, rename,
aur close karna.
Yeh system architecture ko simplify karta hai.
DataNode
Job Tracker
Job Tracker ka role hai client se MapReduce jobs ko accept karna aur data
ko process karna using NameNode.
Iske response mein, NameNode Job Tracker ko metadata provide karta hai.
Task Tracker
Task Tracker, Job Tracker ke liye ek slave node ke roop mein kaam karta
hai.
Yeh Job Tracker se task aur code receive karta hai aur us code ko file par
apply karta hai. Is process ko Mapper bhi kaha ja sakta hai.
MapReduce Layer
MapReduce tab exist karta hai jab client application MapReduce job ko Job
Tracker ko submit karta hai.
Iske response mein, Job Tracker request ko appropriate Task Trackers ko
bhejta hai.
Kabhi-kabhi, Task Tracker fail ho jata hai ya time out ho jata hai. Aise case
mein, us job ka wo part dobara schedule kiya jata hai.
Master Node:
Master node Hadoop cluster mein tasks ko manage aur coordinate karne ke
liye responsible hota hai.
Yeh poore system ke operation ko oversee karta hai, jaise cluster ke
resources ko manage karna, jobs ko schedule karna, aur job failures ko
handle karna.
Examples: NameNode aur Job Tracker.
Slave Node:
Slave node master node dwara assign kiye gaye tasks ko execute karne ke
liye responsible hota hai.
Yeh actual data storage, processing, aur reporting operations perform karta
hai.
Examples: DataNode aur Task Tracker.
Hadoop Components
Hadoop architecture mein kuch key components hote hain jo large-scale data
processing aur storage ko efficiently handle karte hain. In components ka
combined use Hadoop ko powerful aur scalable banata hai.
2. MapReduce
.
Key Features: Dynamic resource allocation, multi-tenant environment.
Components:
o ResourceManager: Cluster resources ko manage karta hai.
o NodeManager: Cluster nodes ke resources ko monitor karta hai.
Hadoop Ecosystem
Hadoop Ecosystem ek collection hai tools aur technologies ka, jo Hadoop
framework ke upar kaam karte hain, jise large-scale data processing aur storage ko
manage karne ke liye design kiya gaya hai. Yeh ecosystem distributed data storage,
processing, aur analysis ko efficiently handle karta hai.
Key Components of Hadoop Ecosystem:
What is HBase?
1. Scalability:
o Horizontal scaling ko support karta hai, jahan multiple nodes add
karke data capacity ko easily increase kiya ja sakta hai.
2. Real-Time Access:
o HDFS ke unlike, HBase real-time data read/write operations ko
support karta hai.
3. Column-Oriented Storage:
o HBase ek column-oriented database hai, jisme data ko columns ke
format mein store kiya jata hai instead of rows.
4. Automatic Sharding:
o Data ko automatically region servers ke beech shard kiya jata hai.
5. Fault-Tolerance:
o Distributed nature aur HDFS ka use karke, HBase high fault tolerance
provide karta hai.
6. Supports Massive Datasets:
o Terabytes aur petabytes tak ke structured aur semi-structured data ko
handle kar sakta hai.
NoSQL ?
NoSQL (Not Only SQL) ek modern database approach hai jo traditional relational
databases (RDBMS) ka alternative hai. Yeh unstructured, semi-structured, aur
structured data ko store aur manage karne ke liye design kiya gaya hai. NoSQL
databases scalability, flexibility, aur high performance provide karte hain, jo Big
Data aur real-time applications ke liye ideal hain.
Advantages of NoSQL
1. CreateOperations
Create ya insert operations ka use naye documents ko collection mein insert
ya add karne ke liye hota hai. Agar collection exist nahi karta hai, toh
MongoDB naya collection create kar lega database mein.
2. ReadOperations
Read operations ka use collection se documents ko retrieve karne ke liye
hota hai, yaani query karne ke liye collection mein se kisi document ko
padhne ke liye.
3. UpdateOperations
Update operations ka use existing documents ko update ya modify karne ke
liye hota hai. Aap update operations perform kar sakte hain jo MongoDB ke
dwara diye gaye methods ke through kiye jaate hain.
4. DeleteOperations
Delete operations ka use documents ko collection se delete ya remove karne
ke liye hota hai. Aap delete operations perform kar sakte hain jo MongoDB
ke dwara diye gaye methods ke through kiye jaate hain.
1. Extract:
o Data ko various sources (databases, APIs, IoT devices, etc.) se extract
kiya jata hai.
2. Transform:
o Extracted data ko clean aur transform kiya jata hai, jaise ki
aggregation, conversion, aur missing values ko handle karna.
Example: JSON data ko structured format mein convert karna.
3. Load:
o Transformed data ko data warehouse ya data lake mein load kiya jata
hai.
1. Scalability:
o Need: Badhta hua data handle karne ke liye scalable database chahiye.
o Benefit: NoSQL easily scale ho sakta hai, jyada servers add karke
data handle karta hai.
2. Flexibility:
o Need: Data structure ko jaldi change karna padta hai.
o Benefit: NoSQL flexible schema provide karta hai, jisse data ko
quickly modify kiya ja sakta hai.
Example: Social media platforms jo naye user data types add karti hain.
3. High Performance:
o Need: Real-time processing important hoti hai.
o Benefit: NoSQL fast data access aur low-latency operations offer
karta hai.
5. Cost Efficiency:
o Need: Kam cost me database solution chahiye.
o Benefit: NoSQL cheap hardware par chal sakta hai aur infrastructure
cost reduce karta hai.