0% found this document useful (0 votes)

11 views37 pages

BDA_UNIT-IV

Hadoop is an open-source framework designed for storing and processing large datasets using distributed computing, offering low cost, scalability, and flexibility in handling unstructured data. It outperforms traditional RDBMS systems in big data analytics due to its ability to manage high volumes and varieties of data efficiently. Key components include HDFS for storage and MapReduce for processing, with additional support from tools like HBase for real-time data access.

Uploaded by

Makkapati Deepthi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views37 pages

BDA_UNIT-IV

Uploaded by

Makkapati Deepthi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 37

UNIT –IV

Why Hadoop?

1. Key Considerations of Hadoop

o Low Cost: Hadoop is open-source and runs on commodity

hardware, making it cost-effective.

o Computing Power: Hadoop’s distributed computing model

allows it to process large datasets quickly by utilizing
multiple nodes.

o Scalability: Nodes can be added to the Hadoop cluster

without much administration effort.

o Storage Flexibility: Unlike traditional databases, Hadoop can

handle unstructured data like images, videos, and free-form
text.

o Inherent Data Protection: Hadoop handles hardware

failures by replicating data across multiple nodes and
automatically redistributing jobs if a node fails.

Hadoop Framework : Hadoop uses clusters—groups of machines

working together—to distribute and replicate data across nodes,
ensuring failover protection. This architecture supports:

o Data distribution across nodes with redundancy.

o Parallel data processing on locally available resources.

o Automatic failover to handle node failures.

Why Not RDBMS?

 Traditional RDBMS systems are not designed for storing and

processing large-scale unstructured data like images and videos.

 RDBMS scalability comes with high costs and limitations in

distributed data processing.

 Hadoop is more suited for big data analytics and machine learning
tasks compared to RDBMS, which struggles to manage the
volume, variety, and velocity of big data.
Hadoop Overview
Hadoop is an open-source software framework designed to store and
process massive amounts of data in a distributed manner using clusters
of commodity hardware. Its primary tasks are:

1. Massive data storage: Storing large datasets efficiently.

2. Faster data processing: Parallel processing of data for quicker

results.

Key Aspects of Hadoop

The key aspects of Hadoop are described in the figure:

1. Open-source software: Hadoop is free to download, use, and

modify. Users can contribute to its development.

2. Framework: It provides all necessary components to develop and

execute data processing tasks, including tools and libraries.

3. Distributed: Data is divided and stored across multiple computers

(nodes). Computation is performed in parallel across these nodes
for efficiency and reliability.

4. Massive storage: Hadoop is designed to handle enormous

volumes of data across a network of low-cost hardware, ensuring
scalability and redundancy.

5. Faster processing: Hadoop enables parallel processing of data,

ensuring faster response times for large-scale data queries.
Hadoop Distributors
 Lists major companies that provide Hadoop distributions:

o Cloudera (CDH 4.0, CDH 5.0)

o Hortonworks (HDP 1.0, HDP 2.0)

o MAPR (M3, M5, M8)

o Apache Hadoop (Hadoop 1.0, Hadoop 2.0)

 These companies offer Apache Hadoop with commercial support

and additional tools/utilities.

HDFS - Hadoop Distributed File System

 Key Points of HDFS:

1. Storage component of Hadoop.

2. Distributed File System (stores data across multiple

machines).

3. Modeled after Google File System (GFS).

4. Optimized for high throughput (uses large block sizes and

moves computation closer to data).

5. Supports file replication (enhances fault tolerance in

software and hardware failures).

6. Automatically re-replicates data blocks if a node fails.

7. Efficient for large files (gigabytes and above).

8. Works on top of native file systems.

HDFS Daemons
 NameNode:

o It is responsible for managing the File System Namespace in

HDFS.

o The NameNode breaks large files into smaller pieces called

blocks.

o A rack ID is used to identify the location of DataNodes within

the rack.

o The NameNode manages operations like:

 Read and write requests.

 File creation, deletion, and block replication.

o File System Namespace Management:

 The namespace contains a mapping of file names to

blocks and file properties.

 This mapping is stored in a file called FsImage, which

represents the current state of the file system.

o EditLog:

 The EditLog is a transaction log that records every file

system operation (e.g., file creation or deletion).

 On restart, the NameNode reads the EditLog and

applies all transactions to the FsImage to ensure
consistency.
o FsImage Update Process:

 Once the NameNode has applied all transactions to the

FsImage, it flushes the updated FsImage to disk and
truncates the old EditLog to free up space.

o Single Point:

 There is one NameNode per cluster.

Hadoop Distributed File System

 Client Application: The client interacts with the HDFS through the
Hadoop File System Client, which communicates with the
NameNode to manage file operations.

 NameNode: Manages the metadata of the file system (e.g., the

locations of file blocks, replication status).

o In this case, a file (Sample.txt) is broken into three blocks (A,

B, and C).

o Each block is replicated across multiple DataNodes for fault

tolerance and data availability.

 DataNodes: Store the actual data blocks. The replication factor

ensures that multiple copies of each block are distributed across
different nodes for redundancy.

o For example:

 Block A is stored in DataNode A, B, and C.

 Block B and Block C are also replicated across these

three DataNodes.

DataNode

 Multiple DataNodes exist within a Hadoop cluster.

 Communication:
o DataNodes send heartbeat messages to the NameNode at
regular intervals to confirm they are active and functional.

o If the NameNode does not receive a heartbeat from a

DataNode within a specified time, it assumes that the
DataNode has failed.

o The NameNode then replicates the blocks from the failed

DataNode to other active DataNodes to maintain the
replication factor and prevent data loss.

NameNode Operations

 FsImage: Stores the entire file system’s metadata, including the

block-to-file mappings and file properties.

 EditLog: A log of every file system operation (e.g., file creations,

deletions). It helps in recovering and updating the FsImage when
the system restarts.
Secondary NameNode
 The Secondary NameNode is not a real-time backup of the
NameNode. Instead, it periodically takes snapshots of the HDFS
metadata.

 These snapshots help reduce the burden on the NameNode and

assist in recovery scenarios.

Anatomy of File Read

1. The HDFS Client opens a file using the DistributedFileSystem.

2. It communicates with the NameNode to get the location of data

blocks.

3. An FSDataInputStream is created to read the file’s data from the

DataNodes.

4. The client reads the data sequentially from the DataNodes,

moving to the next block as necessary.

5. After all blocks are read, the client closes the FSDataInputStream.
Anatomy of File Write (Figure 5.19)
1. The client calls create() on the DistributedFileSystem to create a
new file.

2. An RPC call is made to the NameNode to register the file.

3. The client receives an FSDataOutputStream to write the data.

4. Data is split into packets and sent to the DataNodes in a pipeline

fashion.

5. Each DataNode sends an acknowledgment back to the client after

receiving the data.

6. Once all blocks are written, the file creation is completed.

How MapReduce Works

 The input dataset is split into independent chunks, which are

processed by map tasks.

 Each map task runs independently and in parallel, producing

intermediate data.

 This intermediate data is stored on the local disk of the server.

 The output from map tasks is shuffled and sorted based on keys.

 The sorted output is sent to reduce tasks, which combine the

outputs from multiple mappers.

 The final reduced output is then stored in the file system.

Features of MapReduce

 Task Scheduling & Monitoring: The framework handles

scheduling, monitoring, and re-executing failed tasks.

Components of MapReduce Architecture

 JobTracker (Master Node)

o There is one JobTracker per cluster.

o Responsible for scheduling tasks to worker nodes

(TaskTrackers).

o Monitors tasks and re-executes them if needed.

 TaskTracker (Slave Node)

o Each cluster node has one TaskTracker.

o Executes tasks assigned by the JobTracker.

---------------JobTracker-------------

 Acts as a master daemon responsible for managing job

execution in Hadoop.

 When you submit code to the cluster, JobTracker decides which

task to assign to which node.

 Monitors running tasks and re-schedules failed tasks to a

different node after a predefined number of retries.

 There is only one JobTracker per Hadoop cluster.

------------------TaskTracker-------------

 Acts as a slave daemon, responsible for executing tasks assigned

by JobTracker.

 Each slave node has a single TaskTracker, which spawns multiple

JVMs to execute parallel map and reduce tasks.

 TaskTracker sends a heartbeat message to JobTracker. If

JobTracker does not receive a heartbeat, it assumes the
TaskTracker has failed and reschedules the task to another node.

 When a job is submitted, JobTracker partitions and assigns

MapReduce tasks to different TaskTrackers.
Hadoop Configuration

Hadoop configuration involves setting up and tuning various parameters

to optimize its performance for specific workloads. Here’s a structured
guide to configuring Hadoop:

1. Core Configuration Files

Hadoop's configuration is managed through XML files located in the

$HADOOP_HOME/etc/hadoop/ directory.

1.1. core-site.xml

Defines common settings, such as the NameNode address and Hadoop's

I/O behavior.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode:9000</value>
<description>HDFS default file system
URI</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop</value>
<description>Temporary directory for Hadoop
files</description>
</property>
</configuration>
1.2. hdfs-site.xml

Configures the HDFS NameNode, DataNode, and replication factors.

<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Number of copies of each data
block</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///var/hadoop/hdfs/namenode</value>
<description>Path for storing NameNode
metadata</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///var/hadoop/hdfs/datanode</value>
<description>Path for storing DataNode
blocks</description>
</property>
</configuration>

1.3. mapred-site.xml

Configures MapReduce job execution.

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>Uses YARN for resource
management</description>
</property>
</configuration>

1.4. yarn-site.xml

Configures YARN, the cluster resource manager.

<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>resourcemanager</value>
<description>ResourceManager
hostname</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>Enables MapReduce shuffle
service</description>
</property>
</configuration>

2. Setting Up Hadoop Cluster

2.1. Format NameNode

hdfsnamenode -format

2.2. Start Hadoop Services

start-dfs.sh # Start HDFS (NameNode, DataNode,
SecondaryNameNode)
start-yarn.sh # Start YARN (ResourceManager,
NodeManager)

2.3. Verify Hadoop Services

jps # Check running Hadoop processes

2.4. Access Hadoop Web UI

 NameNode UI: http://namenode:9870

 ResourceManager UI: http://resourcemanager:8088

HBase: NoSQL, Distributed, Column-Oriented Database on Hadoop

HBase is an open-source, distributed, NoSQL database that runs on top
of Hadoop’s HDFS. It is column-oriented and optimized for handling
real-time, random read/write access to large datasets. Unlike
traditional relational databases, HBase does not use tables with fixed
schemas; instead, it stores data in a flexible, column-family format.

Why Was HBase Developed?

Traditional RDBMS (Relational Database Management Systems)

struggle with scalability and real-time access when dealing with large
volumes of data. Hadoop's HDFS (Hadoop Distributed File System)
allows distributed storage but does not provide a way to efficiently
retrieve small pieces of data in real-time. HBase was developed to
address these issues by offering:

 Fast, scalable, and real-time access to structured data stored in

Hadoop.

 Random access read/write operations, unlike HDFS, which is

optimized for batch processing.

HBase is modeled after Google’s BigTable, a distributed storage system

designed for managing structured data across thousands of machines.

Key Features of HBase

1. NoSQL (Schema-less) Storage

o Unlike RDBMS, which uses fixed schemas, HBase allows

flexible, column-family-based storage.

o This makes it ideal for semi-structured or unstructured big

data.

2. Distributed & Scalable

o Runs on multiple machines in a Hadoop cluster.

o Scales horizontally by adding more nodes.

3. Column-Oriented Storage

o Data is stored in columns instead of rows, improving read

performance for large datasets.

4. Real-Time, Random Read/Write Access

o Unlike batch-processing systems like Hive, HBase supports

low-latency operations.

5. Strong Consistency

o Unlike some NoSQL databases, HBase provides consistent

reads and writes.

6. Automatic Sharding (Region Splitting)

o Tables automatically split into regions and distribute across

nodes.

7. Fault Tolerance
o Built on HDFS, meaning data is replicated across multiple
nodes for reliability.

HBase vs. Traditional Databases

Feature HBase (NoSQL) Traditional RDBMS

Data Model Column-oriented, NoSQL Row-oriented, SQL-based

Schema Dynamic, schema-less Fixed schema

Horizontal scaling (adds Vertical scaling (adds more

Scalability
more nodes) CPU, RAM)

Transaction ACID transactions

No full ACID support
Support supported

Real-time big data Small to medium

Best Use Case
applications structured datasets

Example: Traditional Row-Oriented vs. HBase Column-Oriented

Storage

Row-Oriented Storage (RDBMS - MySQL, PostgreSQL, etc.)

In a traditional relational database, data is stored row by row, which is

optimized for transactional workloads (OLTP - Online Transaction
Processing).

ID Name Age City Phone

101 Alice 25 New York 123-456-7890

ID Name Age City Phone

102 Bob 30 London 987-654-3210

103 Charlie 35 Berlin 555-123-4567

 Problem: If we need to query only the "Age" column for analysis,

we still need to scan entire rows.

 Row-Based Query:

SELECT Age FROM Users;

o This retrieves the entire row and extracts "Age", leading to

unnecessary I/O operations.

Column-Oriented Storage (HBase)

HBase stores data by column families, allowing efficient access to

specific columns.

Table: users

Row Column Family: Personal Column Family: Contact (City,

Key (Name, Age) Phone)

City: New York, Phone: 123-

101 Name: Alice, Age: 25
456-7890

City: London, Phone: 987-654-

102 Name: Bob, Age: 30
3210

City: Berlin, Phone: 555-123-

103 Name: Charlie, Age: 35
4567
 Data is stored column by column, making it faster to retrieve
specific columns.

Advantages:

 If we need only "Age", HBase can scan only the "Age" column
instead of the entire row.

 Queries fetch only necessary columns, reducing disk I/O.

Column-Based Query in HBase

Using HBase shell:

scan 'users', {COLUMNS => ['Personal:Age']}

 Efficient: Fetches only the "Age" column without scanning

unnecessary data.

Architecture of HBase
HBase architecture has 3 main components: HMaster, Region
Server, Zookeeper.
HMaster (Master Server in HBase)

The HMaster is the main process that manages the HBase cluster. It is
responsible for:
- Assigning Regions to Region Servers.
- Performing DDL (Data Definition Language) operations such as
creating, deleting, and modifying tables.
- Monitoring all Region Servers in the cluster.
- Running background threads for tasks like load balancing and failover
handling.
- Handlingautomatic region splitting when a region grows beyond a
predefined size.

In a distributed environment, multiple HMaster instances can be set

up, but only one is active at a time. If the active HMaster fails,
ZooKeeper promotes a standby HMaster.
Region Server – The Processing Unit of HBase

In HBase, tables are horizontally partitioned into smaller units called

Regions, each containing a range of row keys. The Region Server is
responsible for managing these Regions and executing read/write
operations on them.

Key Functions of a Region Server:

- Handles read and write operations on a set of Region

- Manages multiple Regions, where each Region holds a
subset of a table.
- Runs on HDFS DataNodes in the Hadoop cluster, ensuring
distributed storage.
- Stores data in HFiles (on HDFS) for persistent storage.
- Automatically splits Regions when they grow beyond a set
threshold.

Region Structure in HBase:

A Region consists of:

 Row Keys (range-based partitioning of data).

 Column Families, where data is stored logically.

 HFiles, which store actual data in HDFS.

- Default Region Size:256 MB. If a Region grows beyond this, HBase

splits it into two smaller Regions to maintain load balancing.

Zookeeper
-It is like a coordinator in HBase.
-It provides services like maintaining configuration information, naming,
providing distributed synchronization, server failure notification etc.

-Clients communicate with region servers via zookeeper.

HIVE
Apache Hive is a data warehouse infrastructure built on top of Hadoop
that allows users to process and analyze large datasets using SQL-like
queries. Instead of writing complex MapReduce programs, users can
write Hive Query Language (HiveQL), which is then converted into
MapReduce, Tez, or Spark jobs for execution.

Key Features of Hive

- SQL-Like Interface – Makes working with Hadoop easier for

analysts and data engineers.
- Best for Batch Processing & Analytics – Ideal for summarization,
ad-hoc queries, and reporting.
- Scalable & Fault-Tolerant – Works on Hadoop’s distributed
storage (HDFS).
- Schema on Read – You can query structured & semi-structured
data without enforcing a schema before loading.
- Supports Various Execution Engines – Runs queries using
MapReduce (default), Tez, or Spark.
Main Components of Hive Architecture

Metastore (Stores Schema & Metadata)

 Stores table schemas, partitions, column types, and other

metadata.

 Uses MySQL or PostgreSQL as a backend database.

 Enables query optimization and efficient storage handling.

Driver (Compiles & Optimizes HiveQL Queries)

 Parses HiveQL queries and checks for syntax errors.

 Communicates with Metastore to retrieve table information.

 Sends the optimized query plan to the Execution Engine.

Execution Engine (Converts Queries into MapReduce/Tez/Spark Jobs)

 Translates HiveQL queries into MapReduce, Tez, or Spark jobs.

 Breaks the query into stages and optimizes execution flow.

 data retrieval from HDFS and result computation.

HDFS (Hadoop Distributed File System) – Storage Layer

 Stores raw structured and semi-structured data.

 Uses a distributed architecture, meaning data is split across

multiple nodes for scalability and fault tolerance.

 Tables in Hive are typically stored as CSV, JSON.

How a Hive Query is Executed?

User submits a HiveQL query (e.g., SELECT * FROM sales WHERE

region='Asia';).
The Driver parses the query and checks syntax & metadata (via
Metastore).
Execution Engine optimizes the query and generates a query plan.
MapReduce, Tez, or Spark jobs are launched to process the data from
HDFS.
Results are computed and returned to the user.

Apache Pig
Apache Pig is a high-level scripting language used for processing large
datasets in Hadoop. It provides an easy way to write complex data
transformations using a procedural scripting language called Pig Latin.

Key Features of Pig:

- High-Level Language – Uses Pig Latin, which is simpler than
writing raw MapReduce programs.
- Automatically Converts Scripts into MapReduce Jobs – Optimizes
execution to run efficiently on Hadoop.
- Supports Both Structured & Unstructured Data – Works with
text, JSON, CSV, and binary formats like Avro.
- Example Pig Latin Script:

-- Load data from HDFS

sales_data = LOAD 'hdfs://path/to/sales.csv' USING PigStorage(',')

AS (id:int, item:chararray, amount:float);

-- Filter records where amount > 100

high_value_sales = FILTER sales_data BY amount > 100;

-- Group sales by item

grouped_sales = GROUP high_value_sales BY item;

-- Calculate total sales per item

total_sales = FOREACH grouped_sales GENERATE group,

SUM(high_value_sales.amount);

-- Store results back in HDFS

STORE total_sales INTO 'hdfs://path/to/output' USING PigStorage(',');

Pig - Architecture
Hive vs Pig
Both Apache Hive and Apache Pig simplify Big Data processing but
serve different use cases. Here’s a comparison:

Feature Hive Pig

Type SQL-like query engine Scripting-based data flow

Feature Hive Pig

Complex data
Best for Querying structured data
transformations

Language HiveQL (Declarative) Pig Latin (Procedural)

Ease of
Easier for SQL users Easier for programmers
Use

Converts queries into Converts scripts into

Execution
MR/Tez/Spark jobs MapReduce jobs

ETL, log processing, data

Use Case Data warehousing & analytics
cleaning

Introduction to Data Analytics with R

Why R for Data Analytics?

R is a powerful open-source programming language that is widely used

in data analytics, statistical computing, and machine learning. It
provides a comprehensive environment for handling, visualizing, and
analyzing large datasets efficiently. Below are some of the key reasons
why R is a popular choice for data analytics:

1. Open-source & Free

o R is freely available, making it accessible to researchers, data

scientists, and businesses.

o Large and active community support provides numerous

free libraries and resources.

2. Statistical Computing Capabilities

o R is designed for advanced statistical analysis and data

modeling.

o Provides inbuilt functions for regression, hypothesis testing,

time series analysis, and more.

3. Rich Ecosystem of Machine Learning Libraries

o R supports a variety of machine learning techniques through

powerful libraries such as:

 caret – Unified framework for ML models

 randomForest – Random Forest for classification and

regression

 xgboost – Gradient boosting algorithm for predictive

modeling

4. Visualization Capabilities

o R excels in data visualization and storytelling, making it easy

to explore and communicate insights.
o Popular visualization libraries include:

 ggplot2 – Advanced data visualization

 lattice – Multi-panel statistical graphics

 plotly – Interactive graphs and dashboards

5. Integration with Big Data Technologies

o R can handle large datasets and integrate with Big Data

frameworks such as:

 Hadoop – Parallel computing with R using the

RHadoop package

 Spark – Distributed ML and big data processing via

SparkR

 BigR – Enables R to work with Big Data stored in HDFS

Key Steps in Data Analytics with R

To perform data analytics in R, a structured workflow is typically

followed. Below are the key steps:

Step 1: Data Collection

The first step in data analytics is importing data from different sources
into R. Common data sources include:

 CSV files → read.csv("data.csv")

 Excel files → readxl::read_excel("data.xlsx")

 Databases (MySQL, PostgreSQL, MongoDB) → DBI and RMySQL

 Web scraping (APIs, JSON, XML) → httr, rvest

Step 2: Data Preprocessing

Before analysis, raw data needs to be cleaned and transformed:

 Handling Missing Values

o Remove missing data → na.omit(dataset)

o Impute missing values → mean(dataset$column, na.rm =

TRUE)

 Data Transformation

o Convert categorical variables → as.factor(dataset$column)

o Normalize numerical data → scale(dataset$column)

Step 3: Exploratory Data Analysis (EDA)

EDA helps in understanding the distribution, patterns, and

relationships in data.

 Descriptive Statistics

o Summary of data → summary(dataset)

o Mean, median, standard deviation → mean(), sd(), quantile()

 Data Visualization

o Univariate Analysis → Histograms, box plots (ggplot2)

o Bivariate Analysis → Scatter plots, correlation heatmaps

Step 4: Model Building (Supervised & Unsupervised Learning)

Depending on the problem type, different machine learning techniques

are applied:

 Supervised Learning (Labeled Data)

o Regression: Linear Regression, Random Forest Regression

o Classification: Logistic Regression, Decision Trees, SVM

 Unsupervised Learning (Unlabeled Data)

o Clustering: k-Means, Hierarchical Clustering, DBSCAN

o Dimensionality Reduction: PCA

Step 5: Model Evaluation

After training, models are evaluated using various performance

metrics:

 Regression Metrics

o RMSE (Root Mean Squared Error) → Measures error in

prediction

o R² (R-Squared) → Measures model accuracy

 Classification Metrics
o Accuracy → (Correct Predictions / Total Predictions)

o Precision & Recall → Performance of classification models

o ROC Curve & AUC Score → pROC package for model

evaluation

Step 6: Deployment & Interpretation of Results

Once the model is validated, it is deployed for real-world use.

 Deploying as an API using Plumber

 Deploying on web applications with Shiny

 Interpreting results and generating reports using R Markdown

Introduction to Collaborative Filtering

Collaborative Filtering recommends items by analyzing past interactions

between users and items.

How does it work?

 User-based filtering: "People similar to you liked these items."

 Item-based filtering: "If you liked this item, you may like similar
items."

 Hybrid Filtering: Combines both user-based and item-based

filtering.

Example Use Cases

 E-commerce: Suggesting products based on past purchases.

 Streaming Platforms: Recommending movies based on viewing

history.

 Online Learning: Suggesting courses based on user activity

2. Types of Collaborative Filtering

2.1. User-Based Collaborative Filtering

Finds similar users and recommends items liked by similar users.

 Example: If User A and User B have similar movie preferences,

then User A will get recommendations based on User B's likes.

Mathematical Approach:

 Measures similarity using Cosine Similarity or Pearson

Correlation.

2.2. Item-Based Collaborative Filtering

Finds similar items and recommends them to users who liked similar
items.

Example: If many users who purchased "iPhone 13" also bought

"AirPods Pro", then a user who buys "iPhone 13" will get a
recommendation for "AirPods Pro" since these items are frequently
bought together.

2.3. Hybrid Filtering

Combines User-based and Item-based filtering for better

recommendations.

 Used by Netflix, YouTube, and Amazon.

 New users with no history.

De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Final Project of Centralized Bill System
No ratings yet
Final Project of Centralized Bill System
95 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Big data Unit 4 own
No ratings yet
Big data Unit 4 own
18 pages
UNIT 5-PLH
No ratings yet
UNIT 5-PLH
34 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Introduction to Hadoop- chapter-2
No ratings yet
Introduction to Hadoop- chapter-2
59 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Unit 5 Print
No ratings yet
Unit 5 Print
32 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
52 pages
UNIT -2
No ratings yet
UNIT -2
27 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
248 pages
UNIT - 2
No ratings yet
UNIT - 2
42 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Big Data Unit-2 PPT part1
No ratings yet
Big Data Unit-2 PPT part1
76 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE -mini xerox - easy read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE -mini xerox - easy read
16 pages
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
No ratings yet
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
62 pages
Hadoop
No ratings yet
Hadoop
31 pages
CC Unit 5 Notes
No ratings yet
CC Unit 5 Notes
30 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Unit - II
No ratings yet
Unit - II
64 pages
BDA UNIT-2dhhhhbv
No ratings yet
BDA UNIT-2dhhhhbv
23 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
IMTC634_Data Science_Chapter 13
No ratings yet
IMTC634_Data Science_Chapter 13
16 pages
Unit-2 Hadoop HDFS Hadoopecosystem
No ratings yet
Unit-2 Hadoop HDFS Hadoopecosystem
25 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
UNIT-2
No ratings yet
UNIT-2
14 pages
Unit 5-Cloud PDF
No ratings yet
Unit 5-Cloud PDF
33 pages
Hadoop
No ratings yet
Hadoop
9 pages
Module II
No ratings yet
Module II
46 pages
4
No ratings yet
4
53 pages
Bda Imp No Header Footer (1)
No ratings yet
Bda Imp No Header Footer (1)
25 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
UNIT 5
No ratings yet
UNIT 5
101 pages
BDA GTU Study Material Presentations Unit-2 14082021084043PM
No ratings yet
BDA GTU Study Material Presentations Unit-2 14082021084043PM
67 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
DOC-20250429-WA0002. (1)
No ratings yet
DOC-20250429-WA0002. (1)
66 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
10th August Morning and Afternoon session Hadoop (1)
No ratings yet
10th August Morning and Afternoon session Hadoop (1)
18 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
COMP313 W1 Test 2 2018 MODEL
No ratings yet
COMP313 W1 Test 2 2018 MODEL
7 pages
Mkted2140504en Esx Control Expert
No ratings yet
Mkted2140504en Esx Control Expert
50 pages
Exercise For Java Thirdyear
100% (1)
Exercise For Java Thirdyear
57 pages
EPAM Engineering KPI Dashboards For Hotel Industry Case Study
No ratings yet
EPAM Engineering KPI Dashboards For Hotel Industry Case Study
5 pages
Database Views: CHAPTER 5 (6/E) CHAPTER 8 (5/E)
No ratings yet
Database Views: CHAPTER 5 (6/E) CHAPTER 8 (5/E)
8 pages
CypCut User Manual V6.2
No ratings yet
CypCut User Manual V6.2
65 pages
c Lixea 2404-Demo
No ratings yet
c Lixea 2404-Demo
5 pages
Chapter 2. Arrays and Structures
No ratings yet
Chapter 2. Arrays and Structures
31 pages
CG File
No ratings yet
CG File
12 pages
Iot Systems - Logical Design Using Python: Bahga & Madisetti, © 2015
No ratings yet
Iot Systems - Logical Design Using Python: Bahga & Madisetti, © 2015
31 pages
Class 4
No ratings yet
Class 4
1 page
Introduction To Database Systems: Arwie H. Fernando
No ratings yet
Introduction To Database Systems: Arwie H. Fernando
42 pages
SAP S4HANA Sourcing & Procurement Full Config and End User Guide
No ratings yet
SAP S4HANA Sourcing & Procurement Full Config and End User Guide
10 pages
Process Management
No ratings yet
Process Management
9 pages
Quick Instructions Jasco v670pdf
No ratings yet
Quick Instructions Jasco v670pdf
2 pages
1.1 Objective 1.2 Theory 1.3 Algorithm 1.4 Questions Simulate The Functioning of Lamport's Logical Clock in C'
No ratings yet
1.1 Objective 1.2 Theory 1.3 Algorithm 1.4 Questions Simulate The Functioning of Lamport's Logical Clock in C'
5 pages
Simatic ET 200S Distributed I/O 2AI U HF Analog Electronic Module (6ES7134-4LB02-0AB0)
No ratings yet
Simatic ET 200S Distributed I/O 2AI U HF Analog Electronic Module (6ES7134-4LB02-0AB0)
26 pages
CS 111 Introduction To Computer Science: by Wessam El-Behaidy & Amr S. Ghoneim
No ratings yet
CS 111 Introduction To Computer Science: by Wessam El-Behaidy & Amr S. Ghoneim
32 pages
Cloudera Administrator Exercise Instructions PDF
No ratings yet
Cloudera Administrator Exercise Instructions PDF
126 pages
PMM-WH-EXE-EE-060 - Installation of Add-On Board For Solar Power of Warehouse Building Approved
No ratings yet
PMM-WH-EXE-EE-060 - Installation of Add-On Board For Solar Power of Warehouse Building Approved
6 pages
2022 Download Yahoo ITunes Gift Card Format – Amazon Vanilla PDF
No ratings yet
2022 Download Yahoo ITunes Gift Card Format – Amazon Vanilla PDF
1 page
Chziri Zjr2 User Manual
No ratings yet
Chziri Zjr2 User Manual
51 pages
1.1 Section 01 - Introduction PDF
No ratings yet
1.1 Section 01 - Introduction PDF
13 pages
Performance Evaluation of Information Retrieval Systems
No ratings yet
Performance Evaluation of Information Retrieval Systems
45 pages
Data Structures Using C++ 2E: Searching and Hashing Algorithms
No ratings yet
Data Structures Using C++ 2E: Searching and Hashing Algorithms
47 pages
5 Getting Started With Live Coding
No ratings yet
5 Getting Started With Live Coding
5 pages
Pcbridge
No ratings yet
Pcbridge
4 pages
Barcodes and QR Codes Coding Guidelines v12 2019 PDF
No ratings yet
Barcodes and QR Codes Coding Guidelines v12 2019 PDF
130 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.