Distributed ledger
Distributed ledger
KEY TAKEAWAYS
Distributed ledgers use the same concept of storing data in files, but instead of one
working copy of the ledger stored on a server (with backups), identical copies are
allowed to be stored on multiple machines in different geographies. The computers,
called nodes, automatically update their ledger copies and broadcast their states to
other nodes. All nodes are programmed to verify other nodes' ledgers, and the
network maintains its database.
Most of this work is done using encryption techniques such as hashing data and then
comparing it, which is done very quickly on modern computers and networks.
Advantages of Distributed Ledgers
While centralized ledgers are prone to cyber-attacks, distributed ledgers are
inherently harder to attack because a majority of the distributed copies need to be
altered simultaneously for them to be successful. Because of their distributed nature,
these records are resistant to malicious changes by a single party. Distributed ledgers
can also allow for much more transparency than is available in centralized ledgers.
This transparency makes an audit trail much easier when conducting data audits and
financial reviews. This helps remove the possibility of fraud occurring on the
financial books of a company.
Distributed ledgers also reduce operational inefficiencies and speed up the amount of
time a transaction takes to complete. They are automated and, therefore, can function
24/7. All of these factors reduce overall costs for the entities that use and operate
them.
Finance
Music and entertainment
Diamond and precious assets
Artwork
Supply chains of various commodities
While distributed ledger technology has multiple advantages, it’s in a budding stage
and is still being explored in terms of how to adopt it in the best possible way. One
thing is clear, though: The future format of centuries-old centralized ledgers is to be
decentralized.
Nakamoto consensus.
Key Takeaways
The Nakamoto Consensus is a protocol that ensures all participants in a
blockchain network agree on a single, secure version of the blockchain.
It relies on proof-of-work (PoW), block difficulty adjustment, and
decentralization to maintain network integrity and prevent tampering.
While offering benefits like security and financial inclusion, it faces
challenges such as high energy consumption and potential
centralization risks.
Introduction
The Nakamoto Consensus is a fundamental concept in the world of
cryptocurrencies, particularly Bitcoin. Named after the pseudonymous creator
of Bitcoin, Satoshi Nakamoto, this consensus mechanism revolutionized the
way decentralized networks achieve agreement without a central authority.
This article explores what the Nakamoto Consensus is, how it works, and
why it is crucial for the functioning of Bitcoin.
What Is the Nakamoto Consensus?
The Nakamoto Consensus is a protocol used by blockchain networks to
achieve agreement (consensus) on the state of the blockchain. It’s essential
for maintaining the integrity and security of peer-to-peer (P2P) networks like
Bitcoin.
Basically, the Nakamoto Consensus ensures that all participants in the
network agree on a single version of the blockchain, preventing issues such
as double-spending and ensuring that transactions are valid.
Key Components of the Nakamoto Consensus
To understand how the Nakamoto Consensus works, it’s important to grasp
its key components:
1. Proof-of-work (PoW)
Proof-of-work is the mechanism by which new blocks are added to the
blockchain. It involves solving complex mathematical problems that require
significant computational power. The so-called miners compete to solve these
problems. The first miner to do so gets the right to add the next block to the
blockchain and receive a block reward in the form of newly minted bitcoins
plus transaction fees.
2. Block difficulty
The difficulty of the mathematical problems that miners need to solve is
adjusted periodically. This ensures that blocks are added at a consistent rate,
approximately every 10 minutes in the case of Bitcoin. As more miners join
the network and more computational power (hash rate) is applied, the
difficulty increases to maintain this rate.
3. Block rewards and incentives
Miners are incentivized to participate in the network through block rewards
and transaction fees. When a miner successfully adds a block to the
blockchain, they receive a reward in the form of newly created bitcoins.
Additionally, miners collect transaction fees from the transactions included in
the block. These incentives are crucial for motivating miners to contribute
their computational power to the network.
4. Decentralization
The Nakamoto Consensus operates in a decentralized manner, meaning there
is no central authority controlling the network. Instead, consensus is achieved
through the collective effort of participants (miners) spread across the globe.
This decentralization is a core feature that ensures the network's security and
resilience.
How the Nakamoto Consensus Works
The process of achieving consensus in the Nakamoto Consensus can be
broken down into several steps:
1. Transaction broadcast
When a user wants to make a transaction, they broadcast it to the network.
This transaction is then picked up by nodes (computers) connected to the
Bitcoin network.
2. Transaction verification
Nodes verify the validity of the transaction by checking several factors, such
as whether the user has sufficient balance and whether the transaction follows
the network's rules.
3. Inclusion in a block
Verified transactions are grouped together by miners into a block. Miners
then start working on solving the PoW problem associated with that block.
4. Solving the proof-of-work
Miners compete to solve the mathematical problem (hashing) required for the
proof-of-work. This problem involves finding a hash (a string of characters)
that meets specific criteria. The process is resource-intensive and requires
significant computational power.
5. Block addition
The first miner to solve the problem broadcasts their solution to the network.
Other nodes verify the solution, and if it is correct, the new block is added to
the blockchain. This block becomes the latest entry in the chain, and all
subsequent blocks will build upon it.
6. Chain continuity
Once a block is added, miners start working on the next block, and the
process repeats. The blockchain grows over time, with each block containing
a reference (hash) to the previous block, creating a secure and tamper-
resistant chain.
Security and Attack Resistance
The Nakamoto Consensus is designed to be secure and resistant to attacks
through several mechanisms:
1. Difficulty adjustment
The difficulty of the proof-of-work problem adjusts based on the total
computational power of the network. This adjustment ensures that blocks are
added at a consistent rate, preventing any single miner or group of miners
from dominating the network.
2. Majority rule
The network operates on a majority rule principle. To successfully alter the
blockchain, an attacker would need to control more than 50% of the
network's computational power, known as a 51% attack. This is highly
impractical and expensive to do on the Bitcoin network, but smaller networks
can be susceptible to such attacks.
3. Decentralization
The decentralized nature of the network makes it difficult for any single
entity to gain control. The wide distribution of miners across the globe adds
to the network's resilience.
4. Economic incentives
Miners are financially incentivized to act honestly and follow the network's
rules. Attempting to attack the network or create invalid blocks would result
in wasted resources and loss of potential rewards, discouraging malicious
behavior.
Benefits of the Nakamoto Consensus
The Nakamoto Consensus offers several significant benefits that contribute to
the success and adoption of Bitcoin:
1. Trustless environment
Participants in the network do not need to trust each other or a central
authority. The consensus mechanism ensures that all transactions are valid
and that the blockchain remains secure and tamper-proof.
2. Security
The combination of proof-of-work, difficulty adjustment, and
decentralization makes the network highly secure. The likelihood of
successful attacks is minimal, ensuring the integrity of the blockchain.
3. Transparency
The blockchain is a public ledger, meaning all transactions are visible to
anyone. This transparency adds to the trustworthiness of the system, as
anyone can verify transactions and the state of the blockchain.
4. Financial Inclusion
The decentralized nature of the Nakamoto Consensus enables anyone with
internet access to participate in the network, promoting financial inclusion.
Challenges and Criticisms
Despite its advantages, the Nakamoto Consensus is not without challenges
and criticisms:
1. Energy consumption
The proof-of-work mechanism requires significant computational power,
leading to high energy consumption. This has raised environmental concerns
and calls for more energy-efficient consensus mechanisms.
2. Centralization risk
While the network is designed to be decentralized, there is a risk of
centralization if a small number of mining pools control a large portion of the
network's computational power.
3. Scalability
The current design of the Nakamoto Consensus limits the number of
transactions that can be processed per second. As the network grows,
scalability becomes a concern, leading to the development of solutions such
as the Lightning Network to address this issue.
4. Forks
Disagreements within the community can lead to forks, where the blockchain
splits into two separate chains. This can create confusion and uncertainty, as
seen in the 2017 split between Bitcoin and Bitcoin Cash.
5 SybilAttack.
A Sybil attack is a type of attack on a computer network service in which an attacker subverts
the service's reputation system by creating a large number of pseudonymous identities and uses
them to gain a disproportionately large influence. It is named after the subject of the book Sybil,
a case study of a woman diagnosed with dissociative identity disorder.[1] The name was suggested
in or before 2002 by Brian Zill at Microsoft Research.[2] The term pseudospoofing had
previously been coined by L. Detweiler on the Cypherpunks mailing list and used in the
literature on peer-to-peer systems for the same class of attacks prior to 2002, but this term did not
gain as much influence as "Sybil attack".[3]
Description
[edit]
The Sybil attack in computer security is an attack wherein a reputation system is subverted by
creating multiple identities.[4] A reputation system's vulnerability to a Sybil attack depends on
how cheaply identities can be generated, the degree to which the reputation system accepts
inputs from entities that do not have a chain of trust linking them to a trusted entity, and whether
the reputation system treats all entities identically. As of 2012, evidence showed that large-scale
Sybil attacks could be carried out in a very cheap and efficient way in extant realistic systems
such as BitTorrent Mainline DHT.[5][6]
An entity on a peer-to-peer network is a piece of software that has access to local resources. An
entity advertises itself on the peer-to-peer network by presenting an identity. More than one
identity can correspond to a single entity. In other words, the mapping of identities to entities is
many to one. Entities in peer-to-peer networks use multiple identities for purposes of
redundancy, resource sharing, reliability and integrity. In peer-to-peer networks, the identity is
used as an abstraction so that a remote entity can be aware of identities without necessarily
knowing the correspondence of identities to local entities. By default, each distinct identity is
usually assumed to correspond to a distinct local entity. In reality, many identities may
correspond to the same local entity.
An adversary may present multiple identities to a peer-to-peer network in order to appear and
function as multiple distinct nodes. The adversary may thus be able to acquire a disproportionate
level of control over the network, such as by affecting voting outcomes.
In the context of (human) online communities, such multiple identities are sometimes known
as sockpuppets. The less common term inverse-Sybil attack has been used to describe an attack
in which many entities appear as a single identity.[7]
6 Smart contract
With the unique developments and advancements in the technology sector in India, especially during
the challenges posed by the rapid spread of COVID-19, the fintech sector has shown promising
results. There has been a growth, fuelled largely by curiosity and popularity, amongst the citizens of
India in cryptocurrency such as Bitcoin, Ripple, Dogecoin, etc., based on which a large number of
people have started investing a noticeable part of their time and money in these virtual currencies.
In India, the apex financial authority i.e., the Reserve Bank of India (“RBI”), recognised
cryptocurrency, more specifically defined as a form of digital/ virtual currency created through a
series of written computer codes based on cryptography /encryption and is thus free of any central
issuing authority per se. Cryptocurrency is assisted through blockchain technology, that establishes a
person-to-person issuance system that utilises private and public keys allowing authentication and
encryption for secure and safe transactions.
Growing Popularity Of Cryptocurrency
Being an untouched, unregulated market with a potential of over a trillion dollars, India also
witnessed a huge surge of cryptocurrency exchanges.
Witnessing the increasing popularity of the use of cryptocurrency within a short span of a year and
the potential revenue loss to the Government of India; the regulators and authorities started to take
notice and as a consequence, in 2013 the RBI issued a press release, warning the public against
dealing in virtual/digital currencies
Restrictions Imposed By RBI
In November 2017 the Government of India established a high-level Inter-Ministerial Committee to
report on various issues related to the use of virtual currency and subsequently, in July 2019, this
Committee presented its report suggesting a blanket ban on private cryptocurrencies in India.
The threat of revenue loss was so eminent to RBI, that it is interesting to note that even prior to
submission of the report from the Inter-Ministerial Committee, in April 2018 the RBI had issued a
circular restricting all commercial and co-operative banks, small finance banks, payment banks and
NBFC from not only dealing in virtual/digital currencies themselves but also instructing them to stop
providing services to all entities which deal with virtual/digital currencies.
This stalled the rise of the crypto industry in India, as exchanges required banking services for
sending and receiving the money. The banking service is essential for the conversion into
cryptocurrency and in turn for paying salaries, vendors, office space etc. However, the situation
prevailing around cryptocurrencies and their usage completely changed on 4th March 2020, when the
Hon’ble Supreme Court of India, in a well-conceived judgment quashed the earlier ban imposed by
the RBI.
The Hon’ble Supreme Court of India chiefly examined the matter from the perspective of Article
19(1)(g) of the Indian Constitution, which talks about the freedom to practice any profession or to
carry on any occupation, trade or business, and the doctrine of proportionality.
The Apex Court noted that there is unanimity of opinion among all regulators and governments of
other countries that though virtual currencies have not acquired the status of legal tender, but they do
display digital representations of value and are capable of functioning as medium of exchange, unit
of account and/or store of value.
While the court recognized the RBI’s power to take a pre-emptive action, it held that the
proportionality of such a measure was not there in the case, since there wasn’t any damage/loss
suffered directly or indirectly, by RBI’s regulated entities as a result of VC trading. Therefore,
among other reasons, on the grounds of proportionality the impugned Circular dated 06-04-2018 was
set aside.
Developments In The Crypto-World
The Government of India is now considering the introduction of a new bill titled “Cryptocurrency
and Regulation of Official Digital Currency Bill, 2021” (“New Bill”) which is similar in spirit to its
earlier versions. However, the New Bill seeks to ban private cryptocurrencies in India with some
exceptions, to encourage the underlying technology and trading of cryptocurrency but facilitated
within a framework for the creation of an official digital currency which will be issued by the RBI.
The New Bill has approached the difficulty of the lack of cryptocurrency laws and suggests banning
all the private cryptocurrencies in their entirety. The dichotomy in the New Bill’s suggestion arises
since the RBI is still in the grey about which kinds of cryptocurrency will fall under the purview of
private cryptocurrency.
If the New Bill imposes a complete ban on private cryptocurrencies, it shall lead the cryptocurrency
investors to invest and deal in cryptocurrency in a completely unregulated market. Further, the aim of
introducing a law related to cryptocurrency is to ease the process of trading and holding, in a safer
technological environment.
However, even with the introduction of state-owned cryptocurrency which shall be monitored by the
RBI, the risk in investment and holding of cryptocurrency shall remain the same.
Current Situation Of Cryptocurrency In India
Towards the end of March 2021, according to the latest amendments to the Schedule III of the
Companies Act, 2013, the Government of India instructed that from the beginning of the new
financial year, companies have to disclose their investments in cryptocurrencies.
In simple words, companies now have to disclose profit or loss on transactions involving
cryptocurrency, the amount of holding, and details about the deposits or advances from any person
trading or investing in cryptocurrency. This move has been greatly appreciated by the people dealing
in the crypto sector, as this will open the door for all Indian companies to have Crypto on their
balance sheets.
Conclusion
Based on the inference that can be drawn from the aforementioned facts and current scenario around
the world dealing with matters of cryptocurrencies, it is noticeable that there is a complete lack of
clarity concerning cryptocurrency regulation in India.
Well-structured, clear regulations dealing with crypto trading exchanges, blockchain technology,
investors, and the people employed in such sector should be made the priority given that the world of
cryptocurrency is here to stay and demands more attention.
It is fascinating to note that in the Draft National Strategy on Blockchain, 2021, published by the
Ministry of Electronics and Information Technology highlighted the benefits of cryptocurrency.
Therefore, banning a virtual currency that has created an impact in many countries, will not be the
ideal thing to do for the development of our nation.
The government needs to take an effective step towards the positive regulation and enforcement of
cryptocurrency as a way forward to earn the confidence of investors and the general public in
developing the nation. It was announced by the Union Finance Minister Nirmala Sitharam on
16th March 2021 that there shall not be a complete ban on cryptocurrency – “we will allow a certain
amount of window for people to experiment on blockchain, bitcoins and cryptocurrency.”.
Though It would be wiser to pause, sit back and wait for the Government to formulate clear
regulations concerning cryptocurrencies before running in the grey.
The life cycle of a blockchain application typically follows a sequence of stages that help ensure
the project's development, deployment, and sustainability. Here's a breakdown of the main
phases involved:
Problem Definition: Identify the problem you're solving and determine how blockchain
can provide a solution.
Use Case Identification: Pinpoint the specific use cases for the blockchain, such as
finance (DeFi), supply chain, identity management, or voting systems.
Research: Conduct extensive research on existing blockchain platforms, consensus
algorithms (e.g., Proof of Work, Proof of Stake), and tools.
3. Development
Smart Contract Development: Write the code for the blockchain’s smart contracts
using languages like Solidity (Ethereum), Vyper, or others depending on the platform.
Backend Development: Develop the off-chain components (e.g., APIs, database
integration) that will interact with the blockchain.
Frontend Development: Design and build the user interface (UI) to allow users to
interact with the blockchain via wallets or web applications.
Security Considerations: Implement robust security measures, such as encryption and
secure key management.
4. Testing
Unit Testing: Test individual smart contracts and other components for functionality and
correctness.
Integration Testing: Ensure that the blockchain and off-chain components (APIs,
databases) work together seamlessly.
Security Audits: Conduct security audits of the code, particularly for smart contracts.
This is crucial as vulnerabilities can lead to significant financial loss.
Testnet Deployment: Deploy the application on a testnet (e.g., Ethereum Rinkeby,
Binance Testnet) to simulate real-world use without risk.
5. Deployment
Mainnet Launch: Once testing is successful, deploy the application to the blockchain’s
mainnet.
Smart Contract Deployment: Deploy the smart contracts to the mainnet and ensure they
are functioning as expected.
Infrastructure Setup: Set up necessary infrastructure such as nodes, databases, and APIs
for supporting the application.
Transaction Monitoring: Continuously monitor the blockchain for issues like network
congestion or failed transactions.
Performance Monitoring: Track the performance of the application, checking for slow
transaction times or issues that could affect user experience.
Bug Fixes & Upgrades: Regularly update the codebase to fix bugs, improve
performance, and add new features as needed.
Security Maintenance: Continuously monitor for vulnerabilities in the blockchain or the
application and apply patches or updates accordingly.
Scaling Solutions: Implement scaling solutions if the application gains significant usage
(e.g., layer 2 solutions like Optimistic Rollups, sharding).
Cost Optimization: Optimize gas fees (on Ethereum and similar chains) or transaction
costs to ensure the application is cost-efficient for users.
User Growth: Ensure that the application can handle increased user adoption and
transaction volume as it grows.
User Education: Blockchain applications can be complex, so educating users about how
to use the platform, wallet management, and security is crucial.
User Feedback: Collect feedback from users on usability, features, and issues, which can
guide future updates or enhancements.
The life of a blockchain application is not static. It requires continuous updates, monitoring, and
scaling to adapt to user needs, regulatory changes, and technological advancements.
12 BDA QP
13
Machine learning (ML) is a subset of artificial intelligence (AI) that enables computers to learn
from data without explicit programming. There are various types of machine learning, each
suited for different tasks and applications. Here's an overview of the main types of machine
learning:
1. Supervised Learning
Definition: In supervised learning, the model is trained on labeled data. This means that
the algorithm is given input-output pairs, and it learns to map inputs to the correct output.
The goal is to predict the output for new, unseen data.
Use Cases: Classification (e.g., email spam detection, disease diagnosis) and regression
(e.g., predicting house prices, stock prices).
Example Algorithms:
o Classification: Logistic Regression, Support Vector Machines (SVM), k-Nearest
Neighbors (k-NN), Decision Trees, Random Forest, Naive Bayes.
o Regression: Linear Regression, Polynomial Regression, Support Vector
Regression (SVR).
Key Concept: The model learns from past data, where the true labels (outputs) are
known.
2. Unsupervised Learning
Definition: In unsupervised learning, the model is given data without labels (i.e., the data
doesn't have known output values). The goal is to find hidden patterns or intrinsic
structures in the data.
Use Cases: Clustering (e.g., customer segmentation, image grouping), anomaly detection
(e.g., fraud detection), dimensionality reduction (e.g., data compression).
Example Algorithms:
o Clustering: k-Means, Hierarchical Clustering, DBSCAN.
o Dimensionality Reduction: Principal Component Analysis (PCA), t-SNE,
Autoencoders.
Key Concept: The model tries to infer the structure of the data without labeled outputs.
3. Semi-Supervised Learning
5. Self-Supervised Learning
Definition: Self-supervised learning is a type of unsupervised learning where the model
generates labels from the input data itself. The idea is to create auxiliary tasks to learn
useful representations of data without relying on manually labeled data.
Use Cases: Natural Language Processing (NLP), image representation learning (e.g., for
transfer learning).
Example Algorithms: Contrastive Learning, Predictive Models (e.g., BERT for text,
SimCLR for images).
Key Concept: The model generates its own labels from the data, often by predicting
parts of the data from other parts.
6. Deep Learning
Definition: Deep learning is a subset of machine learning that uses neural networks with
many layers (deep neural networks). It is particularly effective for large and complex
datasets, such as images, audio, and text.
Use Cases: Image classification (e.g., facial recognition, object detection), NLP (e.g., text
generation, translation), speech recognition, autonomous driving.
Example Algorithms: Convolutional Neural Networks (CNNs), Recurrent Neural
Networks (RNNs), Long Short-Term Memory Networks (LSTMs), Transformer Models
(e.g., GPT, BERT).
Key Concept: Deep learning models are highly flexible and capable of learning complex
features from raw data, often requiring large datasets and computational power.
7. Transfer Learning
Definition: Transfer learning involves taking a pre-trained model (usually from a large
dataset) and fine-tuning it on a new, smaller dataset for a different but related task. This
approach is especially useful when labeled data is scarce.
Use Cases: Fine-tuning models for specific applications in computer vision, NLP, and
medical diagnostics.
Example Algorithms: Pre-trained CNNs (e.g., VGG, ResNet), BERT for NLP tasks.
Key Concept: Transfer learning leverages knowledge from one domain to improve
performance in another.
Definition: Online learning is a type of machine learning where the model learns
incrementally as new data arrives, rather than being trained on a fixed dataset all at once.
This is useful in situations where data is constantly being updated or when the system
needs to adapt in real-time.
Use Cases: Financial market predictions, personalized recommendations, fraud detection
in real-time.
Example Algorithms: Stochastic Gradient Descent (SGD), Online Naive Bayes.
Key Concept: The model learns from a continuous stream of data and updates its
parameters accordingly.
Summary of the Types of Machine Learning:
In today’s data-driven world, data analytics plays a crucial role in helping businesses make
informed decisions. The data analytics life cycle outlines the step-by-step process used to
extract actionable insights from data. Each phase of this cycle ensures that raw data is
transformed into meaningful insights for effective decision-making.
Phase 1: Discovery
Define the business problem or objective to be solved.
Identify the scope of the data analytics project.
Determine the relevant data sources (internal and external).
Collaborate with stakeholders to clarify objectives and requirements.
Assess available resources, including data, tools, and team expertise.
Phase 2: Data Preparation
Collect data from various sources like databases, APIs, and spreadsheets.
Clean the data by handling missing values, removing duplicates, and correcting
inconsistencies.
Transform and format data to make it suitable for analysis.
Ensure data is standardized and normalized where necessary to maintain consistency.
Conduct data profiling to understand data distributions, types, and patterns.
Phase 3: Model Planning
Select the variables (features) most relevant to solving the problem.
Choose appropriate algorithms (e.g., regression, clustering, classification) based on the
problem type.
Develop a roadmap for the modeling phase, outlining how the data will be used.
Create an initial hypothesis about how the model will behave with the chosen data.
Use exploratory data analysis (EDA) techniques like correlation matrices and scatter
plots to understand relationships in the data.
Phase 4: Model Building
Build models using the selected algorithms and the prepared dataset.
Train models by feeding them the training dataset to allow them to learn patterns.
Use a test dataset to validate model performance and avoid overfitting.
Iterate and refine the model by adjusting parameters for better accuracy.
Use cross-validation techniques like K-fold validation to ensure robustness.
Phase 5: Communication of Results
Visualize results through charts, graphs, dashboards, and other visuals to make insights
understandable.
Present key findings to stakeholders in a concise, actionable format.
Summarize the insights derived from the data and explain how they align with business
goals.
Provide recommendations on actions based on the data analysis.
Prepare a comprehensive report that includes all aspects of the data analysis process,
findings, and conclusions.
Phase 6: Operationalize
Deploy the model into production environments, integrating it into business processes.
Automate tasks like decision-making or predictions based on the model’s insights.
Continuously monitor model performance to ensure it remains accurate and relevant as
new data becomes available.
Update or retrain the model as needed to accommodate changing data trends.
Document the deployment and maintenance processes to ensure long-term usability.
Source: Medium
Conclusion
The data analytics life cycle is a comprehensive process that guides professionals from the
initial discovery phase to the final operationalization of insights. Each phase plays a critical role
in ensuring that raw data is transformed into meaningful insights that drive informed decision-
making. Starting with understanding the business problem, moving through data preparation,
model building, and finally deploying the results, the life cycle ensures a structured approach to
data analysis.
By following this systematic process, organizations can harness the power of their data more
effectively, reduce errors, and make more accurate predictions. Additionally, continuous
monitoring and updating of the models ensure that the insights remain relevant in a rapidly
changing data landscape. Whether you’re working on small-scale data projects or large
enterprise analytics initiatives, understanding and applying the phases of
Arhitecture of Spark.
What is Spark?
Spark Architecture, an open-source, framework-based component that processes a large
amount of unstructured, semi-structured, and structured data for analytics, is utilised in
Apache Spark. Apart from Hadoop and map-reduce architectures for big data processing,
Apache Spark’s architecture is regarded as an alternative. The RDD and DAG, Spark’s
data storage and processing framework, are utilised to store and process data,
respectively. Spark architecture consists of four components, including the spark driver,
executors, cluster administrators, and worker nodes. It uses the Dataset and data frames
as the fundamental data storage mechanism to optimise the Spark process and big data
computation.
Apache Spark, a popular cluster computing framework, was created in order to accelerate
data processing applications. Spark, which enables applications to run faster by utilising
in-memory cluster computing, is a popular open source framework. A cluster is a
collection of nodes that communicate with each other and share data. Because of implicit
data parallelism and fault tolerance, Spark may be applied to a wide range of sequential
and interactive processing demands.
Speed: Spark performs up to 100 times faster than MapReduce for processing
large amounts of data. It is also able to divide the data into chunks in a controlled
way.
Powerful Caching: Powerful caching and disk persistence capabilities are offered
by a simple programming layer.
Deployment: Mesos, Hadoop via YARN, or Spark’s own cluster manager can all
be used to deploy it.
Real-Time: Because of its in-memory processing, it offers real-time computation
and low latency.
Polyglot: In addition to Java, Scala, Python, and R, Spark also supports all four of
these languages. You can write Spark code in any one of these languages. Spark
also provides a command-line interface in Scala and Python.
Spark Architecture
The Apache Spark base architecture diagram is provided in the following figure:
When the Driver Program in the Apache Spark architecture executes, it calls the real
program of an application and creates a SparkContext. SparkContext contains all of the
basic functions. The Spark Driver includes several other components, including a DAG
Scheduler, Task Scheduler, Backend Scheduler, and Block Manager, all of which are
responsible for translating user-written code into jobs that are actually executed on the
cluster.
The Cluster Manager manages the execution of various jobs in the cluster. Spark Driver
works in conjunction with the Cluster Manager to control the execution of various other
jobs. The cluster Manager does the task of allocating resources for the job. Once the job
has been broken down into smaller jobs, which are then distributed to worker nodes,
SparkDriver will control the execution.
Many worker nodes can be used to process an RDD created in the SparkContext, and the
results can also be cached.
The Spark Context receives task information from the Cluster Manager and enqueues it
on worker nodes.
The executor is in charge of carrying out these duties. The lifespan of executors is the
same as that of the Spark Application. We can increase the number of workers if we want
to improve the performance of the system. In this way, we can divide jobs into more
coherent parts.
The master node (process) in a driver process coordinates workers and oversees the tasks.
Spark is split into jobs and scheduled to be executed on executors in clusters. Spark
contexts (gateways) are created by the driver to monitor the job working in a specific
cluster and to connect to a Spark cluster. In the diagram, the driver programmes call the
main application and create a spark context (acts as a gateway) that jointly monitors the
job working in the cluster and connects to a Spark cluster. Everything is executed using
the spark context.
Each Spark session has an entry in the Spark context. Spark drivers include more
components to execute jobs in clusters, as well as cluster managers. Context acquires
worker nodes to execute and store data as Spark clusters are connected to different types
of cluster managers. When a process is executed in the cluster, the job is divided into
stages with gain stages into scheduled tasks.
An executor is responsible for executing a job and storing data in a cache at the outset.
Executors first register with the driver programme at the beginning. These executors have
a number of time slots to run the application concurrently. The executor runs the task
when it has loaded data and they are removed in idle mode. The executor runs in the Java
process when data is loaded and removed during the execution of the tasks. The
executors are allocated dynamically and constantly added and removed during the
execution of the tasks. A driver program monitors the executors during their
performance. Users’ tasks are executed in the Java process.
Cluster Manager
A driver program controls the execution of jobs and stores data in a cache. At the outset,
executors register with the drivers. This executor has a number of time slots to run the
application concurrently. Executors read and write external data in addition to servicing
client requests. A job is executed when the executor has loaded data and they have been
removed in the idle state. The executor is dynamically allocated, and it is constantly
added and deleted depending on the duration of its use. A driver program monitors
executors as they perform users’ tasks. Code is executed in the Java process when an
executor executes a user’s task.
Worker Nodes
The slave nodes function as executors, processing tasks, and returning the results back to
the spark context. The master node issues tasks to the Spark context and the worker nodes
execute them. They make the process simpler by boosting the worker nodes (1 to n) to
handle as many jobs as possible in parallel by dividing the job up into sub-jobs on
multiple machines. A Spark worker monitors worker nodes to ensure that the
computation is performed simply. Each worker node handles one Spark task. In Spark, a
partition is a unit of work and is assigned to one executor for each one.
1. There are multiple executor processes for each application, which run tasks on
multiple threads over the course of the whole application. This allows applications
to be isolated both on the scheduling side (drivers can schedule tasks individually)
and the executor side (tasks from different apps can run in different JVMs).
Therefore, data must be written to an external storage system before it can be
shared across different Spark applications.
2. Even on a cluster manager that also supports other applications, Spark can be run
if it can acquire executor processes and these communicate with each other. It’s
relatively easy for Spark to operate even on a cluster manager if this can be done
even with other applications (e.g. Mesos/YARN).
3. The driver program must listen for and accept incoming connections from its
executors throughout its lifetime (e.g., see spark.driver.port in the network config
section). Workers must be able to connect to the driver program via the network.
4. The driver is responsible for scheduling tasks on the cluster. It should be run on
the same local network as the worker nodes, preferably on the same machine. If
you want to send requests to the cluster, it’s preferable to open an RPC and have
the driver submit operations from nearby rather than running the driver far away
from the worker nodes.
Modes of Execution
You can choose from three different execution modes: local, shared, and dedicated.
These determine where your app’s resources are physically located when you run your
app. You can decide where to store resources locally, in a shared location, or in a
dedicated location.
1. Cluster mode
2. Client mode
3. Local mode
Cluster mode: Cluster mode is the most frequent way of running Spark Applications. In
cluster mode, a user delivers a pre-compiled JAR, Python script, or R script to a cluster
manager. Once the cluster manager receives the pre-compiled JAR, Python script, or R
script, the driver process is launched on a worker node inside the cluster, in addition to
the executor processes. This means that the cluster manager is in charge of all Spark
application-related processes.
Client mode: In contrast to cluster mode, where the Spark driver remains on the client
machine that submitted the application, the Spark driver is removed in client mode and is
therefore responsible for maintaining the Spark driver process on the client machine.
These machines, usually referred to as gateway machines or edge nodes, are maintained
on the client machine.
Local mode: Local mode runs the entire Spark Application on a single machine, as
opposed to the previous two modes, which parallelized the Spark Application through
threads on that machine. As a result, the local mode uses threads instead of parallelized
threads. This is a common way to experiment with Spark, try out your applications, or
experiment iteratively without having to make any changes on Spark’s end.
In practice, we do not recommend using local mode for running production applications.
setup steps of Hadoop
Output:
Output:
Output:
Step 7: Set Permissions
Copy the generated public key to the authorized key file and set the proper
permissions:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 640 ~/.ssh/authorized_keys
Copy
Output:
Step 9: Switch User
Switch to the 'hadoop' user again using the following command:
su - hadoop
Copy
Next, you need to set up environment variables for Java and Hadoop in your system.
Open the '~/.bashrc' Could you file in your preferred text editor? If you're using 'nano,'
you can paste code with 'Ctrl+Shift+V,' save with 'Ctrl+X,' 'Ctrl+Y,' and hit 'Enter':
nano ~/.bashrc
Copy
Copy
Output:
Load the above configuration into the current environment:
source ~/.bashrc
Copy
Output:
Step 11: Configuring Hadoop
Create the namenode and datanode directories within the 'hadoop' user's home
directory using the following commands:
cd hadoop/
mkdir -p ~/hadoopdata/hdfs/{namenode,datanode}
Copy
Next, edit the 'core-site.xml' file and replace the name with your system hostname:
nano $HADOOP_HOME/etc/hadoop/core-site.xml
Copy
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Copy
Output:
Save and close the file. Then, edit the 'hdfs-site.xml' file:
Next, edit the 'hdfs-site.xml' file and replace the name with your system hostname:
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Copy
Output:
Save and close the file. Then, edit the 'mapred-site.xml' file:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Copy
Output:
Finally, edit the 'yarn-site.xml' file:
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Copy
Output:
Output:
Once the Namenode directory is successfully formatted with the HDFS file system,
you will see the message "Storage directory
/home/hadoop/hadoopdata/hdfs/namenode has been successfully formatted." Start the
Hadoop cluster using:
start-all.sh
Copy
Output:
You can check the status of all Hadoop services using the command:
jps
Copy
Output:
To access the Namenode, open your web browser and visit http://your-server-ip:9870.
Replace 'your-server-ip' with your actual IP address. You should see the Namenode
web interface.
Output:
To access the Resource Manager, open your web browser and visit the URL
http://your-server-ip:8088. You should see the following screen:
Output:
Also, put some files into the Hadoop file system. For example, put log files from the
host machine into the Hadoop file system:
hdfs dfs -put /var/log/* /logs/
Copy
You can also verify the above files and directories in the Hadoop web interface. Go to
the web interface, click on Utilities => Browse the file system. You should see the
directories you created earlier on the following screen:
Step 15: To Stop Hadoop Services
To stop the Hadoop service, run the following command as a Hadoop user:
stop-all.sh
Copy
Output:
In summary, you've learned how to install Hadoop on Ubuntu. Now, you're ready to
unlock the potential of big data analytics. Happy exploring!
Streaming R in hadoop.
Integrating Hadoop with R involves merging the power of Hadoop's distributed data processing
capabilities with R's advanced analytics and data manipulation functions. This synergy allows
data scientists and analysts to efficiently analyze vast datasets stored in Hadoop's distributed file
system (HDFS) using R's familiar interface. By leveraging packages like "rhipe" or "rhdfs," users
can seamlessly access and process big data in R, making it easier to perform complex data
analytics, machine learning, and statistical modeling on large-scale datasets. This integration
enhances the scalability and versatility of data analysis workflows, making it a valuable tool for
handling big data challenges.
What is Hadoop?
Hadoop is a powerful open-source framework designed for distributed storage and processing of
vast amounts of data. Originally developed by Apache Software Foundation, it has become a
cornerstone of big data technology. Hadoop consists of two core components: Hadoop
Distributed File System (HDFS) and the Hadoop MapReduce programming model. HDFS is
responsible for storing data across a cluster of commodity hardware, breaking it into smaller
blocks and replicating them for fault tolerance. This distributed storage system allows Hadoop to
handle data at a massive scale.
What is R?
RR is a powerful and open-source programming language and environment widely used for
statistical computing, data analysis, and graphics. Developed in the early 1990s, R has since
gained immense popularity among statisticians, data scientists, and analysts due to its versatility
and extensive library of packages.R offers an array of statistical and graphical techniques,
making it a go-to tool for tasks ranging from data manipulation and visualization to complex
statistical modeling and machine learning. Users can perform data cleansing, transformation, and
exploratory data analysis with ease. R's graphical capabilities allow for the creation of high-
quality plots, charts, and graphs, facilitating data visualization and presentation.
Here are some key features and components of the "rmr" package:
The rhbase package This package provides basic connectivity to the HBASE distributed
database, using the Thrift server. R programmers can browse, read, write, and modify tables
stored in HBASE from within R.
The rhdfs package The "rhdfs" package is an R package that facilitates the integration of R with
the Hadoop Distributed File System (HDFS). HDFS is the primary storage system used in
Hadoop clusters, and it's designed to store and manage very large datasets across distributed
computing nodes. The "rhdfs" package allows R users to interact with HDFS, read and write
data, and perform various file operations within an R environment.
Here are some key features and functionalities of the "rhdfs" package:
Hadoop Streaming: Hadoop Streaming is a utility that comes with Hadoop, allowing you
to use any executable program as a mapper and reducer in a MapReduce job. Instead of
writing Java code, you can use scripts or executable programs in languages like Python,
Perl, or R to perform the mapping and reducing tasks.
R as a Mapper/Reducer: To use R with Hadoop Streaming, you write R scripts that serve
as the mapper and/or reducer functions. These scripts read input data from standard input
(stdin), process it using R code, and then emit output to standard output (stdout). You can
use command-line arguments to pass parameters to your R script.
Data Distribution: Hadoop takes care of distributing the input data across the cluster and
managing the parallel execution of your R scripts on different nodes. Each mapper
processes a portion of the data independently and produces intermediate key-value pairs.
Shuffling and Reducing: The intermediate key-value pairs are sorted and shuffled, and
then they are passed to the reducer (if specified) to aggregate or further process the data.
You can also use R scripts as reducers in this step.
Output: The final results are written to HDFS or another storage location, making them
available for further analysis or use.
Install R Packages: To get started, you'll need to install R packages that provide
MapReduce functionality. Two popular R packages for this purpose are "rmr2" and
"rhipe." "rmr2" is designed for use with Hadoop MapReduce, while "rhipe" works with
both Hadoop and Rhipe (R and Hadoop Integrated Programming Environment).
Set Up Your Hadoop Cluster: Ensure that you have access to a Hadoop cluster or Hadoop
distribution. You will need a running Hadoop cluster to execute MapReduce jobs.
Write Map and Reduce Functions in R: With the R packages installed, you can write your
custom Map and Reduce functions in R. These functions define how your data will be
processed. In MapReduce, the Map function processes input data and emits key-value
pairs, and the Reduce function aggregates and processes these pairs.
Zookeeper
ZooKeeper is a distributed co-ordination service to manage large set of hosts. Co-ordinating and
managing a service in a distributed environment is a complicated process. ZooKeeper solves this
issue with its simple architecture and API. ZooKeeper allows developers to focus on core application
logic without worrying about the distributed nature of the application.
The ZooKeeper framework was originally built at Yahoo! for accessing their applications in an easy
and robust manner. Later, Apache ZooKeeper became a standard for organized service used by
Hadoop, HBase, and other distributed frameworks. For example, Apache HBase uses ZooKeeper to
track the status of distributed data.
Before moving further, it is important that we know a thing or two about distributed applications. So,
let us start the discussion with a quick overview of distributed applications.
Distributed Application
A distributed application can run on multiple systems in a network at a given time (simultaneously)
by coordinating among themselves to complete a particular task in a fast and efficient manner.
Normally, complex and time-consuming tasks, which will take hours to complete by a non-
distributed application (running in a single system) can be done in minutes by a distributed
application by using computing capabilities of all the system involved.
The time to complete the task can be further reduced by configuring the distributed application to run
on more systems. A group of systems in which a distributed application is running is called
a Cluster and each machine running in a cluster is called a Node.
A distributed application has two parts, Server and Client application. Server applications are
actually distributed and have a common interface so that clients can connect to any server in the
cluster and get the same result. Client applications are the tools to interact with a distributed
application.
-
PauseSkip backward 5 secondsSkip forward 5 seconds
Mute
Fullscreen
Naming service − Identifying the nodes in a cluster by name. It is similar to DNS, but for nodes.
Configuration management − Latest and up-to-date configuration information of the system for a
joining node.
Cluster management − Joining / leaving of a node in a cluster and node status at real time.
Leader election − Electing a node as leader for coordination purpose.
Locking and synchronization service − Locking the data while modifying it. This mechanism helps
you in automatic fail recovery while connecting other distributed applications like Apache HBase.
Highly reliable data registry − Availability of data even when one or a few nodes are down.
Distributed applications offer a lot of benefits, but they throw a few complex and hard-to-crack
challenges as well. ZooKeeper framework provides a complete mechanism to overcome all the
challenges. Race condition and deadlock are handled using fail-safe synchronization approach.
Another main drawback is inconsistency of data, which ZooKeeper resolves with atomicity.
Benefits of ZooKeeper
Here are the benefits of using ZooKeeper −