0% found this document useful (0 votes)
4 views51 pages

Distributed ledger

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views51 pages

Distributed ledger

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 51

Distributed ledger.

A distributed ledger is a transaction database that is stored and synchronized across


multiple sites, institutions, or geographies. Network nodes store copies of the ledger
and communicate any changes made by users to other nodes, which append their
ledgers to match. Modern communication technology makes these changes happen in
a matter of seconds, depending on the rates at which nodes compare their states.
Distributed ledgers contrast centralized ledgers, which are most commonly used by
businesses and governments. Centralized ledgers are more prone to cyber-attacks and
fraud, as they have a single point of failure, but they offer more security in that there
are fewer opportunities for access.

Blockchains and direct acyclic graphs are types of distributed ledgers.

KEY TAKEAWAYS

 A distributed ledger is a transaction database that is synchronized across


different sites and geographies.
 The need for a central authority to keep a check against manipulation is
eliminated by using a distributed ledger.
 Distributed ledgers reduce the risk of fraud because the nodes can be
programmed to compare their states and reject unverified changes.

Understanding Distributed Ledgers


Data is collected and entered into digital files and then stored on computers. These
files make up ledgers. Software is used to access and use this data, and access is
granted to users who require it. In the past and in many cases, currently, these ledgers
have been stored in central locations and controlled by specific users. These locations
could be a closed network on a storage system stored in a room and maintained by
system technicians. Data was usually audited and verified by humans, who are prone
to make mistakes and corruption.

Distributed ledgers use the same concept of storing data in files, but instead of one
working copy of the ledger stored on a server (with backups), identical copies are
allowed to be stored on multiple machines in different geographies. The computers,
called nodes, automatically update their ledger copies and broadcast their states to
other nodes. All nodes are programmed to verify other nodes' ledgers, and the
network maintains its database.

Most of this work is done using encryption techniques such as hashing data and then
comparing it, which is done very quickly on modern computers and networks.
Advantages of Distributed Ledgers
While centralized ledgers are prone to cyber-attacks, distributed ledgers are
inherently harder to attack because a majority of the distributed copies need to be
altered simultaneously for them to be successful. Because of their distributed nature,
these records are resistant to malicious changes by a single party. Distributed ledgers
can also allow for much more transparency than is available in centralized ledgers.

This transparency makes an audit trail much easier when conducting data audits and
financial reviews. This helps remove the possibility of fraud occurring on the
financial books of a company.

Distributed ledgers also reduce operational inefficiencies and speed up the amount of
time a transaction takes to complete. They are automated and, therefore, can function
24/7. All of these factors reduce overall costs for the entities that use and operate
them.

Distributed Ledger Uses


Distributed ledger technology has great potential to revolutionize the way
governments, institutions, and corporations work. It can help governments collect
tax, issue passports, and record land registries, licenses, and the outlay of Social
Security benefits, as well as voting procedures.

The technology is making waves in several industries, including:

 Finance
 Music and entertainment
 Diamond and precious assets
 Artwork
 Supply chains of various commodities

While distributed ledger technology has multiple advantages, it’s in a budding stage
and is still being explored in terms of how to adopt it in the best possible way. One
thing is clear, though: The future format of centuries-old centralized ledgers is to be
decentralized.

Nakamoto consensus.

Key Takeaways
 The Nakamoto Consensus is a protocol that ensures all participants in a
blockchain network agree on a single, secure version of the blockchain.
 It relies on proof-of-work (PoW), block difficulty adjustment, and
decentralization to maintain network integrity and prevent tampering.
 While offering benefits like security and financial inclusion, it faces
challenges such as high energy consumption and potential
centralization risks.

Introduction
The Nakamoto Consensus is a fundamental concept in the world of
cryptocurrencies, particularly Bitcoin. Named after the pseudonymous creator
of Bitcoin, Satoshi Nakamoto, this consensus mechanism revolutionized the
way decentralized networks achieve agreement without a central authority.
This article explores what the Nakamoto Consensus is, how it works, and
why it is crucial for the functioning of Bitcoin.
What Is the Nakamoto Consensus?
The Nakamoto Consensus is a protocol used by blockchain networks to
achieve agreement (consensus) on the state of the blockchain. It’s essential
for maintaining the integrity and security of peer-to-peer (P2P) networks like
Bitcoin.
Basically, the Nakamoto Consensus ensures that all participants in the
network agree on a single version of the blockchain, preventing issues such
as double-spending and ensuring that transactions are valid.
Key Components of the Nakamoto Consensus
To understand how the Nakamoto Consensus works, it’s important to grasp
its key components:
1. Proof-of-work (PoW)
Proof-of-work is the mechanism by which new blocks are added to the
blockchain. It involves solving complex mathematical problems that require
significant computational power. The so-called miners compete to solve these
problems. The first miner to do so gets the right to add the next block to the
blockchain and receive a block reward in the form of newly minted bitcoins
plus transaction fees.
2. Block difficulty
The difficulty of the mathematical problems that miners need to solve is
adjusted periodically. This ensures that blocks are added at a consistent rate,
approximately every 10 minutes in the case of Bitcoin. As more miners join
the network and more computational power (hash rate) is applied, the
difficulty increases to maintain this rate.
3. Block rewards and incentives
Miners are incentivized to participate in the network through block rewards
and transaction fees. When a miner successfully adds a block to the
blockchain, they receive a reward in the form of newly created bitcoins.
Additionally, miners collect transaction fees from the transactions included in
the block. These incentives are crucial for motivating miners to contribute
their computational power to the network.
4. Decentralization
The Nakamoto Consensus operates in a decentralized manner, meaning there
is no central authority controlling the network. Instead, consensus is achieved
through the collective effort of participants (miners) spread across the globe.
This decentralization is a core feature that ensures the network's security and
resilience.
How the Nakamoto Consensus Works
The process of achieving consensus in the Nakamoto Consensus can be
broken down into several steps:
1. Transaction broadcast
When a user wants to make a transaction, they broadcast it to the network.
This transaction is then picked up by nodes (computers) connected to the
Bitcoin network.
2. Transaction verification
Nodes verify the validity of the transaction by checking several factors, such
as whether the user has sufficient balance and whether the transaction follows
the network's rules.
3. Inclusion in a block
Verified transactions are grouped together by miners into a block. Miners
then start working on solving the PoW problem associated with that block.
4. Solving the proof-of-work
Miners compete to solve the mathematical problem (hashing) required for the
proof-of-work. This problem involves finding a hash (a string of characters)
that meets specific criteria. The process is resource-intensive and requires
significant computational power.
5. Block addition
The first miner to solve the problem broadcasts their solution to the network.
Other nodes verify the solution, and if it is correct, the new block is added to
the blockchain. This block becomes the latest entry in the chain, and all
subsequent blocks will build upon it.
6. Chain continuity
Once a block is added, miners start working on the next block, and the
process repeats. The blockchain grows over time, with each block containing
a reference (hash) to the previous block, creating a secure and tamper-
resistant chain.
Security and Attack Resistance
The Nakamoto Consensus is designed to be secure and resistant to attacks
through several mechanisms:
1. Difficulty adjustment
The difficulty of the proof-of-work problem adjusts based on the total
computational power of the network. This adjustment ensures that blocks are
added at a consistent rate, preventing any single miner or group of miners
from dominating the network.
2. Majority rule
The network operates on a majority rule principle. To successfully alter the
blockchain, an attacker would need to control more than 50% of the
network's computational power, known as a 51% attack. This is highly
impractical and expensive to do on the Bitcoin network, but smaller networks
can be susceptible to such attacks.
3. Decentralization
The decentralized nature of the network makes it difficult for any single
entity to gain control. The wide distribution of miners across the globe adds
to the network's resilience.
4. Economic incentives
Miners are financially incentivized to act honestly and follow the network's
rules. Attempting to attack the network or create invalid blocks would result
in wasted resources and loss of potential rewards, discouraging malicious
behavior.
Benefits of the Nakamoto Consensus
The Nakamoto Consensus offers several significant benefits that contribute to
the success and adoption of Bitcoin:
1. Trustless environment
Participants in the network do not need to trust each other or a central
authority. The consensus mechanism ensures that all transactions are valid
and that the blockchain remains secure and tamper-proof.
2. Security
The combination of proof-of-work, difficulty adjustment, and
decentralization makes the network highly secure. The likelihood of
successful attacks is minimal, ensuring the integrity of the blockchain.
3. Transparency
The blockchain is a public ledger, meaning all transactions are visible to
anyone. This transparency adds to the trustworthiness of the system, as
anyone can verify transactions and the state of the blockchain.
4. Financial Inclusion
The decentralized nature of the Nakamoto Consensus enables anyone with
internet access to participate in the network, promoting financial inclusion.
Challenges and Criticisms
Despite its advantages, the Nakamoto Consensus is not without challenges
and criticisms:
1. Energy consumption
The proof-of-work mechanism requires significant computational power,
leading to high energy consumption. This has raised environmental concerns
and calls for more energy-efficient consensus mechanisms.
2. Centralization risk
While the network is designed to be decentralized, there is a risk of
centralization if a small number of mining pools control a large portion of the
network's computational power.
3. Scalability
The current design of the Nakamoto Consensus limits the number of
transactions that can be processed per second. As the network grows,
scalability becomes a concern, leading to the development of solutions such
as the Lightning Network to address this issue.
4. Forks
Disagreements within the community can lead to forks, where the blockchain
splits into two separate chains. This can create confusion and uncertainty, as
seen in the 2017 split between Bitcoin and Bitcoin Cash.

5 SybilAttack.

A Sybil attack is a type of attack on a computer network service in which an attacker subverts
the service's reputation system by creating a large number of pseudonymous identities and uses
them to gain a disproportionately large influence. It is named after the subject of the book Sybil,
a case study of a woman diagnosed with dissociative identity disorder.[1] The name was suggested
in or before 2002 by Brian Zill at Microsoft Research.[2] The term pseudospoofing had
previously been coined by L. Detweiler on the Cypherpunks mailing list and used in the
literature on peer-to-peer systems for the same class of attacks prior to 2002, but this term did not
gain as much influence as "Sybil attack".[3]

Description
[edit]

The Sybil attack in computer security is an attack wherein a reputation system is subverted by
creating multiple identities.[4] A reputation system's vulnerability to a Sybil attack depends on
how cheaply identities can be generated, the degree to which the reputation system accepts
inputs from entities that do not have a chain of trust linking them to a trusted entity, and whether
the reputation system treats all entities identically. As of 2012, evidence showed that large-scale
Sybil attacks could be carried out in a very cheap and efficient way in extant realistic systems
such as BitTorrent Mainline DHT.[5][6]

An entity on a peer-to-peer network is a piece of software that has access to local resources. An
entity advertises itself on the peer-to-peer network by presenting an identity. More than one
identity can correspond to a single entity. In other words, the mapping of identities to entities is
many to one. Entities in peer-to-peer networks use multiple identities for purposes of
redundancy, resource sharing, reliability and integrity. In peer-to-peer networks, the identity is
used as an abstraction so that a remote entity can be aware of identities without necessarily
knowing the correspondence of identities to local entities. By default, each distinct identity is
usually assumed to correspond to a distinct local entity. In reality, many identities may
correspond to the same local entity.
An adversary may present multiple identities to a peer-to-peer network in order to appear and
function as multiple distinct nodes. The adversary may thus be able to acquire a disproportionate
level of control over the network, such as by affecting voting outcomes.

In the context of (human) online communities, such multiple identities are sometimes known
as sockpuppets. The less common term inverse-Sybil attack has been used to describe an attack
in which many entities appear as a single identity.[7]

6 Smart contract

 A smart contract is defined as a digital agreement that is signed


and stored on a blockchain network, which executes automatically
when the contract’s terms and conditions (T&C) are met. The
T&C is written in blockchain-specific programming languages
such as Solidity.
 Smart contracts form the foundation of most blockchain use cases,
from non-fungible tokens (NFTs) to decentralized apps and the
metaverse.
 This article explains how smart contracts work and details their
various types. It also lists the top smart contract tools available
and the best practices that need to be followed.
Table of Contents

 What Are Smart Contracts?


 History of Smart Contracts
 How Do Smart Contracts Work?
 Types of Smart Contracts
 Top 10 Uses of Smart Contracts
 Benefits and Challenges of Smart Contracts
 Top Smart Contract Tools
 Best Practices for Using Smart Contracts

What Are Smart Contracts?


A smart contract is a digital agreement signed and stored on a blockchain
network that executes automatically when the contract’s terms and conditions
(T&C) are met; the T&C is written in blockchain-specific programming
languages like Solidity.
One can also look at smart contracts as blockchain applications that enable all
parties to carry out their part of a transaction. Apps powered by smart
contracts are frequently referred to as “decentralized applications” or
“dapps.”
While the idea of blockchain is largely perceived as Bitcoin’s underlying tech
driver, it has, since then, grown into a force to reckon with. Using smart
contracts, a manufacturer requiring raw materials can establish payments, and
the supplier can schedule shipments. Then, based on the contract between the
two organizations, payments can be automatically transferred to the seller
upon dispatch or delivery.
See More: 3 Reasons Bitcoin is an Attractive Hedge for Inflation
History of Smart Contracts
Nick Szabo, a U.S.-born computer scientist who developed a virtual currency
dubbed “Bit Gold” in 1998, a decade before Bitcoin was introduced, was the
first to propose smart contracts in 1994. Szabo characterized smart contracts
as digital transaction mechanisms that implement a contract’s terms.
Many predictions made by Szabo in his paper are now a part of our daily
lives in ways that precede blockchain technology. However, this idea
couldn’t be implemented because the necessary technology, primarily the
distributed ledger, did not exist then.
In 2008, Satoshi Nakamoto introduced the revolutionary blockchain
technology in a whitepaper. It prevented transactions from being specified in
another block. However, the emergence of cutting-edge technologies acted as
stimuli for the rise of smart contracts. Five years on, the Ethereum blockchain
platform made practical use of smart contracts achievable. Ethereum is still
one of the most prevalent platforms enabling smart contract implementation.
See More: What Is the Metaverse? Meaning, Features, and Importance
How Do Smart Contracts Work?
Like any other contract, a smart contract is a binding contract between two
parties. It uses code to take advantage of the advantages of blockchain
technology, thereby unlocking greater efficacy, openness, and confidentiality.
The execution of smart contracts is controlled by relatively easy “if/when…
then…” statements written in code on the blockchain.
These are the steps needed for the functioning of smart contracts.
 Agreement: The parties wanting to conduct business or exchange
products or services must concur on the arrangement’s terms and
conditions. Furthermore, they must determine how a smart
contract will operate, including the criteria that must be fulfilled
for the agreement to be fulfilled.
 Contract creation: Participants in a transaction may create a
smart contract in many ways, including building it themselves or
collaborating with a smart contract provider. The provisions of the
contract are coded in a programming language. During this stage,
verifying the contract’s security thoroughly is critical.
 Deployment: When the contract has been finalized, it must be
published on the blockchain. The smart contract is uploaded to the
blockchain in the same way as regular crypto transactions, with
the code inserted into the data field of the exchange. Once the
transaction has been verified, it’s deemed active on the blockchain
and cannot be reversed or amended.
 Monitoring conditions: A smart contract runs by tracking the
blockchain or a different reliable source for predetermined
conditions or prompts. These triggers can be just about anything
that can be digitally verified, like a date attained, a payment made,
etc.
 Execution: When the trigger parameters are met, the smart
contract is activated as per the “if/when…then…” statement. This
may implement only one or multiple actions, like passing funds to
a vendor or registering the buyer’s possession of an asset.
 Recording: Contract execution results are promptly published on
the blockchain. The blockchain system verifies the actions taken,
logs their completion as an exchange, and stores the concluded
agreement on the blockchain. This document is available at all
times.
See More: How to Build Tech and Career Skills for Web3 and Blockchain
Types of Smart Contracts
When it comes to the types of smart contracts, they are classified into three
categories — legal contracts, decentralized autonomous organizations or
DAOs, and logic contracts. Here, we’ll discuss each of the three in more
detail.
1. Smart legal contract
Smart contracts are guaranteed by law. They adhere to the structure of legal
contracts: “If this happens, and then this will happen.” As smart contracts
reside on blockchain and are unchangeable, judicial or legal smart contracts
offer greater transparency than traditional documents among contracting
entities.
The parties involved execute contracts with digital signatures. Smart legal
contracts may be executed autonomously if certain prerequisites are fulfilled,
for example, making a payment when a specific deadline is reached. In the
event of failure to comply, stakeholders could face severe legal
repercussions.
2. Decentralized autonomous organizations
DAOs are democratic groups governed by a smart contract that confers them
with voting rights. A DAO serves as a blockchain-governed organization
with a shared objective that is collectively controlled. No executive or
president exists. Instead, blockchain-based tenets embedded within the
contract’s code regulate how the organization functions and funds are
allocated. VitaDAO is an example of this type of smart contract, where the
technology powers a community for scientific research.
3. Application logic contracts
ALCs, or application logic contracts, consist of application-based code that
typically remains synced with various other blockchain contracts. It enables
interactions between various devices, like the Internet of Things (IoT) or
blockchain integration. Unlike the other types of smart contracts, these are
not signed between humans or organizations but between machines and other
contracts.
See More: 5-Step Guide to Business Continuity Planning (BCP) in 2021
Top 10 Uses of Smart Contracts
The uses of smart contracts are wide and varied, spread across industries.
Smart Contracts Uses
1. Royalty payment in media and entertainment
As they enter the industry, new artists rely on revenues from streaming
services. Smart contract apps can facilitate easier royalty payments. These
contracts can outline, for instance, the share of royalties payable to the record
company and the artist. Instantaneous handling of these payments is an
enormous advantage for everyone involved.
Smart contracts could also potentially solve the challenge of royalty
distribution in an over-the-top (OTT) content world where traditional
network agreements do not apply. This technology allows emerging artists
and lesser-known actors to get small but regular payments.
2. Decentralized finance (DeFi) applications
Using cryptocurrencies and smart contracts, DeFi apps can offer financial
services without an intermediary. DeFi is no longer limited to peer-to-peer
transactions. On DeFi platforms, smart contracts facilitate complex processes
like borrowing, lending, or derivative transactions.
3. Conversion of assets into non-fungible tokens (NFTs)
By assigning ownership and administering the movable nature of digital
assets, smart contracts have made it possible to create non-fungible tokens
(NFTs). Contracts like this can also be altered to include added stipulations,
like royalties, along with access rights to platforms or software. Essentially,
smart contracts make it possible to treat digital assets just like physical ones,
with real tangible value.
4. B2B data marketplaces
A data marketplace is a portal where users can buy and sell diverse datasets
or data streams from a wide range of sources. Intelligent contracts facilitate
the creation of dynamic and fast-evolving markets that support automated
and secure transactions without the hassle of human intervention. Datapace is
a good example of this particular smart contract use case.
5. Supply chain management
Smart contracts may work autonomously without mediators or third parties
because they are self-executing. An organization can create smart contracts
for an entire supply chain. This would not require regular management or
auditing. Any shipments received beyond the schedule might trigger
stipulated escalation measures to guarantee seamless execution.
6. Digital identity cards
Users can store reputational data and digital assets on smart contracts to
generate a digital identification card. When smart contracts are linked to
multiple online services, other external stakeholders can learn about
individuals without divulging their true identities.
For instance, these contracts may include credit scores lenders can use to
verify loan applicants without the risk of demographic profiling or
discrimination. Similarly, candidates can share resumes without the risk of
gender bias in hiring.
7. Electoral polls
Voting could occur within a secure environment created by smart contracts,
minimizing the likelihood of voter manipulation. Due to the encryption,
every vote is ledger-protected and extremely difficult to decode.
Additionally, smart contracts might boost voter turnout. With an online
voting system driven by smart contracts, one can avoid making trips to a
polling location.
8. Real estate
Smart contracts can accelerate the handover of property ownership. Contracts
can be autonomously created and executed. After the buyer’s payment to the
vendor, for instance, the smart contract may immediately assign control over
the asset dependent on the blockchain’s payment record.
9. Healthcare data management
Smart contracts can revolutionize healthcare by making data recording more
open and efficient. For instance, they might encourage clinical trials by
guaranteeing data integrity. Hospitals can maintain accurate patient data
records and effectively manage appointments.
10. Civil law
Smart contracts can also flourish in the legal industry. It can be used to create
legally binding business and social contracts. In certain regions of North
America, governments have authorized smart contracts for digitized
agreements. For example, California can issue marital and birth certificates as
smart contracts.
See More: 2022 Crypto Bust: Are Blockchain Skills Still in Demand? –
Spiceworks
Benefits and Challenges of Smart Contracts
Like any technology, smart contracts have both pros and cons. Here are the
benefits of smart contracts first:
Benefits of smart contracts
The key reasons to use smart contracts include:
1. Single source of truth
Individuals have the same data at all times, which reduces the likelihood of
contract clause exploitation. This enhances trust and safety because contract-
related information is accessible throughout the duration of the contract.
Additionally, transactions are replicated so that all involved parties have a
copy.
2. Reduction in human effort
Smart contracts don’t need third-party verification or human oversight. This
provides participants autonomy and independence, particularly in the case of
DAO. This intrinsic characteristic of smart contracts offers additional
benefits, including cost savings and faster processes.
3. Prevention of errors
A fundamental prerequisite for any contract is that every term and condition
is recorded in explicit detail. An omission may result in serious issues in the
future, including disproportionate penalties and legal complexities.
Automated smart contracts avoid form-filling errors. This is one of its
greatest advantages.
4. Zero-trust by default
The entire framework of smart contracts is a step beyond conventional
mechanisms. This implies that there’s no need to rely on the trustworthy
conduct of other parties during a transaction. A transaction or exchange does
not necessitate faith as a fundamental component, consistent with zero-trust
security standards. Since smart contracts operate on a decentralized network,
every aspect of the network is more open, fair, and equitable, with no risk of
privilege creep.

Legal aspects of cryptocurrency exchange

With the unique developments and advancements in the technology sector in India, especially during
the challenges posed by the rapid spread of COVID-19, the fintech sector has shown promising
results. There has been a growth, fuelled largely by curiosity and popularity, amongst the citizens of
India in cryptocurrency such as Bitcoin, Ripple, Dogecoin, etc., based on which a large number of
people have started investing a noticeable part of their time and money in these virtual currencies.
In India, the apex financial authority i.e., the Reserve Bank of India (“RBI”), recognised
cryptocurrency, more specifically defined as a form of digital/ virtual currency created through a
series of written computer codes based on cryptography /encryption and is thus free of any central
issuing authority per se. Cryptocurrency is assisted through blockchain technology, that establishes a
person-to-person issuance system that utilises private and public keys allowing authentication and
encryption for secure and safe transactions.
Growing Popularity Of Cryptocurrency
Being an untouched, unregulated market with a potential of over a trillion dollars, India also
witnessed a huge surge of cryptocurrency exchanges.
Witnessing the increasing popularity of the use of cryptocurrency within a short span of a year and
the potential revenue loss to the Government of India; the regulators and authorities started to take
notice and as a consequence, in 2013 the RBI issued a press release, warning the public against
dealing in virtual/digital currencies
Restrictions Imposed By RBI
In November 2017 the Government of India established a high-level Inter-Ministerial Committee to
report on various issues related to the use of virtual currency and subsequently, in July 2019, this
Committee presented its report suggesting a blanket ban on private cryptocurrencies in India.
The threat of revenue loss was so eminent to RBI, that it is interesting to note that even prior to
submission of the report from the Inter-Ministerial Committee, in April 2018 the RBI had issued a
circular restricting all commercial and co-operative banks, small finance banks, payment banks and
NBFC from not only dealing in virtual/digital currencies themselves but also instructing them to stop
providing services to all entities which deal with virtual/digital currencies.
This stalled the rise of the crypto industry in India, as exchanges required banking services for
sending and receiving the money. The banking service is essential for the conversion into
cryptocurrency and in turn for paying salaries, vendors, office space etc. However, the situation
prevailing around cryptocurrencies and their usage completely changed on 4th March 2020, when the
Hon’ble Supreme Court of India, in a well-conceived judgment quashed the earlier ban imposed by
the RBI.
The Hon’ble Supreme Court of India chiefly examined the matter from the perspective of Article
19(1)(g) of the Indian Constitution, which talks about the freedom to practice any profession or to
carry on any occupation, trade or business, and the doctrine of proportionality.
The Apex Court noted that there is unanimity of opinion among all regulators and governments of
other countries that though virtual currencies have not acquired the status of legal tender, but they do
display digital representations of value and are capable of functioning as medium of exchange, unit
of account and/or store of value.
While the court recognized the RBI’s power to take a pre-emptive action, it held that the
proportionality of such a measure was not there in the case, since there wasn’t any damage/loss
suffered directly or indirectly, by RBI’s regulated entities as a result of VC trading. Therefore,
among other reasons, on the grounds of proportionality the impugned Circular dated 06-04-2018 was
set aside.
Developments In The Crypto-World
The Government of India is now considering the introduction of a new bill titled “Cryptocurrency
and Regulation of Official Digital Currency Bill, 2021” (“New Bill”) which is similar in spirit to its
earlier versions. However, the New Bill seeks to ban private cryptocurrencies in India with some
exceptions, to encourage the underlying technology and trading of cryptocurrency but facilitated
within a framework for the creation of an official digital currency which will be issued by the RBI.
The New Bill has approached the difficulty of the lack of cryptocurrency laws and suggests banning
all the private cryptocurrencies in their entirety. The dichotomy in the New Bill’s suggestion arises
since the RBI is still in the grey about which kinds of cryptocurrency will fall under the purview of
private cryptocurrency.
If the New Bill imposes a complete ban on private cryptocurrencies, it shall lead the cryptocurrency
investors to invest and deal in cryptocurrency in a completely unregulated market. Further, the aim of
introducing a law related to cryptocurrency is to ease the process of trading and holding, in a safer
technological environment.
However, even with the introduction of state-owned cryptocurrency which shall be monitored by the
RBI, the risk in investment and holding of cryptocurrency shall remain the same.
Current Situation Of Cryptocurrency In India
Towards the end of March 2021, according to the latest amendments to the Schedule III of the
Companies Act, 2013, the Government of India instructed that from the beginning of the new
financial year, companies have to disclose their investments in cryptocurrencies.
In simple words, companies now have to disclose profit or loss on transactions involving
cryptocurrency, the amount of holding, and details about the deposits or advances from any person
trading or investing in cryptocurrency. This move has been greatly appreciated by the people dealing
in the crypto sector, as this will open the door for all Indian companies to have Crypto on their
balance sheets.
Conclusion
Based on the inference that can be drawn from the aforementioned facts and current scenario around
the world dealing with matters of cryptocurrencies, it is noticeable that there is a complete lack of
clarity concerning cryptocurrency regulation in India.
Well-structured, clear regulations dealing with crypto trading exchanges, blockchain technology,
investors, and the people employed in such sector should be made the priority given that the world of
cryptocurrency is here to stay and demands more attention.
It is fascinating to note that in the Draft National Strategy on Blockchain, 2021, published by the
Ministry of Electronics and Information Technology highlighted the benefits of cryptocurrency.
Therefore, banning a virtual currency that has created an impact in many countries, will not be the
ideal thing to do for the development of our nation.
The government needs to take an effective step towards the positive regulation and enforcement of
cryptocurrency as a way forward to earn the confidence of investors and the general public in
developing the nation. It was announced by the Union Finance Minister Nirmala Sitharam on
16th March 2021 that there shall not be a complete ban on cryptocurrency – “we will allow a certain
amount of window for people to experiment on blockchain, bitcoins and cryptocurrency.”.
Though It would be wiser to pause, sit back and wait for the Government to formulate clear
regulations concerning cryptocurrencies before running in the grey.

Life of blockchain application

The life cycle of a blockchain application typically follows a sequence of stages that help ensure
the project's development, deployment, and sustainability. Here's a breakdown of the main
phases involved:

1. Ideation and Conceptualization

 Problem Definition: Identify the problem you're solving and determine how blockchain
can provide a solution.
 Use Case Identification: Pinpoint the specific use cases for the blockchain, such as
finance (DeFi), supply chain, identity management, or voting systems.
 Research: Conduct extensive research on existing blockchain platforms, consensus
algorithms (e.g., Proof of Work, Proof of Stake), and tools.

2. Design and Architecture

 Blockchain Type: Decide whether to use a public, private, or consortium blockchain


based on the use case.
 Platform Selection: Choose a blockchain platform (e.g., Ethereum, Hyperledger,
Binance Smart Chain, Solana).
 Smart Contract Design: Design and write smart contracts that define the rules and logic
of your application.
 System Architecture: Plan the architecture of the application, including backend
infrastructure, user interfaces, and how they interact with the blockchain.

3. Development

 Smart Contract Development: Write the code for the blockchain’s smart contracts
using languages like Solidity (Ethereum), Vyper, or others depending on the platform.
 Backend Development: Develop the off-chain components (e.g., APIs, database
integration) that will interact with the blockchain.
 Frontend Development: Design and build the user interface (UI) to allow users to
interact with the blockchain via wallets or web applications.
 Security Considerations: Implement robust security measures, such as encryption and
secure key management.

4. Testing

 Unit Testing: Test individual smart contracts and other components for functionality and
correctness.
 Integration Testing: Ensure that the blockchain and off-chain components (APIs,
databases) work together seamlessly.
 Security Audits: Conduct security audits of the code, particularly for smart contracts.
This is crucial as vulnerabilities can lead to significant financial loss.
 Testnet Deployment: Deploy the application on a testnet (e.g., Ethereum Rinkeby,
Binance Testnet) to simulate real-world use without risk.

5. Deployment

 Mainnet Launch: Once testing is successful, deploy the application to the blockchain’s
mainnet.
 Smart Contract Deployment: Deploy the smart contracts to the mainnet and ensure they
are functioning as expected.
 Infrastructure Setup: Set up necessary infrastructure such as nodes, databases, and APIs
for supporting the application.

6. Monitoring and Maintenance

 Transaction Monitoring: Continuously monitor the blockchain for issues like network
congestion or failed transactions.
 Performance Monitoring: Track the performance of the application, checking for slow
transaction times or issues that could affect user experience.
 Bug Fixes & Upgrades: Regularly update the codebase to fix bugs, improve
performance, and add new features as needed.
 Security Maintenance: Continuously monitor for vulnerabilities in the blockchain or the
application and apply patches or updates accordingly.

7. Scaling and Optimization

 Scaling Solutions: Implement scaling solutions if the application gains significant usage
(e.g., layer 2 solutions like Optimistic Rollups, sharding).
 Cost Optimization: Optimize gas fees (on Ethereum and similar chains) or transaction
costs to ensure the application is cost-efficient for users.
 User Growth: Ensure that the application can handle increased user adoption and
transaction volume as it grows.

8. Governance and Updates


 Decentralized Governance: If your application is decentralized, establish governance
mechanisms (e.g., DAOs) that allow users to vote on changes and upgrades.
 Upgrades & Forks: Regularly upgrade the blockchain software or smart contracts to
meet evolving requirements. In some cases, hard forks may be needed to implement
substantial changes.

9. User Adoption and Feedback

 User Education: Blockchain applications can be complex, so educating users about how
to use the platform, wallet management, and security is crucial.
 User Feedback: Collect feedback from users on usability, features, and issues, which can
guide future updates or enhancements.

10. End of Life (if applicable)

 Retirement: If the blockchain application is no longer viable or has been replaced by a


better solution, consider decommissioning the system.
 Migration: For applications that need to be shut down, plan for migrating users and data
to another platform, ensuring no disruption.

The life of a blockchain application is not static. It requires continuous updates, monitoring, and
scaling to adapt to user needs, regulatory changes, and technological advancements.

12 BDA QP

13

14 Machine learning with types.

Machine learning (ML) is a subset of artificial intelligence (AI) that enables computers to learn
from data without explicit programming. There are various types of machine learning, each
suited for different tasks and applications. Here's an overview of the main types of machine
learning:

1. Supervised Learning

 Definition: In supervised learning, the model is trained on labeled data. This means that
the algorithm is given input-output pairs, and it learns to map inputs to the correct output.
The goal is to predict the output for new, unseen data.
 Use Cases: Classification (e.g., email spam detection, disease diagnosis) and regression
(e.g., predicting house prices, stock prices).
 Example Algorithms:
o Classification: Logistic Regression, Support Vector Machines (SVM), k-Nearest
Neighbors (k-NN), Decision Trees, Random Forest, Naive Bayes.
o Regression: Linear Regression, Polynomial Regression, Support Vector
Regression (SVR).
 Key Concept: The model learns from past data, where the true labels (outputs) are
known.

2. Unsupervised Learning

 Definition: In unsupervised learning, the model is given data without labels (i.e., the data
doesn't have known output values). The goal is to find hidden patterns or intrinsic
structures in the data.
 Use Cases: Clustering (e.g., customer segmentation, image grouping), anomaly detection
(e.g., fraud detection), dimensionality reduction (e.g., data compression).
 Example Algorithms:
o Clustering: k-Means, Hierarchical Clustering, DBSCAN.
o Dimensionality Reduction: Principal Component Analysis (PCA), t-SNE,
Autoencoders.
 Key Concept: The model tries to infer the structure of the data without labeled outputs.

3. Semi-Supervised Learning

 Definition: Semi-supervised learning falls between supervised and unsupervised


learning. The algorithm is trained on a small amount of labeled data and a large amount
of unlabeled data. The goal is to improve learning accuracy by leveraging both types of
data.
 Use Cases: Image recognition, speech recognition, and natural language processing
(NLP), where obtaining labeled data is expensive or time-consuming.
 Example Algorithms: Semi-Supervised Support Vector Machines, Graph-Based
Methods.
 Key Concept: The model uses both labeled and unlabeled data to enhance learning.

4. Reinforcement Learning (RL)

 Definition: Reinforcement learning is a type of learning where an agent learns how to


behave in an environment by performing actions and receiving rewards (or penalties).
The agent aims to maximize cumulative rewards over time.
 Use Cases: Robotics (e.g., teaching a robot to walk), gaming (e.g., AlphaGo, chess),
autonomous vehicles, dynamic pricing.
 Example Algorithms: Q-Learning, Deep Q-Networks (DQN), Policy Gradient Methods,
Proximal Policy Optimization (PPO).
 Key Concept: The agent learns through trial and error, with feedback from the
environment in the form of rewards or punishments.

5. Self-Supervised Learning
 Definition: Self-supervised learning is a type of unsupervised learning where the model
generates labels from the input data itself. The idea is to create auxiliary tasks to learn
useful representations of data without relying on manually labeled data.
 Use Cases: Natural Language Processing (NLP), image representation learning (e.g., for
transfer learning).
 Example Algorithms: Contrastive Learning, Predictive Models (e.g., BERT for text,
SimCLR for images).
 Key Concept: The model generates its own labels from the data, often by predicting
parts of the data from other parts.

6. Deep Learning

 Definition: Deep learning is a subset of machine learning that uses neural networks with
many layers (deep neural networks). It is particularly effective for large and complex
datasets, such as images, audio, and text.
 Use Cases: Image classification (e.g., facial recognition, object detection), NLP (e.g., text
generation, translation), speech recognition, autonomous driving.
 Example Algorithms: Convolutional Neural Networks (CNNs), Recurrent Neural
Networks (RNNs), Long Short-Term Memory Networks (LSTMs), Transformer Models
(e.g., GPT, BERT).
 Key Concept: Deep learning models are highly flexible and capable of learning complex
features from raw data, often requiring large datasets and computational power.

7. Transfer Learning

 Definition: Transfer learning involves taking a pre-trained model (usually from a large
dataset) and fine-tuning it on a new, smaller dataset for a different but related task. This
approach is especially useful when labeled data is scarce.
 Use Cases: Fine-tuning models for specific applications in computer vision, NLP, and
medical diagnostics.
 Example Algorithms: Pre-trained CNNs (e.g., VGG, ResNet), BERT for NLP tasks.
 Key Concept: Transfer learning leverages knowledge from one domain to improve
performance in another.

8. Online Learning (Incremental Learning)

 Definition: Online learning is a type of machine learning where the model learns
incrementally as new data arrives, rather than being trained on a fixed dataset all at once.
This is useful in situations where data is constantly being updated or when the system
needs to adapt in real-time.
 Use Cases: Financial market predictions, personalized recommendations, fraud detection
in real-time.
 Example Algorithms: Stochastic Gradient Descent (SGD), Online Naive Bayes.
 Key Concept: The model learns from a continuous stream of data and updates its
parameters accordingly.
Summary of the Types of Machine Learning:

Type of Learning Description Example Use Cases


Trains on labeled data to predict Email spam detection, house price
Supervised Learning
outputs. prediction
Unsupervised Customer segmentation, anomaly
Finds patterns in unlabeled data.
Learning detection
Semi-Supervised Combines small labeled and large Image recognition with few
Learning unlabeled data. labeled images
Reinforcement Learns through trial and error with Robotics, game AI, self-driving
Learning rewards. cars
Self-Supervised
Generates labels from the data itself. NLP, self-training models
Learning
Uses neural networks with many layers Image and speech recognition,
Deep Learning
for complex tasks. NLP
Fine-tunes pre-trained models for NLP models like BERT,
Transfer Learning
specific tasks. computer vision models
Learns incrementally from a stream of Real-time fraud detection,
Online Learning
data. personalized ads

35 Data analytics project life cycle

In today’s data-driven world, data analytics plays a crucial role in helping businesses make
informed decisions. The data analytics life cycle outlines the step-by-step process used to
extract actionable insights from data. Each phase of this cycle ensures that raw data is
transformed into meaningful insights for effective decision-making.

Data Analytics Lifecycle


The data analytics life cycle consists of multiple phases that guide analysts through data
collection, processing, and deriving insights. Each phase is critical for ensuring accurate,
valuable, and actionable results.

Phase 1: Discovery
 Define the business problem or objective to be solved.
 Identify the scope of the data analytics project.
 Determine the relevant data sources (internal and external).
 Collaborate with stakeholders to clarify objectives and requirements.
 Assess available resources, including data, tools, and team expertise.
Phase 2: Data Preparation
 Collect data from various sources like databases, APIs, and spreadsheets.
 Clean the data by handling missing values, removing duplicates, and correcting
inconsistencies.
 Transform and format data to make it suitable for analysis.
 Ensure data is standardized and normalized where necessary to maintain consistency.
 Conduct data profiling to understand data distributions, types, and patterns.
Phase 3: Model Planning
 Select the variables (features) most relevant to solving the problem.
 Choose appropriate algorithms (e.g., regression, clustering, classification) based on the
problem type.
 Develop a roadmap for the modeling phase, outlining how the data will be used.
 Create an initial hypothesis about how the model will behave with the chosen data.
 Use exploratory data analysis (EDA) techniques like correlation matrices and scatter
plots to understand relationships in the data.
Phase 4: Model Building
 Build models using the selected algorithms and the prepared dataset.
 Train models by feeding them the training dataset to allow them to learn patterns.
 Use a test dataset to validate model performance and avoid overfitting.
 Iterate and refine the model by adjusting parameters for better accuracy.
 Use cross-validation techniques like K-fold validation to ensure robustness.
Phase 5: Communication of Results
 Visualize results through charts, graphs, dashboards, and other visuals to make insights
understandable.
 Present key findings to stakeholders in a concise, actionable format.
 Summarize the insights derived from the data and explain how they align with business
goals.
 Provide recommendations on actions based on the data analysis.
 Prepare a comprehensive report that includes all aspects of the data analysis process,
findings, and conclusions.
Phase 6: Operationalize
 Deploy the model into production environments, integrating it into business processes.
 Automate tasks like decision-making or predictions based on the model’s insights.
 Continuously monitor model performance to ensure it remains accurate and relevant as
new data becomes available.
 Update or retrain the model as needed to accommodate changing data trends.
 Document the deployment and maintenance processes to ensure long-term usability.
Source: Medium

Conclusion
The data analytics life cycle is a comprehensive process that guides professionals from the
initial discovery phase to the final operationalization of insights. Each phase plays a critical role
in ensuring that raw data is transformed into meaningful insights that drive informed decision-
making. Starting with understanding the business problem, moving through data preparation,
model building, and finally deploying the results, the life cycle ensures a structured approach to
data analysis.

By following this systematic process, organizations can harness the power of their data more
effectively, reduce errors, and make more accurate predictions. Additionally, continuous
monitoring and updating of the models ensure that the insights remain relevant in a rapidly
changing data landscape. Whether you’re working on small-scale data projects or large
enterprise analytics initiatives, understanding and applying the phases of

Arhitecture of Spark.

What is Spark?
Spark Architecture, an open-source, framework-based component that processes a large
amount of unstructured, semi-structured, and structured data for analytics, is utilised in
Apache Spark. Apart from Hadoop and map-reduce architectures for big data processing,
Apache Spark’s architecture is regarded as an alternative. The RDD and DAG, Spark’s
data storage and processing framework, are utilised to store and process data,
respectively. Spark architecture consists of four components, including the spark driver,
executors, cluster administrators, and worker nodes. It uses the Dataset and data frames
as the fundamental data storage mechanism to optimise the Spark process and big data
computation.

Apache Spark Features

Apache Spark, a popular cluster computing framework, was created in order to accelerate
data processing applications. Spark, which enables applications to run faster by utilising
in-memory cluster computing, is a popular open source framework. A cluster is a
collection of nodes that communicate with each other and share data. Because of implicit
data parallelism and fault tolerance, Spark may be applied to a wide range of sequential
and interactive processing demands.

 Speed: Spark performs up to 100 times faster than MapReduce for processing
large amounts of data. It is also able to divide the data into chunks in a controlled
way.
 Powerful Caching: Powerful caching and disk persistence capabilities are offered
by a simple programming layer.
 Deployment: Mesos, Hadoop via YARN, or Spark’s own cluster manager can all
be used to deploy it.
 Real-Time: Because of its in-memory processing, it offers real-time computation
and low latency.
 Polyglot: In addition to Java, Scala, Python, and R, Spark also supports all four of
these languages. You can write Spark code in any one of these languages. Spark
also provides a command-line interface in Scala and Python.

Two Main Abstractions of Apache Spark

The Apache Spark architecture consists of two main abstraction layers:

Resilient Distributed Datasets (RDD):


It is a key tool for data computation. It enables you to recheck data in the event of a
failure, and it acts as an interface for immutable data. It helps in recomputing data in case
of failures, and it is a data structure. There are two methods for modifying RDDs:
transformations and actions.

Directed Acyclic Graph (DAG):


The driver converts the program into a DAG for each job. The Apache Spark Eco-system
includes various components such as the API core, Spark SQL, Streaming and real-time
processing, MLIB, and Graph X. A sequence of connection between nodes is referred to
as a driver. As a result, you can read volumes of data using the Spark shell. You can also
use the Spark context -cancel, run a job, task (work), and job (computation) to stop a job.

Spark Architecture

The Apache Spark base architecture diagram is provided in the following figure:
When the Driver Program in the Apache Spark architecture executes, it calls the real
program of an application and creates a SparkContext. SparkContext contains all of the
basic functions. The Spark Driver includes several other components, including a DAG
Scheduler, Task Scheduler, Backend Scheduler, and Block Manager, all of which are
responsible for translating user-written code into jobs that are actually executed on the
cluster.

The Cluster Manager manages the execution of various jobs in the cluster. Spark Driver
works in conjunction with the Cluster Manager to control the execution of various other
jobs. The cluster Manager does the task of allocating resources for the job. Once the job
has been broken down into smaller jobs, which are then distributed to worker nodes,
SparkDriver will control the execution.
Many worker nodes can be used to process an RDD created in the SparkContext, and the
results can also be cached.

The Spark Context receives task information from the Cluster Manager and enqueues it
on worker nodes.

The executor is in charge of carrying out these duties. The lifespan of executors is the
same as that of the Spark Application. We can increase the number of workers if we want
to improve the performance of the system. In this way, we can divide jobs into more
coherent parts.

Spark Architecture Applications

A high-level view of the architecture of the Apache Spark application is as follows:

The Spark driver

The master node (process) in a driver process coordinates workers and oversees the tasks.
Spark is split into jobs and scheduled to be executed on executors in clusters. Spark
contexts (gateways) are created by the driver to monitor the job working in a specific
cluster and to connect to a Spark cluster. In the diagram, the driver programmes call the
main application and create a spark context (acts as a gateway) that jointly monitors the
job working in the cluster and connects to a Spark cluster. Everything is executed using
the spark context.

Each Spark session has an entry in the Spark context. Spark drivers include more
components to execute jobs in clusters, as well as cluster managers. Context acquires
worker nodes to execute and store data as Spark clusters are connected to different types
of cluster managers. When a process is executed in the cluster, the job is divided into
stages with gain stages into scheduled tasks.

The Spark executors

An executor is responsible for executing a job and storing data in a cache at the outset.
Executors first register with the driver programme at the beginning. These executors have
a number of time slots to run the application concurrently. The executor runs the task
when it has loaded data and they are removed in idle mode. The executor runs in the Java
process when data is loaded and removed during the execution of the tasks. The
executors are allocated dynamically and constantly added and removed during the
execution of the tasks. A driver program monitors the executors during their
performance. Users’ tasks are executed in the Java process.

Cluster Manager

A driver program controls the execution of jobs and stores data in a cache. At the outset,
executors register with the drivers. This executor has a number of time slots to run the
application concurrently. Executors read and write external data in addition to servicing
client requests. A job is executed when the executor has loaded data and they have been
removed in the idle state. The executor is dynamically allocated, and it is constantly
added and deleted depending on the duration of its use. A driver program monitors
executors as they perform users’ tasks. Code is executed in the Java process when an
executor executes a user’s task.

Worker Nodes
The slave nodes function as executors, processing tasks, and returning the results back to
the spark context. The master node issues tasks to the Spark context and the worker nodes
execute them. They make the process simpler by boosting the worker nodes (1 to n) to
handle as many jobs as possible in parallel by dividing the job up into sub-jobs on
multiple machines. A Spark worker monitors worker nodes to ensure that the
computation is performed simply. Each worker node handles one Spark task. In Spark, a
partition is a unit of work and is assigned to one executor for each one.

The following points are worth remembering about this design:

1. There are multiple executor processes for each application, which run tasks on
multiple threads over the course of the whole application. This allows applications
to be isolated both on the scheduling side (drivers can schedule tasks individually)
and the executor side (tasks from different apps can run in different JVMs).
Therefore, data must be written to an external storage system before it can be
shared across different Spark applications.
2. Even on a cluster manager that also supports other applications, Spark can be run
if it can acquire executor processes and these communicate with each other. It’s
relatively easy for Spark to operate even on a cluster manager if this can be done
even with other applications (e.g. Mesos/YARN).
3. The driver program must listen for and accept incoming connections from its
executors throughout its lifetime (e.g., see spark.driver.port in the network config
section). Workers must be able to connect to the driver program via the network.
4. The driver is responsible for scheduling tasks on the cluster. It should be run on
the same local network as the worker nodes, preferably on the same machine. If
you want to send requests to the cluster, it’s preferable to open an RPC and have
the driver submit operations from nearby rather than running the driver far away
from the worker nodes.

Modes of Execution

You can choose from three different execution modes: local, shared, and dedicated.
These determine where your app’s resources are physically located when you run your
app. You can decide where to store resources locally, in a shared location, or in a
dedicated location.

1. Cluster mode
2. Client mode
3. Local mode

Cluster mode: Cluster mode is the most frequent way of running Spark Applications. In
cluster mode, a user delivers a pre-compiled JAR, Python script, or R script to a cluster
manager. Once the cluster manager receives the pre-compiled JAR, Python script, or R
script, the driver process is launched on a worker node inside the cluster, in addition to
the executor processes. This means that the cluster manager is in charge of all Spark
application-related processes.

Client mode: In contrast to cluster mode, where the Spark driver remains on the client
machine that submitted the application, the Spark driver is removed in client mode and is
therefore responsible for maintaining the Spark driver process on the client machine.
These machines, usually referred to as gateway machines or edge nodes, are maintained
on the client machine.

Local mode: Local mode runs the entire Spark Application on a single machine, as
opposed to the previous two modes, which parallelized the Spark Application through
threads on that machine. As a result, the local mode uses threads instead of parallelized
threads. This is a common way to experiment with Spark, try out your applications, or
experiment iteratively without having to make any changes on Spark’s end.

In practice, we do not recommend using local mode for running production applications.
setup steps of Hadoop

Step 1: Install Java Development Kit


To start, you'll need to install the Java Development Kit (JDK) on your Ubuntu
system. The default Ubuntu repositories offer both Java 8 and Java 11, but it's
recommended to use Java 8 for compatibility with Hive. You can use the following
command to install it:
sudo apt update && sudo apt install openjdk-8-jdk
Copy

Step 2: Verify Java Version


Once the Java Development Kit is successfully installed, you should check the version
to ensure it's working correctly:
java -version
Copy

Output:

Step 3: Install SSH


SSH (Secure Shell) is crucial for Hadoop, as it facilitates secure communication
between nodes in the Hadoop cluster. This is essential for maintaining data integrity
and confidentiality and enabling efficient distributed data processing across the
cluster:
sudo apt install ssh
Copy

Step 4: Create the Hadoop User


You must create a user specifically for running Hadoop components. This user will
also be used to log in to Hadoop's web interface. Run the following command to
create the user and set a password:
sudo adduser hadoop
Copy

Output:

Step 5: Switch User


Switch to the newly created 'hadoop' user using the following command:
su - hadoop
Copy

Step 6: Configure SSH


Next, you should set up password-less SSH access for the 'Hadoop' user to streamline
the authentication process. You'll generate an SSH keypair for this purpose. This
avoids the need to enter a password or passphrase each time you want to access the
Hadoop system:
ssh-keygen -t rsa
Copy

Output:
Step 7: Set Permissions
Copy the generated public key to the authorized key file and set the proper
permissions:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 640 ~/.ssh/authorized_keys
Copy

Step 8: SSH to the localhost


You will be asked to authenticate hosts by adding RSA keys to known hosts. Type
'yes' and hit Enter to authenticate the localhost:
ssh localhost
Copy

Output:
Step 9: Switch User
Switch to the 'hadoop' user again using the following command:
su - hadoop
Copy

Step 10: Install Hadoop


To begin, download Hadoop version 3.3.6 using the 'wget' command:
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Copy
Once the download is complete, extract the contents of the downloaded file using the
'tar' command. Optionally, you can rename the extracted folder to 'hadoop' for easier
configuration:
tar -xvzf hadoop-3.3.6.tar.gz
mv hadoop-3.3.6 hadoop
Copy

Next, you need to set up environment variables for Java and Hadoop in your system.
Open the '~/.bashrc' Could you file in your preferred text editor? If you're using 'nano,'
you can paste code with 'Ctrl+Shift+V,' save with 'Ctrl+X,' 'Ctrl+Y,' and hit 'Enter':
nano ~/.bashrc
Copy

Append the following lines to the file:


export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Copy

Output:
Load the above configuration into the current environment:
source ~/.bashrc
Copy

Additionally, you should configure the 'JAVA_HOME' in the 'hadoop-env.sh' file.


Edit this file with a text editor:
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Copy

Search for the “export JAVA_HOME” and configure it .


JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Copy

Output:
Step 11: Configuring Hadoop
Create the namenode and datanode directories within the 'hadoop' user's home
directory using the following commands:
cd hadoop/
mkdir -p ~/hadoopdata/hdfs/{namenode,datanode}
Copy

Next, edit the 'core-site.xml' file and replace the name with your system hostname:
nano $HADOOP_HOME/etc/hadoop/core-site.xml
Copy
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Copy
Output:

Save and close the file. Then, edit the 'hdfs-site.xml' file:

Next, edit the 'hdfs-site.xml' file and replace the name with your system hostname:
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Copy

Change the NameNode and DataNode directory paths as shown below:


<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
Copy

Output:

Save and close the file. Then, edit the 'mapred-site.xml' file:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Copy

Make the following changes:


<configuration>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
Copy

Output:
Finally, edit the 'yarn-site.xml' file:
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Copy

Make the following changes:


<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Copy

Output:

Step 12: Start Hadoop Cluster


Before starting the Hadoop cluster, you need to format the Namenode as the 'hadoop'
user. Format the Hadoop Namenode with the following command:
hdfs namenode -format
Copy

Output:

Once the Namenode directory is successfully formatted with the HDFS file system,
you will see the message "Storage directory
/home/hadoop/hadoopdata/hdfs/namenode has been successfully formatted." Start the
Hadoop cluster using:
start-all.sh
Copy

Output:
You can check the status of all Hadoop services using the command:
jps
Copy

Output:

Step 13: Access Hadoop Namenode and Resource Manager


First, determine your IP address by running:
ifconfig
Copy

If needed, install 'net-tools' using:


sudo apt install net-tools
Copy

To access the Namenode, open your web browser and visit http://your-server-ip:9870.
Replace 'your-server-ip' with your actual IP address. You should see the Namenode
web interface.

Output:
To access the Resource Manager, open your web browser and visit the URL
http://your-server-ip:8088. You should see the following screen:

Output:

Step 14: Verify the Hadoop Cluster


The Hadoop cluster is installed and configured. Next, we will create some directories
in the HDFS filesystem to test Hadoop. Create directories in the HDFS filesystem
using the following command:
hdfs dfs -mkdir /test1
Copy
hdfs dfs -mkdir /logs
Copy

Next, run the following command to list the above directory:


hdfs dfs -ls /
Copy

You should get the following output:

Also, put some files into the Hadoop file system. For example, put log files from the
host machine into the Hadoop file system:
hdfs dfs -put /var/log/* /logs/
Copy

You can also verify the above files and directories in the Hadoop web interface. Go to
the web interface, click on Utilities => Browse the file system. You should see the
directories you created earlier on the following screen:
Step 15: To Stop Hadoop Services
To stop the Hadoop service, run the following command as a Hadoop user:
stop-all.sh
Copy

Output:

In summary, you've learned how to install Hadoop on Ubuntu. Now, you're ready to
unlock the potential of big data analytics. Happy exploring!

Streaming R in hadoop.

Integrating Hadoop with R involves merging the power of Hadoop's distributed data processing
capabilities with R's advanced analytics and data manipulation functions. This synergy allows
data scientists and analysts to efficiently analyze vast datasets stored in Hadoop's distributed file
system (HDFS) using R's familiar interface. By leveraging packages like "rhipe" or "rhdfs," users
can seamlessly access and process big data in R, making it easier to perform complex data
analytics, machine learning, and statistical modeling on large-scale datasets. This integration
enhances the scalability and versatility of data analysis workflows, making it a valuable tool for
handling big data challenges.

What is Hadoop?
Hadoop is a powerful open-source framework designed for distributed storage and processing of
vast amounts of data. Originally developed by Apache Software Foundation, it has become a
cornerstone of big data technology. Hadoop consists of two core components: Hadoop
Distributed File System (HDFS) and the Hadoop MapReduce programming model. HDFS is
responsible for storing data across a cluster of commodity hardware, breaking it into smaller
blocks and replicating them for fault tolerance. This distributed storage system allows Hadoop to
handle data at a massive scale.
What is R?
RR is a powerful and open-source programming language and environment widely used for
statistical computing, data analysis, and graphics. Developed in the early 1990s, R has since
gained immense popularity among statisticians, data scientists, and analysts due to its versatility
and extensive library of packages.R offers an array of statistical and graphical techniques,
making it a go-to tool for tasks ranging from data manipulation and visualization to complex
statistical modeling and machine learning. Users can perform data cleansing, transformation, and
exploratory data analysis with ease. R's graphical capabilities allow for the creation of high-
quality plots, charts, and graphs, facilitating data visualization and presentation.

Why Integrate R with Hadoop?


Integrating R with Hadoop offers several advantages, primarily because it allows data scientists
and analysts to harness the combined power of R's advanced analytics and Hadoop's distributed
data processing capabilities. Here are some reasons why integrating R with Hadoop is beneficial:

 Scalability: Hadoop is designed to handle massive volumes of data across distributed


clusters. By integrating R with Hadoop, you can leverage the scalability of Hadoop's
distributed file system (HDFS) to process and analyze large datasets efficiently. R alone
may struggle with big data, but Hadoop can manage it effectively.
 Parallel Processing: Hadoop's MapReduce framework enables parallel processing of data
across multiple nodes. Integrating R with Hadoop allows you to distribute R
computations across these nodes, significantly speeding up data analysis tasks. This
parallelization is especially critical for big data analytics.
 Data Variety: Hadoop can store and manage structured and unstructured data from
various sources, such as log files, social media, and sensor data. Integrating R with
Hadoop enables you to analyze and extract insights from diverse data types using R's
powerful analytics tools.
 Cost-Efficiency: Hadoop is known for its cost-effective storage and processing of large
datasets. By utilizing Hadoop's distributed storage and computational resources,
organizations can reduce infrastructure costs while benefiting from R's analytics
capabilities.
R Hadoop
The rmr package The "rmr" package, short for "R MapReduce," is a set of R packages that
enable the integration of R with the Hadoop MapReduce framework. This package is part of the
"RHIPE" (R and Hadoop Integrated Programming Environment) project and provides a bridge
between R and Hadoop, allowing data scientists and analysts to perform distributed data
processing and analysis using R.

Here are some key features and components of the "rmr" package:

 MapReduce Framework: The "rmr" package leverages the MapReduce programming


model, which is at the core of Hadoop's distributed data processing capabilities. It allows
users to define custom Map and Reduce functions in R to process data in parallel across a
Hadoop cluster.
 Data Manipulation: "rmr" provides functions for data manipulation, transformation, and
filtering, which can be applied to data stored in Hadoop's distributed file system (HDFS)
or other data sources.
 Integration with R: Users can write R code for both the mapping and reducing phases of
MapReduce jobs, making it easier to work with complex data analysis and statistical
tasks using R's familiar syntax.

The rhbase package This package provides basic connectivity to the HBASE distributed
database, using the Thrift server. R programmers can browse, read, write, and modify tables
stored in HBASE from within R.

 Connectivity: The "rhbase" package enables R programmers to establish connections to


an HBase instance through the Thrift server. This allows R to interact with and
manipulate data stored in HBase tables.
 Data Operations: With "rhbase," R users can perform various data operations on HBase
tables, including reading data, writing data, modifying records, and browsing table
structures.
 Single-Node Installation: The package is typically installed on the node that serves as the
R client. This means that you don't need to install it on every node in the Hadoop cluster;
you only need it where you intend to run R scripts that interact with HBase.

The rhdfs package The "rhdfs" package is an R package that facilitates the integration of R with
the Hadoop Distributed File System (HDFS). HDFS is the primary storage system used in
Hadoop clusters, and it's designed to store and manage very large datasets across distributed
computing nodes. The "rhdfs" package allows R users to interact with HDFS, read and write
data, and perform various file operations within an R environment.

Here are some key features and functionalities of the "rhdfs" package:

 HDFS Connectivity: "rhdfs" provides functions to establish connections to HDFS


instances, enabling R to communicate with and access data stored in HDFS.
 File Operations: Users can perform standard file operations like reading, writing,
deleting, moving, and listing files and directories within HDFS using R functions
provided by the package.
 Data Import/Export: The package facilitates the seamless import and export of data
between R and HDFS. This is particularly useful when working with large datasets that
are stored in HDFS.

R with Hadoop Streaming


R with Hadoop Streaming is a technique that allows you to integrate the R programming
language with the Hadoop ecosystem using Hadoop's streaming API. This approach enables you
to leverage the distributed data processing capabilities of Hadoop while writing your data
processing logic in R. Here's how it works:

 Hadoop Streaming: Hadoop Streaming is a utility that comes with Hadoop, allowing you
to use any executable program as a mapper and reducer in a MapReduce job. Instead of
writing Java code, you can use scripts or executable programs in languages like Python,
Perl, or R to perform the mapping and reducing tasks.
 R as a Mapper/Reducer: To use R with Hadoop Streaming, you write R scripts that serve
as the mapper and/or reducer functions. These scripts read input data from standard input
(stdin), process it using R code, and then emit output to standard output (stdout). You can
use command-line arguments to pass parameters to your R script.
 Data Distribution: Hadoop takes care of distributing the input data across the cluster and
managing the parallel execution of your R scripts on different nodes. Each mapper
processes a portion of the data independently and produces intermediate key-value pairs.
 Shuffling and Reducing: The intermediate key-value pairs are sorted and shuffled, and
then they are passed to the reducer (if specified) to aggregate or further process the data.
You can also use R scripts as reducers in this step.
 Output: The final results are written to HDFS or another storage location, making them
available for further analysis or use.

Processing Big Data with MapReduce


Processing big data with MapReduce in R involves leveraging the MapReduce programming
model to analyze and manipulate large datasets using the R programming language. MapReduce
is a distributed data processing framework commonly associated with Hadoop, but there are R
packages and tools available that enable you to apply MapReduce principles within an R
environment.

 Install R Packages: To get started, you'll need to install R packages that provide
MapReduce functionality. Two popular R packages for this purpose are "rmr2" and
"rhipe." "rmr2" is designed for use with Hadoop MapReduce, while "rhipe" works with
both Hadoop and Rhipe (R and Hadoop Integrated Programming Environment).
 Set Up Your Hadoop Cluster: Ensure that you have access to a Hadoop cluster or Hadoop
distribution. You will need a running Hadoop cluster to execute MapReduce jobs.
 Write Map and Reduce Functions in R: With the R packages installed, you can write your
custom Map and Reduce functions in R. These functions define how your data will be
processed. In MapReduce, the Map function processes input data and emits key-value
pairs, and the Reduce function aggregates and processes these pairs.

Zookeeper

ZooKeeper is a distributed co-ordination service to manage large set of hosts. Co-ordinating and
managing a service in a distributed environment is a complicated process. ZooKeeper solves this
issue with its simple architecture and API. ZooKeeper allows developers to focus on core application
logic without worrying about the distributed nature of the application.

The ZooKeeper framework was originally built at Yahoo! for accessing their applications in an easy
and robust manner. Later, Apache ZooKeeper became a standard for organized service used by
Hadoop, HBase, and other distributed frameworks. For example, Apache HBase uses ZooKeeper to
track the status of distributed data.

Before moving further, it is important that we know a thing or two about distributed applications. So,
let us start the discussion with a quick overview of distributed applications.

Distributed Application
A distributed application can run on multiple systems in a network at a given time (simultaneously)
by coordinating among themselves to complete a particular task in a fast and efficient manner.
Normally, complex and time-consuming tasks, which will take hours to complete by a non-
distributed application (running in a single system) can be done in minutes by a distributed
application by using computing capabilities of all the system involved.

The time to complete the task can be further reduced by configuring the distributed application to run
on more systems. A group of systems in which a distributed application is running is called
a Cluster and each machine running in a cluster is called a Node.
A distributed application has two parts, Server and Client application. Server applications are
actually distributed and have a common interface so that clients can connect to any server in the
cluster and get the same result. Client applications are the tools to interact with a distributed
application.

Benefits of Distributed Applications


 Reliability − Failure of a single or a few systems does not make the whole system to fail.
 Scalability − Performance can be increased as and when needed by adding more machines with
minor change in the configuration of the application with no downtime.
 Transparency − Hides the complexity of the system and shows itself as a single entity / application.

Challenges of Distributed Applications


 Race condition − Two or more machines trying to perform a particular task, which actually needs to
be done only by a single machine at any given time. For example, shared resources should only be
modified by a single machine at any given time.
 Deadlock − Two or more operations waiting for each other to complete indefinitely.
 Inconsistency − Partial failure of data.
Advertisement

-
PauseSkip backward 5 secondsSkip forward 5 seconds
Mute

Fullscreen

What is Apache ZooKeeper Meant For?


Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate between themselves
and maintain shared data with robust synchronization techniques. ZooKeeper is itself a distributed
application providing services for writing a distributed application.

The common services provided by ZooKeeper are as follows −

 Naming service − Identifying the nodes in a cluster by name. It is similar to DNS, but for nodes.
 Configuration management − Latest and up-to-date configuration information of the system for a
joining node.
 Cluster management − Joining / leaving of a node in a cluster and node status at real time.
 Leader election − Electing a node as leader for coordination purpose.
 Locking and synchronization service − Locking the data while modifying it. This mechanism helps
you in automatic fail recovery while connecting other distributed applications like Apache HBase.
 Highly reliable data registry − Availability of data even when one or a few nodes are down.

Distributed applications offer a lot of benefits, but they throw a few complex and hard-to-crack
challenges as well. ZooKeeper framework provides a complete mechanism to overcome all the
challenges. Race condition and deadlock are handled using fail-safe synchronization approach.
Another main drawback is inconsistency of data, which ZooKeeper resolves with atomicity.

Benefits of ZooKeeper
Here are the benefits of using ZooKeeper −

 Simple distributed coordination process


 Synchronization − Mutual exclusion and co-operation between server processes. This process helps
in Apache HBase for configuration management.
 Ordered Messages
 Serialization − Encode the data according to specific rules. Ensure your application runs
consistently. This approach can be used in MapReduce to coordinate queue to execute running
threads.
 Reliability
 Atomicity − Data transfer either succeed or fail completely, but no transaction is partial.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy