BIG DATA - 25.09.2020 (19 Files Merged)
BIG DATA - 25.09.2020 (19 Files Merged)
WHAT IS DATA?
THE QUANTITIES,CHARACTERS OR SYMBOLS .
WHAT IS INFORMATION?
PROCESSED DATA IS KNOWN AS A INFORMATION
EXAMPLE:
1) SOCIAL MEDIA:
THE STATISTIC SHOWS THAT 500+TB OF NEW DATA
GET INTO THE DATABASES OF THE SOCIAL MEDIA SITE.
2) SHARE MARKET:
THE NEW YORK STOCK EXCHANGE GENERATES HUGE
AMOUNT OF NEW DATA(SAY,ABOUT 1 TB)PER DAY
THROUGH ITS DAILY TRANSACTION.
3) E-COMMERCE SITE:
FLIPKART,AMAZON ETC GENERATES HUGE AMOUNT
OF DATA.
4) AIRPLANE:
A SINGLE JETPLANE CAN GENERATE 10+TB OF
DATA IN 30 MINS OF FLIGHT TIME.
TYPES OF DATA:
THREE TYPES=
1) STRUCTURED DATA
2) UNSCTRUCTURED DATA
3) SEMISTRUCTURE DATA
STRUCTURED DATA::--
STRUCTURED DATA ARE THOSE TYPES OF DATA
WHICH ARE STORED ALREADY IN AN
ORDER.THERE ARE NEARLY 20% OF THE TOTAL
AMOUNT OF EXISTING DATA.
THERE ARE TWO FORM OF DATA
1)MACHINE GENERATED DATA=>
SENSORS,WEBBLOGS ETC.
2)HUMAN GENERATED DATA=>
NAMES,ADDRESS ETC.
UNSTRUCTURED DATA:
2) VARITY:
DIFFERENT FORMAT OF DATA FROM VARIOUS
SOURCES(IMAGE,TEXT,PDF,AUDIO,VIDEO)
3) VELOCITY
THE DATA IS GENERATING AT A VERY FIRST
RATE. VELOCITY IS MEASURED OF HOW FIRST
THE DATA IS COMING IN
4) VALUE:
EXTRACT USEFUL DATA
5) VALIDITY/VERACITY:
ASSIGNMENT:
1) NON-RELATIONAL:
1)NO-SQL DATABASES NEVER FOLLOW THE
RELATIONAL MODEL.
2) NEVER PROVIDES TABLES WITH FLAT FIXED
COLUMN RECORDS.
3) WORK WITH SELF CONTAINED AGGREGATES OR
BINARY LARGE OBJECT.
4) DOES NOT REQUIRE OBJECT RELATIONAL MAPPING
AND DATA NORMALIZATION.
5) NO COMPLEX FEATURES LIKE QUERY
LANGUAGES,QUERY PLANNERS, REFERANTIAL INTEGRITY
JOINS,ACID.
2) SCHEMA FREE :
1) NO SQL DATABASES ARE EITHER SCHEMA FREE OR
HAVE RELAXED SCHEMA.
2) DO NOT REQUIRE ANY SORT OF DEFINATION OF THE
SCHEMA OF THE DATA.
3) OFFERS HETEROGENEOUS STRUCTURES OF DATA IN
THE SAME DOMAIN.
3) SIMPLE API:
1) OFFERS EASY TO USE INTEFACES FOR STORAGE AND
QUERYING DATA PROVIDED.
2) API ALLOW LOW LEVEL DATA MANIPULATION AND
SELECTION METHODS.
3) TEXT BASED PROTOCALS MOSTLY USED WITH HTTP
REST WITH JSON.
4)MOSTLY USED NO STANDARED BASED QUERY
LANGUAGE.
5)WEB ENABLED DATABASES RUNNING AS INTERNET
FACING SERVICES.
4) DISTRIBUTED:
1) MULTIPLE NOSQL DATABASES CAN BE EXECUTED IN
A DISTRIBUTED FASHION.
2) OFFERS AUTO-SCALING AND FAIL OVER
CAPABILITIES.
3) OFTEN ACID CONCEPT CANBE SACRIFICED FOR
SCALABILITY AND THROUGHPUT.
4) ONLY PROVIDING EVEVTUAL CONSISTENCY.
5) SHARED NOTHING ARCHITECHTURE.THIS ENABLE
LESS CO-ORDINATION AND HIGHER DISTRIBUTION.
TYPES OF NOSQL DATABASES:
1) KEY VALUE STORE.
2) COLUMN ORIENTED.
3) GRAPH BASED.
4) DOCUMENT ORIENTED.
ADVANTAGES:
DISADVANTAGES:
1) COMPLEX QUERIES MAY ATTEMPT TO INVOLVE
MULTIPLE KEY VALUE PAIRS WHICH MAY DELAY
PERFORMANCE.
2) DATA CANBE INVOLVING MAMY TO MANY
RELATIONSHIP WHICH MAY COLLIDE.
EXAMPLE:
BIG TABLE,CASSANDRA,SIMPLE DB ETC.
GRAPH DATABASE:
A GRAPH DATA STRUCTURE CONSIST OF A FINITE SET OF
ORDERD PAIR,CALLED EDGES,NODES.
1)A GRAPH DATABASE STORES THE DATA IN A GRAPH
2)IT IS CAPABLE OF ELEGANTLY REPRESENTING ANY KIND
OF DATA IN A HIGHLY ACCESSIBLE WAY.
3) A GRAPH DB IS A COLLECTION OF NODES AND EDGES.
4)EACH NODE REPRESENTS AN ENTITY AND EACH EDGE
REPRESENTS A CONNECTION OR RELATIONSHIP BETWEEN
TWO NODES.
5)EVERY NODE AND EDGE ARE DEFINED BY UNIQUE
IDENTIFIRE.
6) EACH NODE KNOWS IT ADJACENT NODES.
7) AS THE NUMBER OF NODES INCREASES,THE COST OF A
LOCAL STEP REMAIN THE SAME.
8) INDEX FOE LOOKUPS.
EXAMPLES:
INFINITE GRAPH,NEO4J, ORIENT DB ETC.
RELATION MODEL:
TABLE,ROWS,COLUMNS,JOINS
GRAPH MODEL:
VERTICES AND EDGES SET,VERTICES,KEY/VALUE
PAIRS,EDGES.
ADVANTAGES:
1) FASTEST TRAVERSAL BECAUSE OF CONNECTIONS.
2) DATA CAN BE EASILY HANDLED.
DISADVANTAGES:
WRONG CONNECTIONS MAY LEAD TO INFINITE LOOP.
EXAMPLE:
MONGO DB,COUCH DB ETC.
ADVANTAGES:
1)THIS TYOE OF FORMAT IS VARY USEFUL AND IS USED TO
STORED THE SEMI STRUCTURED DATA.
2)STORAGE RETRIVAL AND MANAGING OF DOCUMENTS IS
VERY EASY.
DISADVANTAGES:
1)HANDALING MULTIPLE DOCUMENTS IS VERY
CHALLENGING.
2)AGGREGATION OPERATION MAY NOT WORK
ACCURATELY.
RELATIONAL MODEL:
TABLE,ROWS,COLUMNS,JOIN
DATE:2/11/2020
TOPIC: MONGODB
Name:”xyz”, field:values
Age:30,
Website:”abc.com”,
Hobbies:[“teching”,”watching tv”]
MONGODB DATATYPES:--
1) STRING:
2) INTEGER:
THIS TYPE IS USED TO STORE A NUMERICAL VALUE.INTEGER CAN BE 32 BIT
OR 64 BIT DEPENDING UPON YOUR SERVER.
3) BOOLEAN:
THIS TYPE IS USED TO STORE A BOOLEAN VALUE(TRUE/FALSE)
4) DOUBLE:
THIS TYPE IS USED TO STORE FLOATING POINT VALUES.
5) MIN/MAX KEY:
THIS TYPE IS USED TO COMPARE A VALUE AGAINST THE LOWEST AND
HIGHEST BSON ELEMENTS.
6) ARRAYS:
THIS TYPE IS USED TO STORE ARRAYS OR LIST OR MULTIPLE VALUES INTO
ONE KEY.
7) TIMESTAMP:
THIS CAN BE HANDY FOR RECORDING WHEN A DOCUMENT HAS BEEN
MODIFIED OR ADDED.
8) OBJECT:
THIS DATA TYPE IS USED FOR EMBEDDED DOCUMENTS
9) NULL:
THIS TYPE IS USED TO STORE A NULL VALUE.
10) SYMBOL:
THIS DATATYPE IS USED IDENTICALLY TO A STRING.
HOWEVER IT’S GENERALLY SERERVED FOR LANGUAGES THAT USE A
SPECIFIC SYMBOL TYPE.
11) DATE:
THIS DATATYPES IS USED TO STORE THE CURRENT DATE OR TIMEIN UNIX
TIME FORMAT..YOU CAN SPECIFY YOUR OWN DATE TIME BY CREATING
OBJECT OF DATE & PASSING DAY,MONTH,YEAR INTO IT.
12) OBJECTID:
THIS DATATYPE IS USED TO STORE THE DOCUMENT’S ID.
13) BINARY DATA:
THIS DATATYPE IS USED TO STORE BINARY DATA.
14) CODE:
TABLE VS COLLECTION
RDBMS:
STUDENT_ID STUDENT_NAME AGE COLLEGE
1001 XYZ 30 BIGINNERS BOOK
1002 ABC 29 BIGINNERS BOOK
MONGO DB:
{
“-ID”: OBJECTID(“5ca66a95f9e57db687850dd4”),
STUDENT_ID: 1001,
STUDENT_NAME:”XYZ”,
AGE:30,
COLLEGE:”BIGINNERS BOOK”
}
{
“-ID”: OBJECTID(“5ca66a95f9e57db687850dd5”),
STUDENT_ID: 1002,
STUDENT_NAME:”ABC”,
AGE:29,
COLLEGE:”BIGINNERS BOOK”
}
NOTES:
COLUMNS ARE REPRESENTED AS KEYVALUE PAIRS(JSON FORMAT),ROWS
ARE REPRESENTED AS DOCUMENTS.MONGODB AUTOMATICALLY INSERTS
A UNIQUE-ID (12 BYTE FIELD)FIELD IN EVERY DOCUMENT,THIS SERVES AS
PRIMARY KEY FOR EACH DOCUMENT.
ANOTHER THING IS THAT MONGODB SUPPORTS DYNAMIC SCHMA WHICH
MEANS ONE DOCUMENT OF A COLLECTION CAN HAVE 4 FIELDS WHILE THE
OTHER DOCUMENTS HAS ONLY 3 FIELDS.
THIS IS NOT POSSIBLE IN RELATIONAL DATABASE.
DATA BASE:
COLLECTION:
DOCUMENT:
C→ CREATE
R→READ
U→UPDATE
D→DELETE
CREATE OPERATION:
Create a database:
A database is created, by using “use” command.
use name_of_the_database
example:
use employee
admin 0.000GB
employee 0.000GB
local 0.000GB
db.createCollection(<emp>,
{
Autoindexid:<boolean>,
Emp_name:<string>,
Emp_id:<number>,
Date_of_joining:<date>,
Salary:<number>
}
Output:
Name of the collection is emp
insertone()
db.employee.insertOne(
{
Emp_id:1
Emp_name:” xyz”,
Salary:39000
}
Output:
{“acknowledgement”:true,
“inserted ID”:objectId(“5ca66a95f9e57db68788950dd3”)
}
insertMany():
db.employee.insertMany(
{
Emp_id:2,
Emp_name:”jhon”,
Salary:34000
}
{
Emp_id:3,
Emp_name:”smit”,
Salary:45000
}
{
Emp_id:4,
Emp_name:”bob”,
Salary: 56000
}
)
Output:
{
“acknowledged”: true,
“insertedIds”:[
objectId(“5ca66a95f9e57db68788950dd4”),
objectId(“5ca66a95f9e57db68788950dd5”),
objectId(“5ca66a95f9e57db68788950dd6”)
]
}
READ OPERATION:
Read operation retrieve documents from a collection i.e query a collection
for documents. MongoDB provides the following methods to read
documents from a collection.
db.employee.find()
output:
“_id”:objectId(“5ca66a95f9e57db68788950dd3”),Emp_id:1,Emp_name:”
xyz”,Salary:39000
}
{
“_id”:objectId(“5ca66a95f9e57db68788950dd4”),
Emp_id:2,Emp_name:”jhon”,Salary:34000
}
{
“_id”:objectId(“5ca66a95f9e57db68788950dd5”),
Emp_id:3,Emp_name:”smit”,Salary:45000
}
{
“_id”:objectId(“5ca66a95f9e57db68788950dd6”),
Emp_id:4,Emp_name:”bob”,Salary: 56000
}
db.employee.find().pretty()
output:
{
“_id”:objectId(“5ca66a95f9e57db68788950dd3”),
Emp_id:1
Emp_name:” xyz”,
Salary:39000
}
{
“_id”:objectId(“5ca66a95f9e57db68788950dd4”),
Emp_id:2,
Emp_name:”jhon”,
Salary:34000
}
{
“_id”:objectId(“5ca66a95f9e57db68788950dd5”),
Emp_id:3,
Emp_name:”smit”,
Salary:45000
}
{
“_id”:objectId(“5ca66a95f9e57db68788950dd6”),
Emp_id:4,
Emp_name:”bob”,
Salary: 56000
}
db.employee.find({Emp_id:1}).pretty()
output:
{
“_id”:objectId(“5ca66a95f9e57db68788950dd3”),
Emp_id:1
Emp_name:” xyz”,
Salary:39000
Update operation:
Example:
db.employee.updateOne(
{ salary:{$lt:38000},}
)
Delete operation:
1) DESIGNING:
THE DESIGN IS THE FIRST IN THE DATA MART PROCESS.
THIS PHASE COVERS ALL OF THE FUNCTIONS FROM
INITIATING THE REQUEST FOR A DATA MART THROUGH
GATHERING DATA ABOUT THE REQUIRESMENTS AND
DEVELOPING THE LOGICAL AND PHYSICAL DESIGN OF THE
DATA MART.
FOLLOWING TASKS ARE:
GATHERING BUSINESS AND TECHNICAL
REQUIRMENTS
IDENTIFYING THE DATA SOURCES
SELECTING THE APPROPRIATE SUBSET OF DATA
DESIGNING THE LOGICAL AND PHYSICAL
ARCHITECTURE OF THE DATA MART.
2) CONSTRUCTING:
3) POUPUTATING:
THIS STEPS INCLUDES ALL OF THE TASKS RELATED TO THE
GETTING DATA FROM THE SOURCES,CLEANING IT
UP,MODIFYING IT TO THE RIGTH FORMAT AND LEVEL THE
DETAILS AND MOVING IT INTO THE DATAMART.
FOLLOWING TASKS ARE:
MAPPING DATA SOURCES TO TARGET DATA
SOURCES
EXTRACTING DATA.
CLEANSING AND TRANSFORMING
INFORMATION
LOADING DATA IN TO THE DATA MART
CREATIN AND STORING META DATA.
4) ACCESSING:
THIS STEP INVOLVES PUTTING THE DATA TO USE :
QUERYING THE DATA,ANALYZING IT,CREATING REPORTS
,CHARTS AND GRAPHS AND PUBLIUSHING THEM.
The Big-data frame work provides a structure for organizations that want to start
with the Big-data or aim to develop their big-data capabilities further. The big-
data frame work includes all organizational aspects that should be taken into
account in a Big-data Organization. The big-data frame work is vender
independent.
Now- a- days there‘s probably no single big-data software that would not be able
to process these large volume of data.
Special big-data frame works have been created to implement and support the
functionality of such software. They help rapidly process and structure huge
chunk of real –time data.
There are many great big-data tools on the market right now:
Map reduce:
Solution:
Input:
deer bear river
car car river
deer car bear
The Hadoop platforms executes the programs based on configuration set using
jobconf.
It describes the format of the input data for a map reduce job.
Input location:
Map-function:
Reduce –function:
It reduces the set of tuples, which share a key to a single tuple with a change in
the value.
We can set the options such that a specific set of key value pairs are transferred
to a specific reduce task.
The hadoop frame work consists of a single master and many slaves. Each master
has job-tracker and each slave has Task –tracker. Master distributes the programs
and data to the slaves.
Task tracker keep track directed to it and relays the information to the job
tracker. Job tracker monitors all the status reports and re-initiates the failed tasks
if any.
1. FileInputFormat:
It is the base class for all file –based Input Formats. It specifies input directory
where the data-files are located. When we start a hadoop job ,FileInput Format is
provided with a path containing files to read. It will read all files and divides these
files into one or more Input Splits.
2. TextInputFormat :
It is the default Input format of Map-reduce. Text Input format treats each line
of input file as a separate record and performs no parsing. This is useful for
unformatted data or line –based records
Key:
It is the byte offset of the beginning of the line within the file(not whole file
just one split),so it will be unique if combined with the filename.
Value:
It is the contents of the line, excluding line terminators
3. KeyValueTextInputFormat:
4. SequenceFileInputFormat:
It reads sequence file.sequence files are binary files that stores sequences of
binary key value pair. Sequence files block-compress and provide direct
serialization and de-serialization of several arbitrary data-types(not just text). Key
and value both are user defined.
It is a sequence file input format using which we can extract the sequence file’s
keys and values as an unique binary object.
It is another form of text input format where the keys are the byte offset of the
line and values are contents of the line. each mapper receives a variable number
of lines of input with text input format and key value text input format,the
number depends on the split and the length of the lines. if we want our mapper
to receive a fixed format of line of input,then we use NLine Input Format.
8. DBInput Format:
It is an InputFormat that reads data from a RDBMS, using JDBC .it is best for
loading relatively small datasets, perhaps for joining with large datasets from
HDFS using multiple inputs. Here key is Long-writable and value is DB-writables.
Types of Hadoop output formats:
1. Textoutputformat:
It is an output format which writes sequence files for its output and it is
intermediate format use between map-reduce jobs, which rapidly serialize
arbitrary data types to the file and the corresponding sequence file input format
will deserialize the file In to the same type.
It is another form of sequence file output format which writes keys and values to
sequence file in binary format
It is used to write output as map files. the key in a map file must be added I order
so we need to ensure that reducer keys in-sorted order.
5. Multiple outputs:
It allows writing data to files whose names are derived from the output keys and
values or in fact from an arbitrary string.
6. Lazyoutput format:
Sometimes File Output format will creates output files ,even if they are
empty.lazy output format is a wrapper output format which ensures that the
output file will be created only when the record is emitted for a given partition.
7. DBoutput format:
It is an output format for writing to RDBS and hbase. It sends the reduce
output to a sql table.it accepts the key value pairs where the key has a type
extending DB writable .it returned the key to the database with a batch SQL
query.
MULTIDIMENSIONAL OLAP IN DATA WARE HOUSE (MOLAP):
EX:
ARCHITECTURE:
1. DATABASE SERVER
2. MOLAP SERVER
3. FRONT –END TOOLS.
DISADVANTAGES:
ESSBASE:
EXPRESS SERVER:
YELLOWFIN:
CLEAR ANALYTICS:
DISADVANTAGES:
ADVANTAGES:
DIMENSIONAL MODELING:
1. FACTS:
2. DIMENSIONS:
3. ATTRIBUTES:
4. FACT TABLE:
e.g:
BOOK SHOP
5. DIMENSIONAL TABLE:
e.g:
1. CONFORMED DIMENSION
2. OUTTRIGGER DIMENSION
3. SHRUNKEN DIMENSION
4. ROLE PLAYING DIMENSION
5. DIMENSION TO DIMENSION TABLE
6. JUNK DIMENSION
7. DEGENERATE DIMENSION
8. SWAPPABLE DIMENSION
9. STEP DIMENSION
STEPS OF DIMENSIONAL MODELLING:
MULTIDIMENSIONAL SCHEMA:
STAR SCHEMA
SNOWFLAKE SCHEMA
GALAXY SCHEMA
STAR SCHEMA:
STAR SCHEMA IN DATA WARE HOUSE IN WHICH THE CENTER OF THE STAR CAN
HAVE ONE FACT TABLE AND A NUMBER OF ASSOCIATED DIMENSION TABLES.IT IS
KNOWN AS STAR SCHEMA AS ITS STRUCTURE RESEMBLES A STAR.
THE STAR SCHEMA DATA MODEL IS THE SIMPLEST TYPE OF DATAWARE HOUSE
SCHEMA.IT IS ALSO KNOWN AS STAR JOIN SCHEMA AND IS OPTIMIZED FOR
QUERING LARGE DATA SETS.
CHARACTERISTICS:
ADVANTAGES:
1.Quality data:--
Organizations add data sources in the data ware-house, so they
can be sure in their relevancy and constant availability. This
provides higher data quality and data integrity for informed
decision making.
2. Promotes decision making:--
Strategy decisions are based on the fact and relevant data. They
are supported by information that the organization has collected
over-time. Another plus is that leaders are better informed about
data requests and can extract information according to their
specific requirement.
2. Middle Tier:
3. Top Tier:
This tier is the front end client layer. This layer holds the
query tools , analyzing tools and data-mining tools.
Data ware house models:
There are three types of models:
1. Virtual ware-house:
2. Data mart:
Note:
Window-based /Unix/Linux based servers are used to
implements data marts. They are implemented on low
cost servers.
The implementation data marts cycles is measured in
short period of time i.e.in weeks rather the months or
years.
The life cycle of a data-mart may be complex in long
run if its planning and design are not organization
wide.
Data marts are small in size.
Data marts are customized by department.
The source of a data-mart is departmentally
structured data ware house.
Data mart are flexible.
3. Enterprise warehouse:
Operational system:--
Flat files:--
Meta data:--
A set of data that defines and gives information about other
data.
Meta data used in data ware house for a variety of purpose.
Purpose of metadata:
Meta data summarizes necessary information about data,
which can make finding and work with particular instances of
data more accessible.
e.g:
author, data build and data changed and file size are examples
of very basic document metadata.
Meta dat is used to direct a query to the most appropriate data
source.
Highly and lightly summarized data:
The area of the data ware house saves all the predefined lightly
and highly summarized( aggregated) data generated by the
ware house manager.
End-user-access tools:
The principal purpose of a data ware house is to provide
information to the business managers for strategic decision
making. These customers interact with the ware house using
end client access tools
e.g(end user access tools):--
1. reporting & query tool
2. application development
3.executive information system tools
4.online analytical processing tool
5.data mining tool
Hadoop eco-system:
To handle big-data more efficiently.it comprises of
various tools that are required to perform different
task on HADOOP.
The components are
1)HDFS
2)MAP-REDUCE
3)YARN
4)HIVE
5)PIG
6)HBASE
7)HCATALOG
8)AVRO
9)THRIFT
10) DRILL
11) MAHUT
12) SQOOP
13) FLUME
14) AMBARI
15) ZOOKEPER
16) OOZIE
1.HDFS:
HDFS refers to Hadoop Distributed File System.
HDFS is a dedicated file system to store the big
data with a cluster of commodity
hardware/cheaper hardware with streaming
access pattern. It enables data to be stored at
multiple nodes in the cluster which ensures data
security and fault tolerance.
Components are
1) Name node (master node ,it is not store the
actual data. it store Metadata ,i.e number of
blocks, their locations ,on which rack, which data-
node is store the data etc. it consist of files and
directories)
2) data node(slave node, it is responsible for
storing the actual data in HDFS. it performs read
and write operation as per the request of client.
replica block of data-node consist of 2 files of the
files system, first file is for the data and second
file is for recording the block’s metadata.
Name node
2) Map-reduce:
Data once stored in HDFS also needs to be
processed upon. Now a query is sent to the
process a dataset in the HDFS. Now Hadoop
indentify where the data is stored (mapping),the
query is broken into the multiple parts and result
s of all these multiple parts are combined and
overall result is sent back to the user.(reduce
process) .
While data is stored in HDFS ,Map-reduce is used
for processed that data. It is a software frame
work for easily writing applications that processed
the vast amount of structured and unstructured
data. Map-reduce programs are parallel in nature.
it improves the speed and reliability of cluster this
parallel processing.
Features:
1. Simplicity: map reduce jobs are easy to run.
Applications can be written in any languages
such as java, c++ and python
2. Scalability: it can process peta- bytes of data
3. Speed: by means of parallel processing problem
that take days to solve ,it’s is solved in HOURS
and minutes by map-reduce.
4. Fault-tolerance: it take care failures .if one copy
of data is unavailable ,another machine has a
copy of the same data can be used for same
task.
3) YARN:
4)HIVE:
It is an open source data-ware-house system,
for querying & analyzing large data set ,stored
in HDFS.
HIVE do three main functions: data
summarization, query and analysis.
Facebook created HIVE for people who are
fluent with SQL..
HIVE use language called HIVEQL (HQL).which
is similar to sql.
HQL automatically translate SQL like queries
into map-reduce which will execute on
HADOOP .HIVE is suitable for structured data.
after the data is analyzed it is ready for the
user to access .
Main parts of HIVE are:
1)Meta-store: it store the meta-data
2)Driver: manage the lifecycle of a HQL
statement
3)Query Compiler: it compiles HQL into
Directed Acyclic graph
4)Hive server: it provides a thrift interface &
JDBC/ODBC server.
5) PIG
PIG is a high –level language platform for
analyzing and querying huge data set that
are stored in HDFS.PIG as a component of
HADOOP ecosystem uses PIG Latine
Language.it is also very similar to SQL.
It loads the data,applies the required filters
and dumps the data in the required format.
For the program execution PIG requires Java
runtime environment.
The compiler internally converts PIG Latine to
Mapreduce. it produces a sequestial set of
Map Reduce job.it was initially developed by
Yahoo . it gives us a platform for building data
flow for extract, transform ,load, processing
and analyzing the huge data sets.
6)Hbase:
It is a distributed database. data is sorted in
the table format. that tables contains
billions of rows and millions of columns.it is
scalable,distributed,NO-sql database built
in the top of the HDFS.it is written in JAVA.
It is modeled after google’s big-table.
Two components
1)hbase master
2)region server.
DATA MODELLING IN MONGODB
THE DATA MODELLING OR DATA STRUCTURING REPRESENTS
THE NATURE OF DATA AND THE BUSINESS LOGIC TO CONTROL
THE DATA.IT ALSO ORGANIZE THE DATABASE.
THE STRUCTURE OF DATA ARE EXCPLICITLY DETERMINES BY
THE DATA MODEL.
DATA MODEL HELPS TO COMMUNICATE BETWEEN BUSINESS
PEOPLE,WHO REQUIRES THE COMPUTER SYSTEM,AND THE
TECHNICAL PERSON ,WHO CAN FULFILL THEIR REQUIREMENTS.
CONCEPTUAL MODEL:
IN THIS MODEL,THE CONCEPT OR SEMANTICES OF THE
DATADABE MODELS ARE CONCERNED.WE DONOT NEED
TO CARE ABOUT THE ACTUAL DATA OR META
INFORMATION HERE.
LOGICAL MODEL:
IN THIS MODEL, THERE ARE THE DESCRIPTION OF TABLES,
COLUMNS, OBJECT ORINTED CLASSES, XML TAGS,
DOCUMENT STRUCTURES.
PHYSICAL MODEL:
IN THIS MODEL,CONSIST OF THE ACTUAL PHYSICAL
STRUCTURE TO STORE THE DATA,LIKE THE
PARTITIONS,CPU SPACES,REPLICATION ETC, i.e. HOW THE
ACTUAL DATA CAN BE STORED INTO THE SYSTEM.
RELATION MODEL:
IN THE RELATION MODEL THE INFORMATION IS STORED
IN TWO DIMENSIAL TABLES.AND THE RELATION ARE
FORMED BY STORING THE COMMON ATTRIBUTES.
THE TABLES ARE ALSO KNOWN AS RELATIONS IN THE
MODEL.
STORAGE ENGINE:
STORAGE ENGINE IS THE COMPONENT OF THE
DATABASE THAT IS RESPONSIBLE FOR MANAGING HOW
DATA IS STORED ,BOTH IN THE MEMORY AND ON THE
DISK.
MONGODB SUPPORTS MULTIPLE STORAGE ENGINES,AS
DIFFERENT ENGINES PERFORM BETTER FOR SPECIFIC
WORKLOADS.CHOSING THE APPROPRIATE STORAGE
ENGINE FOR YOUR CASE CAN SIGNIFICANTLY IMPACT
THE PERFORMANCE OF YOUR APPLICATIONS.
1) WIRED TIGER STORAGE ENGINE(DEFAULT):
IT IS A NO-SQL,OPEN SOURCE EXTENSIBLE PLATFORM
FOR DATA MANAGEMENT.IT IS THE DEFAULT
STORAGE ENGINE STARING IN MONGODB3.2.IT IS
WELL SUITED FOR MOST WORKLOADS AND IS
RECOMMENDED FOR NEW DEPLOYMENTS. IT
PROVIDES A DOCUMENT LEVEL CONCURRENCE
MODEL,CHECKPOINT POINTING AND
COMPRESSION,AMONG OTHER FEATURES.
EXAMPLE:
ASSUME WE ARE GETTING THE DETAILS OF
EMPLOYEES IN THREE DIFFERENT DOCUMENTS
NAMELY ,PERSONAL DETAILS,CONTACT AND
ADDRESS.YOU CAN EMBED ALL THE THREE
DOCUMENTS IN A SINGLE ONE:
{
_id:” “,
Emp_Id:”10025AE336”
Personal_Details:{
First_Name:”Radhika”,
Last_Name:”Sharma”,
Date_of_Birth:”1995-09-26”
},
Contact: {
Email:radhika_sharma.123@gmail.com,
Phone :”9848022338”
},
Address : {
City: “Hyderabad”,
Area:”Madapur”,
State: “Telangana”
}
}
Employee:
{_id:<objectId101>,
Emp_id:”10025AE336”
}
Personal_Details:
{
_id:<object102>,
empdocID:” objectId101”,
First_Name:”Radhika”,
Last_Name:”Sharma”,
Date_of_Birth:”1995-09-26
}
Contact :{
_id:<objectID103>,
empdocID:” objectId101”,
Email:radhika_sharma.123@gmail.com,
Phone :”9848022338”
}
Address: {
_id:<objectID104>,
empdocID:” objectId101”,
City: “Hyderabad”,
Area:”Madapur”,
State: “Telangana”
}
Subject name: Big Data
Topic name:
Hadoop ecosystem:
7) HCATALOG:
IT IS A TABLE AND STORAGE MANAGEMENT LAYER FOR
HADOOP.IT SUPPORTS DIFFERENT COMPONETS AVAILABLE IN
HADOOP ECOSYSTEM LIKES MAPREDUCE,HIVE,PIG TO EASILY
READ AND WRITE DATA FROM THE CULSTER.IT IS A KEY
COMPONENT OF HIVE THAT ENABLE THE USERS TO STORE
THEIR DATA IN ANY FORMAT AND STRUCTURE. BY DEFAULT,
IT SUPPORTS RC FILE,CSV,JSON,SEQUENCE FILE AND ORC FILE
FORMAT.IT EXPOSES TABULAR DATA OF HIVE MEATSTORE TO
OTHER HADOOP APPLICATIONS.
BENFITS:
IT ENABLES OF DATA NOTIFICATIONS.
WITH THE TABLE ABSTRUCTION, HCATALOG FREES THE USER
FROM OVERHEAD OF DATA STORAGE.
IT PROVIDES VISIBILITY FOR DATA CLEANING AND ARCHIVING
TOOLS.
8) AVRO:
IT IS PART OF HADOOP ECOSYSTEM AND IT IS A MOST POPULAR
DATA SERILIZATION SYSTEM. IT IS AN OPEN SOURCE PROJECT
THAT PROVIDES DATA SERILIZATION AND DATA EXCHANGING
SERVICE IN HADOOP.THESE SERVICES CAN BE USED TOGETHER
OR INDEPENDENTLY.BIG DATA CAN EXCHANG PROGRAMS
WRITTEN IN DIFFERENT LANGUAGES USING AVRO.
USING SERILIZATION SERVICE PROGRAM CAN SERIALIZE THE
DATA INTO FILES OR MESSAGES.IT STORES DATA DEFINITION
AND DATA TOGETHER IN ONE MESSAGE OR FILE MAKING IT
EASY FOR PRAGRAMS TO DYNAMICALLY UNDERSTAND
INFORNMATION STORED IN AVRO FILE OR MESSAGE.
FEATURES:
1) RICH DATA STRUCTURE
2) REMOTE PROCEDURE CALL
3) CAMPACT ,FIRST, BINARY DATA FORMAT
4) CONTAINER FILE TO STORE PERSISTENT DATA.
AVROSCHEMA:
IT RELIES ON SCHEMAS FOR
SERILIZATION/DESERILIZATION.
IT REQUIRES THE SCHEMA FOR DATA READ/WRITE
WHEN AVRO DATA IS STORED IN A FILE ITS SCHEMA IS
STORED WITH IT,SO THAT FILES MAY BE PROCESSED LATER
BY ANY PROGRAM.
DYNAMIC TYPING:
IT REFERS TO SERILIZATION AND DESERILIZATION
WITHOUT CODE GENERATION.IT COMPLEMENTS THE
CODE GENERATION WICH IS AVAILABLE IN IT FOR
STATICALLY TYPED LANGUAGE AS AN OPTIAL
OPTIMIZATION.
9) THRIFT:
IT IS A SOFTWARE FRAME WORK FOR SCALABLE CROSS –
LANGUAGE SERVICE DEVELOPMENT.IT IS AN INTER FACE
DEFINITION LANGUAGE FOR RPC
CUMMINICATION.HADOOP DOES A LOT OF RPC CALLS SO
THERE IS A RESPONSIBILITY OF USING HADOOP
ECOSYSTEM COMPONENT.,APACHE THRIFT FOR
PERFORMANCE OR OTHER REASON.IT COMBINES A
SOFTWARE ATSCK WITH ACODE GENERATIONENGINE TO
BUILD SERVICES THAT WORK EFFICIENTLY
DIFFERENTLANGUAGE C++,JAVA,PYTHON,RUBY ETC.IT
ALLOWES US TO DEFINE DATA TYPEAND SERVICE
INTERFACES IN A SIMPLE DEFINATIOM FILE.IT IS A LIGTH
WEIGHT,LANGUAGE INDIPENDENT SOFTWARE STACK
WITH AN ASSOCIATED CODE GENERATION MECHNISM
FOR RPC.IT PROVIDES CLEAN SBATRUCTION FOR DATA
TRANSPORT,DATA SERIALIZATION AND APPLICATION
LEVEL PROCESSING.IT WAS ORIGINALY DEVELOPED BY
FACEBOOK ANOW IT TS OPEN SOURCE PROJECT
10) DRILL:
THE MAIN PURPOSE OF THE DAHOOP ECO SYSTEM
COMPONENT IS LARGE SCALE DATA PROCESSING
INCLUDING STRUCTURED AND SEMI-STRUCTURE DATA.IT
IS A LOW LATENCY DISTRIBUTED QUERY ENGINE THAT IS
DESIGN TO SCALE TO SEVERAL THOUSTAND OF NODES
QUERY PETABYTES OF DATA..THE DRILL IS THE FIRST
DISTRIBUTED SQL QUERY ENGINE THAT HAS A SCHEMA
FREE MODEL.
APPLICATION:
A COMPANY THAT PROVIDESCONSUMER PURCHASE DATA
FOR MOBILE AND INTERNAT BANKING.IN THAT CASE USE
THE DRILL TO QUICKLY PROCESS TRILLIONS OF RECORDS
AND EXECUTE QUERIES.
FEARURES:
THE DRILL HAS SEPECILIZED MEMORY MENEGEMENT
SYSTEM TO ELEMINATES GARBAGE COLLECTION AND
OPTIMIZE MEMORY ALLOCATION AND USAGE.DRILL PLAYS
WELL WITH HIVE BY ALLOWING DEVELOPER S TO REUSE
THEIR EXISTING HIVE DEPLOYMENT.
1) EXTENSIBILITY:
DRILL PROVODES AN EXTENSIBLE ARCHITECTURE AT ALL
LAYERS,INCLUDING QUERY LAYER,QUERY
OPTIMIZATION,AND CLIENT API.WE CAN EXTEND ANY
LAYER FOR THE SPECIFIC NEED OF AN ORGANIZATION.
2) FLEXIBILITY:
IT PROVIDES A HEIRARCHICAL COLUMNAR DATA MODEL
THAT CAN REPRESENT COMPLEX,HIGHLY DYNAMIC
DATA AND ALLOW EFFIEIENT PROCESSING.
3) DYNAMIC SCHEMA DISCOVERY:
IT DOES NOT REQUIR SCHEMA OR TYPE SPECIFICATION
FOR DATA IN ORDER TO STRAT THE QUREY EXECUTION
PROCESS.INSTEAD ,IT STARTS PROCESSING THE DATA IN
UNITS CALLED RECORD BATCHES AND DISCOVER
SCHEMA ON THAT TIME OF PROCESSING
4) DRILL DECENTRALIZED MATADATA:
UNLIKE OTHER SQL HADOOP TECHNOLOGIES THE DRILL
DOES NOT HAVE CENTRALIZED MATADATA
REQUIREMENT.IT’S USER DO NOT NEED TO CREATE AND
MANAGE TABLES IN METADATA IN ORER TO QUERY
DATA.
11) MAHOUT:
MAHOUT IS OPEN SOURCE FRAME WORK FOR
CREATING SCALABLE MACHINE LEARING ALGORITHM
AND DATA MINING LIBRARY.ONCE DATA IS STORED IN
HADOOP HDFS,IT PROVIDES THE DATA SCIENCE TOOLS
TO AUTOMATICALLY FIND MEANING FULL PATTERNS OF
THOSE BIGDATA.
14.AMBARI:
IT IS A MANAGEMENT PLATFORM FOR
PROVISIONING,MANAGING,MONITORING AND SECURING
APACHE HADOOP CLUSTER.HADOOP MANAGEMENT GETS
SIMPLER AS AMBARI PROVIDE CENSISTENT,SECURE
PLATFORMFOR OPERATIONAL CONTROL.
FEARURES:
1. SIMPLIFIED INSTALLATION,CON FIGURATIONAND
MANAGEMENT:
IT EASILY AND EFFICIENTLY CREATE AND MANAGE
CKUSTER AT SCALE.
2. CENTRALIZED SECURITY SETUP:
IT REDUCE THE COMPLEXITY TOI ADMINISTRATOR AND
CONFIGURE CLUSTER SECURITY ACROSS THE ENTIRE
PLATFORM
3. HIGHLY EXTENSIBLE AND CUNTOMIZABLE:
IT IS HOGHLY EXTENSIBLE FOR BRINGING CUSTOM
SERVICES UNDER MANAGEMENT.
4. FULL VISIBLE INTO CLUTER HEALTH:
IT ENSURES THAT THE CLUSTER IS HEALTHY A ND
AVAILABLE WITH A HOLISTIC APPROACH TO
MONITORING.
15. ZOOKEEPER:
IT IS A CENTRALIZED SERVICE AND A HADOOP
ECOSYSTEM COMPONENT FOR MAINTAINING
CONFIGURATION INFORMATION,NAMEING,PROVIING
DISTRIBUTED SYNCHRONIZATION AND PROVIDING
GROUP SERVICES.IT MANAGES AND COORDINATES A
LARGE CLUSTER OF MACHINES
FEARURES:
FAST:
IT IS FAST WITH WORKLOADS WHERE READS TO DATA
ARE MORE COMMON THAN WRITES.THE IDEAL READ
/WRITE RATIO IS 10:1.
ORDERD:
IT MAINTAINS A RECORD OF ALL TRANSACTIONS.
16) OOZIE:
IT IS A WORKFLOW SCHEDULAR SYSTEM MANAGING
APACHE HADOOP JOBS.IT COM BINES MULTIPLE JOBS
SEQUENTIALLY INTO ONE LOGICAL UNIT OF
WORK.OOZIE FRAMEWORK IS FULLY INTEGRATED WITH
HADOOP STACK.YARN AS AN ARCHITECTURE CENTER
AND SUPPORTS HADOOP JOBS FOR MAPREDUCE,PIG
,HIVE AND SQOOP.
DATA PREPROCESSING:
DATA CLEANING:
DATA CLEANING IS A PROCESS OF ENSURING DATA IS
CORRECT, CONSISTENT AND USABLE.
IT CLEANS THE DATA BY FILLING IN THE MISSING VALUES,
SMOTHING NOISY DATA (TYPING ERROR), RESOLVING THE
INCONSISTENCY (NAMING CONVENTION)AND REMOVING
THE OUTLINES.
SOFTWARE TOOLS:
DATA CLEANER, OPENREFINE, WINPURE, DATA LADDER ETC.
1. CONVERSION TABLE:-
2. HISTOGRAMS:-
3. TOOLS:-
EVERY DAY MAJOR VENDORS ARE COMING OUT WITH NEW AND
BETTER TOOLS TO MANAGE BIGDATA AND THE COMPLEXITIES THAT
CAN ACCOMPANY IT.
4. ALGORITHMS:-
DATA INTEGRATION:
e.g:
OR
1) STATISTICAL ANALYSIS:--
A) DESCRIPTIVE ANALYSIS:
3) DIAGNASTIC ANALYSIS:--
4) PREDICTIVE ANALYSIS:--
5) PRESCRIPTIVE ANALYSIS:--
2) DATA COLLECTION:
3) DATA CLEANING:
4) DATA ANALYSIS:
5) DATA INTERPRETATION:
6) DATA VISUALIZATION:
1) Xplenty:
IT IS A CLOUD BASED ETL SOLUTION PROVIDING SIMPLE
VISUALIZE DATA PIPELINES FOR AUTOMATED DATA FLOWS
ACROSS A WIDE RANGE OF SOURCES AND
DESTINATIONS.XPLENTY’S POWERFUL AN PLATFORM
TRANSFORMATION TOOLS ALLOW YOU TO CLEAN ,NORMALIZE
AND TRANSFORMED DATA WHILE ALSO ADHERING TO
COMPLIANCE BEST PRECTICES.
FEATURES:
POWERFUL,CODE FREE ON PLATFORM DATA
TRANSFORMING OFFERING
REST API CONNECTOR—PULL IN DATA FROM ANY SOURCE
THAT HAS A REST API’S
DESTINATION FLEXIBILITY—SEND DATA TO
DATABASES,DATA WARE HOUSES AND SALESFORCE.
SECIRITY FOCUSED—FIELD –LEVEL DATA ENCRYPTION AND
MAKING TO MEET COMPLIANCE REQUIREMENTS.
REST API—TO ACHIVE ANYTHING POSSIBLE ON THE Xplenty
UI VIA THE Xplenty API.
CUSTOMER CENTRIC COMPANY THAT LEADS WITH FIRST
CLASS SUPPORT.
3) MICROSOFT HD INSIGHT:
FEATURES:
RELIABLE ANALYTICS WITH AN INDUSTRY LEADING
SLA(SERVICE LEVEL AGREEMENT)
IT OFFERS ENTERPRICE GRADE SECURITY AND MONITORING
PROTECT DATA ASSETS AND EXTEND ON ON-PREMISES
SECURITY AND GOVERNENCE CONTOLR S TO THE CLOUD
HIGH PRODUCTIVITY PLATFORM FOR DEVELOPERS AND
SCIENTISTS.
INTEGRATION WITH LEADING PRODUCTIVITY APPLICATIONS
DEPLOY HADOOP IN THE CLOUD WITHOUT PURCHASING
NEW HARDWARE OR PAYING OTHER UP FRONT COSTS.
4) SKYTREE:
FEATURES:
5) TALEND:
FEATURES:
FEATURES:
7) SPARK:
FEATRERS:
8) PLOTLY:
PLOTLY IS ONE OF THE BIG DATA ANALYSIS TOOLS THAT LETS USERS
CRTEATES CHARTS AND DASHBOARDS TO SHARE ONLINE.
FEATURES:
9) APACHE SAMOA:
FEATURES:
11) ELASTICSEARCH:
FEATURES:
IT ALLOWS COMBINE MANY TYPES OF SEARCHES SUCH AS
STRUCTURED ,UNSTRUCTURED ,GEO,MERTIC ETC.
REAL TIME SEARCH AND ANALYTICS FEATURES TO WORK
BIG DATA BY USING THE ELASTIC SEARCH HADOOP.
IT GIVES AN ENHANCED EXPERIENCE WITH SECURUTY,
MONITORING,REPORTING AND ML FEATURES.
12) R PROGRAMMING:
OLAP GUIDELINES:
IT IS ALSO KNOWN AS DR.E.F.CODD RULE.
DR.E.F.CODD , THE FATHER OF THE RELATIONAL
MODEL,HAS FORMULATED A LIST OF 12 GUIDELINES AND
REQUIREMENTS AS THE BASIS FOR SELECTING OLAP
SYSTEM.
12 GUIDELINES ARE:
1) MUTIDIMEMTIONAL CONCEPTUAL VIEW
2) TRANSPARENCY
3) ACCESSIBILITY
4) CONSISTENCY REPORTING PERFORMANCE
5) CLIENT/SERVER ARCHITECTURE
6) GENERAL DIMENSIONALITY
7) DYNAMIC SPARSE MATRIX HANDLING
8) MULTI USER SUPPORT
9) UNRESTRICTED CROSS-DIMANTIONAL OPERATIONS
10) INTUTITIVE DATA MANIPULATION
11) FLEXIBLE REPORTING
12) UNLIMITED DIMENSIONS AND AGGREGATION LEVELS
2) TRANSPARENCY:
3) ACCESSIBILITY:
6) GENERIC DIMENTIONALITY:
8) MULTIUSER SUPPORT:
ADVANTAGES:
1) SCHEMALESS:
MONGO DB IS A DOCUMENT DATABASE IN WHICH ONE
COLLECTION HOLDS DIFFERENT DOCUMENTS. NUMBER OF
FIELDS,CONTENT AND SIZE OF THE DOCUMENT CAN DIFFER
FROM ONE DOCUMENT TO ANOTHER DOCUMENT.
2) STRUCTURE OF A SINGLE OBJECT IS CLEAR
3) NO COMPLEX JOIN.
4) DEEP QUERY ABILITY.MONGODB SUPPORTS DYNAMIC
QUERIES ON DOCUMENTS USING A DOCUMENT BASED QUERY
LANGUAGE THAT’S NEARLY AS POWERFULL AS SQL.
5) EASY OF SCALE –OUT- IT IS EASY TO SCALE
6) CONVERSION/MAPPING OF APPLICATION OBJECTS TO
DATABASE OBJECT NOT NEEDED.
7) USES INTERNAL MEMORY FOR STORING FILES,WORKING
SET,ENABLING FASTER ACCESS OF DATA.
INDEXING:
INDEX IS A SINGLE FIELD WITH IN THE
DOCUMENT.INDEXES ARE USED TO QUICKLY LOCATE
DATA WITHOUT HAVING TO SEARCH EVERY
DOCUMENT IN A MONGO DB DATABASE.
THIS IMPROVES THE PERFORMANCE OF OPERATIONS
PERFORMED ON THE MONGODB DATABASE.
HIGH AVAILABILITY:
AUTO REPLICATION IMPROVES THE AVAILABILITY OF
MONGODB DATABASE.
LOAD BALANCING:
HORIZONTAL SCALING ALLOWS MONGODB TO
BALANCE THE LOAD.
DATE:
THESE ARE REPRESENTED AS YYYY-MM-DD
INTERVAL
C)STRING DATA TYPE:
iv) DATE
v)DECIMALS
vi)UNION TYPE
4)LITERALS DATA TYPES:
FLOATING POINT TYPES,DECIMAL TYPES
5)NULL TYPE
OLAP(INFORMATION)
FEATURES OF OLAP :
OLTP:(OPERATION)
FULFORM OF OLTP IS ONLINE TRANSCATION PROCESSING.IT IS
CHARACTERIZED BY LARGE NO OF SHORT ONLINE
TRANSCATIONS(INSERT,UPDATE,DELETE)
DIFFERENCE BETWEEN OLAP AND OLTP:
OLAP OLTP
1.SPEED DEPENDS ON THE 1.VERY FAST PROCESSING
AMOUNT OF DATA INVOLVED SPEED
2.IT HELPS IN DECISION 2.IT AIMS IS TO CONTROL AND
SUPPORT,PROBLEM SOLVING RUN BASIC BUSINESS TASK.
AND PLANNING
3.DENORMALIZED 3.HIGHLY NORMALIZED
DATABASE
4.IT DEALS WITH CONSOLIDATED 4.IT DEALS WITH OPERATIONAL
DATA.(HETERIGENEOUS DATA DATA IN WHICH OLTP
SOURCES) DATABASES ARE ONLY
SOURECE ON THE DATA.
5.COMPLEX QUERIES WITH 5.SIMPLE AND STANDARD
AGGREGATION ALSO. QUERIES.
OLAP OPERATIONS:
1. PIVOTING.
PIVOTING:
4) OTHERS:
A) WOLAP: WEB ONLINE ANALYTICAL PROCESSING.WEB OLAP
WHICH IS OLAP SYSTEM ACCESSIBLE VIA THE WEB
BROWSER.WOLAP IS A THREE –TIERED ARCHITECTURE.IT
CONSIST OF THREE COMPONENT:CLIENT,MIDDLEWARE AND A
DATABASE SERVER.
READ WRITE
PRIMARY
REPLICATION REPLICATION
SECONDARY SECONDARY
A TYPICAL DIAGRAM OF MONGODB REPLICATION,IN WHICH CLIENT APPLICATIOM ALWAYS INTERACT WITH THE
PRIMARY NODE AND THE PRIMARY NODE THEN REPLICATES THE DATA TO THE SECONDARY NODES
FEATURES OF REPLICA SET:
A CLUSTER OF N NODES.
ANY ONE NODE CAN BE PRIMARY
ALL WRITE OPERATIONS IS DONE IN THE PRIMAREY NODES.
AUTOMATIC FAILOVER
AUTOMATIC RECOVERY
CONSENSUS ELECTION OF PRIMARY
General syntax:
mogd --port “PORT”--dbpath “Your DBDataPath”--
replSet”Replica_set_instance_name”
e.g:
mongod – port 27017 --dbpath “D:\setup\mongodb\data”--
replSet “rs 0”
Syntax:
Rs.add(HOST_NAME: PORT)
e.g:
rs.add(“mongod1.net:27017”)
MONGODB ADMINISTRATION
There are several strategies or operational strategies for the
monitoring purpose:
1) backup strategies:
MONGODB works on a large set of important data. The
database administrator should backup those data to avoid data
loss.
2) monitoring:
The database administrator should monitor different database
and data store related information to improve the
performance and reduce the faults.
3) runtime configuration:
For a huge number of database setting , the administrator
should provide configuration details.
4) import and export:
the administrator can import or export JSON data to /from
different sources in correct order.
5) Production notes:
Mongodb works on large set of data. The data are replicated
and it uses different shards. The administrator should care
about the production architecture and notes to contol the
complete system.
BACKUP AND RESTORE IN MONGODB
BACKUP IN MONGODB:
TO CREATE BACKUP OF DATABASE IN MONGODB,USE
mongodump COMMAND.
THIS COMMAND WILL DUMP THE ENTIRE DATA OF OUR
DATABASE OF OUR SERVER INTO THE DUMP DIRECTORY.
SYNTAX:
mongodump
e.g:
1) mongodump-- host HOST_NAME--portPORT_NUMBER
THIS COMMAND WILL BACKUP ALL DATABASES OF SPECIFIED
mongod INSTANCE.
E.G:
mongodump--host xyz--port 27017
Syntax:
Mongostore
OPTIMIZATION TECHNIQUES:
BIGDATA OPTIMIZATION:
3) DATA DISTRIBUTION:
2) INFORMATION EXTRACTION:
Advantages of No-Sql:
1. It can be used as primary or Analytic Data Source.
2. Big-Data capability
3. No single point of failure.
4. Easy replication
5. It provides fast performance and horizontal scalability
6. Can handle structure, semi-structure and unstructured
data with equal effect.
7. OOP which is easy to use and flexible.
8. No-sql database do-not need a dedicated high
performance server.
9. Simple to implement than using RDBMS.
10. It can serve as the primary data-source for online
applications.
11. Handles BigData which manages velocity, variety
,volume and complexity of the data.
12. Eliminates the need for a specific caching layer to
store data.
13. Offers a flexible schema design which can easily be
altered without downtime or service disruption.
Disadvantages of No-Sql:
1. No standardization rules.
2. Limited query capabilities.
3. RDBMS databases and tools are comparatively
mature.
4. It does-not offer any traditional database capability
like consistency when multiple transactions are
performed simultaneously.
5. When the volume of the data increases it is difficult
to maintain unique values as key become difficult.
6. Does-not work as well with relational data.
7. Open source options so not so popular for
enterprises.
8. The learning curve is stiff for new developer.
Difference between RDBMS and NO-SQL:
1. Storage:
3. Database structure:
4. ACID:
RDBMS are harder to construct and obey ACID .IT helps
to create consistency of the database.
NO-SQL do-not support ACID to store the data
5. Normalization:
RDBMS supports the normalization and joining of
tables.
NO-SQL does-not support normalization.
6. Open-source:
RDBMS is an open-source application.
NO-SQL is the open source program.
7. Integrity constraints:
In RDBMS, supports the integrity constraints at the
schema level.
NO-SQL database supports integrity constraints.
8. Development year:
RDBMS was developed in the 1970s to deal with the
issues of flat file storage.
NO-SQL developed in the late 2000s to overcome the
issues and limitations of SQL DataBase.
9. Distributed database:
10. Client-server:
RDBMS supports client –server architecture.
No-SQL storage system supports multi-servers. it also
Supports the client server architecture .
14. Data-fetching:
In RDBMS, data fetching is rapid because of its
relational approach and database.
In NO_SQL, data-fetching is easy and flexible.
15. Example:
RDBMS: MY-SQL,SQL-SERVER,QRACLE etc
NO-NQL: Aache Hbase, mongoDB etc.
JSON:
JSON stands for Java Script Object Notation.
It is a text base, human readable, information
capability organize utilized for speaking to basic
information structure and objects in web browser
based code. It is additionally in some cases utilized in
desktop and server side programming situations.
It is format to store and interchange data.
Advantages:
1. Faster
2. structure data
3. readable : it is human readable and writable.it is
light weight text based data interchange format.
4.language independent.
Example:
{
JSON document id-->“_id”:100,
“name”:”Akash”, key
“Subject”:[“math”,” computer science”]
}
JSON Document id:
Id is a 12 byte hexadecimal number which assures
the uniqueness of every document. if you do-not
provide then Mongo DB provides a unique id for
every document.
12 byte::4(current time stamp)+3(machine
id)+2(process id of the mongo DB server)+3(simple
incremental value)
BSON:
The BSON is the Binary Java Script Object Notation.
In Mongo DB they use BSON to encrypt the JSON
data. it is also language independent, easy to parse
and generates files from the machine
HADOOP ECOSYSTEM
Hadoop ecosystem components like
1)Hadoop Distributed File System(HDFS)
2)Map-Reduce
3)YARN
4) HIVE
5) Apche PIG
6) Apache HBase
7)HCatalog
8)Avro
9) Thrift
10)Drill
11) Apache Mahout
12) Sqoop
13) Apache Flume
14) Ambari
15) Zookeeper
16) Apache oozie.
Microsoft Windows [Version 10.0.18362.30]
(c) 2019 Microsoft Corporation. All rights reserved.
C:\Users\Sourabh>mongo
MongoDB shell version v4.4.2
connecting to:
mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("999414a7-0f59-4ea3-a9a6-
2c671cacf27a") }
MongoDB server version: 4.4.2
---
The server generated these startup warnings when booting:
2020-12-18T14:21:23.175+05:30: Access control is not enabled for
the database. Read and write access to data and configuration is
unrestricted
---
---
Enable MongoDB's free cloud-based monitoring service, which will
then receive and display
metrics about your deployment (disk utilization, CPU, operation
statistics, etc).