0% found this document useful (0 votes)
219 views184 pages

BIG DATA - 25.09.2020 (19 Files Merged)

This document provides an overview of big data technology. It defines big data as large volumes of data that are growing exponentially over time. Examples of sources of big data include social media, stock markets, and e-commerce sites. The document discusses the three types of data: structured, unstructured, and semi-structured. It also covers the five V's of big data - volume, variety, velocity, value and validity. The benefits of big data analysis for organizations are also summarized.

Uploaded by

Arindam Mondal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
219 views184 pages

BIG DATA - 25.09.2020 (19 Files Merged)

This document provides an overview of big data technology. It defines big data as large volumes of data that are growing exponentially over time. Examples of sources of big data include social media, stock markets, and e-commerce sites. The document discusses the three types of data: structured, unstructured, and semi-structured. It also covers the five V's of big data - volume, variety, velocity, value and validity. The benefits of big data analysis for organizations are also summarized.

Uploaded by

Arindam Mondal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 184

SUBJECT NAME: BIG DATA TECHNOLOGY

WHAT IS DATA?
THE QUANTITIES,CHARACTERS OR SYMBOLS .
WHAT IS INFORMATION?
PROCESSED DATA IS KNOWN AS A INFORMATION

WHAT IS BIG DATA?


BIG DATA IS ALSO DATA BUT WITH A HUZE SIZE.
IT IS USED TO DESCRIBE A COLLECTION OF DATA THAT IS
HUZE IN VOLUME AND YET GROWING EXPONENTIALLY
WITH TIME.

EXAMPLE:
1) SOCIAL MEDIA:
THE STATISTIC SHOWS THAT 500+TB OF NEW DATA
GET INTO THE DATABASES OF THE SOCIAL MEDIA SITE.
2) SHARE MARKET:
THE NEW YORK STOCK EXCHANGE GENERATES HUGE
AMOUNT OF NEW DATA(SAY,ABOUT 1 TB)PER DAY
THROUGH ITS DAILY TRANSACTION.
3) E-COMMERCE SITE:
FLIPKART,AMAZON ETC GENERATES HUGE AMOUNT
OF DATA.
4) AIRPLANE:
A SINGLE JETPLANE CAN GENERATE 10+TB OF
DATA IN 30 MINS OF FLIGHT TIME.

TYPES OF DATA:
THREE TYPES=
1) STRUCTURED DATA
2) UNSCTRUCTURED DATA
3) SEMISTRUCTURE DATA

STRUCTURED DATA::--
STRUCTURED DATA ARE THOSE TYPES OF DATA
WHICH ARE STORED ALREADY IN AN
ORDER.THERE ARE NEARLY 20% OF THE TOTAL
AMOUNT OF EXISTING DATA.
THERE ARE TWO FORM OF DATA
1)MACHINE GENERATED DATA=>
SENSORS,WEBBLOGS ETC.
2)HUMAN GENERATED DATA=>
NAMES,ADDRESS ETC.

THE EXAMPLE OF STRUCTURED DATA IS DATABASE


EMP_NO NAME AGE
001 XYZ 35

UNSTRUCTURED DATA:

THE UNSTRUCTURE DATA HAVE NO CLEAR


FORMAT IN STRORAGE.80% OF DATA ARE
UNSTRUCTURED.ALL THE SATRLLITE GENERATED
IMAGES,SCIENCTIFIC DATA IMAGES ARE MACHINE
GENERATED UNSTRUCTURED DATA.THERE ARE
VARIOUS TYPES OF HUMAN-GENERATED
UNSTRUCTURED DATA.(IMAGES,VIDEOS,SOCIAL
MEDIA DATA,PDF,TEXT DOCUMENTS ETC.)
SEMI-STRUCTURED DATA:
IT IS VERY DIFFICULT TO CATEGORIZE THIS TYPE
OF DATA.SOMETIMES THEY LOOK LIKE
STRUCTURED OR UNSTRUCTURED .SO THAT’S
WHY THESE DATA ARE KNOWN AS
SEMISTRUCTURE DATA.WE CANNOT STORE THESE
TYPE OF DATA USING TRADITIONAL DATABASE
FORMAT,BUT IT CONTAINS SOME
ORGANOZATIONAL PROPERTIES.
EXAMPLE:
SPREED SHEET FILES,XML,JSON,NOSQL DATABASE
DATA ITEMS

FIVE V’S IN BIG DATA:


1) VOLUME:
THE AMOUNT OS DATA WHICH WE DEAL WITH
IS OF VERY LARGE SIZE OF PB(HUGE AMOUT OF
DATA)

2) VARITY:
DIFFERENT FORMAT OF DATA FROM VARIOUS
SOURCES(IMAGE,TEXT,PDF,AUDIO,VIDEO)
3) VELOCITY
THE DATA IS GENERATING AT A VERY FIRST
RATE. VELOCITY IS MEASURED OF HOW FIRST
THE DATA IS COMING IN
4) VALUE:
EXTRACT USEFUL DATA

5) VALIDITY/VERACITY:

BENEFITS OF BIGDATA PROCESSING:--


1. COST SAVING:
SOME TOOLS OF BIGDATA LIKE HAOOP CAN
BRING COST ADVANTAGE S TO BUSINESS
WHEN THE LARGE AMOUNT OF DATA ARE
STORED AND THESE TOOLS ALSO HELP IN
INDENTIFYING MORE EFFICIENT WAYS OF
DOING BUSINESS.
2. TIME REDUCTION:
HIGH SPEED TOOLS LIKE HADOOP CAN EASILY
IDENTIFY NEW SOURCES OF DATA WHICH
HELPS BUSINESSES ANALYZING DATA
IMMEDIATELY AND MAKE QUICK DECISION
BASED ON THE LEARNING
3)UNDERSTAND THE MARKET CONDITION:
BY ANALYZING BIG DATA YOU CAN GET A BETTER
UNDERSTANDING OF CURRENT MARKET
CONDITIONS.
4)CONTROL ONLINE REPUTATION:
BIG DATA TOOLS CAN DO SENTIMENT ANALYSIS.
THEREFOR YOU CAN DET FEEDBACK ABOUT WHO
IS SAYING WHAT ABOUT YOUR COMPANY.

ASSIGNMENT:

1. WHAT IS BIG DATA?


2) WHAT ARE THE CHARACTERIES OF BIG DATA?
3) HOW IS ANALYSIS OF BIG DATA USEFUL FOR
ANY ORGANISATIONS?
4) WHAT ARE THE CHALLENGES IN HANDLING
BIGDATA?
--:NO - SQL DATA BASE:--
DEFINATION:
NO-SQL DATABASE IS A NONRELATIONAL DATABASE
MANAGEMENT SYSTEM.THAT DOES NOT REQUIRE A FIXED
SCHEMA.IT AVOIDS JOINS,AND IS EASY TO SCALE.
IT IS USED FOR DISTRIBUTED DATA STORE WITH A VERY LARGE
AMOUNT OF DATA .IT IS USED FOR BIGDATA AND REAL TIME
WEB APPLICATION.FOR EXAMPLE,COMPANIES LIKE
TWITTER,FACEBOOK,GOOGLE COLLECTS TERABYTES OF USER
DATA EVERY SINGLE DAY.THESE TYPE OF DATA STRORING MAY
NOT REQUIRE FIXED SCHEMA,AVOID JOIN OPERATIONS AND
TYPICALLY SCALE HORIZONTALLY.

THE CONCEPT OF NOSQL DATABASE BECAME POPULAR WITH


INTERNET GIANTS LIKE GOOGLE,FACEBOOK,AMAZON ETC.
WHO DEAL WITH HUGE VOLUME OF DATA. THE SYSTEM
RESPONSE TIME BECOMES SLOW WHEN WE USE RDBMS FOR
MASSIVE VOLUME OF DATA.
TO RESOLVE THIS PROBLEM WE COULD SCALE UP OUR
SYSTEMS BY UPGRADING OUR EXISTING HARDWARE.THIS
PROCESS IS VERY EXPENSIVE.
THE ALTERNATIVE FOR ISSUE IS TO DISTRIBUTED DATABASE
LOAD ON THE MULTIPLE HOSTS WHENEVRE THE LOAD
INCREASES.THIS METHOD IS KNOWN AS A SCALLING OUT.

WHEN SHOULD NO-SQL BE USED?


1) WHEN HUGE AMOUNT OF DATA NEED TO BE STORED AND
RETRIVED.
2) THE RELATIONSHIP BETWEEN THE DATA STORE IS NOT THAT
IMPORTNENT.
3) THE DATA CHANGING OVERTIME AND IS NOT STRUCTURED.
4) SUPPORTS OF CONSTRINS AND JOINS IS NOT REQUIRED AT
DATABASE LEVEL.
5) THE DATA IS GROWING CONTINUOUSLY AND YOU NEED TO
SCALE THE DATA BASE REGULAR TO HANDLE THE DATA.
BRIEF HISTORY OF NO-SQL DATABASES:---

 1998-CARLO STROZZI USE THE TERM NO-SQL FOR HIS


LIGHT WEIGHT,OPEN SOURCE,DATABASE WHICH DID NOT
HAVE AN SQL INTEFACE.
 2000-GRAPH DATABASE NEO4J IS LAUNCHED.
 2004 –GOOGLE BIGTABLE IS LAUNCHED.
 2005-COUCHDB IS LAUNCHED
 2007-THE RESEARCH PAPER ON AMAZON DYNAMO IS
RELEASED.
 2008-FACEBOOK OPEN SOURCES THE CASSANDRA
PROJECT.
 2009-THE TERM NO-SQL WAS REINTRODUCED.
FEATURES OF NO-SQL::--

1) NON-RELATIONAL:
1)NO-SQL DATABASES NEVER FOLLOW THE
RELATIONAL MODEL.
2) NEVER PROVIDES TABLES WITH FLAT FIXED
COLUMN RECORDS.
3) WORK WITH SELF CONTAINED AGGREGATES OR
BINARY LARGE OBJECT.
4) DOES NOT REQUIRE OBJECT RELATIONAL MAPPING
AND DATA NORMALIZATION.
5) NO COMPLEX FEATURES LIKE QUERY
LANGUAGES,QUERY PLANNERS, REFERANTIAL INTEGRITY
JOINS,ACID.

2) SCHEMA FREE :
1) NO SQL DATABASES ARE EITHER SCHEMA FREE OR
HAVE RELAXED SCHEMA.
2) DO NOT REQUIRE ANY SORT OF DEFINATION OF THE
SCHEMA OF THE DATA.
3) OFFERS HETEROGENEOUS STRUCTURES OF DATA IN
THE SAME DOMAIN.
3) SIMPLE API:
1) OFFERS EASY TO USE INTEFACES FOR STORAGE AND
QUERYING DATA PROVIDED.
2) API ALLOW LOW LEVEL DATA MANIPULATION AND
SELECTION METHODS.
3) TEXT BASED PROTOCALS MOSTLY USED WITH HTTP
REST WITH JSON.
4)MOSTLY USED NO STANDARED BASED QUERY
LANGUAGE.
5)WEB ENABLED DATABASES RUNNING AS INTERNET
FACING SERVICES.
4) DISTRIBUTED:
1) MULTIPLE NOSQL DATABASES CAN BE EXECUTED IN
A DISTRIBUTED FASHION.
2) OFFERS AUTO-SCALING AND FAIL OVER
CAPABILITIES.
3) OFTEN ACID CONCEPT CANBE SACRIFICED FOR
SCALABILITY AND THROUGHPUT.
4) ONLY PROVIDING EVEVTUAL CONSISTENCY.
5) SHARED NOTHING ARCHITECHTURE.THIS ENABLE
LESS CO-ORDINATION AND HIGHER DISTRIBUTION.
TYPES OF NOSQL DATABASES:
1) KEY VALUE STORE.
2) COLUMN ORIENTED.
3) GRAPH BASED.
4) DOCUMENT ORIENTED.

KEY VALUE STORE:


1) KEY VALUE STORES ARE MOST BASIC TYPES OF NOSQL
DATABASES
2) DESIGNED TO HANDLE HUGE AMOUNT OF DATA
3) BASED ON AMAZON’S DYNAMO PAPER
4) KEY VALUE STORES ALLOW DEVELOPER TO STORE SCHEMA
LESS DATA.
5) IN THE KEY VALUE STORAGE,DATABASE STORES DATA AS
HASH TABLE WHERE EACH KEY IS UNIQUE AND VALUE CAN
BE STRING,JSON,BINARY LARGE OBJECT ETC.
6) A KEY MAY BE STRING,LISTS,SETS,SORTED SET AND
VALUES ARE STORED AGAINST THESE KEYS.
7) FOR EXAMPLE A KEY VALUE PAIR MIGHT CONSIST OF A
KAY LIKE”NAME” THAT IS ASSOCIATED WITH A VALUE
LIKE” ROBIN”
8) KEY VALUES STORES CANBE USED AS
COLLECTIONS,DICTIONARY, ASSOCIATIVE ARRAYS ETC.
9) KEY VALUES STORES FOLLOW THE AVAILABILITY AND
PARTITIONED ASPECTS OF CAP THEOREM.
10) KEY VALUES STORES WOULD WORK WELL FOR
SHOPPING CART CONTENTS , COLOR SCHEMES,DEFAULT
ACCOUNT NUMBER ETC.

EXAMPLE: DESIS,DYNAMO DB,RIAK ETC.

ADVANTAGES:

1) CAN HANDLE LARGE AMOUNT OF DATA AND HEAVY


LOAD.
2) EASY RETRIVAL OF DATA BY KEYS.

DISADVANTAGES:
1) COMPLEX QUERIES MAY ATTEMPT TO INVOLVE
MULTIPLE KEY VALUE PAIRS WHICH MAY DELAY
PERFORMANCE.
2) DATA CANBE INVOLVING MAMY TO MANY
RELATIONSHIP WHICH MAY COLLIDE.

COLUMN ORIENTED DATABASES:


1)COLUMN ORIENTED DBS PRIMARILY WORK ON COLUMNS
AND EVERY COLUMN IS TREATED INDIVIDUALLY.
2) VALUES OF A SINGLE COLUMN ARE STORES CONTIGUOUSLY.
3) COLUMN STORES DATA IN COLUMN SPECIFIC FILES.
4)IN COLUMN STORE, QUERY PROCESSOR WORK ON COLUMN
TOO.
5)ALL DATA WITHIN EACH COLUMN DATA FILE HAVE THE SAME
TYPE WHICH MAKE IT IDEAL FOR COMPARESSION.
6)COLUMN STORES CAN IMPROVE THE PERFORMANCE OF
QUERIES AS IT CAN ACCESS SPECIFIC COLUMN DATA
7)HIGH PERFORMANCE ON AGGREGATION
QUERIES(COUNT,SUM,AVG,MIN,MAX)
8)WORKS ON DATA WAREHOUSES,BUSINESS
INTELLIGENCE,CUSTOMER RELATION SHIP
MANAGEMENT,LIBRARY CARD CATALOGS ETC.

EXAMPLE:
BIG TABLE,CASSANDRA,SIMPLE DB ETC.

GRAPH DATABASE:
A GRAPH DATA STRUCTURE CONSIST OF A FINITE SET OF
ORDERD PAIR,CALLED EDGES,NODES.
1)A GRAPH DATABASE STORES THE DATA IN A GRAPH
2)IT IS CAPABLE OF ELEGANTLY REPRESENTING ANY KIND
OF DATA IN A HIGHLY ACCESSIBLE WAY.
3) A GRAPH DB IS A COLLECTION OF NODES AND EDGES.
4)EACH NODE REPRESENTS AN ENTITY AND EACH EDGE
REPRESENTS A CONNECTION OR RELATIONSHIP BETWEEN
TWO NODES.
5)EVERY NODE AND EDGE ARE DEFINED BY UNIQUE
IDENTIFIRE.
6) EACH NODE KNOWS IT ADJACENT NODES.
7) AS THE NUMBER OF NODES INCREASES,THE COST OF A
LOCAL STEP REMAIN THE SAME.
8) INDEX FOE LOOKUPS.

EXAMPLES:
INFINITE GRAPH,NEO4J, ORIENT DB ETC.

COMPARISION BETWEEN THE CLASSIC RELATIONAL


MODEL AND THE GRAPH MODEL:

RELATION MODEL:
TABLE,ROWS,COLUMNS,JOINS
GRAPH MODEL:
VERTICES AND EDGES SET,VERTICES,KEY/VALUE
PAIRS,EDGES.
ADVANTAGES:
1) FASTEST TRAVERSAL BECAUSE OF CONNECTIONS.
2) DATA CAN BE EASILY HANDLED.

DISADVANTAGES:
WRONG CONNECTIONS MAY LEAD TO INFINITE LOOP.

DOCUMENT ORIENTED DATABASE:


1)A COLLECTION OF DOCUMENTS
2)DATA IS THIS MODEL IS STORED INSIDE DOCUMENTS.
3)A DOCUMENT IS A KEY VALUE COLLECTION WHERE THE
KEY ALLOWES ACCESS TO ITS VALUES.
4)DOCUMENT ARE NOT TYPICALLY FORCED TO HAVE A
SCHEMA AND THEREFORE ARE FLEXIBLE AND EASY TO
CHANGE.
5)DOCUMENTS ARE STORED INTO COLLENTION OF IN
ORDER TO GROUP DIFFERENT KIDS OF DATA.
6)DOCUMENTS CAN CONTAIN MANY DIFFERENT KEY
VALUE PAIRS OR KEY-ARRAY PAIR ORNESTED
DOCUMENTS.

EXAMPLE:
MONGO DB,COUCH DB ETC.
ADVANTAGES:
1)THIS TYOE OF FORMAT IS VARY USEFUL AND IS USED TO
STORED THE SEMI STRUCTURED DATA.
2)STORAGE RETRIVAL AND MANAGING OF DOCUMENTS IS
VERY EASY.

DISADVANTAGES:
1)HANDALING MULTIPLE DOCUMENTS IS VERY
CHALLENGING.
2)AGGREGATION OPERATION MAY NOT WORK
ACCURATELY.

COMPARISON BETWEEN THE CLASSIC RELATION MODE;L


AND THE DOCUMENT MODEL:

RELATIONAL MODEL:
TABLE,ROWS,COLUMNS,JOIN

DOCUMENT ORIENTED MODEL:


COLLECTIONS,DOCUMENTS,KEY/VALUE PAIR,NOT
AVALABLE ANY JOIN.
BIGDATA TECHNOLOGY

DATE:2/11/2020

TOPIC: MONGODB

THIS IS HOW A DOCUMENT LOOKS IN MONGODB:--

Name:”xyz”, field:values

Age:30,

Website:”abc.com”,

Hobbies:[“teching”,”watching tv”]

MONGODB DATATYPES:--

1) STRING:

THIS IS THE MOST COMMONLY USDED DATATYPE TO STORE THE DATA.STRING


IN MONGODB MUST BE UTF-8 VALID.[UNICODE TRANSFER FORMAT-8 ➔IS A
FORMAT IN THE UNICODE CODING SYSTEM THAT USES FROM ONE TO FOUR
BYTES]

2) INTEGER:
THIS TYPE IS USED TO STORE A NUMERICAL VALUE.INTEGER CAN BE 32 BIT
OR 64 BIT DEPENDING UPON YOUR SERVER.

3) BOOLEAN:
THIS TYPE IS USED TO STORE A BOOLEAN VALUE(TRUE/FALSE)
4) DOUBLE:
THIS TYPE IS USED TO STORE FLOATING POINT VALUES.

5) MIN/MAX KEY:
THIS TYPE IS USED TO COMPARE A VALUE AGAINST THE LOWEST AND
HIGHEST BSON ELEMENTS.

6) ARRAYS:
THIS TYPE IS USED TO STORE ARRAYS OR LIST OR MULTIPLE VALUES INTO
ONE KEY.

7) TIMESTAMP:
THIS CAN BE HANDY FOR RECORDING WHEN A DOCUMENT HAS BEEN
MODIFIED OR ADDED.

8) OBJECT:
THIS DATA TYPE IS USED FOR EMBEDDED DOCUMENTS

9) NULL:
THIS TYPE IS USED TO STORE A NULL VALUE.

10) SYMBOL:
THIS DATATYPE IS USED IDENTICALLY TO A STRING.
HOWEVER IT’S GENERALLY SERERVED FOR LANGUAGES THAT USE A
SPECIFIC SYMBOL TYPE.

11) DATE:
THIS DATATYPES IS USED TO STORE THE CURRENT DATE OR TIMEIN UNIX
TIME FORMAT..YOU CAN SPECIFY YOUR OWN DATE TIME BY CREATING
OBJECT OF DATE & PASSING DAY,MONTH,YEAR INTO IT.

12) OBJECTID:
THIS DATATYPE IS USED TO STORE THE DOCUMENT’S ID.
13) BINARY DATA:
THIS DATATYPE IS USED TO STORE BINARY DATA.

14) CODE:

THIS DATATYPE IS USED TO STORE JAVASCRIPT CODE INTO THE DOCUMENT.

15) REGULAR EXPRESSION:


THIS DATA TYPE IS USED TO STOREREGULAR EXPRESSION.

TABLE VS COLLECTION

RDBMS:
STUDENT_ID STUDENT_NAME AGE COLLEGE
1001 XYZ 30 BIGINNERS BOOK
1002 ABC 29 BIGINNERS BOOK

MONGO DB:
{
“-ID”: OBJECTID(“5ca66a95f9e57db687850dd4”),
STUDENT_ID: 1001,
STUDENT_NAME:”XYZ”,
AGE:30,
COLLEGE:”BIGINNERS BOOK”
}
{
“-ID”: OBJECTID(“5ca66a95f9e57db687850dd5”),
STUDENT_ID: 1002,
STUDENT_NAME:”ABC”,
AGE:29,
COLLEGE:”BIGINNERS BOOK”
}
NOTES:
COLUMNS ARE REPRESENTED AS KEYVALUE PAIRS(JSON FORMAT),ROWS
ARE REPRESENTED AS DOCUMENTS.MONGODB AUTOMATICALLY INSERTS
A UNIQUE-ID (12 BYTE FIELD)FIELD IN EVERY DOCUMENT,THIS SERVES AS
PRIMARY KEY FOR EACH DOCUMENT.
ANOTHER THING IS THAT MONGODB SUPPORTS DYNAMIC SCHMA WHICH
MEANS ONE DOCUMENT OF A COLLECTION CAN HAVE 4 FIELDS WHILE THE
OTHER DOCUMENTS HAS ONLY 3 FIELDS.
THIS IS NOT POSSIBLE IN RELATIONAL DATABASE.

DATA BASE:

DATABASE IS A PHYSICAL CONTAINER FOR COLLECTIONS.EACH DATABASE


GETS ITS OWN SET OF FILES ON THE FILE SYSTEM.A SINGLE MONGODB
SERVER TYPICALLY HAS MULTIPLE DATABASES.

COLLECTION:

COLLECTION IS A GROUP OF MONGODB DOCUMENTS.IT IS THE


EQUIVALENT OF AN RDBMS TABLE.A COLLECTION ESISTS WITHIN A SINGLE
DATABASE.COLLECTION DONOT ENFORCE A SCHEMA.DOCUMENTS WITHIN
A COLLECTION CAN HAVE DIFFERENT FIELDS.TYPICALLY ALL DOCUMENTS IN
A COLLECTION ARE OF SIMILAR OR RELATED PURPOSWE

DOCUMENT:

A DOCUMENT IS A SET OF KEY-VALUE PAIRS.DOCUMENTS HAVE DYNAMIC


SCHEMA. DYNAMIC SCHEMA MEANS THAT DOCUMENTS IN THE SAME
COLLECTION DO NOT NEED TO HAVE THE SAME SET OF FIELDS OR STRUCTURE
AND COMMON FIELDS IN A COLLECTION DOCUMENTS MAY HOLD DIFFERENT
TYPES OF DATA.
MONGO SHELL:

THE MONGOSHELL IS AN INTERACTIVE JAVA SCRIPT INTERFACE TO MONGODB.IN


THE MONGO SHELL USE TO QUERY,UPDATE DATA AS WELL AS ADMINISTRATIVE
OPERATIONS.

MONGODB CRUD OPERATON:

MONGODB PROVIDES A SET OF SOME BASIC BUT MOST ESSENTIAL OPERATIONS


THAT WILL HELP US TO EASILY INTERACT WITH THE MONGODB SERVER AND
THESE OPERATIONS ARE KNOWN AS CRUD OPERATION.

C→ CREATE

R→READ

U→UPDATE

D→DELETE

CREATE OPERATION:

Create or insert operations add new documents to a collection. if the collection


does-not currently exist ,insert operations will create the collection.

MongoDB provides the following methods to insert documents into a collection:

1) db.collection.insertOne()→it is used to insert single document in the


collection.
2) db.collection.insertMany()→it is used to insert multiple documents in the
collection.
Create a new collection:
db.createCollection(name,options)
example:
db.createCollection(<name>,
{
Autoindexid:<boolean>,
Name:<string>,
Name_id:<number>,
Date:<date>
}
)

Create a database:
A database is created, by using “use” command.
use name_of_the_database
example:
use employee

Display the name of the database:


To display all database, by using “show dbs”
Example:
Show dbs

admin 0.000GB
employee 0.000GB
local 0.000GB
db.createCollection(<emp>,
{
Autoindexid:<boolean>,
Emp_name:<string>,
Emp_id:<number>,
Date_of_joining:<date>,
Salary:<number>
}
Output:
Name of the collection is emp

insertone()

db.employee.insertOne(
{
Emp_id:1
Emp_name:” xyz”,
Salary:39000
}
Output:
{“acknowledgement”:true,
“inserted ID”:objectId(“5ca66a95f9e57db68788950dd3”)
}
insertMany():
db.employee.insertMany(
{
Emp_id:2,
Emp_name:”jhon”,
Salary:34000
}

{
Emp_id:3,
Emp_name:”smit”,
Salary:45000
}
{
Emp_id:4,
Emp_name:”bob”,
Salary: 56000
}
)

Output:
{
“acknowledged”: true,
“insertedIds”:[
objectId(“5ca66a95f9e57db68788950dd4”),
objectId(“5ca66a95f9e57db68788950dd5”),
objectId(“5ca66a95f9e57db68788950dd6”)
]
}
READ OPERATION:
Read operation retrieve documents from a collection i.e query a collection
for documents. MongoDB provides the following methods to read
documents from a collection.

db.collection.find()→it is used to retrieve documents from the collection

we can specify query filters or criteria that identify the documents to


returns.

To see the entered value:

db.employee.find()

output:

“_id”:objectId(“5ca66a95f9e57db68788950dd3”),Emp_id:1,Emp_name:”
xyz”,Salary:39000

}
{
“_id”:objectId(“5ca66a95f9e57db68788950dd4”),
Emp_id:2,Emp_name:”jhon”,Salary:34000
}

{
“_id”:objectId(“5ca66a95f9e57db68788950dd5”),
Emp_id:3,Emp_name:”smit”,Salary:45000
}
{
“_id”:objectId(“5ca66a95f9e57db68788950dd6”),
Emp_id:4,Emp_name:”bob”,Salary: 56000
}
db.employee.find().pretty()

output:

{
“_id”:objectId(“5ca66a95f9e57db68788950dd3”),
Emp_id:1
Emp_name:” xyz”,
Salary:39000
}
{
“_id”:objectId(“5ca66a95f9e57db68788950dd4”),
Emp_id:2,
Emp_name:”jhon”,
Salary:34000
}

{
“_id”:objectId(“5ca66a95f9e57db68788950dd5”),
Emp_id:3,
Emp_name:”smit”,
Salary:45000
}
{
“_id”:objectId(“5ca66a95f9e57db68788950dd6”),
Emp_id:4,
Emp_name:”bob”,
Salary: 56000
}
db.employee.find({Emp_id:1}).pretty()

output:

{
“_id”:objectId(“5ca66a95f9e57db68788950dd3”),
Emp_id:1
Emp_name:” xyz”,
Salary:39000

Update operation:

Update operation modify existing documents in a collection. MongoDB provides


the following methods to update documents of a collection.

db.collection.updateOne()→ it is used to update a single document in the


collection that satisfy the given criteria.

db.collection.updateMany()→it is used to update a multiple documents in the


collection that satisfies the given criteria.

db.collection.replaceOne()→it is used to replace single document in the collection


that satisfy the given criteria.

Example:

db.employee.updateOne(

{ salary:{$lt:38000},}

)
Delete operation:

Delete operations remove documents from a collection. MongoDB provides the


following methods to delete documents of a collection.

db.collection.deleteOne()→ it is used to delete a single document from the


collection that satisfy the given criteria.

db.collection.deleteMany()→it is used to delete multiple document from the


collection that satisfy the given criteria.
DATA MART

A DATA MART IS FOCUSED ON A SINGLE FUNCTIONAL AREA OF


AN ORGANIZATION AND CONTAINS A SUBSET OF DATA STORED
IN A DATA WARE HOUSE.
A DATA MART IS A CONDENSED VERSION OF DATA WARE
HOUSE.IT DESIGNED FOR USE BY A SPECIFIC DEPARTMENT,UNIT
OR SET OF USERS IN AN ORGANIZATION.

WHY DO WE NEED THE DATAMART?

 DATA MART HELPS TO ENHANCE USER’S RESPONSE TIME


DUE TO REDUCTION IN VOLUME OF DATA.
 IT PROVIDES EASY ACCESS TO FREQUENTLY REQUESTED
DATA.
 DATA MARTS ARE SIMPLIER TO IMPLEMENT WHEN
COMPARE TO CORPORATE DATA WARE HOUSE.AT THE
SAME TIME ,THE COST OF IMPLEMENTING DATA MART IS
CERTAINLY LOWER COMPARE WITH IMPLEMENTING FULL
DATA WARE HOUSE.
 COMPARED TO DATAWARE HOUSE ,A DATA MART IS
AGILE.IN CASE OF CHANGE IN THE MODEL,DATAMART
CAN BE BUILT QUICKER DUE TO A SMALLER SIZE.
 A DATA MART IS DEFINED BY A SINGLE SUBJECT MATTER
EXPERT(SME).
 DATA IS PARTITIONED AND ALLOWS VER GRANULAR
ACCESS CONTROL PRIVILEGES.
 DATA CAN BE SEGMENTED AND STORED ON DIFFERENT
HARDWARE AND SOFTWARE PLATFORMS.

TYPES OF DATA MART

THERE ARE THREE MAIN TYPES OF DATA MART.

1. DEPENDENT DATA MART:

DEPENDENT DATA MARTS ARE CREATED BY DRAWING DATA


DIRECTLY FROM OPERATIONAL, EXTERNAL OR BOTH SOURCES.
IT ALLOWS SOURCING ORGANIZATION’S DATA FROM A SINGLE
DATA WARE HOUSE.IT IS ONE OF THE DATA MART EXAMPLE
WHICH OFFERS THE BENEFIT OF CENTRALIZATION.IF YOU NEED
TO DEVELOP ONE OR MORE PHYSICAL DATAMART S,THEN YOU
NEED TO CONFIGURE THEM AS DEPENDENT DATA MART.
IT CAN BE BUILT IN TWO DIFFERENT WAYS,1) EITHER WHERE A
USER CAN ACCESS BOTH THE DATA MART AND DATAWARE
HOUSE ,DEPENDING ON NEEDS OR ACCESS IS LIMITED ONLY TO
THE DATA MART.2)IT IS NOT OPTIMAL AS IT PRODUCES
SOMETIMES REFFERED TO AS A DATA JUNK-YARD.

2. INDEPENDENT DATA MART

AN INDEPENDENT DATA MART IS CREATED WITHOUT THE USE


OF CENTRAL DATA WARE HOUSE.THIS KIND OF DATA MART IS
AN IDEAL OPTION FOR SMALLER GROUPS WITH IN AN
ORGANIZATION.
AN INDEPENDENT DATA MART HAS NEITHER A RELATIONSHIP
WITH THE ENTERPRICE DATA WARE HOUSE NOR WITH ANY
OTHER DATA MART.IN THIS DATA MART ,THE DATA IS INPUT
SEPERATELY AND ITS ANALYSES ARE ASLO PERFORMED
AUTONOMOUSLY.
IMPLEMENTATION OF INDEPENDENT DATA MARTS ARE
ANTITHETICAL TO THE MOTIVATION FOR BUILDING A DATA
WARE HOUSE.FIRSTLY OF ALL YOU NEED A
CONSISTENT,CENTRALIZED STORE A ENTERPRISE DATA WHICH
CAN BE ANALYZED BY MULTILE USERS WITH DIFFERENT
INTERESTS WHO WANT WIDELY VARING INFORMATION.
3. HYBRID DATA MART
A HYBRID DATA MART COMBINES INPUT FROM SOURCES
APART FROM DATA WARE HOUSE.THIS COULD BE
HELPFULWHEN YOU WANT AD-HOC INTEGRATION,LIKE AFTER
A NEW GROUP A PRODUCT IS ADDED TO A ORGANIZATION.
IT IS THE BEST DATA MART EXAMPLE SUITED FOR MULTIPLE
DATA BASE ENVIRONMENTS AND FAST IMPLEMENTATION
TURN-SRROUND FOR ANY ORGANIZATION.IT ASLO REQIURES
LEAST DATA CLEANSING EFFORT.
IT ALSO SUPPORTS LARGE STORAGE STRUCTURES AND IT IS
BEST SUITED FOR FLEXIBLE FOR SMALLER DATA CENTRIC
APPLICATION.

DATA MART IMPLEMENTATION


THE SIGNIFICANTS STEPS IN IMPLEMENTING A DATAMART ARE
TO DESIGN THE SCEMA, CONSTRUCT THE PHYSICAL STORAGE,
POPULATE THE DATA MART WITH DATA FROM SOURCE
SYSTEM, ACCESS IT TO MAKE INFORMED DECISIONS AND
MANAGE IT OVER TIME.

1) DESIGNING:
THE DESIGN IS THE FIRST IN THE DATA MART PROCESS.
THIS PHASE COVERS ALL OF THE FUNCTIONS FROM
INITIATING THE REQUEST FOR A DATA MART THROUGH
GATHERING DATA ABOUT THE REQUIRESMENTS AND
DEVELOPING THE LOGICAL AND PHYSICAL DESIGN OF THE
DATA MART.
FOLLOWING TASKS ARE:
 GATHERING BUSINESS AND TECHNICAL
REQUIRMENTS
 IDENTIFYING THE DATA SOURCES
 SELECTING THE APPROPRIATE SUBSET OF DATA
 DESIGNING THE LOGICAL AND PHYSICAL
ARCHITECTURE OF THE DATA MART.
2) CONSTRUCTING:

THIS STEP CONTAINS CREATING THE PHYSICAL DATABASE


AND LOGICAL STRUCTURRS ASSOCIALTED WITH THE DATA
MART TO PROVIDE FAST AND EFFICIENT ACCESS TO THE
DATA.
FOLLOWING TASKS ARE:
 CREATING THE PHYSICAL DATABASE AND LOGICAL
STRUCTURE SUCH AS TABLE SPACES ASSOCIATED
WITH THE DATA MART.
 CREATING THE SCEME OBJECT SUCH AS TABLES AND
INDEXES DESCRIBE IN THE DESIGN STEP.
 DETERMINEING HOW BEST TO SET UP THE
TABLES AND ACCESS STRUCTURES

3) POUPUTATING:
THIS STEPS INCLUDES ALL OF THE TASKS RELATED TO THE
GETTING DATA FROM THE SOURCES,CLEANING IT
UP,MODIFYING IT TO THE RIGTH FORMAT AND LEVEL THE
DETAILS AND MOVING IT INTO THE DATAMART.
FOLLOWING TASKS ARE:
 MAPPING DATA SOURCES TO TARGET DATA
SOURCES
 EXTRACTING DATA.
 CLEANSING AND TRANSFORMING
INFORMATION
 LOADING DATA IN TO THE DATA MART
 CREATIN AND STORING META DATA.

4) ACCESSING:
THIS STEP INVOLVES PUTTING THE DATA TO USE :
QUERYING THE DATA,ANALYZING IT,CREATING REPORTS
,CHARTS AND GRAPHS AND PUBLIUSHING THEM.

FOLLOWING TASKS ARE:


 SET UP AND INTERMEDIATE LAYER FOR THE FRONT
END TOOL TO USE.THIS LAYER TRASLATE THE
DATABASE OPERATIONS AND OBJECTS NAMES INTO
BUSINESS CONDITIONS SO THAT THE END CLIENT
CAN INTERACT WITH THE DATAMART USING WORDS
WHICH RELATES TO THE BUSUNESS FUNCTIONS.
 SET UP AND MANAGE THE DATABASE
ARCHITECTURES LIKE SUMMARIZED TABLES WHICH
HELPES QUERIES AGREE THROUGH THE FRONT END
TOOLS EXECUTE RAPODLY AND EFFICIENTLY.
5) MANAGING:

THIS STEP CONTAINS MANAGING THE DATAMART OVER


ITS LIFETIME.
FOLLOWING TASKS ARE:
 PROVIDING SECURE ACCESS TO THE DATA.
 MANAGING THE GROWTH OF THE DATA
 OPTIMIZING THE SYSTEM FOR BETTER
PERFORMANCE
 ENSURING THE AVAILABILITY OF DATA EVENT WITH
SYSTEM FAILURES.
ADVANTAGES OF DATA MART:

 DATAMARTS CONTAIN A SUBSETBOF AN ORGANIZATION-


WIDE DATA.THIS DATA IS VALUABLE TO A SPECIFIC GROUP
OF PEPPLE IN AN ORGANIZATION.
 IT IS COST –EFFECTIVE ALTERNATIVES TO A DATA WARE
HOUSE,WHICH CAN TAKE HIGH COST TO BUILD
 IT ALLOWS FASTER ACCESS OF DATA.
 IT IS EASY TO USE AS IT IS SPECIFICALLY DESIGNED FOR
THE NEEDS OF ITS USRES.THUS THE DATA MART CAN
ACCELERATE BUSINESS PROCESSES.
 ITS IMPLEMENTATION TIME IS LESS THAN THE DATA WARE
HOUSE.
 IT CONTAINS HISTORICAL DATA WHICH ENABLES THE
ANALYST TO DETERMINE DATA TRENDS

DIS-ADVANTAGES OF DATA MART:

 MANY A TIMES ENTERPRICES CREATE TOO MANY


DISPARATE AND UNRELATED DATA MARTS WITHOUT
MUCH BENEFIT.IT CAN BECOME A BIG HANDLE TO
MAINTAIN.
 DATA MART CAN NOT PROVIDE COMPANY WIDE DATA
ANALUSIS AS THEIR DAT SET IS LIMITED.
DIFFERENCE BETWEEN DATA WARE HOUSE AND DATA MART:
DATA WARE HOUSE DATA MART
1.A DATA WARE HOUSE IS A 1.A DATA MART IS AN ONLY
VAST REPOSITORY OF SUBTYPE OF A DATA WARE
INFORMATION COLLECTED HOUSE .IT IS A ARCHITECTURE
FROM VARIOUS TO MEET THE REQUIRMENT
ORGANIZATIONS OR OF A SPECIFIC USER GROUP.
DEPARTMENTS WITH IH A
CORPORATION
2. IT MAY HOLD MULTIPLE 2.IT HOLDS ONE SUBJECT
SUBJECT AREAS AREA.
3.IT HOLDS VERY 3.IT MAY HOLD MORE
DETAILED SUMMARIZED DATA
INFORMATION
4. IT WORKS TO 4.IT CONCENTRATE S ON
INTEGRATE ALL DATA INTEGRATING DATA FROM A
SOURCES GIVEN SUBJECT AREA OR SET
OF SOURCE SYSTEM.

5.IN DATA WARE HOUSE FACT 5.IN DATAMART,STAR SCHEMA


CONSLELLATION IS USED AND SNOWFLAKE SCHEMA IS
USED.
6.IT IS A CENTRALIZED SYSTEM 6. IT IS A DECENTRALIZED
SYSTEM
7.IT IS THE DATA ORIENTED 7.IT IS A PROJECT ORIENTED.
Data Flow frame work:

The Big-data frame work provides a structure for organizations that want to start
with the Big-data or aim to develop their big-data capabilities further. The big-
data frame work includes all organizational aspects that should be taken into
account in a Big-data Organization. The big-data frame work is vender
independent.

Now- a- days there‘s probably no single big-data software that would not be able
to process these large volume of data.

Special big-data frame works have been created to implement and support the
functionality of such software. They help rapidly process and structure huge
chunk of real –time data.

There are many great big-data tools on the market right now:

1. Most popular like Hadoop, Hive, Spark etc.


2. Most promising like Flink and Heron
3. Most useful like Map-reduce and Presto
4. Most underrated like Samza and Kudu.

Map reduce:

Hadoop Map-reduce is a software frame work for distributed


processing a large data sets on computing clusters. It is a sub-project
of the Apache Hadoop project.
Apache Hadoop is an open source frame work that allows to store
and process big-data in a distributed environment across clusters of
computers using simple programming modules.
Map-reduce is the core component for data processing in Hadoop
Frame-work. It helps to split the input data set into a number of parts
and run a program on all the data parts parallel at once.
The term Map-reduce refers to two separate and distinct tasks.
The first is the map operation, takes a set of data and converts it into
the another set of data ,where individual elements are broken down in
to tuples (key/values pairs).The reduce operation combines those data
tuples based on the key and accordingly modifies the value of the key.

Word count example :


Lets consider few words(e.g: bear,deer,river,car) of a text document.
We want to find the number of occurrence of each word.

Solution:
Input:
deer bear river
car car river
deer car bear

Output: How many time occurrence of each word.


Map-reduce Architecture:

Provide by user provided by Hadoop Frame work

1. Input splitting and distribution

1. Job configuration 2. Start of individual map tasks


2. Input format
3. Input location
4. Map-function 3. Shuffle, partion/sort per Map
5. Number of reduce
output
tasks
6. Reduce function 4. Marge sort for map outputs for
7. Output key type
each reduce task
8. Output value type
9. Output format
10. Output location 5. Start of individual reduce tasks

6. Collection of final output.

Jobconf is the frame work used to provide various parameters of a map-reduce


job to the Hadoop for execution.

The Hadoop platforms executes the programs based on configuration set using
jobconf.

The parameter being map-reduce, reduce function, combiner, portioning


function, input output format,partioner controls the shuffling of the tuples when
being sent from mapper node to reducer nodes.the total number of partions
done in the tuples is equal to the number of reduce node i.e. based on the
function output the tuple are transmitted through different reduce nodes.
Input format:

It describes the format of the input data for a map reduce job.

Input location:

It describes the location of the data-file.

Map-function:

It converts the data into key-value pair.

Reduce –function:

It reduces the set of tuples, which share a key to a single tuple with a change in
the value.

The number of map and reduce node can also be defined.

We can set the options such that a specific set of key value pairs are transferred
to a specific reduce task.

The hadoop frame work consists of a single master and many slaves. Each master
has job-tracker and each slave has Task –tracker. Master distributes the programs
and data to the slaves.

Task tracker keep track directed to it and relays the information to the job
tracker. Job tracker monitors all the status reports and re-initiates the failed tasks
if any.

Input format in hadoop map-reduce:

1. FileInputFormat:

It is the base class for all file –based Input Formats. It specifies input directory
where the data-files are located. When we start a hadoop job ,FileInput Format is
provided with a path containing files to read. It will read all files and divides these
files into one or more Input Splits.
2. TextInputFormat :

It is the default Input format of Map-reduce. Text Input format treats each line
of input file as a separate record and performs no parsing. This is useful for
unformatted data or line –based records

 Key:
It is the byte offset of the beginning of the line within the file(not whole file
just one split),so it will be unique if combined with the filename.

 Value:
It is the contents of the line, excluding line terminators

3. KeyValueTextInputFormat:

It is similar to Text InputFormat as it also treats each line of input as a separate


record.while Text InputFormat treats entire line as the value but the KeyValue
TextInput Fortmat breaks the line itself into key and values by a tab
character(“\t”).Here key is everything up to the tab character while the value is
the remaining part of the line after tab character.

4. SequenceFileInputFormat:

It reads sequence file.sequence files are binary files that stores sequences of
binary key value pair. Sequence files block-compress and provide direct
serialization and de-serialization of several arbitrary data-types(not just text). Key
and value both are user defined.

5. SequenceFile As Text Input Format:

Hadoop Sequence File as textinput format is another form of sequevce file


input format which converts the sequence file key values to the text
objects.”tostring()”  by calling this function conversion is performed on
the key and values. this input format makes sequence file suitable input for
streaming.
6. Sequence File As Binary Input Format:

It is a sequence file input format using which we can extract the sequence file’s
keys and values as an unique binary object.

7. NLine Input Format:

It is another form of text input format where the keys are the byte offset of the
line and values are contents of the line. each mapper receives a variable number
of lines of input with text input format and key value text input format,the
number depends on the split and the length of the lines. if we want our mapper
to receive a fixed format of line of input,then we use NLine Input Format.

8. DBInput Format:

It is an InputFormat that reads data from a RDBMS, using JDBC .it is best for
loading relatively small datasets, perhaps for joining with large datasets from
HDFS using multiple inputs. Here key is Long-writable and value is DB-writables.
Types of Hadoop output formats:

1. Textoutputformat:

Map-reduce default reducer output format is Text OutputFormat,which


writes(key, value)pairs on individual lines of thext files and its keys and values can
be of any type since Text Output format turns them to string by calling
“tostring()” function.

2. Sequence file output format:

It is an output format which writes sequence files for its output and it is
intermediate format use between map-reduce jobs, which rapidly serialize
arbitrary data types to the file and the corresponding sequence file input format
will deserialize the file In to the same type.

3. Sequence file as binary output format:

It is another form of sequence file output format which writes keys and values to
sequence file in binary format

4. Map file output format:

It is used to write output as map files. the key in a map file must be added I order
so we need to ensure that reducer keys in-sorted order.

5. Multiple outputs:

It allows writing data to files whose names are derived from the output keys and
values or in fact from an arbitrary string.

6. Lazyoutput format:

Sometimes File Output format will creates output files ,even if they are
empty.lazy output format is a wrapper output format which ensures that the
output file will be created only when the record is emitted for a given partition.
7. DBoutput format:

It is an output format for writing to RDBS and hbase. It sends the reduce
output to a sql table.it accepts the key value pairs where the key has a type
extending DB writable .it returned the key to the database with a batch SQL
query.
MULTIDIMENSIONAL OLAP IN DATA WARE HOUSE (MOLAP):

MOLAP IS A CLASSICAL OLAP THAT FACILATES DATA ANALYSIS BY USING A


MULTIDIMENSIONAL DATA CUBE.DATA IS PRECOMPUTED,RESUMMARIZED AND
STORED IN A MOLAP.

MULTIDIMENSIONAL DATA ANALYSIS IS ALSO POSSIBLE IF ARELATIONAL


DATABASE IS USED.

EX:

ORACLE’S EXPRESS SERVER.

ARCHITECTURE:

MOLAP ARCHITECTURE INCLUDES THE FOLLOWING COMPONENTS:

1. DATABASE SERVER
2. MOLAP SERVER
3. FRONT –END TOOLS.

1. THE USER REQUEST REPORTS THROUGH THE INTERFACE.


2. THE APPLICATION LOGIC LAYER OF THE MDDB RETRIEVES THE STORED
DATA FROM THE DATABASE.
3. THE APPLICATION LOGIC LAYER FORWORDS THE RESULT TO THE CLIENT
SERVER.

MOLAP ARCHITECTURE MAINLY READS THE PRECOMPILED DATA.MOLAP


ARCHITECTURE HAS LIMITED CAPABILITIES TO DYNAMICALLY CREATE
AGGREGATIONS OR TO CALCULATE THE RESULTS THAT HAVE NOT BEEN
PRE CALCULATED AND STORED.
ADVANTAGES:
 MOLAP CAN MANAGE,ANALYZE AND STORE CONSIDERABLE
AMOUNTS OF MULTIDIMENSIONAL DATA.
 FAST QUERY PERFORMANCE DUE TO OPTIMIZE STORAGE
,INDEXING AND CACHING
 SMALLER SIZES OF DATA AS COMPARED TO THE RELATIONAL
DATABASE.
 AUTOMATED COMPUTATION OF HIGHER LEVEL OF AGGREGATES
DATA.
 HELP USRES TO ANALYZE, LESS DEFINED DATA.
 MOLAP IS EASIER TO THE USER THAT ‘S WHY IT IS A SUITABLE
MODEL FOR INEXPERIENCED USERS.
 MOLAP CUBES ARE BUILT FOR FAST DATA RETRIVAL AND
OPTIMAL FOR SLICING AND DICING OPERATIONS.
 ALL CALCULATIONS ARE PRE CALCULATED WHEN THE CUBE IS
CREATED.

DISADVANTAGES:

 ONE MAJOR WEEKNESS OF MOLAP IS THAT IT IS LESS ACCESSABLE THAN


ROLAP AS IT HANDELS ONLY A LIMITED AMOUNT OF DATA.
 THE MOLAP ALSO INTRODUCES DATA REDUNANCY AS IT IS RESOURCE
INTENSIVE.
 MOLAP SOLUTIONS MAY BE LENGTHY, PARTICULARLY ON LARGE DATA
VOLUME.
 MOLAP IS NOT CAPABLE OF CONTAINING DETAILED DATA.
 THE STORAGE UTILIZATION IS VERY LOW IF THE DATA SET IS HIGHLY
SCATTERED.
 IT CAN HANDLE THE ONLY LIMITED AMOUNT OF DATA THEREFORE,IT’S
IMPPOSSIBLE TO INCLUDE A LARGE AMOUNT OF DATA IN THE CUBE ITSELF.
MOLAP TOOLS:

ESSBASE:

TOOLS FROM ORACLE THAT HAS A MULTIDIMENSIONAL DATA BASE.

EXPRESS SERVER:

WEB BASED ENVIRONMENT THAT RUNS ON ORACLE DATABASE.

YELLOWFIN:

BUSINESS ANALYTICS TOOLS FOR CREATING REPORTS AND DASHBOARDS.

CLEAR ANALYTICS:

IT IS AN EXCEL BASED BUSINESS SOLUTIONS

SAP BUSINESS INTELLIGENCE:

BUSINESS ANALYTICS SOLUTION FROM SAP.

RELATIONAL OLAP (ROLAP):

ROLAP HAS 3-TIER ARCHITECTURE.

1. WORKS DIRECTLY WITH THE RELATIONAL DATABASES


2. FACT AND DIMENTION TABLE ARE STORED AS RELATION
3. NEW RELATIONS ARE STORING AGGREGATE INFORMATION.
ADVANTAGES:

1. IT HANDLES LARGE AMOUNT OF DATA.


2. 2-D RELATIONAL TABLES CAN BE VIEWED IN MULTIPLE
MULTIDIMENSIONAL FORMS
3. DATABASESEQURITY THROUGH AUTHORIZATION
4. ANY SQL REPORTING TOOLS CAN BE USED TO ACCESS THE DATA
5. TIME ADDED –TO LOAD DATA IS LESS.

DISADVANTAGES:

1. DIFFICULT TO PERFORM COMPLEX DATA CALCULATION.


2. LONG QUERY TIME FOR LARGE DATA SIZE.
3. ADDITION DEVELOPMENT TIME AND MORE CODE SUPPORT ARE NEEDED.
4. DOES NOT HAVE COMPLEX AND COMPLICATED FUNCTION.

HOLAP (HYBRID OLAP):

THIS SYSTEM INCLUDES THE BEST OF ROLAP AND MOLAP.

ADVANTAGES:

1. HOLAP PROVIEDS BENEFITS OF BOTH MOLAP AND ROLAP.


2. IT PROVIDES FAST ACCESS AT ALL LEVELS OF AGGREGATION
3. HOLAP BALANCES THE DISKSPACE REQUIRMENT,AS IT ONLY STORES THE
AGGREGATE INFORMATION ON THE OLAP SERVER AND THE DETAIL
RECORDS REMAINS IN THE RELATIONAL DATABASE.SO NO DUPLICATE COPY
OF THE DETAIL RECORD IS MAINTAINED.
COMPARISON OF OLAP SERVERS:

FEATURES ROLAP MOLAP HOLAP


1.DETAIL DATA RELATIONAL MULTIDIMENSIONAL RELATIONAL
STORAGE DATABASE DATABASE DATABASE
LOCATION
2.AGGREGATE RELATIONAL MULTIDIMENSIONAL MULTIDIMENSIONAL
DATA STORAGE DATABASE DATABASE DATABASE
LOCATION
3.SPACE LARGE MEDIUM SMALL
REQUIRED
4.QUERY SLOW FAST MEDIUM
RESPONSE TIME
5.PROCESSING SLOW FAST FAST
TIME
6.LATENCY LOW HIGH MEDIUM

DIMENSIONAL MODELING:

DIMENSIONAL MODELING (DM) IS A DATA STRUCTURE TECHNIQUES OPTIMIZED


FOR DATA STORAGE IN A DATA WAREHOUSE.

THE PURPOSE OF DIMENSIONAL MODELLING IS TO OPTIMIZE THE DATABASE FOR


FASTER RETRIVAL OF DATA.

A DIMENSIONAL MODEL IN DATA WARE HOUSE IS DESIGNED TO READ,


SUMMARIZE, ANALYZE THE NUMERIC INFORMATION LIKE
VALUES,BALANCES,COUNTS,ETC.IN A DATAWARE HOUSE.

DATA IN A WARE HOUSE ARE USUALLY MULTIDIMENSIONAL HAVING MEASURE


(MEASURE SOME VALUES AND CAN BE AGGREGATED UPON THOSE
VALUES.ex:SUM(),AVG())AND DIMENSION( THAT DEFINE THE DIMENSION ON
WHICH THE MEASURE ATTRIBUTE AND THEIR SUMMARY FUNCTION WORK
UPON)ATTRIBUTES.
ELEMENTS OF DIMENSIONAL DATA MODEL:

1. FACTS:

FACTS ARE THE NUMERICAL MEASURES OR QUANTITIES BY WHICH ONE CAN


ANALYZE RELATIONSHIP BETWEEN DIMENTIONS.

2. DIMENSIONS:

DIMENSIONS ARE THE COLLECTION OF LOGICALLY RELATED ATTIBUTES AND IS


VIEWED AS AN AXIS FOR MODELLING THE DATA.

3. ATTRIBUTES:

THE ATTRIBUTES ARE THE VARIOUS CHARACTERISTICS OF THE DIMENSION IN


DIMENSIONAL DATA MODELING.

IN THE LOCATION DIMENSION, THE ATTRIBUTES CAN BE


STATE,COUNTRY,ZIPCODE ETC.

ATTRIBUTES ARE USED TO SEARCH,FILTER OR CLASSIFY FACTS.

DIMENSION TABLES CONTAIN ATTRIBUTES.

4. FACT TABLE:

THE RELATION CONTAINING SUCH MULTIDIMENSIONAL DATA ARE CALLED FACT


TABLE.

e.g:

BOOK SHOP

BID TID NUMBER


B1 1 25
B2 2 36
FACT TABLE

IT IS A PRIMARY TABLE IN DIMENSION MODELLING


A FACT TABLE CONTAINS:1)MEASUREMENTS/FACTS,II)FOREIGN KEY TO
DIMENSION TABLE.

5. DIMENSIONAL TABLE:

A DIMENSION TABLE IS A TABLE ASSOCIATED WITH EACH DIMENSION AND HELPS


IN DESCRIBING THE DIMENSION FURTHER.

THEY ARE JOINED TO FACT TABLE VIA A FOREIGN KEY.

DIMENSION TABLES ARE DE-NORMALIZED TABLES.

THE DIMENSION ATTRIBUTES ARE THE VARIOUS COLUMNS IN A DIMENSION


TABLE.

NO SET LIMIT FOR GIVEN A NUMBER OF DIMENSION.

e.g:

BID AUTHOR_NAME PRICE


B1 XYZ 456.00
B2 ABC 250.00
DIMENSION TABLE

TYPES OF DIMENSIONS IN DATA WARE HOUSE:

1. CONFORMED DIMENSION
2. OUTTRIGGER DIMENSION
3. SHRUNKEN DIMENSION
4. ROLE PLAYING DIMENSION
5. DIMENSION TO DIMENSION TABLE
6. JUNK DIMENSION
7. DEGENERATE DIMENSION
8. SWAPPABLE DIMENSION
9. STEP DIMENSION
STEPS OF DIMENSIONAL MODELLING:

THE ACCURACY IN CREATING YOUR DIMENSIONAL MODELLING DETERMINES THE


SUCCESS OF YOUR DATA WARE HOUSE IMPLEMENTATION.HERE ARE THE STEPS
TO CREATE DIMENSION MODEL:---

1. IDENTIFY BUSINESS PROCESS


2. IDENTIFY GRAIN(LEVEL OF DETAILS)
3. INDENTIFY DIMENSIONS
4. INDENTIFY FACTS
5. BUILD SCHEMA.

RULES FOR DIMENSIONAL MODELLING:

 LOAD ATOMIC DATA INTO DIMENSIONAL STUCTURE.


 BUILD DIMENSIONAL MODELS ARROUND BUSINESS PROCESS
 NEED TO ENSURE THAT EVERY FACT TABLE HAS AN ASSOCIATED DATE
DIMENSION TABLE.
 ENSURE THAT ALL FACTS IN A SINGLE FACT TABLE ARE AT THE SAME GRAIN
OR LEVEL OF DETAILS.
 IT’S ESSITIAL TO STORE REPORT LABELS AND FILTER DOMAIN VALUES IN
DIMESION TABLES.
 CONTINUOUSLY BALANCE REQUIREMENTS AND REALITIES TO DELIVER
BUSINESS SOLUTION TO SUPPORT THEIR DECISION MAKING.

BENEFITS OF DIMENSIONAL MODELLING:

 STANDARDIZATION OF DIMENSIONS ALLOWS EASY REPORTING ACROSS


AREAS OF THE BUSINESS.
 DIMENSION TABLES STORE THE HISTORY OF THE DIMENSIONAL
INFORMATION.
 DIMENSIONAL ALSO TO STORE DATA IN SUCH A FASHION THAT IT IS EASIER
TO RETRIVE THE INFORMATION FROM THE DATA ONCE THE DATA IS
STORED IN THE DATABASE.
 COMPARED TO THE NORMALIZED MODEL DIMESIONAL TABLE ARE EASIER
TO UNDERSTAND.
 INFORMATION IS GROUPED INTO CLEAR AND SIMPLE BUSINESS
CATEGORIES.
 THE DIMENSIONAL MODEL IS VERY UNDERSTADABLE BY THE BUSINESS
 DIMENSIONAL MODELLING IN DATA WARE HOUSE CREATES A SCHEMA
WHICH IS OPTIMIZED FOR HIGH PERFORMANCE
 THE DIMENSIONAL MODEL ALSO HELPS TO BOOST QUERY PERFORMANCE.

MULTIDIMENSIONAL SCHEMA:

MULTIDIMEN SIONAL SCHEMA IS SPECIALLY DESIGNED TO MODEL DATA WARE


HOUSE SYSTEM.THE SCHEMAS ARE DESIGNED TO ADDRESS THE UNIQUE NEEDS
OF VERY LARGE DATABASES DEWSIGNED FOR THE ANALYTICAL PURPOSE(OLAP)

A SCHEMA IS AC COLLECTION OF DATABASE OBJECTS, INCLUDING TABLES


,VIEWS,INDEXES ETC.

TYPES OF DATA WARE HOUSE SCHEMA:

STAR SCHEMA

SNOWFLAKE SCHEMA

GALAXY SCHEMA
STAR SCHEMA:

STAR SCHEMA IN DATA WARE HOUSE IN WHICH THE CENTER OF THE STAR CAN
HAVE ONE FACT TABLE AND A NUMBER OF ASSOCIATED DIMENSION TABLES.IT IS
KNOWN AS STAR SCHEMA AS ITS STRUCTURE RESEMBLES A STAR.

THE STAR SCHEMA DATA MODEL IS THE SIMPLEST TYPE OF DATAWARE HOUSE
SCHEMA.IT IS ALSO KNOWN AS STAR JOIN SCHEMA AND IS OPTIMIZED FOR
QUERING LARGE DATA SETS.

CHARACTERISTICS:

 EVERY DIMENSION IN A STAR SCEMA IS REPRESENTED WITH THE ONLY ONE


DIMENSION TABLE.
 THE DIMENSION TABLE SHOULD CONTAIN THE SET OF ATTRIBUTES.
 THE DEMENSION TABLE IS JOINED TO THE FACT TABLE USING A FOREIGEN
KEY.
 THE DIMENSION TABLES ARE NOT JOINED TO NEACH OTHER.
 FACT TABLE WOULD CONTAIN KEY AND MEASURE.
 THE STAR SCHEMA IS EASY TO UNDERSTAND AND PRIVIDES OPTIMAL DISK
USAGE.
 THE DIMENSION TABLE ARE NOT NORMALIZED
 THIS SCHEMA IS WIDELY SUPPORTED BY BI TOOLS.

ADVANTAGES:

 SIMPLEST AND EASIEST


 OPTIMIZES NAVIGATION THROUGH DATABASE
 MOST SUITABLE FOR QUERY PROCESSING.
WHAT IS DATA WAREHOUSE?

Data warehouse is a collection of corporate data aggregated from one


or several sources.

It serves as a business analytical tool, which allows analyzing and


comparing data in-order to solve working issues and improve business
process.

How does it work?

The concept firstly appeared in the 1980s. It was developed to support


operational systems dataflow transfer to decision making systems.

These systems required the analysis of large amount of heterogeneous


data accumulated by companies over time.

Data-warehouse works on the following principle:--

1. Data is extracted into one area from heterogeneous sources


2. Converted in accordance with the needs of the decision support
system.
3. Stored in the warehouse.

Thus the system provides answers for business decisions analyzing


all these heterogeneous data. That is why data ware-housing
primarily aimed to simplify decision making processes and helps
executives to get required information based on the whole data
quickly.
Benefits of data ware-house:--

1.Quality data:--
Organizations add data sources in the data ware-house, so they
can be sure in their relevancy and constant availability. This
provides higher data quality and data integrity for informed
decision making.
2. Promotes decision making:--
Strategy decisions are based on the fact and relevant data. They
are supported by information that the organization has collected
over-time. Another plus is that leaders are better informed about
data requests and can extract information according to their
specific requirement.

Data ware-house Architecture:--


A data ware-house architecture is a method for defining the
overall architecture of data communication processing and
presentation that exist for end clients computing within the
enterprise.
Each data warehouse is different, but all are characterized by
standard vital components.
To design an effective and efficient data ware-house, we need to
understand and analyze the business needs and construct a
business analysis framework.
Each person has different views regarding the design of a data
ware-house.
These views are:---
1) The top-down view:
This view allows the selection of relevant information needed
for a data ware-house.

2) The source view:


This view presents the information being captured stored and
managed by the operational system.

3) The data ware-house view:


This view includes the fact tables and dimension tables. It
represents the information stored inside the data warehouse.

4) The business query view:


It is the view of the data from the view point of the end-user.
Generally a data ware-house adopts three-tier architecture.
Following are the three tiers of the data ware-house
architecture:---
1. Bottom Tier:
The bottom tier of the architecture is the data warehouse
server. It is a relational database system. We use the back
end tools and utilities to feed the data into the bottom tier.
These back end tools and utilities perform the extract, clean,
load and refresh functions

2. Middle Tier:

In the middle tier ,we have the online analytical processing(OLAP)


server that can be implemented in either the following ways:

 By Relational OLAP (ROLAP), which is an extended relational


database management system .The ROLAP maps, the operations
on multidimensional data to standard relational operations
 By Multi-dimensional OLAP(MOLAP) ,which directly implements
the multidimensional data and the operation

3. Top Tier:
This tier is the front end client layer. This layer holds the
query tools , analyzing tools and data-mining tools.
Data ware house models:
There are three types of models:
1. Virtual ware-house:

The view over an operational data ware house is known as a virtual


warehouse. It is easy to build a virtual ware house.

Building a virtual ware house requires excess capacity of operational


database servers.

2. Data mart:

Data mart contains a subset of organization –wide data. This


subset of data is valuable to specific groups of an
organization.
e.g:
The marketing data mart may contain data related to items,
customers and sales. Data-marts are contain subjects.

Note:
 Window-based /Unix/Linux based servers are used to
implements data marts. They are implemented on low
cost servers.
 The implementation data marts cycles is measured in
short period of time i.e.in weeks rather the months or
years.
 The life cycle of a data-mart may be complex in long
run if its planning and design are not organization
wide.
 Data marts are small in size.
 Data marts are customized by department.
 The source of a data-mart is departmentally
structured data ware house.
 Data mart are flexible.

3. Enterprise warehouse:

 An enterprise ware house collects all the information and


subjects spanning an entire organization.
 It provides us enterprise –wide data integration
 The data is integrated from operational systems and external
information providers.
 This information can vary from a few gigabytes to hundreds of
gigabytes, terabytes or beyond.

Data ware houses and their architectures vary depending upon


the elements of an organization’s situation.
Three common architectures are:--
1. Data ware house architecture: Basic
2. Data ware house architecture: with staging area
3. Data ware house architecture: with staging area and data-mart
Data ware house architecture: basic

Operational system:--

An operational system is a method used in data ware housing


to refer to a system that is used to process the day by day
transactions of an organization

Flat files:--

A flat file system is a system of files in which transactional data


is stored and every file in the system must have a different
name.

Meta data:--
A set of data that defines and gives information about other
data.
Meta data used in data ware house for a variety of purpose.
Purpose of metadata:
Meta data summarizes necessary information about data,
which can make finding and work with particular instances of
data more accessible.
e.g:
author, data build and data changed and file size are examples
of very basic document metadata.
Meta dat is used to direct a query to the most appropriate data
source.
Highly and lightly summarized data:
The area of the data ware house saves all the predefined lightly
and highly summarized( aggregated) data generated by the
ware house manager.

End-user-access tools:
The principal purpose of a data ware house is to provide
information to the business managers for strategic decision
making. These customers interact with the ware house using
end client access tools
e.g(end user access tools):--
1. reporting & query tool
2. application development
3.executive information system tools
4.online analytical processing tool
5.data mining tool

Data ware house architecture: with staging area


We must clean and process your operational information before
put it into the ware house.
We can do this programmatically ,although data ware houses uses
a staging area(a place where data is processed before entering
the ware house)
A staging area simplifies data cleansing and consolidation for
operational methods coming from multiple sources(enterprise
data ware house)
Data ware house architecture: with staging area and data
marts
We may want to customize our ware house architecture for
multiple groups within our organization.
We can do this by adding data marts.
A data-mart is a segment of a data warehouse that can provide
information for reporting the analyzing on a section, unit,
department or operation in the company.
Subject name: BIGDATA TECHNOLOGY
Date: 09.10.2020 Time:11.30pm -2.30pm

Hadoop eco-system:
To handle big-data more efficiently.it comprises of
various tools that are required to perform different
task on HADOOP.
The components are
1)HDFS
2)MAP-REDUCE
3)YARN
4)HIVE
5)PIG
6)HBASE
7)HCATALOG
8)AVRO
9)THRIFT
10) DRILL
11) MAHUT
12) SQOOP
13) FLUME
14) AMBARI
15) ZOOKEPER
16) OOZIE

1.HDFS:
HDFS refers to Hadoop Distributed File System.
HDFS is a dedicated file system to store the big
data with a cluster of commodity
hardware/cheaper hardware with streaming
access pattern. It enables data to be stored at
multiple nodes in the cluster which ensures data
security and fault tolerance.
Components are
1) Name node (master node ,it is not store the
actual data. it store Metadata ,i.e number of
blocks, their locations ,on which rack, which data-
node is store the data etc. it consist of files and
directories)
2) data node(slave node, it is responsible for
storing the actual data in HDFS. it performs read
and write operation as per the request of client.
replica block of data-node consist of 2 files of the
files system, first file is for the data and second
file is for recording the block’s metadata.

Name node

Data node(A) Data node(B)

Data node(commodity software)

At start up ,each data node connects to its


corresponding name-node and does handshaking.
Verification of namespace id and software version
of the data-node take place by handshaking;
If any mismatch is occurred ,data –node goes
down automatically.
Core aspects:
1. streamming data access
2. large data set
3. simple coherency method
4.import /export data to and from HDFS

2) Map-reduce:
Data once stored in HDFS also needs to be
processed upon. Now a query is sent to the
process a dataset in the HDFS. Now Hadoop
indentify where the data is stored (mapping),the
query is broken into the multiple parts and result
s of all these multiple parts are combined and
overall result is sent back to the user.(reduce
process) .
While data is stored in HDFS ,Map-reduce is used
for processed that data. It is a software frame
work for easily writing applications that processed
the vast amount of structured and unstructured
data. Map-reduce programs are parallel in nature.
it improves the speed and reliability of cluster this
parallel processing.

Features:
1. Simplicity: map reduce jobs are easy to run.
Applications can be written in any languages
such as java, c++ and python
2. Scalability: it can process peta- bytes of data
3. Speed: by means of parallel processing problem
that take days to solve ,it’s is solved in HOURS
and minutes by map-reduce.
4. Fault-tolerance: it take care failures .if one copy
of data is unavailable ,another machine has a
copy of the same data can be used for same
task.

3) YARN:

Full form of YARN is YET ANOTHER RESOURCE


NEGOTIATOR. It is a HADOOP ecosystem
component that provides the resource
management. it is very important component.
it is also known as a operating system of
HADOOP, as it is responsible for managing and
monitoring workloads. it allows multiple data
processing engines such as real time streaming,
batch-processing to handle data stored on a
single platform.
Features:
1)Flexibility
2)Efficiency
3)Shared

YARN consist of two essential components.


1) Resource manager:
It is main node in the processing
department.
It works at the cluster level and takes
responsibility of funning master machine
It takes the job submission and negotiates
the first container for executing an
application.
It consist of two components(application
manager,scheduler.
It receives the processing requests then
passes to the request where the actual
processing take place.
2) Node manager:
Node managers are installed on every data-
node.it is responsible for execution of task
on every single data node.
It works on node level component and run
on every slave machine.
It is responsible for monitoring resource
utilization in each container and managing
containers.
In maintains continuous communication
with a resource manager to give the
updates.

4)HIVE:
It is an open source data-ware-house system,
for querying & analyzing large data set ,stored
in HDFS.
HIVE do three main functions: data
summarization, query and analysis.
Facebook created HIVE for people who are
fluent with SQL..
HIVE use language called HIVEQL (HQL).which
is similar to sql.
HQL automatically translate SQL like queries
into map-reduce which will execute on
HADOOP .HIVE is suitable for structured data.
after the data is analyzed it is ready for the
user to access .
Main parts of HIVE are:
1)Meta-store: it store the meta-data
2)Driver: manage the lifecycle of a HQL
statement
3)Query Compiler: it compiles HQL into
Directed Acyclic graph
4)Hive server: it provides a thrift interface &
JDBC/ODBC server.

Hive is highly scalable ,as it can serve both


the purposes(batch processing, real time)
It supports all primitive data types of sql

5) PIG
PIG is a high –level language platform for
analyzing and querying huge data set that
are stored in HDFS.PIG as a component of
HADOOP ecosystem uses PIG Latine
Language.it is also very similar to SQL.
It loads the data,applies the required filters
and dumps the data in the required format.
For the program execution PIG requires Java
runtime environment.
The compiler internally converts PIG Latine to
Mapreduce. it produces a sequestial set of
Map Reduce job.it was initially developed by
Yahoo . it gives us a platform for building data
flow for extract, transform ,load, processing
and analyzing the huge data sets.

How PIG works:


In PIG first load the command, loads the
data. Then perform various
functions(grouping, joining etc) at last either
dump the data on the screen or stored the
data into the HDFS.
Features of pig:
1. extensibility:
for carrying out special purpose processing
,users can create their own function .
2. optimization opportunities:
it allows the system to optimize automatic
execution .this allows the user pay
attention to semantics instead of
efficiency.
3. Handles all kinds of data:
It analyzes both structured and
unstructured data.

6)Hbase:
It is a distributed database. data is sorted in
the table format. that tables contains
billions of rows and millions of columns.it is
scalable,distributed,NO-sql database built
in the top of the HDFS.it is written in JAVA.
It is modeled after google’s big-table.

Two components
1)hbase master
2)region server.
DATA MODELLING IN MONGODB
THE DATA MODELLING OR DATA STRUCTURING REPRESENTS
THE NATURE OF DATA AND THE BUSINESS LOGIC TO CONTROL
THE DATA.IT ALSO ORGANIZE THE DATABASE.
THE STRUCTURE OF DATA ARE EXCPLICITLY DETERMINES BY
THE DATA MODEL.
DATA MODEL HELPS TO COMMUNICATE BETWEEN BUSINESS
PEOPLE,WHO REQUIRES THE COMPUTER SYSTEM,AND THE
TECHNICAL PERSON ,WHO CAN FULFILL THEIR REQUIREMENTS.

DATA MODELLING CONCEPT HAS THREE LEVELS:


1) CONCEPTUAL MODEL
2) LOGICAL MODEL
3) PHYSICAL MODEL

CONCEPTUAL MODEL:
IN THIS MODEL,THE CONCEPT OR SEMANTICES OF THE
DATADABE MODELS ARE CONCERNED.WE DONOT NEED
TO CARE ABOUT THE ACTUAL DATA OR META
INFORMATION HERE.

LOGICAL MODEL:
IN THIS MODEL, THERE ARE THE DESCRIPTION OF TABLES,
COLUMNS, OBJECT ORINTED CLASSES, XML TAGS,
DOCUMENT STRUCTURES.

PHYSICAL MODEL:
IN THIS MODEL,CONSIST OF THE ACTUAL PHYSICAL
STRUCTURE TO STORE THE DATA,LIKE THE
PARTITIONS,CPU SPACES,REPLICATION ETC, i.e. HOW THE
ACTUAL DATA CAN BE STORED INTO THE SYSTEM.

TYPES OF DATA MODELLING:


1) FLAT DATA MODELLING
2) STAR MODELLING
3) HIERARCHICAL MODELLING
4) RELATION MODELLING
5) OBJETCT-RELATIONAL MODELLING

FLAT DATA MODELLING:


IN THE FLAT DATABASE MODELLING ALL DATA ARE
STORED AS A SINGLE ROW OF DATA AS A TEXT FILE.YHE
FIELDS ARE SEPERATED BY DELIMITERS LIKE
SPACE,COMMA, TAB ETC.
THE MAIN DISADVANTAGE OF THIS MODEL IS,IT IS NOT
EASILY ACCESSIBLE BY DIFFERENT SOFTWARE TO MAKE
COMPLEX RELATIONSHIP.

STAR DATA MODELLING:


IN THE STAR DATA MODELLING ,ALL DATA
TABLES(DIMENSION) ARE ATTACHED WITH ONE
CENTRALIZED TABLE(FACT TABLE).THE FACT TABLES ARE
IN 3NF IN THIS SCHEMA,WHERE AS DIMENSIONAL
TABLES ARE DENORMALIZED.
STAR MODELLING IS THE SIMPLEST ARCHITECHTURE
,AND IT IS MOST COMMONLY USED NOW A DAYS.
HIERARCHICAL MODELLING:
IN THE HIERARCHICAL MODEL,THE TABLES ARE STORED
IN A TREE STRUCTURE.THERE IS A SINGLE ROOT,AND
OTHER TABLES ARE LINKED WITH THE ROOT IN THE
DIFFERENT LEVELS.
IN THIS MODEL ,ONE CHILD NODE MUST HAVE ONLY
ONE PARENT NODE.

RELATION MODEL:
IN THE RELATION MODEL THE INFORMATION IS STORED
IN TWO DIMENSIAL TABLES.AND THE RELATION ARE
FORMED BY STORING THE COMMON ATTRIBUTES.
THE TABLES ARE ALSO KNOWN AS RELATIONS IN THE
MODEL.

OBJECT RELATIONAL MODEL:


IN THE OBJECT –RELATIONAL MODEL,THE RELATIONAL
MODEL AND THE OBJECT ORIENTED MODEL ARE
MARGE TOGETHER.
IT SUPPORTS DIFFERENT OBJECT ORIENTED DATABASE
CONCEPTS LIKE OBJETCS,CLASSES,INHERITANCE ETC.

STORAGE ENGINE:
STORAGE ENGINE IS THE COMPONENT OF THE
DATABASE THAT IS RESPONSIBLE FOR MANAGING HOW
DATA IS STORED ,BOTH IN THE MEMORY AND ON THE
DISK.
MONGODB SUPPORTS MULTIPLE STORAGE ENGINES,AS
DIFFERENT ENGINES PERFORM BETTER FOR SPECIFIC
WORKLOADS.CHOSING THE APPROPRIATE STORAGE
ENGINE FOR YOUR CASE CAN SIGNIFICANTLY IMPACT
THE PERFORMANCE OF YOUR APPLICATIONS.
1) WIRED TIGER STORAGE ENGINE(DEFAULT):
IT IS A NO-SQL,OPEN SOURCE EXTENSIBLE PLATFORM
FOR DATA MANAGEMENT.IT IS THE DEFAULT
STORAGE ENGINE STARING IN MONGODB3.2.IT IS
WELL SUITED FOR MOST WORKLOADS AND IS
RECOMMENDED FOR NEW DEPLOYMENTS. IT
PROVIDES A DOCUMENT LEVEL CONCURRENCE
MODEL,CHECKPOINT POINTING AND
COMPRESSION,AMONG OTHER FEATURES.

2) IN MEMORY STORAGE ENGINE:

IN MEMORY STORAGE ENGINE IS AVAILABLE IN


MONGODB ENTERPRISE.RATHER THAN STORING
DOCU MENTS ON DISK,IT RETAINS THEM IN MEMORY
FOR MORE PREDICTABLE DATA LATENCIES.
INDEXING:
INDEX SUPPORTS THE EFFICIENT RESOLUTION OF
QUERIES.WITHOUT INDEXES,MONGODB MUST SCAN
EVERY DOCUMENT OF THE COLLOETION TO SELECTS
THOSE DOCUMENTS THAT MATCH WITH THE QUERY
STATEMENT. THIS SCAN IS HIGHLY INEFFICIENT AND
REQUIRE MONGO DB TO PROCESS A LARGE VOLUME
OF DATA.
INDEXES ARE SPECIAL DATA STRUCTURES,THAT
STORE A SMALL PORTION OF THE DATA SET IN AN
EASY TO TRAVERSE FORM.

THE INDEX STORES THE VALUE OF A SPECIFIC FIELD


OR SET OF FIELDS , ORDERD BY THE VALUE OF THE
FIELD AS SPECIFIED IN THE INDEX.

TO CREATE AN INDEX ,USE createIndex() Method IN


MONGODB.
SYNTAX:
db.collection_name.createIndex({key:1})
here key is the name of the field on which you want to
create index and 1 is for the ascending order.
To create index in descending orders you need to use -1

TO DROP A PERTICULAR INDEX USE THE dropIndex()method in


MONGODB.
SYNTAX:
db.colloection _name.dropIndex({key:1})
db.collection_name.dropIndexes( )
TO DISPLAY ALL INDEXES IN A COLLECTION USE
getIndexes()Method
Syntax:
db.collection_name.getIndexes()

DATA MODEL DESIGN:


MONGODB PROVIDES TWO TYPES OF DATA MODEL;
1) EMBEDDED DATA MODEL
2) NORMALIZED DATA MODEL

BASED ON THE REQUIREMENT,YOU CAN USE EITHER


OF THE MODELS WHILE PREPARING YOUR
DOCUMENTS.

EMBEDDED DATA MODEL:


IN THIS DATA MODEL, YOU CAN HAVE ALL THE
RELATED DATA IN A SINGLE DOCUMENT,IT IS ALSO
KNOWN AS DE-NORMALIZED DATA MODEL.

EXAMPLE:
ASSUME WE ARE GETTING THE DETAILS OF
EMPLOYEES IN THREE DIFFERENT DOCUMENTS
NAMELY ,PERSONAL DETAILS,CONTACT AND
ADDRESS.YOU CAN EMBED ALL THE THREE
DOCUMENTS IN A SINGLE ONE:

{
_id:” “,
Emp_Id:”10025AE336”
Personal_Details:{
First_Name:”Radhika”,
Last_Name:”Sharma”,
Date_of_Birth:”1995-09-26”
},
Contact: {
Email:radhika_sharma.123@gmail.com,
Phone :”9848022338”
},
Address : {
City: “Hyderabad”,
Area:”Madapur”,
State: “Telangana”
}
}

NORMALIZED DATA MODEL


In this model, you can refer the subdocuments in the
original document, using references.
Example:
ASSUME WE ARE GETTING THE DETAILS OF
EMPLOYEES IN THREE DIFFERENT DOCUMENTS
NAMELY ,PERSONAL DETAILS,CONTACT AND
ADDRESS.YOU CAN WRITE THE DOCUMENTS USING
NORMALIZED DATA MODEL;

Employee:
{_id:<objectId101>,
Emp_id:”10025AE336”
}
Personal_Details:
{
_id:<object102>,
empdocID:” objectId101”,
First_Name:”Radhika”,
Last_Name:”Sharma”,
Date_of_Birth:”1995-09-26
}
Contact :{
_id:<objectID103>,
empdocID:” objectId101”,
Email:radhika_sharma.123@gmail.com,
Phone :”9848022338”
}

Address: {
_id:<objectID104>,
empdocID:” objectId101”,
City: “Hyderabad”,
Area:”Madapur”,
State: “Telangana”
}
Subject name: Big Data
Topic name:
Hadoop ecosystem:
7) HCATALOG:
IT IS A TABLE AND STORAGE MANAGEMENT LAYER FOR
HADOOP.IT SUPPORTS DIFFERENT COMPONETS AVAILABLE IN
HADOOP ECOSYSTEM LIKES MAPREDUCE,HIVE,PIG TO EASILY
READ AND WRITE DATA FROM THE CULSTER.IT IS A KEY
COMPONENT OF HIVE THAT ENABLE THE USERS TO STORE
THEIR DATA IN ANY FORMAT AND STRUCTURE. BY DEFAULT,
IT SUPPORTS RC FILE,CSV,JSON,SEQUENCE FILE AND ORC FILE
FORMAT.IT EXPOSES TABULAR DATA OF HIVE MEATSTORE TO
OTHER HADOOP APPLICATIONS.
BENFITS:
IT ENABLES OF DATA NOTIFICATIONS.
WITH THE TABLE ABSTRUCTION, HCATALOG FREES THE USER
FROM OVERHEAD OF DATA STORAGE.
IT PROVIDES VISIBILITY FOR DATA CLEANING AND ARCHIVING
TOOLS.
8) AVRO:
IT IS PART OF HADOOP ECOSYSTEM AND IT IS A MOST POPULAR
DATA SERILIZATION SYSTEM. IT IS AN OPEN SOURCE PROJECT
THAT PROVIDES DATA SERILIZATION AND DATA EXCHANGING
SERVICE IN HADOOP.THESE SERVICES CAN BE USED TOGETHER
OR INDEPENDENTLY.BIG DATA CAN EXCHANG PROGRAMS
WRITTEN IN DIFFERENT LANGUAGES USING AVRO.
USING SERILIZATION SERVICE PROGRAM CAN SERIALIZE THE
DATA INTO FILES OR MESSAGES.IT STORES DATA DEFINITION
AND DATA TOGETHER IN ONE MESSAGE OR FILE MAKING IT
EASY FOR PRAGRAMS TO DYNAMICALLY UNDERSTAND
INFORNMATION STORED IN AVRO FILE OR MESSAGE.
FEATURES:
1) RICH DATA STRUCTURE
2) REMOTE PROCEDURE CALL
3) CAMPACT ,FIRST, BINARY DATA FORMAT
4) CONTAINER FILE TO STORE PERSISTENT DATA.

AVROSCHEMA:
IT RELIES ON SCHEMAS FOR
SERILIZATION/DESERILIZATION.
IT REQUIRES THE SCHEMA FOR DATA READ/WRITE
WHEN AVRO DATA IS STORED IN A FILE ITS SCHEMA IS
STORED WITH IT,SO THAT FILES MAY BE PROCESSED LATER
BY ANY PROGRAM.

DYNAMIC TYPING:
IT REFERS TO SERILIZATION AND DESERILIZATION
WITHOUT CODE GENERATION.IT COMPLEMENTS THE
CODE GENERATION WICH IS AVAILABLE IN IT FOR
STATICALLY TYPED LANGUAGE AS AN OPTIAL
OPTIMIZATION.

9) THRIFT:
IT IS A SOFTWARE FRAME WORK FOR SCALABLE CROSS –
LANGUAGE SERVICE DEVELOPMENT.IT IS AN INTER FACE
DEFINITION LANGUAGE FOR RPC
CUMMINICATION.HADOOP DOES A LOT OF RPC CALLS SO
THERE IS A RESPONSIBILITY OF USING HADOOP
ECOSYSTEM COMPONENT.,APACHE THRIFT FOR
PERFORMANCE OR OTHER REASON.IT COMBINES A
SOFTWARE ATSCK WITH ACODE GENERATIONENGINE TO
BUILD SERVICES THAT WORK EFFICIENTLY
DIFFERENTLANGUAGE C++,JAVA,PYTHON,RUBY ETC.IT
ALLOWES US TO DEFINE DATA TYPEAND SERVICE
INTERFACES IN A SIMPLE DEFINATIOM FILE.IT IS A LIGTH
WEIGHT,LANGUAGE INDIPENDENT SOFTWARE STACK
WITH AN ASSOCIATED CODE GENERATION MECHNISM
FOR RPC.IT PROVIDES CLEAN SBATRUCTION FOR DATA
TRANSPORT,DATA SERIALIZATION AND APPLICATION
LEVEL PROCESSING.IT WAS ORIGINALY DEVELOPED BY
FACEBOOK ANOW IT TS OPEN SOURCE PROJECT

10) DRILL:
THE MAIN PURPOSE OF THE DAHOOP ECO SYSTEM
COMPONENT IS LARGE SCALE DATA PROCESSING
INCLUDING STRUCTURED AND SEMI-STRUCTURE DATA.IT
IS A LOW LATENCY DISTRIBUTED QUERY ENGINE THAT IS
DESIGN TO SCALE TO SEVERAL THOUSTAND OF NODES
QUERY PETABYTES OF DATA..THE DRILL IS THE FIRST
DISTRIBUTED SQL QUERY ENGINE THAT HAS A SCHEMA
FREE MODEL.

APPLICATION:
A COMPANY THAT PROVIDESCONSUMER PURCHASE DATA
FOR MOBILE AND INTERNAT BANKING.IN THAT CASE USE
THE DRILL TO QUICKLY PROCESS TRILLIONS OF RECORDS
AND EXECUTE QUERIES.

FEARURES:
THE DRILL HAS SEPECILIZED MEMORY MENEGEMENT
SYSTEM TO ELEMINATES GARBAGE COLLECTION AND
OPTIMIZE MEMORY ALLOCATION AND USAGE.DRILL PLAYS
WELL WITH HIVE BY ALLOWING DEVELOPER S TO REUSE
THEIR EXISTING HIVE DEPLOYMENT.
1) EXTENSIBILITY:
DRILL PROVODES AN EXTENSIBLE ARCHITECTURE AT ALL
LAYERS,INCLUDING QUERY LAYER,QUERY
OPTIMIZATION,AND CLIENT API.WE CAN EXTEND ANY
LAYER FOR THE SPECIFIC NEED OF AN ORGANIZATION.
2) FLEXIBILITY:
IT PROVIDES A HEIRARCHICAL COLUMNAR DATA MODEL
THAT CAN REPRESENT COMPLEX,HIGHLY DYNAMIC
DATA AND ALLOW EFFIEIENT PROCESSING.
3) DYNAMIC SCHEMA DISCOVERY:
IT DOES NOT REQUIR SCHEMA OR TYPE SPECIFICATION
FOR DATA IN ORDER TO STRAT THE QUREY EXECUTION
PROCESS.INSTEAD ,IT STARTS PROCESSING THE DATA IN
UNITS CALLED RECORD BATCHES AND DISCOVER
SCHEMA ON THAT TIME OF PROCESSING
4) DRILL DECENTRALIZED MATADATA:
UNLIKE OTHER SQL HADOOP TECHNOLOGIES THE DRILL
DOES NOT HAVE CENTRALIZED MATADATA
REQUIREMENT.IT’S USER DO NOT NEED TO CREATE AND
MANAGE TABLES IN METADATA IN ORER TO QUERY
DATA.
11) MAHOUT:
MAHOUT IS OPEN SOURCE FRAME WORK FOR
CREATING SCALABLE MACHINE LEARING ALGORITHM
AND DATA MINING LIBRARY.ONCE DATA IS STORED IN
HADOOP HDFS,IT PROVIDES THE DATA SCIENCE TOOLS
TO AUTOMATICALLY FIND MEANING FULL PATTERNS OF
THOSE BIGDATA.

ALGORITHMS OF MAHOUT ARE:


1) CLUSTERING:
HERE IT TAKES THE ITEM IN PARTICULAR CLASS AND
ORGANIZES THEM INTO NATURALLY OCCURRING
GROUPS,SUCH THAT ITEM BELONGING TO THE SAME
GROUP ARE SILIMAR TO EACH OTHER.
2) COLLABORATIVE FILTERING:
IT MINES USER BEHAVIOR AND MAKES PRODUCT
RECOMENTDATION.
3) CLASSIFICATION:
IT LEARNS EXISTING CATEGORIZATION AND THEN
ASSIGNS UNCLASSIFIED ITEMS TO THE BEST
CATEGORY.
4) FREQUENT PATTERN MI NING:
IT ANALYZES ITEMS IN A GROUP AND THEN
IDENTIFIES WHICH ITEMS TYPICALLY APPEAR
TOGETHER.
12) SQOOP:
IT IMPORTS DATA FROM EXTERNAL SOURCES INTO
RELATED HADOOP ECOSYSTEM COMPONENT LIKES
HDFS, HBASE OR HIVE.IT ALSO EXPORTS DATA FROM
HADOOP TO ORTHER EXTERNAL SOUECES.SQOOP
WORKS WITH RALATIONAL DATA BASES SUCH AS
ORACLE,MYSQL,TERADATA.
FEARURES:
1. IMPORT SEQUENTIAL DATASET S FROM MAINFRAME:
IT SATISFIES THE GROWING NEEDS TO MOVE DATA
FROM MAIN FRAME TO HDFS.
2. IMPORT DIRECT TO ORC FILES:
IMPROVES COMPRESSION AND LIGTH INDEXING AND
IMPROVE QUERY PERFORMANCE.
3. PARALLEL DATA TRANSFER:
FOR FASTER PERFORMANCE AND OPTIMAL SYSTEM
UTILIZATION.
4. EFFICIENT DATA ANALYSIS:
IMPROVE EFFICIENCY OF DATA ANALYSIS BY
COMBINING STRUCTURED DATA AND
UNSTRUCTURED DATA ON A SCEMA ON READING
DATA POOL
5. FAST DATA COPIES:
FROM THE EXTERNAL SYSTEM INTO HADOOP.
13) FLUME:
FLUME EFFICIENTLY COLLECTS,AGGREGATES AND MOVES
A LARGE AMOUT OF DATA FROM ITS ORIGIN AND
SENDING IT BACK TO HDFS.IT IS FAULT TOLERANT AND
RELIABLE MECHANISM.THIS HADOOPECOSYSTEM
COMPONENT ALLOWS THE DATA FLOW FROM THE
SOURCE INTO HADOOP ENVORONMENT.IT USES A SIMPLE
EXTENSIBLE DATA MODEL THAT ALLOWS FOR THE ONLINE
ANALYTIC APPLICATIONWE CAN GET THE DATA FROM
MULTIPLE SERVERS IMMEDIATELY INTO THE HADOOP

14.AMBARI:
IT IS A MANAGEMENT PLATFORM FOR
PROVISIONING,MANAGING,MONITORING AND SECURING
APACHE HADOOP CLUSTER.HADOOP MANAGEMENT GETS
SIMPLER AS AMBARI PROVIDE CENSISTENT,SECURE
PLATFORMFOR OPERATIONAL CONTROL.
FEARURES:
1. SIMPLIFIED INSTALLATION,CON FIGURATIONAND
MANAGEMENT:
IT EASILY AND EFFICIENTLY CREATE AND MANAGE
CKUSTER AT SCALE.
2. CENTRALIZED SECURITY SETUP:
IT REDUCE THE COMPLEXITY TOI ADMINISTRATOR AND
CONFIGURE CLUSTER SECURITY ACROSS THE ENTIRE
PLATFORM
3. HIGHLY EXTENSIBLE AND CUNTOMIZABLE:
IT IS HOGHLY EXTENSIBLE FOR BRINGING CUSTOM
SERVICES UNDER MANAGEMENT.
4. FULL VISIBLE INTO CLUTER HEALTH:
IT ENSURES THAT THE CLUSTER IS HEALTHY A ND
AVAILABLE WITH A HOLISTIC APPROACH TO
MONITORING.

15. ZOOKEEPER:
IT IS A CENTRALIZED SERVICE AND A HADOOP
ECOSYSTEM COMPONENT FOR MAINTAINING
CONFIGURATION INFORMATION,NAMEING,PROVIING
DISTRIBUTED SYNCHRONIZATION AND PROVIDING
GROUP SERVICES.IT MANAGES AND COORDINATES A
LARGE CLUSTER OF MACHINES
FEARURES:
FAST:
IT IS FAST WITH WORKLOADS WHERE READS TO DATA
ARE MORE COMMON THAN WRITES.THE IDEAL READ
/WRITE RATIO IS 10:1.
ORDERD:
IT MAINTAINS A RECORD OF ALL TRANSACTIONS.

16) OOZIE:
IT IS A WORKFLOW SCHEDULAR SYSTEM MANAGING
APACHE HADOOP JOBS.IT COM BINES MULTIPLE JOBS
SEQUENTIALLY INTO ONE LOGICAL UNIT OF
WORK.OOZIE FRAMEWORK IS FULLY INTEGRATED WITH
HADOOP STACK.YARN AS AN ARCHITECTURE CENTER
AND SUPPORTS HADOOP JOBS FOR MAPREDUCE,PIG
,HIVE AND SQOOP.
DATA PREPROCESSING:

 IT IS DONE TO IMPROVE THE QUALITY OF DATA IN THE DATA


WARE HOUSE.
 INCREASE EFFICIENCY
 REMOVES NOISY DATA, INCONSISTENT DATA AND INCOMPLETE
DATA (DATA WITH MISSING VALUES).

DATA CLEANING:
 DATA CLEANING IS A PROCESS OF ENSURING DATA IS
CORRECT, CONSISTENT AND USABLE.
 IT CLEANS THE DATA BY FILLING IN THE MISSING VALUES,
SMOTHING NOISY DATA (TYPING ERROR), RESOLVING THE
INCONSISTENCY (NAMING CONVENTION)AND REMOVING
THE OUTLINES.

MOST ASPECTS OF DATA CLEANING CANBE DONE THROUGH


THE USE OF SOFTWARE TOOLS, BUT A PORTION OF IT MUST
BE DONE MANUALLY.

SOFTWARE TOOLS:
DATA CLEANER, OPENREFINE, WINPURE, DATA LADDER ETC.

BENEFITS OF DATA CLEANING:


1. IT REMOVES THE MAJOR ERRORS AND INCONSISTENCIES
THAT ARE INVITABLE WHEN MULTIPLE SOURCES OF DATA
ARE BEING PULLED INTO ONE DATA SET.
2. USING TOOLS TO CLEAN UP DATA WILL MAKE EVERY ONE
ON YOUR TEAM MORE EFFICIENT AS YOU WILL BE AVLE
TO QUICKLY GET WHAT YOU NEED FROM THE DATA
AVAILABLE TO YOU.
3. FEWER ERRORS MEANS HAPPIER CUSTOMER.
4. IT ALLOWES YOU TO MAP DIFFERENT DATA FUNCTIONS
AND BETTER UNDERSTAND WHAT YOUR DATA IS
INTENDED TO DO AND LEARN WHERE IT IS COMING
FORM.
5. INCREASED PRODUCTIVITY.
6. FASTER SALES CYCLE:
MARKETING DECISIONS DEPEND ON THE DATA.GIVING
YOUR MARKETING DEPARTMENT THE BEST QUAILITY OF
DATA POSSIBLE MEANS BETTER AND MORE LEADS FOR
YOUR SALES TEAM TO CONVERT.

DATA CLEANING STEPS:


1) MONITOR ERRORS:-

KEEP A RECORD OF TRENDS WHERE MOST OF YOUR ERRORS ARE


COMING FROM.THIS WILL MAKE IT A LOT EASIER TO IDENTIFY AND FIX
INCORRECT OR CORRUPT DATA.RECORDS ARE ESPECIALLY IMPORTANT
IF YOU ARE INTEGRATING OTHER SOLUTIONS WITH YOUR FLEET
MANAGEMENT SOFTWARE,SO THAT YOUR ERRORS CLOG UP THE
WORK OF OTHER DEPARTMENTS.
2) STANDARIZE YOUR PROCESS:-

STANDARIZE THE POINT OF ENTRY TO HELP REDUCE THE RISK OF


DUPLICATION.

3) VALIDATE DATA ACCURACY:-

ONCE YOU HAVE CLEANED YOUR EXISTING DATABASE,VALIDATE THE


ACCURACY OF YOUR DATA.RESEARCH AND INVEST IN DATA TOOLS
THAT ALLOW YOU TO CLEAN YOUR DATA IN THE REAL TIME.

4) SCRUB FOR DUPLICATE DATA:-

IDENTIFY DUPLICATES TO HELP SAVE TIME WHEN ANALYZING THE


DATA.REPEATED DATA CAN BE AVOIDED BY RESEARCHING AND
INVESTING IN DIFFERENT DATA CLEANING TOOLS THAT CAN ANALYZE
RAW DATA IN BULK AND AUTOMATE THE PROCESS FOR YOU.

5) ANALYZE YOUR DATA:-

AFTER YOUR DATA HAS BEEN STANDARIZED ,VALIDATED AND


SCRUBBED FOR DUPLICATES,USE THE THIRD PART SOURD=CE TO
APPEND IT.RELIABLE THIRD PARTY SOURCES CAN CAPTURE
INFORMATION DIRECTLY FROM THE FIRST PARTY SITES,THE CLEAN AND
COMPILE INFORMATION FOR BIUSINESS INTELLIGENCE AND
ANALYTICS.

6) COMMUNICATE WITH YOUR TEAM:-

SHARE THE NEW STANDARIZED CLEANING PROCESS WITH YOUR TEAM


TO PROMOTE ADOPTION OF THE NEW PROTOCOL.NOW THAT YOU
HAVE SCRUBBED DOWN YOUR DATA ,IT’S IMPORTANT TO KEEP IT
CLEAN.KEEPING YOUR TEAM IN THE LOOP WILL HELP YOU DEVEPLO
AND STRENGTH CUSTOMER SEGMENTATION AND SEND MORE
TARGETED INFORMATION TO CUSTOMERS AND PROSPECTS.

DATA CLEANING TECHNIQUES:

THERE ARE A NUMBER OF TECNIQUES THAT HAVE BEEN DEVELOPED TO


ASSIST IN CLEANING BIGDATA.

1. CONVERSION TABLE:-

WHEN CERTAIN DATA ISSUES ARE ALREADY KNOWN,IT CAN BE


STORTED BY THE RELEVANT KEY AND THEN LOOPUPS CAN BE USED
INORDER TO MAKE THE CONVERSION.

2. HISTOGRAMS:-

THESE ALLOWS FOR IDENTIFICATION OF VALUES THAT OCCUR LESS


FREQUENTLY AND MAY BE INVALID.

3. TOOLS:-

EVERY DAY MAJOR VENDORS ARE COMING OUT WITH NEW AND
BETTER TOOLS TO MANAGE BIGDATA AND THE COMPLEXITIES THAT
CAN ACCOMPANY IT.

4. ALGORITHMS:-

SUCH AS SPELL CHECK OR PHONETIC ALGORITHMS CAN BE USEFULL.


CHALLENGES OF DATA CLEANING:-

 LIMITED KNOWLEDGE ABOIT WHAT IS CAUSING


ANOMALIES,CREATING DIFFICULTIES IN CREATING THE RIGTH
TRASFORMATIONS.
 DATA DELETION ,WHERE A LOSS OF INFORMATION LEADS TO
INCOMPLETE DATA THAT CANNOT BE ACCURABLELY “FILLED IN”.
 ONGOING MAINTENANCE CAN BE EXPENSIVE AND TIME
COMSUMING.
 IT IS DIFFICULT TO BUILD A DATA CLEANSING GRAPH TO ASSIST
WITH THE PROCESS AHEAD OF TIME.

DATA INTEGRATION:

IT IS A PREPROCESSING METHODS THAT INVOLVES MERGING OF DATA


FROM DIFFERENT SOURCES (FLAT FILES, MUTI-DIMENTIINAL
DATABASES, DATA CUBES) IN ORDER TO FROM A DATA STORE LIKE
DATA WARE HOUSE.

e.g:

1) COMPANIES COMBINES DATA FROM


SALES,MARKETING,FINANCE,FULFILLMENT ,CUSTOMER SUPPORT AND
TECHNICAL SUPPORT OR SOME COMBINATION OF THOSE ELEMENTS
TO UNDERSTAND CUSTOMER JOURNEY.
2) PUBLIC ATTRACTIONS SUCH ZOOS COMBINE WEATHER DATA WITH
HISTORICAL ATTENDENCE DATA TO BETTER PREDICT STAFFING
REQUIRMENTS ON SPCIFIC DATES.

3) HOTLES USE WEATHER DATA AND DATA ABOUT MAJOR EVENTS TO


MORE PRECISELY ALLOCATE RESOURCES AND MAXIMIZE PROFITS
THROUGH DYNAMIC PRICING.

BENEFITS OF DATA INTEGRATION:--

 MORE EFFECTIVE COLLABORATION


 FASTERT ACCESS TO COMBINED DATA SETS THAN TRADITIONAL
METHODS SUCH AS MANUAL INTEGRATIONS.
 MORE COMPREHENSIVE VISIBILITY INTO AND ACROSS DATA
ASSETS.
 DATA SYNCING TO ENSURE THE DELIVERY OF TIMELY ,ACCURATE
DATA.
 ERROR REDUCTION AS OPPOSED TO MANUAL INTEGRATIONS.
 HIGHER DATA QUALITY OVER TIME.

ISSUES IN DATA INTEGRATION:-

1) SCHEMA INTEGRATION AND OBJECT MATCHING.


COMPANY A (EMP_ID,NAME,DOB,AGE,SALARY(RS))
B(EMP_NO,NAME,SALARY($))

2) REDUNDANCY  UNWANTED ATTRIBUTE


3) DETECTION AND RESOLUTION OF DATA VALUES
CONFILCTSCORRECTLY MODIFIES VALUES.

DATA INTEGRATION IMPLEPENTATION:--

 MANUAL INTEGRATION BETWEEN SOURCE SYSTEMS


 APPLICATION INTEGRATIONS THAT REQUIRE THE APPLICATION
PUBLISHERS OVERCOME THE INTEGRATION CHALLENGES OF
THEIR RESPECTIV SYSTEMS
 COMMON STORAGE INTERGRATION DATA FROM DIFFERENT
SYSTEMS IS REPLICATED AND STORED IN A COMMON
INDEPENTED SYSTEM.
 MIDDLE WARE WHICH TRANSFERS THE DATA INTEGRATION LOGIC
FROM THE APPLICATION TO A SEPARATE MIDDLE WARE LAYER.
 VIRTUAL INTEGRATION OR UNIFORM ACCESS
INTEGRATION,WHICH PROVIDE VIEWS OF THE DATA ,BUT DATA
REMAINS IN ITS ORIGINAL REPOSITORY.
 API S WHICH IS A SOFTWARE INTERMEDIARY THAT ENABLES
APPLICATIONS TO COMMUNICATE AND SHARE DATA.
DATA ANALYSIS:

THE SYSTEMATIC APPLICATION OF STATISTICAL AND LOGICAL


TECHNIQUES TO DESCRIBE THE DATA SCOPE, MODULARIZED THE DATA
STRUCTURE ,CONDENCE THE DATA REPRESENTATION
(IMAGE,TABLES,GRAPHS )AND EVALUATE STATISTICAL INCLINATIONS
,PROBABILITY DATA TO DERIVE MEANINGFUL CONCLUSIONS ,IS
KNOWN AS DATA ANALYSIS

OR

DATA ANALYSIS IS A DEFINED AS A PROCESS OF CLEANING,


TRANSFORMING AND MODELING DATA TO DISCOVER THE USEFUL
INFOR MATION FOR BUSINESS DECISION MAKING.

THE PURPOSE OF DATA ANALYSIS IS TO EXTRACT USEFUL


INFORMATION FROM DATA, AND TAKING THE DECISION BASED UPON
THE DATA ANALYSIS.

TYPES OF DATA ANALYSIS:

THERE ARE SEVERAL TYPES OF DATA ANALYSIS TECHNIQUES THAT EXIST


BASED ON BUSUNESS AND TECHNOLOGY.

HOWEVER THE MAJOR DATA ANALYSIS METHODS ARE:--


1. TEXT ANALYSIS:--

TEXT ANALYSIS IS ALSO REFERRED TO AS DATA MINING.IT IS ONE


OF THE METHODS OF DATA ANALYSIS TO DISCOVER A PATTERN IN
LARGE DATA SETS USING DATABASES OR DATA MINING TOOLS.
IT USED TO TRASFORMED RAW DATA INTO BUSINESS
INFORMATION.BUSINESS INFORMATION TOOLS ARE PRESENT IN
THE MARKET WHICH IS USED TO TAKE STRATEGIC BUSINESS
DECISIONS.OVERALL IT OFFERS A WAY TO EXTRACT AND EXAMINE
DATA AND DERIVING PATTERNS AND FINALLY INTERPRETATION
OF THE DATA.

1) STATISTICAL ANALYSIS:--

IIT SHOWS “WHAT HAPPEN?” BY USING PAST DATA IN THE FORM OF


DASHBOARDS.STATISTICAL ANALYSIS INCLUDES
COLLECTIONS,ANALYSIS,INTERPRETATION,PRESENTATION AND
MODELING OF THE DATA.IT ANALYSIS A SET OF DATA OR A SMAPLE OF
DATA. THERE ARE TWO CATEGORIES OF THIS TYPE OF ANALYSIS::--

A) DESCRIPTIVE ANALYSIS:

THIS ANALYSIS IS DONE EITHER COMPLETE DATA OR SAMPLE OF


SUMMARIZED NUMERICAL DATA.IT SHOWS MEAN AND
DEVIATION FOR CONTINUOUS DATA WHERE AS PERCENTAGE AND
FREQUENCY FOR CATEGORICAL DATA.
B) INFERENTIAL ANALYSIS:

ANALYSIS SAMPLE FROM COMPLETE DATA .IN THIS TYPE OF ANALYSIS


YOU CAN FIND DIFFERENT CONCLUSIONS FROM SAME DATA BY
SELECTING DIFFERENT SAMPLE.

3) DIAGNASTIC ANALYSIS:--

DIAGNOSTIC ANALYSIS SHOWS “WHY DID IT HAPPEN?”BY


FINDING THE CAUSE FROM THE INSIGTH FOUND IN STASTICAL
ANALYSIS.THIS ANALYSIS IS USEFUL TO IDENTIFY BEHAVIOR
PATTERS OF DATA.IF A NEW PROBLEM ARRIVES IN YOUR
BUSINESS PROCESS,THEN YOU CAN LOOK INTO THIS ANALYSIS TO
FIND SIMILAR PATTERS OF THAT PROBLEM.AND IT MAY HAVE
CHANCES TO USE SIMILAR PRESCRIPTIONS FOR THE NEW
PROBLEM.

4) PREDICTIVE ANALYSIS:--

PREDICTIVE ANALYSIS SHOWS “WHAT IS LIKELY TO HAPPEN?” BY


USING PREVIOUS DATA.
THE SIMPLEST EXAMPLE DATA ANALYSIS IS LIKE IF LAST YEAR I
BOUGHT TWO DRESSES BASED ON MY SAVINGS.AND IT THIS YEAR
MY SALARY IS INCREASING DOUBLE THEN I CAN BUY FOUR
DRESSES. BUT OF COURSE IT’S NOT EASY LIKE THIS BECAUSE YOU
HAVE TO THINK ABOUT ORTHER CIRCUMSTANCES LIKE CHANCES
OF PRICE OF THE CLOTHS IS INCREASED IN THIS YEAR OR MAY BE
OF DRESS YOU WANT TO BUY A NEW BIKE OR YOU NEED TO BUY
A HOUSE..SO HERE, THIS ANALYSIS MAKES PREDICTIONS ABOUT
THE FUTURE OUTCOMES BASED ON THE CURRENT OR PAST
DATA.FORECASTING IS JUST AN ESTIMATE. ITS ACCURACY IS
BASED ON HOW MUCH DETAILED INFORMATION YOU HAVE HOW
MUCH YOU DIG ON IT.

5) PRESCRIPTIVE ANALYSIS:--

PRESCRIPTIVE ANALYSIS COMBINES THE INSIGTH FROM ALL


PREVIOUS ANALYSIS TO DETERMINE WHICH ACTION TO TAKE IN A
CURRENT PROBLEM OR DECISION.MOST DATA-DRIVEN
COMPANIES ARE UTILYING PRESCRIPTIVE ANALYSIS BECAUSE
PREDICTIVE AND DESCRIPTIVE ANALYSIS ARE NOT ENOUGH TO
IMPROVE DATA PERFORMANCE.BASED ON CURRENT SITUATIONS
AND PROBLEMS THEY ANALYZE THE DATA AND MAKE DECISIONS.

DATA ANALYSIS PROCESS::--


THE DATA ANALYSIS PROCESS IS NOTHING BUT GATHERING
INFORMATIONS BY USING A PROPER APPLICATION OR TOOL
WHICH ALLOWS YOU TO EXPLORE THE DATA AND FIND A
PATTERN IN IT.BASED ON THAT INFORMATION AND DATA ,YOU
CAN MAKE DECISIONS OR YOU CAN GET ULTIMATE
CONCLUSIONS.
DATA ANALYSIS CONSISTS OF THE FOLLOWING PHASES---
1) DATA REQUIRMENT GATHERING:

FIRST OF ALL, YOU HAVE TO THINK ABOUT WHY DO YOU WANT TO DO


THIS DATA ANALYSIS? ALL YOU NEED TO FIND OUT THE PURPOSE OR
AIM OF DOING THE ANALYSIS OF DATA.YOU HAVE TO DECIDE WHAT TO
ANALYZE AND HOW TO MEASURE IT, YOU HAVE TO UNDERSTAND WHY
YOU ARE INVESTIGATING AND WHAT MEASURES YOU HAVE TO USE TO
DO THIS ANALYSIS.

2) DATA COLLECTION:

AFTER REQUIREMENT GATHERING ,YOU WILL GET A CLEARE IDEA


ABOUT WHAT THINGS YOU HAVE TO MEASURE AND WHAT SHOULD BE
YOUR FINDINGS.NOW IT’S TIME TO COLLECT YOUR DATA BASED ON
THE REQUIREMENTS.ONCE YOU COLLECT YOUR DATA, REMEMBER
THAT THE COLLECTED DATA MUST BE PRECESSED OR ORGANIZED FOR
ANALYSIS.AS YOU COLLECTED DATA FROM VARIOUS SOURCES ,YOU
MUST HAVE TO KEEP A LOG WITH A COLLECTION DATA AND SOURCE
OF THE DATA.

3) DATA CLEANING:

NOW WHAT EVER DATA IS COLLECTED MAY NOT BE USEFUL OR


IRRELEVANT TO YOUR AIM OF ANALYSIS, HENCE IT SHIULD BE
CLEANED.THE DATA WHICH IS COOLLECTED MAY CONTAIN DUPLICATE
RECORDS, WHITE SPACES OR ERRORS.THE DATA SHOULD BE CLEANED
AND ERROR FREE.THIS PHASE MUST BE DONE BEFORE ANALYSIS
BECAUSE BASED ON THE DATA CLEANING YOUR OUTPUT OF ANALYSIS
WILL BE CLOSER TO YOUR EXPECTED OUTCOME.

4) DATA ANALYSIS:

ONCE THE DATA IS COLLECTED, CLEANED, AND PRESECCED IT IS READY


FOR ANALYSIS.AS YOU MANIPULATYE DATA, YOU MAY FIND YOU HAVE
THE EXACT INFORMATION THAT YOU NEED,OR YOU MIGHT NEED TO
COLLECT MORE DATA. DURING THIS PHASE YOU CAN USE DATA
ANALYSIS TOOLS AND SOFTWARE WHICH WILL HELP YOU TO
UNDERSTAND, INTERPRET AND DERIVE CONCLUSIONS BASED ON THE
REQUIRMENTS.

5) DATA INTERPRETATION:

AFTER ANALYZING YOUR DATA , IT’S FINALLY TIME TO INTERPRET YOUR


RESULTS. YOU CAN CHOOSE THE WAY TO EXPRESS OR COMMUNICATE
WITH YOUR DATA ANALYSIS EITHER YOU CAN USE SIMPLY IN
WORDS,OR MAY BE A TABLES OR CHARTS.THEN USE THE RESULTS OF
YOUR DATA ANALYSIS PROCESS TO DECIDE YOUR BEST COURSE OF
ACTION.

6) DATA VISUALIZATION:

DATA VISUALIZATION IS VERY COMMON IN YOUR DAY TO DAY LIFE ,


THEY OFTEN APPEAR IN THE FORM OF CHARTS AND GRAPHS.OR DATA
SHOWN GRAPHICALLY SO THAT IT WILL BE EASIER FRO THE HUMAN
BRAIN TO UNDERSTAND AND PRECESS IT.

DATA VISUALIZATION OFTEN USED TO DISCOVER UNKNOWN FACTS


AND TRENDS, BY OBSERVING RELATION SHIPS AND COMPAING DAT
SETS,YOU CAN FIND A WAY TO FIND OUT MEANING FUL
INFORMATION.

DATA ANALYSIS TOOLS:

DATA ANALYSIS TOOLS MAKE IT EASIER FOR USERS TO PROCESS AND


MANIPULATE DATA ,ANALYZE THE RELATIONSHIPS AND CORELATION S
BETWEEN DATA SETS,AND IT ALSO HELPS TO IDENTIFY PATTERNS AND
TRENDS FOR INTERPRETATION.

1) Xplenty:
IT IS A CLOUD BASED ETL SOLUTION PROVIDING SIMPLE
VISUALIZE DATA PIPELINES FOR AUTOMATED DATA FLOWS
ACROSS A WIDE RANGE OF SOURCES AND
DESTINATIONS.XPLENTY’S POWERFUL AN PLATFORM
TRANSFORMATION TOOLS ALLOW YOU TO CLEAN ,NORMALIZE
AND TRANSFORMED DATA WHILE ALSO ADHERING TO
COMPLIANCE BEST PRECTICES.
FEATURES:
 POWERFUL,CODE FREE ON PLATFORM DATA
TRANSFORMING OFFERING
 REST API CONNECTOR—PULL IN DATA FROM ANY SOURCE
THAT HAS A REST API’S
 DESTINATION FLEXIBILITY—SEND DATA TO
DATABASES,DATA WARE HOUSES AND SALESFORCE.
 SECIRITY FOCUSED—FIELD –LEVEL DATA ENCRYPTION AND
MAKING TO MEET COMPLIANCE REQUIREMENTS.
 REST API—TO ACHIVE ANYTHING POSSIBLE ON THE Xplenty
UI VIA THE Xplenty API.
 CUSTOMER CENTRIC COMPANY THAT LEADS WITH FIRST
CLASS SUPPORT.

2) MICSOFT POWER BI:

POWER BI IS A BI[BUSINESS INTELLEGENCE] AND ANALYTICS


PLATFORM THAT SERVES TO INGEST DATA FROM VARIOUS
SOURCES,INCLUDING BIG DATA SOURCES,PROCESS AND CONVERT
IT INTO ACTIONAL INSIGTHS.THE PLATFORM INCLUDES A RANGE
OF PRODUCTS—POWER BI DESKTOP,POWER BI PRO,POWER BI
PREMIUM,POWER BI MOBILE,POWER BI REPORT SERVER AND
POWER BI EMBEDDED SIUTABLE FOR DIFFERENT BI AND
ANALYTICS NEEDS .
FEATURES:
 INTEGRATED WITH 100+ ON PREMISES AND CLOUD BASED
DATA SOURCES.
 MUTILANGUAGE SUPPORT:DAX,POWER QUERY,SQL, R
AND PYTHON.
 ML,AI,BIGDATA ,STREAM ANALYSIS CAPABILITES
 INTERATIVE DASHBORDING
 PREBIULD AND CUSTOMIZABLE VISUALS
 WORK SPACE AND ROW –LEVEL SECURITY

3) MICROSOFT HD INSIGHT:

AZURE HDINSIGHT IS A SPARK AND HADOOP SERVICE IN THE


CLOUD.IT PROVIDES BIGDATA CLOUD OFFERING IN TWO
CATEGORIES STANDARD,PREMIUM.IT PROVIDES AN ENTERPRISE
SCALE CLUSTER FOR THE ORGANIZATION TO RUN THEIR BIGDATA
WORKLOADS.

FEATURES:
 RELIABLE ANALYTICS WITH AN INDUSTRY LEADING
SLA(SERVICE LEVEL AGREEMENT)
 IT OFFERS ENTERPRICE GRADE SECURITY AND MONITORING
 PROTECT DATA ASSETS AND EXTEND ON ON-PREMISES
SECURITY AND GOVERNENCE CONTOLR S TO THE CLOUD
 HIGH PRODUCTIVITY PLATFORM FOR DEVELOPERS AND
SCIENTISTS.
 INTEGRATION WITH LEADING PRODUCTIVITY APPLICATIONS
 DEPLOY HADOOP IN THE CLOUD WITHOUT PURCHASING
NEW HARDWARE OR PAYING OTHER UP FRONT COSTS.

4) SKYTREE:

SKYTREE IS ONE OF THE BEST BIGDATA ANALYTICS TOOLS THAT


EMPOWERS DATA SCIENTISTS TO BUILD MORE ACCURATE MODELS
FASTER.IT OFFERS ACCURATE PREDICTIVE MACHING LEARNIG MODELS
THAT ARE EASY TO USE.

FEATURES:

 HIGH SCALABLE ALGORITHMS


 ARTIFICIAL INTELLIGENCE FOR DATA SCIENTISTS
 IT ALLOWS DATA SCIENTISTS TO VISUALIZE AND UNDERSTAND
THE LOGIC BEHIND ML DECISIONS.
 SKYTREE VIA THE EASY TO ADOPT GUI OR PROGRAMATICALLY IN
JAVA.
 MODEL INTERPRETABLITY.
 IT IS DESIGN TO SOLVE ROBUST PREDICTIVE PROBLEMS WITH
DATA PREPARATION CAPABILITIES.
 PROGRAMMATIC AND GUI ACCESS

5) TALEND:

TALEND IS A BIGDATA ANALYTICS SOFTWARE THAT SIMPLIFIES AND


AUTOMATES BIGDATA INTEGRATION.ITS GRAPHICAL WIZARD
GENERATES NATIVE CODE.IT ALSO ALLOW BIG DATA
INTEGRATION,MASTER DATA MANAGEMENT AND CHECKS DATA
QUALITY.

FEATURES:

 ACCELERATE TIME TO VALUE FOR BIG DATA PROJECTS.


 SIMPLIFY ETL AND ELT FOR BIG DATA
 TALEND BIGDATA PLATFORM SIMPLIFIES USING MAP REDUCE
AND SPARK BY GENERATING NATIVE CODE
 SMARTER DATA QUALITY WITH ML AND NATURAL LANGUAGE
PRECESSING
 AGILE DEVOPS TO SPEED UP BIG DATA PROJECTS.
 STREAMLINE ALL THE DEVOPS PROCESSES.
6) SPLICE MACHINE:

SPLICE MACHINE IS ONE OF THE BEST BIGDATA ANALYTICS


TOOLS.THEIR ARCHITECTURE IS PORTABLE ACROSS PUBLIC CLOUDS
SUCH AS AWS,AZURE AND GOOGLE

FEATURES:

 IT IS A BIG DATA ANALYTICS SOFTWARE THAT CAN


DYNAMICALLY SCALE FROM A FEW TO THOUSANDS OF NODES TO
ENABLE APPLICATIONS AT EVERY SCALE.
 IT AUTOMATICALLY EVALUATES EVERY QUERY TO DISTRIBUTED
HBASE REGIONS
 REDUCE MANAGEMENT,DEPLOY FASTER AND REDUCE RISK
 CONSUMING FAST STREAMING DATA ,DEVELOP ,TEST AND
DEPLOY ML MODELS

7) SPARK:

IT IS ONE OF THE POWERFUL OPENSOURCE BIG DATA ANALYTICS


TOOLS.IT OFFERS OVER SO HIGH –LEVEL OPERATORS THAT MAKE IT
EASY TO BUILD PARALKLEL APPS.IT IS ONE OF THE OPEN SOURCE DATA
ANALYTICS TOOLS USED AT A WIDE RANGE OF ORGANIZATIONS TO
PROCESS LARGE DATASETS

FEATRERS:

 IT HELPS TO RUN AN APPLICATIO IN HADOOP CLUSTER UPTO 100


TIMES FASTER IN MEMORY 10 TIMES FASTER ON DISK
 IT OFFERS LIGHTING FAST PROCESSING
 IT SUPPORT FOR SOPHISTICATED ANALYTICS
 ABILITY TO INTEGRATE WITH HADOOP AND EXISTING HADOOP
DATA
 IT PROVIDES BUILT-IN API’S IN JAVA ,SCALA AND PYTHON

8) PLOTLY:

PLOTLY IS ONE OF THE BIG DATA ANALYSIS TOOLS THAT LETS USERS
CRTEATES CHARTS AND DASHBOARDS TO SHARE ONLINE.

FEATURES:

 EASILY TURN ANY DATA INTO EYE-CATCHING AND INFORMATIVE


GRAPHICS
 IT PROVIES AUDITED INDUSTRIES WITH FINE GRAINED
INFORMATION ON DATA PROVENANCE
 IT OFFERS UNLIMITED PUBLIC FILE HOSTING THROUGH ITS FREE
COMMUNITY PLAN

9) APACHE SAMOA:

IT IS A BIG DATA ANALYTICS TOOL.IT IS ONE OF THE BIG DATA


ANALYSIS TOOLS WHICH ENABLES DEVELOPMENT OF NEW ML
ALGORITHMS.IT PROVIDES A COLLECTION OF DISTRIBUTED
ALGORITHMS FOR COMMON DATA MIONING AND ML TASK.
10) LUMIFY:

IT IS A BIG DATA FUSION.ANALYSIS AND VISUALIZATION PLATFORM.

IT IS ONE OF THE BEST TOOLS THAT HELPS USERS TO DISCOVER


CONNECTIONS AND EXPLORE RELATIONSHIPS IN TEIR DATA VIA A
SUITE OF ANALYTIC OPTIONS

FEATURES:

 IT PRIVIDES BOTH 2D AND 3D GRAPH VISUALIZATIONS WITH A


VARITY OF AUTOMATIC LAYOUTS
 IT PROVIDES A VARITY OF OPTIONS FOR ANALYZING THE LINKS
BETWEEN ENTITIES ON THE GRAPH.
 IT COMES WITH SPECIFIC INGEST PROCESSING AND INTERFACE
ELEMENTS FOR TEXTUAL ,CONTEXT,IMAGES AND VIDEOS.
 IT SPACES FEATURES ALLOWS YOU TO ORGANIZE WORK INTO A
SET OF PROJECTS OR WORKSPACES.
 IT IS BUILT ON PREVEN ,SCALABLE BIG DATA TECHNOLOGIES.

11) ELASTICSEARCH:

IT IS A JSON BASED BIGDATA SEARCH AND ANALYTICS ENGINE.IT


IS A DISTRIBUTED REST FUL SEARCH AND ANALYTICS ENGINE FOR
SOLVING NUMBERS OF USE CASES.IT OFFERS HORIZONTAL
SCALABITILY ,MAXIMUM RELIABLITITY AND EASY MANAGEMENT.

FEATURES:
 IT ALLOWS COMBINE MANY TYPES OF SEARCHES SUCH AS
STRUCTURED ,UNSTRUCTURED ,GEO,MERTIC ETC.
 REAL TIME SEARCH AND ANALYTICS FEATURES TO WORK
BIG DATA BY USING THE ELASTIC SEARCH HADOOP.
 IT GIVES AN ENHANCED EXPERIENCE WITH SECURUTY,
MONITORING,REPORTING AND ML FEATURES.

12) R PROGRAMMING:

IT IS USED FOR STAISTICAL COMPUTING AND GRAPHICS


IT IS ALSO USED FOR BIG DATA ANALYSIS .IT PROVIDES A WIDE
VARITY OF STATISTICAL TEST.

13) IBM SPSS MODELER:

IT IS PREDICTIVE BIGDATA ANALYTICS PLATFORM

IT OFFERS PREDICTIVE MODELS AND DELIVERS TO


INIDIVIDUALS,GROUPS,SYSTEM AND THE ENTERPRICE.IT IS ONE OF THE
BIGDATA ANALYTICS TOOLS WHISH HAS A RANGE OF ANALYSIS
TECHNIQUES.
OLAP
FULFORM OF OLAP IS ONLINE ANALYTICAL PROCESSING.
IT IS A CATEGORY OF SOFTWARE THAT ALLOWS USERS TO
ANALYZE INFORMATION FROM MUTIPLE DATABASES AT THE
SAME TIME.IT IS A TECHNOLOGY THAT ENABLES ANALYSTS TO
EXTRACT AND VIEW BUSINESS DATA FROM DIFFERENT OF
VIEW.
ANALYSTS FREQUENTLY NEED TO GROUP ,AGGREGATE AND
JOIN DATA.THESE OPERATIONS IN RDBMS ARE RESOURCE
INTENSIVE.
WITH OLAP DATA CAN BE PRECALCULATED AND AGGREGRATED
,MAKING ANALYSTIC FASTER.
IT IS A CLASSIFICATION OF SOFTWARE TECHNOLOGY WHICH
AUTHORIZE ANALYSTS, MANAGERS, AND EXECUTIVES TO GAIN
INSIGHT INTO INFORMATION THROUGH FAST,
CONSISTENT,INTERACTIVE ACCESS IN A WIDE VARITY OF
POSSIBLE VIEWS OF DATA THAT HAS BEEN TRANSFORMED
FROM RAW INFORMATION TO REFLECT THE REAL DIAMENTIAL
ITY OF THE ENTERPRISE AS UNDERSTOOD BY THE CLIENT.
OLAP IMPLEMENT THE MUTYDIMENTIONAL ANALYSIS OF
BUSINESS INFOR MATION AND SUPPORT THE CAPABILITY FOR
COMPLEX ESTIMATIONS,TREND ANALYSIS AND SOPHISTICATED
DATA MODELING.IT IS RAPIDLY ENHANCING THE ESSENTIAL
FOUNDATION FOR INTELLIGENT SOLUTIONS CONTAINING
BUSINESS PERFORMANCE
MANAGENT,PLANNING,BUDGETING,FORECASTING,FINANCIAL
DOCUMENTING,ANALYSIS,SIMULATION –MODELS,KNOWLEDGE
DISCOVERY AND DATA WAREHOUSES REPPORTING.OLAP
ENABLES END-CLIENTS TO PERFORM ADHOC ANALYSIS OF
RECORD IN MULTIDIMENSIONS,PROVIDING THE IN SIGTH AND
UNDERSTANDING THEY REQUIRE FOR BETTER DECISION
MAKING.
WHO USES OLAP AND WHY?
OLAP APPLICATIONS ARE USED BY A VARIETY OF THE
FUNCTIONS OF AN ORGANISATION.
1) FINANCE AND ACCOUNTING:
BUDGETING,ACITIVITY BASED COSTING,FINANCIAL
PERFORMANCE ANALYSIS,FINANCIAL MODELING.
2) SALES AND MARKETING:
SALES ANALYSIS AND FORECASTING,MARKET RESEARCH
ANALYSIS,PROMOTION ANALYSIS,CUSTOMER
ANALYSIS,MARKET AND CUSTOMER SEGMENTATION.
3) PRODUCTION:
PRODUCTION PLANNING,DEFECT ANALYSIS.
HOW OLAP WORKS?
FUNDAMENTALLY ,OLAP HAS A VERY SIMPLE CONCEPT.IT
PRECALCULATES MOST OF THE QUERIES THAT ARE
TYPICALLY VERY HARD TO EXECUTE OVER TABULAR
DATABASES,NAMELY AGGREGATION,JOINING,AND
GROUPING.THESE QUERIES ARE CALCULATED DURING A
PROCESS THAT IS USUALLY CALLED BUILDING OR
PROCESSING OF THE OLAP CUBE.THIS PROCESS HAPPENS
OVERNIGTH AND BY THE TIME END-USERS GET TO WORK
DATA WILL HAVE BEEN UPDATED.

OLAP GUIDELINES:
IT IS ALSO KNOWN AS DR.E.F.CODD RULE.
DR.E.F.CODD , THE FATHER OF THE RELATIONAL
MODEL,HAS FORMULATED A LIST OF 12 GUIDELINES AND
REQUIREMENTS AS THE BASIS FOR SELECTING OLAP
SYSTEM.
12 GUIDELINES ARE:
1) MUTIDIMEMTIONAL CONCEPTUAL VIEW
2) TRANSPARENCY
3) ACCESSIBILITY
4) CONSISTENCY REPORTING PERFORMANCE
5) CLIENT/SERVER ARCHITECTURE
6) GENERAL DIMENSIONALITY
7) DYNAMIC SPARSE MATRIX HANDLING
8) MULTI USER SUPPORT
9) UNRESTRICTED CROSS-DIMANTIONAL OPERATIONS
10) INTUTITIVE DATA MANIPULATION
11) FLEXIBLE REPORTING
12) UNLIMITED DIMENSIONS AND AGGREGATION LEVELS

1)MULTIDIMENSINAL CONCEPTUAL VIEW:

THIS IS THE CENTRAL FEATURES OF AN OLAP SYSTEM.BY NEEDING


AN MULTIDIMENTIONAL VIEW IS A POSSIBLE TO CARRY OUT
METHODS LIKE SLICE AND DICE.

2) TRANSPARENCY:

MAKE THE TECHNOLOGY ,UNDERLINE INFORMATION


REPOSITORY,COMPUTING OPERATIONS AND THE DISSIMILAR
NATURE OF SOURCE DATA TOTALLY TRANSPARENT TO
USERS.SUCH THE TRANSPARENCY HELP THE TO IMPROVE THE
EFFICIENCY AND THE PRODUCTIVITY OF THE USER.

3) ACCESSIBILITY:

IT PROVIDES ACCESS ONLY TO THE DATA THAT IS ACTUALLY


REQUIRED TO PERFORM THE PERTICULAR ANALYSIS,PRESENT A
SINGLE COHERENT, AND CONSISTENT VIEW TO THE CLIENT.THE
OLAP SYSTEM MUST MAP ITS OWN LOGICAL SCHEMA TO THE
HETEROGENEOUS PHYSICAL DATA STORES AND PERFORMED ANY
NECESSARY TRASFORMATIONS.THE OLAP OPERATIONS SHOULD
BE SITTING BETWEEN DATA SOURCES (DATA WARE HOUSE) AND
AN OLAP FRONT END.

4) CONSISTENT REPORTING PERFPORMANCE:

TO MAKE SURE THAT THE USER S DONOT FEEL ANY SIGNIFICANT


DEGRADATION IN DOCUMENTING PERFORMANCE AS THE
NUMBER OF DIMENTIONS OR THE SIZE OF THE DATABASE
INCRESESE.THAT IS THE PERFORMANCE OF OLAP SHOULD NOT
SUFFER AS THE NUMBER OF DIMENSION IS INCREASED.USERS
MUST OBSERVE CONSISTENT RUN TIME,RESPONSE TIME,OR
MACHINE UTILIZATION EVERY TIME A GIVEN QUERY IS RUN.

5) CLIENT /SERVER ARCHITECTURE:

MAKE THE SERVER COMPONENT OF OLAP TOOLS SUFFICIENTLY


INTLIGENT THAT THE VARIOUS CLIENTS TO BE ATTACHED WITH A
MINIMUM OF EFFORT AND INTEGRATION PROGRAMMING. SO
THE SERVER SHOULD BE CAPABLE OF MAPPING AND
CONSOLIDATING DATA BETWEEN DISSIMILAR DATABASES.

6) GENERIC DIMENTIONALITY:

AN OLAP METHOD SHOULD TREAT EACH DIMENSION AS


EQUIVALENT IN BOTH IN STRUCTURE AND OPERATIONS
CAPABILITIES.ADDITIONAL OPERATIONAL CAPABILITIES MAY BE
ALLOWED TO SELECTED DIMENTIONS,BUT SUCH ADDITIONAL
TASKS SHOULD BE GRANTALE TO ANY DIMENTION.

7) DYNAMIC SPARSE MATRIX HANDLING:

TO ADAPT THE PHYSICAL SHCEMA TO THE SPECIFIC ANALYTICAL


MODEL BEING CREATED AND LOADED THAT OPTIMIZES SPARCE
MATRIX HANDLING.WHEN ENCOUNTERING THE SPARSE MATRIX
,THE SYSTEM MUST BE EASY TO DYNAMICALLY ASSUME THE
DISTRIBUTION OF THE INFORMATION AND ADJUST THE STORAGE
AND ACCESS TO OBTAINAND MAINTAIN A CONSISTENT LEVEL OF
PERFORMANCE.

8) MULTIUSER SUPPORT:

OLAP TOOLS MUST PROVIDE CONCURRENT DATA ACCESS , DATA


INTEGRITY AND ACCESS SECURITY.

9) UNRESTRICTED CROSS-DIMENTIONAL OPERATIONS:

IT PROVIDES THE METHOD TO IDENTIFY DIMENTIONAL ORDER


AND NECESSARILY FUNCTIONS ROLL-UP AND DRILL-DOWN
METHODS WITH IN THE DIMENTION OR ACROSS THE DIMENTION

10) INTUITIVE DATA MANIPULATION:

DATA MANIPULATION FUNDAMENTAL THE CONSOLIDATION


DIRECTION LIKES AS REORIENTATION,DRILL-DOWN,AND ROLL-UP
AND ANOTHER MANIPULATION TO BE ACCOMPLISHED
NATUARALLY AND PRECISELY VIA POINT AND CLICK AND DRAG
AND DROP METHODS ON THE CELLS OF THE SCIENTIFIC MODEL.IT
AVOIDS THE USE OF A MENU OR MUTIPLE TRIPS TO A USER
INTERFACE.

11) FLEXIBLE REPORTING:

IT IMPLEMENTS EFFICIENCY TO THE BUSINESS CLIENTS TO


ORGANIZE THE COLUMNS,ROWS,AND CELLS IN A MANNER THAT
FACILITATES SIMPLE MANIPULATION ,ANALYSIS,AND SYSTHESIS
OF DATA.

12) UNLIMITATED DIMENSIONS AND AGGREGATION LEVELS:

THE NUMBER OF DATA DIMENSIONS SHOULD BE


UNLIMITED.EACH OF THESE COMMON DIMENSIONS MUST
ALLOW A PRACTICALLY UNLIMITED NUMBER OF CUSTOMER-
DEFINED AGGREGATION LEVELS WITH IN ANY GIVEN
CONSOLIDATION PATH.
MONGO DB DATA BASE
MONGO DB IS A OPEN SOURCE , DOCUMENT ORIENTED
DATABASE THAT STORES THE DATA IN THE FROM OF
DUOCUMENTATION.IT IS A EXAMPLE OF NO-SQL
DATABASE.MONGO DB WRITTEN IN C++ LANGUAGE.
MONGO DB WAS CREATED BY ELIOT AND DWIGHT IN 2007.
WHEN THEY FACED SCALABILITY ISSUES WHILE WORKING WITH
RELATIONAL DATABASE.MONGO DB WAS DESIGNED TO WORK
WITH COMMODITY SERVERS. NOW IT IS USED BY THE COMPANY
OF ALL SIZES AND ACROSS ALL INDUSTRY.
IT IS INITIALLY DEVELOPED AS A PLATFORM AS
SERVISE(PAAS),LATER 2009 IT IS INRTODUCE IN THE MARKET AS A
OPEN SOURCE DATABASE SERVER THAT WAS MAINTAINED AND
SUPPORTED BY MONGO DB.
VERSION 1.4 WAS RELEASED IN MARCH,2010
MONGO DB2.4.9 WAS THE LATEST AND STABLE VERSION WHICH
WAS RELEASED ON JANUARY 10,2014.

ADVANTAGES:
1) SCHEMALESS:
MONGO DB IS A DOCUMENT DATABASE IN WHICH ONE
COLLECTION HOLDS DIFFERENT DOCUMENTS. NUMBER OF
FIELDS,CONTENT AND SIZE OF THE DOCUMENT CAN DIFFER
FROM ONE DOCUMENT TO ANOTHER DOCUMENT.
2) STRUCTURE OF A SINGLE OBJECT IS CLEAR
3) NO COMPLEX JOIN.
4) DEEP QUERY ABILITY.MONGODB SUPPORTS DYNAMIC
QUERIES ON DOCUMENTS USING A DOCUMENT BASED QUERY
LANGUAGE THAT’S NEARLY AS POWERFULL AS SQL.
5) EASY OF SCALE –OUT- IT IS EASY TO SCALE
6) CONVERSION/MAPPING OF APPLICATION OBJECTS TO
DATABASE OBJECT NOT NEEDED.
7) USES INTERNAL MEMORY FOR STORING FILES,WORKING
SET,ENABLING FASTER ACCESS OF DATA.

WHY USE MONGODB?


• DOCUMENT ORIENTED STORAGE- DATA IS STORED IN THE
FORM OF JSON STYLE.
• INDEX ON ANY ATTRIBUTE
• REPLICATION AND HIGH AVAILABILITY
• AUTO-SCALING
• FAST IN PLACE UPDATES
• PROFESSIONAL SUPPORTS BY MONGO DB

WHERE TO USE MONGO DB?


BIGDATA
CONTENT MANAGEMENT AND DELIVERY
MOBILE AND SOCIAL INFRASTRUCTURE
USER DATA MANAGEMENT
DATA HUB

FEATURES OF MONGO DB:


1) INDEXING
2) REPLICATION
3) LOAD-BALANCING
4) LARGE MEDIA STORAGE
5) HORIZONTAL SCALABILITY
6) HIGH PERFORMANCE
7) AGGREGATION
8) ADHOC QUERIES SUPPORT

INDEXING:
INDEX IS A SINGLE FIELD WITH IN THE
DOCUMENT.INDEXES ARE USED TO QUICKLY LOCATE
DATA WITHOUT HAVING TO SEARCH EVERY
DOCUMENT IN A MONGO DB DATABASE.
THIS IMPROVES THE PERFORMANCE OF OPERATIONS
PERFORMED ON THE MONGODB DATABASE.

HIGH AVAILABILITY:
AUTO REPLICATION IMPROVES THE AVAILABILITY OF
MONGODB DATABASE.

LOAD BALANCING:
HORIZONTAL SCALING ALLOWS MONGODB TO
BALANCE THE LOAD.

MONGO DB PROVIDES HIGH PERFORMANCE. MOST OF


THE MONGODB ARE FASTER THAN THE RDBMS
MONGO DB PROVIDES AUTO REPLICATION FEATURES
THAT ALLOWS TO QUICKLY RECOVER DATA IN CASE OF
FAILURE.

HORIZONTAL SCALLING IS POSSIBLE IN MONGODB


BECUASE OF SHARING.SHARING IS PARTITIONING OF
DATA & PLACING IT ON MULTIPLE MACHINES IN SUCH
A WAY THE ORDER OF THE DATA IS PRESERVED.
IT MEANS ADDING MORE MACHINE TO HANDLE THE
DATA.

MAPPING RDBMS TO MONGODB:


1. COLLECTION OF MONGODB IS EQUIVALENT TO THE
TABLES IN RDBMS
2. DOCUMENTS IN MONGDB IS EQUIVALENT TO THE
ROWS IN RDBMS
3. FIELDS IN MONGODB IS EQUIVALENT TO THE
COLUMNS IN RDBMS.

FIELDS ARE STORED IN DOCUMENTS,DOCUMENTS


ATTE STORED IN COLLECTIONS AND COLLECTION S
ARE STORED IN THE DATABASE.
HADOOP DATA TYPE

HADOOP DATA TYPES ARE CATEGIRIZED INTO FIVE TYPES AS:


1) PRIMITIVE DATA TYPE:
A) NUMERIC DATA TYPE:
IT SUPPORTS BOTH INTEGRAL AND FLOATING DATA TYPES.
INTEGRAL TYPE:
TINYINT(1 BYTE),SMALLINT(2 BYTE),INT(4 BYTE),BIGINT(8 BYTE)
FLOAT TYPE:
FLOAT(4 BYTE),DOUBLE(8 BYTE),DECIMAL(17 BYTE)
B) DATE/TIME DATA TYPE:
IT SUPPORTS TIMESTAMP,DATE AND INTEVAL DATATYPES.
TIMESTAMP TYPE:
IT USES NANOSECOND PRECISION AND ITS DENOTED BY yyyy-mm-dd
hh:mm:ss FORMAT

DATE:
THESE ARE REPRESENTED AS YYYY-MM-DD

INTERVAL
C)STRING DATA TYPE:

i)STRING:UNBOUNDED VARIABLE LENGTH CHARCTER STRING.EITHER SINGLE OR


DOUBLE QUOTES CAN BE USED TO ENCLOSE CHARACTERS

ii)VARCHAR:VARIABLE -LENGTH CHARCTER SRTING.MAXIMUM LENGTH IS


SPECIFIED IN BRACES AND ALLOWED UP TO 65355 BYTES.

iii)CHAR:FIXED -LENGTH CHARACTER STRING 'M',OR "M"


D)MISCELLANEOUS DATA TYPES:
BOOLEAN:IT STORES TRUE OR FLASE VALUES.
BINARY:IT IS AN ARRAY OF BYTES.

2)COMPLEX DATA TYPE:

A) ARRAY:AN ORDERED COLLECTION OF SILMILAR TYPES OF FIELDS THAT ARE


INDEXABLE USING ZERO BASED INTEGERS.
SYNTAX:ARRAY
B) MAP: UNORDERED COLLECTION OF KEY VALUE PAIRS.KEY MAY BE VALUES,
PRIMITIVES OR ANY TYPE OF DATA.
SYNTAX:MAP
C) STRUCT: COLLECTION OF NAMEDED FIELDS.THIS FIELDS MAY BE VARIOUS
TYPES .
SYNTAX:STRUCT
D)UNION: UNION TYPE CAN HOLD ANY DATA TYPETHAT MAY BE ONE OF THE
SPECIFIED DATA TYPES.
SYSTAX: UNIONTYPE.

3)COLUMNS DATA TYPES:


i)INTEGRAL TYPE:(INT/INTEGER,SMALLINT,BIGINT,TINY INT)
ii)STRINGS TYPE:VARCHAR,CHAR
iii)TIMESTAMP TYPE:"YYYY-MM-DD HH:MM:SS.(.f...)"

iv) DATE
v)DECIMALS
vi)UNION TYPE
4)LITERALS DATA TYPES:
FLOATING POINT TYPES,DECIMAL TYPES

5)NULL TYPE
OLAP(INFORMATION)

FULFORM OF THE OLAP IS ONLINE ANALYTICAL PROCESSING.OLAP


TERMS WAS INTRODUCED BY E.F.CODD AND IT IS A CATEGORY OF
SOFTWARE TECHNOLOGY TO ANALYZE THE COMPLEX DATA DERIVED
FROM THE DATA WARE HOUSE.

FEATURES OF OLAP :

1. FAST: SPECIALIZED DATA STORAGE FOR FAST ACCESSING OF


DATA.
2. ANALYSIS:
LETS USER TO ENTER QUERY IN INTERACTIVE TOOL.
3. SHARED:
MULTIPLE USERS CAN ACCESS THE DATA.
4. MULTIDIMENSIONALITY:
ATTRIBUTES: MEASURE AND DIMENSION:FOR EASY DECISION
MAKING
5. INFORMATION:
REPRESENTATION :GRAPH,RULES ETC.

OLTP:(OPERATION)
FULFORM OF OLTP IS ONLINE TRANSCATION PROCESSING.IT IS
CHARACTERIZED BY LARGE NO OF SHORT ONLINE
TRANSCATIONS(INSERT,UPDATE,DELETE)
DIFFERENCE BETWEEN OLAP AND OLTP:

OLAP OLTP
1.SPEED DEPENDS ON THE 1.VERY FAST PROCESSING
AMOUNT OF DATA INVOLVED SPEED
2.IT HELPS IN DECISION 2.IT AIMS IS TO CONTROL AND
SUPPORT,PROBLEM SOLVING RUN BASIC BUSINESS TASK.
AND PLANNING
3.DENORMALIZED 3.HIGHLY NORMALIZED
DATABASE
4.IT DEALS WITH CONSOLIDATED 4.IT DEALS WITH OPERATIONAL
DATA.(HETERIGENEOUS DATA DATA IN WHICH OLTP
SOURCES) DATABASES ARE ONLY
SOURECE ON THE DATA.
5.COMPLEX QUERIES WITH 5.SIMPLE AND STANDARD
AGGREGATION ALSO. QUERIES.

OLAP OPERATIONS:

1. PIVOTING.

2. SLICE AND DICE.

3. ROLLUP AND DRILL DOWN

PIVOTING:

IT IS THE TECHNIQUE OF CHANGING FROM ONE DIMANTION


ORIENTATION TO ANOTHER .ALSO CALLWD AS ROTATION.
SLICE AND DICE:

IN SLICE,CROSS TABULATION IS DONE FOR SPECIFIC VALUE ORTHER


THAN ALL FOR THE FIXED THIRD DIMENTION.

IN DICE ,TWO OR MORE DIMENTION ARE FIXED.

ROLL UP AND DRILL DOWN :

ROLLUP ,IS THE OPERATION THAT CONVERTS DATA WITH FINER


GRANULARITY TO THE COASSER GRANULARITY WITH THE HELP OF
AGGREGATION.

DRILL DOWN IS CONVERSION OF COASSER TO FINER GRANULARITY.

TYPES OF THE OLAP SYSTEMS:

1) ROLAP: RELATIONAL ONLINE ANALYTICAL PROCESSING.IT IA AN


EXTENDED RDBMS ALONG WITH MULTIDIMENSIONAL DATA
MAPPING TO PERFORM THE STANDARD RELATIONAL OPERATION.

2) MOLAP: MULTIDIMENSIONAL ONLINE ANALYTICAL


PROCESSING.IT IMPLEMENTS OPERATION IN MULTIDIMENSIONAL
DATA.
3) HOLAP:
HYBRID ONLINE ANALITICAL PROCESSING.IN HOLAP APPROACH
THE AGGREGATED TOTALS ARE STORED IN A MULTIDIMENAIONAL
DATABASE WHILE THE DETAILED DATA IS STORED IN THE
RELATIONAL DATABESE.THIS OFFERS BOTH DATA EFFICIENCY OF
THE ROLAP MODEL AND THE PERFORMANCE OF THE MOLAP
MODEL

4) OTHERS:
A) WOLAP: WEB ONLINE ANALYTICAL PROCESSING.WEB OLAP
WHICH IS OLAP SYSTEM ACCESSIBLE VIA THE WEB
BROWSER.WOLAP IS A THREE –TIERED ARCHITECTURE.IT
CONSIST OF THREE COMPONENT:CLIENT,MIDDLEWARE AND A
DATABASE SERVER.

B) DOLAP: DESKTOP ONLINE ANALYTICAL PROCESSING.IN


DESKTOP OLAP ,A USER DOWNLOADS A PART OF THE DATA
FROM THE DATABASE LOCALLY OR ON THEIR DESKTOP AND
THEN ANALYZE IT.DOLAP IS RELATIVELY CHEAPER TO DEPLOY
AS IT OFFERS VERY FEW FUNCTIONLITIES COMPARES TO
OTHER OLAP SYSTEM.

C) MOLAP: MOBILE ONLINE ANALYTICAL PROCESSING.MOBILE


OLAP HELPES USERS TO ACCESS AND ANALYZE OLAP DATA
USING THEIR MOBILE DEVICES.
D) SOLAP: SPATIAL ONLINE ANALYTICAL PROCESSING.SOLAP IS
CREATED TO FACILIATE MANAGEMENT OF BOTH SPATIAL AND
NON SPATIAL DATA IN A GEOGRAPHIC INFORMATION
SYSTEM(GIS)
REPLICATION CONCEPT
Replication is the process of synchronizing data across multiple
servers. Replication provides redundancy and increases data
availability with multiple copies of data different database
servers.
Replication protects a database from the loss of a single server
and it also allows us to recover from hard ware failure and
service interruptions with additional copies of the data, you can
dedicate one to disaster recovery , reporting or backup.
Why replication?
1) To keep our data safe.
2) High (24*7) availability of data.
3) Disaster recovery.
4) No down time for maintenance(like backup, compaction
index rebuilds etc.)
5) Read scaling( extra copies to read from )
6) Replica set is transparent to the application.

How replication works in MongoDB:--

MongoDB achieves replication by the set of replica set.


A replica set is a group of mongod instances that host the
same data set.
In a replica , one node is primary node that receives all write
operations. All other instances such as secondaries ,apply
operations from the primary so they have the same data set.
replica set can have only one primary node.

 Replica set is a group of two or more nodes( generally


minimum 3 nodes are required)
 In a replica set one node is primary node and the
remaining nodes are secondary.
 All data replicates from primary to secondary node.
 At the time of automatic failover or maintance, election
establishes for primary and a new primary node is
elected .
 After the recovery of failed node, it again join the
replica set and works as a secondary node.

CLIENT APPLICATION DRIVER

READ WRITE

PRIMARY

REPLICATION REPLICATION

SECONDARY SECONDARY

A TYPICAL DIAGRAM OF MONGODB REPLICATION,IN WHICH CLIENT APPLICATIOM ALWAYS INTERACT WITH THE
PRIMARY NODE AND THE PRIMARY NODE THEN REPLICATES THE DATA TO THE SECONDARY NODES
FEATURES OF REPLICA SET:

 A CLUSTER OF N NODES.
 ANY ONE NODE CAN BE PRIMARY
 ALL WRITE OPERATIONS IS DONE IN THE PRIMAREY NODES.
 AUTOMATIC FAILOVER
 AUTOMATIC RECOVERY
 CONSENSUS ELECTION OF PRIMARY

SET UP A REPLICA SET:--

TO CONVERT TO REPLICA SET , FOLLOWING STEPS ARE REQUIRED –

1) SHUT-DOWN ALREADY RUNNING MONGODB SERVER.


2) START THE MONGODB SERVER BY SPECIFING –replSet option.

General syntax:
mogd --port “PORT”--dbpath “Your DBDataPath”--
replSet”Replica_set_instance_name”

e.g:
mongod – port 27017 --dbpath “D:\setup\mongodb\data”--
replSet “rs 0”

 it will start a mongb instance with the name rs 0 on the port


27017
 now start the command prompt and connect to this mongod
instance
 in mongo client ,issue the command rs.initiate() to initiate
new replica set
 to check the replica set configuration,issue the command
rs.config().to check the status of replica set issue the
command rs.status().

Add members to replica set:


To add members to replica set ,start mongod instance on
multiple machines. Now start a mongo client and issue a
command re.add();

Syntax:
Rs.add(HOST_NAME: PORT)

e.g:
rs.add(“mongod1.net:27017”)
MONGODB ADMINISTRATION
There are several strategies or operational strategies for the
monitoring purpose:
1) backup strategies:
MONGODB works on a large set of important data. The
database administrator should backup those data to avoid data
loss.
2) monitoring:
The database administrator should monitor different database
and data store related information to improve the
performance and reduce the faults.
3) runtime configuration:
For a huge number of database setting , the administrator
should provide configuration details.
4) import and export:
the administrator can import or export JSON data to /from
different sources in correct order.
5) Production notes:
Mongodb works on large set of data. The data are replicated
and it uses different shards. The administrator should care
about the production architecture and notes to contol the
complete system.
BACKUP AND RESTORE IN MONGODB

BACKUP IN MONGODB:
TO CREATE BACKUP OF DATABASE IN MONGODB,USE
mongodump COMMAND.
THIS COMMAND WILL DUMP THE ENTIRE DATA OF OUR
DATABASE OF OUR SERVER INTO THE DUMP DIRECTORY.

SYNTAX:
mongodump

e.g:
1) mongodump-- host HOST_NAME--portPORT_NUMBER
THIS COMMAND WILL BACKUP ALL DATABASES OF SPECIFIED
mongod INSTANCE.

E.G:
mongodump--host xyz--port 27017

2)mongodump--dbpath DBPATH--out BACKUP_DIRECTORY


THIS COMMAND WILL BACKUP ONLY SPECIFIED DATABASE AT
SPCIFIED.

3)mongodump --collection COLLECTION--db DB_NAME


THIS COMMAND WILL BACKUP ONLY SPECIFIED COLLECTION OF
SPCIFIED DATA.
RESTORE:
TO RESTORE BACKUP DATA MONGODB’S mongostore command
is used.
This command restores all of the data from the backup directory.

Syntax:
Mongostore

OPTIMIZATION TECHNIQUES:

IN OPTIMIZATION OF A DESIGN, THE DESIGN OBJECTIVE COULD


BE TO MINIMIZE THE COST OF PRODUCTION OR TO MAXIMIZE
THE EFFICIENCY OF PRODUCTION.

AN OPTIMIZATION ALGORITHM IS A PROCEDURE WHICH IS


EXECUTED ITERATIVELY BY COMPARING THE VARIOUS SOLUTIONS
TILL AN OPTIMUM OR A SATISFACTORY SOLUTION IS FOUND.

5 WAYS TO OPTIMIZE BIGDATA:


1) REMOVE LATENCY IN PROCESSING:

LATENCY IN PROCESSING OCCURS IN TRADITIONAL STORAGE


MODELS THAT MOVE SLOWLY WHEN TERINVING
DATA.ORGANIZATIONS CAN DECREASE PROCESSING TIME BY
MOVING AWAY FROM THOSE SLOW HARDDISKS AND
RELATIONAL DATABASES,INTO IN MEMORY COMPUTING
SOFTWARE.
APACHE SPARK IS ONE OPOPULAR EXAMPLE OF AN IN
MEMORY STORAGE MODEL.
2) EXPLOIT DATA IN REAL TIME:

THE GOAL OF REAL-TIME DATA IS TO DECREASE THE TIME


BETWEEN AN EVENT AND THE ACTIONABLE INSIGTH THAT
COULD COME FROM IT.
IN ORDER TO MAKE INFORMED DECISIONS,ORGANIZATION
SHOULD STRIVE TO MAKE THE TIME BETWEEN INSIGTH AND
BENEFIT AS SHORT AS POSSIBLE.
APACHE SPARK STREAMING HELPS ORGANIZATIONS PERFORM
REAL TIME DATA ANALYSIS.
3) ANALYZE DATA PRIOR TO ACTING:

IT’S BETTER TO ANALYZE DATA BEFORE ACTING ON IT,AND THIS


CAN BE DONE THROUGH A COMBINATION OF BATCH AND
REAL TIME DATA PROCESSING.
WHILE HISTORICAL DATA HAS BEEN USED TO ANALYZE
TREANDS FOR THE YEARS,THE AVAILABILITY OF CURRENT DATA
–BOTH IN BATCH FROM AND STREAMING,NOW ENABLES
ORGANIZATIONS TO SPOT CHANGES IN THOSE TRENDS AS
THEY OCCURE.A FULL RANGE OF UP-TO-DATE DATA GIVES
COMPANIES A BROADER AND MORE ACCURATE PERSPECTIVE.
4) TURN DATA INTO DECISIONS:

THE VAST AMOUNT OF BIGDATA THAT EACH ORGANIZATION


HAS TO MANAGE COULD BE IMPOSSIBLE WITHOUT BIGDATA
SOFTWARE AND SERVICE PLATFORM.
MACHINE LEARNING TURNS THE MASSIVE AMOUT OF DATA
INTO TREANDS, WHICH CAN BE ANALYZED AND USED FOR
HIGH QUILITY DECISION MAKING.
ORGANIZATIONS SHOULD USE THIS TECHNOLOGYU TO ITS
FULLEST INORDER TO FULLY OPTIMIZE THE BIGDATA.

5) BY USING LATEST TECHNOLOGY:

BIGDATA USED LATEST TECHNOLOGY.

BIGDATA OPTIMIZATION:

WE NEED A SYSTEM THAT WORKS ON THE FOLLOWING


PRINCIPLES:
1) SCALABILITY:

THE SYSTEM HAS TO BE EXPANDED WITH THE INCREASING


DATA .THE EXPANSION OF THE SYSTEM SHOULD NOT
IMPACT THE EXICTING SYSTEM. SO THE SYSTEM SHOULD BE
EASILY SCALABLE.
2) FAULT TOLERANCE:

HADOOP CLUSTER CAN HAVE MULTIPLE MACHINES IN A


CLUSTER,EVEN IN THOUSANDS FOR HUGE BUSINESSES LIKE
yahoo..THIS IS A GOOD CHANCE THAT SOME OF THEM FAIL
ONE TIME OR ANOTHER.

SUCH POSSIBILITIES NEEDS TO BE CONSIDERED. THE SYSTEM


SHOULD BE CAPABLE OF COPING WITH SUCH SITUATIONS
WITHOUT ANY SIGNIFICANT EFFECTS.

3) DATA DISTRIBUTION:

THE DATA DISTRIBUTION SHOULD BE DONE I N SUCH A WAY


THAT THE SAME MACHINE SHOULD PROCESS THE DATA
WHERE IT IS STORED.
IF DATA STORAGE AND THE PROCESSING HAPPEN IN
DIFFERENT MACHINE ,IT WILL NEED EXTRA COST AND TIME
FOR DATA TRANSMISSION.
HERE HADOOP CAN SERVE AS A BUILDING BLOCK OF OUR
ANALYTICS PLATFORM,AS IT IS BY FAR ONE OF THE BEST
WAY TO HANDEL FAST GROWING DATA
PROCESSING,STORAGE AND ANALYSIS.
CHALLENGES IN BIGDATA OPTIMIZATION:
1) PREPROCESSING:

PREPROCESSING THE DATA IS VERY IMPORTANT ,TIME


COMSUMING AND COMPLICATED TASK WHERE THE
NOISE IS FILTERED OUT FROM THE HUZE VOLUME OF
UNSTRUCTURED AND STRUCTED DATA CONTINUOUSLY
AND THE DATA IS COMPRESSED BY UNDERSTANDING
AND CAPTURING THE CONTEXT INTO WHICH DATA HAS
BEEN GENERATED.

2) INFORMATION EXTRACTION:

EXTRACTING MEANINGFUL INFORMATION FROM HUZE


AMOUNT OF DATA OF POOR QUALITY IS ONE OF THE
MAJOR CHALLENGES BEING FACED IN BIGDATA .SO DATA
CLEANING AND DATA QUALITY VARIFICATION ARE
CRITICAL FOR ITS ACCURACY.

3) DATA INTEGRATION,AGGREGATION AND


REPRESENTATION:

DATA COLLECTED IS NOT HOMOGENOUS.IT MAY HAVE


DIFFERENT META DATA.THUS DATA INTEGRATION
REQUIRES HUGE HUMAN EFFORTS

IT IS DIFFICULT TO COME UP WITH AGGREGATION LOGIC


FOR HUZE SCALE BIGDATA MANUALLY,HENCE THE
REQUIREMENT OF NEWER AND BETTER APPROACHES
ARISES.

ALSO DIFFERENT DATA AGGREGATION AND


REPRESENTATION STRATEGIES MAY BE NEEDED FOR
DIFFERTENT DATA ANALYSIS TASKS.

4) QUERY PROCESSING AND ANALYSIS:

METHODS SUITABLE FOR BIGDATA NEED TO BE


DISCOVERED AND EVALUATED FOR EFFIENCY SO THAT
THEY ABLE TO DEAL WITH
NOISY,DYNAMIC,HETEROGENEOUS,UNSTRUCTED,UNTRU
STWORTHY DATA.
SUBJECT NAME : BIGDATA

Date:5/10/2020 Time: 9.00am to 12.30pm

Advantages of No-Sql:
1. It can be used as primary or Analytic Data Source.
2. Big-Data capability
3. No single point of failure.
4. Easy replication
5. It provides fast performance and horizontal scalability
6. Can handle structure, semi-structure and unstructured
data with equal effect.
7. OOP which is easy to use and flexible.
8. No-sql database do-not need a dedicated high
performance server.
9. Simple to implement than using RDBMS.
10. It can serve as the primary data-source for online
applications.
11. Handles BigData which manages velocity, variety
,volume and complexity of the data.
12. Eliminates the need for a specific caching layer to
store data.
13. Offers a flexible schema design which can easily be
altered without downtime or service disruption.

Disadvantages of No-Sql:

1. No standardization rules.
2. Limited query capabilities.
3. RDBMS databases and tools are comparatively
mature.
4. It does-not offer any traditional database capability
like consistency when multiple transactions are
performed simultaneously.
5. When the volume of the data increases it is difficult
to maintain unique values as key become difficult.
6. Does-not work as well with relational data.
7. Open source options so not so popular for
enterprises.
8. The learning curve is stiff for new developer.
Difference between RDBMS and NO-SQL:
1. Storage:

RDBMS applications store data in the form of table


structure manner.
NO-SQL is a non relational database system, data is
stored in the form of unstructured manner.
2. No. of users:

RDBMS and NO-SQL, supports multiuser.

3. Database structure:

RDBMS uses tabular structures to store data. In a table


Header are the columns names and rows contains the
corresponding values.

4. ACID:
RDBMS are harder to construct and obey ACID .IT helps
to create consistency of the database.
NO-SQL do-not support ACID to store the data
5. Normalization:
RDBMS supports the normalization and joining of
tables.
NO-SQL does-not support normalization.
6. Open-source:
RDBMS is an open-source application.
NO-SQL is the open source program.

7. Integrity constraints:
In RDBMS, supports the integrity constraints at the
schema level.
NO-SQL database supports integrity constraints.

8. Development year:
RDBMS was developed in the 1970s to deal with the
issues of flat file storage.
NO-SQL developed in the late 2000s to overcome the
issues and limitations of SQL DataBase.

9. Distributed database:

RDBMS and NO_SQL both supports the distributed


database.

10. Client-server:
RDBMS supports client –server architecture.
No-SQL storage system supports multi-servers. it also
Supports the client server architecture .

11. Ideally suited for:


RDBMS deals with large quantity of data.
No-SQL database mainly designed for Big-Data and real
–time web applications data.

12. Data –relationship:


In RDBMS, data related to each other with the help of
foreign keys
In NO_SQL ,data can be stored in a single document
file.

13. Hardware and software:


In RDBMS, high software and specializes database
hardware (QRAClE, MY SQL etc.)
In NO-SQL, commodity hardware is used.

14. Data-fetching:
In RDBMS, data fetching is rapid because of its
relational approach and database.
In NO_SQL, data-fetching is easy and flexible.

15. Example:
RDBMS: MY-SQL,SQL-SERVER,QRACLE etc
NO-NQL: Aache Hbase, mongoDB etc.

JSON:
JSON stands for Java Script Object Notation.
It is a text base, human readable, information
capability organize utilized for speaking to basic
information structure and objects in web browser
based code. It is additionally in some cases utilized in
desktop and server side programming situations.
It is format to store and interchange data.
Advantages:
1. Faster
2. structure data
3. readable : it is human readable and writable.it is
light weight text based data interchange format.
4.language independent.

Example:
{
JSON document id-->“_id”:100,
“name”:”Akash”, key
“Subject”:[“math”,” computer science”]
}
JSON Document id:
Id is a 12 byte hexadecimal number which assures
the uniqueness of every document. if you do-not
provide then Mongo DB provides a unique id for
every document.
12 byte::4(current time stamp)+3(machine
id)+2(process id of the mongo DB server)+3(simple
incremental value)

BSON:
The BSON is the Binary Java Script Object Notation.
In Mongo DB they use BSON to encrypt the JSON
data. it is also language independent, easy to parse
and generates files from the machine

Difference between JSON and BSON


The basis of JSON BSON
comparison
1.type Standard file Binary file
format format
2.speed Comparatively faster
less fast
3.space Consumes More space is
comparatively consumed
less space
4. usage Transmission of Storage of the
data data
5. encoding and No such Faster
decoding techniques encoding and
techniques decoding
techniques are
used
6.characteristics Key value pair Light
only used for weigth,fast and
transmission of traversable
data
7.structure Language It consist of a
independent ordered
format used for elements
asunchronous containing a
server browser field
communication name(string
type) ,type and
a value.
8.traversal JSON does not It just indexes
skip rather than on the relevant
skim through all contect and
the content skips all the
content
9.parse JSON formats Needs to be
need not be parsed as they
parsed as they are easy to
are in a human parse and
readable format generate.
already.
10.creation type Broadly JSON The binary en
consist of object coding consist
and array ,object of additional
is the collection information .
of key value pairs
and the array is
ordered list
value.
HADOOP:
DEFINATION:
The hadoop is an open source framework. Hadoop
can easily handle a large amount of data on a low
cost, simple hardware cluster. it is also a scalable
and fault –tolerance frame work. it is not only a
storage system but data can be processed using this
framework. it is written in Java.
The hadoop codes are written by Yahoo,IBM etc.
It provides parallel processing through different
commodity hardware simultaneously.
As it works on commodity hardware so that the cost
is very low .commodity hardware is low end and very
cheap hardware. so Hadoop solution is also
economic.

WHY WE SHOULD USED HADOOP?


 The hadoop solution is very popular. it has
captured at least 90%of BIGDATA market
 It has some unique features that makes this
solution very popular
 Hadoop is scalable .so we can increase the
number of commodity hardware easily.
 It is a fault toleration solution. when one node
goes down then other nodes can process the
data.
 In Hadoop data can be stored as a
structured,unstructured and semistructured
format .so it is more flexible.

Brief history on Hadoop:


Hadoop had started in the year 2002 with the project
apache NUTCH. Hadoop was created by DOUG
Cutting , the creator of Apache LUcene, the widely
used text search library.
There are mainly two problems with BIgdata:
1) To store a huge amount of data
2) To process that stored data.

The traditional approach like-RDBMS is not


sufficient due to the heterogeneity of the data. so,
Hadoop comes as the solution to the problem big-
data i.e storing and processing the Big-Data with
some extra capabilities.

2002-Doug Cutting and mike caferella started to


work on Apache NUtch project

2003-google released paper on GFS(Google File


System) to describe how to store large datasets.

2004-google releases another paper on map


reduce techniques which describe processing of
large datasets.

2005-doug cutting started to used GFS and Map-


reduce in Nutch.

2006-Doug Cutting found some


problem/limitations in Nutch and joined Yahoo
along with the Nutch.

2007-Doug Cutting split out the distributed


computing parts from Nutch and created Hadoop.
2008(January)-yahoo successfully tested Hadoop
on 1000 node cluster.
2008(july)—yahoo released Hadoop as an Open
Source Project to Apache Software Foundation.

2009- Hahoop was successfully tested to sorta


PetaBytes of data in less than 17 hours.

2011-Apache Software Foundation released


Apache Hadoop Version 1.0

2017—apache Hadoop version 3.0

HADOOP ECOSYSTEM
Hadoop ecosystem components like
1)Hadoop Distributed File System(HDFS)
2)Map-Reduce
3)YARN
4) HIVE
5) Apche PIG
6) Apache HBase
7)HCatalog
8)Avro
9) Thrift
10)Drill
11) Apache Mahout
12) Sqoop
13) Apache Flume
14) Ambari
15) Zookeeper
16) Apache oozie.
Microsoft Windows [Version 10.0.18362.30]
(c) 2019 Microsoft Corporation. All rights reserved.

C:\Users\Sourabh>mongo
MongoDB shell version v4.4.2
connecting to:
mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("999414a7-0f59-4ea3-a9a6-
2c671cacf27a") }
MongoDB server version: 4.4.2
---
The server generated these startup warnings when booting:
2020-12-18T14:21:23.175+05:30: Access control is not enabled for
the database. Read and write access to data and configuration is
unrestricted
---
---
Enable MongoDB's free cloud-based monitoring service, which will
then receive and display
metrics about your deployment (disk utilization, CPU, operation
statistics, etc).

The monitoring data will be available on a MongoDB website with a


unique URL accessible to you
and anyone you share the URL with. MongoDB may use this
information to make product
improvements and to suggest MongoDB products and deployment
options to you.

To enable free monitoring, run the following command:


db.enableFreeMonitoring()
To permanently disable this reminder, run the following command:
db.disableFreeMonitoring()
---
> show dbs
admin 0.000GB
config 0.000GB
local 0.000GB
mydatabase 0.000GB
test 0.000GB
> use mytable
switched to db mytable
>
db.student.insertMany([{"name":"xyz","roll_no":01,"dept":"CST","year":1},
{"name":"abc","roll_no":02,"dept":"CST","year":1},{"name":"pqe","roll_no"
:03,"dept":"CST","year":1},{"name":efg","roll_no":01,"dept":"ETCE","year"
:1}])
uncaught exception: SyntaxError: missing } after property list :
@(shell):1:184
>
db.student.insertMany([{"name":"xyz","roll_no":01,"dept":"CST","year":1},
{"name":"abc","roll_no":02,"dept":"CST","year":1},{"name":"pqe","roll_no"
:03,"dept":"CST","year":1},{"name":"efg","roll_no":01,"dept":"ETCE","year
":1}])
{
"acknowledged" : true,
"insertedIds" : [
ObjectId("5fdc76b0a810ef937e25439c"),
ObjectId("5fdc76b0a810ef937e25439d"),
ObjectId("5fdc76b0a810ef937e25439e"),
ObjectId("5fdc76b0a810ef937e25439f")
]
}
> db.student.find().forEach(printjson);
{
"_id" : ObjectId("5fdc76b0a810ef937e25439c"),
"name" : "xyz",
"roll_no" : 1,
"dept" : "CST",
"year" : 1
}
{
"_id" : ObjectId("5fdc76b0a810ef937e25439d"),
"name" : "abc",
"roll_no" : 2,
"dept" : "CST",
"year" : 1
}
{
"_id" : ObjectId("5fdc76b0a810ef937e25439e"),
"name" : "pqe",
"roll_no" : 3,
"dept" : "CST",
"year" : 1
}
{
"_id" : ObjectId("5fdc76b0a810ef937e25439f"),
"name" : "efg",
"roll_no" : 1,
"dept" : "ETCE",
"year" : 1
}
> db.student.createIndex({ "name":1})
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 1,
"numIndexesAfter" : 2,
"ok" : 1
}
> db.student.getIndexes()
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_"
},
{
"v" : 2,
"key" : {
"name" : 1
},
"name" : "name_1"
}
]
> db.student.createIndex({"roll_no":-1})
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 2,
"numIndexesAfter" : 3,
"ok" : 1
}
> db.student.getIndexes()
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_"
},
{
"v" : 2,
"key" : {
"name" : 1
},
"name" : "name_1"
},
{
"v" : 2,
"key" : {
"roll_no" : -1
},
"name" : "roll_no_-1"
}
]
>
> db.student.createIndex({"dept":1,"year":1})
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 3,
"numIndexesAfter" : 4,
"ok" : 1
}
> db.student.getIndexes()
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_"
},
{
"v" : 2,
"key" : {
"name" : 1
},
"name" : "name_1"
},
{
"v" : 2,
"key" : {
"roll_no" : -1
},
"name" : "roll_no_-1"
},
{
"v" : 2,
"key" : {
"dept" : 1,
"year" : 1
},
"name" : "dept_1_year_1"
}
]
> db.student.dropIndex({"name":1})
{ "nIndexesWas" : 4, "ok" : 1 }
> db.student.getIndexes()
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_"
},
{
"v" : 2,
"key" : {
"roll_no" : -1
},
"name" : "roll_no_-1"
},
{
"v" : 2,
"key" : {
"dept" : 1,
"year" : 1
},
"name" : "dept_1_year_1"
}
]
> db.student.dropIndexes()
{
"nIndexesWas" : 3,
"msg" : "non-_id indexes dropped for collection",
"ok" : 1
}
> db.student.getIndexes()
[ { "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_" } ]
>

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy