4.2 NoSQL Databases UNIT-1
4.2 NoSQL Databases UNIT-1
VISWANADH. MNV
UNIT-1 NoSQL, AGGREGATE DATA MODEL, Data Models
1.1 NoSQL
1.1.1 Introduction to NOSQL and Why NoSQL
SQL:
SQL database is a digital system that stores and organizes information using tables,
columns, and rows. It allows users to easily manage, retrieve, and manipulate data
using a language called SQL (Structured Query Language). SQL databases are commonly
used in applications where structured data storage and retrieval are essential, such as
in websites, business applications, and data analysis systems.
NOSQL Databases
A NoSQL database is a type of database system that provides a flexible way to store and
manage data, diverging from the rigid structure of traditional SQL databases. NoSQL
databases can handle large volumes of unstructured or semi-structured data more
effectively, offering different data models such as document stores, key-value pairs,
column-family stores, and graph databases. They are commonly used in applications
requiring scalability, high availability, and dynamic data schemas, such as web
applications, big data processing, and real-time analytics.
Normalization Yes No
Eventual No Yes
Consistency
1
When to Use SQL Databases
● SQL databases are suited for applications where the integrity of the data is
important.
● If you have an application that handles critical data like financial information
then prefer SQL Database
● you should use a relational database in order to be sure that any query you
make will get you the correct response and that you will not accidentally lose any
data. In this case, you want to have the maximum consistency, possibly by
sacrificing a level of availability in comparison to NoSQL.
● When your data is structured enough and can be relatively easily organized into
schemas you can choose a SQL database because it is a natural fit for structured
data.
● If the database schema is well designed you can reduce data redundancy
through normalization by abstracting out duplicated information then SQL
Database is used. This will also help improve the quality of your data.
2
an application might be a text search app that needs to perform search queries
to find a specific term through thousands or millions of documents.
● Given their flexibility, NoSQL databases may also be useful when developing
prototypes and MVPs. Time can be saved in not designing a schema, and
analyzing the collected data can help guide the schema design process for the
final product.
3
Concurrency
● Enterprise applications tend to have many people looking at the same body of
data at once, possibly modifying that data.
● Most of the time they are working on different areas of that data, but
occasionally they operate on the same bit of data. As a result, we have to worry
about coordinating these interactions to avoid such things as double booking of
hotel rooms.
● Concurrency is notoriously difficult to get right, with all sorts of errors that can
trap even the most careful programmers. Since enterprise applications can have
lots of users and other systems all working concurrently, there’s a lot of room
for bad things to happen. Relational databases help handle this by controlling all
access to their data through transactions.
● While this isn’t a cure-all (you still have to handle a transactional error when you
try to book a room that’s just gone), the transactional mechanism has worked
well to contain the complexity of concurrency.
● Transactions also play a role in error handling.With transactions, you can make a
change, and if an error occurs during the processing of the change you can roll
back the transaction to clean things up.
Example
4
Concurrent transactions (where few operations of transaction T1 are executed, the T2 and
again the remaining operations of T1).
Integration
● Enterprise applications live in a rich ecosystem that requires multiple
applications, written by different teams, to collaborate in order to get things
done.
● This kind of inter-application collaboration is awkward because it means pushing
the human organizational boundaries.Applications often need to use the same
data and updates made through one application have to be visible to others.
● A common way to do this is shared database integration where multiple
applications store their data in a single database. Using a single database allows
all the applications to use each others’ data easily, while the database’s
concurrency control handles multiple applications in the same way as it handles
multiple users in a single application.
A Standard Model
● Relational databases have succeeded because they provide the core benefits we
outlined earlier in a (mostly) standard way.
● As a result, developers and database professionals can learn the basic relational
model and apply it in many projects.
● Although there are differences between different relational databases, the core
mechanisms remain the same: Different vendors’ SQL dialects are similar,
transactions operate in mostly the same way.
5
● For application developers, the biggest problem has been what’s commonly
called the impedance mismatch:
● The difference between the relational model and the in-memory data structures.
● The relational data model organizes data into a structure of tables and rows, or
more properly, relations and tuples. In the relational model, a tuple is a set of
name-value pairs and a relation is a set of tuples.
● All operations in SQL consume and return relations, which leads to the
mathematically elegant relational algebra.
● The application layer of an application is typically written in an object-oriented
language. However, the object-oriented and the relational data model don't fit
well together.
● In the object-oriented world you have objects that are connected via references.
They build an object hierarchy or graph. Contrarily, the relational model saves
data in two-dimensional tables with rows for each entry and columns for the
entry’s properties.
● If you want to store your object graph in a relational database, you have to slice
and flatten your object graph until it fits into multiple normalized tables. This is
complex and unnatural (following the OO notion).
● Moreover, if you want to recover the objects you have to join several tables,
which can lead to complex queries and performance issues.
Consider the following example:
Comparing the object-oriented and the relational data model. These two worlds doesn’t fit
together naturally.
6
The Customer Karl has references to two BankAccount objects and to two
Address objects. In the schema, there are three tables for each class (Customers,
Addresses, BankAccounts) and each table is filled with the corresponding data.
Furthermore the entries for the addresses and the bank accounts have a foreign key
pointing to the entry in the Customer table. It is remarkable that the direction of the
relationship in the relational model is the reverse of the original one. That is why I call
the relation model unnatural for the object-oriented developer. Moreover, the data
distribution over several tables gets even more complicated when there are
intermediate tables necessary (for n:m relationships).
7
Application Databases
● A different approach is to treat your database as an application database which
is only directly accessed by a single application codebase that’s looked after by a
single team.
● With an application database, only the team using the application needs to know
about the database structure, which makes it much easier to maintain and
evolve the schema.
● Since the application team controls both the database and the application code,
the responsibility for database integrity can be put in the application code.
8
● In 2000 several large web properties increased in large scale. This increase in
scale was happening along many dimensions. Websites started tracking activity
and structure in a very detailed way . Large sets of data appeared: links, social
networks, activity in logs, mapping data.
● With this growth in data came a growth in users—as the biggest websites grew to
be vast estates regularly serving huge numbers of visitors.
● Coping with the increase in data and traffic required more computing resources.
To handle this kind of increase, two choices: up or out.
● Scaling up implies bigger machines,more processors, disk storage, and memory.
● But bigger machines get more and more expensive, not to mention that there
are real limits as your size increases.
● Cluster: The alternative is to use lots of small machines in cluster. A cluster of
small machines can use commodity hardware and ends up being cheaper at
these kinds of scales. • It can also be more resilient—while individual machine
failures are common, the overall cluster can be built to keep going despite such
failures, providing high reliability.
9
● While this separates the load, all the sharding has to be controlled by the
application which has to keep track of which database server to talk to for each
bit of data. Also, we lose any querying, referential integrity, transactions, or
consistency controls that cross shards.
Key Points
● Relational databases have been a successful technology for twenty years,
providing persistence, concurrency control, and an integration mechanism.
● Application developers have been frustrated with the impedance mismatch
between the relational model and the in-memory data structures.
● There is a movement away from using databases as integration points towards
encapsulating databases within applications and integrating through services.
● The vital factor for a change in data storage was the need to support large
volumes of data by running on clusters. Relational databases are not designed to
run efficiently on clusters.
● NoSQL is an accidental neologism. There is no prescriptive definition—all you can
make is an observation of common characteristics.
● The common characteristics of NoSQL databases are
○ Not using the relational model
10
○ Running well on clusters
○ Open-source
○ Built for the 21st century web estates
○ Schema less
● The most important result of the rise of NoSQL is Polyglot Persistence.
11
1.2 AGGREGATE DATA MODEL
1.2.1 Introduction
● Data Model: How we view and interact with database data.
● Types of Data Models
Relational Data Model**:
○ Uses tables (like spreadsheets).
○ Rows represent entities; columns represent attributes.
○ Relationships are defined by linking rows from different tables.
NoSQL Data Models or Aggregate Data Models
● Aggregate means a collection of objects that are treated as a unit. In NoSQL
Databases, an aggregate is a collection of data that interact as a unit. Moreover,
these units of data or aggregates of data form the boundaries for the ACID
operations.
● Aggregate Data Models in NoSQL make it easier for the Databases to manage
data storage over the clusters as the aggregate data or unit can now reside on
any of the machines. Whenever data is retrieved from the Database all the data
comes along with the Aggregate Data Models in NoSQL.
● Aggregate Data Models in NoSQL don’t support ACID transactions and sacrifice
one of the ACID properties. With the help of Aggregate Data Models in NoSQL,
you can easily perform OLAP operations on the Database.
● You can achieve high efficiency of the Aggregate Data Models in the NoSQL
Database if the data transactions and interactions take place within the same
aggregate.
1.2.2 Aggregates
● Definition: Aggregate means a collection of objects that are treated as a unit. In
NoSQL Databases, an aggregate is a collection of data that interact as a unit.
Moreover, these units of data or aggregates of data form the boundaries for the
ACID operations.
● Aggregate: A complex record with nested structures (lists, other records).
● The aggregate-Oriented database is the NoSQL database which does not support
ACID transactions and they sacrifice one of the ACID properties. Aggregate
orientation operations are different compared to relational database operations.
We can perform OLAP operations on the Aggregate-Oriented database.
● The efficiency of the Aggregate-Oriented database is high if the data transactions
and interactions take place within the same aggregate. Several fields of data can
be put in the aggregates such that they can be commonly accessed together. We
12
can manipulate only a single aggregate at a time. We can not manipulate
multiple aggregates at a time in an atomic way.
● Aggregate – Oriented databases are classified into four major data models. They
are as follows:
○ Key-value
○ Document
○ Column family
○ Graph-based
13
● The domain is fit where we don’t want to change shipping and billing address.
If you notice a single logical address record appears 3 times in the data, but its value is
copied each time wherever used. The whole address can be copied into an aggregate as
needed. There is no pre-defined format to draw the aggregate boundaries. It solely
depends on whether you want to manipulate the data as per your requirements.
The Data Model for customer and order would look like this.
// in customers
{
"customer": {
"id": 1,
"name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"orders": [
{
"id":99,
"customerId":1,
"orderItems":[
{
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress":[{"city":"Chicago"}],
"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago"}
}],
}]
}
}
14
In these Aggregate Data Models in NoSQL, if you want to access a customer along with
all customer’s orders at once. Then designing a single aggregate is preferable. But if you
want to access a single order at a time, then you should have separate aggregates for
each order. It is very content-specific.
Advantage:
● It can be used as a primary data source for online applications.
● Easy Replication.
● No single point Failure.
● It provides fast performance and horizontal Scalability.
● It can handle Structured semi-structured and unstructured data with equal
effort.
Disadvantage:
● No standard rules.
● Limited query capabilities.
● Doesn’t work well with relational data.
● Not so popular in the enterprise.
● When the value of data increases it is difficult to maintain unique values.
15
pairs stored on separate records is called key-value databases and they do not have an
already defined structure.
16
● Caching mechanism for repeatedly accessing data or key-based design.
● The application is developed on queries that are based on keys.
Features:
● One of the most un-complex kinds of NoSQL data models.
● For storing, getting, and removing data, key-value databases utilize simple
functions.
● Querying language is not present in key-value databases.
● Built-in redundancy makes this database more reliable.
Advantages:
● It is very easy to use. Due to the simplicity of the database, data can accept any
kind, or even different kinds when required.
● Its response time is fast due to its simplicity, given that the remaining
environment near it is very much constructed and improved.
● Key-value store databases are scalable vertically as well as horizontally.
● Built-in redundancy makes this database more reliable.
Disadvantages:
● As querying language is not present in key-value databases, transportation of
queries from one database to a different database cannot be done.
● The key-value store database is not refined. You cannot query the database
without a key.
Some examples of key-value databases:
Here are some popular key-value databases which are widely used:
● Redis: The key-value database which is so popular mostly used
● Amazon DynamoDB: The key-value database which is mostly used in AWS is
Amazon DynamoDB.It can easily handle a large number of requests every day
and it also provides various security options.
● Riak: It is the database used to develop applications.
Document Data Model:
● A Document Data Model is a lot different than other data models because it
stores data in JSON, BSON, or XML documents.
● In this data model, we can move documents under one document and apart
from this, any particular elements can be indexed to run queries faster.
● Often documents are stored and retrieved in such a way that it becomes close to
the data objects which are used in many applications which means very less
translations are required to use data in applications. JSON is a native language
that is often used to store and query data too.
17
● So in the document data model, each document has a key-value pair below is an
example for the same.
{
"Name" : "abc",
"Address" : "Narsapur",
"Email" : "abc@gmail.com",
"Contact" : "12345"
}
18
● Open formats: It has a very simple build process that uses XML, JSON, and its
other forms.
● Built-in versioning: It has built-in versioning which means as the documents
grow in size there might be a chance they can grow in complexity. Versioning
decreases conflicts.
Disadvantages:
● Weak Atomicity: It lacks in supporting multi-document ACID transactions. A
change in the document data model involving two collections will require us to
run two separate queries i.e. one for each collection. This is where it breaks
atomicity requirements.
● Consistency Check Limitations: One can search the collections and documents
that are not connected to an author collection but doing this might create a
problem in the performance of database performance.
● Security: Nowadays many web applications lack security which in turn results in
the leakage of sensitive data. So it becomes a point of concern, one must pay
attention to web app vulnerabilities.
Applications of Document Data Model :
● Content Management: These data models are very much used in creating
various video streaming platforms, blogs, and similar services Because each is
stored as a single document and the database here is much easier to maintain as
the service evolves over time.
● Book Database: These are very much useful in making book databases because
as we know this data model lets us nest.
● Catalog: When it comes to storing and reading catalog files these data models
are very much used because they have a fast reading ability if incase Catalogs
have thousands of attributes stored.
● Analytics Platform: These data models are very much used in the Analytics
Platform.
19
● Wide column store database
● Wide column store
● Columnar database
● Columnar store
20
Figure: column family containing 3 rows. Each row contains its own set of columns.
As the above diagram shows:
● A column family consists of multiple rows.
● Each row can contain a different number of columns to the other rows. And the
columns don’t have to match the columns in the other rows (i.e. they can have
different column names, data types, etc).
● Each column is contained in its row. It doesn’t span all rows like in a relational
database. Each column contains a name/value pair, along with a timestamp.
Note that this example uses Unix/Epoch time for the timestamp.
Here’s how each row is constructed:
21
● Column. Each column contains a name, a value, and timestamp.
● Name. This is the name of the name/value pair.
● Value. This is the value of the name/value pair.
● Timestamp. This provides the date and time that the data was inserted. This can
be used to determine the most recent version of data.
Some DBMSs expand on the column family concept to provide extra
functionality/storage ability. For example, Cassandra has the concept of composite
columns, which allow you to nest objects inside a column.
22
● This aggregate is central to running on a cluster, as the database will ensure that
all the data for an aggregate is stored together on one node.
● The aggregate also acts as the atomic unit for updates, providing a useful, if
limited, amount of transactional control.
● Within that notion of aggregate, we have some differences.The key-value data
model treats the aggregate as an opaque whole, which means you can only do
key lookup for the whole aggregate— you cannot run a query nor retrieve a part
of the aggregate.
Key Points
● An aggregate is a collection of data that we interact with as a unit. Aggregates
form the boundaries for ACID operations with the database.
● Key-value, document, and column-family databases can all be seen as forms of
aggregate- oriented databases.
● Aggregates make it easier for the database to manage data storage over clusters.
● Aggregate-oriented databases work best when most data interaction is done
with the same aggregate; aggregate-ignorant databases are better when
interactions use data organized in many different formations.
23
1.3 More Details on Data Models
Introduction
● Aggregates are the main feature. The aggregate-oriented databases model data
using aggregates.
● Aggregates are central and there are also additional data modeling concepts that
exist in NoSQL databases and data is accessed in these models.
● Other data models are- Graph data model, schema less databases, materialized
views
1.3.2 Relationships
● Purpose of Aggregates: - Combine commonly accessed data.
Examples: Customer and their order history.
● Different Access Needs: - Some applications need customer data with order
history (single aggregate). - Others process orders individually (separate
aggregates).
● Linking Separate Aggregates: - Use customer ID in the order's data. - Read
order data to get customer ID, then fetch customer data. - Database won't know
the relationship by default.
● Database Relationship Visibility: - Some databases can show these links. -
Document stores index and query aggregate content. - Key-value stores like Riak
use metadata for links and partial retrieval.
● Handling Updates: - Aggregate-oriented databases: data retrieval unit is the
aggregate. - Atomicity within a single aggregate only. - Relational databases:
support multiple record transactions with ACID guarantees.
● Complexity in Multiple Aggregates: - Harder to operate across multiple
aggregates in aggregate-oriented databases. - Relational databases struggle with
complex relationships and many joins.
● Database Choice: - Relational databases for data with many relationships. -
Aggregate-oriented databases can be awkward with multiple aggregates. -
Consider other NoSQL databases for complex queries and relationships.
24
first-class elements of the data model. These data models give us a conceptual view of
the data.
These are the data models which are based on topographical network structure.
Obviously, in graph theory, we have terms like Nodes, edges, and properties, let’s see
what it means here in the Graph-Based data model.
● Nodes: These are the instances of data that represent objects which are to be
tracked.
● Edges: As we already know edges represent relationships between nodes.
● Properties: It represents information associated with nodes.
The below image represents Nodes with properties from relationships represented by
edges.
25
Examples of Graph Data Models :
● JanusGraph: These are very helpful in big data analytics. It is a scalable graph
database system open source too. JanusGraph has different features like:
○ Storage: Many options are available for storing graph data like Cassandra.
○ Support for transactions: There are many supports available like ACID
(Atomicity, Consistency, Isolation, and Durability) which can hold
thousands of concurrent users.
○ Searching options: Complex searching options are available and optional
support too.
● Neo4j: It stands for Network Exploration and Optimization 4 Java. As the name
suggests this graph database is written in Java with native graph storage and
processing. Neo4j has different features like:
○ Scalable: Scalable through data partitioning into pieces known as shards.
○ Higher Availability: Availability is very much high due to continuous
backups and rolling upgrades.
○ Query Language: Uses programmer-friendly query language Cypher graph
query language.DGraph main features are:
● DGraph: It is an open-source distributed graph database system designed with
scalability.
○ Query Language: It uses GraphQL, which is solely made for APIs.
○ open-source system: support for many open standards.
Advantages of Graph Data Model :
● Structure: The structures are very agile and workable too.
● Explicit Representation: The portrayal of relationships between entities is
explicit.
● Real-time O/P Results: Query gives us real-time output results.
Disadvantages of Graph Data Model :
● No standard query language: Since the language depends on the platform that
is used so there is no certain standard query language.
● Unprofessional Graphs: Graphs are very unprofessional for transactional-based
systems.
● Small User Base: The user base is small which makes it very difficult to get
support when running into a system.
Applications of Graph Data Model:
● Graph data models are very much used in fraud detection which itself is very
much useful and important.
26
● It is used in Digital asset management which provides a scalable database model
to keep track of digital assets.
● It is used in Network management which alerts a network administrator about
problems in a network.
● It is used in Context-aware services by giving traffic updates and many more.
● It is used in Real-Time Recommendation Engines which provide a better user
experience.
27
Schemaless vs. schema databases pros and cons
There is no existing “schema” for the data Though the NoSQL community is still
to be structured around growing at a tremendous rate, not all
troubleshooting issues have been
properly documented
Can add additional fields that SQL Lack of compatibility with SQL
databases can’t accommodate instructions
Materialized views precompute and store the results of a query as a physical table in the
database. This precomputation occurs at regular intervals or can be triggered by specific
events.
1. Define the Query: Specify a query to retrieve data from source tables, including any
filtering, aggregations, or joins.
28
2. Populate the View: The database runs the query and stores the results as a physical
table.
Materialized views need periodic updates to reflect changes in the source data. The
refresh frequency varies based on requirements.
1. Full Refresh: Completely recomputes and replaces the materialized view. Simple but
resource-intensive.
2. Incremental Refresh: Applies only changes from the source data, more efficient for
large datasets.
2. Time Series Analysis: Store precomputed data summaries (e.g., monthly or weekly)
for business intelligence and reporting.
29
fig: materialized view in NOSQL mongodb
1. Speed: They improve query performance by allowing you to query precomputed data
instead of recalculating it each time, saving time on complex queries.
2. Simplicity: They consolidate complex queries into one table, making data
transformations and maintenance easier. This also helps reduce the data replicated in
the view.
4. Access Control: They allow you to control data access, letting users see specific data
without accessing the underlying source tables.
30
2. Performance Impact: Frequent updates can degrade system performance,
especially during peak periods.
4. Management: Clear refresh rules and schedules are needed, along with strategies to
handle data inconsistencies, refresh failures, and storage strain.
The following table shows key similarities and differences between tables, regular views,
cached query results, and materialized views:
Regular ✔ ✔
table
Regular ✔ ✔
view
Material ✔ ✔ ✔ ✔ ✔ ✔
ized
view
31
Figure 3.3. Customer is stored separately from Order.
In document stores, since we can query inside documents, removing references to
Orders from the Customer object is possible. This change allows us to not update the
Customer object when new orders are placed by the Customer.
{
"customerId": 1,
"name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"payment": [{"type": "debit", "ccinfo": "1000-1000-1000-1000"}]
}
{
"orderId": 99,
"customerId": 1,
"orderDate": "Nov-20-2011",
"orderItems": [{"productId": 27, "price": 32.45}],
"orderPayment": [{"ccinfo": "1000-1000-1000-1000", "txnId": "abelif879rft"}],
"shippingAddress": {"city": "Chicago"}
}
32
Figure 3.4. Conceptual view into a column data store
33
Key Points
● Aggregate-oriented databases make inter-aggregate relationships more difficult
to handle intra-aggregate relationships.
● Graph databases organize data into node and edge graphs; they work best for
data that has complex relationship structures.
● Schemaless databases allow you to freely add fields to records, but there is
usually an implicit schema expected by users of the data.
● Aggregate-oriented databases often compute materialized views to provide data
organized differently from their primary aggregates. This is often done with
map-reduce computations.
34