0% found this document useful (0 votes)
8 views13 pages

2.1.2 Data Models

Uploaded by

Mukesh Nalekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views13 pages

2.1.2 Data Models

Uploaded by

Mukesh Nalekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Models

Pravin Y Pawar

Adapted from Designing Machine Learning Systems


by Chip Huyen
Data Models

• Most important part of developing software, because they have such a profound effect
o on how the software is written
o on how we think about solving the problem

• Describes how data is represented in terms of attributes


o Car can be represented by its make, model, year, color etc., as well by its owner, license plate etc.
o Most applications are built by layering one data model on top of another

• Many different kinds of data models


o Every data model embodies assumptions about how it is going to be used
o Building software is hard enough, even when working with just one data model, and without worrying about its
inner workings

• Selecting a data representation affects


• The ways systems are built
• The problems systems can solve
Relational Model
Most persistent ideas in Computer Science
• Invented by Edgar Codd in 1970, but still going strong today, even getting more popular!
• Simple and Powerful
• Data is organized into relations; Each relation is a set of tuples
• Table is visual representation of relation, each row or table makes up a tuple
• Relations are unordered – can shuffle order of rows or columns
• Usually stored in CSV or Parquet format
Relational Model(2)
Normalization
• Desirable for relations to be normalized
o Can follow forms such as first normal form (1NF), second normal form (2NF) etc.
o Can reduce data redundancy and improve data integrity

• For example, consider Books relation Title Author Format PubID Price
o Lot of duplicate data – publisher , country Book1 Pawar Paperback 1 20

o If something change, needs to be reflected everywhere Book1 Pawar E-book 1 10

o If normalized, updates needs to be managed in single place Book2 Pravin Paperback 2 30


Book3 PYP Paperback 1 30
Title Author Format Publisher County Price Book2 Pravin Paperback 2 15

Book1 Pawar Paperback Press1 UK 20


Book1 Pawar E-book Press1 UK 10
Book2 Pravin Paperback NotedDraft US 30 PubID Pulisher Country
Book3 PYP Paperback Press1 UK 30 1 Press1 UK
2 NotedDraft US
Book2 Pravin Paperback NotedDraft US 15

Major downside : need to join data back from multiple tables – expensive for large tables!
Relational Model(3)
Querying
• Databases built around relational data model are relational databases
• Language used to specify data to be retrieved from database is query language
o Most popular is SQL

• SQL
o Declarative query language – specify the output needed rather than steps needed to get it
o Specify the pattern of data –
o Tables from which data is needed
o Conditions the results must meet
o Transformations, such as joins, sort, group, aggregate
o Database figures out how to provide those results
o How to break query in parts
o What methods to user execute each part of query
o Order in which the parts needs to be executed
NoSQL
Need and Origin
• Relational model generalizes a lot but still restrictive
• Needs a strict schema – schema management is painful!
• Per Couchbase 2014 survey, frustration with schema management is #1 reason for adoption of non
relational databases

• Latest movement against relational data model is NoSQL


o Started as hashtag for meetup to discuss non-relational databases
o Retroactively reinterpreted as Not Only SQL – many NoSQL data systems supports relational models

• Two major types – Document and Graph model


• Document model targets use cases where data comes in self-contained documents
• Graph model targeting use cases where relationships between data items are common and important
NoSQL(2)
Document Model
• Built around concept of “document”
o Single continuous string, encoded as JSON, XML or binary format like BSON
o Each document has a unique key that represents that document, which can be used to retrieve it
o Collection of document analogous to table in relational database, document analogous to row
o Can convert relation into a collection of documents
o Collection of document much more flexible than table
o All rows in table must follow same schema while documents in same collection can have different schemas

• For example, a book can be represented as document as follows


{
“Title”: “Book1”,
“Author” : “Pravin”,
“Publisher”: “ Press1”,
“Country”: “US”,
“Sold as”: [
{ “format”: “Paperback”, “Price”: “20”},
{ “format”: “E-books”, “Price”: “10”},
]
}
NoSQL(3)
Document Model – Schema less – does not enforce schema
• Misleading!
o Application that reads documents assumes some kind of structure of documents
o Document databases just shifts responsibility of assuming structures from application that writes data
to the application that reads that data!

• Has better locality than relational model


o In relational model, data is spread across different tables and needs to be joined when required
o In document model, all information is stored in single, self-contained place making it easier to read
o Difficult to execute joins across documents compared to across tables

o Because of different strengths of document and relational data models,


o Common to use both models for different tasks in same database systems
o PostgreSQL and MySQL support them both!
NoSQL(4)
Graph Model
• Built around concept of “Graph”
• Graph consists of nodes and edges, where edges represents relationship between the nodes
• Database that uses graph structures to store its data is graph database
• In document database, content of each document is priority
• In graph database, relationships between data items are priority!
• Faster to retrieve data faster as relationships are models in that way

• For example, consider the following representation


o that may come from StandFord
o Find person born in USA
o Which one is simpler – graph or relational model?
Structured vs Unstructured data
Structured data
• Follows a predefined data mode – aka schema
• For example, Person model has two data values, age of type number and name as string
• Predefined structure makes data analysis easier

• Disadvantage – commitment to predefined schema!


o If schema changes, retrospectively update all data – causes mysterious bugs in the process
o As requirements changes over time,
o Committing to a predefined schema can be restricting
o If data coming from multiple sources, don’t have any control over schema
o Unstructured data becomes appealing!
Structured vs Unstructured data(2)
Unstructured data
• Does not adhere to predefined schema!
• Usually a text but can also be numbers, dates, images, audio
• Might still contain intrinsic patterns that helps to extract meaningful structures out of data

• Allows for more flexible storage options


o If storage follows schema, can only store data following that schema
o If storage does not follows schema, can store any type of data!
o Can convert all data, regardless of types and formats into bytestrings and store them together

• Data Warehouse vs Data Lake


o Data warehouse is repository to store structured data
o Data Lake is repository to store unstructured data
o Data lakes are usually used to store raw data before processing
o Data warehouses are used to store data that has been processed into formats ready to be used
Structured vs Unstructured data(3)
Key Differences

Structured data Unstructured data


Schema already defined Does not have to follow a schema
Easy to search and analyze Fast arrival
Can only handle data with a specific Can handle data from any source
schema
Schema changes will cause a lot of No need to worry about schema changes,
troubles as its shifted to the downstream
applications that uses this data
Stored in data warehouses Stored in data lakes
Thank You!
In our next session:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy