Database Administration and Management
Database Administration and Management
1. What is a database?
Answer: A database is a structured collection of data that is stored and managed to ensure easy
access, retrieval, and manipulation. It can be stored on a variety of systems, including cloud, on-
premises servers, or distributed systems. Databases are managed by Database Management
Systems (DBMS), such as MySQL, Oracle, PostgreSQL, and SQL Server.
Answer:
Database: A database is a collection of data that is stored and organized in a way that
allows for efficient access, management, and modification.
DBMS (Database Management System): A DBMS is a software system that provides
an interface for managing and manipulating databases. It helps in the creation,
maintenance, and administration of databases and ensures data integrity, security, and
consistency. Examples include Oracle, MySQL, Microsoft SQL Server, and PostgreSQL.
Hierarchical Model: Data is organized in a tree-like structure, where each record has a
single parent (e.g., XML databases).
Network Model: Similar to the hierarchical model, but allows multiple parent records for
a child.
Relational Model: Data is organized in tables (relations) that can be linked via keys.
This is the most common database model (e.g., MySQL, SQL Server, Oracle).
Object-Oriented Model: Data is represented as objects, similar to objects in object-
oriented programming.
Document Model: Data is stored in a semi-structured format, such as JSON or BSON
(e.g., MongoDB).
Graph Model: Data is stored as nodes and edges in a graph structure (e.g., Neo4j).
Answer: A primary key is a unique identifier for a record in a relational database table. It
ensures that each record can be uniquely identified. A primary key must contain unique values,
and it cannot contain NULL values. A table can only have one primary key, but the primary key
can consist of one or more columns (a composite key).
Answer: A foreign key is a field (or collection of fields) in one table that uniquely identifies a
row in another table. It is used to establish and enforce a link between the data in two tables. The
foreign key ensures referential integrity by only allowing values that are present in the referenced
primary key field.
Answer: Indexing is a data structure technique used to speed up the retrieval of rows from a
database table. An index provides quick access to rows based on the values of one or more
columns. Indexes are created on frequently searched columns to optimize query performance.
Common types of indexes include B-tree and Hash Indexes.
Answer:
Clustered Index: In a clustered index, the data rows are stored in the order of the index.
A table can have only one clustered index because the data can be sorted in only one way.
The primary key is usually a clustered index.
Non-clustered Index: In a non-clustered index, the data rows are stored separately from
the index. Non-clustered indexes have pointers to the data rows, making it possible to
create multiple non-clustered indexes on a table.
9. What are the ACID properties in databases?
Atomicity: Ensures that all operations in a transaction are completed successfully. If one
operation fails, the entire transaction is rolled back.
Consistency: Ensures that the database moves from one valid state to another,
maintaining data integrity.
Isolation: Ensures that transactions are executed independently without interference from
other concurrent transactions.
Durability: Ensures that once a transaction is committed, its changes are permanent,
even in the event of a system failure.
Answer: A transaction is a sequence of one or more database operations (such as insert, update,
delete) that are executed as a single unit. A transaction ensures that the database remains in a
consistent state before and after its execution. Transactions follow the ACID properties to ensure
reliability and consistency.
Answer: A deadlock occurs when two or more transactions are unable to proceed because each
one is waiting for the other to release locks on resources. To handle deadlocks:
Deadlock Detection: The DBMS periodically checks for deadlocks and aborts one or
more transactions to resolve the issue.
Deadlock Prevention: Using techniques like locking resources in a specific order can
prevent deadlocks.
Deadlock Avoidance: A DBMS can use algorithms like the wait-for graph to prevent
deadlock situations before they happen.
Answer: A stored procedure is a set of SQL queries or commands that are stored and executed
on the database server. Stored procedures can accept input parameters, perform operations, and
return output. They help with reusability, encapsulation of logic, and performance improvements
by reducing network traffic and re-compilation of queries.
13. What is the difference between a view and a table?
Answer:
Table: A table is a collection of data organized in rows and columns. It physically stores
data.
View: A view is a virtual table that is based on the result of a SELECT query. It does not
store data itself but presents data from one or more tables. Views are used to simplify
complex queries, provide data security, and aggregate data.
Answer:
Backup: A backup is a copy of the database and its related files that is created to protect
data from loss. Types of backups include full, incremental, and differential backups.
Recovery: Recovery is the process of restoring a database from a backup to its original or
last consistent state. Recovery techniques include point-in-time recovery and transaction
log replay.
Answer:
SQL Databases: SQL databases are relational databases that use structured query
language (SQL) for managing and querying data. They are best suited for structured data
and enforce ACID properties (e.g., MySQL, PostgreSQL, SQL Server).
NoSQL Databases: NoSQL databases are non-relational and are designed to handle
unstructured, semi-structured, or rapidly changing data. They are more flexible in terms
of schema and can scale horizontally (e.g., MongoDB, Cassandra, Redis).
Answer: Replication is the process of copying data from one database server (the master) to one
or more other servers (slaves). This ensures data redundancy, improves read performance, and
provides high availability. Types of replication include master-slave, master-master, and peer-to-
peer replication.
17. What is sharding in databases?
Answer: A schema is a logical container in a database that defines how data is organized,
including tables, views, indexes, and relationships between them. It acts as a blueprint for the
structure of the database and ensures data integrity and consistency. A schema can also define
security policies, user access rights, and permissions.
INNER JOIN: Returns records that have matching values in both tables.
LEFT JOIN (or LEFT OUTER JOIN): Returns all records from the left table and
matched records from the right table. If no match is found, NULL values are returned for
columns from the right table.
RIGHT JOIN (or RIGHT OUTER JOIN): Returns all records from the right table and
matched records from the left table. If no match is found, NULL values are returned for
columns from the left table.
FULL JOIN (or FULL OUTER JOIN): Returns records when there is a match in either
left or right table. It returns NULL values when there is no match.
CROSS JOIN: Returns the Cartesian product of both tables. Every row in the first table
is combined with every row in the second table.
Answer: ORM is a programming technique used to interact with relational databases using
object-oriented programming languages. It allows developers to work with database records as
objects, which simplifies database interaction. ORM tools map database tables to classes,
columns to attributes, and rows to objects. Common ORMs include Hibernate (Java), Entity
Framework (.NET), and Django ORM (Python).
21. What is a trigger in a database?
Answer: A trigger is a set of SQL statements that automatically execute in response to certain
events on a table or view, such as INSERT, UPDATE, or DELETE operations. Triggers are used
to enforce business rules, maintain data integrity, and automate tasks such as auditing or logging.
There are BEFORE triggers (executed before the operation) and AFTER triggers (executed
after the operation).
Answer: A materialized view is a precomputed view that stores the result of a query in a
database. Unlike a regular view, which is computed on demand, a materialized view stores the
result physically and periodically refreshes the data. Materialized views improve query
performance by providing fast access to aggregated data but may require more storage space.
Answer: Database partitioning is the process of dividing a large database table into smaller,
more manageable pieces, known as partitions. Each partition is stored separately, but the data is
still logically part of the same table. Partitioning helps improve query performance, ease of
backup, and manageability. There are several types of partitioning:
Range Partitioning: Data is divided based on a range of values (e.g., date ranges).
List Partitioning: Data is divided based on a specific list of values (e.g., geographic
regions).
Hash Partitioning: Data is divided using a hash function to evenly distribute rows across
partitions.
Composite Partitioning: A combination of the above methods.
Answer: Database clustering involves the use of multiple database servers (nodes) to act as a
single unified system. Clustering increases high availability, load balancing, and fault tolerance.
In a clustered database setup, data is replicated across multiple servers to ensure continued
availability in case of hardware failure. Clustering can be active-active (multiple nodes process
queries) or active-passive (one node processes queries, others are standby).
25. What is the difference between OLTP and OLAP?
Answer:
OLTP (Online Transaction Processing): OLTP systems are designed to handle a large
number of short, transactional queries such as insert, update, and delete. These systems
are optimized for high-speed data processing with real-time data. Examples include
banking systems and online shopping.
OLAP (Online Analytical Processing): OLAP systems are designed for querying and
analyzing large volumes of historical data. They focus on complex queries involving
aggregation, summarization, and data analysis. OLAP is typically used in business
intelligence applications like data warehouses.
Answer: A deadlock occurs when two or more transactions are unable to proceed because they
are each waiting for the other to release a resource (e.g., a lock on a row or table). To prevent
deadlocks:
Lock Ordering: Ensure that all transactions acquire locks in the same order.
Timeouts: Use timeouts so that transactions are rolled back if they take too long to
acquire a lock.
Deadlock Detection: Most DBMS (e.g., SQL Server, Oracle) have deadlock detection
algorithms that identify deadlocks and automatically resolve them by aborting one of the
transactions.
Minimize Locking Duration: Reduce the time a transaction holds a lock by committing
it as soon as possible.
Answer: A roll-back segment (or transaction log) is a part of a database that stores the old
values of data that has been modified during a transaction. This ensures that in case of a system
failure or transaction rollback, the database can revert to its previous state. It helps maintain
atomicity and consistency in the database by ensuring that incomplete transactions do not
corrupt the data.
Answer: A data warehouse is a central repository for storing large amounts of structured data,
typically from multiple sources, to support decision-making and business intelligence (BI)
activities. Data warehouses are optimized for read-heavy workloads and complex queries
involving aggregation and historical analysis. Data in a warehouse is often organized using an
OLAP model.
29. What is the difference between a full backup and an incremental backup?
Answer:
Full Backup: A full backup copies all the data in the database regardless of whether it
has changed since the last backup. This is the most comprehensive backup but is time-
consuming and takes up more storage space.
Incremental Backup: An incremental backup only copies data that has changed since
the last backup (either full or incremental). This is faster and requires less storage, but
restoring from incremental backups may take longer as you need the most recent full
backup and all subsequent incremental backups.
Answer: A transaction log records all the changes made to the database, including inserts,
updates, and deletes. It ensures that transactions are atomic (i.e., either fully committed or rolled
back), and it enables data recovery after a system crash or failure. The transaction log also
supports point-in-time recovery, enabling you to restore the database to a specific state.
Answer:
Database Consistency: Refers to the state where all data in the database satisfies defined
rules, constraints, and relationships. For example, when a transaction is completed, the
database must transition from one valid state to another valid state, as defined by the
ACID properties.
Integrity Constraints: Rules that ensure the accuracy and reliability of data in the
database. Common integrity constraints include:
o Primary Key: Ensures uniqueness of a record.
o Foreign Key: Maintains referential integrity between tables.
o Check Constraints: Enforces domain integrity by limiting the range of valid
values for a column.
o Not Null: Ensures that a column cannot have NULL values.
32. What is the difference between a DDL and DML in SQL?
Answer:
DDL (Data Definition Language): DDL statements are used to define and modify the
structure of database objects such as tables, views, indexes, and schemas. Examples
include CREATE, ALTER, and DROP.
DML (Data Manipulation Language): DML statements are used to query and modify
data in the database. Examples include SELECT, INSERT, UPDATE, and DELETE.
Answer: Database migration is the process of moving data from one database to another, often
from an older version of a database to a newer version, or from one database platform to another
(e.g., from MySQL to PostgreSQL). Migration typically involves data export/import, schema
transformation, and ensuring that applications connected to the database are compatible with the
new version.
Answer: Database concurrency refers to the ability of a database to allow multiple users or
transactions to access or modify the data at the same time. Proper concurrency control ensures
that simultaneous access does not result in inconsistent or corrupted data. It is managed using
locks and isolation levels.
Locks: Used to control access to database resources during a transaction. Types include
shared locks (read) and exclusive locks (write).
Isolation Levels: Define the level of visibility each transaction has on data modified by
other transactions. Common isolation levels include:
o Read Uncommitted: Allows dirty reads (transactions can read uncommitted
changes).
o Read Committed: Prevents dirty reads, but non-repeatable reads are possible.
o Repeatable Read: Prevents dirty and non-repeatable reads, but phantom reads are
possible.
o Serializable: Provides the highest level of isolation, ensuring transactions are
executed serially, one at a time.
35. What are the benefits and drawbacks of using a distributed database system?
Answer: A distributed database is one where data is stored across multiple physical locations,
which may be spread across different servers or geographic regions.
Benefits:
o Scalability: Distributed systems can scale horizontally by adding more nodes to
distribute the load.
o High Availability: Data replication across multiple nodes ensures that the system
remains available even if one node fails.
o Fault Tolerance: If one server or database fails, others can take over to maintain
system uptime.
Drawbacks:
o Complexity: Managing a distributed database can be more complex than
managing a single centralized database, especially in terms of synchronization,
consistency, and partitioning.
o Latency: Data retrieval may be slower if nodes are geographically dispersed.
o Data Consistency: Ensuring strong consistency can be challenging in distributed
systems, especially in the case of network partitions (CAP Theorem).
Answer: CAP Theorem (also known as Brewer’s Theorem) states that a distributed database
system can achieve at most two out of the following three properties:
A distributed database system must make trade-offs between these properties. For example:
Answer:
UPDATE: The UPDATE statement modifies existing records in a table based on certain
conditions. It is typically used for changing values of specific columns.
sql
Copy code
UPDATE employees SET salary = 5000 WHERE employee_id = 101;
MERGE: The MERGE statement (also known as "upsert") allows you to update or insert
data based on whether a matching record exists. It is useful when you need to perform an
insert if no record is found or update if a record already exists.
sql
Copy code
MERGE INTO employees AS target
USING new_employees AS source
ON target.employee_id = source.employee_id
WHEN MATCHED THEN
UPDATE SET target.salary = source.salary
WHEN NOT MATCHED THEN
INSERT (employee_id, salary) VALUES (source.employee_id,
source.salary);
38. Explain the differences between In-Memory Databases and Traditional Disk-
Based Databases.
Answer:
In-Memory Databases (IMDBs): Store data entirely in RAM, which provides faster data
access and processing times compared to disk-based databases.
o Advantages: High speed, low latency, ideal for real-time applications (e.g.,
Redis, Memcached).
o Disadvantages: Data loss in case of power failure (though some support
persistence options).
Disk-Based Databases: Store data on disk (e.g., hard drive, SSD), and rely on disk I/O
for reading and writing data.
o Advantages: Persistent data storage, supports large volumes of data.
o Disadvantages: Slower compared to in-memory due to I/O operations, especially
when dealing with large datasets.
39. What is a recursive query? How can you perform one in SQL?
Answer: A recursive query is a query that references itself, commonly used to deal with
hierarchical data (such as organization charts, bill-of-materials, or tree structures). In SQL,
recursive queries are typically written using Common Table Expressions (CTEs) with the WITH
clause.
Syntax Example (SQL Server, PostgreSQL):
sql
Copy code
WITH RECURSIVE EmployeeHierarchy AS (
SELECT employee_id, manager_id, name
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.manager_id, e.name
FROM employees e
INNER JOIN EmployeeHierarchy eh ON e.manager_id = eh.employee_id
)
SELECT * FROM EmployeeHierarchy;
This query recursively retrieves all employees and their managers, starting from the top-
level managers.
Answer:
SQL (Structured Query Language) is used for querying, modifying, and managing data
in a relational database.
o SQL is declarative, meaning you describe what you want (e.g., SELECT * FROM
employees).
PL/SQL (Procedural Language/SQL) is Oracle’s procedural extension to SQL that
allows you to write complex scripts, including loops, conditionals, and exception
handling.
o PL/SQL is imperative, meaning you specify how to perform operations (e.g.,
using LOOP, IF, EXCEPTION).
Key differences:
Answer: Replication is the process of copying and maintaining database objects (such as tables
or entire databases) across multiple servers to improve availability and reliability.
Types of Replication:
o Master-Slave Replication: One server is the master (writes are performed here),
and the others are slaves (they receive copies of data from the master).
o Master-Master Replication: Multiple servers act as masters, meaning each
server can perform read and write operations, and changes are propagated to other
servers.
o Peer-to-Peer Replication: Each server in the network acts as both a master and a
slave, sharing data bidirectionally.
o Synchronous Replication: Changes to data are immediately propagated to
replicas before the transaction is considered complete.
o Asynchronous Replication: Changes are propagated to replicas after the
transaction is committed.
Answer: A database snapshot is a read-only, static view of the database at a particular point in
time. It captures the state of the data at that moment, and can be used for:
Backup purposes.
Reporting and analytics.
Point-in-time recovery.
Difference from a backup: Snapshots are typically more lightweight and faster to create
compared to full backups.
44. What is a database "ghost" record and how do you handle it?
Answer: A ghost record is a record that remains in the database after it has been logically
deleted. In databases with high concurrency or complex transaction management, these records
can sometimes remain visible to certain processes even though they have been marked as
deleted. Ghost records can impact performance by causing unnecessary disk space usage and
query performance degradation.
How to handle it: Most modern DBMS have built-in mechanisms to handle ghost
records. For example, in SQL Server, the Ghost Cleanup process periodically removes
such records.
Answer: Database versioning involves tracking and managing changes to a database schema or
structure over time. This is critical for ensuring that database changes are well-documented,
reversible, and consistent across different environments (development, staging, production).
Tools: Popular tools for database versioning include Flyway, Liquibase, and Redgate.
Importance: It helps manage migrations, automate the deployment of schema changes,
and ensures smooth updates to the database without breaking functionality.
Answer:
47. What is an index, and how does it improve query performance in databases?
Answer: An index is a data structure that improves the speed of data retrieval operations on a
database table. It works similarly to an index in a book, which allows you to find information
quickly without reading every page. In databases, indexes are typically created on one or more
columns of a table.
How It Works: When you query a table with an indexed column, the database doesn't
need to scan the entire table to find matching rows. Instead, it uses the index to quickly
locate the data.
Types of Indexes:
o B-tree Index: The most common type of index, optimized for equality and range
queries.
o Hash Index: Used for equality comparisons (fast lookups).
o Composite Index: An index on multiple columns, useful for queries that filter by
multiple columns.
o Full-text Index: Specialized index for searching text data (used in search engines
or text-heavy applications).
Trade-offs: While indexes improve read performance, they can slow down write
operations (INSERT, UPDATE, DELETE) because the index also needs to be updated.
Answer: A deadlock occurs when two or more transactions are stuck in a cycle, each waiting for
a resource (such as a row or a table) that the other transaction holds a lock on. This results in a
situation where none of the transactions can proceed.
How Deadlocks Happen: For example, Transaction A locks Row 1 and waits for Row 2,
while Transaction B locks Row 2 and waits for Row 1. Since neither transaction can
proceed, a deadlock is created.
Prevention:
o Lock Ordering: Ensure that all transactions acquire locks in the same order to
avoid circular dependencies.
o Timeouts: Set time limits on transactions or queries to automatically roll back
transactions that wait too long.
o Lower Isolation Levels: Use lower isolation levels like Read Committed instead
of Serializable to reduce the chances of blocking.
Resolution:
o Deadlock Detection: Most modern database management systems, like SQL
Server, automatically detect deadlocks. When a deadlock is detected, the
database typically rolls back one of the transactions (usually the one with the least
work done) to resolve the deadlock.
Answer: The Write-Ahead Log (WAL) is a standard technique used by many database
management systems (DBMS) to ensure data integrity and recoverability in case of a system
failure. The core idea of WAL is that before any changes (like inserts, updates, or deletes) are
made to the actual database, the changes are first written to a log file.
How it Works:
o All modifications are logged sequentially in the WAL file.
o Once the WAL entries are written to disk, the changes are applied to the database.
o This ensures that in case of a crash, the database can be restored by replaying the
WAL entries, recovering any changes that were committed but not yet written to
the database.
Advantages:
o Crash Recovery: Ensures that no committed data is lost in case of an unexpected
failure.
o Consistency: Provides a consistent database state by ensuring that only fully
committed transactions are written to the database.
o Performance: WAL enables efficient sequential writes, which are faster than
random disk writes.
50. What is sharding in a database, and when would you use it?
Answer: Sharding is the process of dividing a large database into smaller, more manageable
pieces, called shards. Each shard contains a subset of the data and is stored on a separate server
or node. Sharding is a horizontal partitioning technique, meaning data is distributed across
multiple servers rather than increasing the capacity of a single server.
How it Works:
o Each shard holds a portion of the data based on some key, such as a range of IDs
or geographical location. For example, you might shard user data by region (e.g.,
US users on one shard, EU users on another).
o Queries are distributed to the appropriate shard based on the sharding key.
When to Use Sharding:
o Scalability: When a database becomes too large to fit on a single server, sharding
allows you to scale horizontally by adding more servers.
o High Traffic: For applications with large amounts of traffic or data (e.g., social
media platforms, large e-commerce websites).
o Improved Performance: By distributing the load across multiple nodes, you
reduce the chance of any single server becoming a bottleneck.
Challenges:
o Complexity: Sharding introduces complexity in terms of data management,
querying, and ensuring consistency across shards.
o Cross-Shard Queries: Queries that need to access data from multiple shards can
be challenging to optimize and may involve additional network overhead.
51. What is an ACID transaction? Can you explain each of the ACID properties
in detail?
Answer: ACID stands for Atomicity, Consistency, Isolation, and Durability, which are the
four key properties that guarantee reliable processing of database transactions.
Atomicity: Ensures that a transaction is treated as a single unit, meaning all operations
within the transaction either fully complete or have no effect (i.e., they are "all or
nothing").
Consistency: Guarantees that a transaction brings the database from one valid state to
another, preserving all integrity constraints (e.g., foreign keys, check constraints).
Isolation: Defines how the changes made by one transaction are isolated from other
transactions. Higher isolation levels prevent dirty reads, non-repeatable reads, and
phantom reads.
Durability: Ensures that once a transaction is committed, its effects are permanent, even
in the case of a system crash.
52. Explain what a "Query Execution Plan" is and how it helps in optimizing
SQL queries.
Answer: A Query Execution Plan (or Execution Plan) is a roadmap of how the database
engine will execute a given SQL query. It shows the steps taken to retrieve the requested data,
including:
Access Methods: How the database will access the data (e.g., table scan, index scan).
Join Methods: How the database will join multiple tables (e.g., nested loops, hash join).
Sort Operations: Whether the query will require sorting the result set.
Indexes Used: The indexes that will be used to optimize query performance.
The execution plan can highlight inefficient operations like full table scans or improper
use of indexes, enabling DBAs to rewrite queries or create indexes that optimize
performance.
By examining the plan, DBAs can also identify costly operations like sorts, joins, and
aggregations that can be optimized.
Most databases, such as SQL Server, Oracle, and PostgreSQL, provide ways to view the
execution plan (e.g., EXPLAIN in PostgreSQL or EXPLAIN PLAN in Oracle).
Answer:
Clustered Index:
o A clustered index determines the physical order of data in the table. There can be
only one clustered index per table because the data can only be sorted in one
order.
o The data rows themselves are stored in the leaf nodes of the clustered index.
o It's used primarily for range queries (e.g., BETWEEN, <, >) because data is stored in
a sorted manner.
o Example: A primary key is typically implemented as a clustered index.
Non-clustered Index:
o A non-clustered index is a separate structure from the actual data. It contains a
copy of the indexed columns along with pointers to the actual data rows.
o A table can have multiple non-clustered indexes.
o Non-clustered indexes are beneficial for lookups on non-primary columns or
when you need efficient querying for specific values that are not the primary key.
Key Difference: A clustered index directly affects the storage order of the table's data, while a
non-clustered index is an additional structure separate from the data.
54. What is database normalization, and why is it important? Can you explain
the different normal forms?
Helps reduce data redundancy (duplicate data) and ensures data integrity.
Minimizes the risk of anomalies (e.g., update, insert, delete anomalies).
Normal Forms:
1. 1st Normal Form (1NF): Ensures that each column contains atomic values (i.e., no
repeating groups or arrays), and each record is unique.
2. 2nd Normal Form (2NF): Achieved when the table is in 1NF and all non-key attributes
are fully functionally dependent on the entire primary key (removes partial dependency).
3. 3rd Normal Form (3NF): Achieved when the table is in 2NF, and no transitive
dependencies exist (non-key attributes should not depend on other non-key attributes).
4. Boyce-Codd Normal Form (BCNF): A stricter version of 3NF where every determinant
is a candidate key.
5. 4th Normal Form (4NF): Ensures no multi-valued dependencies exist.
6. 5th Normal Form (5NF): Ensures no join dependency exists, meaning the data cannot
be further decomposed without losing information.
Trade-Offs: While normalization improves integrity, it can lead to performance issues due to
increased joins. Denormalization (a controlled process of combining tables) may be applied in
certain cases to improve performance.
Answer: A Write-Behind Cache is a caching mechanism that stores changes to data in the
cache and delays the writing of those changes to the database until a later time. This is also
referred to as Lazy Write.
How it works:
When a change is made to data, it is first written to the cache rather than immediately
writing it to the database.
The changes in the cache are periodically written back (flushed) to the database in
batches.
Benefits:
o Performance improvement: By reducing the frequency of writes to the database,
it minimizes database write overhead, especially for high-volume applications.
o Reduced Latency: Since the data is written to the cache quickly, the application
can continue processing without waiting for the database write operation to
complete.
Drawbacks:
o Data Consistency: There's a risk that the cache might become stale or
inconsistent with the database if the cache is not properly flushed.
o Risk of data loss: In case of a system crash before the cache is written to the
database, uncommitted changes could be lost.
Example: Redis can be used as a write-behind cache, storing changes temporarily before they
are persisted to a relational database.
Answer: A Bloom Filter is a probabilistic data structure used to test whether an element is a
member of a set. It provides a fast, memory-efficient way to check for membership but with the
possibility of false positives (it may incorrectly tell you that an element is present when it's not).
However, it will never return a false negative.
How it works:
The Bloom Filter uses multiple hash functions to map an element to several positions in a
bit array. If an element is inserted into the filter, its corresponding bit positions are set to
1.
To check if an element is in the set, the same hash functions are applied, and if all
corresponding bits are set to 1, the element is considered to be in the set (with a
possibility of a false positive).
Large datasets: When you need to check membership of a large dataset without storing
the entire set, such as when filtering queries to avoid unnecessary database access.
No false negatives: Useful when you can tolerate false positives but not false negatives
(e.g., in web caching systems, distributed databases like Cassandra or HBase).
Example: A Bloom Filter is used in HBase to check if a row exists in the SSTable without
having to load the entire database block into memory, reducing disk I/O.
57. What is the "CAP Theorem" and how does it affect database design?
Answer: The CAP Theorem (also known as Brewer's Theorem) states that a distributed
database system can achieve at most two out of three properties: Consistency, Availability, and
Partition Tolerance.
Consistency: All nodes in the system see the same data at the same time.
Availability: Every request (read or write) will receive a response, whether the data is
consistent or not.
Partition Tolerance: The system will continue to function, even if network partitions
(communication breakdown between nodes) occur.