Still Stuck With Pandas - Discover DuckDB - by Sarthak Shah
Still Stuck With Pandas - Discover DuckDB - by Sarthak Shah
Search Write
Introduction
Pandas has been the go-to library for Python data manipulation for years. It’s
user-friendly, versatile, and powerful. However, as datasets grow larger,
Pandas starts showing its limitations, particularly in terms of performance
and memory usage. But what if I told you there’s a library that’s up to 20x
faster and just as easy to integrate into your Python workflows?
What is DuckDB?
DuckDB is an in-memory SQL database management system tailored for
Online Analytical Processing (OLAP) workloads. It’s a powerful tool for
running analytical queries on structured data, offering seamless integration
with Python’s Pandas library and supporting various file formats like CSV
and Parquet.
1. Blazing Speed
Optimized for columnar data.
2. Seamless Integration
Directly supports Pandas DataFrames.
simplifies the process of importing CSV files directly into your database,
ensuring quick and efficient data loading.
4. Cross-Platform Compatibility
Runs on Windows, macOS, and Linux.
1. Installation
pip install duckdb
2. Loading Data
From a CSV File:
import duckdb
conn = duckdb.connect()
conn.execute("CREATE TABLE cars AS SELECT * FROM read_csv_auto('cars.csv');")
import pandas as pd
import duckdb
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
result = duckdb.query("SELECT * FROM df WHERE A > 1").fetchdf()
print(result)
3. In-Memory Mode
For blazing-fast operations, use DuckDB’s in-memory mode:
conn = duckdb.connect(":memory:")
Code Example
import pandas as pd
import numpy as np
import time
import duckdb
# Pandas Operation
start_time = time.time()
result_pandas = df[df['C'] > 50]['A'].mean()
end_time = time.time()
pandas_time = end_time - start_time
print(f"Pandas Result: {result_pandas}, Time taken: {pandas_time:.6f} seconds")
# DuckDB Operation
st = time.time()
result_duckdb = duckdb.query("SELECT AVG(A) FROM df WHERE C > 50").fetchall()[0]
et = time.time()
duckdb_time = et - st
print(f"DuckDB Result: {result_duckdb}, Time taken: {duckdb_time:.6f} seconds")
# Performance Comparison
print(f"DuckDB is {pandas_time / duckdb_time:.2f} times faster than Pandas!")
Results
For a dataset of 100 million rows, the results are:
Key Takeaways
DuckDB is a game-changer for large-scale data analysis in Python.
Its SQL-first approach and in-memory mode make it ideal for fast and
scalable operations.
Conclusion
DuckDB is not just a faster alternative to Pandas; it’s a powerful tool for
modern data analysis. Whether you’re working with small datasets or
handling large-scale operations, DuckDB’s speed, flexibility, and SQL-based
simplicity make it a must-have in your Python toolkit.
So, are you ready to embrace the future of data analysis and leave Pandas’
performance limitations behind? Give DuckDB a try and experience the
difference firsthand!
No responses yet
Respond
Lists
DataOps 05: Big Data with Limited How DuckDB handles data not
resources — Bioinformatics Data… fitting into memory?
In the ever-evolving field of bioinformatics, In my previous article about DuckDB I
efficiently processing large datasets is a… described how to use this database as an…
5d ago 23 Jan 2 1