Abhishek Raviprasad
4 min readFeb 3, 2023

--

In this article lets see how DuckDB is a game changer in analytical world and how it helps data scientists and analysts in doing quick computations.

What is DuckDB ?

DuckDB is an open-source, relational database management system optimized for analytical workloads. It uses a columnar storage layout, vectorized query execution, and advanced query optimization techniques to provide high performance for SQL queries and analytical workloads. DuckDB supports a subset of SQL and is designed to be used as an embedded database, embedded within an application, rather than as a standalone database server.

DuckDB is faster, runs in your laptop and very much useful for analysing CSV files, Parquet files etc. This is very fast compared to traditional pandas and numpy modules.

DuckDB vs Pandas

Lets see how easy and faster DuckDB is compared to Pandas.

I have a directory full of sales data files under data directory,

Inorder to read these CSV files and list top 10 rows of the table/dataframe in pandas we need to do the following,

# with pandas
current_time = time.time()
df = pd.concat([pd.read_csv(file) for file in glob.glob('data/*.csv')])
print(f"time: {(time.time() - current_time)}")
print(df.head(10))

From the output below we can see that the above step took 0.216 seconds to read all the CSV files and print the top 10 rows of the dataframe.

Now lets use DuckDB and see how to achieve the same task,

# with duckdb
conn = duckdb.connect() # create an in-memory database
cur_time = time.time()
df = conn.execute("""
SELECT *
FROM 'data/*.csv'
LIMIT 10
""").df()
print(f"time: {(time.time() - cur_time)}")
print(df)

From the output below we can see that the above step took 0.04 seconds to read all the CSV files and print the top 10 rows of the table. This is almost 10 times faster than pandas.

How to read CSV file with additional parameters in Duckdb?

In this section let us see how to read CSV files by passing parameters like headers, delimiter etc.

df = conn.execute("""
SELECT *
FROM read_csv_auto('data/*.csv', header=True, delim=',')
""").df()
print(df)

Another example of how to create a table by reading the csv files,

df = conn.execute("""
CREATE TABLE Sales
AS SELECT * FROM read_csv_auto('data/*.csv', header=True, delim=',')
""").df()
print(df)

More examples on CSV import can be found here: https://duckdb.org/docs/data/csv

How to export a table to a CSV file ?

To export the data from a table to a CSV file, use the COPY statement.

COPY tbl TO ‘output.csv’ (HEADER, DELIMITER ‘,’);

How to read a parquet file and register it as table?

Let us see how we can read the parquet file, register it as a temp table and run few operations on it.

df = conn.execute("""
SELECT * FROM read_parquet('sample.parquet')
""").df()
# print(df)

conn.register("temp_table", df)
print(conn.execute("DESCRIBE temp_table").df())
print(conn.execute("SELECT COUNT(*) FROM temp_table").df())

Results can be seen below,

How to exclude columns with DuckDB?

We can use EXCLUDE key word in DuckDB to exclude columns while selecting the table.

df = conn.execute("""
SELECT * EXCLUDE(Product)
FROM read_csv_auto('data/*.csv', header=True, delim=',')
""").df()
print(df)

What is Column Expression in DuckDB?

Suppose you want to calculate minimum value of all the columns except some columns, the query becomes as simple as below.

df = conn.execute("""
SELECT
MIN(COLUMNS(* EXCLUDE(Product)))
FROM read_csv_auto('data/*.csv', header=True, delim=',')
""").df()
print(df)

Instead of us writing min of each and every column the syntax is very simple in DuckDB.

There are many otehr friendlier SQL functions like SELECT * REPLACE, GROUP BY ALL etc.
Refer: https://duckdb.org/2022/05/04/friendlier-sql.html

Conclusion

In conclusion, DuckDB is an SQL database management system that focuses on providing high performance for analytical queries while still supporting transactional workloads. It is designed to work efficiently on both single nodes and large-scale distributed systems, making it a versatile and scalable choice for a variety of use cases. The combination of its columnar storage, vectorized query execution, and support for advanced query optimization techniques makes DuckDB well-suited for data analysis and business intelligence applications. Additionally, its compatibility with SQL and the use of standard SQL interfaces makes it easily accessible to a wide range of users, regardless of their prior experience with databases. Overall, DuckDB is a highly performant and flexible database management system that offers many benefits for data-driven organizations.

Reference

  1. https://www.kaggle.com/datasets/knightbearr/sales-product-data
  2. https://duckdb.org/

--

--

Abhishek Raviprasad

Senior Solution Engineer at Infoworks.io, 4+ years of big data/ETL data warehouse experience building data pipelines