Building a Streamlit app on the data residing in datalakehouse using Deltalake and DuckDb

Abhishek Raviprasad
5 min readApr 28, 2023

--

In today’s data-driven world, businesses are generating and collecting massive amounts of data from various sources. However, managing and analyzing this data is becoming increasingly challenging as data is spread across various systems and platforms. Datalakehouses have emerged as a solution to this problem, providing a unified platform to store, manage, and analyze large amounts of data.

In this article, we will explore how to build a Streamlit app on the data residing in a datalakehouse using Deltalake and DuckDb. Deltalake is an open-source storage layer that brings reliability to data lakes, while DuckDb is a lightweight, in-memory SQL database that can query data stored in various formats.

We will start by introducing Deltalake and DuckDb and how they can be used to manage and analyze data stored in a datalakehouse. Then, we will dive into the details of building a Streamlit app that queries data from the datalakehouse using DuckDb and presents it in an interactive way using Streamlit.

By the end of this article, you will have a better understanding of how to leverage the power of Deltalake and DuckDb to build scalable and efficient applications on top of your datalakehouse. Whether you are a data analyst or a developer, this article will provide you with practical insights into managing and analyzing data in a datalakehouse.

Project

This project aims to run from your local laptop/computer without the need for spark, data catalogs and any cloud storage like s3, gs, adls etc. But please use Spark and cloud storage for the production use case.

In this project we will use delta-rs, which is native Rust library for Delta Lake, with bindings into Python that removes the need for spark to interact with the data in delta format.

The below diagram depicts the architecture used in the project,

Install the following modules before starting the project,

  1. deltalake
  2. streamlit
  3. plotly

How to Ingest data?

For this project I have used the below dataset from kaggle (https://www.kaggle.com/datasets/lokeshparab/amazon-products-dataset).

To read the CSV files and write to the local directory in delta format we can use the following code snippet,

import pandas as pd
from deltalake.writer import write_deltalake
import glob
import os

df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('../data_dir/', "*.csv"))))
# OR do this if you want to read only selective csv files
# df = pd.read_csv("csv_file_path")
write_deltalake('../target_path', df, mode="append")

The data now will be written into the target directory in the delta format.

The Delta format offers several advantages for managing data in a data lakehouse:

  1. ACID Transactions: Delta supports ACID transactions, which ensure that data is consistently stored and protected in the data lakehouse. This feature makes it easier to manage complex data pipelines and ensure data integrity.
  2. Schema Enforcement: Delta enforces schema validation during write operations, ensuring that only data conforming to the specified schema is added to the data lakehouse. This helps maintain data consistency and reduces data quality issues.
  3. Time Travel: Delta supports time travel, allowing users to access and query the history of the data lakehouse at any point in time. This feature makes it easier to track changes in data over time and recover from data errors.
  4. High Performance: Delta uses advanced indexing and caching techniques to provide high performance for both batch and streaming workloads. This allows users to query and process data quickly and efficiently.
  5. Open Source: Delta is an open-source format, which means that it is freely available and can be customized to meet specific data management needs. This makes it a flexible and cost-effective option for managing data in a data lakehouse.

Read the data

Now lets see how we can read the data from Streamlit application and build few quick reports.

import streamlit as st
from deltalake import DeltaTable
import plotly.express as px

# Set up Streamlit app
st.set_page_config(page_title="Delta Streamlit Example", layout="wide")

# Read the data stored in delta format and convert it into pandas table as required
st.subheader('Enter your :blue[Delta] Table Path')
path = st.text_input('','../target_path')
dt = DeltaTable(path)

pd_table = dt.to_pandas()
# Do some basic transformations as needed
pd_table['company'] = pd_table['name'].apply(lambda x: x.split()[0])
pd_table['selling_price'] = pd_table['discount_price'].apply(
lambda x: float(x.replace("₹", "").replace(",", "")) if x is not None else 0)

st.write(pd_table)
st.write(f"Number of rows: {pd_table.shape[0]}")
st.markdown("<h3></h3>", unsafe_allow_html=True)

# Plot few basic graphs
category_count = pd_table.groupby(['main_category', 'sub_category']).size().reset_index()
category_count.columns = ['main_category', 'sub_category', 'Count']
fig1 = px.bar(category_count, x='main_category', y='Count', color='sub_category')
st.plotly_chart(fig1, use_container_width=True)

category_sum = pd_table.groupby(['main_category'])['selling_price'].sum()
st.write(category_sum)

Use the below code snippet to take the custom query input from end user and display the results in the app,

# Use duckdb, register the pyarrow table as temporary table to run the custom queries passed by user
import duckdb
conn = duckdb.connect()
dataset = dt.to_pyarrow_table()
conn.register("temp_table", dataset)
st.subheader('Enter your :orange[Duck DB] Query. Table name registered is temp_table')
query = st.text_input('', '')
if query == '':
print("no value entered")
else:
duck_val = conn.execute(
query
).df()
st.write(duck_val)

The resultant app will look like this,

That brings us to the end of this article. The main purpose of this project was to show how we can build data applications quickly on top of a lakehouse architecture without needing expertise in HTML, CSS , React and other frontend technologies. Data engineers can focus only on the data aspect and use tools like Streamlit to host & deploy the frontend applications.

Please reach out for any questions or suggestions.

--

--

Abhishek Raviprasad

Senior Solution Engineer at Infoworks.io, 4+ years of big data/ETL data warehouse experience building data pipelines