DuckLake - Take Python and DuckDB for a swim in your data lake

Abstract

Pitch

With DuckDB and DuckLake, managing and analyzing huge data sets is no longer limited to complex cloud infrastructure setups. You can literally run these tasks on your notebook now and at comparable speeds. This talk will show you how.

Description

DuckDB is an embedded relational analytics database (OLAP) which can be added to a Python project with a simple uv add duckdb or pip install duckdb. It is both fast and powerful for processing analytical data warehouse workloads, using the well-known PostgreSQL SQL dialect. Data can be stored in memory and persisted on disk. DuckDB is well integrated with Polars via zero copy Apache Arrow data structures, making it a great choice for complex data science and engineering tasks.

DuckLake is a extension which comes with DuckDB to add data lake features, meaning that huge data sets can be managed using Parquet files stored on disk or in an object store such as S3. It uses a novel approach to data lakes in that the management structures are stored in a database (DuckDB), instead of complex file and directory structures, as many other data lake systems do. This provides great advantages for implementing smart features such as snapshots, schema evolution or time travel.

Again, installation of the extension is just a simple INSTALL ducklake command away, making this a really easy way to configure your own personal "lake house" - the ideal combination of a data warehouse with a data lake.

The talk will give a short introduction to the database terminology, explain what is novel about the DuckLake approach and then showcase a typical use case for lake houses: storing historical weather data and making this available for analytics to Python applications.

Both DuckDB and DuckLake are MIT licensed.

Resources: