Offline

Fast Multi-Version ETL Pipelines in Python with Generators and functools

Track:
Data Engineering and MLOps
Type:
Talk
Level:
intermediate
Duration:
30 minutes
View in the schedule

Abstract

Modern data systems rarely stay unchanged. Schemas evolve, search indices are rebuilt, and for some period of time, multiple versions of the same dataset need to be kept in sync. A typical example is a zero-downtime migration, where several versions of the same data must be synchronised in parallel while the system remains live. This creates a subtle but important challenge: how to keep each version consistent without duplicating extraction work or increasing database load.

In this talk, I’ll describe a production ETL architecture built in Python that processes multiple data versions in parallel using a single streaming pipeline. The system synchronises data from PostgreSQL into OpenSearch, keeps each version independently consistent, and guarantees that no version ever moves backwards — while querying the database only once per batch. The talk is based on a real production system and explains the design decisions and tradeoffs behind it.

The design is based on generator pipelines and functional composition using the functools module. Instead of relying on threads or async frameworks, the ETL flow is expressed as a sequence of small, composable functions: page extraction, DTO normalisation, version-aware filtering, transformation, bulk loading, and dead-letter handling. The reference implementation uses Django as the ORM layer and Celery for orchestration, but the core design is not framework-specific and can be applied equally with SQLAlchemy or raw SQL.

I’ll show how this design makes it possible to:

  1. Synchronise multiple versions efficiently without duplicate database queries
  2. Process large datasets in a streaming, memory-efficient way
  3. Build extensible pipelines from protocol-defined functional stages
  4. Maintain a clear separation of concerns with strong typing and isolated tests
  5. handle failures safely using bulk retries and dead-letter queues

Attendees will leave with concrete patterns for building fast and maintainable ETL pipelines in Python, and with a clearer understanding of how generators and functional composition can be used to model complex data flows — borrowing ideas from Go-style concurrency while staying entirely within the Python ecosystem. While the examples focus on ETL pipelines, the patterns discussed apply to any Python system that processes large streams of data and needs to balance performance, correctness, and extensibility.

Audience: Intermediate to advanced Python developers. Familiarity with generators and basic ETL concepts is helpful; interest in functional design patterns and backend data systems will be beneficial.