The hardest test suite I ever built - a pytest case study

Abstract

For years, this real-time video system had no tests. Every change produced unpredictable side effects. Accuracy drifted. Production incidents and escalations followed. The only “verification” was manual inspection and hope.

When I joined the project, this was the reality - and building a proper integration test suite became my first priority.

In this talk, I’ll share how I designed and evolved the hardest integration test suite of my career using pytest - and kept it readable.

The system processed live streams in production. It was non-deterministic. Individual detections were only 80–90% accurate. For testing, we replayed recorded scenarios to make system behavior observable and comparable across runs. But binary assertions were not enough. A single failed event did not mean the whole system was broken - but we needed a way to measure when it actually was.

Instead of writing one massive test, I built a layered architecture:

dual parametrization - recording scope and event scope
orchestration in fixtures - assertions in tiny, single-purpose tests
statistics collection during execution
end-of-run aggregation that summarizes system accuracy

The result was a suite that could detect regressions in model changes and produce reproducible evidence - HTML reports, structured dumps, and a summary statistics file.

This talk explores how far pytest can be stretched beyond unit tests - into a framework for architecting complex integration systems.