Why Coding Agents Fail at ML (and How to Fix It)

Abstract

Over the last year, LLM-based coding agents have matured to the point where they can autonomously navigate codebases, edit files, run tests, and iterate on solutions with minimal human input. Many Python teams have started applying these agents to machine learning projects, where the volume of repetitive experimental work makes automation appealing.

However, coding execution and ML optimization are not the same problem. In software engineering, success is often local and binary: a feature works or it does not, a bug is reproduced or fixed. In ML, code correctness is only a prerequisite. Progress is determined by measured model behavior across repeated experiments, and that measurement only happens after training and evaluation are complete. This distinction creates a coordination gap. Coding agents can generate and modify training code, but without external structure they drift over long experiment horizons. Teams encounter recurring failure modes: multiple coupled changes in a single step make results unattributable, LLMs lose context and rediscover already tested hypothesis in slightly different wording, etc.

This poster presents an architecture that addresses the gap by pairing coding agents with a deterministic (non-LLM) orchestrator. The orchestrator manages experimentation as a tree search. It starts from a reproducible baseline, samples hypothesis-driven modifications constrained to a single aspect per step, evaluates each modification through a fixed entrypoint that returns multiple metrics, and decides which branches to expand or prune based on statistical evidence. Each experiment runs in an isolated git worktree, so every branch has a clean file state, an independent diff, and a separate log. An anti-repetition memory tracks previously attempted hypotheses across the tree to prevent the agent from regenerating equivalent ideas. When stopping criteria are met, an integration stage combines winning branches and evaluates the result.

This presentation is aimed at ML engineers and data scientists who have experimented with (or are considering) coding agents for their workflows. Attendees will learn how to utilize coding agents for ML tasks and structure experiments as an LLM-searchable trajectory with deterministic controls rather than a sequence of ad-hoc edits.