How to Version and Reproduce Machine Learning Experiments

Ask most teams to rerun a model they trained six months ago and reproduce its exact metrics, and you will watch confidence drain from the room. The code has moved on, the training data has been overwritten, a dependency has silently bumped a minor version, and the one engineer who remembered the magic hyperparameters has left. Reproducible machine learning is the discipline that closes this gap: the ability to take a recorded experiment and regenerate the same model, or at least an equivalent one within known tolerances, from first principles. It is not a nice-to-have for research hygiene. It is the foundation on which debugging, auditing, regulatory readiness, and safe iteration all depend.

The reason reproducibility is so hard in machine learning is that a model is not just code. It is the joint product of code, data, configuration, environment, and a chain of stochastic operations, any one of which can drift. Traditional software engineering solved the code half of this problem decades ago with version control, but ML adds three volatile inputs that Git alone cannot handle: large datasets that do not belong in a repository, non-deterministic hardware and libraries, and an explosion of experiment runs that need to be compared. This article lays out a concrete, tooling-agnostic approach to ml versioning, experiment tracking and model reproducibility that you can adopt incrementally, starting with the changes that buy you the most reliability for the least effort.

Define what reproducibility actually means for your team

Before buying tools or writing pipelines, agree on which flavour of reproducibility you are targeting, because they cost very different amounts. The strictest is bitwise reproducibility, where rerunning an experiment produces byte-identical model weights. This is achievable on CPU with fixed seeds and pinned libraries, but on GPU it is often impractical because parallel floating-point reductions are non-associative and library kernels change between versions. A more realistic target for most teams is statistical reproducibility: rerunning the pipeline yields metrics within a small, agreed tolerance, and the same modelling decisions can be defended and traced.

There is also a distinction between reproducibility and replicability. Reproducibility means you can regenerate a result from the same recorded inputs; replicability means someone else can arrive at a comparable result from a fresh implementation of the same method. Most engineering effort should go into reproducibility first, because it is a prerequisite for trustworthy debugging and it is entirely within your control. Replicability is a research-grade concern that follows naturally once your inputs are properly versioned.

Write this target down explicitly. A one-paragraph reproducibility policy that says, for example, 'validation metrics must be reproducible within plus or minus 0.5 percent from the recorded commit, data snapshot and config' gives your team a testable definition. Without it, people argue about whether a run 'reproduced' when they are really disagreeing about tolerance.

Version the four inputs, not just the code

Every reproducible run is a function of four inputs, and all four must be pinned. First is code: the model definition, preprocessing, training loop and evaluation, all committed to version control with a specific commit hash recorded for the run. Second is data: the exact snapshot of training, validation and test sets used. Third is configuration: hyperparameters, feature lists, split logic, random seeds and any thresholds. Fourth is environment: the interpreter version, library versions, hardware type and relevant driver or accelerator versions.

The classic failure is teams that version only the first and part of the third while treating data and environment as ambient. A model trained on 'the current production table' is unreproducible the moment that table changes, even if every line of code is committed. Similarly, an experiment that ran under an unpinned dependency graph can behave differently after a routine upgrade because a default changed in a preprocessing function. Reproducibility is only as strong as the least-controlled of these four inputs.

A useful mental model is that a run should be uniquely identified by a tuple of (code hash, data hash, config hash, environment hash). If you can compute and store those four fingerprints for every run, you have the addressing scheme you need. Everything else in this article is about how to capture each one cheaply and how to reconstruct a run from the tuple.

Treat data as a first-class versioned artefact

Data is the input most teams handle worst, because datasets are too large for a code repository and change continuously. The pattern that scales is content-addressed storage: store immutable data snapshots in object storage on a cloud platform, and keep only lightweight pointers, hashes and metadata in Git. Data-versioning tools formalise this by letting you commit a reference to a dataset version alongside your code, so checking out a commit also resolves the exact data it was trained on.

For streaming or continuously updated sources, freeze a snapshot at experiment time rather than querying live. A practical technique is to materialise the query result to an immutable, timestamped location and record its hash in the run metadata. For very large corpora used to train foundation models, full duplication is wasteful, so prefer immutable append-only stores plus a manifest of file hashes that defines the exact slice used. Either way, the invariant is the same: a run must reference a snapshot that can never silently change underneath it.

Do not forget the transformations between raw data and model input. Feature engineering, tokenisation vocabularies, normalisation statistics and train/validation/test split boundaries are all part of the data lineage. Store the fitted preprocessing artefacts, such as scalers and encoders, as versioned outputs, and always compute split assignments from a seeded, deterministic function of a stable identifier rather than a random shuffle that changes when rows are added.

Adopt experiment tracking before it becomes unavoidable

Experiment tracking is the practice of logging every run's inputs, metrics and artefacts to a central store so runs can be compared and retrieved. The minimum you should capture per run is the code commit, the data snapshot reference, the full config, the environment fingerprint, the metrics over time, and the output model artefact with its own hash. Most experiment-tracking tools give you this with a few lines of instrumentation, and even a disciplined structured-logging convention into a database is far better than nothing.

The payoff of good ml experiment management is not just reproducibility but velocity. When every run is queryable, you can answer 'which configuration gave the best validation score last quarter' or 'what changed between the run that worked and the one that regressed' in seconds instead of archaeology. The comparison view across runs is where teams catch subtle regressions, such as a metric improving on paper while a fairness or latency constraint quietly degrades.

Instrument tracking automatically rather than by hand. Manual logging is forgotten exactly when it matters, during a late-night debugging push. Wire the capture of commit hash, config and environment into your training entry point so that it is impossible to launch a run without recording them. The goal is that reproducibility is the default behaviour of your pipeline, not a checklist someone has to remember.

Learn from practitioners in Dubai

Previous editions of World AI Technology Expo Dubai have brought together senior AI practitioners and leaders. Speakers below are shown for reference from previous editions; the 2026 line-up will be announced ahead of the event.

Nitin Akarte

Microsoft

AI Network Director

United States

Akshay Singh Dalal

Google

Head of Regional Risk & Compliance

United Arab Emirates

James Hunter

IBM

Program Director @ IBM | Driving DevOps Automation and AI

United Kingdom

Abhinav Sharma

Cisco

CTO & Director - AI & Automation Leader

India

View Speakers Apply to Speak

Control randomness and the environment

Machine learning is saturated with stochasticity: weight initialisation, data shuffling, dropout, augmentation, and non-deterministic accelerator kernels. To make runs comparable, set and record seeds for every random number generator in play, including the language runtime, the numerical libraries and the framework, and pass the seed through config so it is captured with the run. Be aware that a single global seed is often insufficient because data-loading worker processes maintain their own generator state, which must be seeded explicitly.

For deterministic execution on accelerators, most frameworks offer a deterministic mode that swaps non-deterministic kernels for reproducible ones, usually at some cost to throughput. This is a genuine trade-off: enable strict determinism for experiments where exact comparison matters, such as ablations or debugging a regression, and accept faster non-deterministic training for large exploratory runs where you only need statistical reproducibility. Document which mode a run used, because it changes how tightly you can expect results to match.

Pin the environment as aggressively as you pin code. Use a lockfile that captures exact transitive dependency versions, and prefer containerised training so the interpreter, system libraries and accelerator toolchain travel with the code. Record the hardware type in run metadata, since the same code can yield different numerics across accelerator generations. A container image digest is an excellent environment fingerprint because it collapses the entire software stack into a single immutable hash.

Make the pipeline reconstructable, not just recorded

Recording inputs is necessary but not sufficient; you also need the ability to act on that record and actually rebuild a run. This is where declarative pipelines earn their keep. Express your workflow as versioned stages with explicit inputs and outputs, so that reconstructing an old experiment is a matter of checking out the commit, resolving its data and environment, and replaying the stages. Pipelines that encode dependencies also give you free caching: unchanged stages need not rerun, which lowers the cost of reproducibility.

Package the model together with everything needed to interpret it. A bare weights file is close to useless months later if you cannot recover the preprocessing, the input schema and the config that produced it. Adopt a model-packaging convention that bundles the artefact with its metadata, dependency requirements and a reference back to the originating run. This is what turns a stored file into a reproducible, deployable model, and it is the bridge between experimentation and production serving.

Test reproducibility the way you test code. Add a periodic job that takes a recorded run, replays it from its fingerprints, and asserts the metrics fall within your agreed tolerance. This reproducibility regression test is the only reliable way to know your policy still holds, because pipelines rot silently as dependencies and infrastructure evolve. A green reproducibility check is a far stronger guarantee than a folder of good intentions.

Extend the discipline to foundation models and agents

Teams increasingly build on top of large language models rather than training from scratch, and reproducibility takes a different shape here. You may not control the underlying model weights, so the versioned inputs become the exact model identifier and version, the sampling parameters such as temperature, the full prompt templates, the retrieval context, and any tool or function definitions. Pin the served model version explicitly, because a provider silently updating a model behind the same name will change your outputs without any change on your side.

For systems built with an agent framework or retrieval over vector databases, the sources of variation multiply: retrieved documents change as the index is updated, and sampling introduces run-to-run variance even at low temperature. Version the index snapshot and embedding model alongside the prompts, and set sampling to its most deterministic setting when you need to compare behaviour. Log full input-output traces so that a surprising result can be replayed and inspected exactly, which is the agentic equivalent of an experiment record.

Evaluation itself needs versioning in this world. If your quality metric comes from a model-graded evaluation, the grading model and its prompt are part of your experiment and must be pinned, or your benchmark drifts underneath you. Treat the evaluation harness with the same rigour as the system under test. Practitioners wrestling with exactly these questions can compare notes with peers, vendors and investors at the World AI Technology Expo Dubai, held 17 to 19 November 2026 at the Millennium Airport Hotel, Dubai, where reproducibility and evaluation of production AI are recurring themes.

Roll it out incrementally without stalling the team

You do not need to build the perfect platform before getting value, and attempting to will bury the effort under its own weight. Start with the highest-leverage move: automatically capture the code commit, config and environment fingerprint on every run, and store metrics centrally. This alone eliminates the most common class of 'we cannot tell what produced this model' failures and takes a day or two to instrument.

Layer in data versioning next, because it is the input most likely to betray you and the one that unlocks true end-to-end reproducibility. Then add environment pinning through containers and lockfiles, and finally introduce declarative pipelines and a reproducibility regression test once the basics are habitual. Sequencing matters: each layer is useful on its own, and stacking them in this order means you always have a working, incrementally more reliable system rather than a half-finished migration.

Finally, treat reproducibility as a cultural default rather than a tax. The teams that sustain it are the ones where launching an untracked run feels wrong, where reviewing an experiment includes checking its record, and where the cost of reproducibility has been engineered down to near zero through automation. Get there and reproducibility stops being a heroic recovery effort and becomes the quiet property that makes everything else, from debugging to audits to confident iteration, dramatically easier.

Inside the event

A glimpse of the atmosphere from previous editions — keynotes, the exhibition floor and the networking that defines World AI Technology Expo Dubai.

Panel discussion at World AI Technology Expo Dubai

Delegates at World AI Technology Expo Dubai

Live product demonstration at World AI Technology Expo Dubai

Keynote session at World AI Technology Expo Dubai

Exhibition floor at World AI Technology Expo Dubai

Networking at World AI Technology Expo Dubai

Key takeaways

A reproducible run is a function of four versioned inputs: code, data, configuration and environment. Pinning only some of them leaves the whole run unreproducible.
Decide upfront whether you need bitwise or statistical reproducibility, and write down an explicit tolerance so 'it reproduced' becomes a testable claim.
Store data as immutable, content-addressed snapshots with hashes in Git, and version preprocessing artefacts and split logic as part of data lineage.
Automate experiment tracking at the training entry point so commit, config and environment are captured by default, never by memory.
Control randomness with recorded seeds and choose deterministic accelerator modes deliberately, trading throughput for exact comparison only when it matters.
For foundation models and agents, version the served model identifier, sampling parameters, prompts, retrieval index and the evaluation grader, since providers can change behaviour silently.

Frequently asked questions

Reproducibility means regenerating the same result from the same recorded inputs, such as the identical code, data snapshot and configuration. Replicability means an independent implementation of the same method arrives at a comparable result. Engineering teams should prioritise reproducibility first, because it is fully within their control and is a prerequisite for trustworthy debugging and auditing.

GPU training uses parallel floating-point reductions that are non-associative, so the order of operations can vary run to run, and library kernels change between versions. Even with fixed seeds this produces tiny numerical differences that compound. Enable your framework's deterministic mode for exact comparison, or target statistical reproducibility within a tolerance for large training runs.

At a minimum, capture the code commit hash, the data snapshot reference, the full configuration including seeds, an environment fingerprint such as a container image digest, the metrics over time, and the output model artefact with its own hash. Wire this capture into your training entry point so it happens automatically on every run rather than depending on someone remembering to log it.

Use content-addressed storage: keep immutable data snapshots in object storage on a cloud platform and commit only lightweight pointers, hashes and metadata to Git. Data-versioning tools let a code commit resolve the exact dataset version it used. For continuously updated sources, freeze a timestamped snapshot at experiment time instead of querying live data.

Version the inputs you do control: the exact served model identifier and version, sampling parameters, prompt templates, retrieval context and any tool definitions. Pin the model version explicitly, because providers can update a model behind the same name and change your outputs. Also version your evaluation grader and prompts, since a model-graded metric drifts if its grading model changes.