Clockwork claims to save millions in wasted GPU compute

  • Clockwork.io is launching TorchPass, a fault tolerance product that uses live GPU migration to prevent AI training jobs from crashing and restarting when hardware fails
  • The Stanford spinout claims TorchPass completes failure recovery in under two minutes versus hours lost to traditional checkpoint restarts
  • The company has raised over $40 million and counts Nscale, DCAI, Nebius and others as customers

Anyone who has run a large AI training job knows the sinking feeling: a single GPU fails, the job crashes and hours of work vanish. The training run rolls back to the last checkpoint — which might be 30 minutes or two hours old — wasting expensive GPU time.

Clockwork.io, a Palo Alto startup spun out of Stanford University, is launching a product Tuesday designed to drastically reduce that pain. TorchPass, generally available March 11, uses what Clockwork calls “Live GPU Migration” to transparently move training workloads from failed GPUs to spares, keeping the job running rather than forcing a restart.

The financial pitch is direct. For a typical 2,048-GPU H200 deployment, Clockwork says TorchPass delivers over $6 million in annual savings by eliminating GPU hours lost to failure-driven restarts. The company puts current GPU cluster utilization at 30–50% of theoretical performance, with failures being a leading cause of the gap. This pain is particularly acute as GPU shortages drive up prices.

Failure is not an option

The industry sees such failures as a constant. But Clockwork CEO Suresh Vasudevan disagrees. “We built TorchPass to fundamentally reject that premise,” he said in the company’s press release. “Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload.”

Clockwork was founded in 2018. Its core technology is nanosecond-level clock synchronization, developed originally for network reconstruction research at Stanford. That precision timing underlies the company’s observability layer, which gives TorchPass early warning signals — rising temperatures, ECC memory errors — that allow it to migrate workloads proactively before a GPU fails outright.

TorchPass requires half the recovery time of checkpoint recovery, according to benchmarks, said to Dan Zheng, Clockwork co-founder and chief business officer, in an interview with Fierce Network. And an open-source alternative from Meta, PyTorch’s TorchFT, performed even slower in Clockwork’s tests — about 2.3x longer than TorchPass.

Who’s the competition?

Clockwork’s competitive claim is that no one else offers fault tolerance across the full stack — GPU, network, and observability — in a pure software, hardware-agnostic package. The company has previously shipped network fault tolerance that automatically reroutes traffic when a network link fails; TorchPass extends that to compute failures. The company’s customers include Nscale, DCAI (Danish Centre for AI), Nebius, Uber and Wells Fargo.

As a venture-backed startup with just over $40 million raised and a product that only began shipping in September 2025, Clockwork will need to demonstrate it can scale support and integrations to keep pace with rapidly evolving GPU infrastructure — particularly next-generation tightly coupled systems like Nvidia’s GB200 NVL72, where even small failures carry outsized costs.

And Clockwork joins a field of companies looking to simplify GPU deployment and reduce costs. For example, neocloud provider Crusoe is moving beyond raw GPU power to abstract more AI infrastructure decisions.