Profine

Profine — Profile. Rewrite. Ship faster.

Profine profiles your PyTorch training code on real GPUs, transparently rewrites it with deterministic optimizations, and hands back measured, reviewable speedups before the multi-hour run.

Three steps. No guesswork.

  1. Profile on real GPUs. Profine runs your code on real hardware to find genuine bottlenecks, not synthetic ones.
  2. Transparent rewrites. Every change is a reviewable diff: torch.compile, scaled_dot_product_attention (SDPA), fused AdamW, bf16 autocast, TF32 matmul precision, and more.
  3. Measured speedups. Profine reports the end-to-end speedup it actually achieved on your training loop — not theoretical numbers.

Every line, justified.

Profine produces a reviewable diff with a stated justification for each rewrite, so you can ship optimizations confidently.

Six stages. One measured pass.

A single deterministic pipeline profiles, plans, rewrites, validates, and reports — without modifying your training semantics.

Install Profine from PyPI.

Install with pip install profine. Source: github.com/ProfineAI/profine-cli. Package: pypi.org/project/profine.

Talk to us about Profine.

Get in touch to discuss profiling and optimizing your PyTorch workloads on real GPUs.