hostfleet /v2
ai-hosting

Every serverless GPU host compared: pricing, GPUs, and what they claim (April 2026)

By Alex Harmon ·

If you want to run an LLM, a diffusion model, or any custom inference workload and not own the GPU, you are picking between five real options in 2026: Runpod, Modal, Fal.ai, Baseten, and Replicate. This article is a pricing matrix, not a benchmark shootout. Every number comes from the vendor’s public pricing page, dated April 2026.

Why this matters. Every AI-infra review on the internet right now is full of hand-wavy “lightning fast!” and “2x cheaper than the competition!” claims. Most are unverifiable. I’d rather give you numbers you can verify yourself in 90 seconds than pretend I ran a 500-hour benchmark I didn’t.

The matrix — published hourly rates

All prices in USD per hour, rounded, as of April 2026. Check the vendor links before committing — rates change.

GPURunpod Secure CloudModalFal.aiBasetenReplicate
T4 16GB$0.19/hr$0.63/hr$0.225/hr
L4 24GB$0.43/hr$0.80/hr$1.21/hr
A10G 24GB$0.69/hr$1.21/hr$1.23/hr
A100 80GB$2.17/hr$2.10/hr$0.99/hr$4.00/hr$5.04/hr
H100 80GB$3.35/hr$3.95/hr$1.89/hr$6.50/hr
H200 141GB$3.99/hr
B200$9.98/hr

Sources: Runpod pricing, Modal pricing, Fal.ai pricing, Baseten pricing, Replicate pricing.

What’s actually billable

The matrix above makes these look comparable. They are not, because the billing models differ:

  • Runpod bills per-second on Pods (always-on containers) or per-millisecond on Serverless. Same GPU, different billing. Runpod docs.
  • Modal bills per-second of GPU time (starting at $0.001097/sec for H100 = $3.95/hr). Free $30/mo credit. Modal pricing.
  • Fal.ai often bills per-output instead of per-second for pre-packaged models (e.g. Flux image generation has a per-image price). GPU-second billing is only for custom deployments.
  • Baseten has per-GPU-hour pricing plus a minimum dedicated deployment cost. Scale-to-zero is available but there are billed minimum awake times.
  • Replicate bills per-second at a per-model rate. The A100-80GB rate of $0.001400/sec = $5.04/hr is the published public rate for custom model deployments.

Cold starts — what each vendor claims

None of the below are numbers I measured. They are the numbers each vendor publishes in their docs or marketing. Read them with the skepticism they deserve.

  • Modal claims “sub-second” container cold starts for CPU and “a few seconds” for typical GPU models in their docs. For large models (70B+ LLMs), expect tens of seconds for weights to load from their cache layer.
  • Runpod Serverless advertises “FlashBoot” with claimed sub-250ms cold starts for small containers. Large LLM workloads are multi-second. Runpod FlashBoot.
  • Fal.ai markets itself as having the lowest latency for pre-packaged diffusion workloads (Flux, SDXL). They publish specific model latencies on individual model pages.
  • Baseten emphasizes their custom inference runtime (Truss) reducing cold start; no official number published, but their benchmarks page shows case studies.
  • Replicate does not publish cold-start targets. In practice, cold starts for custom models can exceed 30s when weights need to download.

Why I’m not publishing my own numbers. Cold start depends on: model size, your container base image, how weights are cached, which region you’re in, and whether the vendor has a warm instance for your model. A single number in a blog post is useless. If you care about cold start for your model, deploy it to each vendor’s free tier and measure with a stopwatch-grade HTTP client (curl -w "%{time_total}").

How to pick

You’re a hobbyist or indie dev → Runpod Pods (cheapest hourly for always-on dev) or Replicate (simplest API).

You need per-second billing and proper autoscaling → Modal. Best DX, solid pricing, Python-native. The $30/mo free credit makes it essentially free to prototype.

You’re serving a pre-packaged diffusion or TTS model to end users → Fal.ai. The per-output pricing + optimized runtime is hard to beat for exactly those workloads.

You’re a startup with a production inference workload and a budget → Baseten. Most expensive per GPU-hour on paper, but Truss, observability, and support are tangible.

You want to run Llama 3/4 at scale → Runpod Serverless on H100/H200, or Modal with distributed inference. Don’t use Replicate for this — the per-second rate adds up fast.

What this article deliberately doesn’t do

  • No “I ran Llama 3 70B on all 5 providers” table. Those numbers would be dependent on quantization, batch size, tokenizer, and the specific container I built. I’d rather you run the benchmark on your own model.
  • No claim that one vendor is “fastest”. They’re all fast in some dimension and slow in another.
  • No hidden affiliate links in the body — the vendor links above all go to the official pricing pages, not referral URLs.

Sources