ai-hosting

Every serverless GPU host compared: pricing, GPUs, and what they claim (April 2026)

By Alex Harmon · April 21, 2026

If you want to run an LLM, a diffusion model, or any custom inference workload and not own the GPU, you are picking between five real options in 2026: Runpod, Modal, Fal.ai, Baseten, and Replicate. This article is a pricing matrix, not a benchmark shootout. Every number comes from the vendor’s public pricing page, dated April 2026.

Why this matters. Every AI-infra review on the internet right now is full of hand-wavy “lightning fast!” and “2x cheaper than the competition!” claims. Most are unverifiable. I’d rather give you numbers you can verify yourself in 90 seconds than pretend I ran a 500-hour benchmark I didn’t.

The matrix — published hourly rates

All prices in USD per hour, rounded, as of April 2026. Check the vendor links before committing — rates change.

GPU	Runpod Secure Cloud	Modal	Fal.ai	Baseten	Replicate
T4 16GB	$0.19/hr	—	—	$0.63/hr	$0.225/hr
L4 24GB	$0.43/hr	$0.80/hr	—	$1.21/hr	—
A10G 24GB	$0.69/hr	—	—	$1.21/hr	$1.23/hr
A100 80GB	$2.17/hr	$2.10/hr	$0.99/hr	$4.00/hr	$5.04/hr
H100 80GB	$3.35/hr	$3.95/hr	$1.89/hr	$6.50/hr	—
H200 141GB	$3.99/hr	—	—	—	—
B200	—	—	—	$9.98/hr	—

Sources: Runpod pricing, Modal pricing, Fal.ai pricing, Baseten pricing, Replicate pricing.

What’s actually billable

The matrix above makes these look comparable. They are not, because the billing models differ:

Runpod bills per-second on Pods (always-on containers) or per-millisecond on Serverless. Same GPU, different billing. Runpod docs.
Modal bills per-second of GPU time (starting at $0.001097/sec for H100 = $3.95/hr). Free $30/mo credit. Modal pricing.
Fal.ai often bills per-output instead of per-second for pre-packaged models (e.g. Flux image generation has a per-image price). GPU-second billing is only for custom deployments.
Baseten has per-GPU-hour pricing plus a minimum dedicated deployment cost. Scale-to-zero is available but there are billed minimum awake times.
Replicate bills per-second at a per-model rate. The A100-80GB rate of $0.001400/sec = $5.04/hr is the published public rate for custom model deployments.

Cold starts — what each vendor claims

None of the below are numbers I measured. They are the numbers each vendor publishes in their docs or marketing. Read them with the skepticism they deserve.

Modal claims “sub-second” container cold starts for CPU and “a few seconds” for typical GPU models in their docs. For large models (70B+ LLMs), expect tens of seconds for weights to load from their cache layer.
Runpod Serverless advertises “FlashBoot” with claimed sub-250ms cold starts for small containers. Large LLM workloads are multi-second. Runpod FlashBoot.
Fal.ai markets itself as having the lowest latency for pre-packaged diffusion workloads (Flux, SDXL). They publish specific model latencies on individual model pages.
Baseten emphasizes their custom inference runtime (Truss) reducing cold start; no official number published, but their benchmarks page shows case studies.
Replicate does not publish cold-start targets. In practice, cold starts for custom models can exceed 30s when weights need to download.

Why I’m not publishing my own numbers. Cold start depends on: model size, your container base image, how weights are cached, which region you’re in, and whether the vendor has a warm instance for your model. A single number in a blog post is useless. If you care about cold start for your model, deploy it to each vendor’s free tier and measure with a stopwatch-grade HTTP client (curl -w "%{time_total}").

How to pick

You’re a hobbyist or indie dev → Runpod Pods (cheapest hourly for always-on dev) or Replicate (simplest API).

You need per-second billing and proper autoscaling → Modal. Best DX, solid pricing, Python-native. The $30/mo free credit makes it essentially free to prototype.

You’re serving a pre-packaged diffusion or TTS model to end users → Fal.ai. The per-output pricing + optimized runtime is hard to beat for exactly those workloads.

You’re a startup with a production inference workload and a budget → Baseten. Most expensive per GPU-hour on paper, but Truss, observability, and support are tangible.

You want to run Llama 3/4 at scale → Runpod Serverless on H100/H200, or Modal with distributed inference. Don’t use Replicate for this — the per-second rate adds up fast.

What this article deliberately doesn’t do

No “I ran Llama 3 70B on all 5 providers” table. Those numbers would be dependent on quantization, batch size, tokenizer, and the specific container I built. I’d rather you run the benchmark on your own model.
No claim that one vendor is “fastest”. They’re all fast in some dimension and slow in another.
No hidden affiliate links in the body — the vendor links above all go to the official pricing pages, not referral URLs.

Sources

Runpod pricing — https://www.runpod.io/pricing
Runpod Serverless docs — https://docs.runpod.io/serverless/pricing
Modal pricing — https://modal.com/pricing
Modal cold-start docs — https://modal.com/docs/guide/cold-start
Fal.ai pricing — https://fal.ai/pricing
Baseten pricing — https://www.baseten.co/pricing/
Replicate pricing — https://replicate.com/pricing