CerebriumAI · milo157 · Jun 11, 2026 · Jun 11, 2026 · Jun 11, 2026 · Jun 22, 2026
@@ -42,6 +42,13 @@
                   "hardware/cpu-and-memory"
                 ]
               },
+              {
+                "group": "Performance",
+                "pages": [
+                  "performance/faster-cold-starts",
+                  "performance/checkpointing"
+                ]
+              },
               {
                 "group": "Scaling apps",
                 "pages": [

@@ -1,19 +1,21 @@
 ---
-title: "Memory Checkpointing"
-description: "Faster cold starts with memory checkpointing"
+title: "Memory and GPU Checkpointing (Beta)"
+description: "Radically reduce container cold starts by skipping initilization work"
 ---
 
 ## Introduction
 
-Memory checkpointing takes a snapshot of a container's GPU and CPU memory and uses that to speed up the startup of all future containers. Applications that perform a large amount of work at container start time benefit the most from this process.
+Memory checkpointing takes a snapshot of a container’s CPU memory and GPU memory, and uses it to speed up the startup of future containers. Applications that perform a large amount of work at container start time benefit the most from this process.
 
-For example, Machine Learning and LLM frameworks load massive model weights and compile various CUDA kernels at container start time, taking many minutes. Loading a checkpoint that already contains the compiled CUDA kernels can skip this delay entirely.
+This is useful for both CPU-only and GPU workloads. For CPU applications, checkpointing can preserve expensive initialization work such as imports, dependency loading, configuration setup, and in-memory state. For GPU applications, it can also preserve model weights, CUDA state, and compiled kernels.
 
-Cerebrium has native checkpointing and restore functionality built in to the platform.
+For example, ML and LLM frameworks often load large model weights and compile CUDA kernels at container start time, which can take many seconds or minutes. Loading from a checkpoint that already contains this initialized state can skip most of that delay.
+
+Since this feature is still in beta, please report all issues to the team via our [Discord Community](https://discord.gg/ATj6USmeE2) or via [Email](mailto:support@cerebrium.ai)
 
 ## How To Use
 
-Checkpointing is available on our v2 runtime environment. Add the following to your `cerebrium.toml` to upgrade.
+Checkpointing is available in early beta to our customer base. Add the following to your `cerebrium.toml` in order to use it.
 
 ```
 [cerebrium.experimental]
@@ -33,7 +35,9 @@ If successful subsequent containers will be restored from this created checkpoin
 
 A checkpoint is tightly coupled to a single deployment. To disable restoring from checkpoints simply remove the POST request and redeploy your application.
 
-### Example
+You can find several implementations in our [Examples repository on Github](https://github.com/CerebriumAI/examples)
+
+### vLLM Example
 
 ```python
 from vllm import AsyncLLMEngine
@@ -53,8 +57,9 @@ engine = AsyncLLMEngine.from_engine_args(engine_args)
 engine.sleep(level=1)
 # Trigger checkpoint
 try:
-    urllib.request.urlopen("http://169.254.169.253:8234/checkpoint", method="POST")
-except http.client.RemoteDisconnect:
+    req = urllib.request.Request("http://169.254.169.253:8234/checkpoint", method="POST")
+    urllib.request.urlopen(req, timeout=300)
+except http.client.RemoteDisconnected:
     # TCP connections disconnect on restore and throw remote
     pass
 
@@ -72,7 +77,7 @@ engine.wake_up()
 
 **Ephemeral Filesystem:** Any files written to disk before the checkpoint will not be copied to the restored container. Only memory is checkpointed.
 
-**Provider Availablity:** Checkpointing is only available on the AWS provider. More coming soon.
+**Provider Availablity:** Checkpointing is only available on the <b>AWS provider</b>. More coming soon.
 
 ## Platform specific recommendations
 

@@ -0,0 +1,137 @@
+---
+title: Faster Cold Start Performance
+description: Reduce queueing delay and initialization time on new containers
+---
+
+Cold starts happen when Cerebrium must boot a new container to serve a request. That adds latency in two phases:
+
+1. **Queueing** - No warm container is available, so the request waits until a new one starts and becomes ready.
+2. **Initialization** - The new container runs startup work before it accepts traffic: importing dependencies, loading model weights into GPU memory, compiling CUDA kernels.
+
+Use metrics and request logs from your Cerebrium dashboard to see which phase dominates and how you can improve it.
+However, most production workloads benefit from both the reduction of initialization and then configuring scaling to keep warm capacity available.
+
+## Reducing initialization time
+
+Most cold-start time in ML workloads comes from loading model weights into GPU memory. For large models, a standard Hugging Face load can take 40+ seconds even when reading from storage at ~2 GB/s.
+
+Work through the techniques below in order. Each step targets a different part of startup and can be combined with the others.
+
+### Store model weights on persistent storage
+
+Store model weights on [persistent storage](/storage/managing-files) at `/persistent-storage` rather than baking them into the container image.
+
+Cerebrium caches reads from persistent storage within each region. After weights are loaded once, future cold starts can reuse the cached copy and load faster.
+
+This is usually the best default for large models. Baking weights into the container increases image size, which means Cerebrium has to pull and restore a larger image before your application can start initialization.
+
+Only include weights in the container image when they are small enough that the image remains lightweight.
+
+<Note>
+  Increasing CPU core count can parallelise reads from storage and improve
+  pull-through times for large files. Multiple cores process different parts
+  simultaneously, reducing overall transfer time.
+</Note>
+
+### Run initialization at module scope
+
+Move as much initialization work as possible out of the request path and into module scope so it runs once at container start, before the container accepts traffic.
+
+```python
+# Runs once at container start — not on every request
+model = load_model("/persistent-storage/models/my-model/")
+tokenizer = load_tokenizer("/persistent-storage/models/my-model/")
+
+def predict(prompt: str):
+    return model.generate(prompt)
+```
+
+For multiple independent models or weight files, load them concurrently rather than sequentially. Use `ThreadPoolExecutor` or similar patterns to read files in parallel and take full advantage of storage bandwidth.
+
+### Load weights directly to GPU
+
+Standard PyTorch and Hugging Face loading paths copy weights through CPU memory. Libraries that stream weights directly from disk to GPU reduce this overhead. Use one of these when model loading remains the bottleneck after moving work to module scope.
+
+#### Tensorizer
+
+[Tensorizer](https://github.com/coreweave/tensorizer) serialises model weights into a format optimised for fast transfer and loads them directly into GPU memory in a single step. It works with Cerebrium persistent storage at nearly 2 GB/s read speed. For large models (20B+ parameters), loading time typically decreases by 30–50%, with greater improvements on larger models.
+
+Tensorizer works with Transformers, Diffusers, scikit-learn, or custom PyTorch modules. The only requirement is the ability to initialise an empty model before the deserializer restores weights into it.
+
+#### FlashPack
+
+[FlashPack](https://github.com/fal-ai/flashpack) loads PyTorch tensors from disk to GPU at high throughput without requiring GPUDirect Storage. Convert a model once, store the `.flashpack` file on persistent storage, then load directly into GPU memory on startup.
+
+FlashPack also provides integration mixins for Transformers and Diffusers models. See the [FlashPack repository](https://github.com/fal-ai/flashpack) for conversion and loading patterns.
+
+### Restore from a checkpoint
+
+When initialization includes work that does not change between deployments - compiled CUDA kernels, large weight loads, framework setup - [memory checkpointing](/performance/checkpointing) captures CPU and GPU memory state after initialization and restores it on future cold starts.
+
+Checkpointing skips repeated initialization work entirely. A container restored from a checkpoint resumes from the point where the checkpoint trigger was sent, with model weights and compiled kernels already in memory.
+
+Use checkpointing when Tensorizer or FlashPack still leave multi-minute startup times, or when compiled kernels dominate initialization. Enable checkpointing in `cerebrium.toml` and trigger it after initialization completes. See the [Memory Checkpointing guide](/performance/checkpointing) for configuration, trigger endpoints, and framework-specific recommendations.
+
+## Reduce queueing with scaling
+
+When initialization is already optimised, keep warm containers available so requests do not wait for new ones to boot. See [Scaling Apps](/scaling/scaling-apps) for full parameter reference.
+
+Use these scaling options based on traffic pattern:
+
+| Goal                                               | Parameter        | When to use                                                                    |
+| -------------------------------------------------- | ---------------- | ------------------------------------------------------------------------------ |
+| Eliminate cold starts from scaling to zero         | `min_replicas`   | Latency-sensitive production workloads that cannot tolerate startup delay      |
+| Handle bursty traffic without waiting for scale-up | `scaling_buffer` | Traffic arrives in bursts where one request is followed by several more        |
+| Keep containers warm through brief dips            | `cooldown`       | Steady workloads with occasional gaps before traffic returns                   |
+| Maintain headroom before autoscaler adds replicas  | `scaling_target` | Workloads using `concurrency_utilization` that need spare capacity per replica |
+
+### Keep containers warm with `min_replicas`
+
+Set `min_replicas` to maintain a floor of running instances at all times. This eliminates cold starts from scaling to zero but increases cost while idle.
+
+```toml
+[cerebrium.scaling]
+min_replicas = 1
+```
+
+Use `min_replicas = 1` or higher for latency-sensitive production workloads that cannot tolerate cold-start delay.
+
+### Buffer capacity with `scaling_buffer`
+
+`scaling_buffer` provisions extra idle replicas above what the scaling metric recommends. This helps with bursty traffic - when one request arrives, additional warm containers are already available for the requests that follow.
+
+```toml
+[cerebrium.scaling]
+min_replicas = 0
+max_replicas = 10
+replica_concurrency = 1
+scaling_metric = "concurrency_utilization"
+scaling_target = 100
+scaling_buffer = 3
+```
+
+`scaling_buffer` is available with `concurrency_utilization` and `requests_per_second` metrics.
+
+### Tune cooldown for traffic patterns
+
+The `cooldown` parameter sets how long reduced concurrency must persist before a container scales down. A longer cooldown keeps containers warm through brief traffic dips and reduces cold starts when traffic returns quickly.
+
+```toml
+[cerebrium.scaling]
+cooldown = 600  # Keep containers warm for 10 minutes after traffic drops
+```
+
+Match cooldown to traffic patterns. Steady workloads with occasional gaps benefit from longer cooldowns. Highly intermittent workloads may accept shorter cooldowns to reduce idle cost.
+
+### Leave headroom with `scaling_target`
+
+With `concurrency_utilization`, set `scaling_target` below 100 to maintain excess capacity before the autoscaler adds replicas. For example, `scaling_target = 70` with `replica_concurrency = 1` keeps containers at 70% utilisation, leaving room for new requests without waiting for a scale-up event.
+
+```toml
+[cerebrium.scaling]
+replica_concurrency = 1
+scaling_metric = "concurrency_utilization"
+scaling_target = 70
+```
+
+All scaling strategies trade cost for latency. Monitor cold start frequency and request latency in the dashboard to find the right balance.