diff --git a/docs.json b/docs.json index 263f77b..4adab4e 100644 --- a/docs.json +++ b/docs.json @@ -42,6 +42,13 @@ "hardware/cpu-and-memory" ] }, + { + "group": "Performance", + "pages": [ + "performance/faster-cold-starts", + "performance/checkpointing" + ] + }, { "group": "Scaling apps", "pages": [ diff --git a/other-topics/faster-cold-starts.mdx b/other-topics/faster-cold-starts.mdx deleted file mode 100644 index b06716d..0000000 --- a/other-topics/faster-cold-starts.mdx +++ /dev/null @@ -1,116 +0,0 @@ ---- -title: Faster Cold Starts -description: Decrease the time it takes start your application ---- - -## Container vs Storage Volume for Model Loading - -Two main options exist for storing model weights: - -1. **Inside the Container**: Packaging model weights directly in your container image - - Pros: - - Faster initial startup as weights are already in the container - - No need to download or transfer weights from external storage - - Cons: - - Much larger container size, leading to longer deployment times - - Less flexibility to update model weights without rebuilding container - -2. **Storage Volume**: Storing weights in a persistent storage volume - - Pros: - - Smaller container sizes and faster deployments - - Easy to update model weights without rebuilding container - - Cons: - - Initial cold start includes time to load weights from storage - - Requires managing separate storage infrastructure - -Storing model weights in a storage volume works best for most applications. For smaller models requiring minimal cold start times, container storage may be more appropriate. - - - Increasing core counts can parallelize downloads, improving pull-through times - for large images. This benefit becomes particularly notable when handling - large files from the storage layer, as multiple cores process different parts - simultaneously, reducing overall download time. - - -## Loading Models from Storage Volume Faster - -One of the biggest factors in model startup time is loading the model from storage into GPU memory. For example, in larger models of 20B+ parameters, it can take over 40 seconds to load using a normal Hugging Face load, even with 2GB/s transfer speeds from persistent storage. - -The underlying hardware is optimized for fast model loading, but several additional techniques can further reduce cold-start times. - -### Tensorizer (recommended) - -[Tensorizer](https://github.com/coreweave/tensorizer) is a library that loads models from storage into GPU memory in a single step. Initially built for S3, it also works with Cerebrium's persistent storage (nearly 2GB/s read speed). For large models (20B+ parameters), loading time decreases by 30–50%, with even greater improvements for larger models. See the [GitHub page](https://github.com/coreweave/tensorizer) for details on the underlying methods. - -The following section covers using **Tensorizer** to load a model from storage directly into GPU memory in a single step. - -### Installation - -Add the following to your `[cerebrium.dependencies.pip]` in your `cerebrium.toml` file to install Tensorizer in your deployment: - -```txt -tensorizer = ">=2.7.0" -``` - -### Usage - -To use **Tensorizer**, you need to first serialise your model and save it to your persistent-storage. - -```python -from tensorizer import TensorSerializer -def serialize_model(model, save_path): - """Serialize the model and save the weights to the save_path.""" - try: - serializer = TensorSerializer(save_path) - start = time.time() - serializer.write_module(model) - end = time.time() - print(f"Serializing model took {end - start} seconds", file=sys.stderr) - serializer.close() - return True - except Exception as e: - print("Serialization failed with error:", e, file=sys.stderr) - return False -``` - -This will convert your model to a protocol buffer serialised format that is optimised for faster transfer speeds and fast loading into GPU memory. - -On the next deployment start, load the serialised model from storage into GPU memory in a single step: - -```python - -from tensorizer import TensorDeserializer -from tensorizer.utils import no_init_or_tensor -def deserialize_saved_model(model_path, model_id, plaid=True): - """Deserialize the model from the model_path and load into GPU memory.""" - - # create a config object that we can use to init an empty model - config = AutoConfig.from_pretrained(model_id) - - # Initialize empty model without loading weights into GPU - print("Initializing empty model", file=sys.stderr) - start = time.time() - with no_init_or_tensor(): - # Load empty model from config - model = AutoModelForCausalLM.from_config(config) - end_init = time.time() - start - - # Create deserializer object - # Note: plaid_mode enables faster deserialization but isn't safe for training - deserializer = TensorDeserializer(model_path, plaid_mode=True) - - # Deserialize model directly into GPU (zero-copy) - print("Loading model", file=sys.stderr) - start = time.time() - deserializer.load_into_module(model) - end = time.time() - deserializer.close() - - # Report timings - print(f"Initializing empty model took {end_init} seconds", file=sys.stderr) - print(f"\nDeserializing model took {end - start} seconds\n", file=sys.stderr) - - return model -``` - -Tensorizer works with any model type — Transformers, Diffusers, scikit-learn, or custom PyTorch. The only requirement is the ability to initialize an empty model. The deserializer restores weights into that empty model. diff --git a/other-topics/checkpointing.mdx b/performance/checkpointing.mdx similarity index 68% rename from other-topics/checkpointing.mdx rename to performance/checkpointing.mdx index 894ae5d..674eab2 100644 --- a/other-topics/checkpointing.mdx +++ b/performance/checkpointing.mdx @@ -1,19 +1,21 @@ --- -title: "Memory Checkpointing" -description: "Faster cold starts with memory checkpointing" +title: "Memory and GPU Checkpointing (Beta)" +description: "Radically reduce container cold starts by skipping initilization work" --- ## Introduction -Memory checkpointing takes a snapshot of a container's GPU and CPU memory and uses that to speed up the startup of all future containers. Applications that perform a large amount of work at container start time benefit the most from this process. +Memory checkpointing takes a snapshot of a container’s CPU memory and GPU memory, and uses it to speed up the startup of future containers. Applications that perform a large amount of work at container start time benefit the most from this process. -For example, Machine Learning and LLM frameworks load massive model weights and compile various CUDA kernels at container start time, taking many minutes. Loading a checkpoint that already contains the compiled CUDA kernels can skip this delay entirely. +This is useful for both CPU-only and GPU workloads. For CPU applications, checkpointing can preserve expensive initialization work such as imports, dependency loading, configuration setup, and in-memory state. For GPU applications, it can also preserve model weights, CUDA state, and compiled kernels. -Cerebrium has native checkpointing and restore functionality built in to the platform. +For example, ML and LLM frameworks often load large model weights and compile CUDA kernels at container start time, which can take many seconds or minutes. Loading from a checkpoint that already contains this initialized state can skip most of that delay. + +Since this feature is still in beta, please report all issues to the team via our [Discord Community](https://discord.gg/ATj6USmeE2) or via [Email](mailto:support@cerebrium.ai) ## How To Use -Checkpointing is available on our v2 runtime environment. Add the following to your `cerebrium.toml` to upgrade. +Checkpointing is available in early beta to our customer base. Add the following to your `cerebrium.toml` in order to use it. ``` [cerebrium.experimental] @@ -33,7 +35,9 @@ If successful subsequent containers will be restored from this created checkpoin A checkpoint is tightly coupled to a single deployment. To disable restoring from checkpoints simply remove the POST request and redeploy your application. -### Example +You can find several implementations in our [Examples repository on Github](https://github.com/CerebriumAI/examples) + +### vLLM Example ```python from vllm import AsyncLLMEngine @@ -53,8 +57,9 @@ engine = AsyncLLMEngine.from_engine_args(engine_args) engine.sleep(level=1) # Trigger checkpoint try: - urllib.request.urlopen("http://169.254.169.253:8234/checkpoint", method="POST") -except http.client.RemoteDisconnect: + req = urllib.request.Request("http://169.254.169.253:8234/checkpoint", method="POST") + urllib.request.urlopen(req, timeout=300) +except http.client.RemoteDisconnected: # TCP connections disconnect on restore and throw remote pass @@ -72,7 +77,7 @@ engine.wake_up() **Ephemeral Filesystem:** Any files written to disk before the checkpoint will not be copied to the restored container. Only memory is checkpointed. -**Provider Availablity:** Checkpointing is only available on the AWS provider. More coming soon. +**Provider Availablity:** Checkpointing is only available on the AWS provider. More coming soon. ## Platform specific recommendations diff --git a/performance/faster-cold-starts.mdx b/performance/faster-cold-starts.mdx new file mode 100644 index 0000000..e66c9c7 --- /dev/null +++ b/performance/faster-cold-starts.mdx @@ -0,0 +1,137 @@ +--- +title: Faster Cold Start Performance +description: Reduce queueing delay and initialization time on new containers +--- + +Cold starts happen when Cerebrium must boot a new container to serve a request. That adds latency in two phases: + +1. **Queueing** - No warm container is available, so the request waits until a new one starts and becomes ready. +2. **Initialization** - The new container runs startup work before it accepts traffic: importing dependencies, loading model weights into GPU memory, compiling CUDA kernels. + +Use metrics and request logs from your Cerebrium dashboard to see which phase dominates and how you can improve it. +However, most production workloads benefit from both the reduction of initialization and then configuring scaling to keep warm capacity available. + +## Reducing initialization time + +Most cold-start time in ML workloads comes from loading model weights into GPU memory. For large models, a standard Hugging Face load can take 40+ seconds even when reading from storage at ~2 GB/s. + +Work through the techniques below in order. Each step targets a different part of startup and can be combined with the others. + +### Store model weights on persistent storage + +Store model weights on [persistent storage](/storage/managing-files) at `/persistent-storage` rather than baking them into the container image. + +Cerebrium caches reads from persistent storage within each region. After weights are loaded once, future cold starts can reuse the cached copy and load faster. + +This is usually the best default for large models. Baking weights into the container increases image size, which means Cerebrium has to pull and restore a larger image before your application can start initialization. + +Only include weights in the container image when they are small enough that the image remains lightweight. + + + Increasing CPU core count can parallelise reads from storage and improve + pull-through times for large files. Multiple cores process different parts + simultaneously, reducing overall transfer time. + + +### Run initialization at module scope + +Move as much initialization work as possible out of the request path and into module scope so it runs once at container start, before the container accepts traffic. + +```python +# Runs once at container start — not on every request +model = load_model("/persistent-storage/models/my-model/") +tokenizer = load_tokenizer("/persistent-storage/models/my-model/") + +def predict(prompt: str): + return model.generate(prompt) +``` + +For multiple independent models or weight files, load them concurrently rather than sequentially. Use `ThreadPoolExecutor` or similar patterns to read files in parallel and take full advantage of storage bandwidth. + +### Load weights directly to GPU + +Standard PyTorch and Hugging Face loading paths copy weights through CPU memory. Libraries that stream weights directly from disk to GPU reduce this overhead. Use one of these when model loading remains the bottleneck after moving work to module scope. + +#### Tensorizer + +[Tensorizer](https://github.com/coreweave/tensorizer) serialises model weights into a format optimised for fast transfer and loads them directly into GPU memory in a single step. It works with Cerebrium persistent storage at nearly 2 GB/s read speed. For large models (20B+ parameters), loading time typically decreases by 30–50%, with greater improvements on larger models. + +Tensorizer works with Transformers, Diffusers, scikit-learn, or custom PyTorch modules. The only requirement is the ability to initialise an empty model before the deserializer restores weights into it. + +#### FlashPack + +[FlashPack](https://github.com/fal-ai/flashpack) loads PyTorch tensors from disk to GPU at high throughput without requiring GPUDirect Storage. Convert a model once, store the `.flashpack` file on persistent storage, then load directly into GPU memory on startup. + +FlashPack also provides integration mixins for Transformers and Diffusers models. See the [FlashPack repository](https://github.com/fal-ai/flashpack) for conversion and loading patterns. + +### Restore from a checkpoint + +When initialization includes work that does not change between deployments - compiled CUDA kernels, large weight loads, framework setup - [memory checkpointing](/performance/checkpointing) captures CPU and GPU memory state after initialization and restores it on future cold starts. + +Checkpointing skips repeated initialization work entirely. A container restored from a checkpoint resumes from the point where the checkpoint trigger was sent, with model weights and compiled kernels already in memory. + +Use checkpointing when Tensorizer or FlashPack still leave multi-minute startup times, or when compiled kernels dominate initialization. Enable checkpointing in `cerebrium.toml` and trigger it after initialization completes. See the [Memory Checkpointing guide](/performance/checkpointing) for configuration, trigger endpoints, and framework-specific recommendations. + +## Reduce queueing with scaling + +When initialization is already optimised, keep warm containers available so requests do not wait for new ones to boot. See [Scaling Apps](/scaling/scaling-apps) for full parameter reference. + +Use these scaling options based on traffic pattern: + +| Goal | Parameter | When to use | +| -------------------------------------------------- | ---------------- | ------------------------------------------------------------------------------ | +| Eliminate cold starts from scaling to zero | `min_replicas` | Latency-sensitive production workloads that cannot tolerate startup delay | +| Handle bursty traffic without waiting for scale-up | `scaling_buffer` | Traffic arrives in bursts where one request is followed by several more | +| Keep containers warm through brief dips | `cooldown` | Steady workloads with occasional gaps before traffic returns | +| Maintain headroom before autoscaler adds replicas | `scaling_target` | Workloads using `concurrency_utilization` that need spare capacity per replica | + +### Keep containers warm with `min_replicas` + +Set `min_replicas` to maintain a floor of running instances at all times. This eliminates cold starts from scaling to zero but increases cost while idle. + +```toml +[cerebrium.scaling] +min_replicas = 1 +``` + +Use `min_replicas = 1` or higher for latency-sensitive production workloads that cannot tolerate cold-start delay. + +### Buffer capacity with `scaling_buffer` + +`scaling_buffer` provisions extra idle replicas above what the scaling metric recommends. This helps with bursty traffic - when one request arrives, additional warm containers are already available for the requests that follow. + +```toml +[cerebrium.scaling] +min_replicas = 0 +max_replicas = 10 +replica_concurrency = 1 +scaling_metric = "concurrency_utilization" +scaling_target = 100 +scaling_buffer = 3 +``` + +`scaling_buffer` is available with `concurrency_utilization` and `requests_per_second` metrics. + +### Tune cooldown for traffic patterns + +The `cooldown` parameter sets how long reduced concurrency must persist before a container scales down. A longer cooldown keeps containers warm through brief traffic dips and reduces cold starts when traffic returns quickly. + +```toml +[cerebrium.scaling] +cooldown = 600 # Keep containers warm for 10 minutes after traffic drops +``` + +Match cooldown to traffic patterns. Steady workloads with occasional gaps benefit from longer cooldowns. Highly intermittent workloads may accept shorter cooldowns to reduce idle cost. + +### Leave headroom with `scaling_target` + +With `concurrency_utilization`, set `scaling_target` below 100 to maintain excess capacity before the autoscaler adds replicas. For example, `scaling_target = 70` with `replica_concurrency = 1` keeps containers at 70% utilisation, leaving room for new requests without waiting for a scale-up event. + +```toml +[cerebrium.scaling] +replica_concurrency = 1 +scaling_metric = "concurrency_utilization" +scaling_target = 70 +``` + +All scaling strategies trade cost for latency. Monitor cold start frequency and request latency in the dashboard to find the right balance.