- Introduction
- Generating and Executing Jobs with OmniHub
- Sweeping Jobs and Application Arguments
- Processing Metrics and Results
- Known Issues
- Developer Corner
- Contact
This repository provides utilities (scripts, tools, and container images) to execute and analyze AI/ML workloads on AMD systems, evaluating performance across system scales, model granularities, and AMD performance tools. Tested with local ML models and pre-built Docker and Apptainer images, the tools enable flexible mix-and-match configurations via simple command line arguments. We use the Hugging Face and vLLM APIs in our examples, with support for additional frameworks available upon request (contact us).
ML Models Sample List
- Llama (v2-v4, 3B-405B)
- DeepSeek-R1
- AMD-OLMo (1B)
- Mistral (7B-24B)
- Qwen (v2.5-7B)
- NVLM (72B)
- more can be added on request (contact us)
See here for more details on the available ML models on Frontier and the HPC Fund clusters.
Frameworks
- Hugging Face: inference and finetuning
- vLLM: inference
- PyTorch: training and inference
- SGLang: coming soon...
Systems
- HPC Fund: multi-node, multi-GPU; Apptainer
- Frontier (OLCF): multi-node, multi-GPU; Apptainer
Tools Sample List
- rocprof/rocprof-compute
- PyTorch Profiler
- omnitrace
- omnistat
- more can be added on request (contact us).
Check the list of supported tools for more details about tools and their execution modes.
Our test clusters use SLURM for job management. OmniHub provides the tool omnihub-generate-job
to automatically create SLURM job scripts that fit your target environment. Using this tool, researchers can easily
generate and execute jobs with different combinations of ML models, container platforms, number of nodes, and
performance tools. For non-SLURM systems, the tool can serve as a reference to modify launch commands as needed.
In its most basic form, SLURM job script generation and job execution work as follows:
git clone https://github.com/AMDResearch/omnihub.git $HOME/omnihub
cd $HOME/omnihub
./omnihub-generate-job --omnihub-dir $HOME/omnihub \
--app-config applications/hf-infer/config-example.yaml \
--output hf-infer.slurm
sbatch hf-infer.slurmWhere --omnihub-dir points to your working copy of the OmniHub repository in
the cluster and --app-config points to the path to the application configuration file relative to the OmniHub directory.
Refer to this document for more examples of using the omnihub-generate-job tool.
Generated SLURM scripts run scripts/sanity-check.sh on the allocated nodes before the main workload. This checks ROCm availability (rocminfo), GPU compute-mode, and PyTorch NCCL (broadcast / all_reduce under torchrun via scripts/sanity_torch_dist.py). If the sanity step fails with errors such as NCCL connection refused or traffic directed at link-local addresses (for example 169.254.x.x), fix cluster networking first—e.g. NCCL_SOCKET_IFNAME / GLOO_SOCKET_IFNAME and correct MASTER_ADDR reachability—before debugging application code.
| Flag | Options | Description |
|---|---|---|
omnihub-dir |
OmniHub working directory. | |
app-config |
Relative path to the app config. | |
cluster |
hpcfund, frontier |
Cluster name. |
partition |
Partition or job queue name. | |
num-nodes |
Number of nodes to allocate. | |
platform |
apptainer, docker |
Container platform to use. |
runner |
manual, torchrun |
Distributed runner for multi-node. |
tools |
List of tools | List of profiling tools. |
time-limit |
SLURM job time limit. | |
tasks-per-node |
Override Slurm tasks per node. | |
image |
Custom container image path. |
| Tool | Description |
|---|---|
rocprof-compute |
Collect all performance counters. |
omnistat |
Low-overhead system metrics, sampled at 1s intervals. |
omnistat-rocprofiler-pmc1 |
Low-overhead performance counter collection, sampled at 1s intervals, 1st set of PMCs. |
omnistat-rocprofiler-pmc2 |
Low-overhead performance counter collection, sampled at 1s intervals, 2nd set of PMCs. |
omnitrace |
Application tracing. |
pytorch-stats |
Collects detailed statistics of PyTorch operations. |
pytorch-trace |
PyTorch execution traces compatible with TensorBoard. |
rocprofv1-stats |
Kernel execution stats (to be deprecated soon). |
rocprofv2-pmc |
Profiling with performance counters (configuration). |
rocprofv3-stats |
Kernel execution stats. |
rccl-info |
Collects statistics of RCCL collective calls |
Explore example applications that demonstrate the usage of various ML models and configurations:
- Hugging Face Inference: LLM inference using the Hugging Face API.
- Hugging Face Finetuning: Finetune LLM models.
- vLLM Inference: Inference via the vLLM API.
- vLLM Latency Benchmark: Measure response times.
- vLLM Throughput Benchmark: Test concurrent processing.
- PyTorch Training: Train a simple CNN with PyTorch.
- PyTorch Inference: Run inference on a simple CNN.
Optional Note: To review vLLM benchmark scripts, launch an OmniHub container and navigate to
/app/vllm/benchmarks.For Apptainer, run:
apptainer shell /path/to/omnihub-image.sif cd /app/vllm/benchmarks/See docs/images.md for cluster-specific image paths.
All examples include YAML configuration files specifying the main entrypoint, tensor parallel size, and other settings. A snippet of the Hugging Face inference configuration file is provided below.
# Specifies the main script to run the application. The main function in the
# application may be decorated with `omnihub.entrypoint` and other functions may be
# decorated with `omnihub.tools.profile` to enable detailed profiling.
entrypoint: applications/hf-infer/infer.py
# Provides details for loading the model. The script will check if the model exists in
# `OMNIHUB_MODELS_DIR` before automatically downloading from Hugging Face.
ModelArguments:
pretrained_model_name_or_path: meta-llama/Llama-3.1-8B-InstructThe OmniHub sweep tool (omnihub-sweep) automates job generation and submission. It creates different
job setups based on CLI flags and generates application configurations from templates.
Application templates are based off application config files and let you list multiple values for certain fields. The
sweep tool then produces every possible combination of those fields. For example, the following template that sweeps the
input and output lengths for the vLLM latency benchmark will generate 4 configuration files with different combinations
of input_len/output_len: 32/128, 32/256, 64/128, and 64/256.
entrypoint: /app/vllm/benchmarks/benchmark_latency.py
model: meta-llama/Llama-3.1-8B-Instruct
tensor_parallel_size: 1
input_len:
- 32
- 64
output_len:
- 128
- 256
batch_size: 8
# more configuration params go here...Note: For creating a template for vLLM throughput benchmark or any example app, simply start with the sample config file in the app directory.
The omnihub-sweep CLI also allows sweeping over job-related options, including:
partitions: cluster partitions.num-nodes: number of nodes.tools: profiling tools to enable (toolscan be set multiple times to enable different sets of tools).
For example, using the previously listed template stored in a file named
vllm-latency-template.yaml, the following omnihub-sweep will generate 4
configuration files, and then it will submit jobs for all of them in 2
partitions using 2 different sets of tools:
mkdir sweep-vllm
./omnihub-sweep --omnihub-dir $PWD --sweep-dir ./sweep-vllm \
--template vllm-latency-template.yaml \
--partitions mi2104x mi2508x \
--tools omnistat --tools omnistat rocprofv3-statsStarting a new sweep
.. Generated configurations: 4
Number of jobs in this sweep: 16
Submitting job: mi2104x/1/omnistat/config-00001.yaml
Submitting job: mi2104x/1/omnistat/config-00003.yaml
Submitting job: mi2104x/1/omnistat/config-00002.yaml
Submitting job: mi2104x/1/omnistat/config-00000.yaml
Submitting job: mi2104x/1/omnistat,rocprofv3-stats/config-00001.yaml
Submitting job: mi2104x/1/omnistat,rocprofv3-stats/config-00003.yaml
Submitting job: mi2104x/1/omnistat,rocprofv3-stats/config-00002.yaml
Submitting job: mi2104x/1/omnistat,rocprofv3-stats/config-00000.yaml
Submitting job: mi2508x/1/omnistat/config-00001.yaml
Submitting job: mi2508x/1/omnistat/config-00003.yaml
Submitting job: mi2508x/1/omnistat/config-00002.yaml
Submitting job: mi2508x/1/omnistat/config-00000.yaml
Submitting job: mi2508x/1/omnistat,rocprofv3-stats/config-00001.yaml
Submitting job: mi2508x/1/omnistat,rocprofv3-stats/config-00003.yaml
Submitting job: mi2508x/1/omnistat,rocprofv3-stats/config-00002.yaml
Submitting job: mi2508x/1/omnistat,rocprofv3-stats/config-00000.yamlTo test sweeps and job generation without submitting jobs to the cluster, use
the --dry-run flag.
After running OmniHub-generated jobs, you can process the results to create a summary in a standardized format
(omnihub-process). The summary includes details from various sources:
- Job configuration options
- Application configuration options
- Application metrics (if available)
- Default monitor metrics
- Omnistat report metrics
To process and index the results, use:
./omnihub-process --results-dir /path/to/results/omnihub -j 4
./omnihub-index --results-dir /path/to/results/omnihub --output indexNote:
omnihub-indexrequires thatomnihub-processhas completed successfully and that processed data is present in each job directory. If no processed data is found,omnihub-indexwill exit with an error message. Please ensure all jobs have been processed before running the index step.
This will generate an index.csv file in the top directory of the repository. The CSV uses two header rows and can be
loaded in Pandas as follows:
import pandas
df = pandas.read_csv("index.csv", header=[0,1], index_col=0)rocprofiler-computedoes not work with multi-node runs.- Tools like
pytorch-trace,omnitrace, androcprofv3-statstend to generate many GBs trace data per rank and you may run out of disk space very soon.
If you want to contribute to OmniHub, make sure you read this document for developer pre-requisites.