Openchat hangs when running a model from a docker container

Hi, I've made the following dockerfile for configuring dependencies and running an openchat model. However, it hangs on startup.

```
from nvidia/cuda:12.4.0-devel-ubuntu22.04

run apt-get update && apt-get install python3-pip -y && apt-get clean
run pip3 install packaging torch && pip3 install ochat && pip3 cache purge

run apt-get install git -y
run pip3 install flash_attn==2.5.8

entrypoint python3 -m ochat.serving.openai_api_server --model $model --host 0.0.0.0 --port $port
```

The following log is emitted, after which the container hangs and I can't even stop it with `sudo docker stop`.

```
INFO 03-10 13:58:32 __init__.py:207] Automatically detected platform cuda.
2025-03-10 13:58:33,122	WARNING services.py:2022 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67100672 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2025-03-10 13:58:33,250	INFO worker.py:1821 -- Started a local Ray instance.
INFO 03-10 13:58:40 config.py:549] This model supports multiple tasks: {'embed', 'reward', 'score', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 03-10 13:58:40 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='openchat/openchat-3.5-0106-gemma', speculative_config=None, tokenizer='openchat/openchat-3.5-0106-gemma', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=openchat/openchat-3.5-0106-gemma, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
INFO 03-10 13:58:43 cuda.py:229] Using Flash Attention backend.
```

And if I run this container without `gpus`, it fails on startup with the following error:

```
INFO 03-10 14:07:35 __init__.py:211] No platform detected, vLLM is running on UnspecifiedPlatform
openchat.json: 100% 484/484 [00:00<00:00, 9.62MB/s]
2025-03-10 14:07:36,435	WARNING services.py:2022 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67104768 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2025-03-10 14:07:36,562	INFO worker.py:1821 -- Started a local Ray instance.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/ochat/serving/openai_api_server.py", line 373, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 639, in from_engine_args
    engine_config = engine_args.create_engine_config(usage_context)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 1126, in create_engine_config
    device_config = DeviceConfig(device=self.device)
  File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 1660, in __init__
    raise RuntimeError("Failed to infer device type")
RuntimeError: Failed to infer device type
```

Pls help me fix this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Openchat hangs when running a model from a docker container #231

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Openchat hangs when running a model from a docker container #231

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions