Hi, I've made the following dockerfile for configuring dependencies and running an openchat model. However, it hangs on startup.
from nvidia/cuda:12.4.0-devel-ubuntu22.04
run apt-get update && apt-get install python3-pip -y && apt-get clean
run pip3 install packaging torch && pip3 install ochat && pip3 cache purge
run apt-get install git -y
run pip3 install flash_attn==2.5.8
entrypoint python3 -m ochat.serving.openai_api_server --model $model --host 0.0.0.0 --port $port
The following log is emitted, after which the container hangs and I can't even stop it with sudo docker stop.
INFO 03-10 13:58:32 __init__.py:207] Automatically detected platform cuda.
2025-03-10 13:58:33,122 WARNING services.py:2022 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67100672 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2025-03-10 13:58:33,250 INFO worker.py:1821 -- Started a local Ray instance.
INFO 03-10 13:58:40 config.py:549] This model supports multiple tasks: {'embed', 'reward', 'score', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 03-10 13:58:40 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='openchat/openchat-3.5-0106-gemma', speculative_config=None, tokenizer='openchat/openchat-3.5-0106-gemma', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=openchat/openchat-3.5-0106-gemma, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
INFO 03-10 13:58:43 cuda.py:229] Using Flash Attention backend.
INFO 03-10 14:07:35 __init__.py:211] No platform detected, vLLM is running on UnspecifiedPlatform
openchat.json: 100% 484/484 [00:00<00:00, 9.62MB/s]
2025-03-10 14:07:36,435 WARNING services.py:2022 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67104768 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2025-03-10 14:07:36,562 INFO worker.py:1821 -- Started a local Ray instance.
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/ochat/serving/openai_api_server.py", line 373, in <module>
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 639, in from_engine_args
engine_config = engine_args.create_engine_config(usage_context)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 1126, in create_engine_config
device_config = DeviceConfig(device=self.device)
File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 1660, in __init__
raise RuntimeError("Failed to infer device type")
RuntimeError: Failed to infer device type
Pls help me fix this.
Hi, I've made the following dockerfile for configuring dependencies and running an openchat model. However, it hangs on startup.
The following log is emitted, after which the container hangs and I can't even stop it with
sudo docker stop.And if I run this container without
gpus, it fails on startup with the following error:Pls help me fix this.