[vLLM] Stuck at "Waiting for output from MQLLMEngine"
#72
by
slemouellic
- opened
Hello, :)
I'm trying to run vLLM in a Docker container using the model Mistral-Small-3.1-24B-Instruct-2503 with 2 GPUs, but the startup hangs indefinitely with this repeated message:
DEBUG [client.py:193] Waiting for output from MQLLMEngine.
Setup
- GPUs: 2 (A100 80GB)
- Docker image:
vllm/vllm-openai:v0.8.4
Docker command
docker run -d \
--name vllm_service \
--gpus '"device=0,1"' \
--shm-size=64g \
--env-file .env \
-p 4445:4445 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--network bridge \
vllm/vllm-openai:v0.8.4 \
--host 0.0.0.0 \
--model "mistralai/Mistral-Small-3.1-24B-Instruct-2503" \
--gpu-memory-utilization 0.95 \
--dtype float16 \
--port 4445 \
--tensor-parallel-size 2 \
--enforce-eager \
--max-model-len 30800 \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral \
--tool-call-parser mistral \
--enable-auto-tool-choice \
--limit_mm_per_prompt "image=10"
Environment variables
NVIDIA_VISIBLE_DEVICES=all
NCCL_MAX_SOCKET_BUFFERS=16384
PYTHONUNBUFFERED=1
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128,expandable_segments:True
NCCL_DEBUG=INFO
Logs
NCCL initializes properly:
- Ranks and GPUs are assigned
- All channels and rings are connected
- No errors in the logs
The process loops forever with:
DEBUG [client.py:193] Waiting for output from MQLLMEngine.
The server never becomes ready.
I’ve tried a bunch of different things (see environment variables above), but nothing seems to work...
Any idea what might be causing this or how to fix it?
Thanks in advance!
More logs :
INFO 04-22 02:26:30 [api_server.py:1034] vLLM API server version 0.8.4
INFO 04-22 02:26:30 [config.py:2832] Downcasting torch.float32 to torch.float16.
INFO 04-22 02:26:46 [config.py:689] This model supports multiple tasks: {'classify', 'score', 'generate', 'embed', 'reward'}. Defaulting to 'generate'.
WARNING 04-22 02:26:48 [arg_utils.py:1731] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.
INFO 04-22 02:26:49 [config.py:1713] Defaulting to use mp for distributed inference
WARNING 04-22 02:26:49 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-22 02:26:49 [api_server.py:246] Started engine process with PID 107
INFO 04-22 02:26:53 [__init__.py:239] Automatically detected platform cuda.
INFO 04-22 02:26:54 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='mistralai/Mistral-Small-3.1-24B-Instruct-2503', speculative_config=None, tokenizer='mistralai/Mistral-Small-3.1-24B-Instruct-2503', skip_tokenizer_init=False, tokenizer_mode=mistral, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=30800, download_dir=None, load_format=LoadFormat.MISTRAL, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=mistralai/Mistral-Small-3.1-24B-Instruct-2503, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,
WARNING 04-22 02:26:56 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 30 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-22 02:26:57 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 04-22 02:26:57 [cuda.py:289] Using XFormers backend.
INFO 04-22 02:26:59 [__init__.py:239] Automatically detected platform cuda.
(VllmWorkerProcess pid=145) INFO 04-22 02:27:01 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=145) INFO 04-22 02:27:02 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=145) INFO 04-22 02:27:02 [cuda.py:289] Using XFormers backend.
INFO 04-22 02:27:03 [utils.py:993] Found nccl from library libnccl.so.2
INFO 04-22 02:27:03 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=145) INFO 04-22 02:27:03 [utils.py:993] Found nccl from library libnccl.so.2
fe249c782f4f:107:107 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
(VllmWorkerProcess pid=145) INFO 04-22 02:27:03 [pynccl.py:69] vLLM is using nccl==2.21.5
fe249c782f4f:107:107 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
fe249c782f4f:107:107 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
fe249c782f4f:107:107 [0] NCCL INFO NET/Plugin: Using internal network plugin.
fe249c782f4f:107:107 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
fe249c782f4f:145:145 [1] NCCL INFO cudaDriverVersion 12040
fe249c782f4f:145:145 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
fe249c782f4f:145:145 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
fe249c782f4f:145:145 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
fe249c782f4f:145:145 [1] NCCL INFO NET/Plugin: Using internal network plugin.
fe249c782f4f:107:107 [0] NCCL INFO NET/IB : No device found.
fe249c782f4f:107:107 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
fe249c782f4f:107:107 [0] NCCL INFO Using non-device net plugin version 0
fe249c782f4f:107:107 [0] NCCL INFO Using network Socket
fe249c782f4f:145:145 [1] NCCL INFO NET/IB : No device found.
fe249c782f4f:145:145 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
fe249c782f4f:145:145 [1] NCCL INFO Using non-device net plugin version 0
fe249c782f4f:145:145 [1] NCCL INFO Using network Socket
fe249c782f4f:145:145 [1] NCCL INFO ncclCommInitRank comm 0x135efc30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 80 commId 0x3f6e4e56e2d8d67a - Init START
fe249c782f4f:107:107 [0] NCCL INFO ncclCommInitRank comm 0x3f5d86f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 70 commId 0x3f6e4e56e2d8d67a - Init START
fe249c782f4f:107:107 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
fe249c782f4f:145:145 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
fe249c782f4f:145:145 [1] NCCL INFO comm 0x135efc30 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
fe249c782f4f:107:107 [0] NCCL INFO comm 0x3f5d86f0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
fe249c782f4f:107:107 [0] NCCL INFO Channel 00/02 : 0 1
fe249c782f4f:145:145 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
fe249c782f4f:107:107 [0] NCCL INFO Channel 01/02 : 0 1
fe249c782f4f:145:145 [1] NCCL INFO P2P Chunksize set to 131072
fe249c782f4f:107:107 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
fe249c782f4f:107:107 [0] NCCL INFO P2P Chunksize set to 131072
fe249c782f4f:145:145 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
fe249c782f4f:107:107 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
fe249c782f4f:145:145 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
fe249c782f4f:145:145 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
fe249c782f4f:107:107 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
fe249c782f4f:107:107 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
fe249c782f4f:145:145 [1] NCCL INFO Connected all rings
fe249c782f4f:107:107 [0] NCCL INFO Connected all rings
fe249c782f4f:145:145 [1] NCCL INFO Connected all trees
fe249c782f4f:107:107 [0] NCCL INFO Connected all trees
fe249c782f4f:145:145 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
fe249c782f4f:145:145 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
fe249c782f4f:107:107 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
fe249c782f4f:107:107 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
fe249c782f4f:107:107 [0] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
fe249c782f4f:145:145 [1] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
fe249c782f4f:107:107 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
fe249c782f4f:145:145 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
fe249c782f4f:145:145 [1] NCCL INFO ncclCommInitRank comm 0x135efc30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 80 commId 0x3f6e4e56e2d8d67a - Init COMPLETE
fe249c782f4f:107:107 [0] NCCL INFO ncclCommInitRank comm 0x3f5d86f0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 70 commId 0x3f6e4e56e2d8d67a - Init COMPLETE
INFO 04-22 02:27:04 [custom_all_reduce_utils.py:206] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-22 02:27:17 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=145) WARNING 04-22 02:27:17 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 04-22 02:27:17 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=145) WARNING 04-22 02:27:17 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 04-22 02:27:17 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_7c6fd185'), local_subscribe_addr='ipc:///tmp/dd3a6d06-f420-4cc7-8b9d-d6a3fd1115a3', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorkerProcess pid=145) INFO 04-22 02:27:17 [parallel_state.py:959] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
INFO 04-22 02:27:17 [parallel_state.py:959] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-22 02:27:17 [model_runner.py:1110] Starting to load model mistralai/Mistral-Small-3.1-24B-Instruct-2503...
(VllmWorkerProcess pid=145) INFO 04-22 02:27:17 [model_runner.py:1110] Starting to load model mistralai/Mistral-Small-3.1-24B-Instruct-2503...
INFO 04-22 02:27:17 [config.py:3466] cudagraph sizes specified by model runner [] is overridden by config []
(VllmWorkerProcess pid=145) INFO 04-22 02:27:17 [config.py:3466] cudagraph sizes specified by model runner [] is overridden by config []
INFO 04-22 02:27:17 [weight_utils.py:265] Using model weights format ['consolidated*.safetensors', '*.pt']
(VllmWorkerProcess pid=145) INFO 04-22 02:27:17 [weight_utils.py:265] Using model weights format ['consolidated*.safetensors', '*.pt']
DEBUG 04-22 02:27:20 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:27:30 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:27:40 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:27:51 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:28:01 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:28:11 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 04-22 02:28:21 [client.py:193] Waiting for output from MQLLMEngine.
...
same issue here