Environment
- Device: Macmini m4
- Python: 3.x
- Mode: Option B (MLX + 9B)
- Model:
mlx-community/Qwen3.5-9B-MLX-4bit
Description
When using mlx_engine.py as the backend (Option B), agent.py detects the model correctly and intent classification works, but no response is displayed for regular chat messages.
The MLX server logs show all requests returning 200, so the server is processing requests successfully — the issue is in the response format.
Steps to Reproduce
-
Start MLX engine:
python3 mlx/mlx_engine.py
-
In another terminal, start agent:
-
Type any message (e.g., "hello") and press Enter
-
The spinner shows "classifying" → "thinking", then returns to the prompt with no output
Expected Behavior
The model's response should be displayed, just like when using llama-server (Option A).
Actual Behavior
- The prompt returns with no visible response
- The MLX engine logs show successful
200 responses:
"POST /v1/chat/completions HTTP/1.1" 200 -
"POST /v1/chat/completions HTTP/1.1" 200 -
"GET /props HTTP/1.1" 200 -
🍎 mac code
claude code, but it runs on your Mac for free
model Qwen3.5-9b-MLX local
tools search · fetch · exec · files
cost $0.00/hr Apple M4 Metal · localhost:8000
─────────────────────────────────────────────────────────
type / to see all commands
auto ? > hello
auto ? >
Root Cause Analysis
After reading the code, I believe the issue is that mlx_engine.py does not support streaming responses.
In agent.py, chat responses go through stream_llm() (line 525), which sends "stream": true and expects Server-Sent Events (SSE) format:
data: {"choices":[{"delta":{"content":"Hello"}}]}
data: {"choices":[{"delta":{"content":"!"}}]}
data: [DONE]
However, mlx_engine.py's _handle_chat() (line 246) always returns a single JSON response, ignoring the stream parameter. When agent.py tries to parse the response as SSE, it gets nothing.
Non-streaming calls (like classify_intent() via llm_call()) work fine because they don't send "stream": true.
Suggested Fix
Add SSE streaming support to mlx_engine.py when stream=True is requested, using mlx_lm.stream_generate() to yield tokens incrementally in the SSE format that agent.py expects.
Environment
mlx-community/Qwen3.5-9B-MLX-4bitDescription
When using
mlx_engine.pyas the backend (Option B),agent.pydetects the model correctly and intent classification works, but no response is displayed for regular chat messages.The MLX server logs show all requests returning
200, so the server is processing requests successfully — the issue is in the response format.Steps to Reproduce
Start MLX engine:
In another terminal, start agent:
Type any message (e.g., "hello") and press Enter
The spinner shows "classifying" → "thinking", then returns to the prompt with no output
Expected Behavior
The model's response should be displayed, just like when using
llama-server(Option A).Actual Behavior
200responses:Root Cause Analysis
After reading the code, I believe the issue is that
mlx_engine.pydoes not support streaming responses.In
agent.py, chat responses go throughstream_llm()(line 525), which sends"stream": trueand expects Server-Sent Events (SSE) format:However,
mlx_engine.py's_handle_chat()(line 246) always returns a single JSON response, ignoring thestreamparameter. Whenagent.pytries to parse the response as SSE, it gets nothing.Non-streaming calls (like
classify_intent()viallm_call()) work fine because they don't send"stream": true.Suggested Fix
Add SSE streaming support to
mlx_engine.pywhenstream=Trueis requested, usingmlx_lm.stream_generate()to yield tokens incrementally in the SSE format thatagent.pyexpects.