Skip to content

MLX engine: agent.py shows no response in chat mode (missing streaming support) #2

@Turcy76

Description

@Turcy76

Environment

  • Device: Macmini m4
  • Python: 3.x
  • Mode: Option B (MLX + 9B)
  • Model: mlx-community/Qwen3.5-9B-MLX-4bit

Description

When using mlx_engine.py as the backend (Option B), agent.py detects the model correctly and intent classification works, but no response is displayed for regular chat messages.

The MLX server logs show all requests returning 200, so the server is processing requests successfully — the issue is in the response format.

Steps to Reproduce

  1. Start MLX engine:

    python3 mlx/mlx_engine.py
  2. In another terminal, start agent:

    python3 agent.py
  3. Type any message (e.g., "hello") and press Enter

  4. The spinner shows "classifying" → "thinking", then returns to the prompt with no output

Expected Behavior

The model's response should be displayed, just like when using llama-server (Option A).

Actual Behavior

  • The prompt returns with no visible response
  • The MLX engine logs show successful 200 responses:
"POST /v1/chat/completions HTTP/1.1" 200 -
"POST /v1/chat/completions HTTP/1.1" 200 -
"GET /props HTTP/1.1" 200 -

  🍎 mac code
  claude code, but it runs on your Mac for free

  model  Qwen3.5-9b-MLX  local
  tools  search · fetch · exec · files
  cost   $0.00/hr  Apple M4 Metal · localhost:8000

─────────────────────────────────────────────────────────
  type / to see all commands

  auto ? > hello




  auto ? > 

Root Cause Analysis

After reading the code, I believe the issue is that mlx_engine.py does not support streaming responses.

In agent.py, chat responses go through stream_llm() (line 525), which sends "stream": true and expects Server-Sent Events (SSE) format:

data: {"choices":[{"delta":{"content":"Hello"}}]}
data: {"choices":[{"delta":{"content":"!"}}]}
data: [DONE]

However, mlx_engine.py's _handle_chat() (line 246) always returns a single JSON response, ignoring the stream parameter. When agent.py tries to parse the response as SSE, it gets nothing.

Non-streaming calls (like classify_intent() via llm_call()) work fine because they don't send "stream": true.

Suggested Fix

Add SSE streaming support to mlx_engine.py when stream=True is requested, using mlx_lm.stream_generate() to yield tokens incrementally in the SSE format that agent.py expects.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions