[Bug] Web server has ~10x worse performance than cli

### Git commit

sd-master-06accf2-bin-win-cuda12-x64

### Operating System & Version

Windows 11

### GGML backends

CUDA

### Command-line arguments used

.\sd-cli.exe -m  "I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors" -p "a lovely cat holding a sign say 'hidream o1 cpp'" --cfg-scale 1.0  -v -H 1024 -W 1024

### Steps to reproduce

Start cli with same params than SD server...
Generate image with both
Observe the speed

### What you expected to happen

Same speed when generating image

### What actually happened

Almost 10x slower speed

### Logs / error messages / stack trace

sd-cli.exe

PS C:\Users\copyhere2\Downloads\sd-master-06accf2-bin-win-cuda12-x64> .\sd-cli.exe -m  "I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors" -p "a lovely cat holding a sign say 'hidream o1 cpp'" --cfg-scale 1.0  -v -H 1024 -W 1024
[DEBUG] main.cpp:597  - version: stable-diffusion.cpp version unknown, commit 06accf2
[DEBUG] main.cpp:598  - System Info:
    SSE3 = 1 |     AVX = 1 |     AVX2 = 1 |     AVX512 = 0 |     AVX512_VBMI = 0 |     AVX512_VNNI = 0 |     FMA = 1 |     NEON = 0 |     ARM_FMA = 0 |     F16C = 1 |     FP16_VA = 0 |     WASM_SIMD = 0 |     VSX = 0 |
[DEBUG] main.cpp:599  - SDCliParams {
  mode: img_gen,
  output_path: "output.png",
  image_path: "",
  metadata_format: "text",
  verbose: true,
  color: false,
  canny_preprocess: false,
  convert_name: false,
  preview_method: none,
  preview_interval: 1,
  preview_path: "preview.png",
  preview_fps: 16,
  taesd_preview: false,
  preview_noisy: false,
  metadata_raw: false,
  metadata_brief: false,
  metadata_all: false
}
[DEBUG] main.cpp:600  - SDContextParams {
  n_threads: 10,
  model_path: "I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors",
  clip_l_path: "",
  clip_g_path: "",
  clip_vision_path: "",
  t5xxl_path: "",
  llm_path: "",
  llm_vision_path: "",
  diffusion_model_path: "",
  high_noise_diffusion_model_path: "",
  embeddings_connectors_path: "",
  vae_path: "",
  audio_vae_path: "",
  taesd_path: "",
  esrgan_path: "",
  control_net_path: "",
  embedding_dir: "",
  embeddings: {
  }
  wtype: NONE,
  tensor_type_rules: "",
  lora_model_dir: ".",
  hires_upscalers_dir: "",
  photo_maker_path: "",
  rng_type: cuda,
  sampler_rng_type: NONE,
  offload_params_to_cpu: false,
  max_vram: 0,
  backend: "",
  params_backend: "",
  enable_mmap: false,
  control_net_cpu: false,
  clip_on_cpu: false,
  vae_on_cpu: false,
  flash_attn: false,
  diffusion_flash_attn: false,
  diffusion_conv_direct: false,
  vae_conv_direct: false,
  circular: false,
  circular_x: false,
  circular_y: false,
  chroma_use_dit_mask: true,
  qwen_image_zero_cond_t: false,
  chroma_use_t5_mask: false,
  chroma_t5_mask_pad: 1,
  prediction: NONE,
  lora_apply_mode: auto,
  force_sdxl_vae_conv_scale: false
}
[DEBUG] main.cpp:601  - SDGenerationParams {
  loras: "{
  }",
  high_noise_loras: "{
  }",
  prompt: "a lovely cat holding a sign say 'hidream o1 cpp'",
  negative_prompt: "",
  clip_skip: -1,
  width: 1024,
  height: 1024,
  batch_count: 1,
  init_image_path: "",
  end_image_path: "",
  mask_image_path: "",
  control_image_path: "",
  ref_image_paths: [],
  control_video_path: "",
  auto_resize_ref_image: true,
  increase_ref_index: false,
  pm_id_images_dir: "",
  pm_id_embed_path: "",
  pm_style_strength: 20,
  skip_layers: [7, 8, 9],
  sample_params: (txt_cfg: 1.00, img_cfg: 1.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: inf, shifted_timestep: 0, flow_shift: inf, extra_sample_args: ),
  high_noise_skip_layers: [7, 8, 9],
  high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: inf, shifted_timestep: 0, flow_shift: inf, extra_sample_args: ),
  custom_sigmas: [],
  cache_mode: "",
  cache_option: "",
  cache: disabled (threshold=inf, start=0.15, end=0.95),
  moe_boundary: 0.875,
  video_frames: 1,
  fps: 16,
  vace_strength: 1,
  strength: 0.75,
  control_strength: 0.9,
  seed: 42,
  upscale_repeats: 1,
  upscale_tile_size: 128,
  hires: { enabled: false, upscaler: "Latent", model_path: "", scale: 2, target_width: 0, target_height: 0, steps: 0, denoising_strength: 0.7, upscale_tile_size: 128 },
  vae_tiling_params: { 0, 0, 0, 0, 0.5, 0, 0 },
}
[INFO ] ggml_extend.hpp:63   - ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16375 MiB):
[INFO ] ggml_extend.hpp:63   -   Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes, VRAM: 16375 MiB
[DEBUG] ggml_extend_backend.cpp:311  - Found 2 backend devices:
[DEBUG] ggml_extend_backend.cpp:314  - #0: CUDA0
[DEBUG] ggml_extend_backend.cpp:314  - #1: CPU
[DEBUG] ggml_extend_backend.cpp:291  - Initializing backend: CUDA0
[INFO ] stable-diffusion.cpp:249  - loading model from 'I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors'
[INFO ] model.cpp:219  - load I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors using safetensors format
[DEBUG] model.cpp:294  - init from 'I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors', prefix = ''
[INFO ] stable-diffusion.cpp:358  - Version: HiDream O1
[INFO ] stable-diffusion.cpp:386  - Weight type stat:                     bf16: 758
[INFO ] stable-diffusion.cpp:387  - Conditioner weight type stat:
[INFO ] stable-diffusion.cpp:388  - Diffusion model weight type stat:
[INFO ] stable-diffusion.cpp:389  - VAE weight type stat:
[DEBUG] stable-diffusion.cpp:391  - ggml tensor size = 400 bytes
[DEBUG] qwen2_tokenizer.cpp:14   - merges size 151387
[DEBUG] qwen2_tokenizer.cpp:39   - vocab size: 151674
[INFO ] stable-diffusion.cpp:739  - using FakeVAE
[DEBUG] stable-diffusion.cpp:880  - loading weights
[DEBUG] ggml_extend.hpp:2688 - hidream_o1_vision params backend buffer size =  875.61 MB(VRAM) (333 tensors)
[DEBUG] ggml_extend.hpp:2688 - hidream_o1 params backend buffer size =  15695.23 MB(VRAM) (407 tensors)
[DEBUG] ggml_extend.hpp:2671 - fake_vae skipping params allocation (no tensors)
[INFO ] model.cpp:799  - NOT using mmap for 'I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors' (mmap disabled by caller)
[INFO ] model.cpp:810  - model files processing completed in 0.00s
[DEBUG] model.cpp:909  - using 10 threads for model loading
[DEBUG] model.cpp:925  - loading tensors from I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors
  |==================================================| 758/758 - 1.95GB/s
[INFO ] model.cpp:1143 - loading tensors completed, taking 7.69s (read: 4.42s, memcpy: 0.00s, convert: 0.05s, copy_to_backend: 1.16s)
[DEBUG] stable-diffusion.cpp:971  - finished loaded file
[INFO ] stable-diffusion.cpp:1053 - total params memory size = 16570.85MB (VRAM 16570.85MB, RAM 0.00MB): text_encoders 875.61MB(VRAM), diffusion_model 15695.23MB(VRAM), vae 0.00MB(N/A), controlnet 0.00MB(N/A), pmid 0.00MB(N/A)
[INFO ] stable-diffusion.cpp:1130 - running in FLOW mode
[INFO ] stable-diffusion.cpp:3894 - generate_image 1024x1024
[INFO ] denoiser.hpp:637  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3214 - sampling using Euler method
[DEBUG] bpe_tokenizer.cpp:207  - split prompt "<|im_start|>user
a lovely cat holding a sign say 'hidream o1 cpp'<|im_end|>
<|im_start|>assistant
<|boi_token|><|tms_token|>" to tokens ["<|im_start|>", "user", "Ċ", "a", "Ġlovely", "Ġcat", "Ġholding", "Ġa", "Ġsign", "Ġsay", "Ġ'", "hid", "ream", "Ġo", "1", "Ġcpp", "'", "<|im_end|>", "Ċ", "<|im_start|>", "assistant", "Ċ", "<|boi_token|>", "<|tms_token|>", ]
[INFO ] stable-diffusion.cpp:3695 - get_learned_condition completed, taking 0.00s
[INFO ] stable-diffusion.cpp:3928 - generating image: 1/1 - seed 42
[DEBUG] ggml_extend.hpp:1907 - hidream_o1 compute buffer size: 208.23 MB(VRAM)
  |==================================================| 20/20 - 1.74s/it
[INFO ] stable-diffusion.cpp:3962 - sampling completed, taking 38.29s
[INFO ] stable-diffusion.cpp:3982 - generating 1 latent images completed, taking 38.47s
[INFO ] stable-diffusion.cpp:3719 - decoding 1 latents
[DEBUG] vae.hpp:209  - computing vae decode graph completed, taking 0.01s
[INFO ] stable-diffusion.cpp:3735 - latent 1 decoded, taking 0.01s
[INFO ] stable-diffusion.cpp:3739 - decode_first_stage completed, taking 0.01s
[INFO ] stable-diffusion.cpp:4125 - generate_image completed in 38.60s
[INFO ] main.cpp:462  - save result image 0 to 'output.png' (success)
[INFO ] main.cpp:534  - 1/1 images saved

SD server
PS C:\Users\copyhere2\Downloads\sd-master-06accf2-bin-win-cuda12-x64> .\sd-server.exe -m  "I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors"
[INFO ] ggml_extend.hpp:63   - ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16375 MiB):
[INFO ] ggml_extend.hpp:63   -   Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes, VRAM: 16375 MiB
[INFO ] stable-diffusion.cpp:249  - loading model from 'I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors'
[INFO ] model.cpp:219  - load I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:358  - Version: HiDream O1
[INFO ] stable-diffusion.cpp:386  - Weight type stat:                     bf16: 758
[INFO ] stable-diffusion.cpp:387  - Conditioner weight type stat:
[INFO ] stable-diffusion.cpp:388  - Diffusion model weight type stat:
[INFO ] stable-diffusion.cpp:389  - VAE weight type stat:
[INFO ] stable-diffusion.cpp:739  - using FakeVAE
[INFO ] model.cpp:799  - NOT using mmap for 'I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors' (mmap disabled by caller)
[INFO ] model.cpp:810  - model files processing completed in 0.00s
  |==================================================| 758/758 - 2.82GB/s
[INFO ] model.cpp:1143 - loading tensors completed, taking 5.33s (read: 3.35s, memcpy: 0.00s, convert: 0.11s, copy_to_backend: 1.54s)
[INFO ] stable-diffusion.cpp:1053 - total params memory size = 16570.85MB (VRAM 16570.85MB, RAM 0.00MB): text_encoders 875.61MB(VRAM), diffusion_model 15695.23MB(VRAM), vae 0.00MB(N/A), controlnet 0.00MB(N/A), pmid 0.00MB(N/A)
[INFO ] stable-diffusion.cpp:1130 - running in FLOW mode
[INFO ] main.cpp:148  - listening on: http://127.0.0.1:1234
[INFO ] stable-diffusion.cpp:3894 - generate_image 1024x1024
[INFO ] denoiser.hpp:637  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3214 - sampling using Euler method
[INFO ] stable-diffusion.cpp:3695 - get_learned_condition completed, taking 0.00s
[INFO ] stable-diffusion.cpp:3928 - generating image: 1/1 - seed 42
  |==================================================| 20/20 - 11.49s/it
[INFO ] stable-diffusion.cpp:3962 - sampling completed, taking 232.61s
[INFO ] stable-diffusion.cpp:3982 - generating 1 latent images completed, taking 232.61s
[INFO ] stable-diffusion.cpp:3719 - decoding 1 latents
[INFO ] stable-diffusion.cpp:3735 - latent 1 decoded, taking 0.01s
[INFO ] stable-diffusion.cpp:3739 - decode_first_stage completed, taking 0.01s
[INFO ] stable-diffusion.cpp:4125 - generate_image completed in 232.63s

### Additional context / environment details

Would guess that sending API call overrides some default setting that is used when doing the minimal command line command with cli

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Web server has ~10x worse performance than cli #1515

Git commit

Operating System & Version

GGML backends

Command-line arguments used

Steps to reproduce

What you expected to happen

What actually happened

Logs / error messages / stack trace

Additional context / environment details

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug] Web server has ~10x worse performance than cli #1515

Description

Git commit

Operating System & Version

GGML backends

Command-line arguments used

Steps to reproduce

What you expected to happen

What actually happened

Logs / error messages / stack trace

Additional context / environment details

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions