Skip to content

[Bug] Web server has ~10x worse performance than cli #1515

@mpulukkinen

Description

@mpulukkinen

Git commit

sd-master-06accf2-bin-win-cuda12-x64

Operating System & Version

Windows 11

GGML backends

CUDA

Command-line arguments used

.\sd-cli.exe -m "I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors" -p "a lovely cat holding a sign say 'hidream o1 cpp'" --cfg-scale 1.0 -v -H 1024 -W 1024

Steps to reproduce

Start cli with same params than SD server...
Generate image with both
Observe the speed

What you expected to happen

Same speed when generating image

What actually happened

Almost 10x slower speed

Logs / error messages / stack trace

sd-cli.exe

PS C:\Users\copyhere2\Downloads\sd-master-06accf2-bin-win-cuda12-x64> .\sd-cli.exe -m "I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors" -p "a lovely cat holding a sign say 'hidream o1 cpp'" --cfg-scale 1.0 -v -H 1024 -W 1024
[DEBUG] main.cpp:597 - version: stable-diffusion.cpp version unknown, commit 06accf2
[DEBUG] main.cpp:598 - System Info:
SSE3 = 1 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | VSX = 0 |
[DEBUG] main.cpp:599 - SDCliParams {
mode: img_gen,
output_path: "output.png",
image_path: "",
metadata_format: "text",
verbose: true,
color: false,
canny_preprocess: false,
convert_name: false,
preview_method: none,
preview_interval: 1,
preview_path: "preview.png",
preview_fps: 16,
taesd_preview: false,
preview_noisy: false,
metadata_raw: false,
metadata_brief: false,
metadata_all: false
}
[DEBUG] main.cpp:600 - SDContextParams {
n_threads: 10,
model_path: "I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors",
clip_l_path: "",
clip_g_path: "",
clip_vision_path: "",
t5xxl_path: "",
llm_path: "",
llm_vision_path: "",
diffusion_model_path: "",
high_noise_diffusion_model_path: "",
embeddings_connectors_path: "",
vae_path: "",
audio_vae_path: "",
taesd_path: "",
esrgan_path: "",
control_net_path: "",
embedding_dir: "",
embeddings: {
}
wtype: NONE,
tensor_type_rules: "",
lora_model_dir: ".",
hires_upscalers_dir: "",
photo_maker_path: "",
rng_type: cuda,
sampler_rng_type: NONE,
offload_params_to_cpu: false,
max_vram: 0,
backend: "",
params_backend: "",
enable_mmap: false,
control_net_cpu: false,
clip_on_cpu: false,
vae_on_cpu: false,
flash_attn: false,
diffusion_flash_attn: false,
diffusion_conv_direct: false,
vae_conv_direct: false,
circular: false,
circular_x: false,
circular_y: false,
chroma_use_dit_mask: true,
qwen_image_zero_cond_t: false,
chroma_use_t5_mask: false,
chroma_t5_mask_pad: 1,
prediction: NONE,
lora_apply_mode: auto,
force_sdxl_vae_conv_scale: false
}
[DEBUG] main.cpp:601 - SDGenerationParams {
loras: "{
}",
high_noise_loras: "{
}",
prompt: "a lovely cat holding a sign say 'hidream o1 cpp'",
negative_prompt: "",
clip_skip: -1,
width: 1024,
height: 1024,
batch_count: 1,
init_image_path: "",
end_image_path: "",
mask_image_path: "",
control_image_path: "",
ref_image_paths: [],
control_video_path: "",
auto_resize_ref_image: true,
increase_ref_index: false,
pm_id_images_dir: "",
pm_id_embed_path: "",
pm_style_strength: 20,
skip_layers: [7, 8, 9],
sample_params: (txt_cfg: 1.00, img_cfg: 1.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: inf, shifted_timestep: 0, flow_shift: inf, extra_sample_args: ),
high_noise_skip_layers: [7, 8, 9],
high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: inf, shifted_timestep: 0, flow_shift: inf, extra_sample_args: ),
custom_sigmas: [],
cache_mode: "",
cache_option: "",
cache: disabled (threshold=inf, start=0.15, end=0.95),
moe_boundary: 0.875,
video_frames: 1,
fps: 16,
vace_strength: 1,
strength: 0.75,
control_strength: 0.9,
seed: 42,
upscale_repeats: 1,
upscale_tile_size: 128,
hires: { enabled: false, upscaler: "Latent", model_path: "", scale: 2, target_width: 0, target_height: 0, steps: 0, denoising_strength: 0.7, upscale_tile_size: 128 },
vae_tiling_params: { 0, 0, 0, 0, 0.5, 0, 0 },
}
[INFO ] ggml_extend.hpp:63 - ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16375 MiB):
[INFO ] ggml_extend.hpp:63 - Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes, VRAM: 16375 MiB
[DEBUG] ggml_extend_backend.cpp:311 - Found 2 backend devices:
[DEBUG] ggml_extend_backend.cpp:314 - #0: CUDA0
[DEBUG] ggml_extend_backend.cpp:314 - #1: CPU
[DEBUG] ggml_extend_backend.cpp:291 - Initializing backend: CUDA0
[INFO ] stable-diffusion.cpp:249 - loading model from 'I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors'
[INFO ] model.cpp:219 - load I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors using safetensors format
[DEBUG] model.cpp:294 - init from 'I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors', prefix = ''
[INFO ] stable-diffusion.cpp:358 - Version: HiDream O1
[INFO ] stable-diffusion.cpp:386 - Weight type stat: bf16: 758
[INFO ] stable-diffusion.cpp:387 - Conditioner weight type stat:
[INFO ] stable-diffusion.cpp:388 - Diffusion model weight type stat:
[INFO ] stable-diffusion.cpp:389 - VAE weight type stat:
[DEBUG] stable-diffusion.cpp:391 - ggml tensor size = 400 bytes
[DEBUG] qwen2_tokenizer.cpp:14 - merges size 151387
[DEBUG] qwen2_tokenizer.cpp:39 - vocab size: 151674
[INFO ] stable-diffusion.cpp:739 - using FakeVAE
[DEBUG] stable-diffusion.cpp:880 - loading weights
[DEBUG] ggml_extend.hpp:2688 - hidream_o1_vision params backend buffer size = 875.61 MB(VRAM) (333 tensors)
[DEBUG] ggml_extend.hpp:2688 - hidream_o1 params backend buffer size = 15695.23 MB(VRAM) (407 tensors)
[DEBUG] ggml_extend.hpp:2671 - fake_vae skipping params allocation (no tensors)
[INFO ] model.cpp:799 - NOT using mmap for 'I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors' (mmap disabled by caller)
[INFO ] model.cpp:810 - model files processing completed in 0.00s
[DEBUG] model.cpp:909 - using 10 threads for model loading
[DEBUG] model.cpp:925 - loading tensors from I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors
|==================================================| 758/758 - 1.95GB/s
[INFO ] model.cpp:1143 - loading tensors completed, taking 7.69s (read: 4.42s, memcpy: 0.00s, convert: 0.05s, copy_to_backend: 1.16s)
[DEBUG] stable-diffusion.cpp:971 - finished loaded file
[INFO ] stable-diffusion.cpp:1053 - total params memory size = 16570.85MB (VRAM 16570.85MB, RAM 0.00MB): text_encoders 875.61MB(VRAM), diffusion_model 15695.23MB(VRAM), vae 0.00MB(N/A), controlnet 0.00MB(N/A), pmid 0.00MB(N/A)
[INFO ] stable-diffusion.cpp:1130 - running in FLOW mode
[INFO ] stable-diffusion.cpp:3894 - generate_image 1024x1024
[INFO ] denoiser.hpp:637 - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3214 - sampling using Euler method
[DEBUG] bpe_tokenizer.cpp:207 - split prompt "<|im_start|>user
a lovely cat holding a sign say 'hidream o1 cpp'<|im_end|>
<|im_start|>assistant
<|boi_token|><|tms_token|>" to tokens ["<|im_start|>", "user", "Ċ", "a", "Ġlovely", "Ġcat", "Ġholding", "Ġa", "Ġsign", "Ġsay", "Ġ'", "hid", "ream", "Ġo", "1", "Ġcpp", "'", "<|im_end|>", "Ċ", "<|im_start|>", "assistant", "Ċ", "<|boi_token|>", "<|tms_token|>", ]
[INFO ] stable-diffusion.cpp:3695 - get_learned_condition completed, taking 0.00s
[INFO ] stable-diffusion.cpp:3928 - generating image: 1/1 - seed 42
[DEBUG] ggml_extend.hpp:1907 - hidream_o1 compute buffer size: 208.23 MB(VRAM)
|==================================================| 20/20 - 1.74s/it
[INFO ] stable-diffusion.cpp:3962 - sampling completed, taking 38.29s
[INFO ] stable-diffusion.cpp:3982 - generating 1 latent images completed, taking 38.47s
[INFO ] stable-diffusion.cpp:3719 - decoding 1 latents
[DEBUG] vae.hpp:209 - computing vae decode graph completed, taking 0.01s
[INFO ] stable-diffusion.cpp:3735 - latent 1 decoded, taking 0.01s
[INFO ] stable-diffusion.cpp:3739 - decode_first_stage completed, taking 0.01s
[INFO ] stable-diffusion.cpp:4125 - generate_image completed in 38.60s
[INFO ] main.cpp:462 - save result image 0 to 'output.png' (success)
[INFO ] main.cpp:534 - 1/1 images saved

SD server
PS C:\Users\copyhere2\Downloads\sd-master-06accf2-bin-win-cuda12-x64> .\sd-server.exe -m "I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors"
[INFO ] ggml_extend.hpp:63 - ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16375 MiB):
[INFO ] ggml_extend.hpp:63 - Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes, VRAM: 16375 MiB
[INFO ] stable-diffusion.cpp:249 - loading model from 'I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors'
[INFO ] model.cpp:219 - load I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:358 - Version: HiDream O1
[INFO ] stable-diffusion.cpp:386 - Weight type stat: bf16: 758
[INFO ] stable-diffusion.cpp:387 - Conditioner weight type stat:
[INFO ] stable-diffusion.cpp:388 - Diffusion model weight type stat:
[INFO ] stable-diffusion.cpp:389 - VAE weight type stat:
[INFO ] stable-diffusion.cpp:739 - using FakeVAE
[INFO ] model.cpp:799 - NOT using mmap for 'I:\LvsSdModels\hiDream\hidream_o1_image_bf16.safetensors' (mmap disabled by caller)
[INFO ] model.cpp:810 - model files processing completed in 0.00s
|==================================================| 758/758 - 2.82GB/s
[INFO ] model.cpp:1143 - loading tensors completed, taking 5.33s (read: 3.35s, memcpy: 0.00s, convert: 0.11s, copy_to_backend: 1.54s)
[INFO ] stable-diffusion.cpp:1053 - total params memory size = 16570.85MB (VRAM 16570.85MB, RAM 0.00MB): text_encoders 875.61MB(VRAM), diffusion_model 15695.23MB(VRAM), vae 0.00MB(N/A), controlnet 0.00MB(N/A), pmid 0.00MB(N/A)
[INFO ] stable-diffusion.cpp:1130 - running in FLOW mode
[INFO ] main.cpp:148 - listening on: http://127.0.0.1:1234
[INFO ] stable-diffusion.cpp:3894 - generate_image 1024x1024
[INFO ] denoiser.hpp:637 - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3214 - sampling using Euler method
[INFO ] stable-diffusion.cpp:3695 - get_learned_condition completed, taking 0.00s
[INFO ] stable-diffusion.cpp:3928 - generating image: 1/1 - seed 42
|==================================================| 20/20 - 11.49s/it
[INFO ] stable-diffusion.cpp:3962 - sampling completed, taking 232.61s
[INFO ] stable-diffusion.cpp:3982 - generating 1 latent images completed, taking 232.61s
[INFO ] stable-diffusion.cpp:3719 - decoding 1 latents
[INFO ] stable-diffusion.cpp:3735 - latent 1 decoded, taking 0.01s
[INFO ] stable-diffusion.cpp:3739 - decode_first_stage completed, taking 0.01s
[INFO ] stable-diffusion.cpp:4125 - generate_image completed in 232.63s

Additional context / environment details

Would guess that sending API call overrides some default setting that is used when doing the minimal command line command with cli

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions