[FEATURE] Add GGUF Model Support via llama.cpp
User Story: As a user with a VRAM-limited GPU (e.g., 8GB-12GB), I want to be able to use quantized GGUF Vision-Language Models so that I can run the application's core features with reasonable performance and without running out of memory.
Problem: The current application relies exclusively on the transformers library, which loads unquantized PyTorch/SafeTensors models. These models are very large and require significant VRAM (16GB+ recommended), making the application slow or unusable for a large portion of the potential user base.
Proposed Solution: Integrate the llama-cpp-python library as an alternative backend for model loading and inference.
-
Modify vlm_profiles.py to include new loader and generation functions specifically for GGUF models.
-
The UI should allow users to select GGUF models, potentially by pointing to a local file.
-
The application will need to detect the model type and route it to the correct backend (transformers or llama.cpp).
Goal: Dramatically lower the VRAM and system RAM requirements, making PlotCaption accessible and performant for a much wider range of hardware. This is the top priority for improving accessibility.
Source: This feature was suggested and proven to be viable by user willdone on Reddit, who successfully patched in a Q8_0 GGUF for testing.
[FEATURE] Add GGUF Model Support via llama.cpp
User Story: As a user with a VRAM-limited GPU (e.g., 8GB-12GB), I want to be able to use quantized GGUF Vision-Language Models so that I can run the application's core features with reasonable performance and without running out of memory.
Problem: The current application relies exclusively on the
transformerslibrary, which loads unquantized PyTorch/SafeTensors models. These models are very large and require significant VRAM (16GB+ recommended), making the application slow or unusable for a large portion of the potential user base.Proposed Solution: Integrate the
llama-cpp-pythonlibrary as an alternative backend for model loading and inference.Modify
vlm_profiles.pyto include new loader and generation functions specifically for GGUF models.The UI should allow users to select GGUF models, potentially by pointing to a local file.
The application will need to detect the model type and route it to the correct backend (
transformersorllama.cpp).Goal: Dramatically lower the VRAM and system RAM requirements, making PlotCaption accessible and performant for a much wider range of hardware. This is the top priority for improving accessibility.
Source: This feature was suggested and proven to be viable by user
willdoneon Reddit, who successfully patched in a Q8_0 GGUF for testing.