Modify llama tokenizer, llama modeling, qwen2 modeling.#677
Modify llama tokenizer, llama modeling, qwen2 modeling.#677ZhangJinghe-AI wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughThis PR introduces Llama-3 tokenization support with configurable model selection, refactors quantization observation paths in both Llama and Qwen2 attention modules to shift from input to output QDQ, and integrates interactive prompt handling into the AOT example runner. ChangesLlama Tokenization and Integration
Qwen2 Quantization Refactoring
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (1 warning, 2 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py (1)
281-304:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winThe new observer arguments are currently a no-op.
rotate_half(...)never reads itsx_observerparameter, so swapping inq_proj_output_qdq.fake_quant.activation_post_processandk_proj_output_qdq.fake_quant.activation_post_processon Line 286 and Line 298 does not change runtime behavior. If this path is meant to use the output observer, wire it up insiderotate_half; otherwise remove the unused parameter so the refactor is not misleading.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py` around lines 281 - 304, rotate_half currently ignores the x_observer argument, so the new q_proj_output_qdq.fake_quant.activation_post_process and k_proj_output_qdq.fake_quant.activation_post_process passed from the q_rope/key_rope call sites are no-ops; fix by either (A) removing the unused x_observer parameter from rotate_half and from all calls in the q_rope/key_rope code paths (symbols: rotate_half, q_rope_mul_1_output_qdq call sites, k_rope_mul_1_output_qdq call sites, q_proj_output_qdq.fake_quant.activation_post_process, k_proj_output_qdq.fake_quant.activation_post_process) or (B) wire it up inside rotate_half so the observer is actually applied to the tensor (e.g., call the provided activation_post_process/observer on the input or output within rotate_half) and keep the current call sites; update the rotate_half signature and all its call sites accordingly to match the chosen approach.
🧹 Nitpick comments (1)
mllm/models/llama/tokenization_llama.hpp (1)
98-101: ⚡ Quick winDocument the public tokenizer API surface.
LlamaMessageand multiple publicLlamaTokenizermethods are exported without clear API comments (expected inputs/outputs/error behavior). Please add concise API docs to reduce integration ambiguity.As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."
Also applies to: 104-237
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mllm/models/llama/tokenization_llama.hpp` around lines 98 - 101, Add concise public API documentation for the exported symbols: add a doc comment above the LlamaMessage struct describing its purpose and the meaning/expected format of role and content, and add doc comments for each public LlamaTokenizer method (e.g., any Encode/Decode/Tokenize/Detokenize/LoadVocab or similarly named public functions found in the file) that specify purpose, parameter expectations (types, valid ranges, ownership/constness), return value semantics, and possible error/exception behavior (including when methods may fail or return empty/invalid results). Follow the project's docstyle (brief one-line summary + short param/return/error bullets), mention thread-safety if relevant, and ensure comments are placed immediately above the declaration so callers see the API contract.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/llama_qnn_aot/aot_run.cpp`:
- Around line 65-69: The prompt loop currently always sends input to
tokenizer.convertMessage; add an early-exit check after reading prompt_text that
returns or breaks when the user enters "exit" or "quit" (case-insensitive and
trimming surrounding whitespace), and also handle EOF (std::cin.eof())
similarly; implement this check immediately before calling
tokenizer.convertMessage so prompt_text is not processed when the user intends
to quit.
- Around line 41-44: The comment above the tokenizer instantiation is stale —
remove or update the lines that mention "Using Qwen3 tokenizer as a placeholder"
and instead document that LlamaTokenizer is being used; specifically edit the
comment that precedes the creation of
mllm::models::llama::LlamaTokenizer(tokenizer_path.get()) so it correctly
reflects the use of LlamaTokenizer (or notes to implement a Llama3Tokenizer if
different) and remove any leftover references to Qwen3 to avoid confusion.
In `@mllm/models/llama/tokenization_llama.hpp`:
- Around line 5-11: The header tokenization_llama.hpp uses std::iswspace
(references at tokenization functions around lines using iswspace) but doesn't
include the wide-char ctype header; add an explicit `#include` <cwctype> to
tokenization_llama.hpp so std::iswspace is guaranteed to be declared (update the
top-of-file includes alongside <string>, <vector>, etc.), then rebuild to
confirm portability on platforms that don't provide it via transitive includes.
- Around line 121-127: The getSystemPromptPrefix() function uses std::localtime
(unsafe for concurrent calls from applyChatTemplate()); replace it with a
thread-safe alternative — on POSIX call std::localtime_r(&t, &tm_) or on Windows
std::localtime_s(&tm_, &t) (or use std::chrono to format the current date) so
that tm_ is populated safely per-thread; update the declaration/initialization
of tm_ accordingly and keep the existing formatting with std::put_time and
oss.str() untouched.
In `@pymllm/mobile/backends/qualcomm/transformers/llama/runner.py`:
- Around line 253-259: The calibration path must use the same chat formatting as
inference: modify the calibrate(...) implementation (referencing calibrate and
its use of calibration samples) to build messages = [{"role": "user", "content":
sample}] and call self.tokenizer.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True) for each sample (the same call/invocation used in
infer()), and use the returned formatted prompt when computing activation
statistics so calibration and infer share identical preprocessing.
---
Outside diff comments:
In `@pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py`:
- Around line 281-304: rotate_half currently ignores the x_observer argument, so
the new q_proj_output_qdq.fake_quant.activation_post_process and
k_proj_output_qdq.fake_quant.activation_post_process passed from the
q_rope/key_rope call sites are no-ops; fix by either (A) removing the unused
x_observer parameter from rotate_half and from all calls in the q_rope/key_rope
code paths (symbols: rotate_half, q_rope_mul_1_output_qdq call sites,
k_rope_mul_1_output_qdq call sites,
q_proj_output_qdq.fake_quant.activation_post_process,
k_proj_output_qdq.fake_quant.activation_post_process) or (B) wire it up inside
rotate_half so the observer is actually applied to the tensor (e.g., call the
provided activation_post_process/observer on the input or output within
rotate_half) and keep the current call sites; update the rotate_half signature
and all its call sites accordingly to match the chosen approach.
---
Nitpick comments:
In `@mllm/models/llama/tokenization_llama.hpp`:
- Around line 98-101: Add concise public API documentation for the exported
symbols: add a doc comment above the LlamaMessage struct describing its purpose
and the meaning/expected format of role and content, and add doc comments for
each public LlamaTokenizer method (e.g., any
Encode/Decode/Tokenize/Detokenize/LoadVocab or similarly named public functions
found in the file) that specify purpose, parameter expectations (types, valid
ranges, ownership/constness), return value semantics, and possible
error/exception behavior (including when methods may fail or return
empty/invalid results). Follow the project's docstyle (brief one-line summary +
short param/return/error bullets), mention thread-safety if relevant, and ensure
comments are placed immediately above the declaration so callers see the API
contract.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c19ed8ee-bc79-4e4d-8767-91e2ceb8db20
📒 Files selected for processing (7)
examples/llama_qnn_aot/aot_run.cppmllm/backends/qnn/aot_rt/QnnAOTConfig.hppmllm/backends/qnn/aot_rt/QnnAOTRuntime.cppmllm/models/llama/tokenization_llama.hpppymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.pypymllm/mobile/backends/qualcomm/transformers/llama/runner.pypymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py
| // Note: Using Qwen3 tokenizer as a placeholder. | ||
| // For production use, you should implement a Llama3Tokenizer or use | ||
| // the appropriate tokenizer for your model. | ||
| auto tokenizer = mllm::models::llama::TinyLlamaTokenizer(tokenizer_path.get()); | ||
| auto tokenizer = mllm::models::llama::LlamaTokenizer(tokenizer_path.get()); |
There was a problem hiding this comment.
Remove stale tokenizer comment.
Lines 41–43 still say Qwen3 tokenizer placeholder, but Line 44 instantiates LlamaTokenizer. Please update/remove this comment to avoid maintenance confusion.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/llama_qnn_aot/aot_run.cpp` around lines 41 - 44, The comment above
the tokenizer instantiation is stale — remove or update the lines that mention
"Using Qwen3 tokenizer as a placeholder" and instead document that
LlamaTokenizer is being used; specifically edit the comment that precedes the
creation of mllm::models::llama::LlamaTokenizer(tokenizer_path.get()) so it
correctly reflects the use of LlamaTokenizer (or notes to implement a
Llama3Tokenizer if different) and remove any leftover references to Qwen3 to
avoid confusion.
| fmt::print("💬 Prompt text (or 'exit/quit'): "); | ||
| std::getline(std::cin, prompt_text); | ||
|
|
||
| auto input_tensor = tokenizer.convertMessage({{.role = "user", .content = prompt_text}}); | ||
|
|
There was a problem hiding this comment.
Implement the advertised exit/quit path.
The prompt text says exit/quit (Line 65), but input is always sent to generation. Add an early return when the user enters exit or quit.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/llama_qnn_aot/aot_run.cpp` around lines 65 - 69, The prompt loop
currently always sends input to tokenizer.convertMessage; add an early-exit
check after reading prompt_text that returns or breaks when the user enters
"exit" or "quit" (case-insensitive and trimming surrounding whitespace), and
also handle EOF (std::cin.eof()) similarly; implement this check immediately
before calling tokenizer.convertMessage so prompt_text is not processed when the
user intends to quit.
| #include <vector> | ||
| #include <string> | ||
| #include <ctime> | ||
| #include <iomanip> | ||
| #include <sstream> | ||
| #include <unordered_map> | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n mllm/models/llama/tokenization_llama.hpp | head -150Repository: UbiquitousLearning/mllm
Length of output: 6309
Add the missing wide-char ctype header for std::iswspace.
std::iswspace is used at lines 62, 63, 75, and 77 but <cwctype> is not explicitly included. While the code may compile on some platforms due to transitive includes, this violates portability best practices and C++ idioms. All headers for functions used must be explicitly included.
Proposed fix
`#include` <vector>
`#include` <string>
+#include <cwctype>
`#include` <ctime>
`#include` <iomanip>
`#include` <sstream>
`#include` <unordered_map>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #include <vector> | |
| #include <string> | |
| #include <ctime> | |
| #include <iomanip> | |
| #include <sstream> | |
| #include <unordered_map> | |
| `#include` <vector> | |
| `#include` <string> | |
| `#include` <cwctype> | |
| `#include` <ctime> | |
| `#include` <iomanip> | |
| `#include` <sstream> | |
| `#include` <unordered_map> | |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@mllm/models/llama/tokenization_llama.hpp` around lines 5 - 11, The header
tokenization_llama.hpp uses std::iswspace (references at tokenization functions
around lines using iswspace) but doesn't include the wide-char ctype header; add
an explicit `#include` <cwctype> to tokenization_llama.hpp so std::iswspace is
guaranteed to be declared (update the top-of-file includes alongside <string>,
<vector>, etc.), then rebuild to confirm portability on platforms that don't
provide it via transitive includes.
| std::string getSystemPromptPrefix() { | ||
| std::time_t t = std::time(nullptr); | ||
| std::tm tm_ = *std::localtime(&t); | ||
| std::ostringstream oss; | ||
| oss << std::put_time(&tm_, "%d %b %Y"); | ||
| return "Cutting Knowledge Date: December 2023\nToday Date: " + oss.str() + "\n\n"; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat mllm/models/llama/tokenization_llama.hpp | head -150 | tail -50Repository: UbiquitousLearning/mllm
Length of output: 2162
🏁 Script executed:
rg "getSystemPromptPrefix" --type hpp --type cpp -B 2 -A 2Repository: UbiquitousLearning/mllm
Length of output: 95
🏁 Script executed:
# Check if there are any threading or concurrency patterns in the codebase
rg "std::thread|std::mutex|pthread|`#pragma` omp" mllm/models/llama/ | head -20Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Look for other usages of std::localtime or similar time functions
rg "std::localtime|localtime_r|localtime_s|std::chrono" mllm/models/llama/ --type hpp --type cppRepository: UbiquitousLearning/mllm
Length of output: 95
🏁 Script executed:
rg "getSystemPromptPrefix" --max-count 10Repository: UbiquitousLearning/mllm
Length of output: 266
🏁 Script executed:
rg "applyChatTemplate" --max-count 10Repository: UbiquitousLearning/mllm
Length of output: 1690
🏁 Script executed:
# Check if there's any concurrent usage or threading in the codebase
rg "std::thread|std::async|std::mutex|concurrent" --max-count 20Repository: UbiquitousLearning/mllm
Length of output: 8668
🏁 Script executed:
# Check what C++ version is being used
find . -name "CMakeLists.txt" -o -name "*.cmake" | head -5 | xargs grep -l "cxx_standard\|CXX_STANDARD" 2>/dev/nullRepository: UbiquitousLearning/mllm
Length of output: 165
🏁 Script executed:
# Look at the full tokenization_llama.hpp file to understand context better
wc -l mllm/models/llama/tokenization_llama.hppRepository: UbiquitousLearning/mllm
Length of output: 112
🏁 Script executed:
cat mllm/CMakeLists.txt | grep -i "cxx_standard\|CXX_STANDARD" -A 2 -B 2Repository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
# Check if there are any service/server implementations that might call applyChatTemplate concurrently
rg "LlamaTokenizer|applyChatTemplate" mllm/ -A 3 -B 3 | head -50Repository: UbiquitousLearning/mllm
Length of output: 4198
🏁 Script executed:
# Check the Service.hpp to understand threading model
head -100 mllm/engine/service/Service.hppRepository: UbiquitousLearning/mllm
Length of output: 2221
🌐 Web query:
std::localtime thread safety C++ std::chrono alternative
💡 Result:
The C-style function std::localtime is not thread-safe because it returns a pointer to a shared, statically allocated std::tm object, which can be overwritten by concurrent calls from other threads [1][2][3]. The modern, thread-safe alternative in C++ is the C++20 library, which provides robust support for time zones and calendar conversions [3][4][5]. Recommended C++20 Approach: Use std::chrono::zoned_time to handle local time conversions in a thread-safe manner [3][4][6]. Example: #include #include void print_local_time { auto now = std::chrono::system_clock::now; // zoned_time is thread-safe and handles time zone conversions auto local_now = std::chrono::zoned_time{std::chrono::current_zone, now}; // Use std::format (C++20) for thread-safe, type-safe formatting std::cout << std::format("{:%F %T %Z}", local_now) << std::endl; } Legacy Alternatives (Pre-C++20): If you cannot use C++20, you must use platform-specific thread-safe functions or manual synchronization [1][7][8]: 1. POSIX (Linux/macOS): Use localtime_r(const std::time_t*, std::tm*), which writes the result into a user-provided buffer [1][2][3]. 2. Windows (MSVC): Use localtime_s(std::tm*, const std::time_t*), which also uses a user-provided buffer [1][3][8]. 3. Manual Synchronization: Wrap calls to std::localtime in a std::mutex to prevent concurrent access [1][7][4]. Note: Avoid std::ctime and std::gmtime for the same reasons, as they also rely on shared static buffers [2][8][4].
Citations:
- 1: https://stackoverflow.com/questions/38034033/c-localtime-this-function-or-variable-may-be-unsafe/38034148
- 2: https://stackoverflow.com/questions/25618702/why-is-there-no-c11-threadsafe-alternative-to-stdlocaltime-and-stdgmtime
- 3: https://runebook.dev/en/articles/cpp/chrono/c/localtime
- 4: https://stackoverflow.com/questions/48106291/will-getting-the-current-date-time-be-thread-safe-in-c20
- 5: https://www.youtube.com/watch?v=Dq7rqjatxz8
- 6: https://en.cppreference.com/cpp/chrono/zoned_time
- 7: https://cplusplus.com/forum/beginner/94342/
- 8: https://kjellkod.wordpress.com/2013/01/22/exploring-c11-part-2-localtime-and-time-again/
Use thread-safe time functions instead of std::localtime.
std::localtime is not thread-safe as it returns a pointer to a shared static buffer. Since getSystemPromptPrefix() is called from applyChatTemplate(), which is invoked by concurrent worker threads in the Service implementation, this creates a race condition. Replace with std::localtime_r (POSIX) or std::localtime_s (Windows), or use std::chrono for the modern C++ approach.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@mllm/models/llama/tokenization_llama.hpp` around lines 121 - 127, The
getSystemPromptPrefix() function uses std::localtime (unsafe for concurrent
calls from applyChatTemplate()); replace it with a thread-safe alternative — on
POSIX call std::localtime_r(&t, &tm_) or on Windows std::localtime_s(&tm_, &t)
(or use std::chrono to format the current date) so that tm_ is populated safely
per-thread; update the declaration/initialization of tm_ accordingly and keep
the existing formatting with std::put_time and oss.str() untouched.
| def infer(self, prompt: str): | ||
| messages = [{"role": "user", "content": prompt}] | ||
| prompt = self.tokenizer.apply_chat_template( | ||
| messages, | ||
| tokenize=False, | ||
| add_generation_prompt=True, | ||
| ) |
There was a problem hiding this comment.
Align calibration preprocessing with inference chat formatting.
infer() now uses apply_chat_template(...) (Line 255), but calibrate() still feeds raw text (Line 324). This distribution mismatch can produce inaccurate activation statistics for deployment. Use the same chat-template path in calibration samples.
Also applies to: 322-330
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pymllm/mobile/backends/qualcomm/transformers/llama/runner.py` around lines
253 - 259, The calibration path must use the same chat formatting as inference:
modify the calibrate(...) implementation (referencing calibrate and its use of
calibration samples) to build messages = [{"role": "user", "content": sample}]
and call self.tokenizer.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True) for each sample (the same call/invocation used in
infer()), and use the returned formatted prompt when computing activation
statistics so calibration and infer share identical preprocessing.
Fix implementation errors in LLaMA C++ tokenizer and runtime.
Correct issues in the LLaMA and Qwen2 modeling files in pymllm.
Additionally, improve the LLaMA calibration process for better performance.
Summary by CodeRabbit
New Features
Improvements