Modify llama tokenizer, llama modeling, qwen2 modeling. by ZhangJinghe-AI · Pull Request #677 · UbiquitousLearning/mllm

ZhangJinghe-AI · 2026-05-19T11:57:02Z

Fix implementation errors in LLaMA C++ tokenizer and runtime.
Correct issues in the LLaMA and Qwen2 modeling files in pymllm.
Additionally, improve the LLaMA calibration process for better performance.

Summary by CodeRabbit

New Features
- Added Llama-3 tokenization support with chat template formatting and special token handling
- Introduced interactive prompt mode for demos
Improvements
- Dynamic end-of-sequence token configuration now adapts based on model type
- Enhanced quantization parameter wiring for improved model accuracy
- Updated inference pipeline to use chat template preprocessing

coderabbitai · 2026-05-19T11:57:16Z

📝 Walkthrough

Walkthrough

This PR introduces Llama-3 tokenization support with configurable model selection, refactors quantization observation paths in both Llama and Qwen2 attention modules to shift from input to output QDQ, and integrates interactive prompt handling into the AOT example runner.

Changes

Llama Tokenization and Integration

Layer / File(s)	Summary
Llama tokenizer implementation `mllm/models/llama/tokenization_llama.hpp`	Llama-3 tokenizer with custom regex splitting for pattern-based tokenization, BPE byte-level encoding, special token trie, chat template formatting with dynamic system prompts, and token/message conversion utilities.
Model type configuration and EOS selection `mllm/backends/qnn/aot_rt/QnnAOTConfig.hpp`, `mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp`	Config struct adds `type` field (default `"qwen3"`); runtime conditionally populates EOS token IDs based on model type (Llama3, Qwen2, or Qwen3).
AOT example with interactive prompt `examples/llama_qnn_aot/aot_run.cpp`	Removes hardcoded "hello" input and sequence-length CLI options; instantiates LlamaTokenizer with model type set to Llama3; implements interactive stdin prompt reading with streaming token callback.
Llama quantization path and runner adjustments `pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py`, `pymllm/mobile/backends/qualcomm/transformers/llama/runner.py`	LlamaAttention removes input QDQ modules and shifts RoPE observation to output QDQ activation post-process; runner applies chat templates before tokenization and switches calibration to direct forward pass.

Qwen2 Quantization Refactoring

Layer / File(s)	Summary
Qwen2 attention QDQ path restructuring `pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py`	Removes k_proj_input_qdq and v_proj_input_qdq from initialization; routes key and value projections directly through output QDQ; updates RoPE concat observer and rotation paths to observe output QDQ activation post-process instead of input QDQ.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

liang1232018
chenghuaWang
oreomaker

Poem

🐰 A Llama hops in, tokenized and neat,
With BPE byte-dances, a rhythmic beat,
Chat templates wrapped in system's embrace,
Quantization flows smooth to their place,
From input to output, the path finds grace! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title is vague and overly broad, listing multiple file modifications without clarifying the main purpose or improvement.	Revise the title to focus on the primary objective, such as 'Fix LLaMA tokenizer and runtime implementation' or 'Improve LLaMA tokenization and Qwen2 modeling for quantization'.
Description check	❓ Inconclusive	The description provides high-level objectives but lacks specific details about what was fixed, why changes were made, or how to validate the fixes.	Expand the description with specifics: what implementation errors were fixed, what issues in modeling files were corrected, and how the calibration process was improved.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py (1)
281-304: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

The new observer arguments are currently a no-op.

rotate_half(...) never reads its x_observer parameter, so swapping in q_proj_output_qdq.fake_quant.activation_post_process and k_proj_output_qdq.fake_quant.activation_post_process on Line 286 and Line 298 does not change runtime behavior. If this path is meant to use the output observer, wire it up inside rotate_half; otherwise remove the unused parameter so the refactor is not misleading.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py` around
lines 281 - 304, rotate_half currently ignores the x_observer argument, so the
new q_proj_output_qdq.fake_quant.activation_post_process and
k_proj_output_qdq.fake_quant.activation_post_process passed from the
q_rope/key_rope call sites are no-ops; fix by either (A) removing the unused
x_observer parameter from rotate_half and from all calls in the q_rope/key_rope
code paths (symbols: rotate_half, q_rope_mul_1_output_qdq call sites,
k_rope_mul_1_output_qdq call sites,
q_proj_output_qdq.fake_quant.activation_post_process,
k_proj_output_qdq.fake_quant.activation_post_process) or (B) wire it up inside
rotate_half so the observer is actually applied to the tensor (e.g., call the
provided activation_post_process/observer on the input or output within
rotate_half) and keep the current call sites; update the rotate_half signature
and all its call sites accordingly to match the chosen approach.

🧹 Nitpick comments (1)

mllm/models/llama/tokenization_llama.hpp (1)
98-101: ⚡ Quick win

Document the public tokenizer API surface.

LlamaMessage and multiple public LlamaTokenizer methods are exported without clear API comments (expected inputs/outputs/error behavior). Please add concise API docs to reduce integration ambiguity.

As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."

Also applies to: 104-237
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mllm/models/llama/tokenization_llama.hpp` around lines 98 - 101, Add concise
public API documentation for the exported symbols: add a doc comment above the
LlamaMessage struct describing its purpose and the meaning/expected format of
role and content, and add doc comments for each public LlamaTokenizer method
(e.g., any Encode/Decode/Tokenize/Detokenize/LoadVocab or similarly named public
functions found in the file) that specify purpose, parameter expectations
(types, valid ranges, ownership/constness), return value semantics, and possible
error/exception behavior (including when methods may fail or return
empty/invalid results). Follow the project's docstyle (brief one-line summary +
short param/return/error bullets), mention thread-safety if relevant, and ensure
comments are placed immediately above the declaration so callers see the API
contract.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/llama_qnn_aot/aot_run.cpp`:
- Around line 65-69: The prompt loop currently always sends input to
tokenizer.convertMessage; add an early-exit check after reading prompt_text that
returns or breaks when the user enters "exit" or "quit" (case-insensitive and
trimming surrounding whitespace), and also handle EOF (std::cin.eof())
similarly; implement this check immediately before calling
tokenizer.convertMessage so prompt_text is not processed when the user intends
to quit.
- Around line 41-44: The comment above the tokenizer instantiation is stale —
remove or update the lines that mention "Using Qwen3 tokenizer as a placeholder"
and instead document that LlamaTokenizer is being used; specifically edit the
comment that precedes the creation of
mllm::models::llama::LlamaTokenizer(tokenizer_path.get()) so it correctly
reflects the use of LlamaTokenizer (or notes to implement a Llama3Tokenizer if
different) and remove any leftover references to Qwen3 to avoid confusion.

In `@mllm/models/llama/tokenization_llama.hpp`:
- Around line 5-11: The header tokenization_llama.hpp uses std::iswspace
(references at tokenization functions around lines using iswspace) but doesn't
include the wide-char ctype header; add an explicit `#include` <cwctype> to
tokenization_llama.hpp so std::iswspace is guaranteed to be declared (update the
top-of-file includes alongside <string>, <vector>, etc.), then rebuild to
confirm portability on platforms that don't provide it via transitive includes.
- Around line 121-127: The getSystemPromptPrefix() function uses std::localtime
(unsafe for concurrent calls from applyChatTemplate()); replace it with a
thread-safe alternative — on POSIX call std::localtime_r(&t, &tm_) or on Windows
std::localtime_s(&tm_, &t) (or use std::chrono to format the current date) so
that tm_ is populated safely per-thread; update the declaration/initialization
of tm_ accordingly and keep the existing formatting with std::put_time and
oss.str() untouched.

In `@pymllm/mobile/backends/qualcomm/transformers/llama/runner.py`:
- Around line 253-259: The calibration path must use the same chat formatting as
inference: modify the calibrate(...) implementation (referencing calibrate and
its use of calibration samples) to build messages = [{"role": "user", "content":
sample}] and call self.tokenizer.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True) for each sample (the same call/invocation used in
infer()), and use the returned formatted prompt when computing activation
statistics so calibration and infer share identical preprocessing.

---

Outside diff comments:
In `@pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py`:
- Around line 281-304: rotate_half currently ignores the x_observer argument, so
the new q_proj_output_qdq.fake_quant.activation_post_process and
k_proj_output_qdq.fake_quant.activation_post_process passed from the
q_rope/key_rope call sites are no-ops; fix by either (A) removing the unused
x_observer parameter from rotate_half and from all calls in the q_rope/key_rope
code paths (symbols: rotate_half, q_rope_mul_1_output_qdq call sites,
k_rope_mul_1_output_qdq call sites,
q_proj_output_qdq.fake_quant.activation_post_process,
k_proj_output_qdq.fake_quant.activation_post_process) or (B) wire it up inside
rotate_half so the observer is actually applied to the tensor (e.g., call the
provided activation_post_process/observer on the input or output within
rotate_half) and keep the current call sites; update the rotate_half signature
and all its call sites accordingly to match the chosen approach.

---

Nitpick comments:
In `@mllm/models/llama/tokenization_llama.hpp`:
- Around line 98-101: Add concise public API documentation for the exported
symbols: add a doc comment above the LlamaMessage struct describing its purpose
and the meaning/expected format of role and content, and add doc comments for
each public LlamaTokenizer method (e.g., any
Encode/Decode/Tokenize/Detokenize/LoadVocab or similarly named public functions
found in the file) that specify purpose, parameter expectations (types, valid
ranges, ownership/constness), return value semantics, and possible
error/exception behavior (including when methods may fail or return
empty/invalid results). Follow the project's docstyle (brief one-line summary +
short param/return/error bullets), mention thread-safety if relevant, and ensure
comments are placed immediately above the declaration so callers see the API
contract.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c19ed8ee-bc79-4e4d-8767-91e2ceb8db20

📥 Commits

Reviewing files that changed from the base of the PR and between 729ca4c and 95d86fc.

📒 Files selected for processing (7)

examples/llama_qnn_aot/aot_run.cpp
mllm/backends/qnn/aot_rt/QnnAOTConfig.hpp
mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp
mllm/models/llama/tokenization_llama.hpp
pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py
pymllm/mobile/backends/qualcomm/transformers/llama/runner.py
pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py

coderabbitai · 2026-05-19T12:02:43Z

  // Note: Using Qwen3 tokenizer as a placeholder.
  // For production use, you should implement a Llama3Tokenizer or use
  // the appropriate tokenizer for your model.
-  auto tokenizer = mllm::models::llama::TinyLlamaTokenizer(tokenizer_path.get());
+  auto tokenizer = mllm::models::llama::LlamaTokenizer(tokenizer_path.get());


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove stale tokenizer comment.

Lines 41–43 still say Qwen3 tokenizer placeholder, but Line 44 instantiates LlamaTokenizer. Please update/remove this comment to avoid maintenance confusion.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/llama_qnn_aot/aot_run.cpp` around lines 41 - 44, The comment above the tokenizer instantiation is stale — remove or update the lines that mention "Using Qwen3 tokenizer as a placeholder" and instead document that LlamaTokenizer is being used; specifically edit the comment that precedes the creation of mllm::models::llama::LlamaTokenizer(tokenizer_path.get()) so it correctly reflects the use of LlamaTokenizer (or notes to implement a Llama3Tokenizer if different) and remove any leftover references to Qwen3 to avoid confusion.

coderabbitai · 2026-05-19T12:02:43Z

+  fmt::print("💬 Prompt text (or 'exit/quit'): ");
+  std::getline(std::cin, prompt_text);
+
+  auto input_tensor = tokenizer.convertMessage({{.role = "user", .content = prompt_text}});



⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Implement the advertised exit/quit path.

The prompt text says exit/quit (Line 65), but input is always sent to generation. Add an early return when the user enters exit or quit.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/llama_qnn_aot/aot_run.cpp` around lines 65 - 69, The prompt loop currently always sends input to tokenizer.convertMessage; add an early-exit check after reading prompt_text that returns or breaks when the user enters "exit" or "quit" (case-insensitive and trimming surrounding whitespace), and also handle EOF (std::cin.eof()) similarly; implement this check immediately before calling tokenizer.convertMessage so prompt_text is not processed when the user intends to quit.

coderabbitai · 2026-05-19T12:02:43Z

+#include <vector>
+#include <string>
+#include <ctime>
+#include <iomanip>
+#include <sstream>
+#include <unordered_map>
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cat -n mllm/models/llama/tokenization_llama.hpp | head -150

Repository: UbiquitousLearning/mllm

Length of output: 6309

Add the missing wide-char ctype header for std::iswspace.

std::iswspace is used at lines 62, 63, 75, and 77 but <cwctype> is not explicitly included. While the code may compile on some platforms due to transitive includes, this violates portability best practices and C++ idioms. All headers for functions used must be explicitly included.

Proposed fix

`#include` <vector> `#include` <string> +#include <cwctype> `#include` <ctime> `#include` <iomanip> `#include` <sstream> `#include` <unordered_map>

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#include <vector>

#include <string>

#include <ctime>

#include <iomanip>

#include <sstream>

#include <unordered_map>

`#include` <vector>

`#include` <string>

`#include` <cwctype>

`#include` <ctime>

`#include` <iomanip>

`#include` <sstream>

`#include` <unordered_map>

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mllm/models/llama/tokenization_llama.hpp` around lines 5 - 11, The header tokenization_llama.hpp uses std::iswspace (references at tokenization functions around lines using iswspace) but doesn't include the wide-char ctype header; add an explicit `#include` <cwctype> to tokenization_llama.hpp so std::iswspace is guaranteed to be declared (update the top-of-file includes alongside <string>, <vector>, etc.), then rebuild to confirm portability on platforms that don't provide it via transitive includes.

coderabbitai · 2026-05-19T12:02:43Z

+  std::string getSystemPromptPrefix() {
+    std::time_t t = std::time(nullptr);
+    std::tm tm_ = *std::localtime(&t);
+    std::ostringstream oss;
+    oss << std::put_time(&tm_, "%d %b %Y");
+    return "Cutting Knowledge Date: December 2023\nToday Date: " + oss.str() + "\n\n";
+  }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cat mllm/models/llama/tokenization_llama.hpp | head -150 | tail -50

Repository: UbiquitousLearning/mllm

Length of output: 2162

🏁 Script executed:

rg "getSystemPromptPrefix" --type hpp --type cpp -B 2 -A 2

Repository: UbiquitousLearning/mllm

Length of output: 95

🏁 Script executed:

# Check if there are any threading or concurrency patterns in the codebase rg "std::thread|std::mutex|pthread|`#pragma` omp" mllm/models/llama/ | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Look for other usages of std::localtime or similar time functions rg "std::localtime|localtime_r|localtime_s|std::chrono" mllm/models/llama/ --type hpp --type cpp

Repository: UbiquitousLearning/mllm

Length of output: 95

🏁 Script executed:

rg "getSystemPromptPrefix" --max-count 10

Repository: UbiquitousLearning/mllm

Length of output: 266

🏁 Script executed:

rg "applyChatTemplate" --max-count 10

Repository: UbiquitousLearning/mllm

Length of output: 1690

🏁 Script executed:

# Check if there's any concurrent usage or threading in the codebase rg "std::thread|std::async|std::mutex|concurrent" --max-count 20

Repository: UbiquitousLearning/mllm

Length of output: 8668

🏁 Script executed:

# Check what C++ version is being used find . -name "CMakeLists.txt" -o -name "*.cmake" | head -5 | xargs grep -l "cxx_standard\|CXX_STANDARD" 2>/dev/null

Repository: UbiquitousLearning/mllm

Length of output: 165

🏁 Script executed:

# Look at the full tokenization_llama.hpp file to understand context better wc -l mllm/models/llama/tokenization_llama.hpp

Repository: UbiquitousLearning/mllm

Length of output: 112

🏁 Script executed:

cat mllm/CMakeLists.txt | grep -i "cxx_standard\|CXX_STANDARD" -A 2 -B 2

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

# Check if there are any service/server implementations that might call applyChatTemplate concurrently rg "LlamaTokenizer|applyChatTemplate" mllm/ -A 3 -B 3 | head -50

Repository: UbiquitousLearning/mllm

Length of output: 4198

🏁 Script executed:

# Check the Service.hpp to understand threading model head -100 mllm/engine/service/Service.hpp

Repository: UbiquitousLearning/mllm

Length of output: 2221

🌐 Web query:

std::localtime thread safety C++ std::chrono alternative

💡 Result:

The C-style function std::localtime is not thread-safe because it returns a pointer to a shared, statically allocated std::tm object, which can be overwritten by concurrent calls from other threads [1][2][3]. The modern, thread-safe alternative in C++ is the C++20 library, which provides robust support for time zones and calendar conversions [3][4][5]. Recommended C++20 Approach: Use std::chrono::zoned_time to handle local time conversions in a thread-safe manner [3][4][6]. Example: #include #include void print_local_time { auto now = std::chrono::system_clock::now; // zoned_time is thread-safe and handles time zone conversions auto local_now = std::chrono::zoned_time{std::chrono::current_zone, now}; // Use std::format (C++20) for thread-safe, type-safe formatting std::cout << std::format("{:%F %T %Z}", local_now) << std::endl; } Legacy Alternatives (Pre-C++20): If you cannot use C++20, you must use platform-specific thread-safe functions or manual synchronization [1][7][8]: 1. POSIX (Linux/macOS): Use localtime_r(const std::time_t*, std::tm*), which writes the result into a user-provided buffer [1][2][3]. 2. Windows (MSVC): Use localtime_s(std::tm*, const std::time_t*), which also uses a user-provided buffer [1][3][8]. 3. Manual Synchronization: Wrap calls to std::localtime in a std::mutex to prevent concurrent access [1][7][4]. Note: Avoid std::ctime and std::gmtime for the same reasons, as they also rely on shared static buffers [2][8][4].

Citations:

1: https://stackoverflow.com/questions/38034033/c-localtime-this-function-or-variable-may-be-unsafe/38034148

2: https://stackoverflow.com/questions/25618702/why-is-there-no-c11-threadsafe-alternative-to-stdlocaltime-and-stdgmtime

3: https://runebook.dev/en/articles/cpp/chrono/c/localtime

4: https://stackoverflow.com/questions/48106291/will-getting-the-current-date-time-be-thread-safe-in-c20

5: https://www.youtube.com/watch?v=Dq7rqjatxz8

6: https://en.cppreference.com/cpp/chrono/zoned_time

7: https://cplusplus.com/forum/beginner/94342/

8: https://kjellkod.wordpress.com/2013/01/22/exploring-c11-part-2-localtime-and-time-again/

Use thread-safe time functions instead of std::localtime.

std::localtime is not thread-safe as it returns a pointer to a shared static buffer. Since getSystemPromptPrefix() is called from applyChatTemplate(), which is invoked by concurrent worker threads in the Service implementation, this creates a race condition. Replace with std::localtime_r (POSIX) or std::localtime_s (Windows), or use std::chrono for the modern C++ approach.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mllm/models/llama/tokenization_llama.hpp` around lines 121 - 127, The getSystemPromptPrefix() function uses std::localtime (unsafe for concurrent calls from applyChatTemplate()); replace it with a thread-safe alternative — on POSIX call std::localtime_r(&t, &tm_) or on Windows std::localtime_s(&tm_, &t) (or use std::chrono to format the current date) so that tm_ is populated safely per-thread; update the declaration/initialization of tm_ accordingly and keep the existing formatting with std::put_time and oss.str() untouched.

coderabbitai · 2026-05-19T12:02:43Z

    def infer(self, prompt: str):
+        messages = [{"role": "user", "content": prompt}]
+        prompt = self.tokenizer.apply_chat_template(
+            messages,
+            tokenize=False,
+            add_generation_prompt=True,
+        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align calibration preprocessing with inference chat formatting.

infer() now uses apply_chat_template(...) (Line 255), but calibrate() still feeds raw text (Line 324). This distribution mismatch can produce inaccurate activation statistics for deployment. Use the same chat-template path in calibration samples.

Also applies to: 322-330

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pymllm/mobile/backends/qualcomm/transformers/llama/runner.py` around lines 253 - 259, The calibration path must use the same chat formatting as inference: modify the calibrate(...) implementation (referencing calibrate and its use of calibration samples) to build messages = [{"role": "user", "content": sample}] and call self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) for each sample (the same call/invocation used in infer()), and use the returned formatted prompt when computing activation statistics so calibration and infer share identical preprocessing.

Modify llama tokenizer and pymllm.

95d86fc

ZhangJinghe-AI requested review from chenghuaWang, liang1232018 and oreomaker as code owners May 19, 2026 11:57

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify llama tokenizer, llama modeling, qwen2 modeling.#677

Modify llama tokenizer, llama modeling, qwen2 modeling.#677
ZhangJinghe-AI wants to merge 1 commit into
UbiquitousLearning:mainfrom
ZhangJinghe-AI:modify_llama_tokenizer

ZhangJinghe-AI commented May 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning, 2 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZhangJinghe-AI commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning, 2 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ZhangJinghe-AI commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading