Skip to content

Modify llama tokenizer, llama modeling, qwen2 modeling.#677

Open
ZhangJinghe-AI wants to merge 1 commit into
UbiquitousLearning:mainfrom
ZhangJinghe-AI:modify_llama_tokenizer
Open

Modify llama tokenizer, llama modeling, qwen2 modeling.#677
ZhangJinghe-AI wants to merge 1 commit into
UbiquitousLearning:mainfrom
ZhangJinghe-AI:modify_llama_tokenizer

Conversation

@ZhangJinghe-AI
Copy link
Copy Markdown

@ZhangJinghe-AI ZhangJinghe-AI commented May 19, 2026

Fix implementation errors in LLaMA C++ tokenizer and runtime.
Correct issues in the LLaMA and Qwen2 modeling files in pymllm.
Additionally, improve the LLaMA calibration process for better performance.

Summary by CodeRabbit

  • New Features

    • Added Llama-3 tokenization support with chat template formatting and special token handling
    • Introduced interactive prompt mode for demos
  • Improvements

    • Dynamic end-of-sequence token configuration now adapts based on model type
    • Enhanced quantization parameter wiring for improved model accuracy
    • Updated inference pipeline to use chat template preprocessing

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

📝 Walkthrough

Walkthrough

This PR introduces Llama-3 tokenization support with configurable model selection, refactors quantization observation paths in both Llama and Qwen2 attention modules to shift from input to output QDQ, and integrates interactive prompt handling into the AOT example runner.

Changes

Llama Tokenization and Integration

Layer / File(s) Summary
Llama tokenizer implementation
mllm/models/llama/tokenization_llama.hpp
Llama-3 tokenizer with custom regex splitting for pattern-based tokenization, BPE byte-level encoding, special token trie, chat template formatting with dynamic system prompts, and token/message conversion utilities.
Model type configuration and EOS selection
mllm/backends/qnn/aot_rt/QnnAOTConfig.hpp, mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp
Config struct adds type field (default "qwen3"); runtime conditionally populates EOS token IDs based on model type (Llama3, Qwen2, or Qwen3).
AOT example with interactive prompt
examples/llama_qnn_aot/aot_run.cpp
Removes hardcoded "hello" input and sequence-length CLI options; instantiates LlamaTokenizer with model type set to Llama3; implements interactive stdin prompt reading with streaming token callback.
Llama quantization path and runner adjustments
pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py, pymllm/mobile/backends/qualcomm/transformers/llama/runner.py
LlamaAttention removes input QDQ modules and shifts RoPE observation to output QDQ activation post-process; runner applies chat templates before tokenization and switches calibration to direct forward pass.

Qwen2 Quantization Refactoring

Layer / File(s) Summary
Qwen2 attention QDQ path restructuring
pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py
Removes k_proj_input_qdq and v_proj_input_qdq from initialization; routes key and value projections directly through output QDQ; updates RoPE concat observer and rotation paths to observe output QDQ activation post-process instead of input QDQ.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • liang1232018
  • chenghuaWang
  • oreomaker

Poem

🐰 A Llama hops in, tokenized and neat,
With BPE byte-dances, a rhythmic beat,
Chat templates wrapped in system's embrace,
Quantization flows smooth to their place,
From input to output, the path finds grace! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is vague and overly broad, listing multiple file modifications without clarifying the main purpose or improvement. Revise the title to focus on the primary objective, such as 'Fix LLaMA tokenizer and runtime implementation' or 'Improve LLaMA tokenization and Qwen2 modeling for quantization'.
Description check ❓ Inconclusive The description provides high-level objectives but lacks specific details about what was fixed, why changes were made, or how to validate the fixes. Expand the description with specifics: what implementation errors were fixed, what issues in modeling files were corrected, and how the calibration process was improved.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py (1)

281-304: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

The new observer arguments are currently a no-op.

rotate_half(...) never reads its x_observer parameter, so swapping in q_proj_output_qdq.fake_quant.activation_post_process and k_proj_output_qdq.fake_quant.activation_post_process on Line 286 and Line 298 does not change runtime behavior. If this path is meant to use the output observer, wire it up inside rotate_half; otherwise remove the unused parameter so the refactor is not misleading.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py` around
lines 281 - 304, rotate_half currently ignores the x_observer argument, so the
new q_proj_output_qdq.fake_quant.activation_post_process and
k_proj_output_qdq.fake_quant.activation_post_process passed from the
q_rope/key_rope call sites are no-ops; fix by either (A) removing the unused
x_observer parameter from rotate_half and from all calls in the q_rope/key_rope
code paths (symbols: rotate_half, q_rope_mul_1_output_qdq call sites,
k_rope_mul_1_output_qdq call sites,
q_proj_output_qdq.fake_quant.activation_post_process,
k_proj_output_qdq.fake_quant.activation_post_process) or (B) wire it up inside
rotate_half so the observer is actually applied to the tensor (e.g., call the
provided activation_post_process/observer on the input or output within
rotate_half) and keep the current call sites; update the rotate_half signature
and all its call sites accordingly to match the chosen approach.
🧹 Nitpick comments (1)
mllm/models/llama/tokenization_llama.hpp (1)

98-101: ⚡ Quick win

Document the public tokenizer API surface.

LlamaMessage and multiple public LlamaTokenizer methods are exported without clear API comments (expected inputs/outputs/error behavior). Please add concise API docs to reduce integration ambiguity.

As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."

Also applies to: 104-237

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mllm/models/llama/tokenization_llama.hpp` around lines 98 - 101, Add concise
public API documentation for the exported symbols: add a doc comment above the
LlamaMessage struct describing its purpose and the meaning/expected format of
role and content, and add doc comments for each public LlamaTokenizer method
(e.g., any Encode/Decode/Tokenize/Detokenize/LoadVocab or similarly named public
functions found in the file) that specify purpose, parameter expectations
(types, valid ranges, ownership/constness), return value semantics, and possible
error/exception behavior (including when methods may fail or return
empty/invalid results). Follow the project's docstyle (brief one-line summary +
short param/return/error bullets), mention thread-safety if relevant, and ensure
comments are placed immediately above the declaration so callers see the API
contract.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/llama_qnn_aot/aot_run.cpp`:
- Around line 65-69: The prompt loop currently always sends input to
tokenizer.convertMessage; add an early-exit check after reading prompt_text that
returns or breaks when the user enters "exit" or "quit" (case-insensitive and
trimming surrounding whitespace), and also handle EOF (std::cin.eof())
similarly; implement this check immediately before calling
tokenizer.convertMessage so prompt_text is not processed when the user intends
to quit.
- Around line 41-44: The comment above the tokenizer instantiation is stale —
remove or update the lines that mention "Using Qwen3 tokenizer as a placeholder"
and instead document that LlamaTokenizer is being used; specifically edit the
comment that precedes the creation of
mllm::models::llama::LlamaTokenizer(tokenizer_path.get()) so it correctly
reflects the use of LlamaTokenizer (or notes to implement a Llama3Tokenizer if
different) and remove any leftover references to Qwen3 to avoid confusion.

In `@mllm/models/llama/tokenization_llama.hpp`:
- Around line 5-11: The header tokenization_llama.hpp uses std::iswspace
(references at tokenization functions around lines using iswspace) but doesn't
include the wide-char ctype header; add an explicit `#include` <cwctype> to
tokenization_llama.hpp so std::iswspace is guaranteed to be declared (update the
top-of-file includes alongside <string>, <vector>, etc.), then rebuild to
confirm portability on platforms that don't provide it via transitive includes.
- Around line 121-127: The getSystemPromptPrefix() function uses std::localtime
(unsafe for concurrent calls from applyChatTemplate()); replace it with a
thread-safe alternative — on POSIX call std::localtime_r(&t, &tm_) or on Windows
std::localtime_s(&tm_, &t) (or use std::chrono to format the current date) so
that tm_ is populated safely per-thread; update the declaration/initialization
of tm_ accordingly and keep the existing formatting with std::put_time and
oss.str() untouched.

In `@pymllm/mobile/backends/qualcomm/transformers/llama/runner.py`:
- Around line 253-259: The calibration path must use the same chat formatting as
inference: modify the calibrate(...) implementation (referencing calibrate and
its use of calibration samples) to build messages = [{"role": "user", "content":
sample}] and call self.tokenizer.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True) for each sample (the same call/invocation used in
infer()), and use the returned formatted prompt when computing activation
statistics so calibration and infer share identical preprocessing.

---

Outside diff comments:
In `@pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py`:
- Around line 281-304: rotate_half currently ignores the x_observer argument, so
the new q_proj_output_qdq.fake_quant.activation_post_process and
k_proj_output_qdq.fake_quant.activation_post_process passed from the
q_rope/key_rope call sites are no-ops; fix by either (A) removing the unused
x_observer parameter from rotate_half and from all calls in the q_rope/key_rope
code paths (symbols: rotate_half, q_rope_mul_1_output_qdq call sites,
k_rope_mul_1_output_qdq call sites,
q_proj_output_qdq.fake_quant.activation_post_process,
k_proj_output_qdq.fake_quant.activation_post_process) or (B) wire it up inside
rotate_half so the observer is actually applied to the tensor (e.g., call the
provided activation_post_process/observer on the input or output within
rotate_half) and keep the current call sites; update the rotate_half signature
and all its call sites accordingly to match the chosen approach.

---

Nitpick comments:
In `@mllm/models/llama/tokenization_llama.hpp`:
- Around line 98-101: Add concise public API documentation for the exported
symbols: add a doc comment above the LlamaMessage struct describing its purpose
and the meaning/expected format of role and content, and add doc comments for
each public LlamaTokenizer method (e.g., any
Encode/Decode/Tokenize/Detokenize/LoadVocab or similarly named public functions
found in the file) that specify purpose, parameter expectations (types, valid
ranges, ownership/constness), return value semantics, and possible
error/exception behavior (including when methods may fail or return
empty/invalid results). Follow the project's docstyle (brief one-line summary +
short param/return/error bullets), mention thread-safety if relevant, and ensure
comments are placed immediately above the declaration so callers see the API
contract.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c19ed8ee-bc79-4e4d-8767-91e2ceb8db20

📥 Commits

Reviewing files that changed from the base of the PR and between 729ca4c and 95d86fc.

📒 Files selected for processing (7)
  • examples/llama_qnn_aot/aot_run.cpp
  • mllm/backends/qnn/aot_rt/QnnAOTConfig.hpp
  • mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp
  • mllm/models/llama/tokenization_llama.hpp
  • pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py
  • pymllm/mobile/backends/qualcomm/transformers/llama/runner.py
  • pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py

Comment on lines 41 to +44
// Note: Using Qwen3 tokenizer as a placeholder.
// For production use, you should implement a Llama3Tokenizer or use
// the appropriate tokenizer for your model.
auto tokenizer = mllm::models::llama::TinyLlamaTokenizer(tokenizer_path.get());
auto tokenizer = mllm::models::llama::LlamaTokenizer(tokenizer_path.get());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove stale tokenizer comment.

Lines 41–43 still say Qwen3 tokenizer placeholder, but Line 44 instantiates LlamaTokenizer. Please update/remove this comment to avoid maintenance confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llama_qnn_aot/aot_run.cpp` around lines 41 - 44, The comment above
the tokenizer instantiation is stale — remove or update the lines that mention
"Using Qwen3 tokenizer as a placeholder" and instead document that
LlamaTokenizer is being used; specifically edit the comment that precedes the
creation of mllm::models::llama::LlamaTokenizer(tokenizer_path.get()) so it
correctly reflects the use of LlamaTokenizer (or notes to implement a
Llama3Tokenizer if different) and remove any leftover references to Qwen3 to
avoid confusion.

Comment on lines +65 to 69
fmt::print("💬 Prompt text (or 'exit/quit'): ");
std::getline(std::cin, prompt_text);

auto input_tensor = tokenizer.convertMessage({{.role = "user", .content = prompt_text}});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Implement the advertised exit/quit path.

The prompt text says exit/quit (Line 65), but input is always sent to generation. Add an early return when the user enters exit or quit.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llama_qnn_aot/aot_run.cpp` around lines 65 - 69, The prompt loop
currently always sends input to tokenizer.convertMessage; add an early-exit
check after reading prompt_text that returns or breaks when the user enters
"exit" or "quit" (case-insensitive and trimming surrounding whitespace), and
also handle EOF (std::cin.eof()) similarly; implement this check immediately
before calling tokenizer.convertMessage so prompt_text is not processed when the
user intends to quit.

Comment on lines +5 to +11
#include <vector>
#include <string>
#include <ctime>
#include <iomanip>
#include <sstream>
#include <unordered_map>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cat -n mllm/models/llama/tokenization_llama.hpp | head -150

Repository: UbiquitousLearning/mllm

Length of output: 6309


Add the missing wide-char ctype header for std::iswspace.

std::iswspace is used at lines 62, 63, 75, and 77 but <cwctype> is not explicitly included. While the code may compile on some platforms due to transitive includes, this violates portability best practices and C++ idioms. All headers for functions used must be explicitly included.

Proposed fix
 `#include` <vector>
 `#include` <string>
+#include <cwctype>
 `#include` <ctime>
 `#include` <iomanip>
 `#include` <sstream>
 `#include` <unordered_map>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include <vector>
#include <string>
#include <ctime>
#include <iomanip>
#include <sstream>
#include <unordered_map>
`#include` <vector>
`#include` <string>
`#include` <cwctype>
`#include` <ctime>
`#include` <iomanip>
`#include` <sstream>
`#include` <unordered_map>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mllm/models/llama/tokenization_llama.hpp` around lines 5 - 11, The header
tokenization_llama.hpp uses std::iswspace (references at tokenization functions
around lines using iswspace) but doesn't include the wide-char ctype header; add
an explicit `#include` <cwctype> to tokenization_llama.hpp so std::iswspace is
guaranteed to be declared (update the top-of-file includes alongside <string>,
<vector>, etc.), then rebuild to confirm portability on platforms that don't
provide it via transitive includes.

Comment on lines +121 to +127
std::string getSystemPromptPrefix() {
std::time_t t = std::time(nullptr);
std::tm tm_ = *std::localtime(&t);
std::ostringstream oss;
oss << std::put_time(&tm_, "%d %b %Y");
return "Cutting Knowledge Date: December 2023\nToday Date: " + oss.str() + "\n\n";
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cat mllm/models/llama/tokenization_llama.hpp | head -150 | tail -50

Repository: UbiquitousLearning/mllm

Length of output: 2162


🏁 Script executed:

rg "getSystemPromptPrefix" --type hpp --type cpp -B 2 -A 2

Repository: UbiquitousLearning/mllm

Length of output: 95


🏁 Script executed:

# Check if there are any threading or concurrency patterns in the codebase
rg "std::thread|std::mutex|pthread|`#pragma` omp" mllm/models/llama/ | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Look for other usages of std::localtime or similar time functions
rg "std::localtime|localtime_r|localtime_s|std::chrono" mllm/models/llama/ --type hpp --type cpp

Repository: UbiquitousLearning/mllm

Length of output: 95


🏁 Script executed:

rg "getSystemPromptPrefix" --max-count 10

Repository: UbiquitousLearning/mllm

Length of output: 266


🏁 Script executed:

rg "applyChatTemplate" --max-count 10

Repository: UbiquitousLearning/mllm

Length of output: 1690


🏁 Script executed:

# Check if there's any concurrent usage or threading in the codebase
rg "std::thread|std::async|std::mutex|concurrent" --max-count 20

Repository: UbiquitousLearning/mllm

Length of output: 8668


🏁 Script executed:

# Check what C++ version is being used
find . -name "CMakeLists.txt" -o -name "*.cmake" | head -5 | xargs grep -l "cxx_standard\|CXX_STANDARD" 2>/dev/null

Repository: UbiquitousLearning/mllm

Length of output: 165


🏁 Script executed:

# Look at the full tokenization_llama.hpp file to understand context better
wc -l mllm/models/llama/tokenization_llama.hpp

Repository: UbiquitousLearning/mllm

Length of output: 112


🏁 Script executed:

cat mllm/CMakeLists.txt | grep -i "cxx_standard\|CXX_STANDARD" -A 2 -B 2

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Check if there are any service/server implementations that might call applyChatTemplate concurrently
rg "LlamaTokenizer|applyChatTemplate" mllm/ -A 3 -B 3 | head -50

Repository: UbiquitousLearning/mllm

Length of output: 4198


🏁 Script executed:

# Check the Service.hpp to understand threading model
head -100 mllm/engine/service/Service.hpp

Repository: UbiquitousLearning/mllm

Length of output: 2221


🌐 Web query:

std::localtime thread safety C++ std::chrono alternative

💡 Result:

The C-style function std::localtime is not thread-safe because it returns a pointer to a shared, statically allocated std::tm object, which can be overwritten by concurrent calls from other threads [1][2][3]. The modern, thread-safe alternative in C++ is the C++20 library, which provides robust support for time zones and calendar conversions [3][4][5]. Recommended C++20 Approach: Use std::chrono::zoned_time to handle local time conversions in a thread-safe manner [3][4][6]. Example: #include #include void print_local_time { auto now = std::chrono::system_clock::now; // zoned_time is thread-safe and handles time zone conversions auto local_now = std::chrono::zoned_time{std::chrono::current_zone, now}; // Use std::format (C++20) for thread-safe, type-safe formatting std::cout << std::format("{:%F %T %Z}", local_now) << std::endl; } Legacy Alternatives (Pre-C++20): If you cannot use C++20, you must use platform-specific thread-safe functions or manual synchronization [1][7][8]: 1. POSIX (Linux/macOS): Use localtime_r(const std::time_t*, std::tm*), which writes the result into a user-provided buffer [1][2][3]. 2. Windows (MSVC): Use localtime_s(std::tm*, const std::time_t*), which also uses a user-provided buffer [1][3][8]. 3. Manual Synchronization: Wrap calls to std::localtime in a std::mutex to prevent concurrent access [1][7][4]. Note: Avoid std::ctime and std::gmtime for the same reasons, as they also rely on shared static buffers [2][8][4].

Citations:


Use thread-safe time functions instead of std::localtime.

std::localtime is not thread-safe as it returns a pointer to a shared static buffer. Since getSystemPromptPrefix() is called from applyChatTemplate(), which is invoked by concurrent worker threads in the Service implementation, this creates a race condition. Replace with std::localtime_r (POSIX) or std::localtime_s (Windows), or use std::chrono for the modern C++ approach.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mllm/models/llama/tokenization_llama.hpp` around lines 121 - 127, The
getSystemPromptPrefix() function uses std::localtime (unsafe for concurrent
calls from applyChatTemplate()); replace it with a thread-safe alternative — on
POSIX call std::localtime_r(&t, &tm_) or on Windows std::localtime_s(&tm_, &t)
(or use std::chrono to format the current date) so that tm_ is populated safely
per-thread; update the declaration/initialization of tm_ accordingly and keep
the existing formatting with std::put_time and oss.str() untouched.

Comment on lines 253 to +259
def infer(self, prompt: str):
messages = [{"role": "user", "content": prompt}]
prompt = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align calibration preprocessing with inference chat formatting.

infer() now uses apply_chat_template(...) (Line 255), but calibrate() still feeds raw text (Line 324). This distribution mismatch can produce inaccurate activation statistics for deployment. Use the same chat-template path in calibration samples.

Also applies to: 322-330

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pymllm/mobile/backends/qualcomm/transformers/llama/runner.py` around lines
253 - 259, The calibration path must use the same chat formatting as inference:
modify the calibrate(...) implementation (referencing calibrate and its use of
calibration samples) to build messages = [{"role": "user", "content": sample}]
and call self.tokenizer.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True) for each sample (the same call/invocation used in
infer()), and use the returned formatted prompt when computing activation
statistics so calibration and infer share identical preprocessing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant