Skip to content

[WIP] Generic quantization support for PEFT methods#3117

Draft
BenjaminBossan wants to merge 16 commits intohuggingface:mainfrom
BenjaminBossan:refactor-quantization-support
Draft

[WIP] Generic quantization support for PEFT methods#3117
BenjaminBossan wants to merge 16 commits intohuggingface:mainfrom
BenjaminBossan:refactor-quantization-support

Conversation

@BenjaminBossan
Copy link
Copy Markdown
Member

@BenjaminBossan BenjaminBossan commented Mar 25, 2026

Problem

Right now, if a new PEFT method wants to add support for quantized layers, it requires a significant amount of work. Notably, the method needs to implement dedicated layer classes for each quantization method (e.g. one class for bnb 4bit, one for bnb 8bit, one for AWQ, ...). These classes typically are >90% boilerplate and the actual difference between implementations of these classes is minimal.

The result of that is that, at the moment, most PEFT methods don't support any, or only very few, quantization methods, even though the amount of actual logic required to support these methods is relatively small.

Suggested solution

This PR is a suggestion of how to solve the issue. If this approach is accepted, with a few extra lines, we should be able to support all quantization methods in all PEFT methods. The general approach is to add an attribute to each PEFT layer, self.quantization_backend, which supports these methods:

  • get_base_weight
  • set_base_weight

When the PEFT layers use these methods to access and write to the base layer weight, and if the weight is quantized, the new classes will deal with that correctly. This means that we no longer need a dedicated layer class to deal with quantized layers, the normal layer class will do. E.g. for MiSS, the normal miss.Linear class can deal with bnb layers, there is no need to add a miss/bnb.py module with dedicated layers.

A few rewrites in the existing PEFT methods are required to support this new quantization backend class, but the amount of total code needed for that is considerably smaller than adding new classes for each quantization method.

Furthermore, these quantization backend classes are agnostic with regard to the PEFT method. Therefore, with M PEFT methods and N quantization methods, we no longer need MxN implementations to support quantization but only M+N.

Migration

For LoRA, we have already implemented the layer classes for each supported quantization method. For the sake of consistency, it could still make sense to migrate LoRA to the new approach if it's accepted. This needs to be accompanied by detailed regression testing to ensure that everything keeps working. I would only suggest to deprecate and remove abandoned quantization methods (perhaps for a v1.0 release).

The bigger issue, however, is that packages that depend on PEFT may break with this change. As an example, if they detect quantized layers via isinstance checks, those would break as all layers would just be normal lora.Linear, lora.Conv2d etc. The approach here would most likely involve deprecating the import of these classes. I think it's also possible to "cheat" isinstance and pretend like there is inheritance when there isn't but I'd like to avoid that.

Anyway, this is out of scope of this PR and will be addressed in the future.

Scope

Updating all PEFT methods is too much for a single PR. This PR focuses on only three PEFT methods for now:

  • MiSS: A pretty normal PEFT method, representative of many other PEFT methods.
  • BOFT: Also pretty normal, but requires slight rewrite of the forward step. Similar changes may be required for other methods too.
  • VeRA: Already supports bnb but with this PR, specific bnb layers are no longer needed.

Right now, if a new PEFT method wants to add support for quantized
layers, it requires a significant amount of worker. Notably, the method
needs to implement dedicated layer classes for each quantization
method (e.g. one class for bnb 4bit, one for bnb 8bit, one for AWQ,
...).

The result of that is that, at the moment, most PEFT methods don't
support any, or only very few, quantization methods, even though the
amount of actual logic required to support these methods is quite
contained.

This PR is a suggestion of how to solve the issue. If this approach is
accepted, with a few extra lines, we should be able to support all
quantization methods in all PEFT methods.

The PR is not in a finished state, more to follow. Right now, only VeRA
and MiSS have been updated as a POC.
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@BenjaminBossan BenjaminBossan changed the title [WIP] Generic quantization support for PEFT [WIP] Generic quantization support for PEFT methods Mar 26, 2026
no reason why it would be nn.Linear instead of nn.Module like the other
PEFT methods
+ some docstring cleanups
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a generic “quantization backend” abstraction so PEFT tuner layers can support multiple quantization frameworks without needing per-backend layer subclasses, and wires it into VeRA and MiSS as an initial proof-of-concept.

Changes:

  • Add Quantizationbackend implementations + backend resolution (resolve_quantization_backend) and a helper to surface backend info in module repr.
  • Extend BaseTunerLayer with get_base_weight / set_base_weight to centralize dequantize/requantize handling for merge/unmerge.
  • Surface quantization backend info in get_layer_status() / get_model_status() and add tests (including a new quantization matrix test file).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_tuners_utils.py Adds coverage for layer/model status reporting of quantization_backend.
tests/test_quantization.py Adds a PEFT-method × quant-backend matrix test suite (bnb/torchao loaders and core behavioral checks).
src/peft/utils/quantization_utils.py Introduces quantization backend classes, backend resolution logic, and repr helper.
src/peft/utils/__init__.py Exports the new quantization helpers from peft.utils.
src/peft/tuners/vera/model.py Updates VeRA injection to remove bnb-specific module creation and to forward torchao merge metadata.
src/peft/tuners/vera/layer.py Hooks VeRA layers into the new backend mechanism for merge/unmerge and forward safety cloning.
src/peft/tuners/vera/bnb.py Removes VeRA’s dedicated bitsandbytes layer implementations (intended to be superseded by generic backend support).
src/peft/tuners/tuners_utils.py Adds quantization_backend attribute and centralized base-weight getters/setters to BaseTunerLayer.
src/peft/tuners/miss/model.py Forwards torchao merge metadata during MiSS injection.
src/peft/tuners/miss/layer.py Hooks MiSS layers into the new backend mechanism for merge/unmerge and forward safety cloning.
src/peft/peft_model.py Extends tuner status dataclasses + status functions to report quantization backend consistency.
Comments suppressed due to low confidence (1)

src/peft/tuners/vera/model.py:259

  • This PR removes the dedicated VeRA bitsandbytes layer implementations, but peft.tuners.vera still has lazy attribute resolution for Linear8bitLt / Linear4bit via from .bnb import ... (see src/peft/tuners/vera/__init__.py). With vera/bnb.py deleted, those imports will raise at runtime and existing tests/imports that reference peft.tuners.vera.Linear8bitLt will break. Please update the VeRA package exports to match the new generic quantization approach (either provide compatible aliases or remove the lazy attributes).
    @staticmethod
    def _create_new_module(vera_config, vera_A, vera_B, adapter_name, target, **kwargs):
        bias = kwargs.pop("bias", False)

        if isinstance(target, BaseTunerLayer):
            target_base_layer = target.get_base_layer()
        else:
            target_base_layer = target

        if isinstance(target_base_layer, torch.nn.Linear):
            if kwargs["fan_in_fan_out"]:
                warnings.warn(
                    "fan_in_fan_out is set to True but the target module is `torch.nn.Linear`. "
                    "Setting fan_in_fan_out to False."
                )
                kwargs["fan_in_fan_out"] = vera_config.fan_in_fan_out = False
        elif isinstance(target_base_layer, Conv1D):
            kwargs["is_target_conv_1d_layer"] = True
            if not kwargs["fan_in_fan_out"]:
                warnings.warn(
                    "fan_in_fan_out is set to False but the target module is `Conv1D`. Setting fan_in_fan_out to True."
                )
                kwargs["fan_in_fan_out"] = vera_config.fan_in_fan_out = True
        else:
            raise ValueError(
                f"Target module {target} is not supported. Currently, only the following modules are supported: "
                "`torch.nn.Linear`, `transformers.pytorch_utils.Conv1D`."
            )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_quantization.py Outdated
Comment thread src/peft/utils/quantization_utils.py Outdated
Comment thread tests/test_quantization.py Outdated
Comment thread tests/test_quantization.py Outdated
Comment thread tests/test_quantization.py
rotated_weight = torch.transpose(rotated_weight, 0, 1)

scaled_rotated_weight = rotated_weight * boft_scale
x_rotated = x @ boft_rotation
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a reformulation of the forward path of BOFT that avoids using the base layer weight directly. This was necessary because calling torch.mm(boft_rotation, orig_weight) can fail with quantized weights. Instead, we should make a forward pass and let the quantized layer handle the details. I ran the BOFT tests with the old and the new implementation and added an assert that they are identical (up to precision).

Regarding runtime, I checked the MetaMath benchmark and got 147 sec for 250 steps (116 sec for 1 eval run) using main branch, and 138 sec (108 sec) using the code from this branch. So the new code seems to be on par or possibly slightly faster than the old one.

@github-actions
Copy link
Copy Markdown

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@BenjaminBossan
Copy link
Copy Markdown
Member Author

not stale

@BenjaminBossan
Copy link
Copy Markdown
Member Author

@Joluck It would be great if you could check the MiSS change.
@zqiu24 It would be great if you could check the BOFT change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants