Hi, thanks for the great work.
I have a question about the implementation of Ingredient 1: MSE-Optimized Grids described in the MR-GPTQ paper.
In the paper, the objective is written as:
$$
\min_{s_T, s_{G_1},...,s_{G_k}} \sum_i \left\lVert \hat{X_i} - X_i \right\rVert_2^2
$$
The paper says:
We solve this by using alternating optimization over the block scales and the per-tensor scale, respectively.
Could you clarify how this alternating optimization is performed in practice?
Hi, thanks for the great work.
I have a question about the implementation of Ingredient 1: MSE-Optimized Grids described in the MR-GPTQ paper.
In the paper, the objective is written as:
The paper says:
Could you clarify how this alternating optimization is performed in practice?