Jax + tpu and AQT int8 train model loss is abnormal

I used the aqt_einsum function in the code to only quantify the qk sccore, and then trained the model. However, I found that the loss dropped very slowly after training to a certain number of steps (such as 200 steps), which was quite different from the loss curve trained by bfloat16. Am I missing something? For example, does backward need some additional processing?
ps: I train model on jax==0.4.23 and tpu v5p-8

In other words, is there a training example for AQT int8 in pax?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jax + tpu and AQT int8 train model loss is abnormal #71

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Jax + tpu and AQT int8 train model loss is abnormal #71

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions