• AIPressRoom
  • Posts
  • GPTQ or bitsandbytes: Which Quantization Technique to Use for LLMs

GPTQ or bitsandbytes: Which Quantization Technique to Use for LLMs

Giant language mannequin quantization for reasonably priced fine-tuning and inference in your pc

As giant language fashions (LLM) received larger with an increasing number of parameters, new strategies to scale back their reminiscence utilization have additionally been proposed.

One of the efficient strategies to scale back the mannequin dimension in reminiscence is quantization. You may see quantization as a compression approach for LLMs. In follow, the principle aim of quantization is to decrease the precision of the LLM’s weights, usually from 16-bit to 8-bit, 4-bit, and even 3-bit, with minimal efficiency degradation.

There are two widespread quantization strategies for LLMs: GPTQ and bitsandbytes.

On this article, I talk about what the principle variations between these two approaches are. They each have their very own benefits and drawbacks that make them appropriate for various use circumstances. I current a comparability of their reminiscence utilization and inference pace utilizing Llama 2. I additionally talk about their efficiency primarily based on experiments from earlier work.

Notice: If you wish to know extra about quantization, I like to recommend studying this wonderful introduction by Maxime Labonne:

GPTQ (Frantar et al., 2023) was first utilized to fashions able to deploy. In different phrases, as soon as the mannequin is totally fine-tuned, GPTQ shall be utilized to scale back its dimension.

GPTQ can decrease the burden precision to 4-bit or 3-bit. In follow, GPTQ is especially used for 4-bit quantization. 3-bit has been proven very unstable (Dettmers and Zettlemoyer, 2023).

It quantizes with out loading your complete mannequin into reminiscence. As an alternative, GPTQ masses and quantizes the LLM module by module. Quantization additionally requires a small pattern of information for calibration which may take greater than…