Hugging Face Quanto

In the world of deep learning, model size and inference speed are crucial factors. While larger models often offer higher accuracy, they come with a hefty price tag: increased memory consumption and slower processing times , quantization is a powerful technique designed to address these challenges.

What is Quantization?

Quantization involves converting the floating-point numbers (typically 32-bit) representing weights and activations in a neural network to lower-bit integers (e.g., 8-bit or even 4-bit) , think of it like compressing a high-resolution image you lose some detail, but the file size shrinks significantly.

Why is Quantization Beneficial?

Reduced Model Size: Lower-bit integers drastically reduce the memory footprint of your model, making it easier to deploy on resource-constrained devices like mobile phones or embedded systems.
Faster Inference: Integer operations are generally faster than floating-point operations. This leads to quicker predictions and a more responsive user experience.
Lower Power Consumption: Fewer computations translate to lower power consumption, which is especially important for battery-powered devices.

The Trade-off:

While quantization offers significant advantages, there is a potential trade-off in accuracy. Using lower-precision numbers can introduce some information loss, which might lead to a slight decrease in model performance. However, with careful techniques and advancements in quantization methods, this impact can often be minimized , as a result quantization remains a compelling optimization strategy.

In the following sections, we will explore how the Quanto library simplifies the process of quantizing your PyTorch models, helping you achieve smaller, faster, and more efficient models with minimal accuracy loss.

Under the hood Quanto leverages the linear quantization algorithm. While this is a foundational quantization technique, it yields impressive results , for a deeper dive into performance metrics, take a look at the benchmark results for the Llama-3.1-8B model and other SOTA models on perplexity: (check the benchmark here) , this benchmark highlights the effectiveness of Quanto's linear quantization in achieving significant model compression without compromising accuracy.

Installation

Getting started with Quanto is easy just use pip:

pip install quanto accelerate transformers

Helper Functions

We'll define a few helper functions to assist us in analyzing our models.

Inspect Model Tensors: The named_module_tensors function iterates through the named parameters and buffers of a module ,it handles both regular tensors and quantized tensors (which might have _data and _scale attributes).

import torch
def named_module_tensors(module, recurse=False):
    for named_parameter in module.named_parameters(recurse=recurse):
      name, val = named_parameter
      flag = True
      if hasattr(val,"_data") or hasattr(val,"_scale"):
        if hasattr(val,"_data"):
          yield name + "._data", val._data
        if hasattr(val,"_scale"):
          yield name + "._scale", val._scale
      else:
        yield named_parameter

    for named_buffer in module.named_buffers(recurse=recurse):
      yield named_buffer

Calculate Data Type Size: The dtype_byte_size function determines the size (in bytes) occupied by a single element of a given PyTorch data type. for example torch.float32 would return 4 bytes.

def dtype_byte_size(dtype):
    """
    Returns the size (in bytes) occupied by one parameter of type `dtype`.
    """
    import re
    if dtype == torch.bool:
        return 1 / 8
    bit_search = re.search(r"[^\d](\d+)$", str(dtype))
    if bit_search is None:
        raise ValueError(f"`dtype` is not a valid dtype: {dtype}.")
    bit_size = int(bit_search.groups()[0])
    return bit_size // 8

Compute Module Sizes: The compute_module_sizes function calculates the size of each submodule within a model, taking into account the sizes of its parameters and buffers.

This function uses the previous two functions to calculate the total size of each submodule within a model ,it iterates through all tensors, calculates their size based on their data type and number of elements, and accumulates these sizes for each submodule.

def compute_module_sizes(model):
    """
    Compute the size of each submodule of a given model.
    """
    from collections import defaultdict
    module_sizes = defaultdict(int)
    for name, tensor in named_module_tensors(model, recurse=True):
      size = tensor.numel() * dtype_byte_size(tensor.dtype)
      name_parts = name.split(".")
      for idx in range(len(name_parts) + 1):
        module_sizes[".".join(name_parts[:idx])] += size

    return module_sizes

These helper functions will be useful as we proceed with quantizing our models and analyzing the impact on model size.

Loading a Pre-trained Model

Let's start by loading a pre-trained language model from the Hugging Face Transformers library , we'll use the Qwen/Qwen2-0.5B model for this example.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B")

Testing the Original Model

Let's quickly test our loaded model to make sure it's working as expected ,we'll provide it with a simple prompt and see how it continues the story:

text = "Once upon a time, there was a"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output:

Once upon a time, there was a man named John. He was a very good man

As you see above the model is able to continue the story as expected.

Analyzing the Original Model Size

Before we quantize the model let's determine its current size , we can use our compute_module_sizes helper function for this:

module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

Output:

The model size is 3.58674688 GB

As you can see, the original model occupies approximately 3.59 GB of storage , this is a significant amount of memory, especially if we're considering deploying this model on resource-constrained devices.

Quantizing the Model with Quanto

Now comes the core of our optimization process: quantizing the model using Quanto , we'll use the quantize and freeze functions for this purpose:

from quanto import quantize, freeze

quantize(model, weights=torch.int8, activations=None)
freeze(model)

Explanation:

quantize(model, weights=torch.int8, activations=None): This line initiates the quantization process. We're specifying torch.int8 for the weights argument, meaning we want to quantize the model's weights to 8-bit integers , we're setting activations to None for now, which means we're only quantizing the weights and not the activations.
freeze(model): This line freezes the quantized model, preparing it for inference ,freezing essentially converts the dynamically quantized weights into statically quantized weights, leading to potential performance improvements.
Weights Quantization: We're focusing on weights quantization in this example as it typically provides the most significant size reduction with minimal impact on accuracy.
Activation Quantization: While we're not quantizing activations in this initial step, it's possible to experiment with activation quantization as well, potentially achieving even further size reduction ( Activation quantization might require more careful calibration to maintain accuracy.)

Analyzing the Quantized Model Size

Now that we've quantized the model,let's see how much its size has been reduced, we can use the same compute_module_sizes helper function as before:

module_sizes = compute_module_sizes(model)
print(f"The model size is {module_sizes[''] * 1e-9} GB")

Output:

The model size is 2.651227464 GB

Impressive Size Reduction!

The quantized model now occupies approximately 2.65 GB of storage ,this represents a significant reduction compared to the original size of 3.59 GB ,we've achieved a size reduction of roughly 26% simply by quantizing the weights to 8-bit integers .

Testing the Quantized Model

Let's test our quantized model with the same prompt we used earlier to see if the generated text is still coherent and meaningful:

text = "Once upon a time, there was a"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output:

Once upon a time, there was a man who was very rich. He had a lot

Maintaining Coherence

The quantized model has generated a slightly different continuation of the story, but it remains coherent and grammatically correct , this suggests that the quantization process hasn't significantly impacted the model's ability to generate meaningful text.