Quantization Granularity
Quantization granularity refers to the level of detail at which a continuous value is represented in a discrete form. In the context of deep learning, it determines how finely the weights and activations of a neural network are quantized.
Types of Quantization Granularity
- Per-Tensor Quantization:
- A single scale and zero-point are used for all elements in a tensor.
- Simplest but may not be optimal for tensors with diverse distributions.
- Per-Channel Quantization:
- A separate scale and zero-point are used for each channel of a tensor.
- More flexible than per-tensor quantization, especially for tensors with diverse distributions along different channels.
- Per-Group Quantization:
- A group of elements within a tensor is quantized using a single scale and zero-point.
- Intermediate between per-tensor and per-channel quantization, offering a balance between accuracy and memory efficiency.
8-Bit Quantizer
-
Will quantize any model in
8-bit precision. -
This quantizer is modality agnostic, meaning we can apply it on any model like vision, audio, text and even multi model.
-
Will use Per-Channel Linear Quantization.
-
Will create a
W8A16LinearLayerclass to store 8-bit weights and scales. -
Will replace all
torch.nn.Linearlayers withW8A16LinearLayer
Applying 8-Bit Quantization
Using Salesforce/codegen-350M-mono model from hugging face.
- So this is a Language MOdel that has been fine-tuned in code.
- And it has only 350million parameters.
- Lets use transformers to load the model with tokenizer and get some generation.
- For complete code refer the notebook.
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_id = "Salesforce/codegen-350M-mono"
model = AutoModelForCausalLM.from_pretrained(model_id,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
Before Quantization
- Parameters:
model.get_parameter<bound method Module.get_parameter of CodeGenForCausalLM( (transformer): CodeGenModel( (wte): Embedding(51200, 1024) (drop): Dropout(p=0.0, inplace=False) (h): ModuleList( (0-19): 20 x CodeGenBlock( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): CodeGenAttention( (attn_dropout): Dropout(p=0.0, inplace=False) (resid_dropout): Dropout(p=0.0, inplace=False) (qkv_proj): Linear(in_features=1024, out_features=3072, bias=False) (out_proj): Linear(in_features=1024, out_features=1024, bias=False) ) (mlp): CodeGenMLP( (fc_in): Linear(in_features=1024, out_features=4096, bias=True) (fc_out): Linear(in_features=4096, out_features=1024, bias=True) (act): NewGELUActivation() (dropout): Dropout(p=0.0, inplace=False) ) ) ) (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (lm_head): Linear(in_features=1024, out_features=51200, bias=True) )> -
Memory Footprint:
print("Footprint of the model in MBs: ", model.get_memory_footprint()/1e+6)Footprint of the model in MBs: 797.310976 -
Text genereated:
# the text generation piple to generate text pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) # lets seee, as it is a model trainned and fine tuend on code, lets ask the model to complete code. print(pipe("def hello_world():", max_new_tokens=20, do_sample=False))Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. [{'generated_text': 'def hello_world():\n print("Hello World")\n\n# hello_world()\n\n# def hello_'}]
Quantizing
- Replace linear with target and quantize model, TargetClass -> W8A16LinearLayer
- We r not quant {lm_head} because the model is an autoregressive model.
- As it uses the output from the previous iteration to get the output of the next iteration.
replace_linear_with_target_and_quantize(model, W8A16LinearLayer, ["lm_head"])
After Quantization
-
Parameters:
pipe.model.get_parameter<bound method Module.get_parameter of CodeGenForCausalLM( (transformer): CodeGenModel( (wte): Embedding(51200, 1024) (drop): Dropout(p=0.0, inplace=False) (h): ModuleList( (0-19): 20 x CodeGenBlock( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): CodeGenAttention( (attn_dropout): Dropout(p=0.0, inplace=False) (resid_dropout): Dropout(p=0.0, inplace=False) (qkv_proj): W8A16LinearLayer() (out_proj): W8A16LinearLayer() ) (mlp): CodeGenMLP( (fc_in): W8A16LinearLayer() (fc_out): W8A16LinearLayer() (act): NewGELUActivation() (dropout): Dropout(p=0.0, inplace=False) ) ) ) (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (lm_head): Linear(in_features=1024, out_features=51200, bias=True) )> -
Memory Footprint:
print("Footprint of the quantized model in MBs: ", pipe.model.get_memory_footprint()/1e+6)Footprint of the quantized model in MBs: 546.021376 -
Text Generated:
- When asked to define a function hello_world:
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. def hello_world(): print("Hello World") # hello_world() # def hello_ - When asked to define a function which returns sun of 5 natural numbers:
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. def sum_of_natural_numbers(5): """Return sum of all numbers from 1 to 5.""" sum = 0 for i in range(1, 6): sum += i return sum # print(sum_of_natural_numbers(5))
- When asked to define a function hello_world: