Skip to content

2.6M param Model

  • This model has 2.6 Million parameters.

  • For complete code implementation refer the notebook.

Before Quantization

  • Model:

    BigramLanguageModel(
      (token_embedding_table): Embedding(65, 192)
      (position_embedding_table): Embedding(32, 192)
      (blocks): Sequential(
        (0): Block(
          (sa): MultiHeadAttention(
            (heads): ModuleList(
              (0-5): 6 x Head(
                (key): Linear(in_features=192, out_features=32, bias=False)
                (query): Linear(in_features=192, out_features=32, bias=False)
                (value): Linear(in_features=192, out_features=32, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (proj): Linear(in_features=192, out_features=192, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (ffwd): FeedFoward(
            (net): Sequential(
              (0): Linear(in_features=192, out_features=768, bias=True)
              (1): ReLU()
              (2): Linear(in_features=768, out_features=192, bias=True)
              (3): Dropout(p=0.1, inplace=False)
            )
          )
          (ln1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
          (ln2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
        )
        (1): Block(
          (sa): MultiHeadAttention(
            (heads): ModuleList(
              (0-5): 6 x Head(
                (key): Linear(in_features=192, out_features=32, bias=False)
                (query): Linear(in_features=192, out_features=32, bias=False)
                (value): Linear(in_features=192, out_features=32, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (proj): Linear(in_features=192, out_features=192, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (ffwd): FeedFoward(
            (net): Sequential(
              (0): Linear(in_features=192, out_features=768, bias=True)
              (1): ReLU()
              (2): Linear(in_features=768, out_features=192, bias=True)
              (3): Dropout(p=0.1, inplace=False)
            )
          )
          (ln1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
          (ln2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
        )
        (2): Block(
          (sa): MultiHeadAttention(
            (heads): ModuleList(
              (0-5): 6 x Head(
                (key): Linear(in_features=192, out_features=32, bias=False)
                (query): Linear(in_features=192, out_features=32, bias=False)
                (value): Linear(in_features=192, out_features=32, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (proj): Linear(in_features=192, out_features=192, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (ffwd): FeedFoward(
            (net): Sequential(
              (0): Linear(in_features=192, out_features=768, bias=True)
              (1): ReLU()
              (2): Linear(in_features=768, out_features=192, bias=True)
              (3): Dropout(p=0.1, inplace=False)
            )
          )
          (ln1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
          (ln2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
        )
        (3): Block(
          (sa): MultiHeadAttention(
            (heads): ModuleList(
              (0-5): 6 x Head(
                (key): Linear(in_features=192, out_features=32, bias=False)
                (query): Linear(in_features=192, out_features=32, bias=False)
                (value): Linear(in_features=192, out_features=32, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (proj): Linear(in_features=192, out_features=192, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (ffwd): FeedFoward(
            (net): Sequential(
              (0): Linear(in_features=192, out_features=768, bias=True)
              (1): ReLU()
              (2): Linear(in_features=768, out_features=192, bias=True)
              (3): Dropout(p=0.1, inplace=False)
            )
          )
          (ln1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
          (ln2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
        )
        (4): Block(
          (sa): MultiHeadAttention(
            (heads): ModuleList(
              (0-5): 6 x Head(
                (key): Linear(in_features=192, out_features=32, bias=False)
                (query): Linear(in_features=192, out_features=32, bias=False)
                (value): Linear(in_features=192, out_features=32, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (proj): Linear(in_features=192, out_features=192, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (ffwd): FeedFoward(
            (net): Sequential(
              (0): Linear(in_features=192, out_features=768, bias=True)
              (1): ReLU()
              (2): Linear(in_features=768, out_features=192, bias=True)
              (3): Dropout(p=0.1, inplace=False)
            )
          )
          (ln1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
          (ln2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
        )
        (5): Block(
          (sa): MultiHeadAttention(
            (heads): ModuleList(
              (0-5): 6 x Head(
                (key): Linear(in_features=192, out_features=32, bias=False)
                (query): Linear(in_features=192, out_features=32, bias=False)
                (value): Linear(in_features=192, out_features=32, bias=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (proj): Linear(in_features=192, out_features=192, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (ffwd): FeedFoward(
            (net): Sequential(
              (0): Linear(in_features=192, out_features=768, bias=True)
              (1): ReLU()
              (2): Linear(in_features=768, out_features=192, bias=True)
              (3): Dropout(p=0.1, inplace=False)
            )
          )
          (ln1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
          (ln2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
        )
      )
      (ln_f): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
      (lm_head): Linear(in_features=192, out_features=65, bias=True)
    )
    

  • Model size:

    Model size: 10.43 MB
    

After Quantization

  • Quantized Model:

    BigramLanguageModel(
      (token_embedding_table): Embedding(65, 192)
      (position_embedding_table): Embedding(32, 192)
      (blocks): Sequential(
        (0): Block(
          (sa): MultiHeadAttention(
            (heads): ModuleList(
              (0-5): 6 x Head(
                (key): W8A16LinearLayer()
                (query): W8A16LinearLayer()
                (value): W8A16LinearLayer()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (proj): W8A16LinearLayer()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (ffwd): FeedFoward(
            (net): Sequential(
              (0): W8A16LinearLayer()
              (1): ReLU()
              (2): W8A16LinearLayer()
              (3): Dropout(p=0.1, inplace=False)
            )
          )
          (ln1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
          (ln2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
        )
        (1): Block(
          (sa): MultiHeadAttention(
            (heads): ModuleList(
              (0-5): 6 x Head(
                (key): W8A16LinearLayer()
                (query): W8A16LinearLayer()
                (value): W8A16LinearLayer()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (proj): W8A16LinearLayer()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (ffwd): FeedFoward(
            (net): Sequential(
              (0): W8A16LinearLayer()
              (1): ReLU()
              (2): W8A16LinearLayer()
              (3): Dropout(p=0.1, inplace=False)
            )
          )
          (ln1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
          (ln2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
        )
        (2): Block(
          (sa): MultiHeadAttention(
            (heads): ModuleList(
              (0-5): 6 x Head(
                (key): W8A16LinearLayer()
                (query): W8A16LinearLayer()
                (value): W8A16LinearLayer()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (proj): W8A16LinearLayer()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (ffwd): FeedFoward(
            (net): Sequential(
              (0): W8A16LinearLayer()
              (1): ReLU()
              (2): W8A16LinearLayer()
              (3): Dropout(p=0.1, inplace=False)
            )
          )
          (ln1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
          (ln2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
        )
        (3): Block(
          (sa): MultiHeadAttention(
            (heads): ModuleList(
              (0-5): 6 x Head(
                (key): W8A16LinearLayer()
                (query): W8A16LinearLayer()
                (value): W8A16LinearLayer()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (proj): W8A16LinearLayer()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (ffwd): FeedFoward(
            (net): Sequential(
              (0): W8A16LinearLayer()
              (1): ReLU()
              (2): W8A16LinearLayer()
              (3): Dropout(p=0.1, inplace=False)
            )
          )
          (ln1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
          (ln2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
        )
        (4): Block(
          (sa): MultiHeadAttention(
            (heads): ModuleList(
              (0-5): 6 x Head(
                (key): W8A16LinearLayer()
                (query): W8A16LinearLayer()
                (value): W8A16LinearLayer()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (proj): W8A16LinearLayer()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (ffwd): FeedFoward(
            (net): Sequential(
              (0): W8A16LinearLayer()
              (1): ReLU()
              (2): W8A16LinearLayer()
              (3): Dropout(p=0.1, inplace=False)
            )
          )
          (ln1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
          (ln2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
        )
        (5): Block(
          (sa): MultiHeadAttention(
            (heads): ModuleList(
              (0-5): 6 x Head(
                (key): W8A16LinearLayer()
                (query): W8A16LinearLayer()
                (value): W8A16LinearLayer()
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (proj): W8A16LinearLayer()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (ffwd): FeedFoward(
            (net): Sequential(
              (0): W8A16LinearLayer()
              (1): ReLU()
              (2): W8A16LinearLayer()
              (3): Dropout(p=0.1, inplace=False)
            )
          )
          (ln1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
          (ln2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
        )
      )
      (ln_f): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
      (lm_head): Linear(in_features=192, out_features=65, bias=True)
    )
    

  • Model Memory:

    Model size: 2.88 MB
    

Papers on Quantization Methods

  • LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale @arxiv.org
    • Proposed a no-performance degradation 8-bit quanitzation method by decomposing the underlying maxtrix multiplication in two stages, the outlier part in float16 and the non-outlier part in int8.
  • QLORA:Efficient Finetuning of Quantized LLMs @arxiv.org

    • Making LLMs more accessible by quantizing them in 4-bit percision and being able to fine-tune, low-rank adapters on top of the model.
  • SmoothQuant @arxiv.org

    • Propsoed to per-calibrate the model so that the quantized model does not get affected by large activations caused by large models.

Open source methods

These methods are designed to make LLMs smaller and faster, while minimizing performance degradation.

  • QuIP: 2-Bit Quantization of Large Language Models
  • AQLM:Extreme Compression of Large Language Models via Additive Quantization

Introduction To GPT ->

<- Bigram Model Quantization