Deep Learning Notes
software

I spent some time diving into what LLMs actually are with the Build a Large Language Model (From Scratch). I got through a good chunk of the book and exercises in this repo, but didn’t finish it.

These are a few of the terms that I needed to clarify for myself so I thought I’d share them. They’re pretty basic things from what I understand.

Batch sizes

A small batch size requires less memory during training but leads to more noisy model updates. The batch size is a tradeoff and a hyperparameter to experiment with when training LLMs.

Stride

When working with LLMs, you increase the stride parameter to avoid overlap between batches. This is because more overlap could lead to increased overfitting.

Self-attention

The higher the dot product between elements, the higher the similarity and attention score between two elements. The self-attention mechanism is also called the scaled dot-product attention.

Causal attention

This is a specialized form of self-attention known as masked attention because it restricts the model to only consider previous and current inputs instead of allowing access to the entire input like self-attention.

Dropout

This is a deep learning technique that ignores randomly selected hidden layer units during training to prevent overfitting. This way the model doesn’t become overly reliant on any specific set of hidden layer units. Dropout is only used during training and is disabled afterwards.

It’s common to apply the dropout mask after computing attention weights. Dropout rate is the number that determines what percentage of the attention weights will be ignored during training.

Registering a buffer in the CausalAttention class

We do this to prevent device mismatch errors. The buffers are automatically moved to the correct device (CPU or GPU) along with the model when training LLMs.

ReLU activation function

ReLU stands for rectified linear unit. It makes sure negative inputs become 0 so that a layer only gets positive values.