Transformers revolutionized NLP by replacing recurrence with attention mechanisms. Each word in a sentence attends to every other word, assigning weights to represent relevance.
The key components include self-attention, multi-head attention, and positional encoding, enabling deep contextual understanding of language.
Word Embedding
Word embeddings convert words into numerical vectors that capture their meanings and relationships in a continuous vector space.
Positional Encoding
Since Transformers process input all at once (not sequentially), positional encoding is added to provide information about the position of each word in the sequence.
Self-Attention
Self-attention lets each word in a sentence focus on all other words to understand context. It computes weights to decide which words to pay more attention to.
Residual Connection
Residual connections add the original input back to the output of a layer. This helps preserve information and improves training in deep networks.
Query, Key, and Value
In self-attention, each word generates a query, key, and value. Attention scores are computed by comparing queries with keys, and then applied to values.
Encoder & Decoder
The encoder processes the input sequence and creates context-aware representations. The decoder uses these representations to generate output (like a translation).