Encoder & Decoder

The Transformer consists of two main parts: an encoder and a decoder. They are connected by Cross-Attention.

Encoder: Processes the input sequence using multiple layers of self-attention and feed-forward networks.
Decoder: Takes the encoder’s output and generates the target sequence using self-attention and cross-attention mechanisms.

The Transformer Architecture:

Posted 2025-03-01Learning Notes

Transformer ( Part 2: Multi-Head Attention )

Before the Transformer, sequence models like RNNs and LSTMs suffered from long-term dependency issues and low parallelization efficiency. Self-Attention was introduced as an alternative, allowing for parallel computation and capturing long-range dependencies.

However, a single-head Self-Attention mechanism has a limitation:
It can only focus on one type of relationship or pattern in the data.

Multi-Head Attention overcomes this by using multiple attention heads that capture different aspects of the input, improving the model’s expressiveness.

Posted 2025-02-09Learning Notes

Transformer ( Part 1: Word Embedding )

Word Embedding is one of the most fundamental techniques in Natural Language Processing (NLP). It represents words as continuous vectors in a high-dimensional space, capturing semantic relationships between them.

Why Do We Need Word Embeddings?

Before word embeddings, one common method to represent words was One-Hot Encoding. In this approach, each word is represented as a high-dimensional sparse vector.

For example, if our vocabulary has 10,000 words, we encode each word as:
$$
\text{dog} = [0, 1, 0, 0, \dots, 0]
$$
However, this method has significant drawbacks:

High dimensionality – A large vocabulary results in enormous vectors.
No semantic similarity – “dog” and “cat” are conceptually related, but their one-hot vectors are completely different.

Word embeddings solve these issues by learning low-dimensional, dense representations that encode semantic relationships between words.