Transformer ( Part 2: Multi-Head Attention )

Before the Transformer, sequence models like RNNs and LSTMs suffered from long-term dependency issues and low parallelization efficiency. Self-Attention was introduced as an alternative, allowing for parallel computation and capturing long-range dependencies.

However, a single-head Self-Attention mechanism has a limitation:
It can only focus on one type of relationship or pattern in the data.

Multi-Head Attention overcomes this by using multiple attention heads that capture different aspects of the input, improving the model’s expressiveness.

Read more