Building a Simplified Self-Attention Mechanism in Python
Introduction
Self-attention is one of the fundamental mechanisms behind Transformer models, the backbone of modern NLP architectures like BERT and GPT. In this tutorial, we’ll break down the self-attention mechanism into simple, digestible pieces and implement it from scratch in Python using NumPy. By the end, you’ll have a clear understanding of how self-attention works and why it’s so powerful.
Step 1: Generating Word Embeddings
To represent words numerically, we’ll create simple embeddings. While real-world embeddings are often pre-trained vectors (e.g., GloVe or Word2Vec), for simplicity, we’ll generate random embeddings for each word in our sentence.
import numpy as np
# Function to build simple embeddings
def build_embeddings(word):
return np.random.rand(4)# Example sentence
sentence = "In this tutorial I will show you how to build embeddings and the self attention mechanism."# Splitting the sentence into words
words = sentence.split()# Generating embeddings for each word
embeddings = [build_embeddings(word) for word in words]
Here, we use a small embedding size of 4 for simplicity. In practice, embedding sizes can range from 50 to 1,024 dimensions.
Step 2: Implementing a Simplified Softmax Function
The softmax function converts raw scores into probabilities, making it an essential part of the self-attention mechanism.
# Simplified softmax function
def softmax(x):
return np.exp(x) / np.sum(np.exp(x), axis=0)
This function ensures that the attention scores sum to 1, allowing them to represent meaningful weights.
Step 3: The Self-Attention Mechanism
Self-attention works by determining how much focus (or attention) each word should give to every other word in the sentence. Let’s implement it step by step:
3.1: Building Query (Q), Key (K), and Value (V)
In Transformers, each word is mapped to three vectors: Query (Q), Key (K), and Value (V). Here, for simplicity, we’ll use the embeddings for all three.
3.2: Calculating Attention Scores
We compute the attention scores using the dot product between Q and the transpose of K.
3.3: Applying Softmax
We use the softmax function to normalize these scores, producing attention weights.
3.4: Weighted Sum of Values
Finally, the attention weights are used to compute a weighted sum of the Value (V) vectors.
# Simplified self-attention mechanism
def self_attention(embeddings):
# Building Q, K, V
Q = np.array(embeddings)
K = np.array(embeddings)
V = np.array(embeddings)
# Calculating the attention scores
scores = np.dot(Q, K.T) # Applying softmax function to get the attention weights
attention_weights = softmax(scores) # Measuring the weighted values
weighted_values = np.dot(attention_weights, V) return attention_weights, weighted_values# Running self-attention
attention_weights, weighted_values = self_attention(embeddings)
Step 4: Interpreting the Attention Weights
Let’s analyze the attention weights and see which words influence each other the most.
# Displaying the attention weights
print("Original sentence: ", sentence)
for i, word in enumerate(words):
print("For the word", word)
top_3 = sorted(range(len(words)),
key=lambda j: attention_weights[i][j],
reverse=True)[:3]
print(top_3)
for j in top_3:
print(f"Attention for: {words[j]}: {attention_weights[i][j]}")
Here, for each word, we calculate its top 3 most attended words based on the attention weights.
Step 5: Explaining Attention Relationships
To make our model’s behavior more interpretable, we identify which word each word “attends” to the most (excluding itself).
# Explanation of attention relationships
print("Explanation")
for i, word in enumerate(words):
sorted_indices = np.argsort(-attention_weights[i])
second_max_attention = sorted_indices[1] # Ignoring the first (self-attention)
print(
f"The word '{word}' is most attended by '{words[second_max_attention]}'"
)
This simple analysis provides insights into the relationships captured by the self-attention mechanism.
Conclusion
In this tutorial, we implemented a simplified version of the self-attention mechanism, demonstrating its key components and functionality. While this example skips some real-world complexities (e.g., scaling, positional encodings), it provides a solid foundation for understanding the inner workings of self-attention.
Feel free to experiment with different sentences or extend this implementation to include additional concepts like multi-head attention or positional encodings.
Happy coding!