Building a Simplified Self-Attention Mechanism in Python

3 min readJan 8, 2025

Introduction

Self-attention is one of the fundamental mechanisms behind Transformer models, the backbone of modern NLP architectures like BERT and GPT. In this tutorial, we’ll break down the self-attention mechanism into simple, digestible pieces and implement it from scratch in Python using NumPy. By the end, you’ll have a clear understanding of how self-attention works and why it’s so powerful.

Step 1: Generating Word Embeddings

To represent words numerically, we’ll create simple embeddings. While real-world embeddings are often pre-trained vectors (e.g., GloVe or Word2Vec), for simplicity, we’ll generate random embeddings for each word in our sentence.

import numpy as np

# Function to build simple embeddings
def build_embeddings(word):
    return np.random.rand(4)# Example sentence
sentence = "In this tutorial I will show you how to build embeddings and the self attention mechanism."# Splitting the sentence into words
words = sentence.split()# Generating embeddings for each word
embeddings = [build_embeddings(word) for word in words]

Here, we use a small embedding size of 4 for simplicity. In practice, embedding sizes can range from 50 to 1,024 dimensions.

Step 2: Implementing a Simplified Softmax Function

The softmax function converts raw scores into probabilities, making it an essential part of the self-attention mechanism.

# Simplified softmax function
def softmax(x):
    return np.exp(x) / np.sum(np.exp(x), axis=0)

This function ensures that the attention scores sum to 1, allowing them to represent meaningful weights.

Step 3: The Self-Attention Mechanism

Self-attention works by determining how much focus (or attention) each word should give to every other word in the sentence. Let’s implement it step by step:

3.1: Building Query (Q), Key (K), and Value (V)

In Transformers, each word is mapped to three vectors: Query (Q), Key (K), and Value (V). Here, for simplicity, we’ll use the embeddings for all three.

3.2: Calculating Attention Scores

We compute the attention scores using the dot product between Q and the transpose of K.

3.3: Applying Softmax

We use the softmax function to normalize these scores, producing attention weights.

3.4: Weighted Sum of Values

Finally, the attention weights are used to compute a weighted sum of the Value (V) vectors.

# Simplified self-attention mechanism
def self_attention(embeddings):
    # Building Q, K, V
    Q = np.array(embeddings)
    K = np.array(embeddings)
    V = np.array(embeddings)

    # Calculating the attention scores
    scores = np.dot(Q, K.T)    # Applying softmax function to get the attention weights
    attention_weights = softmax(scores)    # Measuring the weighted values
    weighted_values = np.dot(attention_weights, V)    return attention_weights, weighted_values# Running self-attention
attention_weights, weighted_values = self_attention(embeddings)

Step 4: Interpreting the Attention Weights

Let’s analyze the attention weights and see which words influence each other the most.

# Displaying the attention weights
print("Original sentence: ", sentence)
for i, word in enumerate(words):
    print("For the word", word)
    top_3 = sorted(range(len(words)),
                   key=lambda j: attention_weights[i][j],
                   reverse=True)[:3]
    print(top_3)
    for j in top_3:
        print(f"Attention for: {words[j]}: {attention_weights[i][j]}")

Here, for each word, we calculate its top 3 most attended words based on the attention weights.

Step 5: Explaining Attention Relationships

To make our model’s behavior more interpretable, we identify which word each word “attends” to the most (excluding itself).

# Explanation of attention relationships
print("Explanation")
for i, word in enumerate(words):
    sorted_indices = np.argsort(-attention_weights[i])
    second_max_attention = sorted_indices[1]  # Ignoring the first (self-attention)
    print(
        f"The word '{word}' is most attended by '{words[second_max_attention]}'"
    )

This simple analysis provides insights into the relationships captured by the self-attention mechanism.

Conclusion

In this tutorial, we implemented a simplified version of the self-attention mechanism, demonstrating its key components and functionality. While this example skips some real-world complexities (e.g., scaling, positional encodings), it provides a solid foundation for understanding the inner workings of self-attention.

Feel free to experiment with different sentences or extend this implementation to include additional concepts like multi-head attention or positional encodings.

Happy coding!