← 返回术语百科

🧠 Transformer

Transformer Architecture

📖 定义

Transformer是一种基于自注意力机制(Self-Attention)的深度学习架构,由Google团队在2017年论文《Attention Is All You Need》中首次提出。它彻底改变了自然语言处理领域,是GPT、BERT等现代大语言模型的基础架构。

⚙️ 原理

核心组件

工作流程

Transformer通过自注意力机制并行处理整个序列,解决了RNN的序列依赖问题,实现了高效的并行计算。

💡 应用场景

💻 代码示例

# 使用PyTorch实现简化的Self-Attention
import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads
        
        self.values = nn.Linear(embed_size, embed_size)
        self.keys = nn.Linear(embed_size, embed_size)
        self.queries = nn.Linear(embed_size, embed_size)
        self.fc_out = nn.Linear(embed_size, embed_size)
    
    def forward(self, values, keys, query, mask):
        N = query.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]
        
        # 分割为多头
        values = self.values(values).view(N, value_len, self.heads, self.head_dim)
        keys = self.keys(keys).view(N, key_len, self.heads, self.head_dim)
        queries = self.queries(query).view(N, query_len, self.heads, self.head_dim)
        
        # 计算注意力分数
        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])
        
        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))
        
        attention = F.softmax(energy / (self.embed_size ** (1/2)), dim=3)
        
        out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).contiguous()
        out = out.view(N, query_len, self.heads * self.head_dim)
        
        return self.fc_out(out)

🔗 相关链接