前言#

好久没更新博客了，最近几个月比较忙，不忙的时候也没有闲到写博客。感觉做了很多事情，停下来想想又觉得好像什么也没做。这一篇其实很早就想写了，灵感来自于 B 站视频*强烈推荐新人入门 LLM 的方法——逐行调试一个小模型*，但硬生生的被拖到了现在。

对于刚入门 LLM 的新人，你可能读了很多论文，学习了机器学习、深度学习、transformer 等理论知识，你也许能理解 LLM 的架构图，理解 transformer 的原理，但在面临代码时也许会迷茫，会手足无措。PyTorch 中无论是卷积层、全连接层，还是整个 Transformer 编码器，都是一个 Module，而 LLM 就只是将这些模块组合起来而已。很多人会说编写代码只是在搭积木，事实也确实如此，但对于新手来说还是不好理解。因为论文中的架构图是抽象的、静态的，而代码中的模块是具体的、动态的，论文中主要是讲一个故事，介绍它的方法，不会也不可能把架构图描述的细致。另外，传统的编程思想可能是“先执行 A，然后执行 B，然后执行 C”，而深度学习中的编程思想是，定义一个结构，输入数据，它会自己学。这种抽象，再加上张量形状的变换，可能会让新人在读/写代码时蒙圈。

网络上主流的学习路线是，带着你手把手写一个简单的 transformer，也就是构建。构建(Construction)的方式固然很好，你可以掌握清晰的代码结构与核心逻辑，但是通常费时又费力，还与实现的代码脱节。而这个 B 站视频提供的是一种解构(Destruction)的方式，通过调试分析去学习实际的代码。这让我想起了我一位老师曾经说过的系统能力的敏捷之路，通过逆向、调试、猜测、验证等手段，快速掌握系统的运行机制。他最常提起的一个事情就是，只改了一个字节，大改了 VC 编译器的机制。这种思想对我造成了不小的影响，很多情况下我不会去重复造一个轮子，而是改进现成品，除非那个现成品真的很烂。说回正题，这种解构的方式对于有了理论基础，想要编写代码的新人来说是降维打击。因为我们已经掌握了模型的宏观架构，接下来只需要跟踪数据去验证即可。

准备工作#

这些环境配置只是个人喜好，你可以根据自己的环境修改。为了方便起见，我们调试一个很小且经典的模型 GPT2，在 CPU 上也能跑。

安装 uv：Installing uv

1
# Windows
2
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
3
# Linux
4
curl -LsSf https://astral.sh/uv/install.sh | sh

创建虚拟环境：Using Python environments | Using a virtual environment
Terminal window
```
1
uv venv --python 3.12
```

安装依赖：transformers/installation

1
uv pip install torch --index-url https://download.pytorch.org/whl/cpu
2
uv pip install transformers

使用编辑器 VS Code，代码（也可以使用 cursor，方便对着源代码提问 AI）

1
import torch
2
from transformers import GPT2Config, GPT2LMHeadModel
3

4
# GPT2 model config
5
mini_config = GPT2Config(
6
    vocab_size=100,  # vocabulary size
7
    n_positions=64,  # maximum sequence length
8
    n_embd=64,  # embedding dimension (hidden size)
9
    n_layer=2,  # number of transformer blocks
10
    n_head=4,  # number of attention heads
11
    n_inner=128,  # FFN inner dimension
12
    activation_function="gelu",  # activation function
13
    resid_pdrop=0.1,  # dropout rate for residual connections
14
    embd_pdrop=0.1,  # dropout rate for embedding
15
    attn_pdrop=0.1,  # dropout rate for attention
16
)
17

18
# load the model
19
model = GPT2LMHeadModel(mini_config)
20

21
print("Model Structure:")
22
print(model)
23
print("#Parameters:")
24
print(sum(p.numel() for p in model.parameters()))
25

26
# generate random input
27
batch_size = 2
28
seq_len = 16
29
input_ids = torch.randint(0, 100, (batch_size, seq_len))
30
print("Input Shape:", input_ids.shape)
31

32
# forward
33
with torch.no_grad():
34
    outputs = model(input_ids, labels=input_ids)
35

36
print("Logits Shape:", outputs.logits.shape)
37
print("Loss:", outputs.loss)

调试器设置

1
{
2
    "version": "0.2.0",
3
    "configurations": [
4
        {
5
            "name": "Python Debugger: Current File",
6
            "type": "debugpy",
7
            "request": "launch",
8
            "program": "${file}",
9
            "console": "integratedTerminal",
10
            "justMyCode": false  // 重要配置
11
        }
12
    ]
13
}

调试与解构#

在代码中，定义了一个非常小的 GPT2 模型，并且随机初始化了模型的权重，因为不需要模型“有用”，我们只需要关注模型的结构和数据流动。在 outputs = model(input_ids, labels=input_ids) 处打个断点，启动调试器即可。接下来的操作就是调试了，没什么好说的，你可以对着 GPT2 的模型架构图，或者代码输出的模型结构来调试。想关注什么部分就深入什么部分，遇到不懂的代码还可以请高人（Gemini、Claude、GPT）。如果你打算自己去调试，那么看到这里就可以了，后面只是我整理的一些记录而已（有 AIGC 内容）。

1
GPT2LMHeadModel(
2
  (transformer): GPT2Model(
3
    (wte): Embedding(100, 64)
4
    (wpe): Embedding(64, 64)
5
    (drop): Dropout(p=0.1, inplace=False)
6
    (h): ModuleList(
7
      (0-1): 2 x GPT2Block(
8
        (ln_1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
9
        (attn): GPT2Attention(
10
          (c_attn): Conv1D(nf=192, nx=64)
11
          (c_proj): Conv1D(nf=64, nx=64)
12
          (attn_dropout): Dropout(p=0.1, inplace=False)
13
          (resid_dropout): Dropout(p=0.1, inplace=False)
14
        )
15
        (ln_2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
16
        (mlp): GPT2MLP(
17
          (c_fc): Conv1D(nf=128, nx=64)
18
          (c_proj): Conv1D(nf=64, nx=128)
19
          (act): GELUActivation()
20
          (dropout): Dropout(p=0.1, inplace=False)
21
        )
22
      )
23
    )
24
    (ln_f): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
25
  )
26
  (lm_head): Linear(in_features=64, out_features=100, bias=False)
27
)

GPT2LMHeadModel 是 Transformers 里对 GPT‑2 做语言模型微调/生成的模型类。它是在基础的 GPT2Model 上，加了一层线性输出层（LM head），把最后的隐状态转换成对每个词表 token 的概率，用来做因果语言建模（Causal Language Modeling），比如续写、对话、代码生成等。

输入：input_ids（以及可选的 attention_mask 等）
输出：logits（形状 [batch_size, seq_len, vocab_size]），以及可选的 loss

嵌入层部分#

1
(transformer): GPT2Model(
2
  (wte): Embedding(100, 64)
3
  (wpe): Embedding(64, 64)
4
)

wte: Embedding(100, 64)
- 作用：词嵌入（word/token embedding）
- 每个 token id ∈ [0, 99] 映射为一个 64 维向量
- 参数量：约 (100 × 64 = 6400)
wpe: Embedding(64, 64)
- 作用：位置嵌入（positional embedding）
- 假设序列长度上限为 64，每个位置 0～63 有一个 64 维向量
- 参数量：约 (64 × 64 = 4096)
输入阶段做的事（概念上）：
- 对于输入序列 input_ids（形状 [batch, seq_len]）：
  1. 查 wte 得到 token 向量：[batch, seq_len, 64]
  2. 查 wpe 得到 position 向量：[batch, seq_len, 64]
  3. 两者相加：hidden = wte(input_ids) + wpe(positions)

Dropout 与层堆叠#

1
(drop): Dropout(p=0.1)
2
(h): ModuleList(
3
  (0-1): 2 x GPT2Block(...)
4
)
5
(ln_f): LayerNorm((64,), eps=1e-05)

drop: Dropout(p=0.1)
- 在输入层做一次 dropout，防止过拟合
h: ModuleList(2 x GPT2Block)
- 有 2 个 Transformer block，每个都是 GPT‑2 的标准结构：
  - 自注意力子层
  - 前馈网络（MLP）子层
  - 各自都有 LayerNorm 和残差连接
ln_f: LayerNorm(64)
- 整个 Transformer 堆叠后的 最后一层 LayerNorm
- 输出仍然是 [batch, seq_len, 64]

GPT2Model 代码#

1
# GPT2Model = TokenEmbedding + PositionEmbedding (+ TokenTypeEmbedding)
2
#           → Dropout
3
#           → num_layers × GPT2Block
4
#           → Final LayerNorm
5
#           → 输出 last_hidden_state（以及可选的 past_key_values / attentions / hidden_states）
6
class GPT2Model(GPT2PreTrainedModel):
7
    _supports_param_buffer_assignment = False
8

9
    def __init__(self, config):
10
        super().__init__(config)
11
        self.embed_dim = config.hidden_size # 隐藏层维度
12
        self.wte = nn.Embedding(config.vocab_size, self.embed_dim) # 词嵌入层 Word Token Embedding
13
        self.wpe = nn.Embedding(config.max_position_embeddings, self.embed_dim) # 位置嵌入层 Word Position Embedding
14

15
        self.drop = nn.Dropout(config.embd_pdrop) # dropout 层
16
        self.h = nn.ModuleList([GPT2Block(config, layer_idx=i) for i in range(config.num_hidden_layers)]) # Transformer 堆叠
17
        self.ln_f = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon) # 最终输出前的 LayerNorm
18
        # Initialize weights and apply final processing
19
        self.post_init()
20

21
    def forward(
22
        self,
23
        input_ids: Optional[torch.LongTensor] = None,
24
        past_key_values: Optional[Cache] = None,
25
        cache_position: Optional[torch.LongTensor] = None,
26
        attention_mask: Optional[torch.FloatTensor] = None,
27
        token_type_ids: Optional[torch.LongTensor] = None,
28
        position_ids: Optional[torch.LongTensor] = None,
29
        head_mask: Optional[torch.FloatTensor] = None,
30
        inputs_embeds: Optional[torch.FloatTensor] = None,
31
        encoder_hidden_states: Optional[torch.Tensor] = None,
32
        encoder_attention_mask: Optional[torch.FloatTensor] = None,
33
        use_cache: Optional[bool] = None,
34
        output_attentions: Optional[bool] = None,
35
        output_hidden_states: Optional[bool] = None,
36
        return_dict: Optional[bool] = None,
37
        **kwargs,
38
    ) -> Union[tuple, BaseModelOutputWithPastAndCrossAttentions]:
39
        r"""
40
        input_ids (`torch.LongTensor` of shape `(batch_size, input_ids_length)`):
41
            `input_ids_length` = `sequence_length` if `past_key_values` is `None` else
42
            `past_key_values.get_seq_length()` (`sequence_length` of input past key value states). Indices of input
43
            sequence tokens in the vocabulary.
44

45
            If `past_key_values` is used, only `input_ids` that do not have their past calculated should be passed as
46
            `input_ids`.
47

48
            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
49
            [`PreTrainedTokenizer.__call__`] for details.
50

51
            [What are input IDs?](../glossary#input-ids)
52
        """
53
    # 选项初始化
54
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
55
        output_hidden_states = (
56
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
57
        )
58
        use_cache = use_cache if use_cache is not None else self.config.use_cache
59
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
60
    # 检查输入
61
        if input_ids is not None and inputs_embeds is not None:
62
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
63
        elif input_ids is not None:
64
            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
65
      # 把多维输入压平，转换成标准的批次输入格式
66
      # [batch, segments, seq_len]
67
      # [batch, n_sentences, seq_len]
68
      # [batch, n_variants, seq_len]
69
      # 统一变成 [N, seq_len]，Embedding 期望的输入
70
            input_shape = input_ids.size()
71
            input_ids = input_ids.view(-1, input_shape[-1])
72
            batch_size = input_ids.shape[0]
73
        elif inputs_embeds is not None:
74
            input_shape = inputs_embeds.size()[:-1]
75
            batch_size = inputs_embeds.shape[0]
76
        else:
77
            raise ValueError("You have to specify either input_ids or inputs_embeds")
78

79
        device = input_ids.device if input_ids is not None else inputs_embeds.device
80

81
        if token_type_ids is not None:
82
            token_type_ids = token_type_ids.view(-1, input_shape[-1])
83
    # 梯度检查点与缓存冲突
84
        if self.gradient_checkpointing and self.training:
85
            if use_cache:
86
                logger.warning_once(
87
                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
88
                )
89
                use_cache = False
90

91
        # based on pattern from src/transformers/models/whisper/modeling_whisper.py::WhisperDecoder
92
    # 初始化 kv cache
93
        if use_cache:
94
            if past_key_values is None:
95
                past_key_values = DynamicCache(config=self.config)
96
            if self.config.add_cross_attention and not isinstance(past_key_values, EncoderDecoderCache):
97
                past_key_values = EncoderDecoderCache(past_key_values, DynamicCache(config=self.config))
98
    # 词嵌入
99
        if inputs_embeds is None:
100
            inputs_embeds = self.wte(input_ids) # (N, seq_len) -> (N, seq_len, hidden_size)
101
    # 生成绝对位置编码
102
        if cache_position is None:
103
      # 先前token数
104
            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
105
            cache_position = torch.arange(
106
                past_seen_tokens,
107
        past_seen_tokens + inputs_embeds.shape[1], # inputs_embeds.shape[1] 为序列长度
108
        device=inputs_embeds.device
109
            ) # (seq_len), eg: tensor([ 0,  1,  2,  3,  4,  5])
110
        if position_ids is None:
111
      # 为什么要 unsqueeze(0) ？
112
      # 因为模型需要：(batch_size, seq_len)
113
      # 而 cache_position 是 (seq_len)
114
            position_ids = cache_position.unsqueeze(0) # 增加维度 (1, seq_len)
115
    # 位置嵌入
116
        position_embeds = self.wpe(position_ids) # (1, seq_len, hidden_size)
117
    # (N, seq_len, hidden_size) + (1, seq_len, hidden_size)，广播
118
        hidden_states = inputs_embeds + position_embeds.to(inputs_embeds.device)
119

120
        # Attention mask.
121
        # ._update_causal_mask() and ._prepare_4d_causal_attention_mask_with_cache_position() copied from LlamaModel
122
        if attention_mask is not None and attention_mask.ndim < 4:
123
            attention_mask = attention_mask.view(batch_size, -1)
124
    # 构造最终的因果注意力 mask
125
        causal_mask = create_causal_mask(
126
            config=self.config,
127
            input_embeds=inputs_embeds,
128
            attention_mask=attention_mask,
129
            cache_position=cache_position,
130
            past_key_values=past_key_values,
131
            position_ids=position_ids,
132
        )
133
    # 交叉注意力处理
134
        # If a 2D or 3D attention mask is provided for the cross-attention
135
        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
136
        _use_sdpa = self._attn_implementation == "sdpa" and output_attentions is False and head_mask is None
137
        if self.config.add_cross_attention and encoder_hidden_states is not None:
138
            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
139
            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
140
            if encoder_attention_mask is None:
141
                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
142
            if _use_sdpa:
143
                encoder_attention_mask = _prepare_4d_attention_mask_for_sdpa(
144
                    mask=encoder_attention_mask, dtype=inputs_embeds.dtype, tgt_len=input_shape[-1]
145
                )
146
            elif self._attn_implementation != "flash_attention_2":
147
                encoder_attention_mask = self.invert_attention_mask(encoder_attention_mask)
148
        else:
149
            encoder_attention_mask = None
150
    # 把传入的 head 层级屏蔽参数整理成统一形状，用于在 block 内屏蔽某些注意力头。
151
        # Prepare head mask if needed
152
        # 1.0 in head_mask indicate we keep the head
153
        # attention_probs has shape bsz x n_heads x N x N
154
        # head_mask has shape n_layer x batch x n_heads x N x N
155
        head_mask = self.get_head_mask(head_mask, self.config.n_layer)
156
    # token 类型嵌入
157
        if token_type_ids is not None:
158
            token_type_embeds = self.wte(token_type_ids) # (batch_size, seq_len) -> (batch_size, seq_len, hidden_size)
159
            hidden_states = hidden_states + token_type_embeds
160
    # dropout
161
        hidden_states = self.drop(hidden_states)
162
    # 输出形状记录，主要用于最后 view 把 batch/seq 形状恢复
163
        output_shape = (-1,) + input_shape[1:] + (hidden_states.size(-1),) # (-1, sample_shape, hidden_size)
164
    # 如果需要输出 attentions，则初始化为 ()，否则 None。其他同理
165
        all_self_attentions = () if output_attentions else None
166
        all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
167
        all_hidden_states = () if output_hidden_states else None
168
    # 模型主体：依次通过每一层 GPT2Block
169
        for i, block in enumerate(self.h):
170
            # Model parallel
171
            if self.model_parallel:
172
                torch.cuda.set_device(hidden_states.device)
173
                if isinstance(head_mask, torch.Tensor):
174
                    head_mask = head_mask.to(hidden_states.device)
175
            if output_hidden_states:
176
                all_hidden_states = all_hidden_states + (hidden_states,) # 收集中间隐状态
177
      # 调用单个 block 的 forward
178
            outputs = block(
179
                hidden_states, # (N, seq_len, hidden_size)
180
                past_key_values if not (self.gradient_checkpointing and self.training) else None,
181
                cache_position,
182
                causal_mask,
183
                head_mask[i],
184
                encoder_hidden_states,  # as a positional argument for gradient checkpointing
185
                encoder_attention_mask=encoder_attention_mask,
186
                use_cache=use_cache,
187
                output_attentions=output_attentions,
188
                **kwargs,
189
            )
190
            hidden_states = outputs[0] # 取隐藏层状态
191
      # 收集中间隐状态
192
            if output_attentions:
193
                all_self_attentions = all_self_attentions + (outputs[1],)
194
                if self.config.add_cross_attention:
195
                    all_cross_attentions = all_cross_attentions + (outputs[2],)
196
            # 若启用模型并行，根据 self.device_map 判断当前层是否是该设备上的最后一层，如果是，就把 hidden_states 迁移到下一个设备上。
197
            if self.model_parallel:
198
                for k, v in self.device_map.items():
199
                    if i == v[-1] and "cuda:" + str(k) != self.last_device:
200
                        hidden_states = hidden_states.to("cuda:" + str(k + 1))
201
    # 最终输出前的 LayerNorm
202
        hidden_states = self.ln_f(hidden_states)
203
    # 恢复输出形状 (-1, sample_shape, hidden_size)
204
        hidden_states = hidden_states.view(output_shape)
205
        # 收集最后一层隐状态
206
        if output_hidden_states:
207
            all_hidden_states = all_hidden_states + (hidden_states,)
208
    # 缓存输出
209
        past_key_values = past_key_values if use_cache else None
210
        if not return_dict:
211
      # [hidden_states, past_key_values, all_hidden_states, all_self_attentions, all_cross_attentions] 中非空的按顺序返回
212
            return tuple(
213
                v
214
                for v in [hidden_states, past_key_values, all_hidden_states, all_self_attentions, all_cross_attentions]
215
                if v is not None
216
            )
217

218
        return BaseModelOutputWithPastAndCrossAttentions(
219
            last_hidden_state=hidden_states,
220
            past_key_values=past_key_values,
221
            hidden_states=all_hidden_states,
222
            attentions=all_self_attentions,
223
            cross_attentions=all_cross_attentions,
224
        )

GPT2Block#

1
GPT2Block(
2
  (ln_1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
3
  (attn): GPT2Attention(
4
    (c_attn): Conv1D(nf=192, nx=64)
5
    (c_proj): Conv1D(nf=64, nx=64)
6
    (attn_dropout): Dropout(p=0.1)
7
    (resid_dropout): Dropout(p=0.1)
8
  )
9
  (ln_2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
10
  (mlp): GPT2MLP(
11
    (c_fc): Conv1D(nf=128, nx=64)
12
    (c_proj): Conv1D(nf=64, nx=128)
13
    (act): GELUActivation()
14
    (dropout): Dropout(p=0.1)
15
  )
16
)

前置 LayerNorm：`ln_1`#

GPT‑2 使用 Pre‑LN 结构：
- x_norm = ln_1(x)
- 然后送进自注意力层

自注意力层：`attn: GPT2Attention`#

c_attn: Conv1D(nf=192, nx=64)
- 这是 GPT‑2 里惯用的一种写法，本质是一个 线性层：
  - 输入维度 nx = 64
  - 输出维度 nf = 192
- 192 = 3 × 64，对应 Q、K、V 三个投影拼在一起：
  - hidden(64) → [Q(64), K(64), V(64)] 合并为 192 维，然后在代码里拆开
多头注意力怎么来的？
- 没在结构里直接显示 head 数，但一般是：
  - hidden_size = num_heads × head_dim
  - 比如 64 = 4 × 16 或 8 × 8（具体取决于 config）
- 实现中会把 Q、K、V reshape 成 [batch, num_heads, seq_len, head_dim] 做多头注意力
c_proj: Conv1D(nf=64, nx=64)
- 把多头注意力计算后的输出再映射回 64 维
- 对应标准 Transformer 里的 W_o
attn_dropout / resid_dropout
- attn_dropout: 对注意力权重做 dropout
- resid_dropout: 对注意力输出（加回残差前）做 dropout
残差连接
- 实际计算类似：
  - x = x + resid_dropout(attn(ln_1(x)))

第二个 LayerNorm：`ln_2`#

对经过注意力 + 残差后的结果再做 LayerNorm，送入 MLP：
- x_norm = ln_2(x)

前馈网络 MLP：`GPT2MLP`#

c_fc: Conv1D(nf=128, nx=64)
- 线性层：64 → 128，相当于隐藏层扩大到 2 倍
act: GELU
- 激活函数，GPT‑2 使用 GELU
c_proj: Conv1D(nf=64, nx=128)
- 再从 128 → 64
dropout
- 对 MLP 输出做 dropout
残差连接
- 和注意力部分类似：
  - x = x + dropout(mlp(ln_2(x)))

总结：每个 GPT2Block = LN → Self-Attn → 残差 + LN → MLP → 残差，这是标准的 GPT‑2 Block，只是参数规模很小。

代码#

1
# GPT2Block = LN → Self-Attn → 残差 + LN → MLP → 残差
2
class GPT2Block(GradientCheckpointingLayer):
3
    def __init__(self, config, layer_idx=None):
4
        super().__init__()
5
        hidden_size = config.hidden_size # 每个 token 的表示维度
6
        inner_dim = config.n_inner if config.n_inner is not None else 4 * hidden_size # MLP 隐层维度
7
    # 自注意力前的 LayerNorm（pre‑LN 架构）
8
        self.ln_1 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
9
    # 自注意力层
10
        self.attn = GPT2Attention(config=config, layer_idx=layer_idx)
11
    # MLP 前的第二个 LayerNorm
12
        self.ln_2 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
13
    # 可选交叉注意力
14
        if config.add_cross_attention:
15
            self.crossattention = GPT2Attention(config=config, is_cross_attention=True, layer_idx=layer_idx)
16
            self.ln_cross_attn = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
17
    # 前馈 MLP
18
        self.mlp = GPT2MLP(inner_dim, config)
19

20
    @deprecate_kwarg("past_key_value", new_name="past_key_values", version="4.58")
21
    def forward(
22
        self,
23
        hidden_states: Optional[tuple[torch.FloatTensor]], # (N, seq_len, hidden_size)
24
        past_key_values: Optional[Cache] = None,
25
        cache_position: Optional[torch.LongTensor] = None,
26
        attention_mask: Optional[torch.FloatTensor] = None,
27
        head_mask: Optional[torch.FloatTensor] = None,
28
        encoder_hidden_states: Optional[torch.Tensor] = None,
29
        encoder_attention_mask: Optional[torch.FloatTensor] = None,
30
        use_cache: Optional[bool] = False,
31
        output_attentions: Optional[bool] = False,
32
        **kwargs,
33
    ) -> Union[tuple[torch.Tensor], Optional[tuple[torch.Tensor, tuple[torch.FloatTensor, ...]]]]:
34
        residual = hidden_states # 保存残差
35
        hidden_states = self.ln_1(hidden_states) # Pre LayerNorm
36
        attn_output, # (N, seq_len, hidden_size), 自注意力层对每个 token 重新“加权聚合”后的表示，传给后面层用的真正输出。
37
    self_attn_weights # (N, num_heads, seq_len, seq_len)，仅debug用
38
    = self.attn( # 使用 KV 缓存、mask 等信息做自注意力。
39
            hidden_states,
40
            past_key_values=past_key_values,
41
            cache_position=cache_position,
42
            attention_mask=attention_mask,
43
            head_mask=head_mask,
44
            use_cache=use_cache,
45
            output_attentions=output_attentions,
46
            **kwargs,
47
        )
48
        hidden_states = attn_output + residual # 残差连接
49
    # 可选的交叉注意力
50
        if encoder_hidden_states is not None:
51
            # add one self-attention block for cross-attention
52
            if not hasattr(self, "crossattention"):
53
                raise ValueError(
54
                    f"If `encoder_hidden_states` are passed, {self} has to be instantiated with "
55
                    "cross-attention layers by setting `config.add_cross_attention=True`"
56
                )
57
            residual = hidden_states
58
            hidden_states = self.ln_cross_attn(hidden_states)
59
            cross_attn_output, cross_attn_weights = self.crossattention(
60
                hidden_states,
61
                past_key_values=past_key_values,
62
                attention_mask=attention_mask,
63
                head_mask=head_mask,
64
                encoder_hidden_states=encoder_hidden_states,
65
                encoder_attention_mask=encoder_attention_mask,
66
                output_attentions=output_attentions,
67
            )
68
            hidden_states = residual + cross_attn_output # 残差连接
69
    # 前馈 MLP + 残差
70
        residual = hidden_states # 保存残差
71
        hidden_states = self.ln_2(hidden_states) # Pre LayerNorm
72
        feed_forward_hidden_states = self.mlp(hidden_states) # 两层线性 + 激活（通常 GELU）
73
        hidden_states = residual + feed_forward_hidden_states # 残差连接
74
    # 组织输出
75
        outputs = (hidden_states,) # 本层输出
76
        if output_attentions:
77
            outputs += (self_attn_weights,)
78
            if encoder_hidden_states is not None:
79
                outputs += (cross_attn_weights,)
80

81
        return outputs

最终输出层 lm_head#

1
(lm_head): Linear(in_features=64, out_features=100, bias=False)

输入：来自 transformer 的输出 [batch, seq_len, 64]
线性映射：每个位置的 64 维向量 → 100 维 logits（对应 100 个 token）
输出形状：[batch, seq_len, 100]
通常在配置里会设置 权重共享：
- lm_head.weight 和 wte.weight 共享，减少参数

汇总#

流程：用 GPT‑2 Transformer 得到序列隐状态 →（可选）只保留最后若干个时间步 → 通过 lm_head 得到对词表的 logits →（可选）根据 labels 计算语言模型损失 → 按 return_dict 决定返回结构。

1
# GPT2LMHeadModel = GPT2Model → 取 hidden_states → Linear(lm_head) → logits（+ 可选 loss）
2
class GPT2LMHeadModel(GPT2PreTrainedModel, GenerationMixin):
3
    _tied_weights_keys = ["lm_head.weight"]
4

5
    def __init__(self, config):
6
        super().__init__(config)
7
        self.transformer = GPT2Model(config)  # GPT2 模型部分
8
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False) # 任务头
9

10
        # Initialize weights and apply final processing
11
        self.post_init()
12

13
    def forward(self,...,**kwargs,) -> Union[tuple, CausalLMOutputWithCrossAttentions]:
14
        r"""
15
        input_ids (`torch.LongTensor` of shape `(batch_size, input_ids_length)`):
16
            `input_ids_length` = `sequence_length` if `past_key_values` is `None` else
17
            `past_key_values.get_seq_length()` (`sequence_length` of input past key value states). Indices of input
18
            sequence tokens in the vocabulary.
19

20
            If `past_key_values` is used, only `input_ids` that do not have their past calculated should be passed as
21
            `input_ids`.
22

23
            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
24
            [`PreTrainedTokenizer.__call__`] for details.
25

26
            [What are input IDs?](../glossary#input-ids)
27
        labels (`torch.LongTensor` of shape `(batch_size, input_ids_length)`, *optional*):
28
            Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
29
            `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
30
            are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
31
        """
32
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
33

34
        transformer_outputs = self.transformer( # GPT‑2 本体
35
            input_ids,  # (batch_size, input_ids_length)
36
            past_key_values=past_key_values,
37
            attention_mask=attention_mask,
38
            cache_position=cache_position,
39
            token_type_ids=token_type_ids,
40
            position_ids=position_ids,
41
            head_mask=head_mask,
42
            inputs_embeds=inputs_embeds,
43
            encoder_hidden_states=encoder_hidden_states,
44
            encoder_attention_mask=encoder_attention_mask,
45
            use_cache=use_cache,
46
            output_attentions=output_attentions,
47
            output_hidden_states=output_hidden_states,
48
            return_dict=return_dict,
49
        )
50
        hidden_states = transformer_outputs[0]  # (batch_size, seq_len, hidden_size)，获取隐藏层，使用[0]取，因为返回的可能不是dict
51
    # 对隐藏层切片送入 lm_head
52
        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
53
        logits = self.lm_head(hidden_states[:, slice_indices, :]) # (batch_size, kept_seq_len, vocab_size)
54

55
        loss = None  # 表示默认情况下（没有标签）是不计算损失的
56
        if labels is not None:
57
            # 把 token 维度拉平
58
            loss = self.loss_function(
59
                logits,
60
                labels,
61
                vocab_size=self.config.vocab_size,
62
                **kwargs,
63
            )
64

65
        if not return_dict:
66
            output = (logits,) + transformer_outputs[1:]
67
            return ((loss,) + output) if loss is not None else output
68

69
        return CausalLMOutputWithCrossAttentions(
70
            loss=loss,
71
            logits=logits,
72
            past_key_values=transformer_outputs.past_key_values,
73
            hidden_states=transformer_outputs.hidden_states,
74
            attentions=transformer_outputs.attentions,
75
            cross_attentions=transformer_outputs.cross_attentions,
76
        )

input_ids

1
tensor([[51, 93, 69, 67, 67, 64, 14, 69, 28, 48, 95, 52,  0, 43, 75, 20],
2
        [38, 46, 94,  7, 13, 65, 12, 77,  1, 29, 93, 14, 71, 98, 64, 81]])

(batch_size, seq_len)，分别表示 (nth sample in batch, nth token in sequence)

transformer_outputs

1
BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=tensor([[[ 0.1692,  1.6322,  0.2560,  ...,  0.3967,  0.6844, -0.0801],
2
         [-0.3589,  0.6584,  0.1836,  ...,  0.9718,  0.5690, -0.7071],
3
         [ 0.6927, -0.4106,  0.5513,  ..., -0.2038,  0.9638,  0.3804],
4
         ...,
5
         [-1.2075, -0.9752, -2.2019,  ..., -2.1254, -0.2321,  0.9713],
6
         [ 0.9386, -0.2055,  0.5741,  ..., -0.6105,  1.7194, -0.8643],
7
         [-1.1196,  1.3400, -0.4640,  ...,  1.0994,  0.7589, -1.0194]],
8

9
        [[-0.1582, -0.3705, -1.2539,  ...,  2.6838, -1.0564,  0.3948],
10
         [-0.7398,  0.4874, -0.1554,  ...,  0.3360, -0.9706, -1.1178],
11
         [ 0.8644, -1.5481,  0.7424,  ..., -0.0362,  0.1190, -0.2096],
12
         ...,
13
         [-2.2812, -1.0691, -1.9750,  ..., -1.6648,  0.2447, -0.9316],
14
         [ 1.3596, -0.1944,  0.4538,  ...,  1.5388,  1.9213, -1.1869],
15
         [-0.9911,  2.3999, -0.3334,  ..., -0.5131,  0.7391, -0.2217]]]), past_key_values=DynamicCache(layers=[DynamicLayer, DynamicLayer]), hidden_states=None, attentions=None, cross_attentions=None)

BaseModelOutputWithPastAndCrossAttentions 类型，本质上就是一个带字段名的 tuple，包含：
- last_hidden_state：最后一层的隐状态 (batch_size, seq_len, hidden_size)，分别表示 (nth sample in batch, nth token in sequence, token hidden state)。（维度 = config.n_embd）
- past_key_values：KV Cache，缓存已经算过的 token 的 K/V 张量。下次只要输入新 token，再带上 past_key_values，模型就不用重算前面所有 token 的 attention，速度会快很多。
- hidden_states：每一层的隐状态（默认 None），会是 tuple(len = num_layers + 1)，包含 embedding 层 + 每一层的隐状态；
- attentions：每一层 self-attention 的权重（None）
- cross_attentions：有 encoder 时，会是每一层 cross-attention 的权重（None）

hidden_states

就是从 transformer_outputs 取出的，(batch_size, seq_len, hidden_size)
slice_indices
```
1
slice(-logits_to_keep, None) # 等价于 [-logits_to_keep:] 切片
2
slice(0, None, None) # slice[T_start, T_stop, T_step]
```
决定在时间维（序列长度那一维）上，保留哪些位置的 hidden_states 来算 logits。
- logits_to_keep = 0 → slice(0, None) → 保留所有时间步；
- logits_to_keep = 5 → slice(-5, None) → 保留最后 5 个时间步。

logits

1
torch.Size([2, 16, 100])
2
tensor([[[ 0.0545, -0.1060, -0.0190,  ...,  0.0495, -0.2154,  0.0839],
3
         [-0.0738, -0.1690,  0.1306,  ...,  0.1128,  0.1745, -0.0701],
4
         [-0.1737, -0.2598,  0.0013,  ...,  0.0105,  0.2533,  0.0665],
5
         ...,
6
         [ 0.0656,  0.0591, -0.0530,  ..., -0.3790,  0.0711,  0.1401],
7
         [ 0.0471,  0.0646,  0.1361,  ..., -0.0941,  0.0735,  0.0579],
8
         [ 0.2808,  0.3964, -0.1211,  ..., -0.0119,  0.2546,  0.2442]],
9

10
        [[ 0.0470, -0.0757, -0.0323,  ...,  0.2119, -0.1910,  0.0868],
11
         [ 0.2607,  0.1443, -0.1541,  ...,  0.2334,  0.1907,  0.2107],
12
         [ 0.0506,  0.0645,  0.0126,  ..., -0.0995,  0.3498,  0.3652],
13
         ...,
14
         [ 0.1784, -0.0449,  0.1409,  ..., -0.1468,  0.6051,  0.0009],
15
         [-0.1163,  0.1026,  0.0557,  ..., -0.0361,  0.0837,  0.2053],
16
         [-0.0091,  0.1724,  0.0056,  ..., -0.0435,  0.2960,  0.1318]]])

表示每个被保留的时间步，对词表中每个 token 的预测分数。(batch_size, kept_seq_len, vocab_size)，分别表示 (nth sample in batch, nth token in kept seq, logit for each vocabulary)

loss

默认 loss_type=ForCausalLMLoss

1
def ForCausalLMLoss(
2
    logits, # (batch_size, seq_len, vocab_size)
3
    labels, # (batch_size, seq_len)
4
    vocab_size: int, # 词表大小
5
    num_items_in_batch: Optional[torch.Tensor] = None,
6
    ignore_index: int = -100, # 忽略标签的值，不计算loss
7
    shift_labels: Optional[torch.Tensor] = None, # 计算好的右移后的标签
8
    **kwargs,
9
) -> torch.Tensor:
10
    # 把 logits 转成 float32，避免用 float16 直接算交叉熵时的数值精度问题。
11
    logits = logits.float()
12
  # 标签右移，让第 0 个位置去预测第 1 个 token，第 1 个位置去预测第 2 个 token，…
13
    if shift_labels is None:
14
        # Shift so that tokens < n predict n
15
        labels = nn.functional.pad(labels, (0, 1), value=ignore_index) # 在最后一维右侧补一个位置，值是 ignore_index。
16
        shift_labels = labels[..., 1:].contiguous() # 第二维度右移一位，contiguous 保持内存连续
17

18
    # 拉平 token 维度
19
    logits = logits.view(-1, vocab_size) # (batch * seq_len, vocab_size)
20
    shift_labels = shift_labels.view(-1) # (batch * seq_len)
21
    shift_labels = shift_labels.to(logits.device)
22
    loss = fixed_cross_entropy(logits, shift_labels, num_items_in_batch, ignore_index, **kwargs)
23
  # source: [N, C]   target: [N]
24
    return loss # tensor(float)

前言#