随机文章

gpt3和gpt4区别GPT：GPT 论文笔记&代码解读这样也行？

2023-08-26 08:06:34 分类:互联网+ 作者:axdmin 阅读:

每当提到互联网，我们就会想到无限的可能性和无穷的创新。现在，就让我们1起来看看最近在互联网领域有哪些令人振奋的消息和故事。

GPT是openAI 提出的1个语言建模模型，它出现在谷歌提出的 attention is all you need后面，架构上应用了attention这种transformer结构训练GPT的训练过程分为1无监督预训练 2有监督微调，这样的结构在后来NLP的各种经典大模型都有体现，例如BERT，Roberta。

无监督训练：对序列建模，从左到右依次计算，是自回归的模型，例如 P(w1,w2,w3) = P(w1)*P(w2|w1)* P(w3|w2,w1)目标函数为最大化P(Seq)的概率，即对模型参数进行最大似然估计。

模型结构是We （词嵌入） Wp 词位置，根据第1行得到最初的h0 隐藏表示，然后用n层transformer结构到最后1层。这里比较困惑的是论文没有提到l，估计l其实是i。

有监督训练：

论文还发现L3 可以用来加速模型收敛和提高模型泛化性代码实现著名的hugging face团队用pytorch 实现了gpthttps://github.com/huggingface/transformers/blob/master/src/transformers/models/openai/modeling_openai.py。

首先看看OpenAIGPTModel all_attentions = () if output_attentions else None all_hidden_states = () if output_hidden_states else None for i, block in enumerate(self.h): if output_hidden_states: all_hidden_states = all_hidden_states + (hidden_states,) outputs = block(hidden_states, attention_mask, head_mask[i], output_attentions=output_attentions) hidden_states = outputs[0] if output_attentions: all_attentions = all_attentions + (outputs[1],)

self.h 是多层的transformer模型，每1次hidden states都会传入模型中，并且输出的hidden states从outputs中取出接着我们看看block是什么？它刚好对应论文中的架构图里的蓝色区域

class Block(nn.Module): def __init__(self, n_ctx, config, scale=False): super().__init__() nx = config.n_embd self.attn = Attention(nx, n_ctx, config, scale) self.ln_1 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon) self.mlp = MLP(4 * nx, config) self.ln_2 = nn.LayerNorm(nx, eps=config.layer_norm_epsilon) def forward(self, x, attention_mask=None, head_mask=None, output_attentions=False): attn_outputs = self.attn( x, attention_mask=attention_mask, head_mask=head_mask, output_attentions=output_attentions, ) a = attn_outputs[0] n = self.ln_1(x + a) m = self.mlp(n) h = self.ln_2(n + m) outputs = [h] + attn_outputs[1:] return outputs

这里x是上1层的hidden states，和attention 相加，然后进行layer normalisation ，然后再进行mlp模型这里的mlp模型并没有是linear层，而是1个自定义的网络，也就是论文里提到的feed forward，代码如下。

class MLP(nn.Module): def __init__(self, n_state, config): # in MLP: n_state=3072 (4 * n_embd) super().__init__() nx = config.n_embd self.c_fc = Conv1D(n_state, nx) self.c_proj = Conv1D(nx, n_state) self.act = ACT_FNS[config.afn] self.dropout = nn.Dropout(config.resid_pdrop) def forward(self, x): h = self.act(self.c_fc(x)) h2 = self.c_proj(h) return self.dropout(h2)

结合代码我们可以看到，每个权重变成4个权重，然后经过activatation层又被转化到原来的维度，最后再加入dropout模型到这里基本上就搞清楚了，但是我们发现hugging face里的预训练代码没有知道写在了哪里，因此我去找了官方的tf代码。

def model(X, M, Y, train=False, reuse=False): with tf.variable_scope(model, reuse=reuse): we = tf.get_variable("we", [n_vocab+n_special+n_ctx, n_embd], initializer=tf.random_normal_initializer(stddev=0.02)) we = dropout(we, embd_pdrop, train) X = tf.reshape(X, [⑴, n_ctx, 2]) M = tf.reshape(M, [⑴, n_ctx]) h = embed(X, we) for layer in range(n_layer): h = block(h, h%d%layer, train=train, scale=True) lm_h = tf.reshape(h[:, :⑴], [⑴, n_embd]) lm_logits = tf.matmul(lm_h, we, transpose_b=True) lm_losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=lm_logits, labels=tf.reshape(X[:, 1:, 0], [⑴])) lm_losses = tf.reshape(lm_losses, [shape_list(X)[0], shape_list(X)[1]⑴]) lm_losses = tf.reduce_sum(lm_losses*M[:, 1:], 1)/tf.reduce_sum(M[:, 1:], 1)

这里X是输入ids ， h 对应pytorch里面的hidden states，可以看到lm losses实际上是用sparse_softmax_cross_entropy_with_logits这个函数计算出来的，并且lm_h 其实就是 0:⑴这个序列，而sparse_softmax_cross_entropy_with_logits的label是X的[1:]因此确实能确定是由1到t⑴的序列去预测 t 。

遗留问题最后发现几个比较迷惑的地方1 MLP为何是feed forward层2 ctx是什么3 最后的hidden 为何还要乘以we

"我用文字勾勒出心中的画卷，希望能取你分享这份美好。如果你喜欢这篇文章，记得关注收藏点赞哦！"

随机文章

gpt3和gpt4区别GPT：GPT 论文笔记&代码解读这样也行？

您可能也感兴趣:

最近发表

网站分类

TAG标签

随机文章

gpt3和gpt4区别GPT：GPT 论文笔记&代码解读 这样也行？

您可能也感兴趣:

为您推荐

gpt3和gpt4区别GPT：GPT 论文笔记&代码解读 这样也行？

最近发表

网站分类

TAG标签

gpt3和gpt4区别GPT：GPT 论文笔记&代码解读这样也行？

gpt3和gpt4区别GPT：GPT 论文笔记&代码解读这样也行？