gpt3应用(GPT⑶ 总结)gpt⑶ 新鲜出炉


最近GPT比较热,本文根据GPT的论文,整理了1些GPT以及论文提到的其他few shot learning的方法GPT:Specifically, we train GPT, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks,

GPT is applied with手机壁纸out any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly(动手机壁纸态) reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or

performing 3-digit arithmetic. we find that GPT can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans.

task-Agnostic Meta-Learning手机壁纸 任务无偏的元学习 (以下摘自论文作者齐国君的文章)基于梯度下落的训练算法,它有两个在传统机器学习框架下没有可学习的超参数1)初始的模型参数;2)每步的更新步长。


去学习1个对多个任务来说合适的初始参数,使得对这些训练任务和其代表的更多未来任务来说,从这个初始参数开始,对模型进行更新,都可以更快和更好地得到新的模型这里更快的意思就是只需要少量的训练样本和少数的几次梯度下落,我们就可以期望得到手机壁纸合适的新任务的模型 (即few shot learning)。


这种情形,meta-learner对没有同任务是有偏的为了解决这个问题,作者提出1种任务无关(task agnostic)的无偏元学习方法作者通过对初始模型加上1个正则化条件,使得它对没有同的任务能“1视同仁”。

具体的,对1个分类任务,可以直接最大化初始模型在没有同类别上的熵(Entropy Maximization)来实现对手机壁纸任务的无偏性另1方面,对1般任务,比如回归或增强学习任务,往往可以通过定义1个损失函数(loss function)或者奖励函数(reward function)来定义和优化这些任务。

如果把负损失或者奖励看着是给每个任务的收入(income),我们就可以基于经济学中的度量收入没有平等(inequality)的方法来刻画meta-learner 在没有同任务的bias比如,我们可以用广泛应用的。


比如对称性、伸缩没有变性、非负性、传递原则等等通过最小化没有平等度量,我们可以得到对没有同任务无偏的meta-learner这个方法的问题,根据GPT的原文‘this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples.‘这会导致如下问题:

First, from a practical perspective, the need for a large dataset of labeled examples for every new 手机壁纸task limits the applicability of language models. (即,fine-tune还是需要较大的数据集来进行调试,但很多任务是提供没有了用来fine-tune的数据的

Second, the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness of the model and the narrowness of the training distribution. This can create pro手机壁纸blems for the pre-training plus fine-tuning paradigm, where models are designed to be large to absorb information during pre-training, but are then fine-tuned on very narrow task distributions.

For each task, we evaluate GPT under 3 conditions: (a) “few-shot learning”, or in-context learning where 手机壁纸we allow as many demonstrations as will fit into the model’s context window (typically 10 to 100), (b) “one-shot learning”, where we allow only one demonstration, and (c) “zero-shot” learning, where no demonstrations are allowed and only an instruction in natural language is given to the model.

由上图:M手机壁纸odel performance improves with the addition of a natural language task description, and with the number of examples in the model’s context, K. Few-shot learning also improves dramatically with model size. Though the results in this case are particularly striking, the general trends with both model s手机壁纸ize and number of examples in-context hold for most tasks we study. We emphasize that these “learning” curves involve no gradient updates or fine-tuning, just increasing numbers of demonstrations given as conditioning.

we also train a series of smaller models (ranging from 125 million parameters to 1手机壁纸3 billion parameters) in order to compare their performance to GPT in the zero, one and few-shot settings. Broadly, for most tasks we find relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap between zero-, one-, and few-shot performance often gr手机壁纸ows with model capacity, perhaps suggesting that larger models are more proficient meta-learners.

关于Fine-tuning的优点和缺点:The main advantage of fine-tuning is strong performance on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential for poor generali手机壁纸zation out-of-distribution。


GPT的结构:We use the same model and architecture as GPT, including the modified initialization, pre-normalization, and reversible tokenization described therein, with th手机壁纸e exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer。

上面是Jay Alammar 的关于GPT的介绍。以1个trained model为例:

The model is presented with an example. We only show it the features and ask it to predict the next手机壁纸 word.

The model’s prediction will be wrong. We calculate the error in its prediction and update the model so next time it makes a better prediction.

Repeat millions of times:

How does a system process the word “robotics” and produce “A”?High-level steps:Convert the word to a vector (list of numbers) r手机壁纸epresenting the word

Compute predictionConvert resulting vector to word

See all these layers? This is the “depth” in “deep learning”.Each of these layers has its own 1.8B parameter to make its calculations. That is where the “magic” happens. This is a high-level view of that process:


gpt3应用(GPT⑶ 总结)gpt⑶ 新鲜出炉

gpt3应用(GPT⑶ 总结)gpt⑶ 新鲜出炉

最近GPT-3比较热,本文根据GPT-3的论文,整理了一些GPT-3以及论文提到的其他few shot learning的方法 GPT-3: Sp...

2023-05-22 栏目:科技派
