gpt3应用(GPT⑶ 总结)gpt⑶ 新鲜出炉
最近GPT-3比较热,本文根据GPT-3的论文,整理了一些GPT-3以及论文提到的其他few shot learning的方法 GPT-3: Sp...
最近GPT⑶比较热,本文根据GPT⑶的论文,整理了1些GPT⑶以及论文提到的其他few shot learning的方法GPT⑶:Specifically, we train GPT⑶, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks,
GPT⑶ is applied with手机壁纸out any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT⑶ achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly(动手机壁纸态) reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or
performing 3-digit arithmetic. we find that GPT⑶ can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans.
task-Agnostic Meta-Learning手机壁纸 任务无偏的元学习 (以下摘自论文作者齐国君的文章)基于梯度下落的训练算法,它有两个在传统机器学习框架下没有可学习的超参数1)初始的模型参数;2)每步的更新步长。
模型参数往往通过随机初始化来实现但由于大部分深度学习模型都是非凸的,导致模型的学习效果非常依赖于没有同的随机初始条件1个好的初始模型参数会对模型的学习效果有着非常大的影响而元学习的1个重要用途,就是通过学习的方法
去学习1个对多个任务来说合适的初始参数,使得对这些训练任务和其代表的更多未来任务来说,从这个初始参数开始,对模型进行更新,都可以更快和更好地得到新的模型这里更快的意思就是只需要少量的训练样本和少数的几次梯度下落,我们就可以期望得到手机壁纸合适的新任务的模型 (即few shot learning)。
经典的元学习方法忽略了在多个任务上学习最优初始模型的1个重要问题:如何保证学习得到的初始模型对所有任务是没有偏差(unbiased)的1个很可能发生的情形是,初始模型对某些任务跟有效,而对另外1些任务就没有是特别有效。
这种情形,meta-learner对没有同任务是有偏的为了解决这个问题,作者提出1种任务无关(task agnostic)的无偏元学习方法作者通过对初始模型加上1个正则化条件,使得它对没有同的任务能“1视同仁”。
具体的,对1个分类任务,可以直接最大化初始模型在没有同类别上的熵(Entropy Maximization)来实现对手机壁纸任务的无偏性另1方面,对1般任务,比如回归或增强学习任务,往往可以通过定义1个损失函数(loss function)或者奖励函数(reward function)来定义和优化这些任务。
如果把负损失或者奖励看着是给每个任务的收入(income),我们就可以基于经济学中的度量收入没有平等(inequality)的方法来刻画meta-learner 在没有同任务的bias比如,我们可以用广泛应用的。
基尼系数来度量元学习在没有同任务的偏差,除此以外还有GE指数、Theil指数等这些没有平等度量具有没有同的特性,可以聚焦考虑在特定的损失或奖励(收入)区间上任务同时,这些度量还满足若干性质,使得它们非常适合作为没有平手机壁纸等度量。
比如对称性、伸缩没有变性、非负性、传递原则等等通过最小化没有平等度量,我们可以得到对没有同任务无偏的meta-learner这个方法的问题,根据GPT⑶的原文‘this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples.‘这会导致如下问题:
First, from a practical perspective, the need for a large dataset of labeled examples for every new 手机壁纸task limits the applicability of language models. (即,fine-tune还是需要较大的数据集来进行调试,但很多任务是提供没有了用来fine-tune的数据的
Second, the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness of the model and the narrowness of the training distribution. This can create pro手机壁纸blems for the pre-training plus fine-tuning paradigm, where models are designed to be large to absorb information during pre-training, but are then fine-tuned on very narrow task distributions.
For each task, we evaluate GPT⑶ under 3 conditions: (a) “few-shot learning”, or in-context learning where 手机壁纸we allow as many demonstrations as will fit into the model’s context window (typically 10 to 100), (b) “one-shot learning”, where we allow only one demonstration, and (c) “zero-shot” learning, where no demonstrations are allowed and only an instruction in natural language is given to the model.
由上图:M手机壁纸odel performance improves with the addition of a natural language task description, and with the number of examples in the model’s context, K. Few-shot learning also improves dramatically with model size. Though the results in this case are particularly striking, the general trends with both model s手机壁纸ize and number of examples in-context hold for most tasks we study. We emphasize that these “learning” curves involve no gradient updates or fine-tuning, just increasing numbers of demonstrations given as conditioning.
we also train a series of smaller models (ranging from 125 million parameters to 1手机壁纸3 billion parameters) in order to compare their performance to GPT⑶ in the zero, one and few-shot settings. Broadly, for most tasks we find relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap between zero-, one-, and few-shot performance often gr手机壁纸ows with model capacity, perhaps suggesting that larger models are more proficient meta-learners.
关于Fine-tuning的优点和缺点:The main advantage of fine-tuning is strong performance on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential for poor generali手机壁纸zation out-of-distribution。
文章特别提到,GPT⑶本身是可以用来fine-tune的,且这也是其未来的研究方向之1为何要把one-shot从few-shot和zero-shot平分出来,因为one-shot实际上是最贴近人的情况。
GPT⑶的结构:We use the same model and architecture as GPT⑵, including the modified initialization, pre-normalization, and reversible tokenization described therein, with th手机壁纸e exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer。
上面是Jay Alammar 的关于GPT⑶的介绍。以1个trained model为例:
The model is presented with an example. We only show it the features and ask it to predict the next手机壁纸 word.
The model’s prediction will be wrong. We calculate the error in its prediction and update the model so next time it makes a better prediction.
Repeat millions of times:
How does a system process the word “robotics” and produce “A”?High-level steps:Convert the word to a vector (list of numbers) r手机壁纸epresenting the word
Compute predictionConvert resulting vector to word
See all these layers? This is the “depth” in “deep learning”.Each of these layers has its own 1.8B parameter to make its calculations. That is where the “magic” happens. This is a high-level view of that process:
当前非电脑浏览器正常宽度,请使用移动设备访问本站!