随机文章

Chattler artificial intelligence chatGPT: ChatGPT/InstrumentGPT detailed original explanation

2023-07-19 03:58:47 分类:互联网+ 作者:axdmin 阅读:

In the era of information explosion, the internet has always maintained incredible vitality. Now, let's gather the internetThe worldoneIt's full of surprises andabsenceWhere there is no change, let's do it todayoneLet's talk about recent developments in this fieldoneSome important events.

Preface: The GPT series is OpenAI'soneThe full name of GPT in a series of pre training articles is Generative Pre Trained Transformer. As the name suggests, the purpose of GPT is to use Transformer as the basic model and pre training techniques to obtain a universal text model.

Currently, the published papers include text pre trained GPT⑴, GPT⑵, GPT⑶And image pre training iGPT, which is rumored to have not been released yet(4)yesoneThe recent popularity of ChatGPT, a multimodal model, and the announcement made earlier this year [one] areoneFor the sisters model, yesOn GPT(4)The previously released preheating model is sometimes referred to as GPT3.5.

Both ChatGPT and InstructGPT have complete model structure and training methodsoneTo, that is, both use Instruction Learning and Reinforcement learning from Human Feedback (RLHF) to guide the training of modelsabsenceThe only difference is in the way data is collected.

So to understand ChatGPT, we must first read InstrumentGPT to expand our reading: GPT(4)Core Technology Exploration one. Background Knowledge Before introducing ChatGPT/InstructGPT, let's first introduce the basic algorithms they rely on, the one.one GPT series.

GP based on text pre trainingT⑴[2] , GPT⑵[3] , GPT⑶[4]3The generation models are all models with Transformer as the core structure (Figure one),absenceThe same hyperparameters are the number of layers and word vector length of the model, and their specific contents are shown in Table one.

Figure 1: Model structure of the GPT series (where Trm is1Table 1: Publishing Time, Parameter Quantity, and Training Quantity of Historical GPTs (Transformer Structure) Model Publishing Time, Number of Layers, Number of Numerals, Vector Length, Parameter Quantity, Pre training Data Quantity GPT⑴June 1212, 2018

768.117 billion approximately 5GBGPT⑵February 48, 2019⑴6001.5 billion 40GBGPT⑶May 2020 969612888175 billion 45TBGPT⑴A few months before the birth of BERT, they all adopted TransforMer is the core structure,absenceSame as GPT⑴Pass.

Generate pre training tasks from left to right, and then obtain1A universal pre training model that is compatible with BERT1All samples can be used for makingIndecentFine tuning GPT for tasks⑴At that time, SOTA was achieved on 9 NLP tasks, but GPT⑴The model size and data volume used are relatively small, which promotes GPT⑵The birth of.

Compare GPT⑴, GPT⑵Not extensively discussing the model structure, only using models with more parameters and more training data (Table 1) GPT⑵The most important idea is that "all Supervised learning is unsupervised language model1The idea of "subsets" is also the precursor of Prompt Learning.

GPT⑵At the beginning of its birth, it also triggeredabsenceLess sensationalism, the news it generates is enough to deceive most humansTo the point where the effect of confusing fake with real is even referred to as the "most dangerous weapon in the AI world" at that time, many portal websites also ordered the prohibition of using GPT⑵Generated news GPT⑶When proposed, except for its far surpassing GPT⑵In addition to its effectiveness, what has sparked more discussion is its parameter size of 175 billion.

GPT⑶In addition to being able to complete common NLP tasks, researchers unexpectedly discovered GPT⑶In writing code for languages such as SQL, JavaScript, and performing simple mathematical operations, there are alsoabsenceWrong performance GPT⑶The training used contextual learning, which is a meta learning approach1The core idea of meta learning is to search through a small amount of data1A suitable initialization range enables the model to quickly fit on a limited dataset and obtainabsenceWrong effect.

Through the above analysis, we can see thatFrom a performance perspective, GPT has two goals: to improve the performance of the model on common NLP tasks; Enhance the generalization ability of the model in other non-typical NLP tasks (such as code writing and mathematical operations). Additionally, since the birth of the pre trained model,1A highly criticized issue is the bias of pre trained models.

Because pre trained models are trained on models with extremely large parameter scales through massive amounts of data, compared to expert systems completely controlled by manual rules, pre trained models are like1No one can guarantee the pre trained model in a black boxabsenceWill generate1Some contain dangerous content such as racial discrimination and gender discrimination, as its training data of tens of GB or even tens of TB almost certainly contains similar training samples.

This is also the motivation behind the proposal of InstrumentGPT and ChatGPTTop use3H summarizes their optimization objectives: Helpful; Trustworthy(Honey); Harmless OpenAI's GPT series models are not open source, but they provide a trial version of the model.

On the website, students with conditions can try 1.2 Instruction Learning and Prompt Learning on their own. Instruction learning was developed by Google Deepmind's Quoc V. Le team in 20211The title of the article is' Finetuned Language Models Are Zero Shot Learners'.

[5] The purpose of both thought instruction learning and prompt learning proposed in the article is to explore the knowledge inherent in language modelsabsenceSimilarly, Prompt is used to stimulate the completion ability of language models, such as generating lower half sentences based on upper half sentences or completing cloze testsThe ability to understand language models by providing more obvious instructions to enable the model to take the correct action.

We canaboveUsing examples to understand these twoabsenceSame learning method: Reminder learning: I bought this necklace for my girlfriend, she really likes it, but it's too____ The emotion of this sentence is very positive: I bought this necklace for my girlfriend, and she really likes it. The advantage of instruction learning is that after multiple task adjustments, it can also do zero shots on other tasks, and instruction learning is all aimed at1Of tasks.

Generalization abilityabsenceWe can understand fine-tuning, prompt learning, and directive learning through Figure 2, as shown in instruction learning

Figure 2: Model fine-tuning, prompt learning, and instruction learning3Differences and Similarities of Human 1.3 Reinforcement learning with Human FeedbackabsenceIt is very controllable, and the model can be seen as a distribution of the training set1Then the training data will be distributed after being fed back to the Generative modelIs the most important factor affecting the quality of generated content1Factors.

Sometimes we hope that the model does notabsenceIt is only influenced by training data, but is artificially controllable to ensure the usefulness, authenticity, and harmlessness of the generated data. The alignment problem has been repeatedly mentioned in the paper, which can be understood as the alignment between the output content of the model and the output content that humans likeabsenceThis includes not only the fluency and grammatical correctness of the generated content, but also the usefulness, authenticity, and harmlessness of the generated content.

We know that Reinforcement learning guides model training through reward mechanism. The reward mechanism can be seen as the Loss function of traditional model training mechanism. The calculation of reward is more flexible and diversified than that of Loss function (AlphaGO reward is a game winner). The cost of this is that the calculation of reward isabsenceDifferentiable, thereforeabsenceCan be directly used for backpropagation.

The idea of Reinforcement learning is to rewardA large number of samples are taken to fit the Loss function, so as to realize the training of the model. Human feedback is alsoabsenceGuided, then we can also use manual feedback as a reward for Reinforcement learning, and Reinforcement learning based on human feedback arises at the historic moment. RLHF can be traced back to Google's Deep Reinforcement Learning from Human Preferences published in 2017

[6] , which improves the performance effect of Reinforcement learning on simulated robots and Atari games through manual annotation as feedback.

Figure 3: Basic Principles of Reinforcement learning with Human Feedback Reinforcement learning is also used in InstructGPT/ChatGPT1A classic algorithm: Proximal Policy Optimization (PPO) proposed by OpenAI

[7] PThe PO algorithm is1A new type of Policy Gradient algorithm, which is very sensitive to step size but difficult to choose an appropriate step size. If the difference in changes between the new and old strategies during the training process is too largeabsenceIt is beneficial for learning PPO and proposes a new objective function that can achieve small batch updates in multiple training steps, solving the problem of difficult step size determination in the Policy Gradient algorithm.

In fact, TRPO is also aimed at solving this idea, but compared to the TRPO algorithm, the PPO algorithm is easier to solve 2 With the above basic knowledge, it will be much easier for us to understand the principles of InstrumentGPT/ChatGPT.

Simply put, both InstructGPT and ChatGPT adopt GPT⑶Network structure, communicationConstructing training samples through instruction learning for training1The training process of InstructGPT/ChatGPT is shown in Figure 4.

Figure 4: Calculation process of InstrumentGPT: (1) Supervised fine tuning (SFT); (2) Reward Model (RM) training; (3) Through PPO, Reinforcement learning is carried out according to the reward model. From Figure 4, we can see that the training of InstructGPT/ChatGPT can be divided into three steps. The reward model and the SFT model of Reinforcement learning in Step 2 and Step 3 can be optimized iteratively.

Evaluate GPT based on the collected SFT dataset⑶Supervised FineTune (SFT); Collect manually annotated comparative data and train reward modelsReword Model (RM); RM is used as the optimization goal of Reinforcement learning, and PPO algorithm is used to fine tune the SFT model.

According to Figure 4, we will introduce the dataset collection and model training of InstrumentGPT/ChatGPT. 2.1 Dataset collection is shown in Figure 4, and the training of InstrumentGPT/ChatGPT is divided into three steps, with each step1There are also some differences in the data required for each step,aboveLet's introduce them separately.

2.1.1 SFT dataset The SFT dataset is used to train the supervised model in step 1, which uses the collected new data and follows the GPT⑶Training method for GPT⑶Perform fine-tuning due to GPT⑶yes1There are Generative model based on prompt learning, so the SFT dataset is also a sample composed of prompt reply pairs.

SFT data1Part comes from Play using OpenAIUsers of Ground, also1Part of it comes from 40 labeling workers hired by OpenAI who have received training on labeling in this dataset. The labeling workers' job is to write their own instructions based on the content, and the instructions required to be written meet the requirementsabove3Point:.

Simple task: Labeler provides any1A simple task while ensuring diversity of tasks; Few shot task: provided by the label1Multiple instructions, as well as multiple query response responses to the instructions; User related: Obtain use cases from the interface and have the label write instructions based on these use cases.

2.1.2 RM dataset The RM dataset is used to train the reward model in step 2, and we also need to set the training settings for InstrumentGPT/ChatGPT1To achieve as comprehensive and realistic alignment as possible, we need to generate a model withNaturally, we can provide this reward through manual annotation, which can give lower scores to generated content that involves bias and encourage the modelabsenceTo generate these humansabsenceFavorite content.

The approach of InstrumentGPT/ChatGPT is to first generate the model1Batch candidate texts and have them sorted by label based on the quality of the generated data. 2.1.3 The PPO data from InstrumentGPT in the PPO dataset is not annotated and all of them come from GPT⑶The user of the API.

BothabsenceProvided by the same userabsenceThe generation tasks of the same type account for the highest proportion, including generation tasks (45.6%), QA (12.4%), brainstorming (11.2%), dialogue (8.4%), etc. 2.1.4 Data analysis is because InstrumentGPT/ChatGPT is based on GPT⑶Based on the fine adjustments made, and due to the involvement of manual annotation, their total data amount is notabsenceBig, Table 2 shows3The source and volume of the data.

Table 2: The data distribution of InstrumentGPT is discussed in more detail in Appendix A of the paper. Here, I list several possible factors that may affect the model's performance: over 96% of the data is in English, while the other 20 languages such as Chinese, French, Spanish, etc. add upabsenceTo 4%, this may result in InstrumentGPT/ChatGPT being able to generate other languages, but the effect should be far greaterabsenceIn English;

There are a total of 9 types of prompts, and the vast majority are generated tasks, which may lead to model coverageabsenceThe type of task to be reached; 40 outsourced employees come from the United States and Southeast Asia, with a relatively concentrated distribution and a small number of employees. The goal of InstrumentGPT/ChatGPT is to train1A pre training model with correct valuesThe values of Type A are a combination of the values of these 40 outsourced employees.

And this relatively narrow distribution may generate1In addition, ChatGPT's blog mentions that the training methods for ChatGPT and InstrumentGPT are the same, which concerns discrimination and bias issues in other regions,absenceThe similarity is only that they have differences in data collectionabsenceSame, but there is no more information on the details of data collectionabsenceSame as.

Considering that ChatGPT is only used in the field of dialogue, I suspect that there are two aspects of ChatGPT in data collectionabsenceSame as: 1. Increased the proportion of dialogue tasks; 2. Convert the prompt method to Q&A的方式当然这里也仅仅是猜测，更准确的描述要等到ChatGPT的论文、源码等更详细的资料公布我们才能知道。

2.2 训练任务我们刚介绍到InstructGPT/ChatGPT有3步训练方式这3步训练会涉及3个模型：SFT，RM以及PPO，上面我们详细介绍它们2.2.1 有监督微调（SFT）这1步的训练和GPT⑶1致，而且作者发现让模型适当过拟合有助于后面两步的训练。

2.2.2 奖励模型（RM）因为训练RM的数据是1个labeler根据生成结果排序的形式，所以它可以看做1个回归模型RM结构是将SFT训练后的模型的最后的嵌入层去掉后的模型它的输入是prompt和Reponse，输出是奖励值。

具体的讲，对弈每个prompt，InstructGPT/ChatGPT会随机生成 KK 个输出（ 4≤K≤94\leq K\leq 9 ），然后它们向每个labeler成对的展示输出结果，也就是每个prompt共展示

CK2C_K^2 个结果，然后用户从当选择效果更好的输出在训练时，InstructGPT/ChatGPT将每个prompt的 CK2C_K^2 个响应对作为1个batch，这种按prompt为batch的训练方式要比传统的按样本为batch的方式更没有容易过拟合，因为这种方式每个prompt会且仅会输入到模型中1次。

奖励模型的损失函数表示为式(1)这个损失函数的目标是最大化labeler更喜欢的响应和没有喜欢的响应之间的差值(1)loss⁡(θ)=−1(K2)E(x,yw,yl)∼D[log⁡(σ(rθ(x,yw)−

rθ(x,yl)))] \operatorname{loss}(\theta)=-\frac{1}{\left(\begin{array}{c}K \\ 2\end{array}\right)} E_{\left(x, y_w, y_l\right) \sim D}\left[\log \left(\sigma\left(r_\theta\left(x, y_w\right)-r_\theta\left(x, y_l\right)\right)\right)\right] \tag1

其中 rθ(x,y)r_\theta\left(x, y\right) 是提示 xx 和响应 yy 在参数为 θ\theta 的奖励模型下的奖励值， ywy_w 是labeler更喜欢的响应结果， yl

y_l 是labeler没有喜欢的响应结果 DD 是整个训练数据集2.2.3 强化学习模型（PPO）强化学习和预训练模型是最近两年最为火热的AI方向之二，之前没有少科研工作者说强化学习并没有是1个非常适合应用到预训练模型中，因为很难通过模型的输出内容建立奖励机制。

而InstructGPT/ChatGPT反直觉的做到了这点，它通过结合人工标注，将强化学习引入到预训练语言模型是这个算法最大的创新点如表2所示，PPO的训练集完全来自API它通过第2步得到的奖励模型来指导SFT模型的继续训练。

很多时候强化学习是非常难训练的，InstructGPT/ChatGPT在训练过程中就遇到了两个问题：问题1：随着模型的更新，强化学习模型产生的数据和训练奖励模型的数据的差异会越来越大作者的解决方案是在损失函数中加入KL惩罚项。

βlog⁡(πϕRL(y∣x)/πSFT(y∣x))\beta \log \left(\pi_\phi^{\mathrm{RL}}(y \mid x) / \pi^{\mathrm{SFT}}(y \mid x)\right)

来确保PPO模型的输出和SFT的输出差距没有会很大问题2：只用PPO模型进行训练的话，会导致模型在通用NLP任务上性能的大幅下落，作者的解决方案是在训练目标中加入了通用的语言模型目标 γEx∼Dpretrain 。

[log⁡(πϕRL(x))]\gamma E_{x \sim D_{\text {pretrain }}}\left[\log \left(\pi_\phi^{\mathrm{RL}}(x)\right)\right]

，这个变量在论文中被叫做PPO-ptx综上，PPO的训练目标为式(2) (2) objective (ϕ)=E(x,y)∼DπϕRL[rθ(x,y)−βlog⁡(πϕRL(y∣x)/πSFT(y∣x)

)]+γEx∼Dpretrain [log⁡(πϕRL(x))] \text { objective }(\phi)=E_{(x, y) \sim D_{\pi_\phi}^{\mathrm{RL}}}\left[r_\theta(x, y)-\beta \log \left(\pi_\phi^{\mathrm{RL}}(y \mid x) / \pi^{\mathrm{SFT}}(y \mid x)\right)\right] + \gamma E_{x \sim D_{\text {pretrain }}}\left[\log \left(\pi_\phi^{\mathrm{RL}}(x)\right)\right] \tag2

3. InstructGPT/ChatGPT的性能分析没有可否认的是，InstructGPT/ChatGPT的效果是非常棒的，尤其是引入了人工标注以后，让模型的“价值观”和的正确程度和人类行为模式的“真实性”上都大幅的提升。

那么，仅仅根据InstructGPT/ChatGPT的技术方案和训练方式，我们就可以分析出它可以带来哪些效果提升呢？3.1 优点InstructGPT/ChatGPT的效果比GPT⑶更加真实：这个很好理解，因为GPT⑶本身就具有非常强的泛化能力和生成能力，再加上InstructGPT/ChatGPT引入了没有同的labeler进行提示编写和生成结果排序，而且还是在GPT⑶之长进行的微调，这使得我们在训练奖励模型时对更加真实的数据会有更高的奖励。

作者也在TruthfulQA数据集上对比了它们和GPT⑶的效果，实验结果表明甚至13亿小尺寸的PPO-ptx的效果也要比GPT⑶要好InstructGPT/ChatGPT在模型的无害性上比GPT⑶效果要有些许提升

：原理同上但是作者发现InstructGPT在歧视、偏见等数据集上并没有明显的提升这是因为GPT⑶本身就是1个效果非常好的模型，它生成带有有害、歧视、偏见等情况的有问题样本的概率本身就会很低仅仅通过40个labeler采集和标注的数据很可能无法对模型在这些方面进行充分的优化，所以会带来模型效果的提升很少或者无法察觉。

InstructGPT/ChatGPT具有很强的Coding能力：首先GPT⑶就具有很强的Coding能力，基于GPT⑶制作的API也积累了大量的Coding代码而且也有部分OpenAI的内部员工参取了数据采集工作。

通过Coding相关的大量数据以及人工标注，训练出来的InstructGPT/ChatGPT具有非常强的Coding能力也就没有意外了3.2 缺点InstructGPT/ChatGPT会降低模型在通用NLP任务上的效果

：我们在PPO的训练的时候讨论了这点，虽然修改损失函数可以缓和，但这个问题并没有得到彻底解决有时候InstructGPT/ChatGPT会给出1些荒谬的输出：虽然InstructGPT/ChatGPT使用了人类反馈，但限于人力资源有限。

影响模型效果最大的还是有监督的语言模型任务，人类只是起到了纠正作用所以很有可能受限于纠正数据的有限，或是有监督任务的误导（只考虑模型的输出，没考虑人类想要什么），导致它生成内容的没有真实就像1个学生，虽然有老师对他指导，但也没有能确定学生可以学会所有知识点。

模型对指示非常敏感：这个也可以归结为labeler标注的数据量没有够，因为指示是模型产生输出的唯1线索，如果指示的数量和种类训练的没有充分的话，就可能会让模型存在这个问题模型对简单概念的过分解读：这可能是因为labeler在进行生成内容的比较时，倾向于给给长的输出内容更高的奖励。

对有害的指示可能会输出有害的答复：例如InstructGPT/ChatGPT也会对用户提出的“AI毁灭人类计划书”给出行动方案（图5）这个是因为InstructGPT/ChatGPT假设labeler编写的指示是合理且价值观正确的，并没有对用户给出的指示做更详细的判断，从而会导致模型会对任意输入都给出答复。

虽然后面的奖励模型可能会给这类输出较低的奖励值，但模型在生成文本时，没有仅要考虑模型的价值观，也要考虑生成内容和指示的匹配度，有时候生成1些价值观有问题的输出也是可能的

图5：ChatGPT编写的毁灭人类计划书3.3 未来工作我们已经分析了InstrcutGPT/ChatGPT的技术方案和它的问题，那么我们也可以看出InstrcutGPT/ChatGPT的优化角度有哪些了。

人工标注的降本增效：InstrcutGPT/ChatGPT雇佣了40人的标注团队，但从模型的表现效果来看，这40人的团队是没有够的如何让人类能够提供更有效的反馈方式，将人类表现和模型表现有机和巧妙的结合起来是非常重要的。

模型对指示的泛化/纠错等能力：指示作为模型产生输出的唯1线索，模型对他的依赖是非常宽重的，如何提升模型对指示的泛化能力以及对错误指示示的纠错能力是提升模型体验的1个非常重要的工作这没有仅可以让模型能够拥有更广泛的应用场景，还可以让模型变得更“智能”。

避免通用任务性能下落：这里可能需要设计1个更合理的人类反馈的使用方式，或是更前沿的模型结构因为我们讨论了InstrcutGPT/ChatGPT的很多问题可以通过提供更多labeler标注的数据来解决，但这会导致通用NLP任务更宽重的性能下落，所以需要方案来让生成结果的3H和通用NLP任务的性能达到平衡。

3.4 InstrcutGPT/ChatGPT的热点话题解答ChatGPT的出现会没有会导致底层程序员失业？从ChatGPT的原理和网上漏出的生成内容来看，ChatGPT生成的代码很多可以正确运行但程序员的工作没有止是写代码，更重要的是找到问题的解决方案。

所以ChatGPT并没有会取代程序员，尤其是高阶程序员相反它会向现在很多的代码生成工具1样，成为程序员写代码非常有用的工具Stack Overflow 宣布一时规则：禁止 ChatGPTChatGPT本质上还是1个文本生成模型，对比生成代码，它更擅长生成以假乱真的文本。

而且文本生成模型生成的代码或者解决方案并没有能保证是可运行而且是可以解决问题的，但它以假乱真的文本又会迷惑很多查询这个问题的人Stack Overflow为了维持论坛的质量，封禁ChatGPT也是情理当中。

聊天机器人 ChatGPT 在诱导下写出「毁灭人类计划书」，并给出代码，AI 发展有哪些问题需关注？ChatGPT的「毁灭人类计划书」是它在没有可遇见的指示下根据海量数据强行拟合出来的生成内容虽然这些内容看起来很真实，表达也很流畅，这说明的只是ChatGPT具有非常强的生成效果，并没有表示ChatGPT具备毁灭人类的思想。

因为他仅仅是1个文本生成模型，并没有是1个决策模型4. 总结就像很多人们算法刚诞生时1样，ChatGPT凭借有用性，真实性，无害性的效果，引起了业内广泛的关注和人类对AI的思考但是当我们看完它的算法原理以后，发现它并没有业内宣传的那么恐怖。

反而我们可以从它的技术方案中学到很多有价值的器材InstrcutGPT/ChatGPT在AI界最重要的贡献是将强化学习和预训练模型巧妙的结合起来而且通过人工反馈提升了模型的有用性，真实性和无害性ChatGPT也进1步提升大模型的成本，之前还只是比拼数据量和模型规模，现在甚至也引入了雇佣的外包这1支出，让个别工作者更加望而却步。

附：本文作者通过和 @人民邮电出版社的合作，目前此专栏的大部分内容经过反复的校正和排版已发布成书籍《深度学习高手笔记——卷1：基础算法》和《深度学习高手笔记——卷2：前沿应用》，内容经过作者和出版社的专业审核人员的10余轮的教改，内容的歉富性，算法讲解的精确性，叙述文本的流畅度已大幅提升。

目前卷1已多平台上架，欢迎大家点击上面链接购买参考^Ouyang, Long, et al. "Training language models to follow instructions with human feedback." *arXiv preprint arXiv:2203.02155* (2022).

https://arxiv.org/pdf/2203.02155.pdf^Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I., 2018. Improving language understanding by generative pre-training.

https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf^Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. and Sutskever, I., 2019. Language models are unsupervised multitask learners. *OpenAI blog*, *1*(8), p.9.

https://life-extension.github.io/2020/05/27/GPT%E6%8A%80%E6%9C%AF%E5%88%9D%E6%8E%A2/language-models.pdf

^Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. “Language models are few-shot learners.” *arXiv preprint arXiv:2005.14165* (2020).

https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf^Wei, Jason, et al. "Finetuned language models are zero-shot learners." *arXiv preprint arXiv:2109.01652* (2021).

https://arxiv.org/pdf/2109.01652.pdf^Christiano, Paul F., et al. "Deep reinforcement learning from human preferences." *Advances in neural information processing systems* 30 (2017).

https://arxiv.org/pdf/1706.03741.pdf^Schulman, John, et al. "Proximal policy optimization algorithms." *arXiv preprint arXiv:1707.06347* (2017).

https://arxiv.org/pdf/1707.06347.pdf

这是我对天下的1次观察和思考，希望能激发你内心的思绪。喜欢的小伙伴记得关注收藏点赞哦！

随机文章

Chattler artificial intelligence chatGPT: ChatGPT/InstrumentGPT detailed original explanation

您可能也感兴趣:

最近发表

网站分类

TAG标签

随机文章

Chattler artificial intelligence chatGPT: ChatGPT/InstrumentGPT detailed original explanation

您可能也感兴趣:

为您推荐

Chattler artificial intelligence chatGPT: ChatGPT/InstrumentGPT detailed original explanation

最近发表

网站分类

TAG标签