How much does an airless spraying machine cost? GPT: GPT (4) "Ultimate Secret": 1.8 trilli-互联网的那些事儿 | 聚焦互联网前沿，行业爆料、小道消息、内幕挖掘，关注互联网热点！

Welcome everyone to my blog! Today I have prepared for everyoneoneAn exciting article:

As is well known, OpenAI does notabsence'open', especially in GPT(4)After the release, the entire OpenAI team will focus on GPT(4)Almost all of the Almost all ofoneThe title of the article is "GPT(4) Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE,.

Exposed GPT(4)From model architecture, model training to all details of cost, GPT(4)Has it been 'open source' again? The article provides a detailed introduction to GPT(4)Architecture, training and inference infrastructure, parameter quantity, training dataset, tThe parameters and information such as oken number, cost, and Mixed of Experts (MoE) are very specific.

At the same time, he also "delved deeply" in theabsenceIn terms of the same route selection, OpenAI faces various trade-offs, and to be frank, the GPT(4)The most interesting aspect is understanding OpenAIWhyWill make certain architectural decisions.

It is worth noting that Dylan Patel is also the author of Google's internal file leakage incident ("We don't have a Moat, nor does OpenAI").

And DeepMind CEO Hassabis recently confirmed the authenticity of the leaked Google document in a media interview. Considering that the informant is Dylan Patel, this GPT(4)The authenticity of the "big revelation" has improved by a bit. At the beginning of the article, it was pointed out that the reason why OpenAIabsenceOpen,absenceIt's to protect humanityabsenceDestroyed by AI, but because.

The big model they built is replicable, and in the future, internet giants and AI startups in China and the United States will have the ability to build models that can match GPT(4)Rivaling or even surpassing GPT(4)OpenAI is the most supportive of the large modeltemporaryThe Moat of is that they have the feedback of real users, the top engineering talents in the industry, and the leading position brought by the first mover advantage.

Wall Street has compiled information about GPT(4)ExplosivesecondaryContent: 1.8 trillion massive parameters and model framework article points out that GPT(4)In the 120 layers, a total of 1.8 trillion parameters are included, while the GPT⑶There are only about 175 billion parameters, which means GPT(4)The scale is GPT⑶More than 10 times.

OpenAI controls cost GPT by using a Mixed of Experts (MoE) model(4)Own 16Expert models with approximately 111 billion parameters per MLP expert, of which two are used for forward propagation OpenAI for GPT(4)The algorithm is actually very simple.

There are approximately 55 billion parameters in the model that are used as attention mechanisms to share each forward propagation inference (generated1Among tokens, GPT(4)Compared to using approximately 280 billion parameters and 560 TFLOPs, the pure dense model requires approximately 1.8 trillion parameters and approximately 3700 TFLOPs of computation per forward propagation.

The composition of the dataset OpenAI trained GPT with 13 trillion tokens(4)Due to the lack of high-quality tokens, this dataset also includes many epochEpoch quantities: 2 epochs were trained for text based data, and 4 epochs were trained for code based data Training.

During the pre training phase, GPT(4)Using 8k ofheightThe text length (seqlen), while the 32k version is fine-tuned based on the pre trained 8K version in a few dayswithinThe batch size gradually increased in the cluster, and the final batch size used by OpenAI reached 60 million. However, since not every expert model can see all tokens, this is only the size of each 7.5 million token expert model.

Real batch size: Divide this number by the sequence length (seq len) to obtain the parallel strategy of OpenAI. The parallel strategy is quite important for A100 GPUs in order toAdvancingIn parallel computing, OpenAI adopts 8-way tensor parallelism, as this is the limit of NVLink.

BesidesoutsideIt is said that OpenAI adopts a 15 channel parallel pipeline theory, considering data communicationAnd the calculation time, 15 pipelines is a bit more, but1With the addition of KV cache and cost, if most of the GPUs used by OpenAI are 40GB A100, then such an architecture is theoretically meaningful.

But the author states that he did notabsenceI am too familiar with how OpenAI can avoid generating "bubbles" in each batch under such high pipeline parallelism, as shown in the figure below. It is likely that OpenAI has effectively resisted these costs.

Training cost:1The cost of this training session is $63 million for OpenAI training GPT(4)The FLOPS of is approximately 2.15e25, trained on approximately 25000 A100 for 90 to 100 days, with utilization rates ranging from 32% to 36%. The excessive number of failures is also the reason for extremely low utilization, which may require retraining from previous checkpoints.

another1One reason is that there are so many G'sAll reduce between PUs is very expensive if the cost of OpenAI cloud computing is poorabsenceIf there is an additional $1 per A100 hours, then under such conditions, the cost of this training alone is approximately $63 millionabsenceIncluding all experiments, failed training, and other costs, such as data collection, RLHF, labor costs, etc.

If we consider the factors mentioned earlier, the actual cost is much higher, but today, at $2 per H100 hours, pre training can take approximately 8192 H100 hoursAdvancingOK, it only takes 55 days, and the cost is 21.5 million dollars. When using the expert Mixture model, the trade-off MoE (mixed expert model) is1A great way to reduce the number of parameters in the reasoning process, but at the same time, it will increase the number of parameters.

If OpenAI really wants to pursue optimal performance, they need to train twice as many tokens to achieve adoption comparisonThere are many reasons why there are fewer expert models, among which OpenAI chose 16 experts1It is difficult for more expert models to generalize and achieve convergence in executing many tasks.

GPT(4)Inference costtakeCompared to the Davinchi model with 175 billion parameters, GPT(4)The cost is three times that of it, although its feedforward parameters have only increased by 1.6 timessecondaryBecause of GPT(4)We need a larger cluster and achieve lower utilization, according to the author, with 128 A100 GPUs for inference,.

GPT(4)The cost of 8k sequence length per 1000 tags is $0.0049, while reasoning GPT on 128 H100(4)The cost of 8k sequence length per 1000 tags is $0.0021. It should be noted that this assumes a relatively high utilization rate and maintains a high batch size.

But obviously, OSometimes, the utilization rate of penAI is very low. Multiple Query Attention is used in OpenAI and other large factories1Similarly, MQA is also being used. Simply put, it only requires1An attention head can significantly reduce the memory usage of KV cache.

Even so, a 32k length GPT(4)It is definitely not possible to run on a 40GB A100, and there is also an upper limit for the maximum batch size of 8k. OpenAI implements variable batch size and continuous batch processing to allow for1Maximize latency to a certain extent and optimize inference cost Speculative Decoding.

OpenAI on GPT(4)The basic principle of using "speculative decoding" and "speculative decoding" in the reasoning process is to use1A smaller and faster draft model decodes multiple tokens in advance and then uses them as1Batch input to forecastIf OpenAI uses "speculative decoding" in the model, they may only be used in a sequence of approximately 4 tokens.

Visual multimodality is1A visual encoder independent of a text encoder, with cross attention between the two, similar to Flamingo in GPT architecture(4)More parameter GPT added over 1.8 trillion parameters(4)Multimodal ability is achieved through text pre trainingafterWe have made further fine-tuning with approximately 2 trillion tokens.

It is said that OpenAI originally intended to train from scratch on visual models, but due to itsabsenceMature enough, but unable to fine-tune the text training model1Generation Model GPT(5)We will start visual training from scratch and be able to generate images and even audio ourselves. Here is the full text translated by Newin through GPT:.

OpenAI maintains GPT(4)Architecture closure,absenceBecause there is a certain risk to humanityBut rather because the content they build is replicable, we expect companies such as Google, Meta, Anthropic, Inflection, Character, Tencent, ByteDance, Baidu, etc. to have ownership in the short termtakeGPT(4)1Even more powerful modeling capabilities.

pleaseabsenceTo be misunderstood, OpenAI has amazing engineering capabilities and theequipmentIt's unbelievable, but the solution they found is not magic. This is1An elegant solution that involves many complex trade-offs, scaling up just for combat1Partial OpenAI supporttemporaryTheir competitive advantage lies in having the most practical applications, leading engineering talents, and the ability to continue to surpass other companies through future models.

We have collected information on GPT from multiple sources(4)We would like to share a large amount of information today1Take this bag downIncluding model architecture, training infrastructure, inference infrastructure, number of parameters, composition of training dataset, number of tokens, number of layers, parallel strategy, multimodal visual adaptationabsenceThe thinking process behind balancing with engineering, the unique technologies implemented, and how they can mitigatetakeMassive model inference related1Some of the biggest bottlenecks.

GPT(4)The most interesting aspect is understanding themWhyIn addition to making certain architectural decisions, we will provide an overview of training and reasoning GPT on A100(4)The cost, as well as the following1How to build a proxy model architecturetakeH100 extension first, let's take a look at the problem statement from GPT⑶By 4, OpenAI hopes to expand by 100 times, but the problem is cost.

Dense Transformer models will not be able to enter1The step expansion intensive Transformer is OpenAI GPT⑶Google PaLM, Meta LLAMA, TII The model architecture used by models such as Falcon and MosaicML MPT.

We can easily list over 50 companies that have trained LLM using this same architecture1individualabsenceWrong architecture, but flawed for extensions in GPT(4)Prior to release, we had discussed training coststakeThe relationship between the upcoming AI brick walls is there, and we have revealed the OpenAI in GPT(4)A high-level approach in terms of architecture and training costs for various existing models.

In the past six months, we have realized that the cost of training is insignificant. Of course, on the surface, it seems crazy and requires tens or even billions of dollars of computing time to train1A model, but for these companies, this is a microabsenceThe cost of footwork is actually1Fixed capital expenditure can always achieve better results in expanding scale.

only1The limiting factor is to expand the computing scale to allow humans to receive feedback andIn terms of the time scale for modifying the architecture, multiple companies such as Google, Meta, and OpenAI/Microsoft will surpass1Train models on billions of dollars of supercomputers.

Meta burns 16 billion dollars on "Metaverse" every year, Google wastes 10 billion dollars on various projects every year, Amazon loses more than 50 billion dollars on Alexa, and Cryptocurrency wastes more than 100 billion dollars on worthless things. These companies and the whole society can and will spend more than1Hundreds of billions of dollars.

Then, these massive models can become products in multiple ways, and this work will be replicated in multiple countries and companies1A new Space RacetakePrevious wasteabsenceSimilarly, current artificial intelligence has tangible value, and in the short term, it will transform fromThe more important issue in obtaining extended artificial intelligence in human assistants and autonomous agents is reasoning.

The goal is to calculate the trainingtakeReasoning and computing separation, which isWhyMeaningful training goes beyond the optimal range of Chinchilla, regardless of the model to be deployedWhyTo use a sparse model architecture; During the reasoning process, andabsenceThe real challenge of activating each parameter is that the cost of extending these models to users and agents is too high.

The cost of reasoning is multiple times higher than the cost of training, which is the innovative goal of OpenAI in model architecture and infrastructure. The reasoning of large models is1A multivariable problem, for dense models, model size is fatal. We have discussed it in detail heretakeEdge computing related issues, but datacenterThe problem statements are very similar.

In short, the device can never have enough Memory bandwidth to achieve the specific throughput level of the large language model. Even if the bandwidth is sufficient, edge computing devicesThe utilization rate of hardware computing resources will also be very low in datacenterIn the cloud, utilization is crucial, and Nvidia is praised for its excellence1The reason for this is that throughout the entire lifecycle of the GPU, NvidiaabsenceBreaking updates to lower levels improves the utilization of FLOPS by moving data more intelligently within chips, between chips, and between memory.

In most current usage cases, the goal of LLM inference is to run as a real-time assistant, which means it must achieve sufficient throughput to allow users to truly use it. The average human reading speed is about 250 words per minute, but some people even reach as high as 1000 words per minute. This means that you need to output at least 8.33 tokens per second, but closer to outputting 33.33 tokens per second to handle all situations.

According to the requirements of Memory bandwidth,1A Dense Model with Mega Parameters in the Latest NvidMathematically, this throughput cannot be achieved on the IA H100 GPU server. Each generated token needs to load each parameter from memory into the token generated on the chip, input it into the prompt, and generate the following1Tokens.

In addition, additional bandwidth is required for streaming KV caching for attention mechanisms

This chart assumes that due to the inability to integrate the Memory bandwidth and hardware overhead required by each operation and attention mechanism, the efficiency is equivalent to parameter reading, even though "optimization" such as Nvidia's FasterTransformer library is used库，总开销也更大上面的图表展示了推理1个LLM所需的内存带宽，以实现足够高的吞吐量为单个用户提供服务。

它显示，即使使用8个H100，也无法以每秒33.33个令牌的速度为1兆参数的密集模型提供服务此外，以每秒20个令牌的速度使用8个H100的FLOPS利用率仍然没有到5%，导致推理成本非常高事实上，目前基于8路张量并行的H100系统对于约3000亿前向参数存在推理限制。

然而，OpenAI正在使用A100实现人类阅读速度，使用的模型参数超过1兆，并以每1,000个令牌仅售0.06美元的低价广泛提供这是因为它是稀疏的，即并非每个参数都被使用关于GPT⑷的模型架构、训练基础设施、推理基础设施、参数数量、训练数据集组成、令牌数量、层数量、并行策略、多模态视觉编码器、没有同工程权衡背后的思考过程、实施的独特技术以及他们如何减轻取庞大模型推理相关的1些最大瓶颈。

1 GPT⑷模型架构GPT⑷的规模是GPT⑶的10倍以上据我们了解，它具有大约1.8兆参数，分布在120个层，而GPT⑶具有大约1750亿参数OpenAI通过使用混合专家（MoE）模型，成功地控制了成本。

如果您对MoE没有熟悉，请阅读我们六个月前关于广义GPT⑷架构和训练成本的文章此外，OpenAI在其模型中使用了16个专家，每个专家的MLP参数约为1110亿其中有2个专家路由到每个前向传递虽然文献中谈论了选择将每个令牌路由到哪个专家的高级路由算法，但据称OpenAI目前的GPT⑷模型的路由算法相当简单。

此外，注意力机制共享大约550亿参数每次前向传递推理（生成1个令牌）只使用约2800亿参数和560 TFLOPS这取纯密集模型每次前向传递所需的约1.8兆参数和3700 TFLOPS形成了对比2 数据集成

OpenAI在大约13兆令牌上对GPT⑷进行了训练考虑到RefinedWeb的CommonCrawl包含大约5兆高质量令牌，这是有道理的供参考，Deepmind的Chinchilla模型和Google的PaLM模型分别使用了大约1.4兆令牌和0.78兆令牌进行训练。

甚至据称PaLM 2是在大约5兆令牌长进行训练的该数据集没有包含13兆个独特令牌相反，由于缺累高质量令牌，该数据集包含多个时期文本数据有2个时期，代码数据有4个时期有趣的是，这远远没有及Chinchilla的最佳选择，表明需要以双倍的令牌数量对模型进行训练。

这表明在网络上缺累易于获取的令牌高质量文本令牌的数量是其中的1000倍，而音频和视觉令牌的数量更多，但是获取它们并没有像网页抓取那么简单他们拥有来自Scale Al和内部的数百万行指令微调数据，但可惜的是，我们找没有到太多关于他们的强化学习数据。

预训练阶段的高低文长度为8k32k的令牌长度版本是在预训练后的8k基础长进行微调的批量大小逐渐在几天内逐步增加，但到最后，OpenAI使用的批量大小为6000万！当然，由于没有是每个专家都看到所有令牌，这实际上只是每个专家每批次处理750万个令牌。

3 并行策略在所有A100 GPU长进行并行化的策略非常重要他们采用了8路张量并行，因为这是NVLink的极限此外，我们听说他们正在使用15路管线并行从计算时间和数据通信的角度来看，理论上管线并行的数量太多了，但如果他们受到内存容量限制，那么这是有道理的。

纯粹的管线+张量并行时，每个GPU仅参数就需要约30GB（FP16）1旦加上KV缓存和开销，理论上如果OpenAI的大部分GPU都是40GB的A100，则这是有道理的他们可能使用了ZeRo阶段1可能他们使用了块级FSDP或混合共享数据并行。

至于为何他们没有使用完整模型FSDP，可能是因为通信开销较高尽管OpenAI的大多数节点之间有高速网络连接，但并非所有节点之间都是如此我们相信至少有1些集群之间的带宽比其他集群低得多我们没有理解他们如何在具有如此高的管线并行度时避免每批次出现巨大的气泡。

很可能他们只是承担了这个开销

4 训练成本OpenAI在GPT⑷的训练中，使用了大约25,000个A100芯片，在90至100天的时间内进行了约32%至36%的MFU（平均功能利用率）这种极低的利用率部分是由于大量的故障导致需要从检查点重新启动的原因，上述提到的气泡代价非常高。

另1个原因是在这么多GPU之间进行全局归约的代价非常高如果我们的猜测是正确的，那么该集群实际上是由许多较小的集群组成的，它们之间的网络连接非常薄弱，即集群的没有同部分之间的非阻塞连接为800G/1.6T，但这些部分只能以200G/400G的速度连接起来。

如果他们在云中的成本约为每小时1美元的A100芯片，仅这次训练的成本就约为6300万美元这还没有考虑到所有的实验、失败的训练运行和其他成本，比如数据收集、强化学习和人员成本等由于这些因素，实际成本要高得多。

此外，这意味着您需要有人购买芯片/网络/数据中央、承担资本支出并将其租给您目前，使用约8,192个H100芯片，以每小时2美元的价格，在约55天内可以完成预训练，成本约为2150万美元需要注意的是，我们相信到今年年底将有9家公司将拥有更多的H100芯片。

并非所有这些公司都会将它们全部用于单次训练运行，但那些这样做的公司将会拥有更大规模的模型Meta将在今年年底拥有超过10万个H100芯片，但其中相当多的芯片将分布在他们的数据中央用于推理他们最大的单个集群仍然将超过25,000个H100芯片。

到今年年底，很多公司将拥有足够的计算资源来训练取GPT⑷规模相当的模型5 MoE 的权衡在推理过程中，MoE是1种很好的方式，可以在推理时减少参数数量，同时增加参数数量，这对于编码更多的信息每个训练令牌是必需的，因为获取足够的高质量令牌非常困难。

如果OpenAI真的试图实现Chinchilla最佳化，他们将没有得没有在训练中使用两倍于目前的令牌数量尽管如此，OpenAI做出了多个权衡例如，在推理过程中，MoE非常难处理，因为模型的每个部分在每个令牌生成时都没有会被使用。

这意味着在为用户提供服务时，某些部分可能处于闲置状态，而其他部分则正在使用这对利用率产生了很大的负面影响研究人员已经表明，使用64到128个专家比使用16个专家的损失更小，但那只是纯粹的研究结果减少专家的数量有多个原因。

OpenAI选择16个专家的原因之1是因为更多的专家在许多任务上很难进行泛化使用更多的专家也可能更难实现收敛在如此大规模的训练运行中，OpenAI选择在专家数量上更保守1些此外，减少专家的数量还有助于他们的推理基础设施。

在采用专家混合推理架构时，存在各种困难的权衡在探讨OpenAI面临的权衡和他们所做的选择之前，我们先从LLM的推理基本权衡开始6 推理的权衡顺便说1下，在开始之前，我们想指出，我们取所有LLM公司交谈过的人都认为Nvidia的FasterTransformer推理库相当糟糕，TensorRT则更糟。

无法使用Nvidia的模板并进行修改的缺点意味着人们需要从零开始创建自己的解决方案如果你是Nvidia的工作人员，阅读这篇文章后，你需要尽快解决这个问题，否则默认的选择将变为开放工具，这样第3方硬件支持可以更容易地添加进来。

1波巨大的模型即将到来如果在推理方面没有优势，并且仍然需要手工编写内核，那么AMD的MI300和其他硬件将有更大的市场在大型语言模型的推理中，有3个次要的权衡，它们发生在批量大小（服务的并发用户数）和使用的芯片数量之间。

延迟 - 模型必须以合理的延迟做出响应人们没有想在等待输出开始流入聊天应用程序之前等待几秒钟预加载（输入令牌）和解码（输出令牌）需要没有同的时间来处理吞吐量 - 模型必须以每秒输出1定数量的令牌大约每秒30个令牌是人类使用所需的。

对于其他各种用途，较低和较高的吞吐量都可以接受利用率 - 运行模型的硬件必须实现高利用率，否则成本将过高虽然可以使用更高的延迟和较低的吞吐量将更多用户请求进行分组，从而实现更高的利用率，但这会增加难度LLM的推理完全是关于平衡两个次要因素：内存带宽和计算。

在最过度简化的术语中，每个参数都必须读取，并且取之相关联的是2个FLOP因此，大多数芯片的比例（例如H100 SXM芯片只有3TB/s的内存带宽，但有2,000 TFLOP/s的FP8）在批量大小为1的推理中完全没有平衡。

如果只为1个用户提供服务，批量大小为1，那么为了每个令牌生成，所需的内存带宽主导推理时间计算时间几乎为零为了有效地将大型语言模型扩展到多个用户，批量大小必须超过4多个用户会分摊参数读取的成本例如，对于批量大小为256或512，每个字节的内存读取有512个FLOP/s或1024个FLOP/s。

这个比例更接近于H100的内存带宽取FLOPS之间的比例这有助于实现更高的利用率，但代价是更高的延迟许多人将内存容量视为LLM推理的1个次要瓶颈，原因是大型模型需要多个芯片进行推理，而较大的内存容量会使其适应的芯片数量减少，但实际上，最好使用超过所需容量的芯片，以便将延迟降低，提高吞吐量，并且可以使用更大的批量大小来实现越来越高的利用率。

谷歌在他们的PaLM推理论文中展示了这些权衡然而，值得注意的是，这是针对像PaLM这样的稠密模型，而没有是像GPT⑷这样的稀疏模型如果1个应用程序要求最低的延迟，我们需要应用更多的芯片，并将模型划分为尽可能多的部分。

较小的批量大小通常可以实现较低的延迟，但较小的批量大小也会导致更差的利用率，从而导致每个令牌的总成本（以芯片秒或美元计）更高如果1个应用程序需要离线推理，并且延迟没有是问题，次要目标是最大化每个芯片的吞吐量（即尽量减少每个令牌的总成本）。

增加批量大小是最高效的，因为较大的批量通常可以实现更好的利用率，但某些对于小批量大小来说没有高效的划分策略在批量大小增大时变得高效起来更多的芯片和更高的批量大小是最便宜的，因为它们可以增加利用率，但这也引入了1个第3个变量，即网络时间。

某些将模型分割到没有同芯片上的方法对于延迟更高效，但取利用率相互制衡内存时间和非注意计算时间都取模型大小成正比，取芯片数量成反比然而，对于给定的分区布局，芯片间通信所需的时间下落得较慢（或根本没有下落），因此随着芯片数量的增加，它变得越来越重要，成为1个越来越重要的瓶颈。

虽然我们今天只是简单地讨论1下，但应该注意到，随着批量大小和序列长度的增长，KV缓存的内存需求会急剧增加如果1个应用程序需要生成具有较长注意力高低文的文本，则推理时间会显著增加对于1个具有多头注意力的500B+模型，注意力KV缓存会变得很大：对于批量大小为512和高低文长度为2048，KV缓存总共达到3TB，这是模型参数大小的3倍。

芯片上的内存需要将此KV缓存从芯片外存加载到内存中，而此期间芯片的计算核心基本上处于闲置状态较长的序列长度对内存带宽和内存容量特别没有利OpenAI的16k序列长度GPT 3.5 turbo和32k序列长度GPT 4的成本要高得多，因为由于内存限制，它们无法使用更大的批量大小。

较低的批量大小导致较低的硬件利用率此外，随着序列长度的增加，KV缓存也会变得更大KV缓存无法在用户之间共享，因此需要单独的内存读取，进1步成为内存带宽的瓶颈7 GPT⑷的推理权衡和基础设施以上所有内容在GPT⑷推理中都很困难，但是模型架构采用了专家混合模型（MoE），这引入了1整套新的困难。

每个令牌生成的前向传递可以路由到没有同的专家集合中这对于在批量大小较大时在吞吐量、延迟和利用率之间实现的权衡造成了困扰 OpenAI的GPT⑷有16个专家，每个前向传递中有2个专家这意味着如果批量大小为8，每个专家的参数读取可能只是批量大小为1。

更糟糕的是，可能1个专家的批量大小为8，而其他的专家可能是4、1或0每次令牌生成，路由算法都会将前向传递发送到没有同的方向，导致令牌到令牌的延迟以及专家批量大小的显著变化推理基础设施是OpenAI选择较少的专家数量的次要原因之1。

如果他们选择了更多的专家，内存带宽将更加成为推理的瓶颈OpenAI在推理集群上经常达到4k+的批量大小，这意味着即使在专家之间进行了最佳的负载均衡，专家的批量大小也只有约500个这需要非常大量的使用才能实现。

我们了解到，OpenAI在1个由128个GPU组成的集群上运行推理他们在多个数据中央和地理位置上都有多个这样的集群推理是在8路张量并行和16路流水线并行长进行的每个由8个GPU组成的节点只有大约130B的参数，即每个GPU在FP16模式下没有到30GB，在FP8/int8模式下没有到15GB。

这使得推理可以在40GB的A100芯片上运行，前提是所有批次的KV缓存大小没有会过大包含各种专家的单个层没有会分割到没有同的节点上，因为这会使网络流量过于没有规则，并且在每个令牌生成之间重新计算KV缓存的代价太高。

对于任何未来的MoE模型扩展和条件路由，如何处理KV缓存的路由是1个最大的困难模型有120个层，所以将其平均分配到15个没有同的节点上是很简单的，但由于第1个节点需要进行数据加载和嵌入，所以在推理集群的主节点上放置较少的层是有意义的。

此外，我们听到了1些关于推理的猜测解码的传言，我们稍后会讨论，但我们没有确定是否相信这些传言这也可以解释为何主节点需要包含较少的层8 GPT⑷的推理成本取175B参数的Davinchi模型相比，GPT⑷的成本是其3倍，尽管其前馈参数只增加了1.6倍。

这次要是因为GPT⑷需要更大的集群并实现了更低的利用率我们认为，对于128个A100来推理GPT⑷ 8k序列长度，每1k令牌的成本是0.0049美分，而对于128个H100来推理GPT⑷ 8k序列长度，每1k令牌的成本是0.0021美分。

值得注意的是，我们假设有较高的利用率，并保持较高的批量大小这可能是1个错误的假设，因为很明显OpenAI有时的利用率非常低我们假设OpenAI在低谷时段关闭集群，并重新调整这些节点以从检查点恢复对较小测试模型的训练，尝试各种新技术。

这有助于降低推理成本如果OpenAI没有这样做，他们的利用率将更低，我们的成本估计将增加1倍以上9 多查询注意力MQA是其他公司正在使用的技术，但我们想指出OpenAI也在使用长话短说，只需要1个头部，KV缓存的内存容量可以大大减少。

即使如此，32k序列长度的GPT⑷肯定无法在40GB的A100芯片上运行，而8k序列长度的GPT⑷在最大批量大小上受到限制如果没有MQA，8k序列长度的GPT⑷的最大批量大小将受到极大的限制，以至于经济上没有可行。

10 连续批处理OpenAI实现了可变的批量大小和连续批处理这样可以在1定程度上允许最大延迟，并优化推理成本如果您对这个概念没有熟悉，那么这篇由AnyScale撰写的文章值得1读11 关于猜测解我们从1些可靠的人士那里听说OpenAI在GPT⑷推理中使用了猜测解码。

我们没有确定是否完全相信这1点令牌到令牌的延迟的普遍变化以及在进行简单的检索任务取更复杂的任务时的差异似乎表明这是可能的，但是变量太多，无法确定以防万1，我们将在这里使用1些“使用分段猜测解码加速LLM推理”的文本并稍作修改/添加1些说明。

使用LLM通常分为两个阶段首先是预填充阶段，将提示文本通过模型生成KV缓存和第1个输出的logits（可能的令牌输出概率分布）通常，这个阶段很快，因为整个提示文本可以并行处理第二阶段是解码从输出的logits当选择1个令牌，并将其反馈到模型中，生成下1个令牌的logits。

重复这个过程，直到生成所需数量的令牌因为解码必须按顺序进行，每次都要将权重流通过计算单元以生成单个令牌，所以当以小批量运行时，第二阶段的算术强度（即计算的FLOP / 内存带宽的字节数）非常低因此，解码通常是自回归生成中最昂贵的部分。

这就是为何在OpenAI的API调用中，输入令牌比输出令牌便宜得多的原因猜测解码的基本思想是使用1个更小、更快的草稿模型预先解码多个令牌，然后将它们作为1个批次馈送给神谕模型如果草稿模型对其预测的令牌是正确的，即较大模型也同意，那么可以通过1个批次解码多个令牌，这样可以节省相当多的内存带宽和时间，每个令牌都能节省。

然而，如果较大模型拒绝了草稿模型预测的令牌，那么剩下的批次将被拾弃，算法自然会恢复到标准的逐令牌解码猜测解码可能还伴随着拒绝采样方案，以从原始分布中进行采样请注意，这仅在带宽是瓶颈的小批量设置中有用猜测解码通过交换计算和带宽来进行权衡。

猜测解码作为性能优化目标具有两个关键原因首先，它完全没有会降低模型质量其次，它提供的优势通常取其他方法无关，因为其性能来自将顺序执行转换为并行执行目前的猜测方法为批次预测1个单独的序列然而，这在大批量大小或低草稿模型对齐度的情况下无法很好地扩展。

直观地说，两个模型在连续的长序列中达成1致的概率指数级地降低，这意味着随着算术强度的扩大，猜测解码的回报迅速减少我们认为如果OpenAI使用猜测解码，他们可能只在大约4个令牌的序列上使用它顺便提1下，GPT⑷降低质量的整个阴谋可能只是因为他们让神谕模型接受来自猜测解码模型的较低概率序列。

另1个注意的是，有人猜测Bard使用了猜测解码，因为谷歌在将整个序列发送给用户之前等待序列的生成完成，但我们没有相信这种猜测是真实的12 关于视觉多模态视觉多模态能力是GPT⑷中最没有令人印象深刻的部分，至少取领先的研究相比。

当然，还没有任何公司将多模态LLM的研究商业化它是1个独立的视觉编码器，取文本编码器分开，但存在交叉注意力我们听说它的架构类似于Flamingo这在GPT⑷的1.8T参数之上增加了更多的参数在仅文本预训练以后，它还进行了另外约2万亿个令牌的微调。

对于视觉模型，OpenAI原本希望从头开始训练，但这种方法还没有够成熟，因此他们决定先从文本开始以减轻风险据称，下1个模型GPT-5将从头开始进行视觉训练，并且能够自己生成图像此外，它还将能够处理音频这种视觉能力的次要目的之1是让自主代理能够阅读网页并转录图像和视频中的内容。

他们训练的数据中有1部分是联合数据（渲染的LaTeX/文本）、网页的屏幕截图、YouTube视频：采样帧，并运行Whisper来获取转录关于所有这些针对LLM的过度优化的有趣的地方在于，视觉模型的成本取文本模型的成本没有同。

正如我们在“亚马逊云危机”文章中所描述的那样，在文本模型中，成本非常低而在视觉模型中，数据加载的IO要高出约150倍每个令牌的字节数为600，而没有是文本的4有很多关于图像压缩的研究正在进行中这对于那些正在根据未来2⑶年内LLM的用例和比率来优化硬件的硬件供应商来说非常重要。

他们可能会发现自己处于1个每个模型都具有强大的视觉和音频能力的天下中他们可能会发现他们的架构适应没有良总的来说，架构肯定会发展到超越当前简化的基于文本的密集和/或MoE模型的阶段本文来自华尔街见闻，欢迎APP查看更多

这是我对天下的1次观察和思考，希望能激发你内心的思绪。喜欢的小伙伴记得关注收藏点赞哦！

随机文章

How much does an airless spraying machine cost? GPT: GPT (4) "Ultimate Secret": 1.8 trilli

您可能也感兴趣:

最近发表

网站分类

TAG标签

随机文章

How much does an airless spraying machine cost? GPT: GPT (4) &quot;Ultimate Secret&quot;: 1.8 trilli

您可能也感兴趣:

为您推荐

How much does an airless spraying machine cost? GPT: GPT (4) "Ultimate Secret": 1.8 trilli

最近发表

网站分类

TAG标签

How much does an airless spraying machine cost? GPT: GPT (4) "Ultimate Secret": 1.8 trilli