节点文献

分布式技术在大模型训练和推理中的应用

Application of distributed techniques in large language model training and inference

推荐 CAJ下载
PDF下载
不支持迅雷等下载工具，请取消加速工具后下载。

【作者】郑纬民；

【Author】 ZHENG Weimin;Department of Computer Science and Technology, Tsinghua University;

【摘要】近几年，人工智能被广泛应用于多个领域，大语言模型（以下简称大模型）的“预训练-微调”成为人工智能的最新范式。分布式技术存在于大模型生命周期的每一环，为大模型的发展助力。在数据获取环节，针对海量小文件的存储问题，研发了文件系统SuperFS，能够同时满足低延迟和可扩展的要求。在数据预处理环节，针对从分布式文件系统读取数据开销大的问题，研发了高效大数据处理引擎“诸葛弩”。在模型训练环节，针对检查点文件读写性能差的问题，提出了分布式检查点策略，加快了检查点文件的读写速度。在模型推理环节，针对KVCache对存储系统的挑战，研发了高吞吐推理方案FastDecode以及大模型推理架构Mooncake。分布式技术的应用，使大模型能够充分利用计算资源，加快训练速度，有利于人工智能领域的发展。更多还原

【Abstract】 In recent years, artificial intelligence has been widely applied in multiple fields, and the "pre-training and fine-tuning" of large models(LLMs) has become the latest paradigm of artificial intelligence. Distributed technology exists at every stage of the lifecycle of LLMs, providing support for them. In the data acquisition process, the file system called "SuperFS", was developed to address the storage issue of massive small files, which can meet the requirements of low latency and scalability. In the data preprocessing stage, an efficient big data processing engine called "Chukonu" was developed to address the issue of high overhead in reading data from distributed file systems. In the model training stage, a distributed checkpoint strategy was proposed to address the problem of poor read and write performance of checkpoint files, greatly improving the read and write speed of checkpoint files. In the model inference stage, a high-throughput inference scheme called "FastDecode" and a LLM inference architecture called "Mooncake" were developed to address the challenge posed by KVCache to storage system. The applications of distributed technology enable LLMs to fully utilize computing resources, accelerate training speed, and benefit the development of the field of artificial intelligence.更多还原

【关键词】分布式技术；大模型；海量小文件；大数据处理引擎；检查点； KVCache；
【Key words】 distributed technology； large language model； massive small files； big data processing engine； checkpoint； KVCache；

【基金】国家自然科学基金项目（No.U23A6007）~~

【文献出处】大数据 ,Big Data Research , 编辑部邮箱 ,2024年05期

【分类号】TP311.13;TP18
【下载频次】131

知网节下载

节点文献中：

本文链接的文献网络图示:

本文的引文网络

节点文献