节点文献

面向大语言模型应用的数据服务平台研究

Research on data service platform for large language model applications

  • 推荐 CAJ下载
  • PDF下载
  • 不支持迅雷等下载工具,请取消加速工具后下载。

【作者】 鞠炜刚汪鹏王佳

【Author】 JU Weigang;WANG Peng;WANG Jia;ZTE Corporation;Southeast University;

【机构】 中兴通讯股份有限公司东南大学

【摘要】 大语言模型应用效果依赖于高质量数据,从原始语料构建训练数据集和检索增强知识的过程中,端到端的数据管理和处理变得至关重要。当前数据服务面临着因数据处理质量差而影响大语言模型应用效果、数据准备效率低、实现的高复杂性和高成本等问题。为解决这些问题,文章提出一种面向大语言模型的数据协同服务方案,对原始语料、数据集和知识处理进行有效协同,基于算子可视化编排的自动化处理技术和跨平台统一计算调度框架,设计实现了一种端到端数据服务平台,能有效满足各类大语言模型应用对于数据的不同需求。该平台提升了数据质量、处理效率和灵活性,降低了成本,显著增强了大模型应用效果,具有较强的通用性和广阔的应用前景。

【Abstract】 The application effectiveness of large language models depends heavily on high-quality data. In the process of constructing training datasets from raw corpora and enhancing knowledge through retrieval, end-to-end data management and processing become critically important. The current data services face issues such as poor data processing quality affecting the performance of large language models, low efficiency in data preparation, and high complexity and high costs in implementation.To address these issues, the article proposes a data collaboration service scheme tailored for large language models, enabling effective collaboration in the processing of raw corpora, datasets, and knowledge. Based on operator visualization orchestration for automated processing and a unified cross-platform computing scheduling framework. An end-to-end data service platform is designed and implemented that can effectively meet the diverse data requirements of various large language model applications. This platform improves data quality, processing efficiency, and flexibility, reduces the cost, and significantly enhances the effectiveness of large model applications, demonstrating strong generality and broad application prospects.

【基金】 国家自然科学基金资助项目;项目名称:持续知识抽取中的若干关键问题研究;项目编号:62376057
  • 【文献出处】 无线互联科技 ,Wireless Internet Science and Technology , 编辑部邮箱 ,2025年02期
  • 【分类号】TP311.13;TP18
  • 【下载频次】38
节点文献中: 

本文链接的文献网络图示:

本文的引文网络