节点文献
制造业大数据分布式存储管理方法研究
Research on Distributed Storage Management Method for Manufacturing Big Data
【作者】 王敏;
【导师】 彭智勇;
【作者基本信息】 武汉大学 , 计算机软件与理论, 2017, 硕士
【摘要】 德国工业4.0的产生与兴起使得制造业逐渐迈入了大数据时代。从产品的设计、制造到维修的整个生命周期中,都会产生大量的结构化、半结构化、非结构化数据,它们具有多模态、高通量、强关联等特性。作为新一代信息技术的关键,制造业大数据将逐渐成为产业革命的核心,是实现智慧生产的重要因素,因此如何存储和管理制造业大数据成为人们研究的热点。对大数据的管理一般采用分布式存储的方法,虽然目前已有许多分布式存储方案,也出现了一些工业大数据管理平台,但采用现有方法对制造业大数据进行存储存在以下几方面的不足:(1)数据管理分散,当需要进行信息共享时,会造成人员之间的频繁沟通;(2)对复杂关联关系的管理能力不够;(3)现有的管理系统均为通用系统,从而缺乏对制造业大数据独有特征的支持等等。为了对制造业大数据进行高效地存储管理,解决现有存储方案的不足,本文设计并实现了针对制造业大数据的分布式存储引擎,利用对象代理数据库实现元数据存储管理功能,以HDFS为文件系统实现分布式数据存储管理的功能,并根据元数据信息与数据间的关联关系对小文件存储和副本机制进行优化。本文的工作主要包括以下几方面:(1)利用源类与代理类之间的代理关系,本文提出了基于对象代理模型的制造业非结构化数据元数据管理方法,分别对其元数据、实体组成关系与约束关系、实体数据对应关系进行建模;(2)由于制造业大数据中存在海量的小文件,而HDFS存储小文件时存在着存储空间浪费等问题,因此本文对小文件存储进行优化,综合考虑文件之间的关联关系以及合并后的存储空间利用率对文件聚簇,将小文件组织成聚簇文件进行存储;(3)针对制造业数据访问具有时效性这一特征,本文对HDFS的副本管理机制进行了改进,根据文件的历史访问频率以及系统的存储空间使用情况,计算文件当前的副本需求量并动态调整副本,当需要增加副本时,本文根据节点的工作状态、副本复制的网络开销以及相关用户的读取效率,为文件选择最优的副本存放节点。最后将本文设计的分布式存储引擎在实际环境中进行部署,对以上方案从功能和性能两方面进行了验证。实验结果不仅表明了功能的正确性与完整性,而且也说明了本文方法在性能上是有效的,能够显著地提高系统的读取效率。
【Abstract】 German industry 4.0 marks the arrival of large-data era for the manufacturing industry.Throughout the life cycle of a product,it will produce a large amount of structured,semi-structured and unstructured data.They are characterized with multiple modes,high throughput and strong correlation.As the key to the new generation of information technology,manufacturing big data is gradually becoming the core of the industrial revolution.And it is turning into an important factor in the realization of intelligent production.Therefore the storage and management of manufacturing big data has become a hot spot for research.Distributed storage is the most common solution in this field.Existing methods for storing manufacturing big data,like distributed storage and industrial big data management platforms,is inadequate for the following reasons:(1)data management is decentralized,when there is need for information sharing,it will cause frequent communication between personnel;(2)Their capacity is not enough for the management of complex relationship;(3)Present management platforms are all general-purpose system.They are not supportive for proprietary characteristics of manufacturing big data.In order to solve the shortage of existing storage methods,this paper designs and implements a distributed storage system specifically for manufacturing big data.It uses the Object Deputy Database to manage the metadata and association relations between data,and it uses HDFS to store the real data files.The work of this paper mainly includes the following aspects:(1)With the use of the deputy relationship between the source class and the deputy class,we propose a manufacturing unstructured data modeling method based on the object deputy model,which models the metadata,the composition relation,constraint relation and the life cycle relation respectively.(2)Because of the large amount of small files in the manufacturing big data,and the existence of storage space waste when HDFS stores small files,we optimize the small file storage by aggregating small files according to their semantic relations and the space utilization after their combination.(3)On account of the timeliness of manufacturing big data,we improve the replication management mechanism of HDFS in this paper.According to the historical access frequency of the file and usage of storage space,we calculate the replica requirement of the file and adjust replications dynamically.When there is a need to add replica,we choose the optimal storage node based on the working state of nodes,network overhead of replica copying and relevant users’ reading efficiency.Finally we deploy the distributed storage system proposed in this paper in a practical working environment,to validate the function and performance of the proposed scheme.The experimental results show the correctness and integrity of the function.They also prove the effectiveness of the proposed method which significantly improves the reading efficiency of the system.
【Key words】 Manufacturing Big Data; Distributed Storage; Object Deputy Model; Association; Storage Optimization;