节点文献
基于Web挖掘的中医知识发现研究
On the Research of TCM Knowledge Discovery System Based on Web Mining
【作者】 赵翔;
【导师】 于剑;
【作者基本信息】 北京交通大学 , 计算机科学与技术, 2010, 硕士
【摘要】 中医是生命科学具备中国特色的传统组成部分,在2500多年的实践中,中医在疾病诊疗和方药使用上具有特色和显著的临床功效,并包含着丰富的知识,几千年的医学实践积累了大量的数据。Internet中含有丰富的医学信息资源,并且资源总量仍然在快速的增长,如何从海量的数据中提取有用的医学知识对中医药信息化建设和临床诊疗具有重要的意义。Web挖掘是解决上述问题的一种有效的方法,它借助数据挖掘的理论方法,从大量半结构化的Web文档集中发现潜在的、有价值的知识,近年来,已经成为一个重要的研究方向。本文以辅助中医信息化建设和临床诊疗研究为目标,采用网页分类和信息抽取技术,设计并实现了基于Web挖掘中医药知识发现系统。本文的主要研究内容包括:(1)对网页分类进行研究,针对中文网页的自动分类逐渐成为Web挖掘研究的热点,它的技术包括文本表示、权重计算、特征选择以及分类算法,本文采用基于字特征的文本特征表示在最大熵分类器上对网页进行分类,以获得和医学相关的网页。(2)命名实体识别是信息抽取中的关键技术,在信息检索、机器翻译、自动文摘等领域发挥着重要作用。本文介绍了三种基于统计的命名实体识别方法,讨论了条件随机场(CRF)模型相对于其他模型的特点。本文采用CRF方法在网页中进行疾病名称的提取。(3)基于Web挖掘的中医药知识发现系统关键模块的实现,包括网页数据采集模块、网页预处理模块、网页分类模块、医学术语识别模块和关系建立模块。
【Abstract】 TCM is an important component of traditional medcine which has some Chinese characteristic. During 2500 years of practice, it has clinical effectives and characteristics in disease diagnosis and treatment. Internet contains plenty of medical information and whose resources are still growing explosively, how to get medical knowledge from mass data has important significance for TCM informatization construction and clinical diagnosis and treatment. Web Ming is an efficient method to resolve the problem. It uses the basic theory of data mining for discovering potential and valuable knowledge from a great quantity of half-structural web pages.Web Mining has been an important study direction in recent years.In this paper, using Web Page Classification technique and Information Extraction technique, we designed a TCM Knowledge Discovery System, which helps TCM informatization construction and clinical diagnosis and treatment. The main research contents of this paper are as follows:(1) Based on deep study of Web Page Classification, Chinese text categorizaiton has gradually become popular in Web Mining, Its key technology contains text expression, weight numeration, feature extraction and classification algorithm. This article uses Maximum Entropy based on Chinese character feature extraction method i to get Web Page about medical science.(2) Named entity recognition has particular sigificance for information retieveval, machine translation, the automatic indexing of documents. This article introduce three Named entity recognition methods based on statistics. Compared with other modes used in sequencial labeling methods we descibe the main characteristic of CRF modes. We use CRF methods to extract the disease name from Web page.(3) We have implemented the key modules of TCM Knowledge Discovery System, including Web page collection module, Web page pretreatment module, Web page classificaton modules, entity name recognition module, relation building module.
【Key words】 TCM Knowledge Discovery; Web Page Classification; Information Extraction; Correlation Analysis; Mutual Information;
- 【网络出版投稿人】 北京交通大学 【网络出版年期】2011年 05期
- 【分类号】TP393.09
- 【被引频次】5
- 【下载频次】314