节点文献
基于文本词特征加权LDA的疾病表征提取方法
Disease Characterization Extraction Method Based on Text Word Feature Weighting LDA
【摘要】 从结构复杂多样的电子病历文本中提取出疾病表征词,是电子病历文本研究与应用的关键环节。LDA模型可以实现对文本信息的有效提取,但标准LDA(latent Dirichlet allocation)及其相关改进模型在提取疾病表证词时针对性较弱、精确率较低。该文提出了FW-LDA(feature weighting LDA)模型,针对中文电子病历文本的数据特征,降低非任务相关词的共现频率,在标准LDA模型的基础上引入了词特征加权,以实现对疾病表证词的针对性提取。通过分析心血管疾病数据的特点,形成了相适应的词性、词长和词义特征加权计算公式,构建了对应的任务侧重和非任务侧重的外部语义词库,并通过实验验证了词特征加权对疾病表征词提取任务的影响程度。与LDA模型相比,在主题数值小于30时,FW-LDA模型的主题一致性有显著提升;在主题数值范围[5,65]上,FW-LDA模型的疾病表征词提取平均精确率提升了48.5%。
【Abstract】 The key link in the research and application of Electronic Medical Record text is to extract disease characterization words from the complex and diverse Electronic Medical Record text. The LDA model can realize the effective extraction of text information, but the standard LDA and its related improved models have weaker pertinence and less accuracy when extracting the disease characterization words. Therefore, we propose the FW-LDA(feature weighting LDA) model. Aiming at the data features of Chinese Electronic Medical Record text, reducing the co-occurrence frequency of non-task related words, we introduce word feature weighting based on standard LDA model to achieve pertinent extraction of the disease characterization words. By analyzing the characteristics of cardiovascular disease data, a feature weighting formula of appropriate part of speech, word length and word meaning is formed, the corresponding task-focused and non-task-focused external semantic vocabulary is constructed, and the effect of word feature weighting by experiments on the task of extracting disease characterization words is verified. Compared with the LDA model, the topic consistency of the FW-LDA model is significantly improved, when the topic value is less than 30. The extraction average accuracy rate of disease characterization words of the FW-LDA model is increased by 48.5%,on the topic value range [5,65].
【Key words】 Electronic Medical Record; disease characterization; word features; weight; LDA model;
- 【文献出处】 计算机技术与发展 ,Computer Technology and Development , 编辑部邮箱 ,2022年05期
- 【分类号】R319;TP391.1
- 【下载频次】164