节点文献

基于口语度的口语词语自动提取研究

Automatic Extraction of Spoken Words by the Spoken Language Measurement

  • 推荐 CAJ下载
  • PDF下载
  • 不支持迅雷等下载工具,请取消加速工具后下载。

【作者】 侯敏张玉强何伟邹煜滕永林

【Author】 Hou Min,Zhang Yuqiang,He Wei,Zou Yu,Teng Yonglin Broadcasting Media Language Research Center,Communication University of China,Beijing 100024

【机构】 中国传媒大学国家语言资源监测与研究中心有声媒体语言分中心

【摘要】 口语词语自动提取的最大障碍在于口语语料的难以获取和口语词语界定的模糊性。本文充分利用广播电视语料兼具书面语体和口语语体的特点,提出了口语度计算模型,该模型以Logistic回归模型为基础,以词语空间分布通用率为协变量,通过衡量词语在书面语体语料和口语体语料中的空间分布差异,能够有效地度量该词语的口语度,从而实现口语词语的自动提取。在约1100万字语料上的实验结果表明,口语和书面语共现词语中提取口语词语准确率为85%,口语独现词语中提取口语词语准确率为76.5%,平均正确率达到79.3%。

【Abstract】 Lack of spoken corpus and the ambiguity definition of spoken words is the biggest obstacle to automatic extraction of spoken word.This paper used the broadcasting corpus which comprised both written and spoken language,and proposed the spoken measurement calculation model.The model is based on Logistic Regression Model with the words generalization as covariates,which could measure the differences of spatial distribution between words in the written corpus and that in spoken corpus,thus the probability of spoken words can be effectively measured and the spoken words can be extraction automatically.The results of experiments on about 11 million words show the precision of extraction is 85%for the words occurred in both spoken and written language and 76.5%for the words occurred only in spoken language,the total precision is 79.3%.

  • 【会议录名称】 中国计算机语言学研究前沿进展(2007-2009)
  • 【会议名称】第十届全国计算语言学学术会议
  • 【会议时间】2009-07-24
  • 【会议地点】中国山东烟台
  • 【分类号】H03
  • 【主办单位】中国中文信息学会
节点文献中: 

本文链接的文献网络图示:

本文的引文网络