节点文献
基于大规模语料库的新闻领域新词挖掘
Approach to Mining New Words in Journalism Based on Large Corpus
【Author】 Cheng Tao~1 Shi Shuicai~1 Sun Yujie~2 Lv Xueqiang~1 (1.Chinese Information Processing Research Center,Beijing Information Science & Technology University,Beijing,100101; 2.College of Information Science and Engineering,Dalian Polytechnic University,Dalian,116034)
【机构】 北京信息科技大学中文信息处理研究中心; 大连工业大学信息学院;
【摘要】 以真实的人民日报语料为处理对象,提出了一种基于大规模语料库的新闻领域新词挖掘的方法。首先对标有时间戴标签的大规模语料进行串频统计和子串归并,建立历史词汇库;然后与历史词库进行比较,从而生成对象新闻语料中的候选新词;最后根据新闻领域新词的构词规则和垃圾词串的构词特征来对候选新词进行过滤,从而挖掘出新词。对该算法进行了模型系统实现并进行了测试运行,结果表明该算法是行之有效的。
【Abstract】 An approach to mining new words in journalism based on the large corpus is proposed in this article,using People’s Daily corpus for treatment object.First,statistic the string frequency and execute substring reduction for the corpus with timestamp,then obtain the history vocabulary.Compared with the history vocabulary,then the candidate new words in the target corpus can be acquired.Filter the candidate words according to the rules of new-word-building and the character of the garbage string,and finally the new words are identifiable.The method is implemented and tested,and the result shows that the proposed arithmetic is promising.
【Key words】 History Vocabulary; New Words; String Frequency Statistics; Statistical Substring Reduction;
- 【会议录名称】 第三届全国信息检索与内容安全学术会议论文集
- 【会议名称】第三届全国信息检索与内容安全学术会议
- 【会议时间】2007-11
- 【会议地点】中国江苏苏州
- 【分类号】TP391.1
- 【主办单位】中国中文信息学会信息检索与内容安全专业委员会