节点文献
一种基于八词位标签的BiLSTM_CRF藏文分词方法
An Eight-word-position Tag for Tibetan Word Segmentation via BiLSTM_CRF
【摘要】 藏文分词是藏语自然语言处理的一项基础性任务,其性能影响藏文自动摘要、自动分类以及搜索引擎等多个方面。基于词位标注的藏文分词方法通常使用四词位标签集,为了更全面地提取特征信息和更深层次的语义信息,该文提出了一种八词位标签集,采用BiLSTM_CRF模型得到一种基于八词位标签的BiLSTM_CRF藏文分词方法。实验结果表明,该方法取得较好的分词效果,在测试数据集上的准确率、召回率和F1值分别达95.07%、95.57%和95.32%。
【Abstract】 Tibetan word segmentation is a fundamental task of Tibetan natural language processing affecting such tasks as Tibetan automatic summary, automatic classification, and search engines. Tibetan word segmentation at present uses the four-word-position tagging method. This paper proposes an eight-word-position tag approach to extract feature and deeper semantic information more comprehensively. The whole segmentation system adopts the BiLSTM_CRF framework. The experimental results demonstrate that the proposed method achieves 95.07% Tibetan word semination accuracy, 95.57% recall and 95.32% F-measure, respectively.
【Key words】 NLP; Tibetan word segmentation; BiLSTM_CRF; eight-word-position based tag;
- 【文献出处】 中文信息学报 ,Journal of Chinese Information Processing , 编辑部邮箱 ,2024年10期
- 【分类号】TP391.1
- 【下载频次】33