节点文献

移动平台下的中文短信内容过滤技术的研究与实现

【作者】 陈欣

【导师】 刘乃琦;

【作者基本信息】 电子科技大学 , 信息与通信工程, 2008, 硕士

【摘要】 面向中文的短信过滤技术是中文移动市场迫切需要的一种技术。目前移动平台上的中文短信过滤技术以黑名单过滤和关键词过滤为主。本论文主要介绍了一个与当前主流中文短信过滤不同的新型过滤技术。该技术是便于在移动设备上实现,结合了中文短信的内容特征,以规则库过滤为基础的内容过滤技术。该技术提高了短信过滤准确率、垃圾短信召回率,降低了正常短信错判率。短信内容过滤是文本分类技术的一种,目前应用最广泛的文本分类技术有很多,最大熵和决策树两种算法分别作为基于统计的和基于规则的文本分类技术的代表算法大量应用于内容过滤。本文也将这两种算法与本文提出的基于轻量级规则库的内容过滤技术进行对比实验,以验证本文提出的基于轻量级规则库的内容过滤技术是否满足实际要求。本文提出的基于轻量级规则库的内容过滤技术由两部分构成:第一部分,规则匹配。规则匹配是短信内容过滤的第一阶段。在该阶段中关键词规则匹配是核心。关键词规则的匹配需要使用中文多模式字符串匹配算法。国际上经典的字符串匹配算法都是针对英文字符串进行匹配的。多模式串匹配算法也是如此,例如,AC,WM等等。本文提出一种针对中文的多模式串匹配算法UIAC。同时,与UIAC算法配合还有其他规则匹配方法:短信文本长度,文本中含有的标点,电话号码,URL等等特征。另外,在该阶段还要做手机平台上中文编码的转换等处理工作。该阶段的输出是中间向量文件。第二部分,过滤。过滤是短信过滤处理的第二阶段。本文提出了轻量级规则库过滤算法。该算法与最大熵和决策树两种经典算法相比,更加有利于在资源有限的移动设备上实现。作为对比,在规则匹配的试验阶段除了产生轻量级规则库过滤中间向量文件外还产生了最大熵中间向量文件和决策树中间向量文件,并且分别用最大熵模型和决策树模型进行处理。之后对比了轻量级规则库和其他两种算法的准确率、召回率以及正常短信误判率。实验使用的短信条数为1000条,正常短信和垃圾短信各500条。对轻量级规则库、最大熵、决策树分别进行了实验,并且将三种算法结果进行比较。实验结果显示,轻量级规则库与其它两种方法相比,性能接近,在正常短信误判率方面有较大提高,并且更便于在手机平台上实现。

【Abstract】 The Chinese oriented SMS filtering technology is needed in the nowadays Chinese Mobile Market. At present, there are mature SMS filtering technologies in English. However, today’s Chinese SMS filtering technology is based mainly on Junk-list and Key word filtering. The system proposed by this article realizes the Simple Rules Filtering technology, which combines SMS content features and promotes the Precision, Recall Rate, and reduces the Normal False Alarm Rate.SMS Content Filtering is a type of Text Categorization technologies. At present, there are two most popular technologies applied to Content Filtering: Maximum Entropy and Decision Tree. In this Article, these two algorithms are used to do a contrast filtering test with a newly introduced Chinese SMS Content Filtering technology. This technology is divided into two parts: The first part is Rules Matching. Rules Matching is the first phase of SMS Content Filtering. In this phase, the Key Rule Matching is the most important algorithm. Key Rule Matching needs to use a Chinese multi-pattern Matching Algorithm. However, the classic algorithms like AC and WM are both designed for English content. This Article introduces a new Chinese Oriented multi-pattern algorithm UIAC. Together with UIAC, we also use other rules to abstract the content features of Chinese SMS: the Length of the short messages,the phone numbers, punctuations, and URL, et al. Besides, in this phase, the Chinese Encoding transformation should be done. The output file of this phase is the vector intermediate file. The second part is filtering. Filtering is the second phase of the SMS Filtering. This Article introduces Simple Rules Fitering Algorithm. Compared with Maximum Entropy and Decision Tree, the algorithm is easier to implement on resource limited mobile platform.As a contrast, in the Rules Matching phase of the test, there are three vector intermediate files: Simple Rules Filtering vector file, Maximum Entropy vector file and Decision Tree vector file. The last two files are processed by Maximum Entropy Model and Decision Tree Model. Then compare the Precision Rate, Recall Rate and Normal SMS False Alarm Rate of the three different algorithms. The test uses 1000 Short Messages, with 500 normal ones and 500 junk ones. The 1000 SMS are used as input data in the three algorithms mentioned above. The results show that the Simple Rules Algorithm has a close performance with the other two algorithms. Moreover, it has an advantage in the aspect of Normal SMS False Alarm Rate and efficiency of implementation.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络