节点文献
基于会话的海量短信息过滤技术研究
Study on Session-Based Filtering of Massive Short Messages
【作者】 张鹏;
【导师】 李晓光;
【作者基本信息】 辽宁大学 , 计算机软件与理论, 2011, 硕士
【摘要】 随着短信息的应用范围越来越广,与之相应的过滤技术也必将为人们越来越重视。短信息过滤要求能够快速的识别出目的短信息,进而针对这些短信息进行后续处理。当前大多的短信息过滤是针对垃圾短信息的,其主要目的在于阻止非法的、无意义的短信息。提高过滤的效率与准确性一直是信息过滤的研究目标,当前对短信息过滤的研究主要集中在过滤的准确性方面,对提高过滤效率的研究较少。本文深入研究了短信息过滤过程以及短信息数据的特点,在此基础上做了以下工作:针对目前过滤系统逐条处理短信息的工作方式,本文提出了一种基于会话的短信息过滤技术。通过对短信息集进行合理的会话集划分,将多条内容相关的短信息聚集在一起,并对该会话集进行特征向量提取。本文给出了会话集的划分依据、划分算法及特征提取算法。为了能够快速的过滤,本文利用会话集特征向量的关键词建立索引,并利用索引结构进行过滤。索引结构分为两级:一级索引结构为特征向量关键词及会话集标识对应的倒排索引结构,利用此结构可以方便的由关键词映射到会话集;二级索引结构为会话集到短信息的正向索引结构,此结构用来由会话集找到具体的短信息。本文给出了索引结构及构建算法。针对过滤中用户模板向量的更新问题,本文提出了基于过滤结果反馈的用户模板更新维护方法,新特征的发现和旧特征的淘汰方法。并给出了特征词的评价方法及用于维护特征词的相关结构和算法。最后针对本文提出的方法进行了实验测试,验证了基于词索引的短信息过滤方法和用户模板更新方法的性能。
【Abstract】 With the short message more and more widely used, people pay more and more attention to the corresponding filtering technology. Short message filtering require to quickly identify the purpose message from massive message set, and use these short messages for the follow-up processing. At present most short message filtering methods were used for garbage short message removal, the main purpose is prevent the illegally and meaningless short message. Improving the efficiency and accuracy of filtering technique is always the objectives of the study. The current research on short information filtering focuses on the accuracy of filter, less study on the improveing of filter efficiency.This paper has a deep research for short information filtering process and the characteristics of short messages data, based on these work we do the following job:The current filtration method, mainly use keyword matching as the foundation, and deal with the messages one by one. To improve the efficiency of filtering, we analyze the relation between different messages, and put forward the thought that division the message sets into different sessions, and extract feature vectors of these sessions. This paper gives the method and basis of the session-division and the feature extraction process.In order to be able to fast filtration, we use the keywords of the session`s feature vector to buid an index and use this index for filtering. The index`s struction contain two son index: the first index was conposed by keyword and senssion id, it is an iverted index; the second index was coposed by sension and the message id, it is an normal index. This paper gives the structure of the index and the constructing algorithm.To solve the updating problem in the template, this paper puts forward user template updating maintenance methods, new features of the discovery and old features out method based on the feedback of the filtering results. Given the evaluation method for feature words and the structure and algorithm for feature words maintaining.Finally, we give a experiment to test the method in this paper, validate the short based on word index information filtering method and user template updating method performance. The method proposed in this paper has effectively improved the response speed of the filter, it can also updating the keyword in user template vector.
【Key words】 short message filtering; session-division; feature extract; fast-filtering model; feedback study;