节点文献
基于数据挖掘的移动通信用户流失研究
Research on Mobile Customer Churn Based on Data Mining
【作者】 刘光远;
【导师】 苑森淼;
【作者基本信息】 吉林大学 , 通信与信息系统, 2007, 博士
【摘要】 随着电信市场的逐渐放开,电信企业间竞争加剧,由于企业间的竞争导致的用户流失逐渐成为影响企业经营效益的主要原因。电信企业具有国内领先的数据仓库系统,为实施基于数据挖掘的用户流失分析提供了条件。用户流失预测系统作为经营分析系统的一个重要组成部分,通过建立用户流失预测模型,使企业能发现即将流失的用户,及时采取措施,减少用户流失的发生。因此,用户流失预测研究对电信企业降低运营成本,提高经营业绩有着极为重要的意义。本文根据通信企业现状,全面分析了数据挖掘技术及数据挖掘在通信企业中的应用,提出了基于进化计算、序列模式的流失预测算法,并建立了用户流失预测分析模型。主要研究内容包括:1、针对目前相关研究中存在的问题,给出了应用数据挖掘进行用户流失预测研究中涉及的主要问题的解决方案,包括连续数据离散化、属性选择等;2、在分析了进化计算适于解决优化问题的基础上,提出基于进化计算的用户流失预测算法,建立了基于进化计算的用户流失分析模型,并进行了对比实验分析;3、针对用户的历史数据和短期偶发数据,基于序列模式挖掘方法,并结合决策树,形成了一个综合的链型树分类器CTC,建立了用户流失分析模型,并进行了对比实验分析;4、针对竞争对手营销政策产生用户流失的问题,提出了基于竞争策略的流失预测模型,并进行了对比实验分析。
【Abstract】 Following the opening and growing-up of the telecommunication industry and the coming of the 3G, the competitions among the telecommunication companies are getting critical. The database systems in the telecommunication companies provide the possibilities to implement the data mining for the research. How to keep the customers is the core issue of the telecommunication companies. The research made by Harvard University illustrates that 5% drop of the consumption of the users could affect the investors’confidence on the gains of the companies. Due to the higher costs to attract new customers, how to keep the current customers is important for the telecommunication companies. Normally, churn without any clue is the headache and out of control for the telecommunication companies. Once the customers decide to leave the companies, it is difficult to persuade them to retain the companies even with better plans. Data mining improves the capabilities of the companies for the prediction and the control of churns significantly. Companies could create models by using data mining tools based on the customers’personal information, calling histories, and churn information. Based on the prediction of churn obtaining from the models, salesman will develop more active and objective methods to keep the current customers than before.The prediction of churn plays a major role in the analysis and operation system in the telecommunication companies. Based on the prediction model of churn, companies will find the customers with higher possibilities of churn, find effective ways to keep the customers to retain in the companies. Therefore, the research on the churn prediction is significant for the telecommunication companies to reduce the operation cost and obtain more gains.Based on the current situation of the telecommunication industry and the analysis of the theory, technology, and application of data mining, the dissertation prompts the genetic and chain algorithms to predict the churn and creates the prediction model for the churn. The major research and devotion are the followings:1. Analyses the technology of data mining. Includes the processes of data mining, the methods of data mining, and the application of data mining in the telecommunication industry. Prompts the prediction methods, analysis of customers’characters, identification of major customers, churn prediction, and identification of users groups.2. Analyses the disadvantages of the current researches, prompts the methods, such as discretion of continuous data, simplification of characters, etc. for the application of data mining to the prediction of churn. Evolutionary computation needs discretion of continuous data. In the preprocessing of the customer information, some data are value data. They need to be discretion, group, and transfer into category data. The discretion of the continuous data is the pre-procession of value data into category data with the same distribution rules as the original ones. It is important for the whole processes of prediction by application of ECCA algorithm in the dissertation. Based on K-means algorithm, the dissertation prompts an algorithm of self-organization distribution to fulfill the discretion of continuous data. It solves the problem in K-means that setting the value of K could affect the results of categorization. The algorithm of self-organization distribution is made up of m neurons. m is the maximum number of groups. The value of characters is the input to the algorithm. In the process of self-organization, the weight of the neurons keeps on updating and getting close to the real category of characters until no more neurons have been updated and finishes the discretion. The characters are huge in the database warehouse of telecommunication industry. Some characters are tightly related. And some superfluous features exist in the discretion database. In order to improve the effectiveness of algorithm, feature selection is necessary for the database to obtain a minimum group of characters with the same attributes as the original ones. For the higher dimension data, the time spending on the data mining and data analysis is an exponential function of the dimension of data. So it is necessary to apply the proper feature selection to reduce the dimension of data but with the same attributes as the original ones. The dissertation promptsχ2 statistics as the measurement of correlation among features. From the theχ2 table, obtains the independent level of confidence a. For a subset of characters, obtains two lists Listf,c and Listf,s based on a. Listf,c is the list of descendent correlation between class and features. Listf,s is a descendent list of correlation between reference features and features. Selecting the features based on the potential difference in two lists for a specific character, the dissertation prompts a feature select algorithm FSBPD. The algorithm takes off the superfluous features which are useless for the decision from the data but keeps the same attributes as the original ones. At the end, the dissertation analyses the theory of the algorithm and provides the experimental results. The experimental results show the algorithm of FSBPD has a sound capability of feature selection.3. Concludes the methodology of evolutionary computation. Evolutionary computation simulates the mechanism of survival of the fittest in the processes of biological evolution and the transmission rules of the genetic information. The dissertation introduces the major branches and the mathematical background of the evolutionary computation. Because the evolutionary computation is perfect to solve the optimum problem, the dissertation prompts the ECCA model for the churn prediction based on it. ECCA model starts the searching from a group to obtain a global optimum instead of a local optimum. ECCA model includes the basic processes of evolutionary computation. The quality of output coming from the first layer rules will seriously affect the prediction of the whole model. Therefore, based on the output of the first layer rules obtained from the probability induction and the traditional generic computation, the dissertation combines the background knowledge, divides the characters into two distinctive categories, creates the first layer rules inside each category without crossing, and creates the crossing rules between categories. From and on the second layer, in order to find the potential rules, does not limit the cross inside each category. By that way, repetition continues until no more new and valuable rules to be created. After the whole rules have been created, ECCA model will code the whole rules into an expression. The experimental results show that ECCA model has a better predication capability with the higher global category results than C4.5.4. For the customers’history and temporary data, the dissertation prompts the algorithm of chain data mining, combines the decision tree, and creates a combined chain-tree classifier (CTC). Creates the model to predict the customer churn, simulates, and compares the experimental results.5. For the issue of churn causing from the new policies of the competitors, the dissertation prompts a prediction model based on the competition. Compares the effect of different calling plans inside the company and among the competitors, predicts the churn, and compares the experimental results.
【Key words】 Data Mining; Prediction of churn; Sequential Patern Mining; Evolutionary Computation; Feature Selection; Classification;