节点文献
利用数据挖掘实现电信业的客户流失预测分析
Analyzing and Predicting Customer’s Churn in Telecommunications Industry Using Data Mining
【作者】 王平;
【导师】 黄庆;
【作者基本信息】 西南交通大学 , 计算机应用技术, 2003, 硕士
【摘要】 客户频繁流失是电信企业发展中所面临的一个严重问题,随着国外电信运营商的涌入,这个问题必将进一步恶化。为减少或避免客户的流失,本文给出了一种行之有效的解决方案:利用数据挖掘知识建立客户流失预测模型,用此模型挖掘出将要离网的客户,再根据这些客户的通话特征和业务喜好采取针对性的措施加以挽留。本文围绕客户流失预测模型的四个步骤进行了分析:问题的定义,数据预处理,建立模型,模型优化与评估。 问题的定义中给出了要解决的问题和要实现的目标,数据预处理从如何选择样本数据、消除噪音、数据转换、特别是属性的选取方面进行了阐述。在属性选取时根据Fisher函数把对分类影响小的属性删除,由Pearson’s Correlation Coefficient将相关联的属性合并,用Singular Value Decomposition减少属性向量空间的维度。 建模是预测的结果是否有应用价值的关键所在,本文从客户分群和离网预测两大方面进行研究。客户分群作为预测的基础为分类器提供有共同特征的用户群体,使得预测分析可以在不同的群体上进行。为了减少调整簇中心所带来的计算代价,本文给出了一种改进的k-平均算法来得到具有相似特征的用户群体。离网预测采用了决策树分类器,本文在描述决策树算法中所涉及到的建树、代价计算、剪枝等问题之后,给出了在建树中和建树后分别加入限制条件的修剪算法。建树阶段设置大小限制的修剪算法是通过计算出不完整树的最小代价得到优化树的代价上限,根据此上限以及计算出的节点的实际代价来修剪节点的。在预测模型中应用了在建树阶段加入大小限制条件的修剪算法。另外,还解释了决策树分类时如何寻找最佳分裂指标和确定分裂点的问题。分裂指标采用了gini index计算方法,确定分裂点时使用了CAIM算法对连续型属性进行了离散化处理。模型优化采取了交叉验证和boosting技术,最后给出了预测分析的结果。
【Abstract】 Customer’s frequent loss is a serious problem in the mobile telecommunications market. This problem will be deteriorated with foreign telecom companies’ coming. In order to combat the high cost of churn, the thesis gives a feasible solution: first, build a prediction model for customer’s churn employing data mining technology; then, use the model to analyze why customers churn and which customers are most likely to churn in the future; finally, make better target recruitment campaigns by summarizing customer’s calling behavior and hobby to increase retention. The whole paper discusses how to build the model in four stages: business question definition, data preparation, model building, model optimization and evaluation.The first stage explains the questions the model will solve and the goals it pursues. The second stage solves the problems such as how to select dataset, minimize "noise", normalize values and especially select attributes. There are three means to decrease the number of attributes: delete irrelevant attributes to the task using Fisher’s Discriminant Ratio; merge correlate attributes according to Pearson’s Correlation Coefficient; reduce the dimensionality of the attribute vector by Singular Value Decomposition.The third stage building model involves customer’s classification and churn prediction. The purpose of customer’s classification is to get different cluster which has common calling behavior, and then the prediction model will be built based on these different clusters. A modified k-means method which can reduce compute complexity greatly is proposed to cluster similar customers.Churn prediction adopts decision trees algorithms. After presenting a brief overview of tree-building algorithm and tree-pruning algorithm of traditional decision trees, the paper describes how to push constraints into the tree-building phase and tree-pruning phase in detail. By computing the cost of the cheapest subtree with size constraints ofthe partial tree (this is an upper bound on the cost of the final optimal tree) and lower bounds on the cost of subtrees of varying sizes that are rooted at nodes of the partial tree, the algorithms can identify and prune nodes that cannot possibly belong to the optimal constrainted subtree. The method pushing size constraints into tree-building phase is applied in the prediction system. When splitting nodes of tree, gini index is chosen as a splitting criterion and CAIM measure is used to transform continuous attributes into discrete ones.In order to get better accuracy, boosting method is used for voting classification algorithms. Finally, the experiment results are explained.
【Key words】 Customer loss; decision trees; clustering; size constraints; attribute extract; correlation analysis; CAIM algorithm; boosting;
- 【网络出版投稿人】 西南交通大学 【网络出版年期】2004年 02期
- 【分类号】TP399
- 【被引频次】14
- 【下载频次】560