节点文献

基于抽样矩阵的汽车客户分群及离群点分析

Customer Segmentation and Outlier Detection of Auto Client Based on Sampling Matrix

【作者】 王海燕

【导师】 刘智;

【作者基本信息】 大连海事大学 , 计算机科学与技术, 2012, 硕士

【摘要】 目前市场上各个行业对客户关系的重视程度越来越高,只有充分了解客户需求,才能更准确的提供相应的产品和服务,使利润最大化。而这其中最基本的要素便是数据挖掘中的客户分群,也就用到了聚类分析及离群点分析,综合应用下达到对客户最深入全面的了解。对商业市场中的客户关系管理具有很深远的意义。本文利用最有代表性的汽车客户数据作为典型代表进行分析,选择改进的基于密度的聚类算法DBSCAN和基于距离的离群点检测法对其进行分群分析和异常点检测分析。本文重点从简化参数求取步骤来改进基于密度的聚类算法DBSCAN和基于距离的离群点检测法。具体研究工作包括:(1)所选算法的合适度:总体上是利用两个算法在原理上的相通之处,试着将二者结合在一起,共同对数据进行分析。由于聚类没有最好的算法,而只有最合适的,因此需要根据数据特征选择最合适的聚类方案。所以首先要保证所选择的算法是最适合此数据集的,实验结果表明,DBSCAN确实是最适合具有此特征的数据集的算法,相应的,鉴于基于距离的离群点检测法与其原理的相通性,也必然是适合此类型的数据集,这里便不再累述。(2)抽取数据确定所需参数:为了节省时间和空间,并且在保证聚类质量的前提下,提出了抽取部分数据进行运算确定参数,得到参数后对所有数据进行聚类及离群点分析。首先要选择合适的取样方法,还要保证参数的准确性。实验表明,系统取样法所抽取的数据其分布特征与总体数据的分布特征最相近,并且其所得参数基本与对所有数据进行运算所得参数相同。(3)在已确定的DBSCAN参数的基础上确定基于距离的离群点检测法所需的参数:利用DBSCAN聚类过程中形成簇的基本条件“密度可达性”作为突破口,离群点应该是打破这一条件的,根据这一标准提出了一个利用DBSCAN的参数确定离群点检测法所需参数的简单理念。实验结果表明,对汽车数据集进行运算效果很好,并且通过两个带类标的UCI数据集验证检测率也很高。

【Abstract】 Many large companies in various industries are now increasingly focused on cu-stomer relationship. In order to provide products and services accurately and maximize profits, it’s necessary to fully understand the customer requirement. The most basic elements of which is customer segmentation of data mining, using clustering analysis and outlier detection to gain more knowledge about how customer needs can best be met. It’s profound significance to the customer relationship management of commercial market.In this paper choose automobile customer data that the features is typical, customer segmentation and outlier detection is done by improved cluster algorithm DBSCAN and distance-based outlier. In this paper, simplify the steps of parameter determination is studied to improve the density-based clustering DBSCAN and distance-based outlier. The main research works in the paper include the following aspects:(1) The fitness of chosen algorithm:overall, combine two algorithms by using the common in principle, and then analyzed data sets. Because of there does not exist the clustering algorithm which is best, but the most appropriate, we should select the most suitable method according to the property of the data. First of all, make sure the chosen algorithm is the most appropriate. The experimental results show that DBSC-AN is the only method, similarly, in view of the similarity in principle with DBSCAN, distance-based outlier must be suit the automobile customer data set, I won’t say more about it here.(2) Determine the parameters by extracting some data:In order to save time and space, and ensure the quality of clustering, extract part of data to determine the parameters and make clustering analysis and outlier detection for all the data are proposed in the paper. Above all we should select suitable sampling method, and then ensure the accuracy of the parameters. The experimental results prove that the data distribution by systematic sampling are more similar to the distribution of all data, the parameter are basically the same with the result by original method.(3) Determine the parameters of distance-based outlier based on the DBSCAN: density-reached which is the basic conditions to forming clusters in DBSCAN considered as a breakthrough, in other words, outlier must not meet the conditions. According to it the distance-based outlier determine parameters make use of the parameter of DBSCAN is proposed to simplify the calculation process. The experimental results show that the outlier result is ideal, and the rate of outlier detection to two UCI data sets with class mark is high.

  • 【分类号】TP311.13
  • 【被引频次】1
  • 【下载频次】151
节点文献中: 

本文链接的文献网络图示:

本文的引文网络