For quite a long time, research studies have attempted to combine various analytical tools to build predictive models. It is possible to combine tools of the same type (ensemble models, committees) or tools of different types (hybrid models). Hybrid models are used in such areas as customer relationship management (CRM), web usage mining, medical sciences, petroleum geology and anomaly detection in computer networks. Our hybrid model was created as a sequential combination of a cluster analysis and decision trees. In the first step of the procedure, objects were grouped into clusters using the k-means algorithm. The second step involved building a decision tree model with a new independent variable that indicated which cluster the objects belonged to. The analysis was based on 14 data sets collected from publicly accessible repositories. The performance of the models was assessed with the use of measures derived from the confusion matrix, including the accuracy, precision, recall, F-measure, and the lift in the first and second decile. We tried to find a relationship between the number of clusters and the quality of hybrid predictive models. According to our knowledge, similar studies have not been conducted yet. Our research demonstrates that in some cases building hybrid models can improve the performance of predictive models. It turned out that the models with the highest performance measures require building a relatively large number of clusters (from 9 to 15).
hybrid predictive model, k-means algorithm, decision trees
C10, C18, C52
Asuncion A., Newman D., (2007), UCI machine learning repository, http://archive.ics.uci.edu.
Blattberg R., Kim B. D., Neslin S., (2008), Database Marketing – Analyzing and Managing Customers, 1st ed., Springer, New York. DOI: 10.1007/978-0-387-72579-6.
Bose I., Chen X., (2009), Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn, Journal of Organizational Computing and Electronic Commerce, 19(2), 133–151, DOI: 10.1080/10919390902821291.
Breiman L., Friedman J., Olshen R., Stone C., (1984), Classification and Regression Trees, 1st ed. Wadsworth statistics / probability series, Wadsworth Publishing Company, Belmont, California.
Chu B. H., Tsai M. S., Ho C. S., (2007), Toward a Hybrid Data Mining Model for Customer Retention, Knowledge-Based Systems, 20(8), 703–718. DOI: 10.1016/j.knosys.2006.10.003.
Everitt B., Landau S., Leese M. D. S., (2011), Cluster Analysis, 5th ed. Wiley Series in Probability and Statistics, John Wiley & Sons, Chichester, West Sussex. DOI: 10.1002/9780470977811.
Ferraretti D., Lamma E., Gamberoni G., Febo M., Di Cuia R., (2011), Integrating Clustering and Classification Techniques: A Case Study for Reservoir Facies Prediction, [in:] Ryżko D., Gawrysik P., Rybiński H., Kryszkieiwcz M., Emerging Intelligent Technologies in Industry, Springer, Berlin, Heidelberg, 21–34. DOI: 10.1007/978-3-642-22732-5_3.
Gaddam S., Phoha V., Balagani K., (2007), K-means + ID3: A Novel Method for Supervised Anomaly Detection by Cascading K-means Clustering and ID3 Decision Tree Learning Methods, IEEE Transactions on Knowledge and Data Engineering, 19(3), 345–354.DOI: 10.1109/TKDE.2007.44.
Khan D., Mohamudally N., (2011), An Integration of k-means and Decision Tree (ID3) Towards a More Efficient Data Mining Algorithm, Journal of Computing, 3(12), 76–82, https://sites. google.com/site/journalofcomputing/volume-3-issue-12-december-2011.
Łapczyński M., Jefmański B., (2013), Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees, [in:] Perner P., (ed.), Advances in Data Mining, Ibai Publishing, Fockendorf, 153–162.
Łapczyński M., Surma J., (2012), Hybrid Predictive Models for Optimizing Marketing Banner Ad Campaign in Online Social Network, [in:] Stahlbock R., (ed), Proceedings of the 2012 International Conference on Data Mining (DMIN), CSREA Press, Las Vegas, Nevada, 140–146.
Li Y., Deng Z., Qian Q., Xu R., (2011), Churn Forecast Based on Two-step Classification in Security Industry, Intelligent Information Management, 3(4), 160–165. DOI: 10.4236/iim.2011.34019.
Lloyd S., (1982), Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137, Institute of Electrical and Electronics Engineers (IEEE). DOI: 10.1109/ TIT.1982.1056489.
Shouman M., Turner T., Stocker R., (2012), Integrating Decision Tree and K-Means Clustering with Different Initial Centroid Selection Methods in the Diagnosis of Heart Disease Patients, [in:] Stahlbock R., (ed), Proceedings of the 2012 International Conference on Data Mining (DMIN), CSREA Press, Las Vegas, Nevada, 24–30.
Walesiak M., Dudek A., (2011), clusterSim: Searching for Optimal Clustering Procedure for a Data Set, https://cran.r-project.org/web/packages/clusterSim. R package version 0.47–3.