The selection, weighting and transformation of variables are essential phases of the modelling process. Two approaches can be applied to improve a model’s accuracy: the selection of variables and the transformation of variables. In symbolic data analysis, two different approaches can be adopted: principal component analysis (PCA) and spectral clustering. In all the cases, we initially start with a set of symbolic variables and, after transformation, we obtain either classical variables (single numeric values) or symbolic variables that can be used in various models. The paper presents and compares PCA and spectral clustering for symbolic data when dealing with the problem of variable transformation. Artificial data with a known cluster structure were used to compare both single and ensemble clustering approaches. The results suggest that spectral clustering achieves better results for single and ensemble models.
symbolic data analysis, ensemble learning, spectral clustering, principal component analysis
C63, C87, C90
Ardabili, S., Mosavi, A., & Várkonyi-Kóczy, A. R. (2019). Advances in Machine Learning Modeling Reviewing Hybrid and Ensemble Methods. In A. R. Várkonyi-Kóczy (Ed.), Engineering for Sustainable Future. Selected papers of the 18th International Conference on Global Research and Education Inter-Academia – 2019 (pp. 215–227). Springer. https://doi.org/10.1007/978-3-03036841-8_21.
Bock, H.-H., & Diday, E. (2000). Analysis of Symbolic Data. Explanatory Methods for Extracting Statistical Information from Complex Data. Springer. https://doi.org/10.1007/978-3-642-57155-8.
Billard, L., & Diday, E. (2006). Symbolic Data Analysis. Conceptual Statistics and Data Mining. Wiley. https://doi.org/10.1002/9780470090183.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. https://doi.org/10.1007 /BF00058655 .
de Carvalho, F. A. T., Lechevallier, Y., & de Melo, F. M. (2012). Partitioning hard clustering algorithms based on multiple dissimilarity matrices. Pattern Recognition, 45(1), 447–464. https://doi.org/10.1016/j.patcog.2011.05.016.
Cazes, P., Chouakria, A., Diday, E., & Schektrman, Y. (1997). Extensions de l’Analyse en Composantes Principales a des données de type intervalle. Revue de Statistique Appliquée, 45(3), 5–24. http://eudml.org/doc/106421.
Chen, M., Wang, H., & Quin, Z. (2015). Principal component analysis for probabilistic symbolic data: a more generic and accurate algorithm. Advances in Data Analysis and Classification, 9, 59– 79. https://doi.org/10.1007/s11634-014-0178-2.
Chouakria, A., Diday, E., & Cazes, P. (2000). Vertices Principal Components Analysis With an Improved Factorial Representation. In A. Rizzi, M. Vichi & H.-H. Bock (Eds.), Advances in Data Science and Classication (pp. 397–402). Springer. https://doi.org/10.1007/978-3-642-72253-0_54.
Dudek, A. (2020). Silhouette Index as Clustering Evaluation Tool. In K. Jajuga, J. Batóg, M. Walesiak (Eds.), Classification and Data Analysis. SKAD 2019. Studies in Classification, Data Analysis, and Knowledge Organization. Springer. https://doi.org/10.1007/978-3-030-52348-0_2.
Douzal-Chouakria, A., Billard, L., & Diday, E. (2011). Principal component analysis for intervalvalued observations. Statistical Analysis and Data Mining, 4(2), 229–246. https://doi.org/10.1002 /sam.10118 .
Dudoit, S., & Fridlyand, J. (2003). Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19(9), 1090–1099. https://doi.org/10.1093/bioinformatics/btg038.
D’Urso, P., & Giordani, P. (2004). A least squares approach to principal component analysis for interval valued data. Chemometrics and Intelligent Laboratory Systems, 70(2), 179–192. https://doi.org/10.1016/j.chemolab.2003.11.005.
Fred, A. L. N., & Jain, A. K. (2005). Combining multiple clustering using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6), 835–850. https://doi.org/10.1109/TPAMI.2005.113.
Gatnar, E. (2008). Podejście wielomodelowe w zagadnieniach dyskryminacji i regresji. Wydawnictwo Naukowe PWN.
Gioia, F., & Lauro, C. N. (2006). Principal component analysis on interval data. Computational Statistics, 21(2), 343–363. https://doi.org/10.1007/s00180-006-0267-6.
Gnanadesikan, R., Kettenring, J. R., & Tsao, S. L. (1995). Weighting and selection of variables for cluster analysis. Journal of Classification, 12, 113–136. https://doi.org/10.1007/BF01202271.
Guyon, I., & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3, 1157–1182. https://doi.org/10.1162/153244303322753616.
Hair, J. E., Black, W. C., Babin, J. B., & Anderson, R. E. (2010). Multivariate Data Analysis (17th edition). Prentice Hall.
Hornik, K. (2005). A CLUE for CLUster Ensembles. Journal of Statistical Software, 14(12), 65–72. https://doi.org/10.18637/jss.v014.i12.
Ichino, M. (2011). The quantile method for symbolic principal component analysis. Statistical Analysis and Data Mining, 4(2), 184–198. https://doi.org/10.1002/sam.10111.
Karatzoglou, A. (2006). Kernel Methods. Software, Algorithms and Applications [Doctoral dissertation, Vienna University of Technology]. https://resolver.obvsg.at/urn:nbn:at:at-ubtuw:1-14467.
Krzanowski, W. J. (2000). Principles of Multivariate Analysis. A User’s Perspective. Oxford University Press. https://doi.org/10.1093/oso/9780198507086.001.0001.
Kuncheva, L. I. (2014). Combining Pattern Classifiers. Methods and Algorithms (2nd edition). John Wiley and Sons.
Lauro, N. C., Verde, R., & Palumbo, F. (2000). Factorial Methods with Cohesion Constraints on Symbolic Objects. In H. A. L. Kiers, J.-P. Rasson, P. J. F. Groenen & M. Schader (Eds.), Data Analysis, Classification, and Related Methods (pp. 381–386). Springer. https://doi.org/10.1007/978-3-642 -59789-3_60 .
Leisch, F. (1999). Bagged clustering (SFB Working Papers No. 51). https://doi.org/10.57938 /9b129f95-b53b-44ce-a129-5b7a1168d832 .
Le-Rademacher, J., & Billard, L. (2012). Symbolic covariance principal component analysis and visualization for interval-valued data. Journal of Computational and Graphical Statistics, 21(2), 413–432. https://doi.org/10.1080/10618600.2012.679895.
Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416. https://doi.org/10.1007/s11222-007-9033-z.
Meyer, D., Leisch, F., & Hornik, K. (2003). The support vector machine under test. Neurocomputing, 55(1–2), 169–186. https://doi.org/10.1016/S0925-2312(03)00431-4.
Nagabhushan, P., Chidananda Gowda, K., & Diday, E. (1995). Dimensionality reduction of symbolic data. Pattern Recognition Letters, 6(2), 219–223. https://doi.org/10.1016/0167-8655(94)00085-H.
Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On Spectral Clustering: Analysis and an algorithm. In T. Dietterich, S. Becker & Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems (pp. 849–856). MIT Press.
Oliveira, M. R., Vilela, M., Pacheco, A., Valadas, R., & Salvador, P. (2017). Extracting Information from Interval Data Using Symbolic Principal Component Analysis. Austrian Journal of Statistics, 46(3–4), 79–87. https://doi.org/10.17713/ajs.v46i3-4.673.
Palumbo, F., & Lauro, C. N. (2003). A PCA for interval-valued data based on midpoints and radii. In H. Yani, A. Okada, K. Shigemasu, Y. Kano & J. J. Meulman (Eds.), New Developments in Psychometrics (pp. 641–648). Springer. https://doi.org/10.1007/978-4-431-66996-8_74.
Polikar, R. (2012). Ensemble learning. In C. Zhang & Y. Ma (Eds.), Ensemble Machine Learning (pp. 1–34). Springer. https://doi.org/10.1007/978-1-4419-9326-7_1.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850. https://doi.org/10.1080/01621459.1971.10482356.
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7.
Sagi, O., & Rokach, L. (2018). Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4). https://doi.org/10.1002/widm.1249.
Shi, J., & Malik, J. (2000). Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905. https://doi.org/10.1109/34.868688.
Su, E. C.-Y., & Wu, H.-M. (2024). Dimension reduction and visualization of multiple time series data: a symbolic data analysis approach. Computational Statistics, 39(4), 1937–1969. https://doi.org/10.1007/s00180-023-01440-7.
Tomašev, N., & Radovanović, M. (2016). Clustering Evaluation in High-Dimensional Data. In M. Celebi & K. Aydin (Eds.), Unsupervised Learning Algorithms (pp. 71–107). Springer. https://doi.org/10.1007/978-3-319-24211-8_4.
Tsai, C. F., & Chen, M. L. (2010). Credit rating by hybrid machine learning techniques. Applied Soft Computing, 10(2), 374–380. https://doi.org/10.1016/j.asoc.2009.08.003.
Walesiak, M., & Dudek, A. (2024). Package ‘clusterSim’: Searching for Optimal Clustering Procedure for a Data Set. R package version 0.51-5. https://cran.r-project.org/web/packages/clusterSim/clusterSim.pdf.
Wang, H., Guan, R., & Wu, J. (2012). CIPCA: complete-information-based principal component analysis for interval-valued data. Neurocomputing, 86, 158–169. https://doi.org/10.1016/j.neucom.2012.01.018 .
Wolpert, D. H., & Macready, W. G. (1997). No Free Lunch Theorems for Optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67–82. https://doi.org/10.1109/4235.585893.
Zhou, Z. H. (2012). Ensemble Methods. Foundations and Algorithms. CRC Press. https://tjzhifei.github.io/links/EMFA.pdf.
Zhou, Z. H. (2021). Machine Learning. Springer. https://doi.org/10.1007/978-981-15-1967-3.
Zuccolotto, P. (2006). Principal components of sample estimates: an approach through symbolic data analysis. Statistical Methods &Applications, 16(2), 173–192. https://doi.org/10.1007/s10260006-0024-6.