Comparative Analysis of Selected Methods for Estimating the Prediction Error of Classifier

Sergiusz Herman

doi:10.5604/01.3001.0014.1225

Sergiusz Herman Uniwersytet Ekonomiczny w Poznaniu, Wydział Informatyki i Gospodarki Elektronicznej, Katedra Ekonometrii Przegląd Statystyczny. Statistical Review, vol. 63, 2016, 4, pages: 449-463 Published online: 31 December 2016 DOI 10.5604/01.3001.0014.1225

891 Views 42 Downloads

ARTICLE

(Polish) PDF

ABSTRACT

Classification is an algorithm, which assigns studied companies, taking into consideration their attributes, to specific population. An essential part of it is classifier. Its measure of quality is especially predictability, measured by true error rate. The value of this error, due to lack of sufficiently large and independent test set, must be estimated on the basis of available learning set.
The aim of this article is to make a review and compare selected methods for estimating the prediction error of classifier, constructed with linear discriminant analysis. It was examined if the results of the analysis depends on the sample size and the method of selecting variables for a model. Empirical research was made on example of problem of bankruptcy prediction of join-stock companies in Poland.

KEYWORDS

prediction error, cross-validation, holdout method, bootstrapping, corporate bankruptcy, classification

REFERENCES

Braga-Neto U. M., Dougherty E. R., (2004), Is Cross-validation for Small-sample Microarray Classification?, Bioinformatics, 20 (3), 374–380.

Efron B., (1983), Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation, Journal of the American Statistical Association, 78 (382), 316–331.

Efron B., Tibshirani R. J., (1997), Improvements on Cross-Validation: The .632+ Bootstrap Method, Journal of the American Statistical Association, 92 (438), 548–560.

Gatnar E., (2001), Nieparametryczna metoda dyskryminacji i regresji, Wydawnictwo Naukowe PWN, Warszawa.

Gatnar E., (2008), Podejście wielomodelowe w zagadnieniach dyskryminacji i regresji, Wydawnictwo Naukowe PWN, Warszawa.

Geisser S., (1975), The Predictive Sample Reuse Method With Applications, Journal of the American Statistical Association, 70, 320–328.

Hadasik D., (1998), Upadłość przedsiębiorstw w Polsce i metody jej prognozowania, Zeszyty naukowe – seria II, Prace habilitacyjne, Zeszyt 153, Akademia Ekonomiczna w Poznaniu, Poznań.

Hanczar B., Dougherty E. R., (2013), The Reliability of Estimated Confidence Intervals for Classification Error Rates When Only a Single Sample is Available, Pattern Recognition, 46, 1067–1077.

Hand D. J., (1981), Discrimination and Classification, John Wiley & Sons, Chichester.

Isaksson A., Wallman M., Goransson H., Gustafsson M. G., (2008), Cross-Validation and Bootstrapping are Unreliable in Small Sample Classification, Pattern Recognition, 29, 1960–1965.

Jiang W., Simon R., (2007), A Comparison of Bootstrap Methods and an Adjusted Bootstrap Approach for Estimating Prediction Error in Microarray Classification, Statistics in Medicine, 26, 5320–5334.

Kim J. H., (2009), Estimating Classification Error Rate: Repeated Cross-Validation, Repeated Hold-Out and Bootstrap, Computational Statistics and Data Analysis, 53, 3735–3745.

Lachenbruch P. A., Mickey M. R., (1968), Estimation of Error Rates in Discriminant Analysis, Technometrics, 10, 1–11.

McLachlan G. J., (1992), Discriminant Analysis and Statistical Pattern Recognition, John Wiley & Sons, Inc.

Molinaro A. M., Simon R., Pfeiffer R. M., (2005), Prediction Error Estimation: A Comparison of Resampling Methods, Bioinformatics, 21, 3301–3307.

Ripley B. D., (1996), Pattern Recognition and Neural Networks, Cambrige University Press.

Simon R., Radmacher M. D., Dobbin K., McShane L. M., (2003), Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification, Journal of the National Cancer Institute, 95 (1), 14–18.

Wehberg S., Schumacher M., (2004), A Comparison of Nonparametric Error Rate Estimation Methods in Classification Problems, Biometrical Journal, 46, 35–47.