Model selection in unsupervised learning is a long standing problem. Here we looked at the specific case of non-negative matrix factorization (NMF) and compared different ways to select its rank. We also propose a new method based on generalizatbility of a solution on unseen data.
The paper is published in IEEE IJCNN 2019.
Rank Selection in Non-negative Matrix Factorization: systematic comparison and a new MAD metric
Abstract Non-Negative Matrix Factorization (NMF) is a powerful dimensionality reduction and factorization method that provides a part-based representation of the data. In the absence of a priori knowledge about the latent dimensionality of the data, it is necessary to select a rank of the reduced representation. Several rank selection methods have been proposed, but no consensus exists on when a method is suitable to use. In this work, we propose a new metric for rank selection based on imputation cross-validation, and we systematically compare it against six other metrics while assessing the effects of data properties. Using synthetic datasets with different properties, our work critically evidences that most methods fail to identify the true rank. We show that properties of the data heavily impact the ability of different methods. Imputation-based metrics, including our new MADimput, provided the best accuracy irrespective of the data type, but no solution worked perfectly in all circumstances. One should therefore carefully assess characteristics of their dataset in order to identify the most suitable metric for rank selection.