Health Research and Policy

Abstract

DATE: May 23, 2013
TIME: 1:15 - 3:00 pm
LOCATION: Medical School Office Building, Rm x303
TITLE: Handling Missing Values in Exploratory Multivariate Data Analysis: from PCA to Multi-blocks Principal Components Methods
SPEAKER: Julie Josse
Associate Professor, Applied Mathematics Dept, Agrocampus Quest, Rennes, France

Missing values are ubiquitous in the statistical practice. There are problematic since most statistical methods can't be applied directly on incomplete data.

In this talk, we focus on handling missing values in multiple factor analysis (MFA), a principal component method which allows to explore and visualise data where the individuals are described by several groups of (continuous and/or categorical) variables. The aims of MFA is to study the similarities between individuals, to study links between variables and to relate these two studies. In addition, MFA also studies the links between groups and compares the information brought by each group. Due to the group structure, the pattern of missing values considered can also be structured with missing rows in some data sets.

Since MFA, such as many principal component methods, can be presented as PCA on matrices with specific row weights and column weights, we discuss here to a greater extent how to deal with missing values in PCA. A common approach consists of ignoring the missing values by minimizing the loss fonction over all non missing entries. This can be achieved by the iterative PCA algorithm where an iterative imputation of the missing values is performed during the estimation of the axes and components. We point out the overfitting problem of such an algorithm and suggest a regularized PCA algorithm to overcome this major issue. Then, to assess how much confidence should be given to the results, we give insight in the parameters variance using a non parametric multiple imputation procedure. Finally, we discuss the extension to Multiple Correspondence Analysis for categorical variables and to MFA for groups of variables. The suggested methods can also be used to complete matrices with continuous, categorical or mixed variables. We illustrate the methodology on simulations and on a wine data set coming from the framework of sensory analysis as well as on a genomics data set.

The methods are implemented in the packages FactoMineR and missMDA.

Suggested reading:
Overview of some results and new challenges in dealing with missing values in principal components methods: Josse, J & Husson, F. (2013). Handling missing values in exploratory multivariate data analysis methods. Journal de la SFdS. 153 (2), pp. 79-99.

Introduction to the problem of missing data: Schaefer, J.L. & Graham, J.W. (2002). Missing data: Our view of the state of the art. Psychological Methods. 7, pp. 147–177.

Multiple Factor analysis applied on the genomics data: de Tayrac M, Lê S, Aubry M, Mosser J, Husson F. (2009). Simultaneous analysis of distinct Omics data sets with integration of biological knowledge: Multiple Factor Analysis approach. BMC Genomics. 10:32.

Stanford Medicine Resources:

Footer Links: