Multilevel Homogeneity Analysis

Geoge Michailides
Ph.D., 1996
Advisor: Jan de Leeuw

Multivariate data arise in many different fields of research in the physical, social and life sciences. Depending on the nature of the data at hand, the goal of multivariate analysis techniques is to examine either the interdependence of a set of variables among themselves, or the dependence of a set of variables on the remaining variables. The Gifi system is a collection of multivariate techniques (primarily focusing on the problem of interdependence) for categorical data. At the heart of the system is the method of optimal scaling, that aims at analyzing categorical data as numerical data. This method works by assigning numbers to the categories, thus introducing a transformation of the variables; these numbers have the property that they are optimal with respect to some well-defined criterion. The transformation of the variables can be such that they preserve the measurement level of the variables (nominal, ordinal or numerical). Two other main aspects of the Gifi system are the implementation of the optimal scaling of the variables through alternating least squares algorithms and the emphasis placed on the geometrical representation of the solution. It is also worth noting that all the classical multivariate techniques can be derived as special cases of the system.

Multivariate data often have a hierarchical structure. For example, students can be naturally grouped clustered by schools, patients by hospitals, firms by industries, households by income, individuals by gender, occupation, level of education or socio-economic status, lab tests by time. However, almost all multivariate analysis techniques are essentially one-group methods. The multilevel structure in the data is ignored at analysis time; that is, grouping variables do not take part in the original analysis. It is sometimes introduced at a later stage when examining the results of the technique broken down by some background variables such as gender or occupation. The purpose of this study is to exend the basic techniques of the Gifi system -homogeneity analysis and nonlinear principal components analysis- to the multilevel data framework, where the grouping (clustering) of the individuals (or objects) is taken explicitly into account by the techniques at analysis time. However, the presence of many groups and a small number of objects within each group has two major drawbacks: first, many parameters need to be estimated, thus making the solutions unstable, and second, general patterns and trends are hard to be detected. These two facts call for models that allow to borrow strength from the multilevel nature of the data, incorporate prior knowledge and improve the stability of the solution. Two families of such models are introduced in this study, and their properties and implementation presented. As it turns out, some interesting connections with multimode factor analysis and hierarchical linear models are established.

This study introduces a multilevel framework where most of the techniques in the Gifi system can be naturally casted, and also numerous new models can be explored. Multivariate data structures are very rich in content and the present work represents only a first step towards a potentially very fruitful research program. An outline of the remaining chapters is given next. the second chapter contains a general overview of the Gifi system of nonlinear multivariate analysis. It presents the basic technique of homogeneity analysis and its extensions and generalizations, such as nonlinear principal components analysis and K-set homogeneity analysis. It provides the reader with a brief account on the historical development of optimal scaling systems, as well as a rigorous formulation of various multivariate techniques through a unifying framework of meet and join loss functions.

The third chapter extends homogeneity analysis and nonlinear principal components analysis to a multilevel framework. It discusses the need for models that take into consideration the hierarchical structure of the data and can simultaneously express how one variable is related to another variables across all groups of objects, and also how one group varies (differs) from another. It outlines two families of such models: the first is based on imposing constraints across the various groups on the category quantifications and the object scores, and the second on modeling the category quantifications. The fourth chapter contains an extensive analysis of the first family of models; their solutions are derived, their properties discussed and their computer implementations presented. Similarly, the fifth chapter deals with the second family that builds a multivariate regression model for the cluster category quantifications. Moreover, relations with individual differences scaling models (INDSCAL) are indicated, and extensions that borrow ideas from the hierarchical linear models literature are discussed. The sixth chapter addresses stability issues both for the single group case, and for the multigroup framework. Particular emphasis is given on replication stability techniques such as the bootstrap. The final chapter, called “”Possibilities and Prospects”” briefly discusses some other potentially useful models, such as an additive model or a multimode factore model for the category quantifications, and points towards directions for future research, such as extensions to the K-set case. A data set from the National Educational Longitudinal Survey of 1988 is used throughout this study, to demonstrate the techniques.