A Multiplatform Look at Multivariate Analysis
Stepping through a simple example using three popular commercial platforms
Multivariate statistical analysis is utilized when more than one observed variable is under study, and it is important to examine relationships between these variables. Most data collected by researchers is multivariate. However, in some cases, it is desirable to isolate only single variables for analysis.
There are a number of procedures in this family, and they include both descriptive and inferential methods. As with univariate tests, inference is based on comparisons of error variation with those of model variation. However, in this case, the sums of squares are contained in matrices rather than a single (or series of) discrete number(s). Historically, these techniques were most used in psychology and biology, but usage has spread to most branches of science as well as education, business and law.
The family of tests includes:
• multiple analysis of variance (MANOVA)
• discriminant analysis (DA)
• principal components analysis (PCA)
• partial least squares (PLS)
• factor analysis
• correspondence analysis.
There are others, but these are the most often encountered. These procedures allow the researcher to examine the variables simultaneously, to assess both their joint actions and the effects of each variable on the others. This is done in a way that protects us from inflating the Type I error rate (chance of getting a false positive) and handles correlations in the data. As with most statistics, those practitioners wishing to delve below the surface in multivariate methods need a good grounding in matrix algebra. For the rest, we will step through a simple example using three popular commercial platforms (JMP 8, Minitab 15 and SYSTAT 12) to
• see what the mathematics does for us in a practical manner
• compare platform outputs and format
For the medical example used, we will employ the following multivariate procedures: MANOVA, clustering, DA and PCA. Due to the length of the manuscript that would result if all of the screenshots (graphics, numerical outputs and dialog boxes) were included, it was necessary to store these results in electronic format. To see how each platform presents the output, please go to www.ScientificComputing.com/MA.
The data comes from an example in JMP, but is easily exported and analyzed in the other two platforms. From the JMP 8 manual:1 ‘Groups of five subjects belong to one of four treatment groups called A, B, Control and Placebo. Cholesterol was measured in the morning and again in the afternoon once a month for three months (the data are fictional). In this example, the response columns are arranged chronologically with time-of-day within month. The columns could have been ordered with AM in the first three columns followed by three columns of PM measures.’
The clinical researcher wishes to determine differences (and relationships) among and between treatment groups to determine whether
• the treatments were effective, i.e., experimental groups had significantly lower cholesterol than controls and placebos
• the treatments were equally effective
• there was any “placebo effect,” and
• there were effects with time and, if so, what they were.
This is a contrived but obviously very important example, as in the pharmaceutical world, a potentially lucrative new drug needs to evidence effectiveness beyond what is already on the market. Table 1 displays the data set, with columns ordered by month (Trt denotes treatment group).
• JMP 8
This software allows multivariate analysis in two menu areas: Fit Model and Multivariate Methods. We will begin by fitting the data with a MANOVA model. This is done, as we
• have repeated measures (RM) data that can be easily handled in MANOVA
• wish to examine both between subject and within subject effects
• may have an arbitrary correlation structure, as we have not tested for this or designed for a specific structure
To fit, we select Fit Model and specify the time/dates as the Ys, treatment as the Model effect and MANOVA as the fitting personality.
The parameter estimates will give us the overall means for each period (intercept) and the offset for each treatment and control. We also get, on the same screen, partial correlation and covariance matrices that are the correlation and covariance structures of the residuals (errors) adjusted for the time effects. JMP also will display the E and H (error and hypothesis) matrices, but this goes beyond what we want for this introductory discussion.
In our case, we note that JMP offers the choice of constructing tests for linear combinations across the responses. This is useful, as we wish to use the repeated measures model, and we can get a sphericity test by checking the ‘Univariate Tests Also’ box. This tells us if the unadjusted univariate F-tests for within-subject effects is appropriate and, if not, we use an adjusted test or reflex to the multivariate tests. In many cases, all we would do is use the repeated measures/univariate test choices. However, in this case, we have a compound structure, with each treatment group measured twice at each time point and repeated at monthly intervals. So, in the Response Specification box, we choose compound as the response and fill in the resultant dialog.
Actually, the analysis is given for Month and Day also. However, for space considerations, only the interaction (Month*Day) is shown. Here we see that an appropriate F-test is done for the intercept (and this test is usually not of interest, as we are more interested in treatment effects versus controls) and the appropriate multivariate tests (Wilk’s, Pillai’s, Hotelling-Lawley and Roy’s) are done for all else. These tests employ the matrices previously mentioned and to do all is not necessary in most circumstances. Some are more powerful than others under certain conditions, and none are the preferred method in all cases. In general, we use Pillai’s trace but, in cases of large deviation from the null hypothesis, we would prefer Roy’s maximum root. These tests are all exact, i.e., they reject the null hypothesis with the set alpha value. But their differences lie in the mathematical nature of the multivariate space with which we are now dealing. Also note that, although in this example the p-values for these tests are all the same (and, in many cases if they are not the same, will lead to the same conclusion), it is not always the case.
As we calculated these same significances levels for tests of month and day, we conclude that there is a day-to-day and month-to-month difference in cholesterol level among the treatment groups, and there is a significant interaction between month and day. JMP also allows us to set up contrasts, so we can test treatment A vs. B, control versus placebo, mean of A and B versus mean of control and placebo, or any combination of these. They confirm what we see on the graphic, i.e., the controls and placebo have little effect on the cholesterol levels over time, but both treatments do.
To further investigate our data, we may choose to cluster it and examine similarities and differences among the groups. Clustering will illuminate these similarities when the variables share some underlying properties, but it will not tell us explicitly what these properties are. As such, it is normally used as an exploratory technique. JMP offers a dendogram that highlights the degrees of similarity as we proceed up the branches and uses a variety of mathematical techniques to actually implement the technique. Some are more useful under certain circumstances, but there is still a bit of “art” to it. By selecting Analyze/Multivariate Methods/Cluster, and using the default settings (Hierarchical as we have a small data set and Ward, as a usually “safe” first choice), and asking JMP to color the groups, we produce the desired graphic.
It seems the method had no trouble telling the treatments from the placebo and controls, and differentiating the two treatments. It also tells us that the control seems to have some undefined properties that render it looking like the placebo in this procedure.
Note that Ward’s method was called usually “safe.” As with all procedures, it has mathematical constraints and is most useful when these constraints are met. It also has its own biases (making each cluster have equal numbers of points) and sensitivities (outliers). Now, let’s examine other, more rigorous ways of separating the groups.
Discriminant analysis is very useful in placing groups of unknown properties into groups of known composition. Here, we are predicting a categorical variable by using many continuous variables. DA does its work by constructing linear combinations of these continuous variables that are used to separate the categorical variables. Distances are calculated from each point to each group’s multivariate mean.
As an example, if we had data on different cholesterol-lowering drugs that worked by different mechanisms, we might want to analyze them with drugs of known action by DA to see which group they most closely resemble. In our case, we merely see if they are separated from the placebo’s and controls, by how much space, and further confirm any placebo/control overlap. To do this, we merely select Analyze/Multivariate Methods/Discriminant, putting the month and days in the Y roles and the treatments in the X.
It appears, once again, that there are differences between the two treatments, between the treatments and control/placebos and that the controls overlap the placebos in some way. The actual numerics (Discriminant Scores) give us a more precise handle on these and again confirm what we had seen before.
Our last methodology, PCA, is the most mathematically convoluted and, unfortunately, one of the most offered analytics in most statistical software for separation analysis. The JMP8 manual states “The purpose of principal component analysis is to derive a small number of independent linear combinations (principal components) of a set of variables that capture as much of the variability in the original variables as possible.” This makes it sound quite simple. However, in reality, we are constructing eigenvectors and employing orthogonal matrix transforms that can get quite messy.
Usually, the first three principal components can effect a good separation and are easily visualized with 3-D graphics. The problem is that, if there is no good separation with these three, the algorithms will keep constructing more vectors until there is a separation. Unfortunately, this takes place in higher dimensions that are not easily visualized and even harder to interpret. Even if the first three are successful (which fortunately happens quite often), understanding the underlying mathematics is quite a bit more challenging than with the other procedures.
In JMP, we will use Analyze/Multivariate Methods/Principle Components, and use the time points as the Ys. The output tells us that the first two PCs will account for 97.976 percent of the variation in the sample, and with the third PC, over 99 percent. An optional scree plot in JMP (not shown) verifies that using three PCs is appropriate. It appears that the placebos and controls (green) share similarities as do the A and B treatments (red). One of the B treatment points (blue) is distinctly colored, indicating a possible dissimilarity with the group.
In summary, the analyst may wish to do a preliminary clustering on the data followed by the MANOVA. If group separation is needed, DA or PCA also may be performed.
• Minitab 15
Minitab does not do the compound RM multivariate analysis, but it will do what it calls ‘Balanced’ and ‘General’ MANOVAs. The former comes from balanced designs, and the latter will analyze results from both balanced and unbalanced designs. There are features in the dialog boxes that allow for specification of covariates and custom designs.
A simple, non-RM MANOVA may be run be choosing Stat/ANOVA/General MANOVA and modeling the treatments using the timed data as responses. Minitab produces univariate graphics analyzing the distribution of error versus fit and order, as well as generating histograms and normal plots. If selected in the dialog, it also will do an eigenanalysis and generate the hypothesis and error matrices, as well as the sum of squares and cross product matrix. The results of the simple general MANOVA are shown at www.ScientificComputing.com/MA.
Although the results follow JMP, it is important to note that we are not testing the same hypotheses in the same way. The test has not taken into account the layered nature of the data so we are not testing the day, month and interaction effects separately.
By choosing Stat/Multivariate/Cluster Observations, and selecting treatments in the customize option, we see that Minitab also separates the A and B treatments and, as JMP does, intercalates the Controls and Placebos. It is interesting to note that we need to specify three clusters to get the coloring. If four or more are specified, Minitab will generate an error message. At this point, we could do descriptive statistics in either platform, asking for means and confidence intervals for these groups to examine the overlaps.
For the DA in Minitab, we use Stat/Multivariate/Discriminant Analysis. In the resultant dialog box, we specify treatment as the Groups, the time points as the predictors, Linear as the Discriminant Function and ask for fitting from cross validation, and request a complete classification summary in the options button. (We rapidly see that, in JMP, most of the work comes in selecting choices under the red arrows and, in Minitab, specifying options in the dialog boxes). In any event, Minitab gives a complete report but no canonical plot.
As Minitab generates much output (you can get even more than displayed at www.ScientificComputing.com/MA!), you may choose to see less in the output window by more judicious selection in the dialog box.
Now, by selecting Stat/Multivariate/Principal Components, and in the resulting dialog box choosing the desired plots, we can get a comprehensive PCA report. Again, the Scree plot tells us that three components are adequate for the separation and most variability are loaded by these, but Minitab will not automatically provide the nice 3-D plot to show this. The points in the biplot and score plots can be “brushed” with a graph tool to see how the groups are divided in the two dimensional graphics.
• SYSTAT 12
SYSTAT is known to be heavy on statistics and, for many areas, we need to be very careful of what we request, as we can get buried in the resultant output. SYSTAT is like Minitab in that the menu is test-driven and, therefore, for most tests it is easy to find the choice from the dropdown menus.
The data is easily imported from Excel by using File/Open/Data and drilling to the Excel file. To begin with a clustering, select Analyze/Cluster Analysis/Hierarchical and select the variables, linkage type and Distance measurement from the dialog box. Under the options tab, the cluster color type, Validity and maximum group number were specified to get the dendogram and numeric output. Presumably, one can get the clusters labeled and colored but, even after e-mailing the help desk, it was difficult to discern how this is done.
To obtain the RM MANOVA, the Analyze/MANOVA/Estimate Model choices were used and, in the dialog box, the following specifications were made: Dependent and Independent variables, Include constant, and Type III adjusted sums of squares. On the Repeated Measures Tab, the repeated measures box was checked and month and day only were chosen, as it was not clear if interaction terms could be included. (They were calculated, however.)
Please note, this is a very small sampling of the output, as SYSTAT will produce all of the matrices used for the calculations, as well as the least squares means and null hypotheses contrasts. With the multivariate tests, it appears that SYSTAT does not perform Roy’s Maximum Root. Suffice it to say that, for this example, SYSTAT leads the analyst to the same conclusions as JMP and Minitab.
To perform the Discriminant Analysis, we choose Analyze/Discrimi-nant Analysis/Classical (as we want linear discriminants) and select treatment as the grouping variable, the time data as the predictors, and ‘all groups equal’ under Priors. On the Statistics Tab, we specify the tests and descriptors to obtain the required output.
Again, we get far more than appears at www.ScientificComputing.com/MA. However, what is presented gives the core output and a flavor of the format. One interesting feature is the scatterplot matrix (Splom) format of the Canonical Plot, allowing the user to choose the best comparators, which would be Factors (1) versus (2) here.
To do a PCA in SYSTAT took a little more work, as there is no menu choice as such, but the tests are done under General Linear Models and Factor Analysis. As your editor could not discern how to specify the proper tests under the menu-driven elements, he defaulted to the following code:
MODEL APRIL_AM APRIL_PM MAY_AM MAY_PM JUNE_AM JUNE_PM, = CONSTANT + TREATMENT$
To view the output, please go to www.ScientificComputing.com/MA. Would you believe it was truncated for the article? From this output, it is obvious that SYSTAT is oriented to the mathematically rigorous!
In summary, multivariate statistics are highly useful in situations with compound data and, although handled somewhat differently by commercial software packages, ultimately they lead to the same, or very similar, conclusions. For those wishing to delve a bit deeper into multivariate statistics, the text by Everitt and Dunn2 is an excellent introduction and that of Rencher3 is suitable for advanced undergraduates.
1. JMP 8: Statistics and Graphics Guide, Vol. 1 and 2. SAS Institute, Inc, Cary, NC. 2008. (p. 467)
2. Applied Multivariate Data Analysis, 2nd ed. Everitt, Brian S. and Graham Dunn. Oxford University Press, New York. 2001.
3. Methods of Multivariate Analysis, 2nd ed. Alvin C. Rencher. John Wiley & Sons, New York. 2002
John Wass is a statistician based in Chicago, IL. He may be reached at editor@ScientificComputing.com.