Avoiding Mean Square Error Bias in Designed Experiments

##### To avoid a biased MSE term, statistically valid model-reduction techniques should be used and the experiment should be properly randomized.
James A. Colton

The error in a designed experiment, ?2, is the natural variation in the response when one of the experimental combinations is replicated. One important challenge in a designed experiment is obtaining an unbiased estimate of ?2. Too often, experimenters do not realize the impact that data collection and analysis assumptions have on the estimate of ?2. If the estimate is biased, tests of the effects in the analysis will be adversely affected. This article discusses the causes and consequences of bias in the mean square error (MSE) term and provides suggestions for detecting and correcting MSE bias.

##### Introduction
A replicated experiment runs each factor combination more than one time and calculates an error estimate (?2) at each experimental combination. These estimates are pooled to obtain an overall estimate of ?2. An unreplicated experiment obtains an estimate by assuming that higher-order interactions are "noise" or by using a modern technique such as Lenth's pseudo standard error [1].

The ANOVA table refers to the estimate of ?2 as the mean square error (MSE), which is based on the following:

The ANOVA table refers to the estimate of ?2 as the mean square error (MSE), which is based on:
1. replication (pure) error
2. terms removed from the model (lack of fit error)
3. a combination of (1) and (2).

##### Causes of MSE Bias
In this article, the term bias refers to the difference between the estimated value of MSE and the expected or true value. Positive bias occurs when the estimated value is higher than the true value and negative bias occurs when the estimated value is lower than the true value.

Bias can occur in the pure-error component or the lack-of-fit component of MSE. Typically, pure-error bias is the result of unsound data-collection procedures while lack-of-fit (LOF) error bias is the result of an inappropriate model-reduction approach.

Pure-Error Bias
Pure-error bias arises when the variation between replicates does not represent the natural variation in the process. If replicates are collected at the same time and/or under the same conditions, they will likely be exposed to the same or similar environmental conditions, raw materials, operators, and machinery. As a result, you can expect a negative bias in the pure-error component when non-randomized replicates (sometimes referred to as repeats) are incorrectly treated as replicates. You also can expect a negative bias in the pure-error component when multiple measurements from the same part are incorrectly treated as replicates.

A positive bias in the pure-error component can occur when one replicate of an experiment is collected under different conditions than a second replicate, and a blocking variable is not used to account for the differences between replicates. In this case, an additional source of variation is included in the pure-error component, resulting in a pure-error component that has a positive bias.

Lack-of-Fit Error Bias
LOF-error bias arises when terms are removed from the model incorrectly. A negative bias in LOF error can occur when insignificant terms in an orthogonal experimental design (a design in which the effects of any factor are balanced across the effects of the other factors) are removed from the model one at a time, starting with the term with the largest p-value, or equivalently, the smallest absolute effect. With this approach, the analyst is hand-picking the smallest possible measure of ?2, placing it in the error, and using it as an estimate of ?2 in the next stage of the analysis. In this situation, you can expect a negative bias in the LOF component, which can be especially dangerous if the LOF error is the only component of variation in the MSE (such as in unreplicated designs). This danger can be demonstrated via simulation.

Consider an 8-factor, 16-run resolution IV designed experiment with a random response obtained from a data simulator. Chart 1 shows the Pareto-effects chart before (top) and after (bottom) removing the term with the smallest absolute effect. With all terms in the model, the chart correctly indicates that none of the effects are significant. After the term with the smallest absolute effect is removed, the remaining 14 effects are compared to it and 13 are found to be significant, which results in 13 Type I errors (a Type I error occurs when an effect is incorrectly deemed significant).

Another inappropriate model-reduction technique can result in a positive bias in the LOF-error component. This occurs when two or more non-orthogonal terms are eliminated from the model together in the same step. One of these terms actually may be significant if left in the model without the other term. The MSE is then higher than expected because the error for this term is included in the MSE when it should not be. This commonly happens in response surface designs when removing quadratic terms that are non-orthogonal.

##### Consequences of bias
If the MSE is biased, then all the tests for significant effects either have an inflated Type I error or an inflated Type II error (a Type II error occurs when an effect is incorrectly deemed not significant). An MSE with a negative bias results in an inflated Type I error rate in the tests for significant effects (Table 1). As a result, process settings may be unnecessarily adjusted and future experiments may contain unimportant factors, which may increase experiment costs. An MSE with a positive bias results in an inflated Type II error rate in the test for significant effects. As a result, the analysis may exclude factors or interactions that influence the response.

Table 1: The standard error (SE) for each coefficient in the model is based on the MSE. The magnitude of each coefficient is compared to the SE to generate a t statistic. The t statistic is converted into a p-value that provides a measure of statistical significance.

##### Signs and remedies
Listed in Table 2 are four indicators that should raise suspicions that the MSE is biased. Specific cutoffs for p-values are not given in Table 2 because formal tests for most forms of MSE bias have not been developed and, in some cases, depend on how many terms are in the model. Future research into formal p-value cut-offs would be beneficial. If one of the indicators of MSE bias from Table 2 is present, Table 3 can be used to identify the possible cause and to recommend a remedy.

If one of the indicators of MSE bias from Table 2 is present, Table 3 can be used to identify the possible cause and to recommend a remedy.

##### Summary
It is important to carefully inspect the data collection and model reduction process for signs of bias in MSE before the results are used to reach any conclusions. To avoid a biased MSE term, statistically valid model-reduction techniques should be used and the experiment should be properly randomized.

##### Appendix
The lack-of-fit test requires both a pure-error component and a LOF-error component. The null hypothesis states that the terms removed from the model to form the LOF error are null effects (effects that do not influence the response and can be used to estimate noise). The alternative states that at least one term removed from the model is an active effect (an effect that influences the response and should not be used to estimate noise), which implies that the MSE has a positive bias. In MINITAB, the lack-of-fit test is printed by default when both a pure-error and LOF-error component exist. In the example in Table 4, the F-test for the lack-of-fit error component has a p-value of 0.032, which might indicate a bias in the MSE.

##### References
[1] R.V. Lenth (1989). "Quick and Easy Analysis of Unreplicated Factorials," Technometrics, 31, 469-473.

James A. Colton is Technical Training Specialist at Minitab Inc. He may be contacted at sceditor@scimag.com.

Chart 1: Pareto chart of effects for an 8-factor,16-run experiment before (top) and after (bottom) eliminating the smallest effect. Any effect extending beyond the vertical red line is significant at the 0.05 a-level.