Item bias is a major threat to measurement validity. This research examined the Type I error and power of the MH test by varying the magnitude of DIF, test score (matching criterion) purification types (single-stage, two-stage, and iterative), test length, and sample size on robustness and power of Mantel-Haenszel (MH) DIF detection procedures. Data was simulated under the one-parameter logistic (1PL) model. In the 20% DIF item conditions the two MH procedures are robust and have sufficient power, but in the 40% DIF item conditions robustness violation and insufficient powers occur. The influence of test length on power is rather modest. On the other hand, test score purification improves power, but the size of their effects is much larger in the 40% DIF item conditions than in the 20% DIF item conditions.
Item bias is a major threat to measurement validity. This research examined the Type I error and power of the MH test by varying the magnitude of DIF, test score (matching criterion) purification types (single-stage, two-stage, and iterative), test length, and sample size on robustness and power of Mantel-Haenszel (MH) DIF detection procedures. Data was simulated under the one-parameter logistic (1PL) model. In the 20% DIF item conditions the two MH procedures are robust and have sufficient power, but in the 40% DIF item conditions robustness violation and insufficient powers occur. The influence of test length on power is rather modest. On the other hand, test score purification improves power, but the size of their effects is much larger in the 40% DIF item conditions than in the 20% DIF item conditions.