Comparison of Item Response Theory Test Equating Methods for Mixed Format Tests

Author :  

Year-Number: 2016-Volume 8, Issue 2
Language : null
Konu : null

Abstract

This study aims to investigate the performance of test equating methods extended to mixed-format tests within the framework of Item Response Theory (IRT). To this end, a simulation study was conducted to compare equating errors of the mean/mean, mean/sigma, robust mean/sigma, Haebara, and Stocking-Lord methods under different conditions. Using 40-item tests, the effects of anchor length (10%, 20%, and 30%) and ability distribution (normal, negatively skewed, and positively skewed) were examined on a sample of 1000 participants. We used the common-item nonequivalent group design. The tests were developed using the three parameter logistic model for dichotomous simulated data and the generalized partial credit model for polytomous simulated data. The results of the study revealed that the robust mean/sigma method generally had the highest equating errors. When all conditions were evaluated, the least equating error occurred with the “Stocking-Lord” method in the case of positively skewed groups and a long anchor test (30%). Moreover, the results indicated that the groups with similar ability distributions (normal-normal, negatively skewed-negatively skewed, and positively skewed-positively skewed) produced less equation errors than the groups with different ability distributions (negatively skewed-normal, positively skewed-normal, and positively skewed-negatively skewed).

Keywords

Abstract

This study aims to investigate the performance of test equating methods extended to mixed-format tests within the framework of Item Response Theory (IRT). To this end, a simulation study was conducted to compare equating errors of the mean/mean, mean/sigma, robust mean/sigma, Haebara, and Stocking-Lord methods under different conditions. Using 40-item tests, the effects of anchor length (10%, 20%, and 30%) and ability distribution (normal, negatively skewed, and positively skewed) were examined on a sample of 1000 participants. We used the common-item nonequivalent group design. The tests were developed using the three parameter logistic model for dichotomous simulated data and the generalized partial credit model for polytomous simulated data. The results of the study revealed that the robust mean/sigma method generally had the highest equating errors. When all conditions were evaluated, the least equating error occurred with the “Stocking-Lord” method in the case of positively skewed groups and a long anchor test (30%). Moreover, the results indicated that the groups with similar ability distributions (normal-normal, negatively skewed-negatively skewed, and positively skewed-positively skewed) produced less equation errors than the groups with different ability distributions (negatively skewed-normal, positively skewed-normal, and positively skewed-negatively skewed).

Keywords


  • Anastasi, A. (1988). Pschological testing. New York: Macmillian.

  • Angoff, W. H. (1971). Scales, norms and equivalent scores. In R. L. Thorndike (Eds.), Educational measurement (pp. 508-600). Washington: American Council on Education.

  • Baker, F. B. & Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal of Educational Measurement, 28(2), 147-162. Retrieved on November, 13, 2013 from http://www.jstor.org/stable/1434796.

  • Bastari, B. (2000). Linking multiple-choice and constructed-response items to a common proficiency scale. Unpublished doctoral dissertation, University of Massachusetts.

  • Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6(4), 431-444. doi: 10.1177/014662168200600405

  • Cao, Y. (2008). Mixed-format test equating: Effects of test dimensionality and common-item sets. Unpublished doctoral dissertation, University of Maryland.

  • Chon, K. H., Lee, W. C. & Ansley, T. N. (2007). Assessing IRT model-data fit for mixed format tests. Casma Research Report, 26, November, 2007.

  • Cohen, A. S. & Kim, S. (1998). An investigation of linking methods under the Graded Response Model. Applied Measurement in Education, 22(2), 116-130. doi: 10.1177/01466216980222002

  • Cook, L. L. & Eignor, D. R. (1991). IRT equating methods. Educational Measurement: Issues and Practice, 10(3), 37–45. doi: 10.1111/j.17453992.1991.tb00207.x

  • Crocker, L. & Algina, J. (1986). Introduction to classical and modern test theory. USA: Harcourt Brace Jovanovich College.

  • Dorans, N. J. (1990). Equating methods and sampling designs. Applied Measurement in Education, 3(1), 3-17. doi:10.1207/s15324818ame0301_2

  • Dorans, N. J., Moses, T. P. & Eignor, D. R. (2010). Principles and practices of test score equating. ETS Research Report, 41.

  • Fitzpatrick, A. R. & Yen, W. M. (2001). The effects of test length and sample size on the reliability and equating of tests composed of constructed-response items. Applied Measurement in Education, 14(1), 31-57. doi:10.1207/S15324818AME1401_04

  • Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144-149.

  • Hambleton, R. K., Swaminathan, H. & Rogers, H. J. (1991). Fundamentals of item response theory. USA: Sage.

  • Han, K. T. (2008). Impact of item parameter drift on test equating and proficiency estimates. Unpublished doctoral dissertation, University of Massachusetts Amherst.

  • Han, K. T. & Hambleton, R. K. (2007). User's manual: WinGen (Center for Educational Assessment Report No. 642). Amherst, MA: University of Massachusetts, Center for Educational Assessment.

  • Harris, D. J. & Crouse, J. D. (1993). A study of criteria used in equating. Applied Measurement in Education, 6(3), 195-240. doi:10.1207/s15324818ame0603_3

  • Harwell, M., Stone, C. A., Hsu, T. C. & Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20(2), 101-125. doi: 10.1177/014662169602000201

  • He, Y. (2011). Evaluating equating properties for mixed-format tests. Unpublished doctoral dissertation, University of Iowa.

  • Hills, J. R., Subhiyah, R. G. & Hirsch, T. M. (1988). Equating minimum-competency tests: Comparisons of methods. Journal of Educational Measurement, 25(3), 221-231. Retrieved from http://www.jstor.org/stable/1434501 on November, 12, 2013.

  • Holland, P. W., von Davier, A. A., Sinharay, S. & Han, N. (2006). Testing the untestable assumptions of the chain and poststratification equating methods for the NEAT design. Research Report, June, Educational Testing Service, Princeton, NJ.

  • Jodoin, M. G. (2003). Measurement efficiency of innovative item formats in computer-based testing. Journal of Educational Measurement, 40(1), 1-15. doi: 10.1111/j.1745-3984.2003.tb01093.x

  • Kang, T. & Petersen, N. S. (2009). Linking item parameters to a base scale. Paper presented at the National Council on Measurement in Education (San Diego, California, April, 2009).

  • Karaca, E. (2008). Ölçme ve değerlendirmede temel kavramlar. In S. Erkan & M. Gömleksiz (Eds.), Eğitimde ölçme ve değerlendirme (pp. 1-36). Ankara: Nobel.

  • Kilmen, S. (2010). Madde Tepki Kuramı’na dayalı test eşitleme yöntemlerinden kestirilen eşitleme hatalarının örneklem büyüklüğü ve yetenek dağılımına göre karşılaştırılması. Unpublished doctoral dissertation, Ankara University.

  • Kim, S. & Hanson, B. A. (2002). Test equating under the multiple-choice model. Applied Psychological Measurement, 26(3), 255-270. doi: 10.1177/0146621602026003002

  • Kim, S. & Kolen, M. J. (2006). Robustness to format effects of IRT linking methods for mixed-format tests. Applied Measurement in Education, 19(4), 357-381. doi:10.1207/s15324818ame1904_7

  • Kim, S. & Lee, W. (2004). IRT scale linking methods for mixed-format tests (ACT Research Report 2004-5). Iowa City, IA: Act, Inc.

  • Kolen, M. J. (1981). Comparison of traditional and Item Response Theory methods for equating tests. Journal of Educational Measurement, 18(1), 1-11. doi: 10.1111/j.1745-3984.1981.tb00838.x

  • Kolen, M. J. (1985). Standard errors of Tucker equating. Applied Psychological Measurement, 9(2), 209-223. doi: 10.1177/014662168500900209.

  • Kolen, M. J. (1988). An NCME instructional module on traditional equating methodology. Instructional Topics in Educational Measurement, Winter, 1988.

  • Kolen, M. J. & Brennan, R. L. (2004). Test equating, scalling and linking (Second Edition). USA: Springer.

  • Kubiszyn, T. & Borich, G. D. (2013). Educational testing and measurement: Classroom application and practice, New Jersey: Wiley.

  • Linn, R. L., Levine, M. V., Hastings, C. N. & Wardrop, J. L. (1981). Item bias in a test of reading comprehension. Applied Pscychological Measurement, 5(2), 159-173. doi: 10.1177/014662168100500202

  • Loyd, B. H. & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17(3), 179-193. Retrieved from http://www.jstor.org/stable/1434833 on November, 12, 2013.

  • Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14(2), 139-160. doi: 10.1111/j.1745-3984.1977.tb00033.x

  • Mbella, K. K. (2012). Data collection design for equivalent groups equating: Using a matrix stratification framework for mixed-format assessment. Unpublished doctoral dissertation, The University of North Carolina.

  • Meng, Y. (2012). Comparison of Kernel equating and item response theory equating methods. Unpublished doctoral dissertation, University of Massachusetts.

  • Mohandas, R. (1996). Test equating, problems and solutions: Equating English test forms for the Indonesian junior secondary school final examination administered in 1994. Unpublished doctoral dissertation, Flinders University of South Australia.

  • Montgomery, M. (2012). Investigation of IRT parameter recovery and classification accuracy in mixed format. Paper Presented at the Annual Meeting of the Nation Council of Measurement in Education (University of Kansas, April, 2012).

  • Nitko, A. J. (2004). Educational assessment of students, New Jersey: Pearson.

  • Paek, I. & Young, M. J. (2005). Investigation of student growth recovery in a fixed-item linking procedure with a fixed-person prior distribution for mixed-format test data. Applied Measurement in Education, 18(2), 199-215. doi: 10.1207/s15324818ame1802_4

  • Petersen, N. S., Kolen, M. J. & Hoover, H. D. (1993). Scaling, norming and equating. In Linn, R. L. (Eds.), Educational measurement (pp. 221-262). USA: The Oryx.

  • Sinharay, S. & Holland, P. W. (2007). Is it necessary to make anchor tests mini-versions of the tests being equated or can some restrictions be relaxed. Journal of Educational Measurement, 44(3), 249-275. doi: 10.1111/j.1745-3984.2007.00037.x

  • Skaggs, G. (1990). To match or not to match samples on ability for equating: A discussion of five articles. Applied Measurement in Education, 3(1), 105-113. doi: 10.1207/s15324818ame0301_8

  • Spence, P. D. (1996). The effect of multidimensionality on unidimensional equating with item response theory. Unpublished doctoral dissertation, University of Florida America.

  • Stocking, M. L. & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201-210. doi: 10.1177/014662168300700208

  • Tate, R. (2000). Performance of a proposed method for the linking of mixed format tests with constructed response and multiple choice items. Journal of Educational Measurement, 37(4), 329-346. doi: 10.1111/j.1745-3984.2000.tb01090.x.

  • Tian, F. (2011). A comparison of equating/linking using the Stocking-Lord method and concurrent calibration with mixed-format tests in the non-equivalent groups common-item design under IRT. Unpublished doctoral dissertation, Boston College.

  • Von Davier, A. A. & Wilson, C. (2007). IRT true-score test equating: A guide through assumptions and applications. Educational and Psychological Measurement, 67(6), 940-957. doi: 10.1177/0013164407301543

  • Woldbeck, T. (1998). Basic concepts in modern methods of test equating. Paper presented at the annual meeting of the Southwest Psychological Association (New Orleans, April 11, 1998).

                                                                                                                                                                                                        
  • Article Statistics