Examining Gender Bias in Multiple Choice Item Formats Violating Item-Writing Guidelines

Mehmet Kaplan; Erkan Hasan Atalmış

doi:10.15345/iojes.2019.01.015

Examining Gender Bias in Multiple Choice Item Formats Violating Item-Writing Guidelines

Author :

DOI : 10.15345/iojes.2019.01.015

Year-Number: 2019-Volume 11, Issue 1

Language : null

Konu :

Number of pages: 214-229

Mendeley

EndNote

Alıntı Yap

English Turkish

Abstract

Keywords

Abstract

Multiple choice items (MCIs) are commonly used in high-stake testing and classroom assessment because of their reliable assessment results. However, the recent literature has revealed that item-writing guidelines have been repeatedly violated in creating MCIs, which could also threaten reliability and validity. Another threat to the validity occurs when items favor certain groups even though the magnitude of underlying ability of the different groups is the same, and this is called differential item functioning (DIF). This empirical study aims to compare item parameters for MCIs with negative wording stem and complex MCIs, which are commonly used MCI formats that violate item-writing guidelines for MCIs, and to investigate the impact of DIF on gender differences considering the use of these item formats. The results of this study showed that DIF detection methods flagged two complex MCIs favoring male students because of the item format and tendency of male students’ taking more risk on solving MCIs.

Keywords

Kaynakça

Atalmis, E. H. (2016). Do the guideline violations influence test difficulty of high-stake test? An investigation on university entrance examination in Turkey. Journal of Education and Training Studies, 4(10), 1-7. http://dx.doi.org/10.11114/jets.v4i10.1738

Ben-Shakhar, G., & Sinai, Y. (1991). Gender differences in multiple-choice tests: The role of differential guessing tendencies. Journal of Educational Measurement, 28(1), 23-35. https://doi.org/10.1111/j.17453984.1991.tb00341.x

Boland, R. J., Lester, N. A., & Williams, E. (2010). Writing multiple-choice questions. Academic Psychiatry, 34(4), 310-316.

Bolger, N., & Kellaghan, T. (1990). Method of measurement and gender differences in scholastic achievement. Journal of Educational Measurement, 27(2), 165-174. http://www.jstor.org/stable/1434975

Caldwell, J. S. (2008). Comprehension assessment: A classroom guide. New York, NY: The Guildford Pub.

Cohen, A. S., & Wollack, J. A. (2004). Handbook on test development: Helpful tips for creating reliable and valid classroom tests. University of Wisconsin-Madison, WI, USA, 2004. https://www.researchgate.net/profile/Allan_Cohen2/publication/248808614_Handbook_on_Test_Deve lopment_Helpful_Tips_for_Creating_Reliable_and_Valid_Classroom_Tests/links/5632378d08aefa44c3 67cea8.pdf

Collins, J. (2006). Education techniques for lifelong learning: Writing multiple-choice questions for continuing medical education activities and self-assessment modules. RadioGraphics, 26(2), 543-551. https://doi.org/10.1148/rg.262055145

Delgado, A. R., & Prieto, G. (1998). Further evidence favoring three-option items in multiple-choice tests. European Journal of Psychological Assessment, 14(3), 197-201. https://doi.org/10.1027/1015-5759.14.3.197

DeMars, C. E. (2000). Test stakes and item format interactions. Applied Measurement in Education, 13(1), 55-77. https://doi.org/10.1207/s15324818ame1301_3

Douglas, M., Wilson, J., & Ennis, S. (2012). Multiple-choice question tests: a convenient, flexible and effective learning tool? A case study. Innovations in Education and Teaching International, 49(2), 111-121. https://doi.org/10.1080/14703297.2012.677596

Downing, S. M. (2005). The effects of violating standard item writing principles on tests and students: The consequences of using flawed test items on achievement examinations in medical education. Advances in health sciences education, 10(2), 133-143. https://doi.org/10.1007/s10459-004-4019-5

Drasgow, F., & Lissak, R. I. (1983). Modified parallel analysis: A procedure for examining the latent dimensionality of dichotomously scored item responses. Journal of Applied Psychology, 68(3), 363-373. http://dx.doi.org/10.1037/0021-9010.68.3.363

Frey, B. B., Petersen, S., Edwards, L. M., Pedrotti, J. T., & Peyton, V. (2005). Item-writing rules: Collective wisdom. Teaching and Teacher Education, 21(4), 357-364. https://doi.org/10.1016/j.tate.2005.01.008

Haladyna, T. M., & Downing, S. M. (1989a). A taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2(1), 37-50. https://doi.org/10.1207/s15324818ame0201_3

Haladyna, T. M., & Downing, S. M. (1993). How many options is enough for a multiple-choice test item? Educational and Psychological Measurement, 53(4), 999-1010. https://doi.org/10.1177/0013164493053004013

Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309-334. https://doi.org/10.1207/S15324818AME1503_5

Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. New York, NY: Routledge.

Hamilton, L. S. (1999). Detecting gender-based differential item functioning on a constructed-response science test. Applied Measurement in Education, 12(3), 211-235. https://doi.org/10.1207/S15324818AME1203_1

Hamilton, L. S., & Snow, R. E. (1998). Exploring differential item functioning on science achievement tests (CSE Tech. Rep. No. 483). Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing.

Hansen, J. D., & Dexter, L. (1997). Quality multiple-choice test questions: Item-writing guidelines and an analysis of auditing test banks. [Super about MCQ]. The Journal of Education for Business, 73(2), 94-97. https://doi.org/10.1080/08832329709601623

Harasym, P. H., Doran, M. L., Brant, R., & Lorscheider, F. L. (1993). Negation in stems of single-response multiple-choice items: An overestimation of student ability. Evaluation & the Health Professions, 16(3), 342-357. https://doi.org/10.1177/016327879301600307

Harter, C. L., & Harter, J. F. R. (2004). Teaching with technology: Does access to computer technology increase student achievement? Eastern Economic Journal, 30(4), 505-514. https://www.jstor.org/stable/40326144

Holland, P. W., & Thayer, D. T. (1985). An alternative definition of the ETS delta scale of item difficulty. Research Report RR-85-4. Princeton, NJ: Educational Testing Service.

Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129-145). Hillsdale NJ: Erlbaum.

Jodoin, M. G., & Gierl, M. J. (2001). Evaluating Type I error and power rates using an effect size measure with logistic regression procedure for DIF detection. Applied Measurement in Education, 14(4), 329-349. https://doi.org/10.1207/S15324818AME1404_2

Comfort, K. (1997). Gender and racial/ethnic differences on performance assessments in science. Educational Evaluation & Policy Analysis, 19(2), 83-97. https://doi.org/10.3102/01623737019002083

Liu, O. L., & Wilson, M. (2009). Gender differences in large-scale math assessments: PISA trend 2000 and 2003. Applied Measurement in Education, 22(2), 164-184. https://doi.org/10.1080/08957340902754635

Lord, F. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates.

Magis, D., Beland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847-862. https://doi.org/10.3758/BRM.42.3.847

Masters, J. C., Hulsmeyer, B. S., Pike, M. E., Leichty, K., Miller, M. T., & Verst, A. L. (2001). Assessment of multiple-choice questions in selected test banks accompanying text books used in nursing education. Journal of Nursing Education, 40(1), 25-32. https://doi.org/10.3928/0148-4834-20010101-07

Mazzeo, J., Schmitt, A. P., & Bleistein, C. A. (1993). Sex-related performance differences on constructed-response and multiple-choice sections of Advanced Placement Examinations. College Board Report No. 92-7. New York, NY: College Entrance Examination Board. https://doi.org/10.1002/j.2333-8504.1993.tb01516.x

McCoubrie, P. (2004). Improving the fairness of multiple-choice questions: A literature review. Medical Teacher, 26(8), 709-712. https://doi.org/10.1080/01421590400013495

Miles, M. B. ve Huberman, A. M. (1994). Qualitative data analysis (2nd ed.). Thousand Oaks, CA: Sage.

Moreno, R., Martínez, R. J., & Muñiz, J. (2015). Guidelines based on validity criteria for the development of multiple choice items. Psicothema, 27(4), 388-394. https://doi.org/10.7334/psicothema2015.110

Nicol, D. (2007). E-assessment by design: Using multiple-choice tests to good effect. Journal of Further and Higher Education, 31(1), 53-64. https://doi.org/10.1080/03098770601167922

Nnodim, J. O. (1992). Multiple-choice testing in anatomy. Medical Education, 26(4), 301-309. https://doi.org/10.1111/j.1365-2923.1992.tb00173.x

Parker, C., & Somers, J. (1983, December). A comparison of the difficulty and reliability of type K and best response test items. Paper presented at the Iowa Evaluation and Research Association Conference, Des Moines, IA.

Pate, A., & Caldwell, D. J. (2014). Effects of multiple-choice item-writing guideline utilization item and student performance. Currents in Pharmacy Teaching and Learning, 6(1), 130-134. https://doi.org/10.1016/j.cptl.2013.09.003

Rizopoulos, D. (2006). Ltm: An R package for latent variable modeling and item response theory analysis. Journal of Statistical Software, 17(5), 1-25. https://core.ac.uk/download/pdf/6305163.pdf

Shizuka, T., Takeuchi, O., Yashima, T., & Yoshizawa, K. (2006). A comparison of three-and four-option English tests for university entrance selection purposes in Japan. Language Testing, 23(1), 35-57. https://doi.org/10.1191/0265532206lt319oa

Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361-370. https://doi.org/10.1111/j.17453984.1990.tb00754.x

Tamir, P. (1993). Positive and negative multiple choice items: How difficult are they? Studies in Educational Evaluation, 19(3), 311-332. https://eric.ed.gov/?id=EJ471898

Tarrant, M., Knierim, A., Hayes, S. K., & Ware, J. (2006). The frequency of item writing flaws in multiple-choice questions used in high stakes nursing assessments. Nurse Education Today, 26(8), 354-363. https://doi.org/10.1016/j.nedt.2006.07.006

Tarrant, M., & Ware, J. (2008). Impact of item-writing flaws in multiple-choice questions on student achievement in high-stakes nursing assessments. Medical Education, 42(2), 198-206. https://doi.org/10.1111/j.1365-2923.2007.02957.x

Tarrant, M., Ware, J., & Mohammed, A. M. (2009). An assessment of functioning and non-functioning distractors in multiple-choice questions: A descriptive analysis. BMC Medical Education, 9(1), 40-47. https://doi.org/10.1186/1472-6920-9-40

Terzi, R., & Suh, Y. (2015). An odds ratio approach for detecting DDF under the nested logit modeling framework. Journal of Educational Measurement, 52(4), 376–398. https://doi.org/10.1111/jedm.12091

Terzi, R., & Yakar, L. (2018). Differential item and differential distractor functioning analyses on Turkish high school entrance exam. Journal of Measurement and Evaluation in Education and Psychology, 9(2), 136-149. https://doi.org/10.21031/epod.368081

Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 147-169). Hillsdale, NJ: Erlbaum.

Thorndike, R. M. (2005). Measurement and evaluation in psychology and education (7th ed.). Upper Saddle River, NJ: Pearson Education.

Tripp, A., & Tollefson, N. (1985). Are complex multiple-choice options more difficult and discriminating than conventional multiple-choice options? Journal of Nursing Education, 24(3), 92-98. https://doi.org/10.3928/0148-4834-19850301-04

Zenisky, A.L., Hambleton, R.K., & Robin, F. (2004). DIF detection and interpretation in large-scale science assessments: Informing item writing practices. Educational Assessment, 9(1-2), 61-78. https://doi.org/10.1080/10627197.2004.9652959

Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense. https://s3.amazonaws.com/academia.edu.documents/33861736/handbook4__pdf_dif.pdf?AWSAccess KeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1535487960&Signature=lTCvN%2BycU1dLORbreXBM 0vYO89w%3D&response-content-disposition=inline%3B%20filename%3DHandbook-4_pdf_dif.pdf

Zumbo, B. D. (2007). Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4(2), 223-233. https://doi.org/10.1080/15434300701375832

Zumbo, B. D., & Thomas, D. R. (1997). A measure of effect size for a model-based approach for studying DIF. Prince George, Canada: University of Northern British Columbia, Edgeworth Laboratory for Quantitative Behavioral Science.

Last issue
Previous issues
Article Statistics

Examining Gender Bias in Multiple Choice Item Formats Violating Item-Writing Guidelines

Author :

Abstract

Keywords

Abstract

Keywords

Kaynakça

MAKALE İSTATİSTİKLERİ

LINKS

Share