Multiple choice items (MCIs) are commonly used in high-stake testing and classroom assessment because of their reliable assessment results. However, the recent literature has revealed that item-writing guidelines have been repeatedly violated in creating MCIs, which could also threaten reliability and validity. Another threat to the validity occurs when items favor certain groups even though the magnitude of underlying ability of the different groups is the same, and this is called differential item functioning (DIF). This empirical study aims to compare item parameters for MCIs with negative wording stem and complex MCIs, which are commonly used MCI formats that violate item-writing guidelines for MCIs, and to investigate the impact of DIF on gender differences considering the use of these item formats. The results of this study showed that DIF detection methods flagged two complex MCIs favoring male students because of the item format and tendency of male students’ taking more risk on solving MCIs.
@article{2019,title={Examining Gender Bias in Multiple Choice Item Formats Violating Item-Writing Guidelines},abstractNode={},author={Mehmet Kaplan-- Erkan Hasan Atalmış},year={2019},journal={International Online Journal of Educational Sciences}}
Mehmet Kaplan-- Erkan Hasan Atalmış . 2019 . Examining Gender Bias in Multiple Choice Item Formats Violating Item-Writing Guidelines . International Online Journal of Educational Sciences.DOI:10.15345/iojes.2019.01.015
Mehmet Kaplan-- Erkan Hasan Atalmış.(2019).Examining Gender Bias in Multiple Choice Item Formats Violating Item-Writing Guidelines.International Online Journal of Educational Sciences
Mehmet Kaplan-- Erkan Hasan Atalmış,"Examining Gender Bias in Multiple Choice Item Formats Violating Item-Writing Guidelines" , International Online Journal of Educational Sciences (2019)
Mehmet Kaplan-- Erkan Hasan Atalmış . 2019 . Examining Gender Bias in Multiple Choice Item Formats Violating Item-Writing Guidelines . International Online Journal of Educational Sciences . 2019. DOI:10.15345/iojes.2019.01.015
Mehmet Kaplan-- Erkan Hasan Atalmış .Examining Gender Bias in Multiple Choice Item Formats Violating Item-Writing Guidelines. International Online Journal of Educational Sciences (2019)
Mehmet Kaplan-- Erkan Hasan Atalmış .Examining Gender Bias in Multiple Choice Item Formats Violating Item-Writing Guidelines. International Online Journal of Educational Sciences (2019)
Format:
Mehmet Kaplan-- Erkan Hasan Atalmış. (2019) .Examining Gender Bias in Multiple Choice Item Formats Violating Item-Writing Guidelines International Online Journal of Educational Sciences
Mehmet Kaplan-- Erkan Hasan Atalmış . Examining Gender Bias in Multiple Choice Item Formats Violating Item-Writing Guidelines . International Online Journal of Educational Sciences . 2019 doi:10.15345/iojes.2019.01.015
Mehmet Kaplan-- Erkan Hasan Atalmış."Examining Gender Bias in Multiple Choice Item Formats Violating Item-Writing Guidelines",International Online Journal of Educational Sciences(2019)
Atalmis, E. H. (2016). Do the guideline violations influence test difficulty of high-stake test? An investigation on university entrance examination in Turkey. Journal of Education and Training Studies, 4(10), 1-7. http://dx.doi.org/10.11114/jets.v4i10.1738
Ben-Shakhar, G., & Sinai, Y. (1991). Gender differences in multiple-choice tests: The role of differential guessing tendencies. Journal of Educational Measurement, 28(1), 23-35. https://doi.org/10.1111/j.17453984.1991.tb00341.x
Boland, R. J., Lester, N. A., & Williams, E. (2010). Writing multiple-choice questions. Academic Psychiatry, 34(4), 310-316.
Bolger, N., & Kellaghan, T. (1990). Method of measurement and gender differences in scholastic achievement. Journal of Educational Measurement, 27(2), 165-174. http://www.jstor.org/stable/1434975
Caldwell, J. S. (2008). Comprehension assessment: A classroom guide. New York, NY: The Guildford Pub.
Cohen, A. S., & Wollack, J. A. (2004). Handbook on test development: Helpful tips for creating reliable and valid classroom tests. University of Wisconsin-Madison, WI, USA, 2004. https://www.researchgate.net/profile/Allan_Cohen2/publication/248808614_Handbook_on_Test_Deve lopment_Helpful_Tips_for_Creating_Reliable_and_Valid_Classroom_Tests/links/5632378d08aefa44c3 67cea8.pdf
Collins, J. (2006). Education techniques for lifelong learning: Writing multiple-choice questions for continuing medical education activities and self-assessment modules. RadioGraphics, 26(2), 543-551. https://doi.org/10.1148/rg.262055145
Delgado, A. R., & Prieto, G. (1998). Further evidence favoring three-option items in multiple-choice tests. European Journal of Psychological Assessment, 14(3), 197-201. https://doi.org/10.1027/1015-5759.14.3.197
DeMars, C. E. (2000). Test stakes and item format interactions. Applied Measurement in Education, 13(1), 55-77. https://doi.org/10.1207/s15324818ame1301_3
Douglas, M., Wilson, J., & Ennis, S. (2012). Multiple-choice question tests: a convenient, flexible and effective learning tool? A case study. Innovations in Education and Teaching International, 49(2), 111-121. https://doi.org/10.1080/14703297.2012.677596
Downing, S. M. (2005). The effects of violating standard item writing principles on tests and students: The consequences of using flawed test items on achievement examinations in medical education. Advances in health sciences education, 10(2), 133-143. https://doi.org/10.1007/s10459-004-4019-5
Drasgow, F., & Lissak, R. I. (1983). Modified parallel analysis: A procedure for examining the latent dimensionality of dichotomously scored item responses. Journal of Applied Psychology, 68(3), 363-373. http://dx.doi.org/10.1037/0021-9010.68.3.363
Frey, B. B., Petersen, S., Edwards, L. M., Pedrotti, J. T., & Peyton, V. (2005). Item-writing rules: Collective wisdom. Teaching and Teacher Education, 21(4), 357-364. https://doi.org/10.1016/j.tate.2005.01.008
Haladyna, T. M., & Downing, S. M. (1989a). A taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2(1), 37-50. https://doi.org/10.1207/s15324818ame0201_3
Haladyna, T. M., & Downing, S. M. (1993). How many options is enough for a multiple-choice test item? Educational and Psychological Measurement, 53(4), 999-1010. https://doi.org/10.1177/0013164493053004013
Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309-334. https://doi.org/10.1207/S15324818AME1503_5
Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. New York, NY: Routledge.
Hamilton, L. S. (1999). Detecting gender-based differential item functioning on a constructed-response science test. Applied Measurement in Education, 12(3), 211-235. https://doi.org/10.1207/S15324818AME1203_1
Hamilton, L. S., & Snow, R. E. (1998). Exploring differential item functioning on science achievement tests (CSE Tech. Rep. No. 483). Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing.
Hansen, J. D., & Dexter, L. (1997). Quality multiple-choice test questions: Item-writing guidelines and an analysis of auditing test banks. [Super about MCQ]. The Journal of Education for Business, 73(2), 94-97. https://doi.org/10.1080/08832329709601623
Harasym, P. H., Doran, M. L., Brant, R., & Lorscheider, F. L. (1993). Negation in stems of single-response multiple-choice items: An overestimation of student ability. Evaluation & the Health Professions, 16(3), 342-357. https://doi.org/10.1177/016327879301600307
Harter, C. L., & Harter, J. F. R. (2004). Teaching with technology: Does access to computer technology increase student achievement? Eastern Economic Journal, 30(4), 505-514. https://www.jstor.org/stable/40326144
Holland, P. W., & Thayer, D. T. (1985). An alternative definition of the ETS delta scale of item difficulty. Research Report RR-85-4. Princeton, NJ: Educational Testing Service.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129-145). Hillsdale NJ: Erlbaum.
Jodoin, M. G., & Gierl, M. J. (2001). Evaluating Type I error and power rates using an effect size measure with logistic regression procedure for DIF detection. Applied Measurement in Education, 14(4), 329-349. https://doi.org/10.1207/S15324818AME1404_2
Comfort, K. (1997). Gender and racial/ethnic differences on performance assessments in science. Educational Evaluation & Policy Analysis, 19(2), 83-97. https://doi.org/10.3102/01623737019002083
Liu, O. L., & Wilson, M. (2009). Gender differences in large-scale math assessments: PISA trend 2000 and 2003. Applied Measurement in Education, 22(2), 164-184. https://doi.org/10.1080/08957340902754635
Lord, F. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates.
Magis, D., Beland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847-862. https://doi.org/10.3758/BRM.42.3.847
Masters, J. C., Hulsmeyer, B. S., Pike, M. E., Leichty, K., Miller, M. T., & Verst, A. L. (2001). Assessment of multiple-choice questions in selected test banks accompanying text books used in nursing education. Journal of Nursing Education, 40(1), 25-32. https://doi.org/10.3928/0148-4834-20010101-07
Mazzeo, J., Schmitt, A. P., & Bleistein, C. A. (1993). Sex-related performance differences on constructed-response and multiple-choice sections of Advanced Placement Examinations. College Board Report No. 92-7. New York, NY: College Entrance Examination Board. https://doi.org/10.1002/j.2333-8504.1993.tb01516.x
McCoubrie, P. (2004). Improving the fairness of multiple-choice questions: A literature review. Medical Teacher, 26(8), 709-712. https://doi.org/10.1080/01421590400013495
Miles, M. B. ve Huberman, A. M. (1994). Qualitative data analysis (2nd ed.). Thousand Oaks, CA: Sage.
Moreno, R., Martínez, R. J., & Muñiz, J. (2015). Guidelines based on validity criteria for the development of multiple choice items. Psicothema, 27(4), 388-394. https://doi.org/10.7334/psicothema2015.110
Nicol, D. (2007). E-assessment by design: Using multiple-choice tests to good effect. Journal of Further and Higher Education, 31(1), 53-64. https://doi.org/10.1080/03098770601167922
Nnodim, J. O. (1992). Multiple-choice testing in anatomy. Medical Education, 26(4), 301-309. https://doi.org/10.1111/j.1365-2923.1992.tb00173.x
Parker, C., & Somers, J. (1983, December). A comparison of the difficulty and reliability of type K and best response test items. Paper presented at the Iowa Evaluation and Research Association Conference, Des Moines, IA.
Pate, A., & Caldwell, D. J. (2014). Effects of multiple-choice item-writing guideline utilization item and student performance. Currents in Pharmacy Teaching and Learning, 6(1), 130-134. https://doi.org/10.1016/j.cptl.2013.09.003
Rizopoulos, D. (2006). Ltm: An R package for latent variable modeling and item response theory analysis. Journal of Statistical Software, 17(5), 1-25. https://core.ac.uk/download/pdf/6305163.pdf
Shizuka, T., Takeuchi, O., Yashima, T., & Yoshizawa, K. (2006). A comparison of three-and four-option English tests for university entrance selection purposes in Japan. Language Testing, 23(1), 35-57. https://doi.org/10.1191/0265532206lt319oa
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361-370. https://doi.org/10.1111/j.17453984.1990.tb00754.x
Tamir, P. (1993). Positive and negative multiple choice items: How difficult are they? Studies in Educational Evaluation, 19(3), 311-332. https://eric.ed.gov/?id=EJ471898
Tarrant, M., Knierim, A., Hayes, S. K., & Ware, J. (2006). The frequency of item writing flaws in multiple-choice questions used in high stakes nursing assessments. Nurse Education Today, 26(8), 354-363. https://doi.org/10.1016/j.nedt.2006.07.006
Tarrant, M., & Ware, J. (2008). Impact of item-writing flaws in multiple-choice questions on student achievement in high-stakes nursing assessments. Medical Education, 42(2), 198-206. https://doi.org/10.1111/j.1365-2923.2007.02957.x
Tarrant, M., Ware, J., & Mohammed, A. M. (2009). An assessment of functioning and non-functioning distractors in multiple-choice questions: A descriptive analysis. BMC Medical Education, 9(1), 40-47. https://doi.org/10.1186/1472-6920-9-40
Terzi, R., & Suh, Y. (2015). An odds ratio approach for detecting DDF under the nested logit modeling framework. Journal of Educational Measurement, 52(4), 376–398. https://doi.org/10.1111/jedm.12091
Terzi, R., & Yakar, L. (2018). Differential item and differential distractor functioning analyses on Turkish high school entrance exam. Journal of Measurement and Evaluation in Education and Psychology, 9(2), 136-149. https://doi.org/10.21031/epod.368081
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 147-169). Hillsdale, NJ: Erlbaum.
Thorndike, R. M. (2005). Measurement and evaluation in psychology and education (7th ed.). Upper Saddle River, NJ: Pearson Education.
Tripp, A., & Tollefson, N. (1985). Are complex multiple-choice options more difficult and discriminating than conventional multiple-choice options? Journal of Nursing Education, 24(3), 92-98. https://doi.org/10.3928/0148-4834-19850301-04
Zenisky, A.L., Hambleton, R.K., & Robin, F. (2004). DIF detection and interpretation in large-scale science assessments: Informing item writing practices. Educational Assessment, 9(1-2), 61-78. https://doi.org/10.1080/10627197.2004.9652959
Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense. https://s3.amazonaws.com/academia.edu.documents/33861736/handbook4__pdf_dif.pdf?AWSAccess KeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1535487960&Signature=lTCvN%2BycU1dLORbreXBM 0vYO89w%3D&response-content-disposition=inline%3B%20filename%3DHandbook-4_pdf_dif.pdf
Zumbo, B. D. (2007). Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4(2), 223-233. https://doi.org/10.1080/15434300701375832
Zumbo, B. D., & Thomas, D. R. (1997). A measure of effect size for a model-based approach for studying DIF. Prince George, Canada: University of Northern British Columbia, Edgeworth Laboratory for Quantitative Behavioral Science.