Examining Rater Reliability When Using an Analytical Rubric for Oral Presentation Assessments
Main Article Content
Abstract
The assessment of English speaking in EFL environments can be inherently subjective and influenced by various factors beyond linguistic ability, including choice of assessment criteria, and even the rubric type. In classroom assessment, the type of rubric recommended for English speaking tasks is the analytical rubric. Driven by three aims, this study analyzes the scores and comments from two raters on 28 video-recorded Thai engineering students’ oral presentations using a detailed analytical rubric that covers content, delivery, and visuals. First, it investigates rater reliability by comparing raters’ scores using Intraclass Correlation Coefficient (ICC) and ANOVA. Second, applying generalizability theory (G-theory), the correlations between the scores are examined to understand the relationships between different assessment dimensions and how they contribute to a comprehensive evaluation of speaking proficiency. Third, a thematic analysis is performed on raters’ comments to gain a deeper understanding of raters’ rationale. The findings suggested that a higher number of raters increases the reliability of the ratings, although diminishing returns are observed above a certain threshold. Also, several key themes emerged in relation to the criteria. The study highlights the crucial role of detailed analytical rubrics and cooperation sessions between raters in improving the reliability of oral EFL assessments.
Article Details
References
Bachman, L. F., & Palmer, A. S. (2012). Language assessment in practice. Oxford University Press.
Bijani, H. (2018). Investigating the validity of oral assessment rater training program: A mixed-methods study of raters’ perceptions and attitudes before and after training. Cogent Education, 5(1), 1-20. https://doi.org/10.1080/2331186x.2018.1460901
Brennan, R. L. (2001). Generalizability theory. Springer-Verlag.
Brown, A. (1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing, 12(1), 1-15. https://doi.org/10.1177/026553229501200101
Bruton, A., Conway, J. H., & Holgate, S. T. (2000). Reliability: What is it, and how is it measured? Physiotherapy, 86(2), 94-99. https://doi.org/10.1016/S0031-9406(05)61211-4
Burak, M. (2018). Speaking assessment: Impact of training sessions. World Science, 2(12(40)), 44-48. https://doi.org/10.31435/rsglobal_ws/30122018/6275
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Harcourt Brace Jovanovich College Publishers.
Davis, L. (2015). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117-135. https://doi.org/10.1177/0265532215582282
Davis, M. T. (2010). Assessing technical communication within engineering contexts tutorial. IEEE Transactions on Professional Communication, 53(1), 33-45. https://doi.org/10.1109/tpc.2009.2038736
Drubin, D. G., & Kellogg, D. R. (2012). English as the universal language of science: Opportunities and challenges. Molecular Biology of the Cell, 23(8), 1399. https://doi.org/10.1091/mbc.e12-02-0108
Ekmekçi, E. (2016). Comparison of native and non-native English language teachers’ evaluation of EFL learners’ speaking skills: Conflicting or identical rating behaviour?. English Language Teaching, 9(5), 98-105. https://doi.org/10.5539/elt.v9n5p98
Fan, J., & Yan, X. (2020). Assessing speaking proficiency: A narrative review of speaking assessment research within the argument-based validation framework. Frontiers in Psychology, 11, Article 330. 1-14. https://doi.org/10.3389/fpsyg.2020.00330
Gan, Z. (2013). Understanding English speaking difficulties: An investigation of two Chinese populations. Journal of Multilingual and Multicultural Development, 34(3), 231-248. https://doi.org/10.1080/01434632.2013.768622
Hidri, S. (2018). Assessing spoken language ability: A many-facet Rasch analysis. In S. Hidri (Ed.), Revisiting the assessment of second language abilities: From theory to practice (pp. 29-53). Springer, Cham. https://doi.org/10.1007/978-3-319-62884-4_2
Huang, L., Kubelec, S., Keng, N., & Hsu, L. (2018). Evaluating CEFR rater performance through the analysis of spoken learner corpora. Language Testing in Asia, 8(14). 1-17. https://doi.org/10.1186/s40468-018-0069-0
Iberri-Shea, G., & Hui, S. K. F. (2017). Adaptation and assessment of a public speaking rating scale. Cogent Education, 4(1), 1-16. https://doi.org/10.1080/2331186x.2017.1287390
Jason, F., & Xun, Y. (2020). Assessing speaking proficiency: A narrative review of speaking assessment research within the argument-based validation framework. Frontiers in Psychology, 11, 1-14. https://doi.org/10.3389/fpsyg.2020.00330
Kaewpet, C., & Sukamolson, S. (2011). A sociolinguistic approach to oral and written communication for engineering students. Asian Social Science, 7(10), 183-187. https://doi.org/10.5539/ass.v7n10p183
Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012
Lamprianou, I., Tsagari, D., & Kyriakou, N. (2021). The longitudinal stability of rating characteristics in an EFL examination: Methodological and substantive considerations. Language Testing, 38(2), 273-301. https://doi.org/10.1177/0265532220940960
Lee, Y. J. (2007). The multimedia assisted test of English speaking: The SOPI approach. Language Assessment Quarterly, 4(4), 352-366. https://doi.org/10.1080/15434300701533661
Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19(3), 246-276. https://doi.org/10.1191/0265532202lt230oa
Leung, L. (2015). Validity, reliability, and generalizability in qualitative research. Journal of Family Medicine and Primary Care, 4(3), 324-327. https://doi.org/10.4103/2249-4863.161306
Naphon, K. (2017). Presentation assessment rubric development and inter-rater reliability of communication and presentation skills course. Journal of Humanities and Social Sciences, 9(18), 1–18. https://ejournals.swu.ac.th/index.php/swurd/article/view/9571
Nunnally, J. D. (1978). Psychometric theory (2nd ed.). McGraw-Hill.
Orr, T. (2010). Assessment in professional communication. IEEE Transactions on Professional Communication, 53(1), 1-3. https://ieeexplore.ieee.org/document/5419148/
Putri, N. S. E., Pratolo, B. W., & Setiani, F. (2019). The alternative assessment of EFL students’ oral competence: Practices and constraints. Ethical Lingua: Journal of Language Teaching and Literature, 6(2), 72-85. https://doi.org/10.30605/25409190.v6.72-85
Rubin, R. B., Welch, S. A., & Buerkel, R. A. (1995). Performance‐based assessment of high school speech instruction. Communication Education, 44(1), 30-39. https://doi.org/10.1080/03634529509378995
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Sage Publications.
Stolarova, M., Wolf, C., Rinker, T., & Brielmann, A. (2014). How to assess and compare inter-rater reliability, agreement and correlation of ratings: An exemplary analysis of mother-father and parent-teacher expressive vocabulary rating pairs. Frontiers in psychology, 5, 1-13. https://doi.org/10.3389/fpsyg.2014.00509
Sundqvist, P., Sandlund, E., Skar, G. B., & Tengberg, M. (2020). Effects of rater training on the assessment of L2 English oral proficiency. Nordic Journal of Modern Language Methodology, 8(1), 3-29. https://doi.org/10.46364/njmlm.v8i1.605
Tran, Y., & Hang, T. T. M. (2021). Use of posters to promote speaking performance among non-English majors at Thai Nguyen University of Education, Vietnam. International Journal of Language and Literary Studies, 3(2), 81-96. https://doi.org/10.36892/ijlls.v3i2.585
Ugiljon, A. (2018). The effective speaking testing techniques in teaching English. International Journal of Secondary Education, 6(1), 24-28. https://doi.org/10.11648/j.ijsedu.20180601.15
Vacha-Haase, T., (1998). Reliability generalization: Exploring variance in measurement error affecting score reliability across studies. Educational and Psychological Measurement, 58(1), 6-20. https://doi.org/10.1177/0013164498058001002
Wind, S. A., & Peterson, M. E. (2017). A systematic review of methods for evaluating rating quality in language assessment. Language Testing, 35(2), 161-192. https://doi.org/10.1177/0265532216686999