Examining Rater Reliability When Using an Analytical Rubric for Oral Presentation Assessments

Main Article Content

Sasithorn Limgomolvilas
Patsawut Sukserm

Abstract

The assessment of English speaking in EFL environments can be inherently subjective and influenced by various factors beyond linguistic ability, including choice of assessment criteria, and even the rubric type. In classroom assessment, the type of rubric recommended for English speaking tasks is the analytical rubric. Driven by three aims, this study analyzes the scores and comments from two raters on 28 video-recorded Thai engineering students’ oral presentations using a detailed analytical rubric that covers content, delivery, and visuals. First, it investigates rater reliability by comparing raters’ scores using Intraclass Correlation Coefficient (ICC) and ANOVA. Second, applying generalizability theory (G-theory), the correlations between the scores are examined to understand the relationships between different assessment dimensions and how they contribute to a comprehensive evaluation of speaking proficiency. Third, a thematic analysis is performed on raters’ comments to gain a deeper understanding of raters’ rationale. The findings suggested that a higher number of raters increases the reliability of the ratings, although diminishing returns are observed above a certain threshold. Also, several key themes emerged in relation to the criteria. The study highlights the crucial role of detailed analytical rubrics and cooperation sessions between raters in improving the reliability of oral EFL assessments.

Article Details

How to Cite
Limgomolvilas, S., & Sukserm, P. (2025). Examining Rater Reliability When Using an Analytical Rubric for Oral Presentation Assessments . LEARN Journal: Language Education and Acquisition Research Network, 18(1), 110–134. https://doi.org/10.70730/JQGY9980
Section
Research Articles
Author Biographies

Sasithorn Limgomolvilas, Chulalongkorn University Language Institute, Thailand

A full-time lecturer at Chulalongkorn University Language Institute (CULI). She has a Bachelor’s degree in Education from Chulalongkorn University, and a Master’s degree in Teaching English as a Speaker of Other Languages (TESOL) from San Francisco State University. She later taught at CULI and earned a Ph.D. in English as an International Language (EIL) from Chulalongkorn University. Her primary areas of research focus are on the fields of ESP and language assessment.

Patsawut Sukserm, Chulalongkorn University Language Institute, Thailand

A full time English lecturer at Chulalongkorn University Language Institute, Thailand. He holds a Ph.D. in English as an International Language Program, Graduate School, Chulalongkorn University. He also has a B.A. English (1st class honors) from Ramkhamhaeng University, a B.S. in Statistics and an M.A. in English as an International Language from Chulalongkorn University. His areas of research include quantitative research, language testing and assessment, and English language teaching.

References

Bachman, L. F., & Palmer, A. S. (2012). Language assessment in practice. Oxford University Press.

Bijani, H. (2018). Investigating the validity of oral assessment rater training program: A mixed-methods study of raters’ perceptions and attitudes before and after training. Cogent Education, 5(1), 1-20. https://doi.org/10.1080/2331186x.2018.1460901

Brennan, R. L. (2001). Generalizability theory. Springer-Verlag.

Brown, A. (1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing, 12(1), 1-15. https://doi.org/10.1177/026553229501200101

Bruton, A., Conway, J. H., & Holgate, S. T. (2000). Reliability: What is it, and how is it measured? Physiotherapy, 86(2), 94-99. https://doi.org/10.1016/S0031-9406(05)61211-4

Burak, M. (2018). Speaking assessment: Impact of training sessions. World Science, 2(12(40)), 44-48. https://doi.org/10.31435/rsglobal_ws/30122018/6275

Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Harcourt Brace Jovanovich College Publishers.

Davis, L. (2015). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117-135. https://doi.org/10.1177/0265532215582282

Davis, M. T. (2010). Assessing technical communication within engineering contexts tutorial. IEEE Transactions on Professional Communication, 53(1), 33-45. https://doi.org/10.1109/tpc.2009.2038736

Drubin, D. G., & Kellogg, D. R. (2012). English as the universal language of science: Opportunities and challenges. Molecular Biology of the Cell, 23(8), 1399. https://doi.org/10.1091/mbc.e12-02-0108

Ekmekçi, E. (2016). Comparison of native and non-native English language teachers’ evaluation of EFL learners’ speaking skills: Conflicting or identical rating behaviour?. English Language Teaching, 9(5), 98-105. https://doi.org/10.5539/elt.v9n5p98

Fan, J., & Yan, X. (2020). Assessing speaking proficiency: A narrative review of speaking assessment research within the argument-based validation framework. Frontiers in Psychology, 11, Article 330. 1-14. https://doi.org/10.3389/fpsyg.2020.00330

Gan, Z. (2013). Understanding English speaking difficulties: An investigation of two Chinese populations. Journal of Multilingual and Multicultural Development, 34(3), 231-248. https://doi.org/10.1080/01434632.2013.768622

Hidri, S. (2018). Assessing spoken language ability: A many-facet Rasch analysis. In S. Hidri (Ed.), Revisiting the assessment of second language abilities: From theory to practice (pp. 29-53). Springer, Cham. https://doi.org/10.1007/978-3-319-62884-4_2

Huang, L., Kubelec, S., Keng, N., & Hsu, L. (2018). Evaluating CEFR rater performance through the analysis of spoken learner corpora. Language Testing in Asia, 8(14). 1-17. https://doi.org/10.1186/s40468-018-0069-0

Iberri-Shea, G., & Hui, S. K. F. (2017). Adaptation and assessment of a public speaking rating scale. Cogent Education, 4(1), 1-16. https://doi.org/10.1080/2331186x.2017.1287390

Jason, F., & Xun, Y. (2020). Assessing speaking proficiency: A narrative review of speaking assessment research within the argument-based validation framework. Frontiers in Psychology, 11, 1-14. https://doi.org/10.3389/fpsyg.2020.00330

Kaewpet, C., & Sukamolson, S. (2011). A sociolinguistic approach to oral and written communication for engineering students. Asian Social Science, 7(10), 183-187. https://doi.org/10.5539/ass.v7n10p183

Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012

Lamprianou, I., Tsagari, D., & Kyriakou, N. (2021). The longitudinal stability of rating characteristics in an EFL examination: Methodological and substantive considerations. Language Testing, 38(2), 273-301. https://doi.org/10.1177/0265532220940960

Lee, Y. J. (2007). The multimedia assisted test of English speaking: The SOPI approach. Language Assessment Quarterly, 4(4), 352-366. https://doi.org/10.1080/15434300701533661

Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19(3), 246-276. https://doi.org/10.1191/0265532202lt230oa

Leung, L. (2015). Validity, reliability, and generalizability in qualitative research. Journal of Family Medicine and Primary Care, 4(3), 324-327. https://doi.org/10.4103/2249-4863.161306

Naphon, K. (2017). Presentation assessment rubric development and inter-rater reliability of communication and presentation skills course. Journal of Humanities and Social Sciences, 9(18), 1–18. https://ejournals.swu.ac.th/index.php/swurd/article/view/9571

Nunnally, J. D. (1978). Psychometric theory (2nd ed.). McGraw-Hill.

Orr, T. (2010). Assessment in professional communication. IEEE Transactions on Professional Communication, 53(1), 1-3. https://ieeexplore.ieee.org/document/5419148/

Putri, N. S. E., Pratolo, B. W., & Setiani, F. (2019). The alternative assessment of EFL students’ oral competence: Practices and constraints. Ethical Lingua: Journal of Language Teaching and Literature, 6(2), 72-85. https://doi.org/10.30605/25409190.v6.72-85

Rubin, R. B., Welch, S. A., & Buerkel, R. A. (1995). Performance‐based assessment of high school speech instruction. Communication Education, 44(1), 30-39. https://doi.org/10.1080/03634529509378995

Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Sage Publications.

Stolarova, M., Wolf, C., Rinker, T., & Brielmann, A. (2014). How to assess and compare inter-rater reliability, agreement and correlation of ratings: An exemplary analysis of mother-father and parent-teacher expressive vocabulary rating pairs. Frontiers in psychology, 5, 1-13. https://doi.org/10.3389/fpsyg.2014.00509

Sundqvist, P., Sandlund, E., Skar, G. B., & Tengberg, M. (2020). Effects of rater training on the assessment of L2 English oral proficiency. Nordic Journal of Modern Language Methodology, 8(1), 3-29. https://doi.org/10.46364/njmlm.v8i1.605

Tran, Y., & Hang, T. T. M. (2021). Use of posters to promote speaking performance among non-English majors at Thai Nguyen University of Education, Vietnam. International Journal of Language and Literary Studies, 3(2), 81-96. https://doi.org/10.36892/ijlls.v3i2.585

Ugiljon, A. (2018). The effective speaking testing techniques in teaching English. International Journal of Secondary Education, 6(1), 24-28. https://doi.org/10.11648/j.ijsedu.20180601.15

Vacha-Haase, T., (1998). Reliability generalization: Exploring variance in measurement error affecting score reliability across studies. Educational and Psychological Measurement, 58(1), 6-20. https://doi.org/10.1177/0013164498058001002

Wind, S. A., & Peterson, M. E. (2017). A systematic review of methods for evaluating rating quality in language assessment. Language Testing, 35(2), 161-192. https://doi.org/10.1177/0265532216686999