The Use of ASR-Equipped Software in the Teaching of Suprasegmental Features of Pronunciation
A Critical Review
DOI:
https://doi.org/10.1558/cj.19033Keywords:
Language Teaching, Automatic Speech Recognition (ASR) tools, Computer Assisted Pronunciation Training, suprasegmentals; pronunciation instruction; automatic speech recognitionAbstract
Technology has paved the way for new modalities in language learning, teaching, and assessment. However, there is still a great deal of work to be done to develop such tools for oral communication, specifically tools that address suprasegmental features in pronunciation instruction. Therefore, this critical literature review examines how researchers have tried to create computer-assisted pronunciation training tools using automatic speech recognition (ASR) systems to aid language learners in the perception and production of suprasegmental features. We used 30 texts from 1990 to 2020 to explore how technologies have been and are currently being used to help learners develop their proficiency with suprasegmental features. Based on our thematic analysis, a persistent gap still exists between ASR-equipped software available to participants in research studies and what is available to university and classroom teachers and students. Additionally, there seems to be more development in the production of speech software for language assessment. In contrast, the translation of these tools into instructional tools for individualized learning seems to be almost non-existent. Moving forward, we recommend that more commercialized pronunciation systems utilizing ASR should be made publicly available using the technologies that are currently developed or are in development for the purposes of oral proficiency judgments.
References
References marked with an asterisk indicate studies included in the text review.
*Al-Qudah, F. Z. M. (2012). Improving English pronunciation through computer-assisted programs in Jordanian universities. Journal of College Teaching & Learning (TLC), 9(3), 201–208. https://doi.org/10.19030/tlc.v9i3.7085
*Anderson-Hsieh, J. (1992). Using electronic visual feedback to teach suprasegmentals. System, 20(1), 51–62. https://doi.org/10.1016/0346-251X(92)90007-P
Anderson?Hsieh, J., Johnson, R., & Koehler, K. (1992). The relationship between native speaker judgments of nonnative pronunciation and deviance in segmentals, prosody, and syllable structure. Language Learning, 42(4), 529–555. https://doi.org/10.1111/j.1467-1770.1992.tb01043.x
Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2014). Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 56, 85–100. https://doi.org/10.1016/j.specom.2013.07.008
Chapelle, C. A., & Chung, Y. R. (2010). The promise of NLP and speech processing technologies in language assessment. Language Testing, 27(3), 301–315. https://doi.org/10.1177/0265532210364405
*Chen, L., Zechner, K., Yoon, S.-Y., Evanini, K., Wang, X., Loukina, A., Tao, J., Davis, L., Lee, C. M., Ma, M., Mundkowsky, R., Lu, C., Leong, C. W., & Gyawali, B. (2018). Automated scoring of nonnative speech using the SpeechRaterSM v. 5.0 Engine. ETS Research Report Series, 2018(1), 1–31. https://doi.org/10.1002/ets2.12198
Chun, D. M. (1989). Teaching tone and intonation with microcomputers. CALICO Journal, 7(1), 21–46. https://doi.org/10.1558/cj.v7i1.21-46
*Cox, T., & Davies, R. (2012). Using automated speech recognition technology with elicited oral response testing. CALICO Journal, 29(4), 601–618. https://doi.org/10.11139/cj.29.4.601-618
*Cucchiarini, C., Strik, H., & Boves, L. (1997). Automatic evaluation of Dutch pronunciation by using speech recognition technology. In 1997 IEEE workshop on automatic speech recognition and understanding proceedings (pp. 622–629). New York: IEEE.
*Delmonte, R. (2000). SLIM prosodic automatic tools for self-learning instruction. Speech Communication, 30(1), 145–166. https://doi.org/10.1016/S0167-6393(99)00043-6
*Delmonte (2002). Feedback generation and linguistic knowledge in “SLIM” automatic tutor. ReCall, 14(2), 209–234. https://doi.org/10.1017/S0958344002000320
Derwing, T. M., Munro, M. J., & Wiebe, G. (1998). Evidence in favor of a broad framework for pronunciation instruction. Language Learning, 48(3), 393–410. https://doi.org/10.1111/0023-8333.00047
*Ding, S., Liberatore, C., Sonsaat, S., Lu?i?, I., Silpachai, A., Zhao, G., Chukharev-Hudilainen, E., Levis, J., & Gutierrez-Osuna, R. (2019). Golden speaker builder—an interactive tool for pronunciation training. Speech Communication, 115, 51–66. https://doi.org/10.1016/j.specom.2019.10.005
Dixon, D. H. (2018). Use of technology in teaching pronunciation skills. In J. I. Liontas (Ed.), The TESOL encyclopedia of English language teaching (pp. 1–7). Hoboken: Wiley. https://doi.org/10.1002/9781118784235.eelt0692
*Evanini, K., & Wang, X. (2013). Automated speech scoring for nonnative middle school students with multiple task types. In Proceedings of Interspeech (pp. 2435–2439). 14th Annual Conference of the ISCA, Lyon. http://evanini.com/papers/evaniniWang2013toefljr.pdf; https://doi.org/10.21437/Interspeech.2013-566
*Fergadiotis, G., Gorman, K., & Bedrick, S. (2016). Algorithmic classification of five characteristic types of paraphasias. American Journal of Speech-Language Pathology, 25, S776–S787. https://doi.org/10.1044/2016_AJSLP-15-0147
*Holland, M., Kaplan, J., & Sabol, M. (1999). Preliminary tests of language learning in a speech-interactive graphics microworld. CALICO Journal, 16(3), 339–359. https://doi.org/10.1558/cj.v16i3.339-359
Johnson, D. O., & Kang, O. (2016). Automatic detection of Brazil’s prosodic tone unit. In Proceedings of speech prosody (pp. 287–291). Boston: ISCA. https://doi.org/10.21437/SpeechProsody.2016-59
*Johnson, W. L., & Valente, A. (2009). Tactical language and culture training systems: Using AI to teach foreign languages and cultures. AI Magazine, 30(2), 72. https://doi.org/10.1609/aimag.v30i2.2240
*Kang, O., & Johnson, D. (2018). The roles of suprasegmental features in predicting English oral proficiency with an automated system. Language Assessment Quarterly, 15(2), 150–168. https://doi.org/10.1080/15434303.2018.1451531
Kang, O., Rubin, D. O. N., & Pickering, L. (2010). Suprasegmental measures of accentedness and judgments of language learner proficiency in oral English. Modern Language Journal, 94(4), 554–566. https://doi.org/10.1111/j.1540-4781.2010.01091.x
*Komatsu, T., Ustunomiya, A., Suzuki, K., Ueda, K., Hiraki, K., & Oka, N. (2005). Experiments toward a mutual adaptive speech interface that adopts the cognitive features humans use for communication and induces and exploits users’ adaptations. International Journal of Human-Computer Interaction, 18(3), 243–268. https://doi.org/10.1207/s15327590ijhc1803_1
Lee, J., Jang, J., & Plonsky, L. (2015). The effectiveness of second language pronunciation instruction: A meta-analysis. Applied Linguistics, 36(3), 345–366. https://doi.org/10.1093/applin/amu040
Levis, J. (2007). Computer technology in teaching and researching pronunciation. Annual Review of Applied Linguistics, 27, 184–202. https://doi.org/10.1017/S0267190508070098
Levis, J. (2016). Research into practice: How research appears in pronunciation teaching materials. Language Teaching, 49(3), 423–437. https://doi.org/10.1017/S0261444816000045
*Liu, Y., Chawla, N. V., Harper, M. P., Shiberg, E., & Stolcke, A. (2006). A study in machine learning from imbalanced data for sentence boundary detection in speech. Computer Speech and Language, 20(4), 468–494. https://doi.org/10.1016/j.csl.2005.06.002
*Mansour, S. (2014). Generation of suprasegmental information for speech using a recurrent neural network and binary gravitational search algorithm for feature selection. Applied Intelligence, 40, 772–790. https://doi.org/10.1007/s10489-013-0505-x
*Masmoudi, A., Bougares, F., Ellouze, M., Estève, Y., & Belguith, L. (2018). Automatic speech recognition system for Tunisian dialect. Language Resources and Evaluation, 52(1), 249–267. https://doi.org/10.1007/s10579-017-9402-y
McCrocklin, S. M. (2016). Pronunciation learner autonomy: The potential of automatic speech recognition. System, 57, 25–42. https://doi.org/10.1016/j.system.2015.12.013
*Ming, Y., Ruan, Q., & Gao, G. (2013). A Mandarin edutainment system integrated virtual learning environments. Speech Communication, 55(1), 71–83. https://doi.org/10.1016/j.specom.2012.06.007
Mora, J., & Levkina, M. (2017). Task-based pronunciation teaching and research: Key issues and future directions. Studies in Second Language Acquisition, 39, 381–399. https://doi.org/10.1017/S0272263117000183
Neri, A., Cucchiarini, C., Strik, H., & Boves, L. (2002). The pedagogy–technology interface in computer assisted pronunciation training. Computer Assisted Language Learning, 15(5), 441–467. https://doi.org/10.1076/call.15.5.441.13473
Pearson Education, Inc. (2015). Versant English test. https://www.versanttest.com/products/english.jsp
Pennington, M. (1999). Computer-aided pronunciation pedagogy: Promise, limitations, directions. Computer Assisted Language Learning, 12(5), 427–440. https://doi.org/10.1076/call.12.5.427.5693
Probst, K., Ke, Y., & Eskenzai, M. (2002). Enhancing foreign language tutors—in search of the golden speaker. Speech Communication, 37(3–4), 423–441. https://doi.org/10.1016/S0167-6393(01)00009-7
Saito, K. (2012). Effects of instruction on L2 pronunciation development: A synthesis of 15 quasi-experimental intervention studies. TESOL Quarterly, 46(4), 842–854. https://doi.org/10.1002/tesq.67
Saito, K., & Plonsky, L. (2019). Effects of second language pronunciation teaching revisited: A proposed measurement framework and meta?analysis. Language Learning, 69(3), 652–708. https://doi.org/10.1111/lang.12345
*Scherrer, Y., Samardzic, T., & Glaser, E. (2019). Digitising Swiss German: How to process and study a polycentric spoken language. Language Resources & Evaluation, 53, 735–769. https://doi.org/10.1007/s10579-019-09457-5
*Setter, J., & Jenkins, J. (2005). State-of-the-art review article. Language Teaching, 38(1), 1–17. https://doi.org/10.1017/S026144480500251X
*Shahin, I. M. A. (2012). Speaker identification investigation and analysis in unbiased and biased emotional talking environments. International Journal of Speech Technology, 15(3), 325–334. https://doi.org/10.1007/s10772-012-9156-2
*Shahin, I. M. A. (2013). Gender-dependent emotion recognition based on HMMs and SPHMMs. International Journal of Speech Technology, 16(2), 133–141. https://doi.org/10.1007/s10772-012-9170-4
*Shahin, I., & Nassif, A. B. (2018). Three-stage speaker verification architecture in emotional talking environments. International Journal of Speech Technology, 21(4), 915–930. https://doi.org/10.1007/s10772-018-9543-4
*Soonklang, T., Damper, R., & Marchand, Y. (2008). Multilingual pronunciation by analogy. Natural Language Engineering, 14(4), 527–546. https://doi.org/10.1017/S1351324908004737
Surface, E., & Dierdorff, E. (2007). Special operations language training software measurement of effectiveness study: Tactical Iraqi study final report. Tampa, FL: U.S. Army Special Operations Forces Language Office.
*Tamburini, F., & Caini, C. (2005). An automatic system for detecting prosodic prominence in American English continuous speech. International Journal of Speech Technology, 8, 33–44. https://doi.org/10.1007/s10772-005-4760-z
Tanaka, R. (2000). Automatic speech recognition and language learning. Journal of Wayo Women’s University, 40, 53–62.
Taylor, J., & Kochem, T. (2020). Access and empowerment in digital language learning, maintenance, and revival: A critical literature review. Diaspora, Indigenous, and Minority Education, 1–12. https://doi.org/10.1080/15595692.2020.1765769
Thomson, R. I., & Derwing, T. M. (2015). The effectiveness of L2 pronunciation instruction: A narrative review. Applied Linguistics, 36(3), 326–344. https://doi.org/10.1093/applin/amu076
Van Compernolle, D. (2001). Recognizing speech of goats, wolves, sheep and ... nonnatives. Speech Communication, 35(1–2), 71–79. https://doi.org/10.1016/S0167-6393(00)00096-0
*Vojtech, J. M., Noordzij, J. P., Cler, G. J., & Stepp, C. E. (2019). The effects of modulating fundamental frequency and speech rate on the intelligibility, communication efficiency, and perceived naturalness of synthetic speech. American Journal of Speech-Language Pathology, 28, 875–886. https://doi.org/10.1044/2019_AJSLP-MSC18-18-0052
*Walker, N., Trofimovich, P., Cedergren, H., & Gatbonton, E. (2011). Using ASR technology in language training for specific purposes: A perspective from Quebec, Canada. CALICO Journal, 28(3), 721–743. https://doi.org/10.11139/cj.28.3.721-743
*Wang, F., Sahli, H., Gao, J., Jiang, D., & Verhelst, W. (2015). Relevance units machine based dimensional and continuous speech emotion prediction. Multimedia Tools Application, 74, 9983–10000. https://doi.org/10.1007/s11042-014-2319-1
*Ward, M. (2015). I’m a useful NLP tool—get me out of here. In F. Helm, L. Bradley, M. Guarda, & S. Thouësny (Eds.), Critical CALL—proceedings of the 2015 EUROCALL Conference, Padova, Italy (pp. 553–557). Dublin: Research-publishing.net. https://doi.org/10.14705/rpnet.2015.000392
*Witt, S. M., & Young, S. J. (2000). Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication, 30(2–3), 95–108. https://doi.org/10.1016/S0167-6393(99)00044-8