When we communicate by speaking or writing, we do so in a deeply unique way. Whether we talk in Dutch, Vietnamese or Tagalog, which words we choose to ask a question or describe a situation, the dialect that colors our speech, or how our voice sounds are only some of the aspects that differ from individual to individual. Today’s language technology needs to address this individual variation: Access to helpful tools that have become omnipresent in our daily lives such as voice assistants, chatbots, or translation engines is only possible if our unique way of speaking can be handled by the language models supporting these applications.
As an example, let’s think about the case of an elderly woman from Scotland, who would like to use a voice assistant in her home to set a timer, switch off the light or call her daughter. Many speech-to-text models still struggle with recognizing voices of women, or of the elderly. Apart from acoustic qualities of her voice, the fact that she speaks a variety of English for which less resources are available as compared to Standard American or British English makes it more difficult for her to make herself understood to the machine. But why is it so difficult for many current language technology solutions to handle linguistic variation?
Many of the currently most commonly used models used in processing language data require training a large statistical model using example data. Well-known examples include the large language models released by OpenAI (such as GPT-3 or the soon-to-appear GPT-4), which are trained using vast amounts of text data from different sources available on the internet. Similarly, models that are built to transform speech to text require training data in the form of recorded speech and corresponding text transcriptions.
Crucially, the performance of these kinds of models relies on the input data that was used in training them. This means that they perform best on the kind of language present in the text or recording that were used as training data. However, training data that includes sufficient variability to account for linguistic diversity can be hard to come by. To address this shortcoming, models can be improved by striving to include more diverse training data. For Speech-to-Text models, this would mean including recordings from speakers of different ages, genders, with a variety of accents and dialects. Models can also be designed to handle multiple languages within one model, making it more flexible and adapt better to input that differs from the training data. These are only some of the possibilities to improve current language technology to better grasp the structure of different language varieties.
In addition to variation between speakers, language is constantly evolving: Words for new concepts enter a language (such as “whataboutism” or “doomscrolling”), some words become more common while other words might become less frequently used, and even the grammar of a language changes over time. For this reason, the performance of language models benefits from continuous evaluation to measure how current language use is being handled, and if necessary, updating the models accordingly.
Being conscious of the fact that language varies between speakers is the first step to making language technology accessible to anyone. It enables interacting with technology in a natural, conversational way: in your very own way of speaking. Would you like to learn more about how we address linguistic diversity in our Ally platform? Don’t hesitate to contact us at result@y.digital!