Advancing Inclusion with AI
and Real-Time Captioning

By: Alex Kozlov

Communication Access Real-Time Translation (CART) services transcribe words into text or captions as they are spoken during a classroom lecture, business meeting, public speech, sports event, or arts performance. For people who are deaf or hard-of-hearing, aren’t fluent in English, or have auditory processing disorders, CART services are essential to enable basic understanding and participation. In addition, by providing written documentation of an event in real time, CART enhances accuracy and information retention.

Traditionally, CART services have been delivered by highly skilled stenographers. In addition to having to learn to type up to 260 words per minute, a CART transcriber often requires additional specialized training in fields like law or medicine to accurately capture arcane terminology. Because these skills are rare and in demand, CART services tend to be expensive and hard to find.

Today, advances in artificial intelligence (AI) and speech recognition are redefining existing approaches to real-time written translation. By combining human skills with smart software, these innovations offer the potential to significantly expand the availability of CART services. Easier access to CART services, meanwhile, promises to create new opportunities to improve communication and enhance inclusion for people with unique learning styles.

ASR: transcription and context

Automated Speech Recognition (ASR) computer software recognizes speech on two fronts. At the acoustic level, an ASR application “listens” to the sound of a spoken word and produces text corresponding to that sound. While a relatively straightforward task, translating sounds into written words poses a variety of challenges. These include understanding accents, jargon, and vocal inflections, as well as filtering out background noise. In recent years, significant progress has been made to improve the ability of ASR to understand accents, reduce the need for training an application to recognize an individual’s voice, and lessen sensitivity to environmental surroundings.

In addition to recognizing sounds, ASR applications deploy natural language processing models to provide a contextual framework that analyzes the broader meaning of combinations of words. This helps the program, among other things, determine proper spelling and usage. For example, in the statement, “I like my steak medium-rare,” the words “medium-rare” provide a context suggesting that the statement relates to “steak” as a food, rather than a “stake” in an organization, or a “stake” driven into the ground. At the same time, the program recognizes that “stake in the ground” refers to a piece of metal rather than a slab of meat. Similarly, contextual analysis can determine that “stake in the ground” is likely an idiomatic expression rather than a literal statement. Based on that determination, the program can more accurately predict the context of the rest of the discussion.

Offline vs. real time

Today’s ASR tools are quite adept at transcribing audio recordings after the fact. Readily available and easily affordable tools allow a user to upload an audio file and receive an accurate transcription in as little as five to ten minutes. However, because the transcription is done offline, these tools have the luxury of analyzing the entire discussion before undertaking the transcription. This backwards and forwards perspective allows the program to identify and review the overall context of the discussion, and as a result provide a much more accurate outcome. A CART application, meanwhile, faces the much more difficult task of conducting the contextual analysis on the fly. This means the application must assess the context of each word and sentence as it’s spoken, as well as predict the context of the words before they’re spoken.

Human in the loop

To address the challenges of real-time speech recognition, researchers are developing end-to-end “transformer” models that apply deep learning techniques to streamline the task of contextualizing words, sentences, and paragraphs. Rather than processing


Latest Updates

Subscribe to our YouTube Channel