• AIPressRoom
  • Posts
  • Modeling and enhancing textual content stability in dwell captions – Google Analysis Weblog

Modeling and enhancing textual content stability in dwell captions – Google Analysis Weblog

Computerized speech recognition (ASR) expertise has made conversations extra accessible with dwell captions in distant conferencing software program, cell functions, and head-worn displays. Nonetheless, to keep up real-time responsiveness, dwell caption methods usually show interim predictions which might be up to date as new utterances are obtained. This may trigger textual content instability (a “flicker” the place beforehand displayed textual content is up to date, proven within the captions on the left within the video beneath), which may impair customers’ studying expertise because of distraction, fatigue, and problem following the dialog.

In “Modeling and Improving Text Stability in Live Captions”, offered at ACM CHI 2023, we formalize this drawback of textual content stability by way of just a few key contributions. First, we quantify the textual content instability by using a vision-based flicker metric that makes use of luminance distinction and discrete Fourier transform. Second, we additionally introduce a stability algorithm to stabilize the rendering of dwell captions through tokenized alignment, semantic merging, and easy animation. Lastly, we carried out a person examine (N=123) to know viewers’ expertise with dwell captioning. Our statistical evaluation demonstrates a robust correlation between our proposed flicker metric and viewers’ expertise. Moreover, it exhibits that our proposed stabilization strategies considerably improves viewers’ expertise (e.g., the captions on the suitable within the video above).

Metric

Impressed by previous work, we suggest a flicker-based metric to quantify textual content stability and objectively consider the efficiency of dwell captioning methods. Particularly, our objective is to quantify the glint in a grayscale dwell caption video. We obtain this by evaluating the distinction in luminance between particular person frames (frames within the figures beneath) that represent the video. Massive visible adjustments in luminance are apparent (e.g., addition of the phrase “shiny” within the determine on the underside), however refined adjustments (e.g., replace from “… this gold. Good..” to “… this. Gold is sweet”) could also be troublesome to discern for readers. Nonetheless, changing the change in luminance to its constituting frequencies exposes each the plain and refined adjustments.

Thus, for every pair of contiguous frames, we convert the distinction in luminance into its constituting frequencies utilizing discrete Fourier remodel. We then sum over every of the high and low frequencies to quantify the glint on this pair. Lastly, we common over the entire frame-pairs to get a per-video flicker.

As an example, we will see beneath that two similar frames (prime) yield a flicker of 0, whereas two non-identical frames (backside) yield a non-zero flicker. It’s price noting that greater values of the metric point out excessive flicker within the video and thus, a worse person expertise than decrease values of the metric.

Stability algorithm

To enhance the soundness of dwell captions, we suggest an algorithm that takes as enter already rendered sequence of tokens (e.g., “Earlier” within the determine beneath) and the brand new sequence of ASR predictions, and outputs an up to date stabilized textual content (e.g., “Up to date textual content (with stabilization)” beneath). It considers each the pure language understanding (NLU) side in addition to the ergonomic side (show, format, and many others.) of the person expertise in deciding when and the best way to produce a secure up to date textual content. Particularly, our algorithm performs tokenized alignment, semantic merging, and easy animation to realize this objective. In what follows, a token is outlined as a phrase or punctuation produced by ASR.

Our algorithm deal with the problem of manufacturing stabilized up to date textual content by first figuring out three lessons of adjustments (highlighted in crimson, inexperienced, and blue beneath):

  1. Pink: Addition of tokens to the tip of beforehand rendered captions (e.g., “How about”).

  2. Inexperienced: Addition / deletion of tokens, in the midst of already rendered captions.

    1. B1: Addition of tokens (e.g., “I” and “buddies”). These might or might not have an effect on the general comprehension of the captions, however might result in format change. Such format adjustments will not be desired in dwell captions as they trigger important jitter and poorer person expertise. Right here “I” doesn’t add to the comprehension however “buddies” does. Thus, it is very important stability updates with stability specifically for B1 sort tokens.

    2. B2: Removing of tokens, e.g., “in” is eliminated within the up to date sentence.

  3. Blue: Re-captioning of tokens: This consists of token edits that will or might not have an effect on the general comprehension of the captions.

  • C1: Correct nouns like “disney land” are up to date to “Disneyland”.

  • C2: Grammatical shorthands like “it is” are up to date to “It was”.

Alignment, merging, and smoothing

To maximise textual content stability, our objective is to align the previous sequence with the brand new sequence utilizing updates that make minimal adjustments to the prevailing format whereas guaranteeing correct and significant captions. To realize this, we leverage a variant of the Needleman-Wunsch algorithm with dynamic programming to merge the 2 sequences relying on the category of tokens as outlined above:

  • Case A tokens: We instantly add case A tokens, and line breaks as wanted to suit the up to date captions.

  • Case B tokens: Our preliminary research confirmed that customers most well-liked stability over accuracy for beforehand displayed captions. Thus, we solely replace case B tokens if the updates don’t break an present line format.

  • Case C tokens: We evaluate the semantic similarity of case C tokens by reworking unique and up to date sentences into sentence embeddings, measuring their dot-product, and updating them provided that they’re semantically totally different (similarity

Lastly, we leverage animations to scale back visible jitter. We implement easy scrolling and fading of newly added tokens to additional stabilize the general format of the dwell captions.

Person analysis

We carried out a person examine with 123 individuals to (1) study the correlation of our proposed flicker metric with viewers’ expertise of the dwell captions, and (2) assess the effectiveness of our stabilization strategies.

We manually chosen 20 movies in YouTube to acquire a broad protection of matters together with video conferences, documentaries, educational talks, tutorials, information, comedy, and extra. For every video, we chosen a 30-second clip with a minimum of 90% speech.

We ready 4 sorts of renderings of dwell captions to match:

  1. Uncooked ASR: uncooked speech-to-text outcomes from a speech-to-text API.

  2. Uncooked ASR + thresholding: solely show interim speech-to-text outcome if its confidence rating is greater than 0.85.

  3. Stabilized captions: captions utilizing our algorithm described above with alignment and merging.

  4. Stabilized and easy captions: stabilized captions with easy animation (scrolling + fading) to evaluate whether or not softened show expertise helps enhance the person expertise.

We collected person rankings by asking the individuals to observe the recorded dwell captions and charge their assessments of consolation, distraction, ease of studying, ease of following the video, fatigue, and whether or not the captions impaired their expertise.

Correlation between flicker metric and person expertise

We calculated Spearman’s coefficient between the glint metric and every of the behavioral measurements (values vary from -1 to 1, the place damaging values point out a damaging relationship between the 2 variables, optimistic values point out a optimistic relationship, and 0 signifies no relationship). Proven beneath, our examine demonstrates statistically important (𝑝 < 0.001) correlations between our flicker metric and customers’ rankings. Absolutely the values of the coefficient are round 0.3, indicating a average relationship.

Stabilization of dwell captions

Our proposed method (stabilized easy captions) obtained persistently higher rankings, important as measured by the Mann-Whitney U test (p < 0.01 within the determine beneath), in 5 out of six aforementioned survey statements. That’s, customers thought of the stabilized captions with smoothing to be extra comfy and simpler to learn, whereas feeling much less distraction, fatigue, and impairment to their expertise than different sorts of rendering.

Conclusion and future course

Textual content instability in dwell captioning considerably impairs customers’ studying expertise. This work proposes a vision-based metric to mannequin caption stability that statistically considerably correlates with customers’ expertise, and an algorithm to stabilize the rendering of dwell captions. Our proposed resolution may be doubtlessly built-in into present ASR methods to boost the usability of dwell captions for a wide range of customers, together with these with translation wants or these with listening to accessibility wants.

Our work represents a considerable step in the direction of measuring and enhancing textual content stability. This may be developed to incorporate language-based metrics that target the consistency of the phrases and phrases utilized in dwell captions over time. These metrics might present a mirrored image of person discomfort because it pertains to language comprehension and understanding in real-world eventualities. We’re additionally enthusiastic about conducting eye-tracking research (e.g., movies proven beneath) to trace viewers’ gaze patterns, corresponding to eye fixation and saccades, permitting us to raised perceive the sorts of errors which might be most distracting and the best way to enhance textual content stability for these.

By enhancing textual content stability in dwell captions, we will create more practical communication instruments and enhance how individuals join in on a regular basis conversations in acquainted or, by way of translation, unfamiliar languages.

Acknowledgements

This work is a collaboration throughout a number of groups at Google. Key contributors embrace Xingyu “Bruce” Liu, Jun Zhang, Leonardo Ferrer, Susan Xu, Vikas Bahirwani, Boris Smus, Alex Olwal, and Ruofei Du. We want to lengthen our because of our colleagues who supplied help, together with Nishtha Bhatia, Max Spear, and Darcy Philippon. We’d additionally prefer to thank Lin Li, Evan Parker, and CHI 2023 reviewers.