Making videos accessible with AI-generated audio descriptions

Audio description (AD), an additional narration track that conveys essential visual information in media work, is imperative for improving video accessibility for 12 million people who are blind or visually impaired. Over 100,000 videos are uploaded every day, and the vast majority are not audio described. Current methods for producing audio descriptions are both slow and expensive.
People Impacted
$ 1.7T
Potential Funding
I have this challenge
the problem
Nature and Context

For people in the visually impaired community, many videos are often completely inaccessible. For instance, people with visual impairments find it especially difficult to understand scenic montages, cooking videos with ingredients on-screen, and how-to videos with visual demonstrations. This means that the visually impaired community faces an insurmountable challenge in navigating the online world as society increasingly relies on video-form media, a trend present before the pandemic and has since been accelerated.

One way to make a video accessible is through audio description, which narrates the essential content of a video as it plays. While this is effective, audio descriptions are currently slow and expensive to produce due to the reliance on manual processes. The typical turnaround time to create an audio description often exceeds 2 weeks since the process relies on professional script writing and voice acting. Moreover, it costs $10+ per minute of video content. Due to how to resource intensive the process is, the availability of audio description is controlled by video producers and creators, of which few make their videos accessible. The visually impaired community has long voiced their demand for widespread audio description.

Symptoms and Causes

The US Center for Disease Control and Prevention - CDC reports the following:

  • Over 12 million people 40 years and over in the United States have vision impairment, including 1 million who are blind, 3 million who have vision impairment after correction, and 8 million who have vision impairment due to uncorrected refractive error.

  • Over 4.2 million Americans aged 40 years and older suffer from uncorrectable vision impairment; this number is predicted to more than double by 2050 to 8.96 million due to the increasing epidemics of diabetes and other chronic diseases and our rapidly aging U.S. population.

  • Approximately 6.8% of children younger than 18 years in the United States have a diagnosed eye and vision condition. Nearly 3 percent of children younger than 18 years are blind or visually impaired, defined as having trouble seeing even when wearing glasses or contact lenses.

the impact
Negative Effects
  • The annual economic impact of major vision problems among the adult population 40 years and older is more than $145 billion.

  • Vision loss causes a substantial social and economic toll for millions of people including significant suffering, disability, loss of productivity, and diminished quality of life.

Success Metrics
  • Achieve 100% availability of audio descriptions for videos

  • Develop a fully AI-driven approach to generate audio descriptions for any open domain video

who benefits from solving this problem
Organization Types

Many organizations are mandated by the FCC and ADA to provide a certain degree of audio description:

  • Federally-funded educational institutions

  • Broadcast networks

  • Government organizations

  • Enterprise organizations (training videos)

While not as regulated as the types of organizations mentioned above, these organizations produce video content that is commonly viewed by people with visual impairments:

  • Streaming platforms

  • Video on demand platforms

  • Social media

  • Nonprofit and Government Organizations that help individuals with vision impairment.

  • Entertainment, Education, and Information platforms such as Netflix, Hulu, MSNNews, YouTube, Comcast, which want to make their content accessible to over 8M adults in the US suffering from some form of visual impairment.

  • Educational Institutions, from K-12 to Higher Education

financial insights
Current Funding

As a team of students from the University of Washington, we have received a number of mini-grants from organizations on campus that support our mission:

The team seeks $25K in funding to continue to evolve the product, participate in the X4Impact Virtual Accelerator and operate this venture full-time while continuing grad and post grad education.

Ideas Description

VerbalEyes improves the audio description process through a scalable AI-driven product that will make audio description easy, fast, and accessible.

Our audio description pipeline automates the process through a few key techniques:

  1. scene selection - identifying the frames that capture a key moment in a video

  2. video captioning - describing the objects and actions that occur in a scene

  3. text-to-speech - synthesizing the descriptions into human-like speech

  4. description scheduling - aligning the descriptions to create a coherent narration while avoiding overlap with existing dialogue.

Altogether, our technology can generate an audio description for an input video within minutes. This audio description can then be played along-side the original video to make it accessible for people with visual impairments. So far, during our user research, we have demoed our original AI-generated audio description for members of the visually impaired community, who affirmed that it helped them to understand the content of videos that otherwise would have been inaccessible.

We plan to provide this as an API so that it can be easily integrated into existing video players.

Data Sources

Centers for Disease Control and Prevention - Fast Facts about Eye Disorders

Contributors to this Page
  1. Daniel Zhu -

  2. VerbalEyes -

  3. Team at Giving Tech Labs

Input Needed From Contributing Editors
(click on any tag to contribute)