Challenge
Making videos accessible with AI-generated audio descriptions
For people in the visually impaired community, many videos are often completely inaccessible. For instance, people with visual impairments find it especially difficult to understand scenic montages, cooking videos with ingredients on-screen, and how-to videos with visual demonstrations. This means that the visually impaired community faces an insurmountable challenge in navigating the online world as society increasingly relies on video-form media, a trend present before the pandemic and has since been accelerated.
One way to make a video accessible is through audio description, which narrates the essential content of a video as it plays. While this is effective, audio descriptions are currently slow and expensive to produce due to the reliance on manual processes. The typical turnaround time to create an audio description often exceeds 2 weeks since the process relies on professional script writing and voice acting. Moreover, it costs $10+ per minute of video content. Due to how to resource intensive the process is, the availability of audio description is controlled by video producers and creators, of which few make their videos accessible. The visually impaired community has long voiced their demand for widespread audio description.
The US Center for Disease Control and Prevention - CDC reports the following:
Over 12 million people 40 years and over in the United States have vision impairment, including 1 million who are blind, 3 million who have vision impairment after correction, and 8 million who have vision impairment due to uncorrected refractive error.
Over 4.2 million Americans aged 40 years and older suffer from uncorrectable vision impairment; this number is predicted to more than double by 2050 to 8.96 million due to the increasing epidemics of diabetes and other chronic diseases and our rapidly aging U.S. population.
Approximately 6.8% of children younger than 18 years in the United States have a diagnosed eye and vision condition. Nearly 3 percent of children younger than 18 years are blind or visually impaired, defined as having trouble seeing even when wearing glasses or contact lenses.
The annual economic impact of major vision problems among the adult population 40 years and older is more than $145 billion.
Vision loss causes a substantial social and economic toll for millions of people including significant suffering, disability, loss of productivity, and diminished quality of life.
Achieve 100% availability of audio descriptions for videos
Develop a fully AI-driven approach to generate audio descriptions for any open domain video
Many organizations are mandated by the FCC and ADA to provide a certain degree of audio description:
Federally-funded educational institutions
Broadcast networks
Government organizations
Enterprise organizations (training videos)
While not as regulated as the types of organizations mentioned above, these organizations produce video content that is commonly viewed by people with visual impairments:
Streaming platforms
Video on demand platforms
Social media
Nonprofit and Government Organizations that help individuals with vision impairment.
Entertainment, Education, and Information platforms such as Netflix, Hulu, MSNNews, YouTube, Comcast, which want to make their content accessible to over 8M adults in the US suffering from some form of visual impairment.
Educational Institutions, from K-12 to Higher Education
As a team of students from the University of Washington, we have received a number of mini-grants from organizations on campus that support our mission:
UW CREATE Mini-grant - $700
UW Research Computing Club Cloud Credit Program - $500 (in cloud credits)
The team seeks $25K in funding to continue to evolve the product, participate in the X4Impact Virtual Accelerator and operate this venture full-time while continuing grad and post grad education.
VerbalEyes improves the audio description process through a scalable AI-driven product that will make audio description easy, fast, and accessible.
Our audio description pipeline automates the process through a few key techniques:
scene selection - identifying the frames that capture a key moment in a video
video captioning - describing the objects and actions that occur in a scene
text-to-speech - synthesizing the descriptions into human-like speech
description scheduling - aligning the descriptions to create a coherent narration while avoiding overlap with existing dialogue.
Altogether, our technology can generate an audio description for an input video within minutes. This audio description can then be played along-side the original video to make it accessible for people with visual impairments. So far, during our user research, we have demoed our original AI-generated audio description for members of the visually impaired community, who affirmed that it helped them to understand the content of videos that otherwise would have been inaccessible.
We plan to provide this as an API so that it can be easily integrated into existing video players.
Daniel Zhu - dzhu99@cs.washington.edu
VerbalEyes - verbaleyes.team@gmail.com
Team at Giving Tech Labs