Transcribing video content is an essential part of ensuring accessibility, but anyone who’s tried it knows that it’s not straightforward. While accessibility is a non-negotiable from both a compliance and values perspective, creating quality subtitles can be incredibly time-consuming and frustrating.
In 2025, you may have noticed we’ve started experimenting with live subtitles during our Library All Staff Meetings. It’s part of our ongoing effort to make content inclusive, but there’s still room for improvement.
Why video transcription matters
When it comes to digital content, subtitles aren’t just a nice-to-have; they’re vital for accessibility. Transcriptions help people with hearing impairments, support non-native English speakers, and make videos accessible in environments where audio isn’t an option.
Despite its importance, transcribing videos can feel like an overwhelming task. Library All Staff Meetings routinely run for over an hour, and our popular lunchtime lecture series at the John Rylands Library often present similar challenges. Manually typing out each word or correcting automatic captions quickly becomes an exhaustive exercise, taking hours of valuable staff time.
Limitations of automatic subtitles on YouTube
YouTube’s automatic subtitles are a convenient shortcut, but they’re far from perfect. Yes, they’re quick and free, but they come with several drawbacks:
- Accuracy: YouTube captures everything; including ‘ums’, ‘ahs’, laughter, background noise, and random non-verbal sounds.
- Punctuation: It doesn’t handle punctuation well, often creating sentences that never seem to end, transforming speech into a single, lengthy stream of consciousness.
- Names and international terms: YouTube often struggles with names, technical terminology, or words from other languages, leading to misinterpretations and confusion.
In short, automatic subtitles from YouTube tick the box for basic accessibility, but they fall short when it comes to genuinely usable, inclusive content.

Discovering OpenAI’s Whisper
In July 2025, I attended the IDEAL EDI conference in Toronto, where I stumbled across something promising: OpenAI’s ‘Whisper’ technology:
During one of the presentations, a speaker described exactly the challenges we were facing at the Library: poor subtitle quality, excessive manual labour, and inaccuracies that reduced the usefulness of captions.
Whisper is a state-of-the-art speech recognition model developed by OpenAI, designed specifically to handle complex transcription tasks. Unlike conventional auto-captioning systems, Whisper uses advanced neural networks trained on diverse datasets, enabling it to manage the quirks of natural language far better than traditional methods.
Specifically, Whisper excels at:
- Handling names and international vocabulary effortlessly.
- Recognising punctuation, producing text that’s structured and readable.
- Automatically filtering out non-verbal noises, producing clean, focused subtitles.
These benefits made Whisper an appealing solution for our Library, promising both improved accessibility and usability.
Bringing Whisper back to Manchester
Inspired by what I’d seen, I returned to Manchester eager to test Whisper myself. While the technology itself is fairly accessible, getting it up and running requires some advanced technical knowledge, something I quickly realised when my initial attempts to install it on my University laptop were unsuccessful.
Fortunately, Pete Morris from our Digital Development Team stepped in. Pete suggested leveraging our powerful local AI hardware (affectionately known as our “AI rig”) already set up for R&D purposes using open-source models like Meta’s LLaMA. With Pete’s help, Whisper was successfully deployed on the rig, initially as a basic proof-of-concept. The test run worked perfectly, but it had limitations: the rig could only process audio files up to ten minutes in length.
Our longer Library All Staff meetings, which are usually an hour or more and uploaded to YouTube, posed a significant barrier.

Developing a full solution
Following discussions with Pete, he generously took on the task of creating a fully functioning end-to-end transcription solution. Pete’s approach was clear and simple: create an interface where we could enter a YouTube URL and effortlessly receive an accurate transcript in just about a minute.
In practice, Pete programmed the AI rig to:
- Access the YouTube video automatically.
- Download and extract the audio from the video.
- Run the extracted audio through Whisper’s transcription model.
- Generate a neatly formatted, accurate transcript.
This user-friendly interface transformed our workflow for longer form videos. Instead of laboriously typing or correcting auto-generated subtitles, I now simply copy and paste the clean, accurate transcript produced by Whisper back into YouTube. YouTube then synchronises the subtitles automatically, aligning them perfectly to the video.
Saving time, improving inclusion
Implementing Whisper at The University of Manchester Library hasn’t just saved time, it’s changed the way we approach accessibility. The efficiency of this process means subtitles are available quickly, significantly reducing manual effort, and allowing us to focus on other critical communications activities.
More importantly, Whisper’s accuracy enhances digital inclusion, helping ensure that everyone can fully engage with our video content, whether they’re attending our Library All Staff Meetings or viewing any of our digital communications materials.
The success we’ve had so far clearly demonstrates the potential of AI to enhance accessibility and inclusivity, making our communications clearer, more accessible, and ultimately more effective.
Here’s to continued innovation and inclusion at the Library – one subtitle at a time. 😀
Kristian Scott
(Senior Digital Communications Developer, The University of Manchester Library)

