Transcribe Videos And Make Them Searchable With Koemei

David J. Hill

Jun 29, 2012

Wouldn't it be great to be able to search videos for what people are actually saying instead of relying on tags or descriptions? Koemei (pronounced "co-may") aims to do just that through their cloud-based speech recognition software that rapidly transcribes video and audio, even if people have accents or more than one person is speaking.

But the startup is also targeting the large quantities of media content from videoconferencing, webcasting, and classroom lectures being produced in business, government, and educational institutions, and by indexing the transcripts, video libraries will become easily searchable.

According to the company, it currently takes about an hour for its system to automatically transcribe one hour of media at a cost of about $0.09 a minute, much cheaper than manual transcription, though user pricing for the service has not been announced.

Building on 8 years of research, the startup comes out of the Idiap Research Institute in Switzerland, and is in the process of raising $1.5 million is a Series A round on top of its Angel funding since its 2010 launch. In terms of its market, Koemei offers a service that competes with the likes of Google Voice (which has many notable fails in its transcription capacities), Nuance and its popular Dragon software, and sites like Amazon Mechanical Turk and Fiverr that you can pay people to transcribe/caption videos for you. In a press release, Koemei claims that today almost $16 billion is spent annually on transcription services by human transcriptionists, numbering 120,000 in the US alone.

However, Koemei is offering more than just a transcription service -- it's an automated transcription platform. The artificial intelligence of speech recognition is improving, but it clearly isn't quite where it needs to be at just yet, which is why the site allows users to edit transcripts when the software makes errors. Furthermore, captioned videos can be published directly to video hosting sites like YouTube and Vimeo. The company is optimistic in stating that it could have nearly $45 million in sales by 2014.

Currently, the site is taking sign ups for a trial and claims that users will get 10 hours of video transcribed for free.

But here's the thing: Google is already in the process of automatically adding captions to all YouTube videos and at no cost. Admittedly, finding videos with captions is hit or miss (especially because of the massive amount uploaded daily), and currently the transcriptions are shoddy at times. Google recognizes the demand for video captions and works with the Described and Captioned Media Program, an organization that advocates for equal media access to the deaf and hard of hearing, to identify YouTube caption editors like CaptionTube and Subtitle Workshop.

So whether Koemei's speech recognition technology is superior to Goggle's will have to be demonstrated, but as far as Koemei's platform goes, it wouldn't be hard for the Google devs to redesign the YouTube upload page to incorporate a caption editor or crowdsource YouTube captioning a la Wikipedia's content. On the other hand, it may be easier for Google just to acquire Koemei and embed its tech instead. Regardless, with as much effort as Google is putting into YouTube lately, such as pouring $200 million into building 100 channels, it's clear that Koemei's service is on Google's radar.

Accurate speech recognition is one of the most important technological challenges today and emergent technologies like Apple's Siri and Google Glass take it for granted that computers will actually understand what you are saying. But, as anyone who has called automated calling centers knows, voice commands are relatively easy to discern and nothing like the diverse language, complicated speech patterns, and speech rate in many videos. Add to that people speaking in accents, dialects, and second languages, and the problem becomes much more difficult.

The world of search is getting a lot of attention lately, from proof-of-concept projects like the descriptive camera that provides a description of what's in an image to the recent announcement of Knowledge Graph, which looks to make Google Search smarter by identifying people, places and things as objects and defining the relationships between them. In the world of video search, we recently profiled a new video surveillance system that allows for rapid searching of faces in videos, easily pulling up every clip in which the person is recognized at a rate of 36 million images per second.

Be Part of the Future

100% Free. No Spam. Unsubscribe any time.

But this technology is vitally important as the quantity of Internet video is enormous. Speaking at TechCrunch Disrupt, Koemei's CEO Temitope Ola, formerly of Silicon Graphics, said, "There are over 7 billion online videos today and only 1 percent of those are [sic] captioned."

And it's only going to get worse. The exponential growth of YouTube alone means that currently 72 hours of content is uploaded to the site every minute, and without videos transcripts, much of it is found in relatively inefficient ways. With the surge in smartphones and other mobile devices around the world that enable anyone to record audio and video easily, much of what is being captured in the world through this content remains separate from the searchable web without reliable transcription in place.

Additionally, with telecommuting becoming the norm in some countries and the demand for online education featuring video lectures skyrocketing, we need transcription solutions rapidly. Whether Koemei can position itself against all of its competition remains to be seen, but it's clear that the big guns are getting pulled out for online audio and video transcription, and in the near future, it may be hard to imagine watching a video without being able to read the transcript.

To get an overview of the service, check out this presentation that highlights all of the service's features:

[Media: TechCrunch, YouTube]

[Sources: Koemei, TechCrunch, Yahoo]

David J. Hill

David started writing for Singularity Hub in 2011 and served as editor-in-chief of the site from 2014 to 2017 and SU vice president of faculty, content, and curriculum from 2017 to 2019. His interests cover digital education, publishing, and media, but he'll always be a chemist at heart.

Concept image of how an AI organizes information

Anthropic Says Chatbots Have What May Be a Key Feature of Consciousness. Are They Right?

Tim Bayne

Jul 17, 2026

A person sits and reads a smartphone while their mind bubbles away

Is AI Making Us Dumber?

Shelly Fan

Jul 16, 2026

The First AI‑Designed Vaccine Has Been Tested in People. Here’s What Happened.

Neil Mabbott

Jul 07, 2026

Future