CAPTCHA spam filters are deciphering ancient texts

Internet users translating a million OCR-defeating words a day


2 October 2007 16:40 GMT / By Amy-Mae Elliott

We've all had to use one. They're the spam filters that make you type a distorted word to prove you are human and not a spam-bot.

What you don't know is that with Carnegie Mellon's "reCAPTCHA" version you are actually helping to decipher, and digitise old documents.

Words from old books and manuscripts from a non-profit organisation called the Internet Archive that can't be "read" properly by OCR software are being use in these applications.

A user is presented with two words, one of which is known, and one not. If the user correctly identifies the known word, it's assumed they also got the unknown word right too. The results are sent back to the research team.

The team's OCR software typically has problems with about one in 10 words, which are then digitised and used in reCAPTCHA, saving tens of thousands of man hours.

More than a million words are being deciphered each day, but there's still 100 million books to go, estimated to take about 400 years to complete.

So, every time you decipher a reCAPTCHA, another word from an old book or manuscript is digitised for the Internet Archive, a fact which may make the process a little less annoying in future...
Related
Full tags
Software, Online, OCR, Security software, Carnegie Mellon

share print story pdf email story

Recommended articles

Recommended articles from around the web

Loading

Best iPad 2 apps

We detail the best iPad 2 and iPad apps in the app store Which iPad app should you download?

Best new iPad apps

We detail the best iPad apps in the app store for your new Retina Display Which iPad app should you download?

Windows 8

First Look: Windows 8 Consumer Preview reviewed

The new iPad

The new iPad: Everything you need to know

Pocket-lint poll

Q. Does the Samsung Galaxy S III deliver what you hoped for?

Vote YES Vote NO

» LAST TIME
When asked Would you switch from iOS to Android? 54% said yes and 46% said no