Dr. Augustine Fou's Online Scrapbook: Can CAPTCHAs solve book-digitizing?

from Boing Boing by Cory Doctorow

Cory Doctorow: Here's an interesting proposal to replace the text in CAPTCHAs (those boxes where you type distorted words) with text that has stymied the optical character recognition software used to digitize old public domain books.

It's a clever hack, but there's one thing I don't understand. CAPTCHAs are supposed to contain a word known to the computer. You key it in and the computer confirms that you're a human being by comparing your entry to what the computer knows the CAPTCHA to be.

But if CAPTCHAs contain text unknown to the computer -- and any text that stymies OCR software is, by definition unknown to the computer -- then what's to stop you from entering anything in the CAPTCHA box and gaining entry?

Instead of requiring visitors to retype random numbers and letters, they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials. The translated text would then go toward the digitization of the printed material on behalf of the Internet Archive project .
“I think it’s a brilliant idea — using the Internet to correct OCR mistakes,” said Brewster Kahle, director of the Internet Archive, in a statement. “This is an example of why having open collections in the public domain is important. People are working together to build a good, open system.”

Link (via /.)

Update: Alex sez, "the system works by having two words displayed. One that is computer generated (hence the computer knows what it is) and the other a scan from a book to be solved by the human (you do not know which is which). You enter in both words, if you get the computer generated one correct - the system knows your a human and lets you in. It can then also assume you entered the other non-generated word in correctly and can use it."

Friday, May 25, 2007

Can CAPTCHAs solve book-digitizing?