EmailEmail
PrintPrint
CMU's 'Million Books' on the Web project makes slow, steady progress
Friday, May 21, 2004

"Ethics of Sex Acts" is the most downloaded book on this Web site, but before you get the wrong idea, consider that "Early Jazz" and "A Brief History of Mathematics" rank two and three.

And consider that the sexual ethics tome, written by Rene Guyon, was last printed in 1958.

The three books are among thousands currently accessible on the Web through the Million Book Project. An international effort spearheaded by Raj Reddy and colleagues at Carnegie Mellon University, its goal is to make a million books available free-to-read on the Web by 2007.

Most, like Guyon's, are older books no longer protected by copyright, or are out of print.

About 30 leaders of the project from CMU, India and China are meeting at the university this week for a workshop and "to keep the momentum going," said Reddy, who has made it one of his pet projects since stepping down as dean of the School of Computer Science in 1998.

The idea is to help address the disparity in the size and accessibility of library collections around the world, while also preserving cultural treasures that otherwise might be lost to decay, or seldom read in a world where information increasingly is digitized.

Scanning so many books and managing a large, searchable database also is a technical challenge for information science.

The National Science Foundation has provided three grants totaling more than $3.5 million for the project. Much of that money has been spent on computers and scanners. The partners in China and India, including the Indian Institute of Science, the University of Pune, Nanjing University and Peking University, shoulder the chore of scanning book texts into the data base.

About 80,000 books have been digitized thus far, with 10,000 pages scanned every day during two shifts ---- or about two 300-page books every hour, Reddy said. Almost 30 scanning centers are now operating in India and China and another is in Egypt. More are being established in Australia and in Europe, as the project continues to expand into new countries.

After a year and a half of scanning, the project still seems to be far from its goal of a million books, Reddy acknowledged, but the pace is picking up. It took about nine months to complete just the first 1,000 books. By the end of this year, 100,000 books will be digitized and the one-millionth book could be added within two years, he added.

"We're trying to make the scanning less onerous," Reddy said, noting that the repetitive work can wear on the project staff.

But scanning represents only about 10 percent of the work. Most of the effort is spent on finding, selecting and shipping the books, not to mention reviewing and cataloguing the digitized versions. The texts are in a wide variety of languages, including Sanskrit, Urdu, Arabic, French and Hindi.

Reddy said he hopes that one million books will just be the beginning of the digitization effort.

"A million books is only about 1 percent of all the books in the world," he noted.

The Million Book Project: www.archive.org/texts/collection.php?collection=millionbooks

First published on May 21, 2004 at 12:00 am
Post-Gazette science editor Byron Spice can be reached at bspice@post-gazette.com or 412-263-1578.