EmailEmail
PrintPrint
1.5 million books and counting at CMU
Wednesday, November 28, 2007

The ultimate goal of the Million Book Project is to digitize every book that has ever been published.

It's only 1 percent of the way toward that target -- but that still makes for some impressive numbers.

The project, based at Carnegie Mellon University and working with partners in China, India and Egypt, has now scanned 1.5 million books, which can be viewed for free at the project's Web site, ulib.org.

The project's founders, led by former Carnegie Mellon computer science dean Raj Reddy, hope that one day, every book will be available in any language the reader wants it in.

For now, though, books on the site are only available in their original languages.

Because the project has relied heavily on scanning centers in China and India to save money, 976,000 of the books scanned so far are in Chinese, nearly 100,000 are in various Indian languages, and 40,000 are in Arabic, a special contribution of the Bibliotheca Alexandrina library in Alexandria, Egypt.

About a quarter of the total -- 366,000 -- are in English, said Michael Shamos, a Carnegie Mellon computer science professor who is the project's director.

So far, virtually none of the books is in modern European languages because libraries in those nations haven't been willing to cooperate with the project, Dr. Shamos said, and none is in Latin or classical Greek.

But that doesn't deter him.

"If it's a wacky, obscure collection so far, it doesn't matter," he said, "because for us, it's just a stepping stone to the next 10 million books."

Visitors to the Web site can search for books by title, author, language, country, subject or year written. The project currently has 18 books written between 1000 and 1100 A.D., for instance, and 7,746 in Sanskrit.

The project began in the late 1990s with Dr. Reddy's efforts to create a universal library, but shifted into higher gear in 2002 after receiving a $3.5 million seed grant from the National Science Foundation.

In the beginning, it faced daunting technical problems. "We began by deciding to digitize 1,000 books," Dr. Shamos said, "and it took us a lot longer than we expected."

Eventually, the system became faster and more foolproof.

Because of the age and rarity of many texts, workers at the scanning centers have to turn the pages by hand, but once they have done that, the pages are photographed and optical character recognition software converts the letters into digital form.

At this point, the 28 Chinese centers, 21 Indian centers and the one in Egypt are scanning 7,000 books a day.

Another early obstacle was resistance from libraries. Some librarians feared that digitization of books meant that someday, "nobody will come to my library," while others were reluctant to lend books to be shipped overseas for scanning.

Eventually, several libraries, including Carnegie Mellon's and the Carnegie Library of Pittsburgh, agreed to participate.

Asked how the Million Book Project fits in with book scanning being done at Google, Yahoo, Microsoft and Amazon, Dr. Shamos called those projects "fellow travelers."

Those companies hope to make money by selling Web versions of books or offering ads on sites that readers will visit, Dr. Shamos said, whereas the Carnegie Mellon project is aimed at offering free books over the Internet.

The other major hurdle both the Million Book Project and the companies face is restrictions under various copyright laws.

About half the project's books still have copyright protection, and so, just as Google currently does, the project provides only selected pages of those volumes.

In the long run, Dr. Shamos believes, copyright laws should change to protect only creative works written for personal entertainment, not informational texts.

"If I'm a farmer in India and I need to find a better way to grow my crops, I shouldn't have to pay for that information," said Dr. Shamos, who also is a copyright lawyer.

"Our goal is to erase socioeconomic differences in access to books. We are sure that there are brilliant people in the developing world whose knowledge is now stunted because they don't have access to all the books you and I do."

Mark Roth can be reached at mroth@post-gazette.com or at 412-263-1130.
First published on November 28, 2007 at 12:00 am