Academic libraries pave a new path away from Google

By Angela Gunn | Published October 14, 2008, 6:39 PM

What's bigger than Google? The vision of librarians, according to the academic institutions banding together to create HathiTrust -- a "universal library" built in part on Google's scanning efforts.

HathiTrust (pronounced haw-TEE -- it's the Hindi word for elephant, that animal that famously lives long and never forgets) launched Tuesday. It's a project of the member universities of the Committee on Institutional Cooperation (CIC) and the University of California system. The CIC has been working with Google since last year to digitize books held in libraries at member schools; the UC system signed on with Google in 2006, and the University of Michigan's MBooks (now folded into HathiTrust) has underway since the school announced affiliation with the Google Books Library Project during its launch in 2004.

In fact, HathiTrust's initial store of content will be built on digital copies of the scans Google made for the Google Books Library Project. In the course of that endeavor, each partner library received a copy of whatever material they offered to the system for, as they say, "ingestion." Those copies were free for schools to use in any appropriate way. HathiTrust is thus able to start life with 78 terabytes (738.8 million pages, 1,713 tons, 25 miles) of content; about 20% of that is available to the general public.

If Google's done it already, why build HathiTrust? Google's library project is, as Google puts it, "an enhanced card catalog" for everybody's use. But librarians familiar with the formation of HathiTrust have expressed concerned that the growing collection of full texts lacks the kind of professional curation that a research-class archive requires -- and that universities, not companies, are the better long-term caretakers of information and scholarship. In addition, some of the materials each university holds were never appropriate for Google scanning; those materials will eventually be digitized as needed and introduced directly to the HathiTrust system, with no version necessarily delivered to Google.

The entire HathiTrust project is licensed under a Creative Commons agreement, with provision made for the approximately 80% of material that still falls under copyright restrictions.

The project's been in the works since a CIC agreement was reached on the project, previously known as the Shared Digital Repository, back in March. A number of information professionals at schools such as the University of Wisconsin and Indiana University spent the summer building mirror sites, figuring out large-scale search options, and (in one school's case) untangling certain problems with Google's previous scans of some of the library's holdings.

At present the partners are preparing for the two-step process of moving their content into the HathiTrust system. First, each institution must prepare accurate bibliographic records to provide metadata for their "digital objects." Once that's done, the content itself can come in -- via Google, if the school prefers, or by non-Google mechanisms currently under development. Those will consist for now of page-image files plus associated optical-character recognition files and metafiles.

HathiTrust is available for searching by the interested public, though there's no grand unified search interface yet. The University of Michigan and the University of Chicago both currently offer search.

Comments

View comments by with a score of at least

"78 terabytes (738.8 million pages, 1,713 tons, 25 miles)"

After all these years it had to be Angela Gunn who taught me how much a terabyte weighs ...

Score: 0

|

I loved those numbers, but alas I get no credit -- they're direct from the HathiTrust site itself. Twenty-five miles! Suddenly your average data center seems so... I don't know, compact?

Score: 0

|

Aw, c'mon, my grandmother is a former librarian! (And more self-aggrandizing than Google? How's THAT work? :-D )

It's an interesting situation. I don't know of many librarians (other than old coots like Michael Gorman) who don't think there's at least potential good in wider availability for materials that otherwise languish in the stacks. OTOH, research and all the processes that go with that -- it's not just a matter of Googling stuff 'til your brain or dissertation or whatever is full. I think I'm pretty good at the Internets, but when the research gets tough I find that deferring to the superior skills of trained librarians pays off hugely. They simply know more places to dig, and more sophisticated ways of wielding the shovels.

The part of this story that most interests me, though -- and I do wish I'd gotten callback on this point -- is the business of "other" materials, the stuff taht never reached Google. I can imagine that each school would have certain items, ephemera and extremely arcane donated archives and whatnot, that might not have been right to send to Google. How is this system more appropriate? Raises all sorts of questions about permissions and data management. I hope to learn more soon.

Score: 0

|

I was only elaborating about Librarians in general, not Google... Yes they are even worse there. But modern librarians are looking for a home and find that they have to push their way into I.T. areas that they know little about.
As for research, I've found that I can usually find what I need faster than most of the Librarians that I work with and I work with a few. I have had to help them many times with how to really use the systems as there mostly based on our network applications.

Score: 0

|

Hi lazarus98 -- wow. Not for nothing, but that sounds like an epic usability fail. Must be really frustrating for all involved. But that's not a reflection on their research skills; it's a reflection on the system you've all been given to work with...

My situation may differ from yours -- in my off-work time (yes, I have some) I'm researching certain history-of-music topics. Archive.org and such have been fantastic re digitizing actual recordings still extant, but beyond that, a *lot* of source material just hasn't made it online. Librarians have been wonderful at finding resources for me, whether those are obscure databases (some available only through my local library system) or books that will never, thanks to our ridiculous copyright laws, reach the interwebs. (You can't imagine how fervently I hoped for better results re orphan works during the latest congressional go-round. Needless to say, like most researchers I am not happy with the results.)

Score: 0

|

Leave it to a librarian to think they know more than most anyone. They are so self aggrandizing.

Score: 0

|

Yeah, how dare those in library science think they can organize a more open system than Google.

Maybe Google can make a play for grabbing up the libraries like they have with their phony 'free the airwaves' grab for whitespace in their atempt to seize the spectrum from myriad other non-commercial players so they can push their product line...

Personally, if they can get the majority of libraries to cooperate, I think its a great idea - especially if they are willing to apply the resource organization and cross cataloging to the resources.

Score: 0

|

PDC 2009: What have we learned this week?

There was the freebie that no one will forget, the heebie-jeebies courtesy of Scott Guthrie, and a teensy bit clearer picture of how this cloud thingie should work.

Live report: Will Google Chrome OS change Linux?

The mysteries of just what Chrome OS is, and how much of an operating system it truly is, may be resolved today.

PDC 2009: Microsoft cares about Web browser performance

The effort to give users of the world's dominant Web browser the impression of quality, is a personal one for the man who leads that battle.

Nokia re-affirms its commitment to Symbian, sort of

Maemo won't necessarily be replacing Symbian in the Nokia N-Series, but that's definitely a place where it will be found.

E-book readers will be in short supply this holiday season

E-readers are hot this year, and a lot of compelling new products have been released, but are there enough electrophoretic displays to go around?

Sony looks to finally open a single storefront for downloads

Sony has had many different download portals for movies, music, e-books, and games, and now it's looking to make a single shop for all of it.

Tuning out the tablet: Time to give the endless speculation a rest

Wide Angle Zoom: Wishing and hoping and thinking and praying....won't put an iTablet on the market.

Five improvements for IT managers in 2010

If businesses are to improve their efficiency for next year, they need to stop and reassess the basic tenets of their job.

AOL's spinoff from Time Warner to shed 2,500 jobs

As AOL moves toward become an independent company again, it will cut nearly a third of its workforce.

Gartner: SMS-based money transfer will be bigger than mobile browsing, search

Gartner issues its predictions for the 10 things our phones will be doing in 2012.

Don't forget to upgrade to Firefox 3.6 beta 3 today

Mozilla has released the latest beta its Firefox 3.6 browser software, just over one week after beta 2.