Monday, August 23, 2010

Don't Google This!

In February 2010, this blog revealed "How Google Books Does It". Now this blog reveals how Google Books hasn't been doing it (correctly). What the right Google-hand gives, the left takes away. If this is a (fair & balanced) metadata mess, so be it.

[x The Cronk Review]
Google's Book Search: A Disaster For Scholars
By Geoffrey Nunberg

Tag Cloud of the following article

created at TagCrowd.com

[Images of some of the errors discussed in this article can be found here (PDF).]

Whether the Google books settlement passes muster with the U.S. District Court and the Justice Department, Google's book search is clearly on track to becoming the world's largest digital library. No less important, it is also almost certain to be the last one. Google's five-year head start and its relationships with libraries and publishers give it an effective monopoly: No competitor will be able to come after it on the same scale. Nor is technology going to lower the cost of entry. Scanning will always be an expensive, labor-intensive project. Of course, 50 or 100 years from now control of the collection may pass from Google to somebody else—Elsevier, Unesco, Wal-Mart. But it's safe to assume that the digitized books that scholars will be working with then will be the very same ones that are sitting on Google's servers today, augmented by the millions of titles published in the interim.

That realization lends a particular urgency to the concerns that people have voiced about the settlement —about pricing, access, and privacy, among other things. But for scholars, it raises another, equally basic question: What assurances do we have that Google will do this right?

Doing it right depends on what exactly "it" is. Google has been something of a shape-shifter in describing the project. The company likes to refer to Google's book search as a "library," but it generally talks about books as just another kind of information resource to be incorporated into Greater Google. As Sergey Brin, co-founder of Google, puts it: "We just feel this is part of our core mission. There is fantastic information in books. Often when I do a search, what is in a book is miles ahead of what I find on a Web site."

Seen in that light, the quality of Google's book search will be measured by how well it supports the familiar activity that we have come to think of as "googling," in tribute to the company's specialty: entering in a string of keywords in an effort to locate specific information, like the dates of the Franco-Prussian War. For those purposes, we don't really care about metadata—the whos, whats, wheres, and whens provided by a library catalog. It's enough just to find a chunk of a book that answers our needs and barrel into it sideways.

But we're sometimes interested in finding a book for reasons that have nothing to do with the information it contains, and for those purposes googling is not a very efficient way to search. If you're looking for a particular edition of Leaves of Grass and simply punch in, "I contain multitudes," that's what you'll get. For those purposes, you want to be able to come in via the book's metadata, the same way you do if you're trying to assemble all the French editions of Rousseau's Social Contract published before 1800 or books of Victorian sermons that talk about profanity.

Or you may be interested in books simply as records of the language as it was used in various periods or genres. Not surprisingly, that's what gets linguists and assorted wordinistas adrenalized at the thought of all the big historical corpora that are coming online. But it also raises alluring possibilities for social, political, and intellectual historians and for all the strains of literary philology, old and new. With the vast collection of published books at hand, you can track the way happiness replaced felicity in the 17th century, quantify the rise and fall of propaganda or industrial democracy over the course of the 20th century, or pluck out all the Victorian novels that contain the phrase "gentle reader."

But to pose those questions, you need reliable metadata about dates and categories, which is why it's so disappointing that the book search's metadata are a train wreck: a mishmash wrapped in a muddle wrapped in a mess.

Start with publication dates. To take Google's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, André Malraux's La Condition Humaine, Stephen King's Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams's Culture and Society 1780-1950, and Robert Shelton's biography of Bob Dylan, to name just a few. And while there may be particular reasons why 1899 comes up so often, such misdatings are spread out across the centuries. A book on Peter F. Drucker is dated 1905, four years before the management consultant was even born; a book of Virginia Woolf's letters is dated 1900, when she would have been 8 years old. Tom Wolfe's Bonfire of the Vanities is dated 1888, and an edition of Henry James's What Maisie Knew is dated 1848.

Of course, there are bound to be occasional howlers in a corpus as extensive as Google's book search, but these errors are endemic. A search on "Internet" in books published before 1950 produces 527 results; "Medicare" for the same period gets almost 1,600. Or you can simply enter the names of famous writers or public figures and restrict your search to works published before the year of their birth. "Charles Dickens" turns up 182 results for publications before 1812, the vast majority of them referring to the writer. The same type of search turns up 81 hits for Rudyard Kipling, 115 for Greta Garbo, 325 for Woody Allen, and 29 for Barack Obama. (Or maybe that was another Barack Obama.)

How frequent are such errors? A search on books published before 1920 mentioning "candy bar" turns up 66 hits, of which 46—70 percent—are misdated. I don't think that's representative of the overall proportion of metadata errors, though they are much more common in older works than for the recent titles Google received directly from publishers. But even if the proportion of misdatings is only 5 percent, the corpus is riddled with hundreds of thousands of erroneous publication dates.

Google acknowledges the incorrect dates but says they came from the providers. It's true that Google has received some groups of books that are systematically misdated, like a collection of Portuguese-language works all dated 1899. But a very large proportion of the errors are clearly Google's own doing. A lot of them arise from uneven efforts to automatically extract a publication date from a scanned text. A 1901 history of bookplates from the Harvard University Library is correctly dated in the library's catalog. Google's incorrect date of 1574 for the volume is drawn from an Elizabethan armorial bookplate displayed on the frontispiece. An 1890 guidebook called London of To-Day is correctly dated in the Harvard catalog, but Google assigns it a date of 1774, which is taken from a front-matter advertisement for a shirt-and-hosiery manufacturer that boasts it was established in that year.

Then there are the classification errors, which taken together can make for a kind of absurdist poetry. H.L. Mencken's The American Language is classified as Family & Relationships. A French edition of Hamlet and a Japanese edition of Madame Bovary are both classified as Antiques and Collectibles (a 1930 English edition of Flaubert's novel is classified under Physicians, which I suppose makes a bit more sense.) An edition of Moby Dick is labeled Computers; The Cat Lover's Book of Fascinating Facts falls under Technology & Engineering. And a catalog of copyright entries from the Library of Congress is listed under Drama (for a moment I wondered if maybe that one was just Google's little joke).

You can see how pervasive those misclassifications are when you look at all the labels assigned to a single famous work. Of the first 10 results for Tristram Shandy, four are classified as Fiction, four as Family & Relationships, one as Biography & Autobiography, and one is not classified. Other editions of the novel are classified as 'Literary Collections, History, and Music. The first 10 hits for Leaves of Grass are variously classified as Poetry, Juvenile Nonfiction, Fiction, Literary Criticism, Biography & Autobiography, and, mystifyingly, Counterfeits and Counterfeiting. And various editions of Jane Eyre are classified as History, Governesses, Love Stories, Architecture, and Antiques & Collectibles (as in, "Reader, I marketed him.").

Here, too, Google has blamed the errors on the libraries and publishers who provided the books. But the libraries can't be responsible for books mislabeled as Health and Fitness and Antiques and Collectibles, for the simple reason that those categories are drawn from the Book Industry Standards and Communications codes, which are used by the publishers to tell booksellers where to put books on the shelves, not from any of the classification systems used by libraries. And BISAC classifications weren't in wide use before the last decade or two, so only Google can be responsible for their misapplications on numerous books published earlier than that: the 1919 edition of Robinson Crusoe assigned to Crafts & Hobbies or the 1907 edition of Sir Thomas Browne's Hydriotaphia: Urne-Buriall, which has been assigned to Gardening.

Google's fine algorithmic hand is also evident in a lot of classifications of recent works. The 2003 edition of Susan Bordo's Unbearable Weight: Feminism, Western Culture, and the Body (misdated 1899) is assigned to Health & Fitness—not a labeling you could imagine coming from its publisher, the University of California Press, but one a classifier might come up with on the basis of the title, like the Religion tag that Google assigns to a 2001 biography of Mae West that's subtitled An Icon in Black and White or the Health & Fitness label on a 1962 number of the medievalist journal Speculum.

But even when it gets the BISAC categories roughly right, the more important question is why Google would want to use those headings in the first place. People from Google have told me they weren't included at the publishers' request, and it may be that someone thought they'd be helpful for ad placement. (The ad placement on Google's book search right now is often comical, as when a search for Leaves of Grass brings up ads for plant and sod retailers—though that's strictly Google's problem, and one, you'd imagine, that they're already on top of.) But it's a disastrous choice for the book search. The BISAC scheme is well-suited for a chain bookstore or a small public library, where consumers or patrons browse for books on the shelves. But it's of little use when you're flying blind in a library with several million titles, including scholarly works, foreign works, and vast quantities of books from earlier periods. For example the BISAC Juvenile Nonfiction subject heading has almost 300 subheadings, like New Baby, Skateboarding, and Deer, Moose, and Caribou. By contrast the Poetry subject heading has just 20 subheadings. That means that Bambi and Bullwinkle get a full shelf to themselves, while Leopardi, Schiller, and Verlaine have to scrunch together in the single subheading reserved for Poetry/Continental European. In short, Google has taken a group of the world's great research collections and returned them in the form of a suburban-mall bookstore.

Such examples don't exhaust Google's metadata errors by any means. In addition to the occasionally quizzical renamings of works (Moby Dick: or the White Wall), there are a number of mismatches of titles and texts. Click on the link for the 1818 Théorie de l'Univers, a work on cosmology by the Napoleonic mathematician and general Jacques Alexander François Allix, and it takes you to Barbara Taylor Bradford's 1983 novel Voice of the Heart, while the link on a misdated number of Dickens's Household Words takes you to a 1742 Histoire de l'Académie Royale des Sciences. Numerous entries mix up the names of authors, editors, and writers of introductions, so that the "about this book" page for an edition of one French novel shows the striking attribution, "Madame Bovary By Henry James." More mysterious is the entry for a book called The Mosaic Navigator: The Essential Guide to the Internet Interface, which is dated 1939 and attributed to Sigmund Freud and Katherine Jones. The only connection I can come up with is that Jones was the translator of Freud's Moses and Monotheism, which must have somehow triggered the other sense of the word "mosaic," though the details of the process leave me baffled.

For the present, then, scholars will have to put on hold their visions of tracking the 19th-century fortunes of liberalism or quantifying the shift of "United States" from a plural to singular noun phrase over the first century of the republic: The metadata simply aren't up to it. It's true that Google is aware of a lot of these problems and they've pledged to fix them. (Indeed, since I presented some of these errors at a conference last week, Google has already rushed to correct many of them.) But it isn't clear whether they plan to go about this in the same way they're addressing the scanning errors that riddle the texts, correcting them as (and if) they're reported. That isn't adequate here: There are simply too many errors. And while Google's machine classification system will certainly improve, extracting metadata mechanically isn't sufficient for scholarly purposes. After first seeming indifferent, Google decided it did want to acquire the library records for scanned books along with the scans themselves, but as of now the company hasn't licensed them for display or use—hence, presumably, those stabs at automatically recovering publication dates from the scanned texts.

Some of the slack may be picked up by other organizations such as the Internet Archive or HathiTrust, a consortium of participating libraries that is planning to make available several million of the public-domain books from their collections that Google scanned, along with their bibliographic records. But for now those sources can only provide access to books in the public domain, about 15 percent of the scanned collections; only Google will have the right to display the orphan works published since 1923.

In any case, none of that should relieve Google of the responsibility of making its collections an adequate resource for scholarly research. That means, at a minimum, licensing the catalogs of the Library of Congress and OCLC Online Computer Library Center and incorporating them into the search engine so that users can get accurate results when they search on various combinations of dates, keywords, subject headings, and the like. ("Adequate" means a lot more than that, as well, from improving the quality of scanning to improving Google's very flaky hit-count algorithms and rationalizing the resulting rankings, which now make no sense at all and often lead with inferior or shoddy editions of classic works.) Whether or not a guarantee of quality is a contractual obligation, it's implicit in the project itself. Google has, justifiably, described its book-scanning program as a public good. But as Pamela Samuelson, a director of the Center for Law & Technology at the University of California at Berkeley, has said, every great public good implies a great public trust.

I'm actually more optimistic than some of my colleagues who have criticized the settlement. Not that I'm counting on selfless public-spiritedness to motivate Google to invest the time and resources in getting this right. But I have the sense that a lot of the initial problems are due to Google's slightly clueless fumbling as it tried master a domain that turned out to be a lot more complex than the company first realized. It's clear that Google designed the system without giving much thought to the need for reliable metadata. In fact, Google's great achievement as a Web search engine was to demonstrate how easy it could be to locate useful information without attending to metadata or resorting to Yahoo-like schemes of classification. But books aren't simply vehicles for communicating information, and managing a vast library collection requires different skills, approaches, and data than those that enabled Google to dominate Web searching.

That makes for a steep learning curve, all the more so because of Google's haste to complete the project so that potential competitors would be confronted with a fait accompli. But whether or not the needs of scholars are a priority, the company doesn't want Google's book search to become a running scholarly joke. And it may be responsive to pressure from its university library partners—who weren't particularly attentive to questions of quality when they signed on with Google—particularly if they are urged (or if necessary, prodded) to make noise about shoddy metadata by the scholars whose interests they represent. If recent history teaches us anything, it's that Google is a very quick study. Ω

[Geoffrey Nunberg (BA, Columbia; MA, Penn; PhD, CUNY) is an adjunct full professor at the School of Information at the University of California at Berkeley. Until 2001, he was a principal scientist at the Xerox Palo Alto Research Center, working on the development of linguistic technologies. His most recent book is The Years of Talking Dangerously (2009).]

Copyright © 2010 The Chronicle of Higher Education

Get the Google Reader at no cost from Google. Click on this link to go on a tour of the Google Reader. If you read a lot of blogs, load Reader with your regular sites, then check them all on one page. The Reader's share function lets you publicize your favorite posts.

Creative Commons License
Sapper's (Fair & Balanced) Rants & Raves by Neil Sapper is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License. Based on a work at sapper.blogspot.com. Permissions beyond the scope of this license may be available here.

Copyright © 2010 Sapper's (Fair & Balanced) Rants & Raves