Skip to content

The de-dupe problem

If you collect a lot of stuff in electronic form, the issue of how many copies to maintain becomes a concern. Ideally, you’d only store one copy of each photograph, album, document, or whatever. Metadata can then be used to find the same item through various routes such as file path, search words, or description.

Google encounters this problem in spades when it comes to their project to scan the worlds books. Books of the world, stand up and be counted! All 129,864,880 of you describes the problem. The problems run the gamut from the basic question of defining what is to be considered as a book through all of the classification and identification schemes being used to manage book collections.

So what does Google do? We collect metadata from many providers (more than 150 and counting) that include libraries, WorldCat, national union catalogs and commercial providers. At the moment we have close to a billion unique raw records. We then further analyze these records to reduce the level of duplication within each provider, bringing us down to close to 600 million records.

The gets boiled down in various ways such as finding errors or April fools jokes such as the turkey probe given and ISBN book number to where Google figures there are about 130 million books in the world.

Take a gander through the essay. Things that seem simple, like creating a catalog of all the world’s books, can – and usually do – have complexities you might not imagine.

Post a Comment

You must be logged in to post a comment.