DLF Logo 2012 DLF Forum

No tempest in my teapot: Analysis of Crowdsourced Data and User Experiences at the California Digital Newspaper Collection

Session Type: Presentation/Panel

Session Description:

The National Library of Australia enabled OCR text correction for digitized historical newspapers since it was first put online in mid-2008. By the end of that year, volunteers from around the globe were correcting newspaper text in Australia’s Trove collection at the rate of more than 300,000 lines per month. This rate has since risen to approximately 2,400,000 lines per month (June 2012). The top producer alone has corrected more than 2,250,000 lines.

At the California Digital Newspaper Collection user OCR text correction began in August 2011. As of July 1, 2012, 636 registered volunteers have corrected approximately 370,000 lines of newspaper text, with the most dedicated user correcting more than 150,000 lines.

What motivates Trove and CDNC volunteers? Who are they? In this session we look at

1. Demographics: What is the age of text correctors, their profession, gender, and how did they learn about CDNC text correction?
2. Experiences: What do text correctors like about newspapers and about the text correction?
3. Motivation: What makes users correct text?
4. Quality: OCR of newspaper text is often of very poor quality. What is the quality of the crowdsourced corrected text?
5. Preferred data: What types of newspaper stories do text correctors prefer?
6. Economics: What is the estimated economic value of corrected text? What are the costs of providing a text correction infrastructure?

Finally we look at the differences and similarities in between the Trove and CDNC collections and volunteers and other non-newspaper library or archival collections that have used crowdsourcing.

Session Leaders:

Brian Geiger, California Digital Newspaper Collection, University of California Riverside
Frederick Zarndt, IFLA Newspapers Section

Session Notes:
View the community reporting Google doc for this session!

Session Slides: