MAKE: Capturing Data Locked Away In History Books


Lots of history books contain structured data: tables, graphs, appendices. In most cases these data derive from databases created, compiled, and/or arranged by the author. In few cases are these databases made easily available for reuse by readers. Rather, in most cases the data is hard to reuse because a) it is available only in print and b) it is published under copyright.

data-table_-_exampleThe UK Text and Data Copyright Exception (hereafter ‘the TDM Exception’) states that:

The new copyright exception allows researchers to make copies of any copyright material for the purpose of computational analysis if they already have the right to read the work […] This exception only permits the making of copies for the purpose of text and data mining for non-commercial research


I infer from this four things:

  1. ‘any copyright material’ includes books published in print form.
  2. ‘the right to read’ includes books held in a library to which I subscribe.
  3. ‘make copies’ includes both digitisation and transcription.
  4. ‘researchers’ includes teams who intend to ‘make copies’ for future – as yet specified – ‘non-commercial research’.

This proposed session will:

  • Estimate our collective ‘right to read’ and capture this as a document that offers non-legal advice to other UK-based historians about what they can and can’t do with printed material under the TDM Exception.
  • ‘make copies’ of structured data found in a small selection of history books (topic to be determined). This will be achieved both by hand-transcription and by Optical Character Recognition (OCR) software (for example Tesseract or ContentMine software).
  • Test the capabilities of OCR software for capturing data tables and publish a our findings as a short guidance document.
  • Combine that data and determine where the data can be stored for subsequent reuse by ‘the researchers’ who made copies.
  • Use that combined data for preliminary historical research that demonstrates the value of the TDM Exception.

I am looking for people to work on this with me. No technical aptitude is either required or preferred, though the project will work better with a balanced team. I anticipate that some preparatory work will be needed in advance of the workshop (for example, to install OCR software and check usability/suitability)

Please post thoughts, suggestions, and/or your willingness to get involved below!

Categories: Session: Make | Tags: , , , , , , |

About James Baker

James Baker is a Lecturer in Digital History and Archives at the University of Sussex (and the awesome Sussex Humanities Lab). He is a historian of long eighteenth century Britain and of contemporary archiving. He is a Software Sustainability Institute Fellow and holds degrees from the University of Southampton and latterly the University of Kent, where in 2010 he completed his doctoral research on the late-Georgian satirical artist-engraver Isaac Cruikshank. As an eighteenth centuryist, his research interests include satirical art, the making and selling of printed objects, urban protest, and corpus analysis. His near contemporary historical interests include the curation of personal digital archives, the critical examination of forensic software and captures, the use of born-digital archives in historical research, and scribing and archiving in the age of the hard disk. Prior to joining Sussex, James has held positions of Digital Curator at the British Library and Postdoctoral Fellow with the Paul Mellon Centre for Studies of British Art. He is a member of the Arts and Humanities Research Council Peer Review College, a convenor of the Institute of Historical Research Digital History seminar and a member of the History Lab Plus Advisory Board.