Lots of history books contain structured data: tables, graphs, appendices. In most cases these data derive from databases created, compiled, and/or arranged by the author. In few cases are these databases made easily available for reuse by readers. Rather, in most cases the data is hard to reuse because a) it is available only in print and b) it is published under copyright.
The UK Text and Data Copyright Exception (hereafter ‘the TDM Exception’) states that:
The new copyright exception allows researchers to make copies of any copyright material for the purpose of computational analysis if they already have the right to read the work […] This exception only permits the making of copies for the purpose of text and data mining for non-commercial research
I infer from this four things:
- ‘any copyright material’ includes books published in print form.
- ‘the right to read’ includes books held in a library to which I subscribe.
- ‘make copies’ includes both digitisation and transcription.
- ‘researchers’ includes teams who intend to ‘make copies’ for future – as yet specified – ‘non-commercial research’.
This proposed session will:
- Estimate our collective ‘right to read’ and capture this as a document that offers non-legal advice to other UK-based historians about what they can and can’t do with printed material under the TDM Exception.
- ‘make copies’ of structured data found in a small selection of history books (topic to be determined). This will be achieved both by hand-transcription and by Optical Character Recognition (OCR) software (for example Tesseract or ContentMine software).
- Test the capabilities of OCR software for capturing data tables and publish a our findings as a short guidance document.
- Combine that data and determine where the data can be stored for subsequent reuse by ‘the researchers’ who made copies.
- Use that combined data for preliminary historical research that demonstrates the value of the TDM Exception.
I am looking for people to work on this with me. No technical aptitude is either required or preferred, though the project will work better with a balanced team. I anticipate that some preparatory work will be needed in advance of the workshop (for example, to install OCR software and check usability/suitability)
Please post thoughts, suggestions, and/or your willingness to get involved below!