This and THATCamp Sussex Humanities Lab 2017

This and THATCamp Sussex Humanities Lab: debrief

James Baker — Thu, 27 Jul 2017 08:13:34 +0000

Earlier this month I had the pleasure of hosting the second annual This and THATCamp Sussex Humanities Lab: an unconference event on the THATCamp (The Humanities and Technology Camp) model. A diverse group of people from humanities, library, archives, and law backgrounds attended to work on the theme ‘Rules, Rights, Resistance’ and a prominent part of our work together was around what – legally, ethically, morally – is possible when text and data mining (both in the UK and in other legal jurisdictions).

The event was lively, energetic, and productive. In this post, I want to focus on a session I proposed: ‘Capturing Data Locked Away in History Books‘. My motivation for this session was framed by a simple problem: there is lots of tabular data locked away in history books but what can we do to get it out? And so there are two aspects to this problem: what is allowed and what is technologically possible.

On the legal side, we turned – as we had for much of the event – to the UK Text and Data Mining Exception (2014) (see section 29A or this guidance for more info). From this we inferred that the following was relevant to what we were trying to achieve:

Printed books are ‘works’ and those works are subject to copyright (even if they are out of print).
We have ‘lawful access’ to those books if we buy them or can get them from a library.
When we have lawful access to books we can text and data mine those books.
That text and data mining can only be on a non-commercial basis.
Text and data mining in the context of research at a university probably constitutes non-commercial research in most cases (unless of course we are making lots of money from it!).
We cannot share the data we are text and data mining (either with a research group or with peer reviewers).
We can share outputs from the text and data mining that are facts (so, something like word counts).
We can write about what we have done.

During the day, we were introduced to the excellent ‘Legal Information Platform‘ developed by CLARIN and aimed at Digital Humanities researchers. This research has lots of excellent advice on text and data mining and will be updated as the law develops (which it will!).

Shortly after the event, I also found the Jisc guide The text and data mining copyright exception: benefits and implications for UK higher education. This is not only another useful resource (if UK specific) but also one that contradicts our understanding during the camp about the ability to share the data we are text and data mining. John Kelly – the author of the guide – writes:

NOTE: Within the context of research projects involving groups of people across institutions, sharing access to a lawfully mined copy is likely to be acceptable as long as each member of the group has lawful access to original content being mined.

Recommendation: Any TDM undertaken by research groups should ensure that all individuals have lawful access to the original work either through their own institution or via registration at the institution where the mining takes place.)

The point is then, read the various guides, make an estimation of the risks involved, and seek legal advice if you are unsure. In many cases your university library will be able to help.

On the technology side, we started by taking photographs (using a nice camera and a smartphone) of pages in history books that contained tables. We then tested a range of Optical Character Recognition (OCR) software to see if it was able to recognise tables and the characters within those tables. We looked at Tesseract (open source OCR software), ABBYY Finereader, Google Drive, and various conversion websites (including Convertio, the Google Vision API, Awesome OCR, and Tabular. We also observed that – for those with member access – the EU-funded IMPACT project contains a demonstrator platform for testing various OCR software/services against various types of textual data.

What we found was that, broadly speaking, online conversion tools are poor at converting tables in history books into data. The exception was Convertio which – we think – was using layout recognition packages for Tesseract to output surprisingly accurate representations of the data tables. On Tesseract, we found that the core installation (for example, via Docker here or here) doesn’t come with layout recognition packages installed (which you need for any non-linear text) and that it is isn’t good at handling warped images. This means that, where possible, scanning on a flatbed – which not everyone has – is better than on a phone or a camera. ABBYY Finereader 12.1.6 – a commercial package (now up to version 14) – turned out to be the winner, confirming similar findings reported by Katrina Navickas reported during her work on the Political Meetings Mapper. Although Finereader didn’t get all the values in the tables right, we put that down to poor quality images (it was only tested by us using mobile phone pictures). What Finereader did very well was recognise a) that there was a table on the page and b) the layout of that table, and output that as structured html or xml (export formats Tesseract can handle as well).

Together, we figured out what to do, what we think the law allowed us to do, and which tools offered the greatest potential for further work. Lots of tabular data is still locked up in history books and – in the UK context at least – the Text and Data Mining Exception doesn’t appear to offer the prospect of getting that data out, combining that data, and sharing that data. But I certainly have a better sense of the legal and technological landscape in which I work than I did going in. Which is the point of hosting a THATCamp.

TDM across Europe: the state of play

James Baker — Tue, 27 Jun 2017 15:49:08 +0000

In order to frame our text and data mining work, Erik Ketzan (Birkbeck) introduce some current European legal developments that pertain to text and data mining. These will include the CLARIN Legal Information Platform, the status of the TDM exception in Germany, and the European Commission’s Proposal for the Directive on Copyright in the Digital Single Market.

Talk. Social network data: What can I do with it ?

Geraldine Castel — Thu, 15 Jun 2017 12:32:21 +0000

I’ve been working on online political campaigning in France and the UK for some years now, and like many of us, had to start paying attention to what was happening on social networks at election time. To try and enable post facto analysis of data collected on Facebook and Twitter, I worked with computer scientists to build a searchable database from Facebook posts and tweets from a selection of candidates to the 2014 elections to the European Parliament. Along the way, we encountered technical difficulties but also ethical and legal ones.

From what information we could gather, databases seem to be officially regulated by two main regimes: copyright laws regarding their ownership and exploitation, and personal data protection laws, applicable when databases contain information on an identifiable subject which is the case here with political candidates. However, finding out precisely how those regimes apply in the case of data collected on social networks was a major challenge.

This session would therefore aim to discuss such issues with researchers involved in similar projects or considering doing so to address questions like the following :

Who owns the data published on social networks ? Companies like Facebook ? Individuals who share contents ? Both, as Facebook users, for example, grant the company a non-exclusive cession of copyrights?
Does it depend on the privacy settings chosen by users ? On the platform used ? On the type of data ?
Which legislation applies ? That of the country where the platform is operated ? That of the individuals who share contents ? That of the researcher using the data ?
What about potential solutions to enable researchers to use the data legally ? For instance anonymizing datasets ? Getting ‘prior consent’ ? Buying the data ?

Aaaand, that’s a lot of questions !! Maybe several sessions ? Or just one for a specific question ?

Best,

3d printing and our rights in the makerspace

James Baker — Thu, 08 Jun 2017 13:19:04 +0000

The Sussex Humanities Lab has a 3d printer.

Thingiverse is the main place to get digital files for printing. Most of the content on Thingiverse is open licensed, meaning that – in most cases – the uploader of the digital file claims copyright and the right to apply to Creative Commons licenses (or similar) to their copyright.

A quick survey of Thingiverse suggests that uploaders may not have the right to apply open licenses to some of the content they upload. Borderline cases include:

– This file www.thingiverse.com/thing:116411 is a design plan for blocks compatible with Duplo.
– This file www.thingiverse.com/thing:2285505 is a design plan for terrain made from in copyright Google Maps data at touchterrain.geol.iastate.edu/
– The file www.thingiverse.com/thing:912478 is a model of the Eiffel Tower

Given the academic interest in 3d printing this session will introduce 3d printing and discuss the potential future legal challenges of the technology.

EU Law, Technological Protection Measures, and Mining Material that is Under Copyright

Martin Paul Eve — Fri, 31 Mar 2017 15:57:04 +0000

The current situation around DRM and UK copyright legislation makes it difficult for researchers to obtain legal copies of texts and other in-copyright material for research purposes.

While UK law is supposed to provide protections to allow researchers to take advantage of exceptions for research, it remains a criminal offence to strip the DRM off e-texts, for instance. Section S296ZE of the Copyright, Designs and Patents Act states that a complaint to the Secretary of State is the correct way to appeal this if the rightsholders refuse to provide a copy that is suitable for digital research purposes.

In this session, we would like to gather like-minded people together to discuss what could be done and which targets might prove best for such an appeal.

MAKE: Capturing Data Locked Away In History Books

James Baker — Wed, 01 Feb 2017 10:25:47 +0000

gdoc

Lots of history books contain structured data: tables, graphs, appendices. In most cases these data derive from databases created, compiled, and/or arranged by the author. In few cases are these databases made easily available for reuse by readers. Rather, in most cases the data is hard to reuse because a) it is available only in print and b) it is published under copyright.

The UK Text and Data Copyright Exception (hereafter ‘the TDM Exception’) states that:

The new copyright exception allows researchers to make copies of any copyright material for the purpose of computational analysis if they already have the right to read the work […] This exception only permits the making of copies for the purpose of text and data mining for non-commercial research

I infer from this four things:

‘any copyright material’ includes books published in print form.
‘the right to read’ includes books held in a library to which I subscribe.
‘make copies’ includes both digitisation and transcription.
‘researchers’ includes teams who intend to ‘make copies’ for future – as yet specified – ‘non-commercial research’.

This proposed session will:

Estimate our collective ‘right to read’ and capture this as a document that offers non-legal advice to other UK-based historians about what they can and can’t do with printed material under the TDM Exception.
‘make copies’ of structured data found in a small selection of history books (topic to be determined). This will be achieved both by hand-transcription and by Optical Character Recognition (OCR) software (for example Tesseract or ContentMine software).
Test the capabilities of OCR software for capturing data tables and publish a our findings as a short guidance document.
Combine that data and determine where the data can be stored for subsequent reuse by ‘the researchers’ who made copies.
Use that combined data for preliminary historical research that demonstrates the value of the TDM Exception.

I am looking for people to work on this with me. No technical aptitude is either required or preferred, though the project will work better with a balanced team. I anticipate that some preparatory work will be needed in advance of the workshop (for example, to install OCR software and check usability/suitability)

Please post thoughts, suggestions, and/or your willingness to get involved below!

This and THATCamp Sussex Humanities Lab 2017

James Baker — Thu, 27 Sep 2012 20:59:40 +0000

The second annual This and THATCamp Sussex Humanities Lab takes place on 4-5 July 2017 at the University of Sussex. It brings together humanists, technologists, educators, and learners to share, build, and make together around the theme of “Rules, Regulations, Resistance”.

Whether building websites, mining data, assembling information, or sharing creative outputs humanists will often encounter laws and regulations. These encounters raise questions the answers to which are not always straightforward, can change over time and between places, and create conflict. These include:

What are my rights as a researcher, educator, or practitioner?
Is what I am doing legal?
Can I share what I find?
Who is constraining me? And why?
How do I effect change?

The event will focus on hands-on sessions that explore the humanities, technology, rules, regulations, and resistance. Any proposal on this theme is welcome, including those on the study of rules, regulations, and resistance (contemporary or historical) using information technology. We are particularly interested in proposals:

That offer discussion points and provocations, project updates, demonstrations.
For one-day or two-day projects that aim to produce things (performances, documents, objects, code, training materials).
From individuals with a idea looking for people to work with them on it.
That provide opportunities for remote participation.
That seek to test the UK Text and Data Mining Exception.

As participants, you will pick on the first day when, where, and whether the sessions proposed take place.

The event is free to attend. If you are interested in joining us or proposing a session, please register at thisand.thatcamp.org/register/. Please note that spaces are limited so registration is vital. If you need help getting to us or if your project has hardware requirements, let us know and we’ll see what we can do to support you.