Speech corpora of individuals with communication disorders (CSD) are gems in the realm of language resources. Because they are costly to obtain and hard to share due to their personal properties, researchers often collect the data themselves. However much time could be saved if it were possible to use existing data and respect GDPR requirements at the same time - if only one knew how.
The DELAD initiative was brought to life specifically to help researchers share datasets of pathological speech, and in October 2020, experts from DELAD and SSHOC Tasks 5.4 and 6.5 organised a webinar to demonstrate best practice in obtaining, processing, and sharing CSD.
Henk van den Heuvel (CLST, Radboud University) opened the webinar with the presentation of the DELAD. The word means shared in Swedish and the DELAD initiative works towards facilitating the exchange and investigation of CSD (corpora of speech of individuals with communication disorders) in compliance with the GDPR.
DELAD strives to connect to existing research infrastructures. As such, they cooperate closely with the CLARIN Knowledge Centre for Atypical Communication Expertise (ACE) which collaborates with the CLARIN Data Centre at the Max Planck Institute for Psycholinguistics (The Language Archive), and with Talkbank for storage of sensitive data. DELAD also organises annual workshops covering topics related to ethical, legal and technical aspects of working with CSD. These cover everything from collecting, formatting, processing and sharing CSD, to ensuring access to such data by collaborating with existimg research infrastructures and providing a quality inventory of relevant datasets.
Nicola Bessell (Department of Speech and Hearing Sciences, University College Cork) highlighted the ethical and GDPR considerations when collecting corpora of speech disorders. She underlined that the GDPR stipulates that processing of health-related data is only allowed for research purposes, while archiving of such data must be in the public interest in order to be legal.
In order to ensure GDPR-compliant use of CSD data, researchers and other users of such data must obtain consent from data owners. This can be done via consent forms which need to address the following aspects:
Among others, the discussion brought a useful remark regarding the use of clinical data for research purposes. Since such use is not classified as repurposing by the GDPR, it is fully allowed.
Also of interest in this regard will be the recent SSHOC webinar on the topic of the DARIAH ELDAH Consent Form Wizard – a tool that provides standardised consent form templates, enabling any user to quickly and easily obtain legal consent valid in all of the European Union.
Paul Trilsbeek (The Language Archive, Max Planck Institute for Psycholinguistics) presented the GDPR-compliant way in which data is stored and made accessible at The Language Archive. He put special emphasis on the issues regarding the anonymisation process. This process can often invalidate the data for many research purposes. Paul Trilsbeek further stressed the necessary legal agreements for archiving and sharing personal data which are in essence of two types:
He underlined the need for thorough examination of licenses used since many existing licenses are “perpetual” and may therefore be in conflict with the GDPR under certain conditions.
Paul Trilsbeek also elaborated on the technical and logistic requirements needed and implemented at TLA in order to ensure “data protection by design and by default” as stated in the GDPR. This includes up-to-date systems and software, secure transport of data (HTTPS) and an elaborate system of access policies and authorisation. At the TLA, all archived copies also reside within the EU at trusted data centres within the Max Planck Society, which is another important aspect for ensuring data security.
The next speaker, Libby Bishop (GESIS - Leibniz Institute for Social Sciences) presented remote secure access, an innovative access method to CSD that is now explored in the SSHOC project. This method brings the user to the data rather than the data to the user. The data resides at the local server and the user can perform analyses by using the tools available at the remote end. In this way, only aggregated analysis results can be downloaded by the user but not the data. This ensures a higher level of data security while enabling easy data reuse by researchers.
Two examples
Libby Bishop first shared some insights into a decade old data collection project called CAVA (Human Communication Audio-Video Archive) which includes data covering a wide range of disorders and is hosted at UCL. She addressed the legal and technical issues related to sustaining and possibly expanding such a collection. The main concern raised was that there is currently no reliable path to a sustainable infrastructure. Open cloud-based solutions, such as those (that will be) provided by SSHOC/EOSC offer a promising way forward, but we will have to wait and see if this will really be the winning option.
The final contribution was given by Katarzyna Klessa (Adam Mickiewicz University) on a very recent curation project which includes legacy data from Polish children with hearing impairment.
Katarzyna Klessa specifically highlighted the legal basis for sharing the data, and issues of interoperability when it comes to obsolete data formats. The CLARIN Knowledge Centre for Atypical Communication Expertise helped make this data accessible via a new and unique sharing model whereby all metadata and information on the dataset can be found at the Talkbank, whereas the audio data is stored on European servers only, more specifically at The Language Archive. This is a novel and promising example for data storage and access that opens up new possibilities for European researchers, since it uses a well-established data centre in the USA for hosting the landing page and part of the CSD, whilst keeping the most sensitive data on European servers.
There were many other useful insights shared during the webinar presentations and discussion, so we invite you to watch the recording and view the presentation slides.
In addition, the DELAD initiative is organising a virtual workshop on 27 and 28 January 2021. The provisional programme is already online.
If you would like to join this workshop or get actively involved, send an email to h.vandnheuvel@let.ru.nl by November 25th with the following information:
Article written by Henk van den Heuvel and Kristina Pahor de Maiti