The Multilingual Corpus of Survey Questionnaires (MCSQ): How Social Scientists can benefit from corpus linguistics

Date:

23 June 2021

Authors

Lidun Hareide (Møreforsking Research Institute, Ålesund, Norway), Olga Kushch (Research and Expertise Centre for Survey Methodology, Universitat Pompeu Fabra)

What is a corpus?

A corpus is a searchable database of naturally occurring text sampled to be representative of a specific population of text. By naturally occurring, we mean text used in real life situations. In addition, a corpus may function as a repository, enabling the preservation of and access to data for posterity. The MCSQ corpus, the very first publicly available multilingual corpus of international survey texts, performs both these functions. It enables the storing, searching and the comparison of information from international social surveys in 8 languages (e.g. French) and 29 of their language varieties (e.g. Swiss French).

The MCSQ is an open-access, open-source research and training resource. It is FAIR (Findable, Accessible, Interoperable and Reproducible) by design. The current version (named Mileva Marić-Einstein), is compiled from the European Social Survey (ESS), the European Values Study (EVS) and the Survey of Health, Ageing and Retirement in Europe (SHARE) in British English source language and their translations into the eight languages: Catalan, Czech, French, German, Norwegian (Bokmål), Portuguese, Spanish and Russian. The current version comprises 3.5 million words and approximately 650,000 sentences. Nearly 80% of the corpus is sentence aligned, meaning that the source text sentences in English are linked to their translations in the different languages.

What use does the MCSQ have for social scientists?

The MCSQ constitutes an important new tool for social scientists. It has the potential to revolutionize survey research and design by allowing the researcher to consult both previous versions of surveys and also harmonize the same question across languages and language varieties. The Ask the same question method (ASQ) is at the heart of survey design. In the ASQ method, the translated questions in a multilingual survey are supposed to be functionally equivalent for the purpose of statistical analysis, that is, the data should be statistically comparable across linguistic groups. In the ASQ model, the concepts to be measured and the answer options are kept the same across languages, and cultural adaptation is only implemented to facilitate fluency and the use of locally appropriate terminology. Previously there has been no tool available for ensuring that the same question is asked, and the MCSQ was compiled to fill this function.

Our preliminary studies done on the MCSQ (Zavala-Rojas, Sorato, Hofland & Hareide, in print) demonstrate that in order to fulfil the goal of statistical comparability of the ASQ, a larger degree of standardization in survey translation across languages, language varieties and cultures is needed. In addition, we have shown that greater attention to the translatability of the source language text is necessary in order to avoid problematic structures such as idioms or fixed expressions that are open to interpretation and may hamper comparability.

As we have noted, the MCSQ functions both as a repository of previous rounds of surveys and a tool for systematic analysis of previous errors and successes. Before the compilation of the MCSQ, no method for tracing translation decisions systematically in multilingual surveys has been in place. The corpus allows for the retrieval and preservation of source and translated questionnaires, and provides textual data for survey translation activities, research and training. The corpus facilitates the visualization and statistical analysis of previous translation decisions across languages. It is also possible to assess in a comparative perspective how a term or a collection of terms have been translated across different languages and in different contexts, and analyze retrospectively whether this decision was appropriate to communicate the intended source text message. Source-language terms that have proven problematic may be avoided in new rounds, and consequently, the MCSQ also allows for the integration of translation analysis into the design of the source questionnaire, as suggested by Harkness (2003). In addition, the corpus provides valuable training material for the highly specialized field of survey design and translation. By constituting a tool for the improvement of best practice both during the design and translation phases of survey questionnaires, the MCSQ allows for a more scientifically refined TRAPD methodology in a way that Harkness, the creator of the TRAPD, envisioned for the future.

In addition, the MCSQ provides valuable corpus resources on the highly specialized domain of surveys for minority languages such as Catalan, as well as for the 30 language varieties represented, thus constituting an important tool for cross-linguistic comparisons of specialized use of language, as well as for Translation Studies. In line with the focus on open-source, open-access principles, this corpus is openly accessible in csv format, which can be made compatible with CAT tools.

Dissemination of the MCSQ

The MCSQ was presented during an online awareness-raising webinar held on the 06th of April 2021. About 55 participants attended the event, organized by the Research and Expertise Centre for Survey Methodology at Universitat Pompeu Fabra and supported by SSHOC. The webinar started with an introduction of the speaker, followed by a presentation of the corpus, and a lively Q&A part.

The event aimed at all academics and practitioners from the fields of Sociology, Communication Sciences, Lexicology, Linguistics and Translation Studies. Participants have received an overview of the corpus compilation process and its applications from one of the corpus developers.

Outcomes & Further Development

The MCSQ is an important digital resource that enables linguistic analysis in survey text. Now, researchers interested in studying questionnaire data can search in the MCSQ for a specific term in one language or language variety, and receive the sentences containing the same term in any of the languages they choose.

The MCSQ and its interface were completely developed using open-source tools, and no proprietary tools were used in any of the steps. All the developed code comprising the corpus compiling steps, since preprocessing to storage and data annotation, is publicly available on a GitHub repository.

Aims for future work:

New survey items
New interface functionalities
New linguistic annotations

LINKS TO WEBINAR MATERIALS

Announcement

Presentation

SSHOC Deliverable 4.3 Survey specific parallel corpora

MCSQ Version v2.0 (Mileva Marić-Einstein)

MCSQ website

References

Harkness, J. A. (2003). Questionnaire translation. In J. A. Harkness, F. J. R. van de Vijver, & P. P. Mohler (Eds.), Cross-cultural survey methods (pp. 35–56). Wiley & Sons.

Zavala-Rojas, D., Sorato, D., Hareide, L., & Hofland, K. (forthcoming). The Multilingual Corpus of Survey Questionnaires: a tool for refining survey translation. Meta: Journal Des Traduceurs.

The Multilingual Corpus of Survey Questionnaires (MCSQ): How Social Scientists can benefit from corpus linguistics

What is a corpus?

What use does the MCSQ have for social scientists?

Dissemination of the MCSQ

Outcomes & Further Development

LINKS TO WEBINAR MATERIALS

References

News

The SSH Open Marketplace Editorial Board is happy to invite you to a series of 8 hands-on workshops to strengthen FAIR and digital research skills

SSHOC Announces New 2026 Leadership

SSHOC 2025 Updates

Science Clusters Position statement on operational commitment to EOSC and Open Research

SSHOC, the SSH Open Science Cluster has a New Chair and Vice-Chair in 2024