Corpus InterLangue (CIL)

1 Corpus Documentation

1.1 Aim of the corpus

The CIL is a collection of spoken and written productions of learners of English and French as second languages (L2). The aim of this corpus is to serve as a source of learner data for evidence-based research in second language acquisition, more specifically in the study of the behavior of learners’ interlanguage.

1.2 Overall structure of the corpus

The corpus is divided into two main parts: one made up of productions of L2 French learners (Français langue étrangère – corpus FLE from here onwards) and another one made up of productions of L2 English learners (Anglais langue étrangère – corpus ALE). L2 French learners are native speakers of Arabic, mandarin Chinese, English or Spanish. L2 English learners are all native speakers of French.

As of November 2020, the corpus contains productions of 115 learners, divided as follows:

1.2.1 Speaker profiles

Learners were selected based on their native language (L1) to favor the study of potential interference phenomena –i.e. negative transfer– in their L2 productions, as well as comparative analyses of shared developmental features among learners of different L1s.

Most of the data collection of the corpus FLE has been carried out at the University of Rennes 2 in France, with students of the Centre International Rennais d'Études de Français pour Étrangers (CIREFE) language center. One portion of the L1 Arabic learners was collected in Syria.

Learners of the corpus ALE are all French and learned English through the French school system.

The L2 proficiency of learners of both groups ranges from B1 to C1 of the Common European Framework of Reference for Languages (CEFRL). Learners are not language specialists, i.e. they do not or have not studied the target language as the main focus of their academic career. Furthermore, most students have completed higher education studies. The ages of learners range from 17 to 61 years.

1.2.2 Data collection protocol

Every year, data are collected by LIDILE research team's Master's students (First year). Subjects are recruited in France. The recordings are transcribed with ELAN. The files are verified in their structure and annotation by a lecturer of the research team specialised in linguistics.

The same protocol is followed for L2 French and L2 English learners, adapted to their respective target languages. Learners are asked to complete three tasks:

a. A semi-guided interview (15-25 minutes), aimed at obtaining spontaneous speech samples from learners. The interview is not scripted, but investigators are instructed to ask questions which elicit four different types of speech productions from the learner: 1. A description of themselves, 2. Talking about past events, 3. Talking about future plans, 4. Arguing for or against a given topic.

b. A read aloud task of a page-long text (1-2 minutes), aimed at obtaining a controlled speech sample suitable for comparative phonetic analyses.

c. A writing task elicited by the read aloud task, aimed at obtaining a handwritten production of a 1 or 2 page-long text.

Prior to the beginning of all recording sessions, learners are asked to read and sign a consent form and to fill in a metadata questionnaire.

Master students use a University of Rennes 2 Moodle database to save the metadata referring to a learner.

1.2.3 Data transcription and annotation

1.2.3.1 Oral data

Oral conversational data are transcribed following a transcription protocol described in (Arbach 2015). The ELAN file structure includes two annotation tiers of the same type identified as “utterance”. The learner tier is identified as ALE and the interviewer(s) is identified as ENQ* in the XML scheme.

Extract from ELAN XML structure :

<ALIGNABLE_ANNOTATION ANNOTATION_ID="a1" TIME_SLOT_REF1="ts41" TIME_SLOT_REF2="ts42">

<ANNOTATION_VALUE>ok i heard you mention about your studies can you tell us more about it?</ANNOTATION_VALUE>

</ALIGNABLE_ANNOTATION>

</ANNOTATION>

Oral reading data are not transcribed

1.2.3.2 Written data

Dialog writing data are transcribed following a derived version of the TEI P3 Text Encoding Initiative Chicago, Oxford (C. M. Sperberg-McQueen & Lou Burnard, 1999). The following annotation tagset is used. A simplified version from: https://tei-c.org/Vault/GL/P3/PH.htm. Note: for historical reasons in the compilation, some texts do not follow the scheme. Instead the raw text is simply available.

1. Unclear word:

1.a the word or part of the word is unclear and you have an idea of what it is:

1.b the word is totally unclear, you have no idea what it is:

use <gap>.

E.g. the transcriber believes that four letters cannot be read at all because of the damage: <gap reason='rubbing' extent='4'> </gap>

where reason is illegible or rubbing.

2. Abbreviations:

An abbreviation is marked up as follows.

example: <abbr title="Monsieur"> M. </abbr> Carter said that

<dialog char_id=”2”> <abbr name=”Antoine”> A </abbr> : Oui, ça va, et toi ? Beaucoup temp sans savoir de toi, t’est trés different </dialog>

Note: This tag is NOT for contractions in English, e.g. I'm.

3. Foreign word:

French words are indicated by <foreign lang="fr"> (before the word) and </foreign> (after the word).

example: <foreign lang="fr"> Monsieur </foreign> Carter

4. Deleted word, letter or sign:

The <del> tag defines character/word/text that has been deleted. NO SPACE

example: almost every subject<del>s</del> (no space)

Apostrophe deleted

Qu<del>’</del>est-ce ...

5. Dialogue

<speaker>Wife of your bussum</speaker>

<p> Oh! I don’t want to interrupt you dear. I only want some

money for Baby’s socks — and to know whether you will have the mutton cold or hashed.

</p>

</sp>

</dialog>

6. Paragraph

1.3 Technical properties

1.3.1 File formatting

Audio files for the interview and reading aloud task were obtained via Roland Edirol-09HR voice recorders and saved in WAV 16-bit format (mono) at a 44.1 kHz sample rate.

Interview audio files were transcribed and time-aligned with CLAN/ELAN in CHA/EAF format.

Handwritten productions of the writing task were scanned or photographed and saved in PDF format. They are subsequently transcribed by master’s students and save ast UTF-8 text files.

1.3.2 Naming conventions

In order to facilitate data organization and recognition, files were named following specific conventions:

1.3.3 Metadata

Since the beginning of the corpus collection in 2009, metadata were obtained via a paper questionnaire filled in by the learner. As of November 2020, metadata questionnaires are implemented online through a Moodle database module available internally. The information asked from learners include:

· Gender

· Year of birth

· Country of birth

· Native language (L1)

· Regional variety of L1

· Current country of residence

· Previous countries of residence

· Level of education

· Occupation

· Number of years studying the L2

· Self-assessed L2 oral and written proficiency (according to CEFR levels)

· Other languages spoken (L3)

· Self-assessed L3 oral and written proficiency (according to CEFR levels)

. CEFR detailed assessment both for oral and written productions

Metadata are available as part of an <ID>_meta.csv file available for each learner. The file includes two lines. The first line corresponds to the variable names. The second line includes the variable values. These values follow specific-vocabulary conventions.

The variable names are (in order of columns):

id_learner n_years_L2 sex birth_country previous_country1_time previous_country1 previous_country2_time previous_country2 previous_country3_time previous_country3 current_country education l1 l1_variety l2 l2_autoevaluation_oral l2_autoevaluation_written l3 l3_autoevaluation_oral l3_autoevaluation_written date_recording place_recording birth_year occupation duration_conv duration_read

The id_learner value corresponds to the naming format described in Section 1.3.2. Other variable values follow conventions.

Variable types

values

Sex

male / female

countries

full names in English e.g.” morocco”

education

open

languages

ISO 639-2 Code: https://www.loc.gov/standards/iso639-2/php/code_list.php

autoevaluation

Common European Framework of Reference (CEFR) classes.

dates

dates follow the DD/MM/YYYY or YYYY patterns

place

full names of cities

occupation

BIT taxonomy https://www.ilo.org/public/french/bureau/stat/isco/docs/resol08.pdf

time

HH:MM:SS

2 Data protection, storage and access

2.1 Protection and storage

Data protection relies on pseudonymization as described in the documentation. Metadata collected by master's students are stored in a Rennes 2 Moodle database. Access is restricted to the students in charge of the tasks. These metadata are then pseudonymized.

The data includes no identification of persons. Only the first two letters of their first name and family names are retained as well as the sex and the year of birth. An example ID is: fra_ca_de_90_f_15. This protects the owner from identification.

Raw data are stored on two hard drives of the research team. Only two researchers (Thomas Gaillat and Leonardo Contreras) have access to these drives. These drives are backed up on a weekly basis.

Pseudonymized data are accessible on the Nakala.fr database. Each data item has a persistent URL. Nakala is a large computer infrastructure that provides a backup service.

2.2 Public access with Human-Num Nakala

Data are available from the Human-Num Nakala repository (https://www.nakala.fr/). This repository is an interoperable and secure service for depositing all types of data (e.g. text files, audio, video, images or other types) in order to share them. This repository mainly provides these services:

assignation of a PID (Persistent IDentifier) making data and metadata citable;
permanent data access;
dissemination of metadata through a Triple Store and OAI-PMH;
dedicated search engine;
customized presentation with NAKALA Press.

See documentation at https://documentation.huma-num.fr/humanum-en/

Nakala offers a set of APIs allowing queries and data management.

3 References

Arbach, N. (2015). Constitution d’un corpus oral deFLE : enjeux théoriques et méthodologiques [Phdthesis, Université Rennes 2]. https://tel.archives-ouvertes.fr/tel-01147632

C. M. Sperberg-McQueen & Lou Burnard (Eds.). (1999). Guidelines for Electronic Text Encoding and Interchange. TEI P3 Text Encoding Initiative Chicago, Oxford. The Association for Computers and the Humanities (ACH) The Association for Computational Linguistics (ACL) The Association for Literary and Linguistic Computing (ALLC). https://tei-c.org/Vault/GL/P3/index.htm

4 Formulaire de consentement

(À remplir sur un papier à entête du laboratoire, en deux exemplaires à signer par le répondant. L'enquêteur remettra un exemplaire au répondant et l 'autre au responsable du projet)

Ce formulaire est destiné à recueillir votre consentement pour la collecte des données vous concernant, dans le cadre du projet Corpus InterLangue piloté par l’équipe de recherche Lidile EA 3874 de l’Université Rennes 2.

En signant le formulaire de consentement, vous certifiez :

· que vous avez lu et compris les renseignements communiqués dans la notice d'information,

· qu'on a répondu à vos questions de façon satisfaisante

· qu'on vous a informé que vous étiez libre d'annuler votre consentement ou de vous retirer de cette recherche en tout temps, sans préjudice.

Informations sur le participant

Nom :

Prénom :

Adresse :

À remplir par le participant

J'ai lu et compris les renseignements fournis dans la fiche d'informations et j'accepte de plein gré de participer à cette recherche.

Oui

Non

J'accepte que mes propos soient enregistrés et exploités par l'équipe du projet Corpus InterLangue.

Oui

Non

J'accepte que mes propos soient diffusés dans le cadre de colloques scientifiques, séminaires ou dans toute forme de valorisation du projet Corpus InterLangue.

Oui

Non

Nom, Prénom:

Date :

Signature :

5 Notice d’information

Responsable du traitement

Les informations recueillies vous concernant vont faire l'objet d'un traitement dans le cadre du projet Corpus InterLangue piloté par Thomas Gaillat (thomas.gaillat@univ-rennes2.fr), Maître de Conférences à l’Université Rennes 2, rattaché à l’unité de recherche Lidile EA 3874.

Finalité du projet et nature des données collectées

Le traitement a pour objet l’étude de la langue des apprenants d’anglais ou français en tant que langue seconde. Nous attendons de vous que vous participiez à un entretien durant lequel nous vous poserons des questions générales sur votre vie et sur votre expérience d’apprentissage de l’anglais / du français. L'entretien durera entre vingt et trente minutes. Les informations recueillies au cours de cet entretien font l'objet d'un enregistrement. Nous vous demanderons dans ce cadre-là de rédiger un dialogue fictif.

Vous devrez également remplir un formulaire dans le but d’obtenir des informations sur votre parcours éducatif, professionnel et linguistique.

Base légale du traitement

La base légale du traitement repose sur le consentement des participants. Votre participation au projet Corpus InterLangue est entièrement libre et volontaire. Vous êtes libre de vous retirer ou de cesser votre participation à ce projet à tout moment. Ce retrait n'aura aucune conséquence.

Confidentialité

Le projet Corpus InterLangue prend les engagements suivants :

Votre identité sera dissimulée à l'aide d'un code dans tous les écrits produits sur la base de vos propos (comptes rendus d'entretien, notes d'observation, notes d'analyse échangées entre les chercheurs, publications...). Aucune autre information ne sera conservée qui puisse révéler votre identité : vos nom et prénoms, ainsi que d’autres informations personnelles qui pourraient aider à vous identifier évoquées lors de l’entretien seront complètement anonymisées.

Seuls les responsables du projet détiennent la table de correspondance qui permet de faire le lien entre votre identité et le numéro aléatoire attribué dans les différents fichiers

Transfert et stockage des données

Les données recueillies sont stockées et conservées sur des serveurs informatiques situés en France dans le cadre de la TGIR Huma-Num qui est soumise aux règles de protection de la vie privée en France. Vos données personnelles sont conservées en base active pendant la durée du projet. Les données pseudonymisées seront accessibles publiquement pour la communauté scientifique.

Diffusion

Les résultats de cette recherche seront diffusés de façon anonyme dans des colloques professionnels et scientifiques et dans des revues professionnelles et académiques. Elles pourront faire l’objet de recherches doctorales ou postdoctorales également.

Droits des personnes

Vous pouvez poser des questions au sujet de ce projet à tout moment en communiquant avec le responsable du projet par courrier électronique (thomas.gaillat@univ-rennes2.fr). Vous pouvez accéder et obtenir copie des données vous concernant, vous opposer au traitement de ces données, les faire rectifier ou les faire effacer. Vous disposez également d'un droit à la limitation du traitement de vos données. Vous pouvez exercer ces droits en vous adressant au responsable du projet par mail ou par voie postale à l’adresse suivante : Lidile – EA 3874. Université Rennes 2, UFR Langues. Place du Recteur Henri Le Moal, CS 24307 - 35043 Rennes.

Après nous avoir contactés, si vous estimez que vos droits Informatique et Libertés ne sont pas respectés, vous avez la possibilité d'introduire une réclamation en ligne auprès de la CNIL ou par courrier postal. CNIL, 3 Place de Fontenoy, TSA 80715 - 75334 Paris Cedex 07 (https://www.cnil.fr/)

Corpus Interlangue (CIL)