Extending the CzeSL corpus

1. Instructing Polish students about collecting texts, explaining the purpose of the project

2. Filling in metadata questionnaires

  1. About the author
  • Person ID, e.g. TOU_H305
  • Gender
  • Age
  • Age category: 6-11; 12-15; 16
  • First language: two-character code according to ISO 639-1
  • First language group: Indo-European (IE), non-Indo-European (nIE), Slavic (S)
  • Knowledge of other foreign languages: ISO code
  • Proficiency in Czech at the time of writing: A1; A1+; A2; A2+; B1; B2; C1; C2
  • Knowledge of Czech in the family: mother; father; partner; siblings; others, nobody
  • Length of stay in Czechia in years: –1; 1; 2; 2–
  • Completed or current Czech language courses: individual, commercial, self-taught, university, abroad, primary school; secondary school, other
  • Intensity of Czech language tuition in hours per week: –3; 5–15; 15–
  • Textbooks used: Basic Czech (BC), Communicative Czech (CC), Čeština pro economy (CE), Chcete mluvit česky? (CMC), Čeština pro cizince (CpC), Easy Czech Elementary (ECE), New Czech Step by Step (NCSS), other
  • bilingual: yes; no

      b. About the text

  • Text ID, e.g. TOU_H305_442
  • Date when the text was collected (YYYY-MM-DD)
  • Medium of the text: manuscript; PC
  • Time limit for writing the text in minutes: 10; 15; 20; 30; 40; 45; 60; other; no
  • Additional help during writing the text: dictionary; student’s book; other; no
  • Was the text written during an exam: interim; final; no
  • Limit in words
  • Title of the text, e.g. The event that changed my life
  • Topic type: general; specific
  • Activity before writing the text: practice; discussion; visual; vocabulary; other; no
  • Ability to choose the topic: selection from many; assigned topic; any; other
  • Genre: any; assigned
  • Actual text type: informative; descriptive; opinion; short story
  • Actual number of words

3. Collecting students’ essays

  1. 10 doc files
  2. 109 manuscripts

4. Digitization of files

  1. Transcription of manuscripts
  2. Entering metadata into a spreadsheet file

5. Release of an extended version of CzeSL-SGT with automatic annotation in KonText (the Czech team):

    1. Automatic error annotation (suggested corrections from a spell/grammar checker, error type identifier)
    2. Automatic linguistic annotation (tags and lemmas for the original and the corrected form)
    3. Adding metadata

6. Preparation of tasks for Polish students of Czech based on CzeSL-SGT (the Polish team, work in progress)

7. Release of an extended version of CzeSL with manual error annotation in TEITOK and KonText (the Czech team, to do)

    1. Manual multi-level correction in TEITOK, using existing manual annotation in feat (the Czech team, work in progress)
    2. Automatic linguistic annotation

CzeSL: http://utkl.ff.cuni.cz/learncorp

KonText: https://kontext.korpus.cz

TEITOK: http://teitok.corpuswiki.org