1 / 36

Compiling a Spoken Chinese Corpus of Situated Discourse

Compiling a Spoken Chinese Corpus of Situated Discourse. Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences. Corpora Overview. Spoken Chinese Corpora A corpus of situated discourse A corpus of major dialects A corpus of speech Written Chinese Corpora

frye
Download Presentation

Compiling a Spoken Chinese Corpus of Situated Discourse

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiling a Spoken Chinese Corpus of Situated Discourse Gu Yueguo The Institute of Linguistics The Chinese Academy of Social Sciences

  2. Corpora Overview Spoken Chinese Corpora A corpus of situated discourse A corpus of major dialects A corpus of speech Written Chinese Corpora A corpus of contemporary written Chinese A corpus of Pre-Qing written Chinese

  3. Main headings Components of the compiling process • Real world discourse –what is it? • Recording • Encoding • Transcription (a) • Transcription (b) • Mark-up • Tagging • Application

  4. (2b) Transcription for a special purpose (2a) Character transcription (3) Mark-up (0) ‘real world’ spoken discourse (4) Coding Recording (1) (5) Application

  5. 0Discourse in the Real World

  6. No prepara-tion Topics pre-set with no preparation Topics pre-set with no written preparation Talking based on a written script Reading a written script Single speaker e.g. talk to oneself e.g. narrate a personal story e.g. oral exam e.g. soliloquy, 1-person cross talk e.g. news reading, reading practice Two or more speakers *e.g. everyday talks * e.g. sports saloon *e.g. press interview e.g. acting, cross talk e.g. collective reciting Spoken Chinese

  7. Real world situated discourse • (1) It is situated to an actual social situation; • (2) It is situated to actual users; • (3) It is situated to an inter-subjective world of discourse; • (4) It is situated to actual goals; • (5) It is situated to spatial and temporal setting; • (6) It is situated to the cognitive capacity of actual users; • (7) It is situated to performance contingencies of actual users who are engaged in spontaneous talking with little pre-planning.

  8. F-staff clerks colleagues Academic ZWF Staff meeting Phone calls visitors Building X Thurs Mon Tues Wedn Fri Prjct team 2 students Other colleagues Colleague 1 Academic Prjct team 1 Academic visitors Colleague 2 Phone calls Prjct team 3 Building Y Building Z

  9. Sat Conference organizers Senior managers Sun Sports playmates Academic Hotel staff Research center staff Summer Resort Swimming pool kindergarten Mon-Fri Residential Building markets wife Academic son Neighbours

  10. Talking and Doing Interwoven in the Real World • Talking is the task, e.g., meeting, seminar, (it is task-oriented, task-goal-directed, segmented on the basis of the goal-attaining process. Note that turn-taking rules are based on such a type of talking-task relation) • Talking is the main constitutive part of the task, some classroom discourse, doctor patient discourse (it is task-oriented, task-goal-directed, segmented on the basis of the goal-attaining process) • Talking is a constitutive part of the task, e.g. giving instructions from time to time (task performance is dominant, talking tends to be fragmented) • Tasking and task run in conflicting parallel, the achievement of the latter serves as a means to the goal of the former, e.g. business dinner (business table talk) (Note that segmenting this kind of talk can be based on the task) • Talking is an embedded social part of the task, e.g. talking over the meal (talking has no specific goal to reach) • Talking is a decorative part of the task, e.g., talking accompanying tea-making • Talking is a hindrance to the task, e.g. talking over a written exam • Talking and task are independent to each other

  11. Micro performance analysis of five minute activities Spatial- temporal Relations between acts doing relation btwn doing & talking talking 00:00- 1:15 Parallel and independent X helps himself with noodles conflictive X and Y gossip Y sorts out the things on the table Parallel and independent 1:27- 2:6 Parallel and independent X sorts out the bowl and the chopsticks Parallel and relevant X and Y talk about the journal editing Y switches on the computer 2:11- 3:06 Parallel and independent X sorts out the things on the table Parallel and independent Y talks to X about a politician Y continues to sort out the things 3:19- 4:25 Parallel and independent X starts to reinstall his computer Parallel and relevant X talks to Y about the Journal layout Y starts to do the layout on computer 4:34- 4:40 Parallel and independent X continues reinstalling Parallel and relevant X continues talking to Y about the Journal editing Y continues doing the layout

  12. Sampling: Whose job? • Sinclair (1991:13) writes: • The specification of a corpus --- the types and proportions of material in it --- is hardly a job for linguists at all, but more appropriate to the sociology of culture. The stance of the linguist should be a readiness to describe and analyse any instances of language placed before him or her. In the infancy of the discipline of corpus linguistics, the linguists have to do the text selection as well; when the impact of the work is felt more widely, it may be possible to hand over this responsibility to language-oriented social scientists.

  13. The standard variety approach it is arguable that Putonghua should be chosen as the target language to rule out other dialects from the picture. There are at least two major reasons for doing so. First, Putonghua serves as the standard language used by the media and education. Second, other spoken corpora have also adopted the standard variety.

  14. Criticisms of the standard variety approach Subject to serious criticisms relating to the reservation of the naturalness of language use. The standard variety is given its identity before the corpus is compiled. The corpus cannot be used to represent its naturalness, nor be used to establish or demonstrate its identity. … what the compilers believe what Putonghua looks like. Subjective judgment is also involved in sampling Putonghua speakers by filtering non-standard speakers out. … Unless they are ‘commissioned’ to talk among themselves, the activities the standard and non-standard interactants are engaged in have to be properly filtered as well.

  15. The sampling: The workplace approach It is true that situated discourses are unlimited in number. However, the types of social situations to which they are situated can be in theory exhaustively enlisted. According to the Beijing Yellow Book 1999, there are 67783 social work units which we divide them into 6 major categories and 31 sub-categories,

  16. 01 Government, Parties and Other Social Bodies 4823 7.12% 02 Economical organizations 53838 79.43% 03 education, research and arts 6840 10.09% 04 health, sports, and social welfare 1365 2.01% 05 public welfare 890 1.46% 06 military 27 0.04% 6 major categories of social work units

  17. descriptive title no of mp3 files the total size 1 accident mediation 1 5 23,369,326 2 accident mediation 2 8 30,944,114 3 Administrative meetings 107 561,000,000 4 assessment meeting 6 68,500,000 5 auction 30 158,000,000 6 bfsu meeting 14 66,200,000 7 Birthday celebration 10 43,100,000 8 btvu seminar 26 138,000,000 9 bus talk 60 294,392,298 10 business negotiation 1 27 143,285,178 11 business negotiation 2 26 140,260,744 12 business negotiation 3 54 284,761,458 13 business negotiation 4 9 44,767,134

  18. 14 child discourse 163 1,115,063,560 • 15 Chinese and Korean first contact 7 34,708,716 • 16 Chinese New Year celebration 11 126,323,484 • 17 Classmates get-together 14 73,063,728 • Classroom discourse-teach • Chinese to Koreans 125 574,000,000 • 19 commercial house key-handling procedure 16 84,512,806 • 20 community talks 322 1,734,865,326 • 21 end year celebration 17 78,310,716 • 22 fortune telling 33 390,741,362 • 23 Gu yueguo a week record 248 1,235,679,186 • 24 house allocation meeting 44 239,388,838 • 25 house decoration team talks 36 181,660,952 • 26 Jiangsu TVU review meeting 11 49,675,918 • 27 kindergarten meeting 28 146,741,690 • 28 Lan Baochun family talks 22 285,975,640

  19. 29 lawsuit 93 508,628,422 30 lovers conversation 11 59,845,160 31 medical discourse 156 764,274,198 32 ministry education meeting 99 522,992,404 33 office talk ministry of communication 114 577,889,242 34 peasant family 73 373,917,094 35 Peking Univ ceremony 7 46,894,312 36 play mah-jong 28 145,754,884 37 private conversation 77 401,858,424 38 Radio Communication interviews 24 919,456,512 39 sell and buy 296 1,150,000,000 40 seventy-eighty yrs old peasant talks 22 125,624,138 41 street market shopping 37 190,887,972 42 student dormitory talks 66 345,920,582 43 table talks 89 529,995,698 44 visit blood doners 14 71,655,104 45 Zhu Rongji press conference 20 97,984,672 total (1second=15.6503KB) 2705 15,180,870,992=970005.11 sds/269.44 hrs

  20. 1Recording

  21. Recording • Who does the recording? • In what role does the person assume while recording? • What is the quality of the recording? • In what manner is the recording to be made? • How is the ethics of recording to be properly taken care? • What details are to be noted while recording? • How are the recordings to be kept safe?

  22. In what role does the person assume while recording The recording person as a legitimate observer: s/he is allowed by the authority to take non-active part in the activity and record the talk. S/he is an outsider. The party is aware of her or his presence and of her or his purpose of being there. The recording person as a genuine participant: s/he is an insider. The recording person as a surreptitious observer: s/he is one of the public members, and her or his presence draws no particular attention from anyone else.

  23. In what manner is the recording to be made? • With the approval of all the participants • With the approval of the key participant • With the approval of the unit authority • Open recording which can be noticed by anyone • Surreptitiously

  24. 姓 名 职 业、职称、职务 年 龄 性 别 文化程度 口 音 与 您 以 及 和 别 的 谈 话 人 的 关 系 录 音 记 录 卡 录音人姓名: ________________ 性别: ______________ 职业: _______________________ 开 始 录 音 日 期 _____ 年 ____月____ 日 结 束 录 音 日 期 _____ 年 ____月____ 日 开 始 录 音 时 间: 上 午_____ 点 下 午_____ 点 晚 上 _____点 结 束 录 音 时 间: 上 午_____ 点 下 午_____ 点 晚 上 _____点 谈 话 地 点 _____ 省 _____市 ____ 县 ____ 乡 _____ 村 单位: ______________________________________________ 谈 话 场 所: 如 办 公 室、 朋 友 家、 餐 馆、 会议室、 超 市、 火 车 上、 车 间、 家 中 、 商 场、 医 院、 法 庭、 宾 馆、 街 上、 晚 会 上、 ___________ 在 录 本 面 磁 带 时 您 在 何 处? 1. ________________ 2. __________________ 3. ___________________ 录音方式: 公开 秘密 先秘密后公开 有些人知道并同意 都知道并同意 请 把 本 面 磁 带 的 谈 话 人 员 的 有 关 情 况 填 在 下 面 的 表 里 (越详细越好): 谈话目的和事由:_______________________________________________________________ _______________________________________________________________________________ _______________________________________________________________________________ 提 醒 您 本 面 录 完 后 要 检 查 一 下 磁 带 是 否 要 翻 面! (以下由语料库工作人员填写) ------------------------------------------------------------------------------------------------------------------------------ 原始声波文件名:_____________________ 汉字转写文件名: ____________________________ 原始声波文件光盘编号: ______________ 切分后声波文件名: __________________________ 归类文件夹名: ______________________ 其他: ______________________________________

  25. How are the recordings to be kept safe? The recordings on the 74 minute mini disks are all converted into wav files by using the recording function of the sound card. The format is 16 bits, stereo, 44100 Hz. The wav files are then stored on 640 mb recordable compact discs. They are further backed up by being converted into MP3 format (to economize on storage space) and saved again on separate 640 mb recordable compact discs. Furthermore, all the MP3 files are stalled on a USB movable 20G hard disk.

  26. 2Transcription

  27. The encoding process • Transcription in Chinese characters • Transcription in Pinying/IPA symbols • Transcription by using Praat • Mark-up by XML • Tagging

  28. Issues in segmentation Segmenting sound streams into orthographic and phonetic linear units is the first major concern of the present project. It proves to be theoretically significant and practically difficult. The only natural unit boundaries are speaker-turns (turn defined in terms of the speaker’s presence of phonation). The other units either larger or smaller than turns tend to be more like theoretical constructs than otherwise.

  29. Basic unit ---? Acoustically speaking, a spontaneous talk is a sequence of strings of sounds uttered by two or more speakers. Prosodic or intonational units seem to be natural segments of the sequence. They are treated as basic units of talk and seem to have the same status as sentence does in written text. The weaknesses of such segmentation are (1) segments larger than intonational units are assumed to be the mere stacking of these basic units, which are untrue, hence misleading; and (2) talk is treated as a self-contained product waiting to be sliced into intonational units, thus ignoring the dynamic aspect of talk and its intrinsic relation with the social activities at large.

  30. Multiple level segmentation 1 The first-level segment: The activity boundary (segmenting talk from other social activities) • Schedule boundary, e.g. a two-hour meeting, classroom discourse • Visit boundary, e.g. a patient’s visit to a doctor • Case boundary, an accident settlement • Appointment boundary, e.g. • Business boundary, e.g. buy something

  31. Multiple level segmentation 2 The second-level segment: goal-oriented segmentation (segmenting talk into goal-attaining chunks) • The segmentation is made on the basis of goal-attaining process – goal-attainment structure • E.g., Opening, negotiating, closing of a meeting • E.g., examine-diagnose-prescribe-recommending • The presentation of a speaker

  32. Multiple level segmentation 3 The third-level segment: turn-oriented segmentation • (segmenting goal-attaining chunks into turn-taking chunks) • The segmentation is made on the basis of turn-boundary

  33. Multiple level segmentation 4 • The fourth-level segment: functional units • (segmenting turn-taking chunks into functional units) • The segmentation is made on the basis of functional markers or clues. • A meaningful cluster with a clear forward function • A meaningful cluster with a clear backward function • A meaningful cluster with a clear downward function • A meaningful cluster having a clear cognitive function: planning or searching for words

  34. Multiple level segmentation 5 The fifth level segment: linear character and phonetic units

  35. Trajectories of life path Trajectories of life path Trajectories of life path Trajectories of life path Trajectories of life path Trajectories of life path Internalized language out of life path trajectories Internalized language out of life path trajectories Internalized language out of life path trajectories Natural growth and development of language

  36. Trajectories of life path Trajectories of life path Trajectories of life path Trajectories of life path Trajectories of life path Trajectories of life path Internalized language out of life path trajectories Internalized language out of life path trajectories Internalized language out of life path trajectories • Linguistic theory as reconstruction • as modeling • as description • as standardization

More Related