Towards a solution for the sharing of phonological data

Towards a solution for the sharing of phonological data Yvan Rose Memorial University of Newfoundland Brian MacWhinney Carnegie Mellon University

Map of presentation • Context: no specialized tool to facilitate research in phonological development • A preliminary attempt: ChildPhon • A more promising solution: Phon • Current state of the Phonproject • Developments in foreseeable future • Potential • Publicly-available cross-linguistic database • Proposal

Context (until recently) • CHILDES tools (focus on CLAN) • Number of tools for multimedia data storage and analysis • Mostly deals with morphological and syntactic aspects of development • Not easily extensible • What about phonology? • No CHILDES tool adapted for phonology • Data sharing and broad-based investigations are challenging

A first attempt • ChildPhon (Rose 2003) • Analytical (relational) database for child language data • Designed within FileMaker Pro • Main features • Interface for double-blind transcriptions • Automatic functions based on phonetic transcriptions: • Syllabification of transcribed forms • Detection of common processes observed in child language (e.g. onset cluster reduction)

Problems with ChildPhon • No support for Unicode fonts no X-platform compatibility (Macintosh-only) • Not compatible with CHILDES / TalkBankno data exchange functions • Automatic parses limited, not customizable • Multimedia capabilities are minimal (at best) • Requires use of proprietary software and font • Algorithms are ‘destructive’ • Statistical functions are minimal • No web implementation • In sum: Good idea -- Bad implementation

Phon: a more promising solution • Interdisciplinary project (First of its kind between Linguistics and Computer Science at Memorial University of Newfoundland) • Software designers and programmers:Rodrigue Byrne, Gregory Hedlund, Philip O'Brien, Yvan Rose, Harold Wareham • Financial Support: • Faculty of Arts, Memorial University • Social Sciences and Humanities Research Council of Canada (SSHRC) • Canada Fund for Innovation (CFI) • National Science Foundation (NSF)

Phon: Overview • Software underpinnings: • Programmed in Java, Unicode font encoding • Cross-platform compatible (Mac, Windows, …) • XML data storage structure • Compatible with TalkBank schema • User management system • Extended multimedia capabilities • More flexible automatic algorithms • Specialized query language • Offers a complete solution for data sharing

Phon: usability • Intuitive graphical user interface • Helpful wizards (e.g. project creation, queries) • Record navigator • Custom selection of data fields • General / record-by-record • Intuitive query language • Standard terminology • Built-in queries (modifiable by user) • Query memorization and saving

Phon: main functions • User management • Media segmentation • Phonetic transcription • Transcription merging(Selection of ‘final’ transcriptions for analysis) • Phrase segmentation and alignment (Further segmentation according to research needs) • Syllable alignment(Alignment of syllables of target and actual forms) • Database query

User management • Secure login • User tasks / privilegesmanagement

Export sound clip  Play Media segmentation • Generally similar to CLAN • Hit the space bar to define a speech segment • Default segment length user-defined • Useful for working on small speech segments • Segment editing: • Change numerical value • ‘Stretch’ the time segment by sliding pointer Yvan Rose: Replace yellow line in segment “timebar” by waveform.

Media window Session info (drawer) Transcription window Media controls Transcription: general interface

Transcription • Built-in IPA character map • Symbol ‘categories’ • Access tosound segment • Interface for double-blind transcriptions • Tied with user management functions Yvan Rose: Link adulttranscription to an electronic IPA dictionary. Need to develop a transcription system for sounds that can’t be transcribed easily. • Ability to assign a feature set to a dummy character• Ability to use the forward slash bar to assign two competing symbols to a given sound (e.g. p/b would imply that voicing cannot be transcribed accurately; the alternants will be considered as one consonant by the syllabifier and query interpreter.

Transcription merging • Comparison of ‘competing’ transcriptions • Direct access to media segment • Selection of most accurate transcription • Further refinement of selected transcription Yvan Rose: People an algorithm that would enable a comparison of transcriptionsbased on specific parameters (e.g. voicing). This algorithm could build on the feature sets associated with each segment transcribed.

Phrase alignment • Further segmentation of the utterances • Useful for researchon phonologicaldomains • A simple mouse click sets and resetsthe domain boundaries Yvan Rose: Several people requested different levels of segmentation. This includes morpho-syntacticlevels of segmentation, as well as various levels of the prosodic hierarchy.Also: add PLAY button in the interface of this module

k ø n s t r e I n t s ‘constraints’ Syllabification algorithm • Syllabification algorithm • Refined labeling of each syllabic position • Each label is a valid object for query   R R O O N N

Syllabification algorithm • Parameters of syllabification areuser-definable Syllable constituents Timing tier Yvan Rose: The parameters will be revised thoroughly. To add (among others): word-final codas, list of exceptional clusters.Also add, to complement stress attraction, an option of ambisyllabic syllabification of intervocalicconsonants in Strong-Weak syllable juncture. In addition to this, we also need a way to manually assign a syllabification to each consonant whichcannot be accounted for by the automatic algorithm.

Syllable alignment • Automatic alignment of syllables • Manual modifications

Query language • Quick and accurate queries on large amounts of data • Language features • Uses terms familiar to phonologists to compose queries • Syllable constituents: onset, nucleus, … • Stressed vs. unstressed syllables • Custom predicates • History of recent queries • Ability to save queries

Query language components • Selectors (e.g. Onset(Syllable x)) • Predicates (e.g. Branching(Onset(Syllable x)) • Boolean connectives • Example: let corpusName = "TestCorpus", let corpus = Corpus(corpusName), let records = Records(corpus) foreach r in records foreach p in Phrases(r) foreach s in Syllables(p) Branching(Onset(TargetSyllable(s))) AND NOT Branching(Onset(ActualSyllable(s)))

Query tree structure Record Branching onset reduction in 2nd syllable TargetPhrase ActualPhrase Syllable Syllable Syllable Syllable Rhyme Rhyme Rhyme Rhyme Nucleus Nucleus Nucleus Nucleus Onset Onset Coda Onset Onset Coda T U N R A S D U N A S D D TRUE AND NOT FALSE branching( ) onset( ) pos( , 2) TargetPhrase MATCH AND NOT branching( ) onset( ) pos( , 2) ActualPhrase

Query results • View in application • Use to generate textual reports • Recording session (e.g. to exemplify a given process) • Time slice (e.g. to exemplify a stage of acquisition) • Entire database (to exemplify a learning curve) • Export • As Unicode file • As ASCII file (modulo font conversion limitations)

Enhancements (short term) • Improvement of syllable alignment algorithm (building on Kondrak’s 2003 algorithm) • Import function • ChildPhon files (including font translator --almost done!) • CHAT files • Incorporation user-defined fields • Incorporation of statistical functions • Chart report generator • Ability to select various chart formats • Bar graphs (for proportions within and across sessions) • Line graphs (for learning curves)

Enhancements (longer term) • Interoperability with Praat • Export to Praat (similar to CLAN function) • Interface to accommodate acoustic measurement data • Web-based interface • Data sharing at a distance • Easy query of corpora on CHILDES database • Further automation • Automatic detection of pre-identified processes Yvan Rose: Include function to extract phonetic inventories per session/stage/…Get examples of ‘canned’ analyses in literature on clinical phonology.

Development timeline • End of fall of 2004 • Completion of current development phase • Release of testing (Beta) version • Winter of 2005 • Bug fixes • Improvement of functionality and user interface (including short-term enhancements) • Website creation (http://www.phon.ca/) • Completion of technical documentation • Notes to programmers • User guide • Summer of 2005 Release of Phon1.0 as open-source freeware

Potential • Standard for data sharing • Large-scale investigations • Cross-linguistic investigations • Enhancement to CHILDES • Elaboration of a database fulfilling the needs of acquisitionists focussing on phonology and related issues • Investigation of interface issues (e.g. between morpho-syntax and phonology)

How to realize this potential • Team of researchers specializing in: • Early acquisition (including babbling) • Segmental development • Prosodic development • Phonological disorders • Second language acquisition • … • Feedback on software development project • Data contribution • Existing corpora in digital format • Conversion of printed corpora • Identification of corpora (printed, with or without audio files) • Setting of conventions for data conversion

Our proposal • Constitution of a research team to develop a phonological component of CHILDES • Database • Supporting software • Elaboration, with the research team, of a grant application to support: • Database elaboration • Software development • Periodical meetings • Workshops • …

Concretely • Feedback on software project • Software needs for various types of researchLet us know what you need • ImplementationLet us know how you want it to work • Contribution to grant application • Kinds of research would the new database enableLet us know what you would like to do • Impacts of this research (e.g. theoretical, clinical, …) • Supporting letters • Contribution to the public database • Sharing of existing / future corpora • Establishment of conventions to format older corpora

Special thanks • The ‘Phon’ team at Memorial: • Rodrigue Byrne • Harold Wareham • Gregory Hedlund • Philip O’Brien • For his great help with the TalkBank XML schema: Franklin Chen (Carnegie Mellon University) • For their useful feedback on an early version of this software: Heather Goad (McGill), Paula Fikkert (Nijmegen), Clara Levelt (Leiden), Katherine Demuth (Brown), Mark Johnson (Brown), Carrie Dyck (Memorial), Phil Branigan (Memorial), Brian MacWhinney (Carnegie Mellon), Bryan Gick (UBC), Sophie Wauquier-Gravelines (Nantes), Sharon Inkelas (UC Berkeley), Conxita Lleó, Sonia Frota (Lisbon), Maria João Freitas (Lisbon), Ronald Sprouse (UC Berkeley), Joe Pater (UMass, Amherst), John Archibald (Calgary), Éliane Lebel (Memorial); hoping that no one was forgotten…

Towards a solution for the sharing of phonological data