360 likes | 532 Views
HLT in South Africa: Yesterday, Today and Tomorrow. Justus Roux Stellenbosch University Centre for Language and Speech Technology. AIM Brunfelsia Latifolia Focus on official government policy development on HLT in South Africa Role players in policy making
E N D
HLT in South Africa: Yesterday, Today and Tomorrow Justus Roux Stellenbosch University Centre for Language and Speech Technology Workshop:HLT Collaboration 23 -26 November 2008
AIM • Brunfelsia Latifolia • Focus on official government policy development on HLT in South Africa • Role players in policy making • Wish list regarding future planning and policies
YESTERDAY 1999 / 2000: • First initiative by Pan South African Language Board (PanSALB) and the Department of Arts, Culture, Science and Technology (DACST) towards setting up a “Human Language Technology Project” • Joint Steering Committee: DACST, PanSALB, Universities: Stellenbosch, Pretoria, UNISA, Bloemfontein, ICOMTEK (CSIR), private translation company • Task to develop a Strategic Plan for HLT development in South Africa
YESTERDAY Thinking at that time very much influenced by • European Model for ‘Language Engineering’ and FP5 funding for HLT in Europe • Recognition of particular realities in SA • Academic & technical realities – limited – training and reskilling programmes – technology transfer • Financial realities – co-operation to be sought from Government, Academia, Private sector • Political realities – official language situation > development of National Lexicographic Units (NLUs)
YESTERDAY September 2000 – Report – The development of Human Language Technologies in South Africa – Strategic Planning. Three steps • Step 1 Create a SA model for HLT development and implementation • Component 1: Applied research and capacity building (Specialised courses at tertiary institutions, short informal courses) • Component 2: Production of language resources – standards – “Regulatory forum” • Component 3: Developing enabling technologies – support to innovative projects – funding from Innovation Fund of DACST • Component 4: Conscious steps to develop HLT industry
YESTERDAY Step 2 Creation of a legal framework to ensure systematic acquisition of government resources Ammendment of the Legal Deposit Act (1997) Step 3 Development of physical infrastructure to manage the implementation of the model (NB Role of the NLUs as integral part) • Virtual National Language and Speech Resource Centre • Virtual National Electronic Language and Speech Data Network • Regulatory Forum for Human Language Technologies
YESTERDAY • Strategic plan was accepted (by DACST) and on 8 November 2001 a Ministerial Advisory Panel on HLT was inaugurated with the task to focus on the viability of the establishment of a “virtual national electronic language and speech network” • 8 members – three of whom are at this meeting • Report delivered in to the Minister in September 2002
YESTERDAY Recommendations #1 A virtual HLT Centre to be established with a hub and spoke / nodes configuration (Accepted)
Structure of National Resource Centre for HLT (Virtual Centre: Hub and connected nodes) Centre X Zulu Ndebele Uni A Xhosa Swati NLU (?) Lang (?) LE LE LE Uni B Venda Tsonga Uni C N Sotho Tswana Managerial Hub Coordination of Node Activities Data acquisition Data enhancement Data management & backup Training Centre Y SA Eng Afrikaans Uni D N Sotho Sign Lang LE = Language experts
YESTERDAY Recommendation #2 (Not accepted) Establishment of an interim Implementation Secretariat for period of one year In stead an HLT Steering Committee was appointed to oversee implementation within a period of five years Recommendation #3 (Accepted – not implemented) HLT development should take place in co-operation with Presidential National Commission on Information Society and Development Recommendation #4 (Not accepted – not necessary) Amendment of Legal Deposit Act (1997)
YESTERDAY 2002 Department of Science and Technology (DST) – National Research and Development Strategy – reference to ICT / HLT (Handout) 2003 • National Language Policy Framework (NLPF) approved by Cabinet (February) – specific reference to HLT in Section 3 (3.3) • The development of an official HLT Strategy as one of the implementation mechanisms of the NLFP is suggested - Section 4 (4.8) (Refer “TODAY”) • Establishment of an HLT Unit within National Language Service • HLT Steering Committee appointed to oversee implementation of an HLT Resource Centre within a period of five years in collaboration with the HLT Unit of the National Language service (NLS) (2003-2007)
YESTERDAY 2004 Department of Trade and Industry Report Benchmarking of Technology – Trends and Technology Developments Emphasis on the important role of HLT within the economic sector in South Africa.
Summary of technologies with potential high impact on ICT sector(SA Dept Trade and Industry Report 2004: 10) Mobile Wireless HLT Pervasive OSS Grid computing Telemedicine Potential impact on industry Geomatics RFID Manufacturing (CAD, Robotics) Limited High Low South Africa`s ability to respond
YESTERDAY 2005 • Establishment of Meraka Institute with HLT Research Group Initiative of Department of Science and Technology (DST) • National Workshop on HLT (May 2005 – CSIR Conference Centre) – Roadmapping • Main issues and recommendations are in handout. • During this period several workshops and conference tracks were held: • PRASA annual conferences • ALASA SIG on Language and Speech Technology Development • ALASA International Conferences (special track) • Roadmapping workshop with State IT Agency (SITA) – Steven Krauwer (BLARKS)
TODAY Progress of Steering Committee to set up Resource Centre in collaboration with NLS (HLT Unit) (1) • Draft HLT National Strategy document developed and submitted (Detail Dr Jokweni) • Great amount of work, but little progress • The Steering Committee had a strained working relationship with previous Chief Director of NLS, hence two instances of disagreement: • Unilateral call by DAC (NLS) (2005) for tenders as management agent for the envisaged National Resource Centre – failure – no funds available • Unilateral call for development proposals by DAC (2006) – Steering Committee was not involved (amount distributed to successful applicants – outputs imminent)
TODAY Progress of Steering Committee to set up Resource Centre in collaboration with NLS (HLT Unit) (2) • The Steering Committee has a good working relationship with new Chief Director and staff of the of NLS • Submissions for funding submitted
Research Role Players in South Africa: Universities NLS PanSALB DAC Meraka Institute DST SABS TC 37 Universities Languages Linguistics Dedicated R&D Centres International Standards Organisation (ISO TC 37) Universities Engineering Computer Science Dedicated R&D Centres Language Resources Enabling Techno- logies Text corpora Speech recognition Spoken corpora Speech generation Dictionaries Morph analysis Standardise Formats & Protocols Lexicons POS tagging Grammars Syntactic analysis Terminology banks Semantic analysis Research
TOMORROW Wish list - Planning and policy • Restructuring of the HLT Steering Committee: Real role players are needed to contribute to the debate(Request to the Minister through NLS / DAC) • Establishment of the HLT Resource Centre as a priority. • Render support services to HLT community • Source of job creation • Co-ordinated academic training at national level • Standard curricula over and above specialised curricula • Staff exchange programme (national & international) • Recognition of modules across accredited institutions • Applied research conducted in accordance with national priorities set by, for example, a body of experts from user sectors. (Roadmaps, annually updated.) • Blue sky research within HLT remains imperative also from funding perspective.
TOMORROW • National funding procedures for HLT research and training should be transparent and equitable • Task for a Select Committee of National and International Experts (?) • Address the particular interest in HLT research and training within Africa: imminent projects – Algeria, Morocco, Kenya, Nigeria and Gabon. • Possibility of international funding, e.g. Association of African Universities (AAU) staff & student exchange programme • Hopefully more insights to be gained from this workshop, not only with respect to international co-operation, but also regarding the positioning of HLT activities in South Africa.
THANK YOU JC Roux jcr@sun.ac.za
Importance of a National Resource Centre for HLT • Acquiring, enhancing and managing text and speech data for HLT applications: • Extremely costly • Extremely time consuming • Requires skilled language experts • Therefore: Need to develop reusable resources • General practice world wide: • ELSNET (Europe), LDC (USA), (Japan) Workshop:HLT Collaboration 23 -26 November 2008
Functions of a National Resource Centre for HLT • Constitutes one of the integral components for effective HLT product development in all official languages of SA. • Will interact will all other role players for in the field to expedite service delivery in HLT applications. • It will serve a depository of raw and enhanced reusabletext and speech resources of all SA languages for use by different communities / institutions for language related purposes, e.g. NLUs, Terminology development sections, translation services, education etc • It will serve as a language archive to document language and speech phenomena of the official languages of SA over a period of time as part of cultural heritage. (SA lost its ‘Sound Archive’) Workshop:HLT Collaboration 23 -26 November 2008
Tasks of a National Resource Centre for HLT Data acquisition • Text data • Different types / genres • Official / Formal (announcements, legislation) • Informal (magazines etc) • Literary (novels, drama etc) • Sources: • Printed media: News agencies, Publishers • Government services (all levels, including Hansard) Workshop:HLT Collaboration 23 -26 November 2008
Tasks of a National Resource Centre for HLT Data acquisition • Speech data • Different types • Read speech • Spontaneous speech • Different domains & conditions • Sport, news, interviews / noisy environments • Different transmission modes • Telephone speech: mobile, fixed lines • Recorded speech (microphone) • Different subjects • Male, Female, young, old, impaired • Sources: • SABC archives • Own initiatives (!) Workshop:HLT Collaboration 23 -26 November 2008
Tasks of a National Resource Centre for HLT Data enhancement Text • Development and application of • Tokenisers (word identification) • Parts of speech taggers (nouns, verbs, adverbs etc) • Morphological analysers (composition of words) • Syntactic parsers (composition of phrases / sentences) (With tools to be developed in collaboration with experts from Technology Component) • Creation of machine readable lexicons (XML format) Workshop:HLT Collaboration 23 -26 November 2008
A partial XML entry for the noun -ntu, class 1-2, is as follows <Entry> <Head> <Stem>ntu</Stem> </Head> <Body> <Tone>3.2.9</Tone> <MSI> <POS> <Noun> <Noun-features> <Class-pf-s>umu</Class-pf-s> <Class-pf-p>aba</Class-pf-p> <Class-no>1-2</Class-no> <Label>n</Label> <Dim> <Form>umntwana</Form> <Sense>baby, small child</Sense> </Dim> <Loc> <Form>kumuntu</Form> Bosch SE, Pretorius L & Jones, J. Towards machine-readable lexicons for South African Bantu Languages. Nordic Journal of African Studies 16 (2): 131-145 (2007) Workshop:HLT Collaboration 23 -26 November 2008
Tasks of a National Resource Centre for HLT Data enhancement (2) Speech • Orthographic transcriptions of speech (S to T) • Phonetic transcription and annotation of speech • Sound like utterances • Fluent speech • Repetitions, false starts etc • Non sound like utterances • Background noise • Lip smacks etc • Supportive software programmes (e.g. Praat) Workshop:HLT Collaboration 23 -26 November 2008
Ukuja(bula) Speaker One – Ngithi ukujabula manje u k u Workshop:HLT Collaboration 23 -26 November 2008
Tasks of a National Resource Centre for HLT Data management & Software development • Determine data needs in collaboration with HLT Unit in NLS for government applications • Acquire the data with the assistance of language specialists at different nodes of the Centre • Solicit development of appropriate software • Manage, back-up, distribute data to users • Commercialise resources: private sector developers Workshop:HLT Collaboration 23 -26 November 2008
Tasks of a National Resource Centre for HLT Training and Consultation • Identify training needs and potential trainers • Develop non-formal training curricula for the reskilling of interested language practitioners • Organise HLT training workshops at different venues in the country encouraging language bodies to participate • Create awareness of HLT potential in collaboration with the HLT Unit of NLS Workshop:HLT Collaboration 23 -26 November 2008
Structure of National Resource Centre for HLT (Virtual Centre: Hub and connected nodes) Centre X Zulu Ndebele Uni A Xhosa Swati NLU (?) Lang (?) LE LE LE Uni B Venda Tsonga Uni C N Sotho Tswana Managerial Hub Coordination of Node Activities Data acquisition Data enhancement Data management & backup Training Centre Y SA Eng Afrikaans Uni D N Sotho Sign Lang LE = Language experts Workshop:HLT Collaboration 23 -26 November 2008
Relationships Seatla se sengwe se tlhapiswa ke se sengwe (The one hand washes the other) • No infringements on current lexicographic or terminological activities - Different foci • Complementary activities: • Raw or enhanced data to be supplied to NLU`s / PanSALB / NLS • NLU`s could contribute to National depository • Win-win situation for the sake of technological development of our languages Workshop:HLT Collaboration 23 -26 November 2008
Concluding remarks • Attempt to speed up activities in the development of HLT applications to provide services in a language of choice. • To provide new resources and tools for lexicographic and terminological development. • To provide a new range of job opportunities for graduates in African languages • Keep South Africa abreast with new developments in the Information Society and avoid the marginalisation of the indigenous languages. Workshop:HLT Collaboration 23 -26 November 2008