1 / 28

Application of Audio and Video Processing Methods for Language Research

Application of Audio and Video Processing Methods for Language Research. Przemyslaw Lenkiewicz, Peter Wittenburg Oliver Schreer , Stefano Masneri Daniel Schneider, Sebastian Tschöpel Max Planck Institute for Psycholinguistics Fraunhofer -Heinrich Hertz Institute

Download Presentation

Application of Audio and Video Processing Methods for Language Research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application of Audio and Video Processing Methods for Language Research Przemyslaw Lenkiewicz, Peter Wittenburg Oliver Schreer, Stefano Masneri Daniel Schneider, Sebastian Tschöpel Max Planck Institute for Psycholinguistics Fraunhofer-Heinrich Hertz Institute Fraunhofer IAIS Institute

  2. Advancing Video and Audio Technology in Humanities research AVATecH

  3. Max Planck Institute for Psycholinguistics Fraunhofer-Heinrich Hertz Institute Fraunhofer IAIS Institute AVATecH

  4. Annotations Base of research analysis

  5. Annotations – challenges • Annotations are of different types, almost all manual • Different quality, conditions – mostly bad • Different languages – mostly minority languages • Annotation time is anything between 10-100 times the length of the media

  6. Manual Annotation Gap We have around 200 TB data at MPI, in particular digitalized Audio/Video-Recordings, Brain-Images, Hand tracking, etc. Increasingly more data is nor described nor annotated Not annotated data switch to lossless mJPEG2000, HD Video and Brain-Imaging Organized and annotated data

  7. AVATecH Main Goals • Reduce the time necessary for annotating. • Develop communication interfaces and human-machine interfaces. • Develop A / V processing algorithms.

  8. Recognizers • Small applications executed from ELAN • They have some specific purposes, they recognize specific things • They usually create annotations or visualize things for you • Aim at tasks that can be trivial but time consuming

  9. recognizers

  10. Audio recognizers • Audio segmentation • Autonomously splits audio stream into homogeneous segments • Approach: Model-free approach based on clustering with help of Bayesian information criterion

  11. Audio recognizers • Audio segmentation: Goals • Find coherent parts in a recording • Detect speaker changes • Detect environment changes • Detect utterances • Preprocessing step for speaker ID, clustering

  12. Speech/Non-speech detection • Detects whether a segment contains speech or not • Approach: Offline training of Gaussian Mixture Models for speech & non-speech and detection of model for each segment with highest likelihood • Integrates further user-driven feedback mechanism

  13. Local Speaker clustering • Joins and labels segments according to underlying speaker • Approach: Iterative calculation of Bayesian Information Criterion and BIC-dependent merging of speech-segment combinations • Baseline tested on single documents with mediocre results robustificationneeded

  14. Speaker Identification • Identifies well-known speakers from given speech segments • Approach: Based on Adapted Gaussian Mixture Models & probability functions • Currently developing fast, iterative training-workflow to train a speaker model for detection

  15. Language Independent Alignment • Accurate alignment between speech and text in a multilingual context.

  16. Query-by-example: • Accurate alignment between speech and text in a multilingual context.

  17. EXAMPLE RECOGNIZERS

  18. Detect how many persons are in the video, detect who and when is speaking, create appropriate number of tiers and annotations for all of them and align their speech with transcription from a textfile.

  19. Detect how many persons are in the video, detect who and when is speaking, create appropriate number of tiers and annotations for all of them and align their speech with transcription from a textfile.

  20. Video recognizers

  21. Shot detection/keyframe extraction

  22. Skin color estimation

  23. Skin color estimation

  24. Hand/Head Detection & Tracking

  25. We can calculate • Boundaries of the gesture space • Speed, acceleration of hand movement • Segment recording into units: • Stroke • Hold • Retreat • Hand movement related to body • Which parts of speech overlap with gestures

  26. Hand/Head Detection & Tracking • Demo (ellipses video)

  27. Thanks • Przemek.Lenkiewicz@mpi.nl • www.mpi.nl/avatech

More Related