600 likes | 1.01k Views
Large Vocabulary Continuous Speech Recognition (LVCSR). Automatic Speech Recognition Spring 2016. Large Vocabulary Continuous Speech Recognition. Sub-word Speech Units. HMM-Based Sub-word Speech Units. Training of Sub-word Units. Training of Sub-word Units. Training Procedure.
E N D
Large Vocabulary Continuous Speech Recognition (LVCSR) Automatic Speech Recognition Spring 2016
Large Vocabulary Continuous Speech Recognition Automatic Speech Recognition
Sub-word Speech Units Automatic Speech Recognition
HMM-Based Sub-word Speech Units Automatic Speech Recognition
Training of Sub-word Units Automatic Speech Recognition
Training of Sub-word Units Automatic Speech Recognition
Training Procedure Automatic Speech Recognition
Errors and performance evaluation in PLU recognition • Substitution error (s) • Deletion error (d) • Insertion error (i) • Performance evaluation: • If the total number of PLUs is N, we define: • Correctness rate: N – s – d /N • Accuracy rate: N – s – d – i / N Automatic Speech Recognition
Language Models for LVCSR Word Pair Model: Specify which word pairs are valid Automatic Speech Recognition
Statistical Language Modeling Automatic Speech Recognition
Perplexity of the Language Model Entropy of the Source: Assuming independent generation of words: Then, H is called the first order entropy of the source: If the source is ergodic, meaning its statistical properties can be completely characterized in a sufficiently long sequence that the Source puts out, Automatic Speech Recognition
Perplexity of the Language Model We often compute H based on a finite but sufficiently large Q: H is the degree of difficulty that the recognizer encounters, on average, When it is to determine a word from the same source. Using language model, if the N-gram language model PN(W) is used, An estimate of H is: Automatic Speech Recognition
Perplexity of the Language Model An estimate of H is: In general If the source is ergodicand Perplexity is defined as: Automatic Speech Recognition
مثال الف)B=8 ب)B=4
Overall recognition system based on sub-word units Automatic Speech Recognition
Naval Resource (Battleship) Management Task: 991-word vocabulary NG (no grammar): perplexity = 991 Automatic Speech Recognition
Word pair grammar We can partition the vocabulary into four nonoverlapping sets of words: The overall FSN allows recognition of sentences of the form: Automatic Speech Recognition
WP (word pair) grammar: Perplexity=60 FSN based on Partitioning Scheme: 995 real arcs and 18 null arcs WB (word bigram) Grammar: Perplexity =20 Automatic Speech Recognition
Control of word insertion/word deletion rate In the discussed structure, there is no control on the sentence length We introduce a word insertion penalty into the Viterbi decoding For this, a fixed negative quantity is added to the likelihood score at the end of each word arc Automatic Speech Recognition
State Tying Automatic Speech Recognition
Problem definition • In HMM-based speech recognition, the performance of the system depends critically on how well state output distributions are modeled • And on how well model parameters are learned • The existing tradeoff: • Using a model such that their parameters can be easily estimated • Using a complex model for density distribution of data to have HMM as a good statistical model • Some common simple densities to model output distribution: • Gaussian model • However, it’s a poor model for the distribution of cepstral features • Gaussian mixture model • It’s an effective and simple model Automatic Speech Recognition
Problem definition • The parameters required to specify a mixture of K Gaussians includes K mean vectors, K covariance matrices, and K mixture weights • A recognizer with tens (or hundreds) of thousands of HMM states will require hundreds of thousands (or millions) of parameters to specify all state output densities • Most training corpora cannot provide sufficient training data to learn all these parameters effectively • Parameters for the state output densities of sub-word units that are never seen in the training data can never be learned at all The key problem:maintaining the balance between model complexity and available training data Automatic Speech Recognition
Training the HMM • To train the HMM for a sub-word unit, data from all instances of the unit in the training corpus are used to estimate the parameters • This process could be: • Context-independent • Context-dependent Automatic Speech Recognition
Context-independent parameter training • In context independent model different samples of a unit is gathered from different location in the corpus • The effects of neighboring units are ignored Gather data from separate instances, assign data to states, aggregate data for each state, and find the statistical parameters of each of the aggregates … State #1 State #2 Automatic Speech Recognition
Context-dependent parameter training • Phonemepronunciationdependson environment(allophones, co-articulation) • Context based grouping of observations results in finer, Context-Dependent (CD) models • Triphones: simple and the most widely used model • context: window of length three. Automatic Speech Recognition
What to use? Word models are best when the vocabulary is small (e.g. digits). CI phoneme based models are rarely used Where accuracy is of prime importance, triphone models are usually used If reduced memory footprint and speed are important, e.g. in embedded recognizers, diphone models are often used Higher-order Nphone models are rarely used Automatic Speech Recognition
Triphones To build the HMM for a word, we simply concatenate the HMMs for individual triphones in it Automatic Speech Recognition
Triphones • Triphones at word boundaries are dependent on neighboring words. • cross-word triphones: context spanning word boundaries, important for accurate modeling. • A triphone in the middle of a word sounds different from the same triphone at word boundaries • e.gthe word-internal triphone AX(G,T) from GUT: G AX T • Vs. cross-word triphone AX(G,T) in BIG ATTEMPT • This results in significant complication of the HMM for the language (through which we find the best path, for recognition) • Resulting in larger HMMs and slower search Automatic Speech Recognition
Problems with triphones • Parameters: very large numbers for VLVR. • Number of phones: about 50 • Number of CD phones: possibly , 503 • but not all of them occur (phonotacticconstraints). In practice, about 60000. • Number of HMM parameters: • with16 mixture and 39-dimensional feature vector: 60000 × 3 × (39 × 16 × 2 +16) ≈ 280M • Data sparsity: • some triphones, particularly cross-word triphones, do not appear in sample. Automatic Speech Recognition
Solution • Parameter sharing: cluster parameters with similar characteristics (‘parameter tying’). • clustering HMM states. • Parameter sharing is a technique by which several similar HMM states share a common set of HMM parameters • Since the shared HMM parameters are now trained using the data from all similar states, there are more data available to train any HMM parameter Automatic Speech Recognition
Parameter sharing types • Continuous density HMMs • Individual states may share the same mixture distributions Automatic Speech Recognition
Parameter sharing types Semi-continuous density HMMs: all states may share the same Gaussians, but with different mixture weights Automatic Speech Recognition
Parameter sharing types Semi-continuous density HMMs: all states may share the same Gaussians, but with state-specific mixture weights, and then share the weights as well Automatic Speech Recognition
Two techniques • Data-driven Clustering • Group HMM states together based on the similarity of their distributions, until all groups have sufficient data • The densities used for grouping are poorly estimated in the first place • Has no estimates for unseen sub-word units • Places no restrictions on HMM topologies etc. • Decision trees • Clustering based on expert-specified rules. The selection of rule is data driven • Based on externally provided rules. Very robust if the rules are good • Provides a mechanism for estimating unseen sub-word units • Restricts HMM topologies Automatic Speech Recognition
Decision Tree • Basic principle: • Recursively partition a data set to maximize a pre-specified objective function • The actual objective function used is dependent on the specific decision tree algorithm • The objective is to separate the data into increasingly “pure” subsets, such that most of the data in any subset belongs to a single class • In our case the “classes” are HMM states • Most commonly used tools for induction of decision trees: • CART (classification and regression tree) • C4.5 Automatic Speech Recognition
Decision Tree in our problem • algorithm • Initially, group together all triphones for the same phoneme. • Split group according to decision tree questions based on left or right phonetic context. • All triphones (HMM states) at the same leaf are clustered (tied). • Advantage: • even unseen triphones are assigned to a cluster and thus a model. • Questions • which DT questions? Which criterion? • Example for predefined binary questions • is the phoneme to the left an /l/? • is the phoneme to the right a nasal? • is the phoneme to the right a nasal? Automatic Speech Recognition
Clustering context-dependent phone Automatic Speech Recognition
Splitting Criterion • Criterion: best question is one that maximizes sample likelihood after splitting. • The parent set O1 has a distribution P1(x) • The total log likelihood of all observations in O1 on the distribution of O1 is • The child set O2 has a distribution P2(x) • The total log likelihood of all observations in O2 on the distribution of O2 is • The child set O3 has a distribution P3(x) • The total log likelihood of all observations in O3 on the distribution of O3 is • The total increase in set-conditioned log likelihood of observations due to partitioning O1 is • Partition O1 such that the increase in log likelihood is maximized • Recursively perform this partitioning of each of the subsets to form a tree Automatic Speech Recognition