130 likes | 177 Views
Presentation of Paper " Effective approached to attention based neural machine translation.
E N D
Effective approaches to Attention based Neural Machine Translation {F2019313020,F2019313001]@UMT.EDU.PK
Introduction • Attentional Mechanism has been developed to improve Neural Machine Translation by selectively focusing of the parts of source sentence. • Paper examines the two classes of Attentional based Mechanism. • Global approach -> attends to all source words. • Local approach -> attends to a subset of source words • Local attention 5.0 BLEU more points achieved over non attentional system. • This Paper is based on 15 others research papers. • More research has been done based on these papers and current BLEU achieved is 59 as compared to 29 achieved BLEU points in this paper.
Neural Machine Translation • The ultimate goal of any NMT model is to take a sentence in one language as input and return that sentence translated into a different language as output. • NMT is a large neural network that is trained in end to end fashion and has the ability to generalize well to very long sequences. • Does not have to store gigantic phrase tables and language models. • NMT has a small memory print. • Explanation of NMT https://towardsdatascience.com/neural-machine-translation-15ecf6b0b
Neural Machine Translation • NMT Model directly models the conditional probability p(y|x)of translating a source sentence x1,….xnto a target sentence y1,……..ym. • Consists of an encoder which computes a representation s for each source sentence. • Decoder which generates one target word at a time.
Neural Machine Translation • Natural choice to model such a decomposition in the decoder side is RNN architecture. • Paper presented in 2013,2014,2015 differ in terms of which RNN architectures are used for decoder and how the encoder computes the source sentence. • Loung (co-author of the paper) used the stacked layers of RNN with LSTM unit for both encoder and decoder hidden units.
Probability of the decoding • Probability of the decoding each word yjas • g being the transformation function that outputs a vocabulary sized vector. Or one can provide g with other inputs such as the currently predicted word yjas proposed by Bahdanau -2015. • hj is the RNN hidden unit abstractly computed as : • Function f computes current hidden state and can either be RNN,GRU or LSTM. • Source representation s is only used once to initialize the decoder hidden decoder state.
Proposed NMT Model and attention • Stacking LSTM architecture is used for proposed NMT systems. • Used LSTM Model is defined in Zaremba ,2015. • Source representation s implies a set of source hidden states • Set s is consulted throughout the entire course of translation process. • This approach is referred as attention mechanism.
Attention –based Models • Classified in two broad Categories Global and Local. • Classes differs on the basis of attention placed on all source positions or a few. • Common to these types: • Each time step t in decoding phase. • Both approaches take input the hidden state ht at the top layer of stacking LSTM. • Goal is to derive context vector ct to help predict the current target word yt. • Combine information from both vectors to produce attentional hidden state.
Global Attention • At each time step t , model infers a variable length alignment weight vector at. • Based on the current target state ht and the all the source states hs. • Global context vector ct is computed as the weighted average. • Weighted average is calculated according to at over all the source stats. Global attention has a draw back that it has to attend all the words on the source side for each target word.
Local Attention • Local attention model focus only on small subset of the source position per target word. • The model first predicts a single aligned position pt for current target word. • Window centered around the source position pt is used to compute ct which is the weighted average of the source hidden states in the window. • Weights at are inferred from the current target state ht and those source states hs (bar)
Training Details • Models trained on the WMT 14 training data. • 4.5 M sentence pairs ( 116M English Words , 110 M German Words). • Limit the vocabulary to top 50 K most frequent words. • Words not listed in this short listed vocabulary were converted to universal tokens. • Sentence pairs exceeding length of 50 words were filtered and shuffle mini batches.