210 likes | 216 Views
Outline for today’s presentation. We will see how RNNs and CNNs compare on variety of tasks Then, we will go through a new approach for Sequence Modelling that has become state of the art Finally, we will look at a few augmented RNN models. RNNs vs CNNs.
E N D
Outline for today’s presentation • We will see how RNNs and CNNs compare on variety of tasks • Then, we will go through a new approach for Sequence Modelling that has become state of the art • Finally, we will look at a few augmented RNN models
RNNs vs CNNs Empirical Evaluation of Generic Networks for Sequence Modelling
Let’s say you are given a sequence modelling task of text classification / music note prediction, and you are asked to develop a simple model. What would your baseline model be based on- RNNs or CNNs?
Recent Trend in Sequence Modelling • Widely considered as RNNs “home turf” • Recent research has shown otherwise – • Speech Synthesis – WaveNet uses Dilated Convolutions for Synthesis • Char-to-Char Machine Translation – ByteNet uses Encoder-Decoder architecture and Dilated Convolutions. Tested on English-German dataset • Word-to-Word Machine Translation – Hybrid CNN-LSTM on English-Romanian and English-French datasets • Character-level Language Modelling – ByteNet on WikiText dataset • Word-level Language Modelling – Gated CNNs on WikiText dataset
Temporal Convolutional Network (TCN) • Model that uses best practices in Convolutional network design • The properties of TCN - • Causal – there is no information leakage from future to past • Memory – It can look very far into the past for prediction/synthesis • Input - It can take any arbitrary length sequence with proper tuning to the particular task. • Simple – It uses no gating mechanism, no complex stacking mechanism and each layer output has the same length as the input • Components of TCN – • 1-D Dilated Convolutions • Residual Connections
TCN - Dilated Convolutions 1-D Dilated Convolutions 1-D Convolutions Source - WaveNet
TCN - Residual Connections • Layers learn modification to the identity mapping rather than the transformation • Has shown to be very useful for very deep networks Residual Block of TCN Example of Residual Connection in TCN
TCN - Weight Normalization • Shortcomings of Batch Normalization – • It needs two passes of the input– one to compute the batch statistics and then to normalize • Takes significant amount of time to be computed for each batch • Dependent on the batch size – not very useful when size is small • Cannot be used when we are training in an online setting • Normalizes the weights with respect to each training example • The main aim is to decouple the magnitude and the direction • It has shown to be faster than Batch Norm
TCN Advantages/Disadvantages [+] Parallelism – Each layer of a CNN network can be parallelized [+] Receptive field size – Can be easily increased by increasing either filter length, dilation factor or depth [+] Stable gradients – Uses Residual connections and Dropouts [+] Storage (train) – Memory footprint is lesser than RNNs [+] Sequence length – Can be easily adopted for variable input length [-] Storage (test) – During testing, it requires more memory
Experimental Setup • TCN filter size, dilation factor and number of layers are chosen to cover the entire receptive field • Vanilla RNN/LSTM/GRU hidden nodes and layers are chosen to have roughly the same number of parameters as TCN • For both the models, the hyperparameter search was used - • Gradient clipping – [0.3, 1] • Dropout – [0, 0.5] • Optimizers - SGD/RMSProp/AdaGrad/Adam • Weights Initialization – Gaussian with • Exponential dilation (for TCN)
Datasets • Adding Problem – • Serves as a stress test for sequence models • Consists of an input of length n and depth 2, with the first dimension being randomly assigned as 0 or 1, and the second dimension having 1s at two places only • Sequential MNIST and P-MNIST – • Tests the ability to remember distant past • Consists of a 784x1 MNIST’s digit image for digit classification. • P-MNIST has the pixels values permuted • Copy Memory – • Tests the memory capacity of the model • Consists of an input of length n+20 with first 10 digits randomly selected from 1 to 8, and the last 10 digits being 9, with everything else as 0 • Goal is to copy the first 10 values to the last ten placeholder values
Polyphonic Music – • Consists of sequence of piano keys of length 88 • Goal is to predict the next key in the sequence • Penn Treebank (PTB) – • It is a small language modelling dataset for both word and char-level • Consists of 5059K characters or 888K words for training • Wikitext-103 – • Consists of 28K Wikipedia articles for word-level language modelling • Consists of 103M words for training • LAMBADA – • Tests the ability to capture longer and broader contexts • Consists of 10K passages extracted from novels and serves as a QnA dataset
Inferences • Inferences are made on the following categories – 1. Memory - • Copy memory task was designed to check propagation of information • TCNs achieve almost 100% accuracy whereas RNNs fail at higher sequence lengths • LAMBADA dataset was designed to test the local and broader contexts • TCNs again outperform all its recurrent counterparts 2. Convergence – • In almost all of the tasks, TCNs converged faster than RNNs • The extent of parallelism possible can be one explanation • Concluded that given enough research, TCNs can outperform SOTA RNN models
To build a simple Sequence Modelling network What would you choose? RNNs or CNNs?