I Introduction
In its simplest form, a standard one dimensional recurrent neural network (RNN) can be defined as
(1) 
where is a discrete time index,
an elementwise sigmoid function,
the input sequence, the hidden state sequence, the output sequence, and andare two weight matrices with proper dimensions. RNN is a powerful tool for sequence modeling, and its gradient can be conveniently calculated, e.g., via backpropagation through time (BPTT)
[1]. Unfortunately, learning RNN turns out to be extremely difficult when it is used to solve problems requiring long term memories [2, 3, 4]. Exploding and vanishing gradients, especially the latter one, are suspected to be the causes. Hence long and short term memory (LSTM) and its variants [4, 5] are invented to overcome the vanishing gradient issue mainly by the use of forgetting gates. However, as a specially modified model, LSTM may still fail to solve certain problems that are suitable for its architecture, e.g., finding XOR relationship between two binary symbols with a long lag. Furthermore, as a more complicated model, it does not necessarily always outperform the standard RNN model on certain natural problems as reported in [2, 3]. Another way to address vanishing gradient is to directly penalize RNN connections encouraging vanishing gradients [2]. However, as an ad hoc method, its impact on the convergence and performance of RNN training is unclear. One important discovery made in [3]is that RNN requiring long term memories can be trained using Hessian free optimization, a conjugate gradient (CG) method tailored for neural network training with the use of a backpropagation like procedure for curvature matrixvector product evaluation
[6]. However, due to its use of line search, Hessian free optimization requires a large minibatch size for gradient and cost function evaluations, making it computationally demanding for problems with large training sample sizes.Recently, a preconditioned stochastic gradient descent (PSGD) algorithm is proposed in [7]. It is a simple and general procedure to upgrade a stochastic gradient descent (SGD) algorithm to a secondorder algorithm by exploiting the curvature information extracted exclusively from noisy stochastic gradients. It is virtually tuning free, and applicable equally well to both convex and nonconvex problems, a striking difference from many optimization algorithms, including the Hessian free one, which assume positive definite Hessian matrices at least for their derivations. Naturally, we are curious about its performance on RNN training, especially on those challenging pathological synthetic problems since they are effectively impossible for SGD [4, 3]. Our results suggest that although the issue of exploding and vanishing gradients arises naturally in RNN, efficient learning is still possible when the gradients are properly preconditioned. Experimental results on the MNIST handwritten digit recognition task suggest that preconditioning helps to improve convergence as well even when no long term memory is required.
Ii PSGD and RNN
Iia Psgd
We briefly summarize the PSGD theory in [7]. Let us consider the minimization of cost function,
(2) 
where is a parameter vector to be optimized, is a random vector,
is a loss function, and
takes expectation over . At the th iteration of PSGD, we evaluate two stochastic gradients over the same randomly drawn samples: the original gradient at point , and a perturbed gradient at point , where is a tiny random vector. By introducing gradient perturbation as , a positive definite preconditioner, , can be pursued by minimizing criterion(3) 
where takes expectation over random vector . Under mild conditions, such a exists and is unique [7]. As a result, the PSGD learning rule is written as,
(4) 
where is a normalized step size. The preconditioner can be conveniently estimated using stochastic relative (natural) gradient descent with minibatch size .
The rationality of PSGD is that minimizing criterion (3) leads to a preconditioner scaling the gradient such that the amplitude of approximately matches that of as . When gradient noise vanishes, we have , a relationship comparable to , where is the Hessian at . Hence PSGD can be regarded as a stochastic version of the deterministic Newton method when converges to and . But unlike the Newton method, PSGD applies equally well to nonconvex optimizations since can be chosen to be positive definite even when is indefinite.
In the context of RNN training, damps exploding gradients and amplifies vanishing gradients by trying to match the scales of vectors and
. In this way, a single preconditioner solves both the exploding and vanishing gradient issues in learning RNN, while conventionally, several different strategies are developed and combined to fix these two issues, e.g., gradient clipping, using penalty term to discourage vanishing gradient, forgetting gate, etc..
IiB Application to RNN Training
IiB1 Dense preconditioner
It is straightforward to apply PSGD to RNN training by stacking all the elements in and to form a single coefficient vector . The resultant preconditioner has no sparsity. Hence, such a brutal force solution is practical only for small scale problems with up to thousands of parameters to learn.
IiB2 Preconditioner with sparse structures
For large scale problems, it is necessary to enforce certain sparse structures on the preconditioner so that it can be stored and manipulated on computers. Supposing the dimensions of , and are , and respectively, one example is to enforce to have form
(5) 
where the dimensions of positive definite matrices , , , and are , , and respectively, and and denote Kronecker product and direct sum respectively. Algorithms for learning these , , are detailed in [7] as well. We mainly study the performance of PSGD with sparse preconditioner due to its better scalability with respect to problem sizes.
Iii Experimental Results
We consider a real world handwritten digit recognition problem [8], and a set of pathological synthetic problems originally proposed in [4] and restudied in [3, 2]. Details of these pathological synthetic problems can be found in [4] and the supplement of [3]. For continuous problems (outputs are continuous), mean squared error (MSE) loss is used, and for discrete problems (outputs are discrete), cross entropy loss is used. The same parameter settings as in [7]
are used for PSGD, and no problemspecific hand tweaking is made. Specifically, the preconditioner is initialized to identity matrix, and then updated using stochastic relative gradient descent with minibatch size
, step size and sampling elementwisely, whereis the accuracy in double precision. The recurrent matrix of RNN is initialized to a random orthogonal matrix such that neither exploding nor vanishing gradient issue is severe at the beginning, loosely comparable to setting large initial biases in the forgetting gates of LSTM
[5]. Other nonrecurrent weights are elementwisely initialized to small random numbers drawn from normal distribution. Minibatch size
and step size are used for RNN training. Program code written in Matlab and supplemental materials revealing more detailed experimental results can be found at https://sites.google.com/site/lixilinx/home/psgd.Iiia Experiment 1: PSGD vs. SGD
We consider the addition problem from [4] where a RNN is trained to predict the sum of a pair of marked, but randomly located, continuous random numbers in a sequence. For SGD, clipped stochastic gradient with clipping threshold is used to address the exploding gradient issue. SGD seldom succeeds on this problem when the sequence length is no less than
. To make the problem easier, sequences with length uniformly distributed in range
are used for training, hoping that SGD can learn the desired patterns from shorter sequences and then generalize them to longer ones. Fig. 1 shows three learning curves for three algorithms using the same initial guess and step size: SGD, PSGD with a sparse preconditioner, and PSGD with a dense preconditioner. Clearly, PSGD with a dense preconditioner converges the fastest. The sparse preconditioner helps a lot as well, despite its simplicity. SGD converges the slowest.IiiB Experiment 2: Performance on Pathological Synthetic Problems
We consider the four groups of pathological synthetic problems in [4]. The first group includes the addition, multiplication, and XOR problems; the second group includes the bit and bit temporal order problems; the third group only has the random permutation problem; and the fourth group are the bit and bit noiseless memorization problems. Totally we have eight problems. In the addition and multiplication problems, RNN needs to memorize continuous random numbers with certain precision for many steps. In the bit and bit temporal order problems, RNN needs to memorize two and three widely separated binary bits and their order, respectively. The XOR problem challenges both RNN and LSTM training since this problem cannot be decomposed into smaller ones. In the random permutation problem, RNN is taught to predict random unpredictable symbols, except the one at the end of sequence, leading to extremely noisy gradients. On the contrary, all symbols in the bit and bit memorization problems, except those information carrying bits, can be trivially predicted, but are not task related, thus diluting the importance of task related gradient components.
We follow the experimental configurations in [3, 4] so that the results can be compared. The results reported in [2] could be biased because according to the descriptions in [2], for most problems, RNN is trained on sequences with length uniformly distributed in range . This considerably facilitates the training since RNN has chances to learn the desired patterns from short sequences and then to generalize them to long ones, as shown in Experiment 1. We follow the configurations in [3, 4] to ensure that there is no short time lag training exemplar to facilitate learning.
Among these eight problems, the bit memorization problem is special in the way that it only has distinct input sequences. Hence we set its minibatch size to . Then the gradient is exact, no longer stochastic. PSGD applies to deterministic optimization as well, but extra cares need to be taken to prevent the arising of an illconditioned Hessian since PSGD is essentially a secondorder optimization algorithm. Note that the cross entropy loss is invariant to the sum of elements in . Thus only needs to have degrees of freedom. Its extra degrees of freedom cause singular Hessian all over the parameter space. We remove those extra degrees of freedom in by constraining all its columns having zero sum. We would like to point out that gradient noise in stochastic gradient naturally regularizes the preconditioner estimation as shown in [7]. Hence we have no need to remove those extra degrees of freedom in for the other five discrete problems.
Only the PSGD with sparse preconditioner is tested. For each problem, four sequence lengths, , , and , are considered. For each problem with each sequence length, five independent runs starting from different random initial guesses are carried out. A run is said to be failed when it fails to converge within the maximum allowed number of iterations, which is set to for PSGD. Table I summarizes the failure rate results. Note that RNN training may take a long time. Hence, we have not finished all five runs for a few test cases due to limited resources.
30  50  100  200  

Addition  0/5  0/5  0/5  2/5 
Multiplication  0/5  0/5  0/5  0/5 
XOR  0/5  0/5  3/5  1/1 
bit temporal order  0/5  0/5  0/5  3/4 
bit temporal order  0/5  0/5  0/5  2/3 
Random permutation  0/5  0/5  0/5  0/5 
bit memorization  0/5  0/5  0/5  0/5 
bit memorization  0/5  0/5  0/5  0/5 
We compare our results with the ones reported in [3]. Since only a few runs are carried out, neither the result here nor the one in [3] has statistical significance. Thus we would like to compare the maximum sequence length that an algorithm can handle without failure. This criterion favors the results reported in [3] as for each problem with each sequence length, only four runs are done there, while PSGD has five runs. These results are summarized in Table II. From Table II, we observe that PSGD outperforms Hessianfree optimization with Tikhonov damping on the multiplication, XOR, bit temporal order, bit memorization, and bit memorization problems. PSGD outperforms Hessianfree optimization with structural damping on the multiplication, bit temporal order, random permutation, and bit memorization problems. Overall speaking, PSGD outperforms Hessianfree optimization with either Tikhonov damping or structural damping, and its performances are no worse than the best ones achieved by both versions of Hessianfree optimization.
HF, Tikhonov  HF, structural  PSGD  

Addition  100  100  100 
Multiplication  100  100  200 
XOR  30  50  50 
bit temporal order  50  50  100 
bit temporal order  100  100  100 
Random permutation  200  100  200 
bit memorization  200  200  
bit memorization  100  200 
IiiC Experiment 3: MNIST Handwritten Digit Recognition
Not every practical RNN learning problem is as pathological as the above studied synthetic problems. Still, PSGD could take nontrivial advantages over SGD such as faster and better convergence even when no long term memory is required. Here, the classic MNIST handwritten digit recognition task is considered [8]. The original
images are zero padded to
ones. Fig. 2 shows the architecture of a small but deep two dimensional RNN used to recognize the zero padded images. No long term memory is required as either dimension only requires eight steps of back propagation.Both SGD and PSGD start from the same random initial guess, and use the same step size and minibatch size. PSGD uses layerwise Kronecker product preconditioner. No preprocessing, pretraining or artificially distorted version of the original training samples is used. Fig. 3 plots the test error rate convergence curves. Here, the test error rate is the ratio of the number of misclassified testing samples to the total number of testing samples. From Fig. 3, one observes that PSGD always converges faster and better than SGD. It is interesting to compare the test error rates here with that listed on [8]
achieved by convolutional neural networks without using distorted version of the original training samples. Here, SGD and PSGD converge to test error rates
and , respectively. They are comparable to the ones listed on [8] achieved using convolutional neural networks without and with pretraining, respectively.Iv Conclusions and Discussions
Preconditioned stochastic gradient descent (PSGD) is a general and simple learning algorithm, and requires little tuning effort. We have tested PSGD on eight pathological synthetic recurrent neural network (RNN) training problems. Although these problems may fail stochastic gradient descent (SGD) miserably, PSGD works quite well on them, and even could outperform Hessianfree optimization, a significantly more complicated algorithm than both SGD and PSGD. While SGD is workable for many practical problems without requiring long term memory, PSGD still provides nontrivial advantages over it such as faster and better convergence as demonstrated in the MNIST handwritten digit recognition example.
Unlike many traditional secondorder optimization algorithms which assume positive definite Hessian, PSGD is designed for both convex and nonconvex optimizations. This might explains its superior performance even its implementation is just slightly more complicated than SGD. PSGD works well with small minibatch sizes to reduce computational complexity due to its inherent ability to damp gradient noise naturally, while many offtheshelf algorithms require a large minibatch size for accurate gradient and cost function evaluations to facilitate line search. Furthermore, PSGD is easier to use since its step size is normalized, saving the trouble of step size selection by either hand tweaking or using step size searching algorithms. Its preconditioner can have flexible forms, providing trade off room between performance and complexity. These properties make PSGD an attractive alternative to SGD and many other stochastic optimization algorithms.
References
 [1] P. J. Werbos, “Backpropagation through time: what it does and how to do it,” Proc. IEEE, vol. 78, no. 10, pp. 1550–1560, Oct. 1990.
 [2] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” arXiv:1211.5063, 2012.
 [3] J. Martens and I. Sutskever, “Learning recurrent neural networks with Hessianfree optimization,” In Proc. of the 28th ICML, 2011.

[4]
S. Hochreiter and J. Schmidhuber, “Long shortterm memory,”
Neural Computation, vol. 9, no.8, pp. 1735–1780, 1997.  [5] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “LSTM: a search space odyssey,” arXiv:1503.04069, 2015.
 [6] N. Schraudolph, “Fast curvature matrixvector products for secondorder gradient descent,” Neural Computation, vol. 14, no. 7, pp. 1723–1738, 2002.
 [7] X.L. Li, “Preconditioned stochastic gradient descent,” arXiv:1512.04202, 2015.

[8]
Y. LeCun, C. Cortes, and C. J. C. Burges,
THE MNIST DATABASE
. Retrieved from http://yann.lecun.com/exdb/mnist/.
Comments
There are no comments yet.