Feed forward pass

f_t = sigmoid(h_t − 1 * W_f + x_t * U_f)

i_t = sigmoid(h_t − 1 * W_i + x_t * U_i)

g_t = tanh(h_t − 1 * W_g + x_t * U_g)

o_t = sigmoid(h_t − 1 * W_o + x_t * U_o)

C_t = C_t − 1 ⋅ f_t + i_t ⋅ g_t

h_t = y_t = tanh(C_t) ⋅ o_t

Back propagation pass

To perform the BPTT with a LSTM unit, we have the eror comming from the top layer (δ1), the future cell (δ4), the future hidden state (δ2). Also, we have stored during the feed forward the states at each step of the feeding. In the case of the future layer, this error is just set to zero if not calculated yet. For convention, ⋅ correspond to point wise multiplication, while * correspond to matrix multiplication.

The rules on how to back prpagate come from this post.

δ3 = δ1 + δ2

δ5 = δ3 ⋅ 6 = δ3 ⋅ o_t

δ6 = δ3 ⋅ 5 = δ3 ⋅ tanh(c_t)

δ7 = δ5 ⋅ f′(5) = δ5 ⋅ tanh′(tanh(c_t))

δ8 = δ7 ⋅ δ4

δ9 = δ8 ⋅ 10 = δ8 ⋅ i_t

δ10 = δ8 ⋅ 9 = δ8 ⋅ g_t

δ11 = δ8 ⋅ 12 = δ8 ⋅ f_t

δ12 = δ8 ⋅ 11 = δ8 ⋅ c_t − 1

δ13 = δ6 ⋅ f′(6) = δ6 ⋅ sigmoid′(o_t) δ14 = δ9 ⋅ f′(9) = δ9 ⋅ tanh′(g_t) δ15 = δ10 ⋅ f′(10) = δ10 ⋅ sigmoid′(i_t) δ16 = δ12 ⋅ f′(12) = δ12 ⋅ sigmoid′(f_t)

δ17 = δ13 * U_o^T δ19 = δ14 * U_g^T δ21 = δ15 * U_i^T δ23 = δ16 * W_f^T δ18 = δ13 * W_o^T δ20 = δ14 * W_g^T δ22 = δ16 * W_i^T δ24 = δ16 * W_f^T

δ25 = δ18 + δ20 + δ22 + δ24 δ26 = δ17 + δ19 + δ21 + δ23

The error δ11, δ25 and δ26 are used for the next layers. Once all those errors are available, it is possible to calculate the weight update.

δW_f = δW_f + h_t − 1^T * δ16 δU_f = δU_f + x_t^T * δ16

δW_i = δW_i + h_t − 1^T * δ15 δU_i = δU_i + x_t^T * δ15

δW_g = δW_g + h_t − 1^T * δ14 δU_g = δU_g + x_t^T * δ14

δW_o = δW_o + h_t − 1^T * δ13 δU_o = δU_o + x_t^T * δ13

- Feed forward pass
- Back propagation pass

LSTM units

Feed forward pass

Back propagation pass