Feed forward pass

r_t = sigmoid(h_t − 1 * W_r + x_t * U_r)

z_t = sigmoid(h_t − 1 * W_z + x_t * U_z)

g_t = tanh(W_g * (h_t − 1 ⋅ r_t) + x_t * U_g)

h_t = y_t = h_t − 1 ⋅ (1 − z_t) + (z_t ⋅ g_t)

Back propagation pass

To perform the BPTT with a GRU unit, we have the eror comming from the top layer (δ1), the future hidden states (δ2). Also, we have stored during the feed forward the states at each step of the feeding. In the case of the future layer, this error is just set to zero if not calculated yet. For convention, ⋅ correspond to point wise multiplication, while * correspond to matrix multiplication.

The rules on how to back prpagate come from this post.

δ3 = δ1 + δ2

δ4 = (1 − z_t) ⋅ δ3

δ5 = δ3 ⋅ h_t − 1

δ6 = 1 − δ5

δ7 = δ3 ⋅ g_t

δ8 = δ3 ⋅ z_t

δ9 = δ7 + δ8

δ10 = δ8 ⋅ tanh′(g_t)

δ11 = δ9 ⋅ sigmoid′(z_t)

δ12 = δ10 * W_g^T δ13 = δ10 * U_g^T δ14 = δ11 * W_z^T δ15 = δ11 * U_z^T

δ16 = δ13 ⋅ h_t − 1 δ17 = δ13 ⋅ r_t

δ18 = δ17 ⋅ sigmoid′(r_t)

δ19 = δ17 + δ4

δ20 = δ18 * W_r^T δ21 = δ18 * U_r^T

δ22 = δ21 + δ15

δ23 = δ19 + δ22

δ24 = δ12 + δ14 + δ20

The error δ23 and δ24 are used for the next layers. Once all those errors are available, it is possible to calculate the weight update.

δW_r = δW_f + h_t − 1^T * δ10 δU_r = δU_f + x_t^T * δ10

δW_z = δW_i + h_t − 1^T * δ11 δU_z = δU_i + x_t^T * δ11

δW_g = δW_g + (h_t − 1^T ⋅ r_t) * δ18 δU_g = δU_g + x_t^T * δ18

- Feed forward pass
- Back propagation pass

GRU units

Feed forward pass

Back propagation pass