Suppose that we have two sets of data $X$ and $Y$ where each consists of ordered streams of datapoints $(x_1, x_2,\ldots, x_t, \ldots)$ and $(y_1, y_2,\ldots, y_s, \ldots)$ for example consecutive frames from a video
The goal is to learn a mapping from the domain $X$ to the domain $Y$.
A naive approach would treat the set of sequences as set of the elements of all the sequences, disregarding the ordering of groups of elements.
For example we would treat a set of video clips as a set of the individual frames in the video clips
It turns out that taking advantage of the temporal ordering yields much better results.

Recurrent temporal predictor $P$:
- Takes as input the sequence $x_1,...,x_t$ and predicts the next element $x_{t+1}$ conditioned on the previous elements. $$x_{t+1} = P_X(x_{1:t})$$
- Loss function $$L_\tau(P_X) = \sum_t\lVert x_{t+1} - P_X(x_{1:t})\rVert^2$$
Based on this a recycle loss is defined TODO: loss formula
Here the predictor takes as input a sequence of generated samples $G_Y(x_1),...,G_Y(x_t)$ to predict the next one.
The generator $G_Y$ maps from $X$ to $Y$.
The predicted samples are then mapped back to $X$ via $G_X$ and the loss between these and the elements of the original sequence is minimised.
The complete loss function: TODO: put the description under the term $$\underset{\text{generator loss for $G_X$}} {L_g(G_X, D_X)} +\underset{\text{generator loss for $G_Y$}}{L_g(G_Y, D_Y)}\
\lambda_{rx}L_r(G_X, G_Y, P_Y) \text{ }\text{ }\text{ }\text{ }\text{recycle loss for the mapping $Y \longrightarrow X$}\
\lambda_{ry}L_r(G_Y, G_X, P_X) \text{ }\text{ }\text{ }\text{ }\text{recycle loss for the mapping $X \longrightarrow Y$}\
\lambda{\tau y}L\tau(P_X)\text{ }\text{ }\text{ }\text{ }\text{recurrent loss for $X$}\
\lambda{\tau x}L\tau(P_Y)\text{ }\text{ }\text{ }\text{ }\text{recurrent loss for $Y$}\ $$

A naive approach is to generate a video frame by frame where $y_t = G_Y(x_t)$.
However we could also incorporate temporal information by using $P_Y$ to smooth the output:

$$y_t = f(G_Y(x_t),P_Y(G_Y(x_{1:t-1})))$$
Here $f$ could be simple averaging:

$$y_t = \frac{G_Y(x_t) + P_Y(G_Y(x_{1:t-1}))}{2}$$
It could also be a non-linear function and possibly one that is learned.

Spatial translation model uses CycleGAN
Temporal prediction model uses Pix2Pix
Discriminator is a $70 \times 70$ PatchGAN
Same network architecture for $G_X$ and $G_Y$
Input size is $256 \times 256$
Temporal predictors
- U-Net architecture
- Input is last two frames (does this mean P(x{1:t}) = P(x{t-2}, x_{t-1})$$
All the loss weights $\lambda_s = 10$