Home

The problem

  • Suppose that we have two sets of data $X$ and $Y$ where each consists of ordered streams of datapoints $(x_1, x_2,\ldots, x_t, \ldots)$ and $(y_1, y_2,\ldots, y_s, \ldots)$ for example consecutive frames from a video
  • The goal is to learn a mapping from the domain $X$ to the domain $Y$.
  • A naive approach would treat the set of sequences as set of the elements of all the sequences, disregarding the ordering of groups of elements.
  • For example we would treat a set of video clips as a set of the individual frames in the video clips
  • It turns out that taking advantage of the temporal ordering yields much better results.

The model

  • Recurrent temporal predictor $P$:
    • Takes as input the sequence $x_1,...,x_t$ and predicts the next element $x_{t+1}$ conditioned on the previous elements. $$x_{t+1} = P_X(x_{1:t})$$
    • Loss function $$L_\tau(P_X) = \sum_t\lVert x_{t+1} - P_X(x_{1:t})\rVert^2$$
  • Based on this a recycle loss is defined TODO: loss formula
  • Here the predictor takes as input a sequence of generated samples $G_Y(x_1),...,G_Y(x_t)$ to predict the next one.
  • The generator $G_Y$ maps from $X$ to $Y$.
  • The predicted samples are then mapped back to $X$ via $G_X$ and the loss between these and the elements of the original sequence is minimised.
  • The complete loss function: TODO: put the description under the term $$\underset{\text{generator loss for $G_X$}} {L_g(G_X, D_X)} +\underset{\text{generator loss for $G_Y$}}{L_g(G_Y, D_Y)}\
  • \lambda_{rx}L_r(G_X, G_Y, P_Y) \text{ }\text{ }\text{ }\text{ }\text{recycle loss for the mapping $Y \longrightarrow X$}\
  • \lambda_{ry}L_r(G_Y, G_X, P_X) \text{ }\text{ }\text{ }\text{ }\text{recycle loss for the mapping $X \longrightarrow Y$}\
  • \lambda{\tau y}L\tau(P_X)\text{ }\text{ }\text{ }\text{ }\text{recurrent loss for $X$}\
  • \lambda{\tau x}L\tau(P_Y)\text{ }\text{ }\text{ }\text{ }\text{recurrent loss for $Y$}\ $$

Generating sequences

  • A naive approach is to generate a video frame by frame where $y_t = G_Y(x_t)$.
  • However we could also incorporate temporal information by using $P_Y$ to smooth the output:

    $$y_t = f(G_Y(x_t),P_Y(G_Y(x_{1:t-1})))$$

  • Here $f$ could be simple averaging:

    $$y_t = \frac{G_Y(x_t) + P_Y(G_Y(x_{1:t-1}))}{2}$$

  • It could also be a non-linear function and possibly one that is learned.

Implementation details

  • Spatial translation model uses CycleGAN
  • Temporal prediction model uses Pix2Pix
  • Discriminator is a $70 \times 70$ PatchGAN
  • Same network architecture for $G_X$ and $G_Y$
  • Input size is $256 \times 256$
  • Temporal predictors
    • U-Net architecture
    • Input is last two frames (does this mean P(x{1:t}) = P(x{t-2}, x_{t-1})$$
  • All the loss weights $\lambda_s = 10$