AIPressRoom
Posts
FNN-VAE for noisy time sequence forecasting

FNN-VAE for noisy time sequence forecasting

This publish didn’t find yourself fairly the way in which I’d imagined. A fast follow-up on the current Time series prediction with FNN-LSTM, it was speculated to display how noisy time sequence (so frequent inapply) might revenue from a change in structure: As a substitute of FNN-LSTM, an LSTM autoencoder regularized by false nearestneighbors (FNN) loss, use FNN-VAE, a variational autoencoder constrained by the identical. Nonetheless, FNN-VAE didn’t appear to deal withnoise higher than FNN-LSTM. No plot, no publish, then?

However – this isn’t a scientific examine, with speculation and experimental setup all preregistered; all that actuallyissues is that if there’s one thing helpful to report. And it appears like there’s.

Firstly, FNN-VAE, whereas on par performance-wise with FNN-LSTM, is much superior in that different which means of “efficiency”:Coaching goes a lot quicker for FNN-VAE.

Secondly, whereas we don’t see a lot distinction between FNN-LSTM and FNN-VAE, we do see a transparent influence of utilizing FNN loss. Including in FNN loss strongly reduces imply squared error with respect to the underlying (denoised) sequence – particularly within the case of VAE, however for LSTM as properly. That is of specific curiosity with VAE, because it comes with a regularizerout-of-the-box – specifically, Kullback-Leibler (KL) divergence.

After all, we don’t declare that comparable outcomes will all the time be obtained on different noisy sequence; nor did we tune any ofthe fashions “to demise.” For what might be the intent of such a publish however to point out our readers fascinating (and promising) conceptsto pursue in their very own experimentation?

The context

This publish is the third in a mini-series.

In Deep attractors: Where deep learning meets chaos, wedefined, with a considerable detour into chaos concept, the concept of FNN loss, launched in (Gilpin 2020). Please seek the advice ofthat first publish for theoretical background and intuitions behind the approach.

The following publish, Time series prediction with FNN-LSTM, confirmedmethods to use an LSTM autoencoder, constrained by FNN loss, for forecasting (versus reconstructing an attractor). The outcomes had been beautiful: In multi-step prediction (12-120 steps, with that quantity various bydataset), the short-term forecasts had been drastically improved by including in FNN regularization. See that second publish forexperimental setup and outcomes on 4 very completely different, non-synthetic datasets.

Right now, we present methods to substitute the LSTM autoencoder by a – convolutional – VAE. In mild of the experimentation outcomes,already hinted at above, it’s utterly believable that the “variational” half is just not even so vital right here – {that a}convolutional autoencoder with simply MSE loss would have carried out simply as properly on these information. In reality, to seek out out, it’ssufficient to take away the decision to reparameterize() and multiply the KL element of the loss by 0. (We go away this to the reader, to maintain the publish at cheap size.)

One final piece of context, in case you haven’t learn the 2 earlier posts and wish to bounce in right here immediately. We’redoing time sequence forecasting; so why this speak of autoencoders? Shouldn’t we simply be evaluating an LSTM (or another sort ofRNN, for that matter) to a convnet? In reality, the need of a latent illustration is as a result of very concept of FNN: Thelatent code is meant to replicate the true attractor of a dynamical system. That’s, if the attractor of the underlyingsystem is roughly two-dimensional, we hope to seek out that simply two of the latent variables have appreciable variance. (Thisreasoning is defined in numerous element within the earlier posts.)

FNN-VAE

So, let’s begin with the code for our new mannequin.

The encoder takes the time sequence, of format batch_size x num_timesteps x num_features similar to within the LSTM case, andproduces a flat, 10-dimensional output: the latent code, which FNN loss is computed on.

library(tensorflow)
library(keras)
library(tfdatasets)
library(tfautograph)
library(reticulate)
library(purrr)

vae_encoder_model <- perform(n_timesteps,
                               n_features,
                               n_latent,
                               identify = NULL) {
  keras_model_custom(identify = identify, perform(self) {
    self$conv1 <- layer_conv_1d(kernel_size = 3,
                                filters = 16,
                                strides = 2)
    self$act1 <- layer_activation_leaky_relu()
    self$batchnorm1 <- layer_batch_normalization()
    self$conv2 <- layer_conv_1d(kernel_size = 7,
                                filters = 32,
                                strides = 2)
    self$act2 <- layer_activation_leaky_relu()
    self$batchnorm2 <- layer_batch_normalization()
    self$conv3 <- layer_conv_1d(kernel_size = 9,
                                filters = 64,
                                strides = 2)
    self$act3 <- layer_activation_leaky_relu()
    self$batchnorm3 <- layer_batch_normalization()
    self$conv4 <- layer_conv_1d(
      kernel_size = 9,
      filters = n_latent,
      strides = 2,
      activation = "linear" 
    )
    self$batchnorm4 <- layer_batch_normalization()
    self$flat <- layer_flatten()
    
    perform (x, masks = NULL) {
      x %>%
        self$conv1() %>%
        self$act1() %>%
        self$batchnorm1() %>%
        self$conv2() %>%
        self$act2() %>%
        self$batchnorm2() %>%
        self$conv3() %>%
        self$act3() %>%
        self$batchnorm3() %>%
        self$conv4() %>%
        self$batchnorm4() %>%
        self$flat()
    }
  })
}

The decoder begins from this – flat – illustration and decompresses it right into a time sequence. In each encoder and decoder(de-)conv layers, parameters are chosen to deal with a sequence size (num_timesteps) of 120, which is what we’ll use forprediction beneath.

vae_decoder_model <- perform(n_timesteps,
                               n_features,
                               n_latent,
                               identify = NULL) {
  keras_model_custom(identify = identify, perform(self) {
    self$reshape <- layer_reshape(target_shape = c(1, n_latent))
    self$conv1 <- layer_conv_1d_transpose(kernel_size = 15,
                                          filters = 64,
                                          strides = 3)
    self$act1 <- layer_activation_leaky_relu()
    self$batchnorm1 <- layer_batch_normalization()
    self$conv2 <- layer_conv_1d_transpose(kernel_size = 11,
                                          filters = 32,
                                          strides = 3)
    self$act2 <- layer_activation_leaky_relu()
    self$batchnorm2 <- layer_batch_normalization()
    self$conv3 <- layer_conv_1d_transpose(
      kernel_size = 9,
      filters = 16,
      strides = 2,
      output_padding = 1
    )
    self$act3 <- layer_activation_leaky_relu()
    self$batchnorm3 <- layer_batch_normalization()
    self$conv4 <- layer_conv_1d_transpose(
      kernel_size = 7,
      filters = 1,
      strides = 1,
      activation = "linear"
    )
    self$batchnorm4 <- layer_batch_normalization()
    
    perform (x, masks = NULL) {
      x %>%
        self$reshape() %>%
        self$conv1() %>%
        self$act1() %>%
        self$batchnorm1() %>%
        self$conv2() %>%
        self$act2() %>%
        self$batchnorm2() %>%
        self$conv3() %>%
        self$act3() %>%
        self$batchnorm3() %>%
        self$conv4() %>%
        self$batchnorm4()
    }
  })
}

Observe that although we known as these constructors vae_encoder_model() and vae_decoder_model(), there’s nothingvariational to those fashions per se; they’re actually simply an encoder and a decoder, respectively. Metamorphosis right into a VAE willoccur within the coaching process; in actual fact, the one two issues that may make this a VAE are going to be thereparameterization of the latent layer and the added-in KL loss.

Talking of coaching, these are the routines we’ll name. The perform to compute FNN loss, loss_false_nn(), will be present ineach of the abovementioned predecessor posts; we kindly ask the reader to repeat it from one in every of these locations.

# to reparameterize encoder output earlier than calling decoder
reparameterize <- perform(imply, logvar = 0) {
  eps <- k_random_normal(form = n_latent)
  eps * k_exp(logvar * 0.5) + imply
}

# loss has 3 elements: NLL, KL, and FNN
# in any other case, that is simply regular TF2-style coaching 
train_step_vae <- perform(batch) {
  with (tf$GradientTape(persistent = TRUE) %as% tape, {
    code <- encoder(batch[[1]])
    z <- reparameterize(code)
    prediction <- decoder(z)
    
    l_mse <- mse_loss(batch[[2]], prediction)
    # see loss_false_nn in 2 earlier posts
    l_fnn <- loss_false_nn(code)
    # KL divergence to a regular regular
    l_kl <- -0.5 * k_mean(1 - k_square(z))
    # general loss is a weighted sum of all 3 elements
    loss <- l_mse + fnn_weight * l_fnn + kl_weight * l_kl
  })
  
  encoder_gradients <-
    tape$gradient(loss, encoder$trainable_variables)
  decoder_gradients <-
    tape$gradient(loss, decoder$trainable_variables)
  
  optimizer$apply_gradients(purrr::transpose(list(
    encoder_gradients, encoder$trainable_variables
  )))
  optimizer$apply_gradients(purrr::transpose(list(
    decoder_gradients, decoder$trainable_variables
  )))
  
  train_loss(loss)
  train_mse(l_mse)
  train_fnn(l_fnn)
  train_kl(l_kl)
}

# wrap all of it in autograph
training_loop_vae <- tf_function(autograph(perform(ds_train) {
  
  for (batch in ds_train) {
    train_step_vae(batch) 
  }
  
  tf$print("Loss: ", train_loss$outcome())
  tf$print("MSE: ", train_mse$outcome())
  tf$print("FNN loss: ", train_fnn$outcome())
  tf$print("KL loss: ", train_kl$outcome())
  
  train_loss$reset_states()
  train_mse$reset_states()
  train_fnn$reset_states()
  train_kl$reset_states()
  
}))

To complete up the mannequin part, right here is the precise coaching code. That is almost an identical to what we did for FNN-LSTM earlier than.

n_latent <- 10L
n_features <- 1

encoder <- vae_encoder_model(n_timesteps,
                         n_features,
                         n_latent)

decoder <- vae_decoder_model(n_timesteps,
                         n_features,
                         n_latent)
mse_loss <-
  tf$keras$losses$MeanSquaredError(discount = tf$keras$losses$Discount$SUM)

train_loss <- tf$keras$metrics$Imply(identify = 'train_loss')
train_fnn <- tf$keras$metrics$Imply(identify = 'train_fnn')
train_mse <-  tf$keras$metrics$Imply(identify = 'train_mse')
train_kl <-  tf$keras$metrics$Imply(identify = 'train_kl')

fnn_multiplier <- 1 # default worth utilized in almost all instances (see textual content)
fnn_weight <- fnn_multiplier * nrow(x_train)/batch_size

kl_weight <- 1

optimizer <- optimizer_adam(lr = 1e-3)

for (epoch in 1:100) {
  cat("Epoch: ", epoch, " -----------n")
  training_loop_vae(ds_train)
 
  test_batch <- as_iterator(ds_test) %>% iter_next()
  encoded <- encoder(test_batch[[1]][1:1000])
  test_var <- tf$math$reduce_variance(encoded, axis = 0L)
  print(test_var %>% as.numeric() %>% round(5))
}

Experimental setup and information

The concept was so as to add white noise to a deterministic sequence. This time, the Roessler system was chosen, primarily for the prettiness of its attractor, obviouseven in its two-dimensional projections:

Determine 1: Roessler attractor, two-dimensional projections.

Like we did for the Lorenz system within the first a part of this sequence, we use deSolve to generate information from the Roesslerequations.

library(deSolve)

parameters <- c(a = .2,
                b = .2,
                c = 5.7)

initial_state <-
  c(x = 1,
    y = 1,
    z = 1.05)

roessler <- perform(t, state, parameters) {
  with(as.list(c(state, parameters)), {
    dx <- -y - z
    dy <- x + a * y
    dz = b + z * (x - c)
    
    list(c(dx, dy, dz))
  })
}

instances <- seq(0, 2500, size.out = 20000)

roessler_ts <-
  ode(
    y = initial_state,
    instances = instances,
    func = roessler,
    parms = parameters,
    methodology = "lsoda"
  ) %>% unclass() %>% as_tibble()

n <- 10000
roessler <- roessler_ts$x[1:n]

roessler <- scale(roessler)

Then, noise is added, to the specified diploma, by drawing from a traditional distribution, centered at zero, with normal deviationsvarious between 1 and a couple of.5.

# add noise
noise <- 1 # additionally used 1.5, 2, 2.5
roessler <- roessler + rnorm(10000, imply = 0, sd = noise)

Right here you’ll be able to evaluate results of not including any noise (left), normal deviation-1 (center), and normal deviation-2.5 Gaussian noise:

Determine 2: Roessler sequence with added noise. Prime: none. Center: SD = 1. Backside: SD = 2.5.

In any other case, preprocessing proceeds as within the earlier posts. Within the upcoming outcomes part, we’ll evaluate forecasts not simplyto the “actual,” after noise addition, take a look at cut up of the info, but in addition to the underlying Roessler system – that’s, the factorwe’re actually inquisitive about. (Simply that in the true world, we are able to’t do this verify.) This second take a look at set is ready forforecasting similar to the opposite one; to keep away from duplication we don’t reproduce the code.

n_timesteps <- 120
batch_size <- 32

gen_timesteps <- perform(x, n_timesteps) {
  do.call(rbind,
          purrr::map(seq_along(x),
                     perform(i) {
                       begin <- i
                       finish <- i + n_timesteps - 1
                       out <- x[start:end]
                       out
                     })
  ) %>%
    na.omit()
}

prepare <- gen_timesteps(roessler[1:(n/2)], 2 * n_timesteps)
take a look at <- gen_timesteps(roessler[(n/2):n], 2 * n_timesteps) 

dim(prepare) <- c(dim(prepare), 1)
dim(take a look at) <- c(dim(take a look at), 1)

x_train <- prepare[ , 1:n_timesteps, , drop = FALSE]
y_train <- prepare[ , (n_timesteps + 1):(2*n_timesteps), , drop = FALSE]

ds_train <- tensor_slices_dataset(list(x_train, y_train)) %>%
  dataset_shuffle(nrow(x_train)) %>%
  dataset_batch(batch_size)

x_test <- take a look at[ , 1:n_timesteps, , drop = FALSE]
y_test <- take a look at[ , (n_timesteps + 1):(2*n_timesteps), , drop = FALSE]

ds_test <- tensor_slices_dataset(list(x_test, y_test)) %>%
  dataset_batch(nrow(x_test))

Outcomes

The LSTM used for comparability with the VAE described above is an identical to the structure employed within the earlier publish.Whereas with the VAE, an fnn_multiplier of 1 yielded ample regularization for all noise ranges, some extra experimentationwas wanted for the LSTM: At noise ranges 2 and a couple of.5, that multiplier was set to five.

Because of this, in all instances, there was one latent variable with excessive variance and a second one in every of minor significance. For allothers, variance was near 0.

In all instances right here means: In all instances the place FNN regularization was used. As already hinted at within the introduction, the principleregularizing issue offering robustness to noise right here appears to be FNN loss, not KL divergence. So for all noise ranges,in addition to FNN-regularized LSTM and VAE fashions we additionally examined their non-constrained counterparts.

Low noise

Seeing how all fashions did beautifully on the unique deterministic sequence, a noise degree of 1 can virtually be handled asa baseline. Right here you see sixteen 120-timestep predictions from each regularized fashions, FNN-VAE (darkish blue), and FNN-LSTM(orange). The noisy take a look at information, each enter (x, 120 steps) and output (y, 120 steps) are displayed in (blue-ish) gray. Ininexperienced, additionally spanning the entire sequence, we’ve the unique Roessler information, the way in which they’d look had no noise been added.

Determine 3: Roessler sequence with added Gaussian noise of normal deviation 1. Gray: precise (noisy) take a look at information. Inexperienced: underlying Roessler system. Orange: Predictions from FNN-LSTM. Darkish blue: Predictions from FNN-VAE.

Regardless of the noise, forecasts from each fashions look glorious. Is that this as a result of FNN regularizer?

forecasts from their unregularized counterparts, we’ve to confess these don’t look any worse. (For highercomparability, the sixteen sequences to forecast had been initiallly picked at random, however used to check all fashions andsituations.)

Determine 4: Roessler sequence with added Gaussian noise of normal deviation 1. Gray: precise (noisy) take a look at information. Inexperienced: underlying Roessler system. Orange: Predictions from unregularized LSTM. Darkish blue: Predictions from unregularized VAE.

What occurs after we begin to add noise?

Substantial noise

Between noise ranges 1.5 and a couple of, one thing modified, or turned noticeable from visible inspection. Let’s bounce on to thehighest-used degree although: 2.5.

Right here first are predictions obtained from the unregularized fashions.

Determine 5: Roessler sequence with added Gaussian noise of normal deviation 2.5. Gray: precise (noisy) take a look at information. Inexperienced: underlying Roessler system. Orange: Predictions from unregularized LSTM. Darkish blue: Predictions from unregularized VAE.

Each LSTM and VAE get “distracted” a bit an excessive amount of by the noise, the latter to a fair greater diploma. This results in instancesthe place predictions strongly “overshoot” the underlying non-noisy rhythm. This isn’t shocking, after all: They had been educatedon the noisy model; predict fluctuations is what they realized.

Will we see the identical with the FNN fashions?

Determine 6: Roessler sequence with added Gaussian noise of normal deviation 2.5. Gray: precise (noisy) take a look at information. Inexperienced: underlying Roessler system. Orange: Predictions from FNN-LSTM. Darkish blue: Predictions from FNN-VAE.

Apparently, we see a significantly better match to the underlying Roessler system now! Particularly the VAE mannequin, FNN-VAE, surpriseswith an entire new smoothness of predictions; however FNN-LSTM turns up a lot smoother forecasts as properly.

“Clean, becoming the system…” – by now you could be questioning, when are we going to provide you with extra quantitativeassertions? If quantitative implies “imply squared error” (MSE), and if MSE is taken to be some divergence between forecastsand the true goal from the take a look at set, the reply is that this MSE doesn’t differ a lot between any of the 4 architectures.Put in a different way, it’s principally a perform of noise degree.

Nonetheless, we might argue that what we’re actually inquisitive about is how properly a mannequin forecasts the underlying course of. And there,we see variations.

Within the following plot, we distinction MSEs obtained for the 4 mannequin sorts (gray: VAE; orange: LSTM; darkish blue: FNN-VAE; inexperienced:FNN-LSTM). The rows replicate noise ranges (1, 1.5, 2, 2.5); the columns characterize MSE in relation to the noisy(“actual”) goal(left) on the one hand, and in relation to the underlying system on the opposite (proper). For higher visibility of the impact,MSEs have been normalized as fractions of the utmost MSE in a class.

So, if we wish to predict sign plus noise (left), it isn’t extraordinarily vital whether or not we use FNN or not. But when we wish topredict the sign solely (proper), with rising noise within the information FNN loss turns into more and more efficient. This impact is muchstronger for VAE vs. FNN-VAE than for LSTM vs. FNN-LSTM: The gap between the gray line (VAE) and the darkish blue one(FNN-VAE) turns into bigger and bigger as we add extra noise.

Determine 7: Normalized MSEs obtained for the 4 mannequin sorts (gray: VAE; orange: LSTM; darkish blue: FNN-VAE; inexperienced: FNN-LSTM). Rows are noise ranges (1, 1.5, 2, 2.5); columns are MSE as associated to the true goal (left) and the underlying system (proper).

Summing up

Our experiments present that when noise is more likely to obscure measurements from an underlying deterministic system, FNNregularization can strongly enhance forecasts. That is the case particularly for convolutional VAEs, and possibly convolutionalautoencoders generally. And if an FNN-constrained VAE performs as properly, for time sequence prediction, as an LSTM, there’s arobust incentive to make use of the convolutional mannequin: It trains considerably quicker.

With that, we conclude our mini-series on FNN-regularized fashions. As all the time, we’d love to listen to from you in case you had been in a position tomake use of this in your individual work!

Thanks for studying!

Gilpin, William. 2020. “Deep Reconstruction of Unusual Attractors from Time Sequence.” https://arxiv.org/abs/2002.05909.

Get pleasure from this weblog? Get notified of latest posts by electronic mail:

Posts additionally accessible at r-bloggers

The post FNN-VAE for noisy time sequence forecasting appeared first on AIPressRoom.