- AIPressRoom
- Posts
- Easy audio classification with torch
Easy audio classification with torch
This text interprets Daniel Falbel’s ‘Simple Audio Classification’ article from tensorflow/keras to torch/torchaudio. The primary purpose is to introduce torchaudio and illustrate its contributions to the torch ecosystem. Right here, we concentrate on a well-liked dataset, the audio loader and the spectrogram transformer. An attention-grabbing aspect product is the parallel between torch and tensorflow, displaying generally the variations, generally the similarities between them.
Downloading and Importing
torchaudio has the speechcommand_dataset inbuilt. It filters out background_noise by default and lets us select between variations v0.01 and v0.02.
# set an present folder right here to cache the dataset DATASETS_PATH <- "~/datasets/" # 1.4GB obtain df <- speechcommand_dataset( root = DATASETS_PATH, url = "speech_commands_v0.01", obtain = TRUE ) # count on folder: _background_noise_ df$EXCEPT_FOLDER # [1] "_background_noise_" # variety of audio information length(df) # [1] 64721 # a pattern pattern <- df[1] pattern$waveform[, 1:10]
torch_tensor 0.0001 * 0.9155 0.3052 1.8311 1.8311 -0.3052 0.3052 2.4414 0.9155 -0.9155 -0.6104 [ CPUFloatType{1,10} ]
pattern$sample_rate # 16000 pattern$label # mattress plot(pattern$waveform[1], sort = "l", col = "royalblue", essential = pattern$label)
Determine 1: A pattern waveform for a ‘mattress’.
Courses
[1] "mattress" "chicken" "cat" "canine" "down" "eight" "5" [8] "4" "go" "comfortable" "home" "left" "marvin" "9" [15] "no" "off" "on" "one" "proper" "seven" "sheila" [22] "six" "cease" "three" "tree" "two" "up" "wow" [29] "sure" "zero"
Generator Dataloader
torch::dataloader has the identical process as data_generator outlined within the unique article. It’s chargeable for getting ready batches – together with shuffling, padding, one-hot encoding, and so forth. – and for caring for parallelism / machine I/O orchestration.
In torch we do that by passing the prepare/check subset to torch::dataloader and encapsulating all of the batch setup logic inside a collate_fn() operate.
At this level, dataloader(train_subset) wouldn’t work as a result of the samples should not padded. So we have to construct our personal collate_fn() with the padding technique.
I counsel utilizing the next strategy when implementing the collate_fn():
start with collate_fn <- operate(batch) browser().
instantiate dataloader with the collate_fn()
create an setting by calling enumerate(dataloader) so you may ask to retrieve a batch from dataloader.
run setting[[1]][[1]]. Now try to be despatched inside collate_fn() with entry to batch enter object.
construct the logic.
collate_fn <- operate(batch) { browser() } ds_train <- dataloader( train_subset, batch_size = 32, shuffle = TRUE, collate_fn = collate_fn ) ds_train_env <- enumerate(ds_train) ds_train_env[[1]][[1]]
The ultimate collate_fn() pads the waveform to size 16001 after which stacks the whole lot up collectively. At this level there aren’t any spectrograms but. We going to make spectrogram transformation part of mannequin structure.
pad_sequence <- operate(batch) { # Make all tensors in a batch the identical size by padding with zeros batch <- sapply(batch, operate(x) (x$t())) batch <- torch::nn_utils_rnn_pad_sequence(batch, batch_first = TRUE, padding_value = 0.) return(batch$permute(c(1, 3, 2))) } # Ultimate collate_fn collate_fn <- operate(batch) { # Enter construction: # record of 32 lists: record(waveform, sample_rate, label, speaker_id, utterance_number) # Transpose it batch <- purrr::transpose(batch) tensors <- batch$waveform targets <- batch$label_index # Group the record of tensors right into a batched tensor tensors <- pad_sequence(tensors) # goal encoding targets <- torch::torch_stack(targets) list(tensors = tensors, targets = targets) # (64, 1, 16001) }
Batch construction is:
batch[[1]]: waveforms – tensor with dimension (32, 1, 16001)
batch[[2]]: targets – tensor with dimension (32, 1)
Additionally, torchaudio comes with 3 loaders, av_loader, tuner_loader, and audiofile_loader– extra to come back. set_audio_backend() is used to set one in every of them because the audio loader. Their performances differ based mostly on audio format (mp3 or wav). There is no such thing as a excellent world but: tuner_loader is finest for mp3, audiofile_loader is finest for wav, however neither of them has the choice of partially loading a pattern from an audio file with out bringing all the info into reminiscence first.
For a given audio backend we’d like go it to every employee by means of worker_init_fn() argument.
ds_train <- dataloader( train_subset, batch_size = 128, shuffle = TRUE, collate_fn = collate_fn, num_workers = 16, worker_init_fn = operate(.) {torchaudio::set_audio_backend("audiofile_loader")}, worker_globals = c("pad_sequence") # pad_sequence is required for collect_fn ) ds_test <- dataloader( test_subset, batch_size = 64, shuffle = FALSE, collate_fn = collate_fn, num_workers = 8, worker_globals = c("pad_sequence") # pad_sequence is required for collect_fn )
Mannequin definition
As a substitute of keras::keras_model_sequential(), we’re going to outline a torch::nn_module(). As referenced by the unique article, the mannequin relies on this architecture for MNIST from this tutorial, and I’ll name it ‘DanielNN’.
dan_nn <- torch::nn_module( "DanielNN", initialize = operate( window_size_ms = 30, window_stride_ms = 10 ) { # spectrogram spec window_size <- as.integer(16000*window_size_ms/1000) stride <- as.integer(16000*window_stride_ms/1000) fft_size <- as.integer(2^trunc(log(window_size, 2) + 1)) n_chunks <- length(seq(0, 16000, stride)) self$spectrogram <- torchaudio::transform_spectrogram( n_fft = fft_size, win_length = window_size, hop_length = stride, normalized = TRUE, energy = 2 ) # convs 2D self$conv1 <- torch::nn_conv2d(in_channels = 1, out_channels = 32, kernel_size = c(3,3)) self$conv2 <- torch::nn_conv2d(in_channels = 32, out_channels = 64, kernel_size = c(3,3)) self$conv3 <- torch::nn_conv2d(in_channels = 64, out_channels = 128, kernel_size = c(3,3)) self$conv4 <- torch::nn_conv2d(in_channels = 128, out_channels = 256, kernel_size = c(3,3)) # denses self$dense1 <- torch::nn_linear(in_features = 14336, out_features = 128) self$dense2 <- torch::nn_linear(in_features = 128, out_features = 30) }, ahead = operate(x) { x %>% # (64, 1, 16001) self$spectrogram() %>% # (64, 1, 257, 101) torch::torch_add(0.01) %>% torch::torch_log() %>% self$conv1() %>% torch::nnf_relu() %>% torch::nnf_max_pool2d(kernel_size = c(2,2)) %>% self$conv2() %>% torch::nnf_relu() %>% torch::nnf_max_pool2d(kernel_size = c(2,2)) %>% self$conv3() %>% torch::nnf_relu() %>% torch::nnf_max_pool2d(kernel_size = c(2,2)) %>% self$conv4() %>% torch::nnf_relu() %>% torch::nnf_max_pool2d(kernel_size = c(2,2)) %>% torch::nnf_dropout(p = 0.25) %>% torch::torch_flatten(start_dim = 2) %>% self$dense1() %>% torch::nnf_relu() %>% torch::nnf_dropout(p = 0.5) %>% self$dense2() } ) mannequin <- dan_nn() machine <- torch::torch_device(if(torch::cuda_is_available()) "cuda" else "cpu") mannequin$to(machine = machine) print(mannequin)
An `nn_module` containing 2,226,846 parameters. ── Modules ────────────────────────────────────────────────────── ● spectrogram: <Spectrogram> #0 parameters ● conv1: <nn_conv2d> #320 parameters ● conv2: <nn_conv2d> #18,496 parameters ● conv3: <nn_conv2d> #73,856 parameters ● conv4: <nn_conv2d> #295,168 parameters ● dense1: <nn_linear> #1,835,136 parameters ● dense2: <nn_linear> #3,870 parameters
Mannequin becoming
Not like in tensorflow, there isn’t a mannequin %>% compile(...) step in torch, so we’re going to set loss criterion, optimizer technique and analysis metrics explicitly within the coaching loop.
loss_criterion <- torch::nn_cross_entropy_loss() optimizer <- torch::optim_adadelta(mannequin$parameters, rho = 0.95, eps = 1e-7) metrics <- list(acc = yardstick::accuracy_vec)
Coaching loop
library(glue) library(progress) pred_to_r <- operate(x) { lessons <- factor(df$lessons) lessons[as.numeric(x$to(device = "cpu"))] } set_progress_bar <- operate(whole) { progress_bar$new( whole = whole, clear = FALSE, width = 70, format = ":present/:whole [:bar] - :elapsed - loss: :loss - acc: :acc" ) }
epochs <- 20 losses <- c() accs <- c() for(epoch in seq_len(epochs)) { pb <- set_progress_bar(length(ds_train)) pb$message(glue("Epoch {epoch}/{epochs}")) coro::loop(for(batch in ds_train) { optimizer$zero_grad() predictions <- mannequin(batch[[1]]$to(machine = machine)) targets <- batch[[2]]$to(machine = machine) loss <- loss_criterion(predictions, targets) loss$backward() optimizer$step() # eval studies prediction_r <- pred_to_r(predictions$argmax(dim = 2)) targets_r <- pred_to_r(targets) acc <- metrics$acc(targets_r, prediction_r) accs <- c(accs, acc) loss_r <- as.numeric(loss$merchandise()) losses <- c(losses, loss_r) pb$tick(tokens = list(loss = round(mean(losses), 4), acc = round(mean(accs), 4))) }) } # check predictions_r <- c() targets_r <- c() coro::loop(for(batch_test in ds_test) { predictions <- mannequin(batch_test[[1]]$to(machine = machine)) targets <- batch_test[[2]]$to(machine = machine) predictions_r <- c(predictions_r, pred_to_r(predictions$argmax(dim = 2))) targets_r <- c(targets_r, pred_to_r(targets)) }) val_acc <- metrics$acc(factor(targets_r, ranges = 1:30), factor(predictions_r, ranges = 1:30)) cat(glue("val_acc: {val_acc}nn"))
Epoch 1/20 [W SpectralOps.cpp:590] Warning: The operate torch.rfft is deprecated and might be eliminated in a future PyTorch launch. Use the brand new torch.fft module features, as an alternative, by importing torch.fft and calling torch.fft.fft or torch.fft.rfft. (operate operator()) 354/354 [=========================] - 1m - loss: 2.6102 - acc: 0.2333 Epoch 2/20 354/354 [=========================] - 1m - loss: 1.9779 - acc: 0.4138 Epoch 3/20 354/354 [============================] - 1m - loss: 1.62 - acc: 0.519 Epoch 4/20 354/354 [=========================] - 1m - loss: 1.3926 - acc: 0.5859 Epoch 5/20 354/354 [==========================] - 1m - loss: 1.2334 - acc: 0.633 Epoch 6/20 354/354 [=========================] - 1m - loss: 1.1135 - acc: 0.6685 Epoch 7/20 354/354 [=========================] - 1m - loss: 1.0199 - acc: 0.6961 Epoch 8/20 354/354 [=========================] - 1m - loss: 0.9444 - acc: 0.7181 Epoch 9/20 354/354 [=========================] - 1m - loss: 0.8816 - acc: 0.7365 Epoch 10/20 354/354 [=========================] - 1m - loss: 0.8278 - acc: 0.7524 Epoch 11/20 354/354 [=========================] - 1m - loss: 0.7818 - acc: 0.7659 Epoch 12/20 354/354 [=========================] - 1m - loss: 0.7413 - acc: 0.7778 Epoch 13/20 354/354 [=========================] - 1m - loss: 0.7064 - acc: 0.7881 Epoch 14/20 354/354 [=========================] - 1m - loss: 0.6751 - acc: 0.7974 Epoch 15/20 354/354 [=========================] - 1m - loss: 0.6469 - acc: 0.8058 Epoch 16/20 354/354 [=========================] - 1m - loss: 0.6216 - acc: 0.8133 Epoch 17/20 354/354 [=========================] - 1m - loss: 0.5985 - acc: 0.8202 Epoch 18/20 354/354 [=========================] - 1m - loss: 0.5774 - acc: 0.8263 Epoch 19/20 354/354 [==========================] - 1m - loss: 0.5582 - acc: 0.832 Epoch 20/20 354/354 [=========================] - 1m - loss: 0.5403 - acc: 0.8374 val_acc: 0.876705979296493
Making predictions
We have already got all predictions calculated for test_subset, let’s recreate the alluvial plot from the unique article.
library(dplyr) library(alluvial) df_validation <- data.frame( pred_class = df$lessons[predictions_r], class = df$lessons[targets_r] ) x <- df_validation %>% mutate(right = pred_class == class) %>% count(pred_class, class, right) alluvial( x %>% select(class, pred_class), freq = x$n, col = ifelse(x$right, "lightblue", "crimson"), border = ifelse(x$right, "lightblue", "crimson"), alpha = 0.6, disguise = x$n < 20 )
Determine 2: Mannequin efficiency: true labels <–> predicted labels.
Mannequin accuracy is 87,7%, considerably worse than tensorflow model from the unique put up. Nonetheless, all conclusions from unique put up nonetheless maintain.
Get pleasure from this weblog? Get notified of latest posts by e-mail:
Posts additionally accessible at r-bloggers
Reuse
Textual content and figures are licensed below Artistic Commons Attribution CC BY 4.0. The figures which were reused from different sources do not fall below this license and may be acknowledged by a notice of their caption: “Determine from …”.
Quotation
For attribution, please cite this work as
Damiani (2021, Feb. 4). Posit AI Weblog: Easy audio classification with torch. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2021-02-04-simple-audio-classification-with-torch/
BibTeX quotation
@misc{athossimpleaudioclassification, creator = {Damiani, Athos}, title = {Posit AI Weblog: Easy audio classification with torch}, url = {https://blogs.rstudio.com/tensorflow/posts/2021-02-04-simple-audio-classification-with-torch/}, yr = {2021} }
The post Easy audio classification with torch appeared first on AIPressRoom.