• AIPressRoom
  • Posts
  • Posit AI Weblog: Phrase Embeddings with Keras

Posit AI Weblog: Phrase Embeddings with Keras

Introduction

Phrase embedding is a technique used to map phrases of a vocabulary todense vectors of actual numbers the place semantically related phrases are mapped toclose by factors. Representing phrases on this vector area assistalgorithms obtain higher efficiency in pure languageprocessing duties like syntactic parsing and sentiment evaluation by groupingrelated phrases. For instance, we anticipate that within the embedding area“cats” and “canine” are mapped to close by factors since they’reeach animals, mammals, pets, and so on.

On this tutorial we’ll implement the skip-gram mannequin created by Mikolov et al in R utilizing the keras package deal.The skip-gram mannequin is a taste of word2vec, a category ofcomputationally-efficient predictive fashions for studying phraseembeddings from uncooked textual content. We gained’t tackle theoretical particulars about embeddings andthe skip-gram mannequin. If you wish to get extra particulars you possibly can learn the paperlinked above. The TensorFlow Vector Representation of Words tutorial contains further particulars as does the Deep Studying With R notebook about embeddings.

There are different methods to create vector representations of phrases. For instance,GloVe Embeddings are applied within the text2vec package deal by Dmitriy Selivanov.There’s additionally a tidy strategy described in Julia Silge’s weblog publish Word Vectors with Tidy Data Principles.

Getting the Information

We are going to use the Amazon Fine Foods Reviews dataset.This dataset consists of evaluations of tremendous meals from Amazon. The info span a interval of greater than 10 years, together with all ~500,000 evaluations as much as October 2012. Evaluations embrace product and consumer data, rankings, and narrative textual content.

Information could be downloaded (~116MB) by operating:

download.file("https://snap.stanford.edu/knowledge/finefoods.txt.gz", "finefoods.txt.gz")

We are going to now load the plain textual content evaluations into R.

Let’s check out some evaluations we have now within the dataset.

[1] "I've purchased a number of of the Vitality canned pet food merchandise ...
[2] "Product arrived labeled as Jumbo Salted Peanuts...the peanuts ... 

Preprocessing

We’ll start with some textual content pre-processing utilizing a keras text_tokenizer(). The tokenizer can bechargeable for remodeling every overview right into a sequence of integer tokens (which can subsequently be used asenter into the skip-gram mannequin).

library(keras)
tokenizer <- text_tokenizer(num_words = 20000)
tokenizer %>% fit_text_tokenizer(evaluations)

Word that the tokenizer object is modified in place by the decision to fit_text_tokenizer().An integer token can be assigned for every of the 20,000 commonest phrases (the opposite phrases willbe assigned to token 0).

Skip-Gram Mannequin

Within the skip-gram mannequin we’ll use every phrase as enter to a log-linear classifierwith a projection layer, then predict phrases inside a sure vary earlier than and afterthis phrase. It might be very computationally costly to output a chancedistribution over all of the vocabulary for every goal phrase we enter into the mannequin. As a substitute,we’re going to use unfavourable sampling, which means we’ll pattern some phrases that don’tseem within the context and practice a binary classifier to foretell if the context phrase wehanded is really from the context or not.

In additional sensible phrases, for the skip-gram mannequin we’ll enter a 1d integer vector ofthe goal phrase tokens and a 1d integer vector of sampled context phrase tokens. We are going togenerate a prediction of 1 if the sampled phrase actually appeared within the context and 0 if it didn’t.

We are going to now outline a generator operate to yield batches for mannequin coaching.

library(reticulate)
library(purrr)
skipgrams_generator <- operate(textual content, tokenizer, window_size, negative_samples) {
  gen <- texts_to_sequences_generator(tokenizer, sample(textual content))
  operate() {
    skip <- generator_next(gen) %>%
      skipgrams(
        vocabulary_size = tokenizer$num_words, 
        window_size = window_size, 
        negative_samples = 1
      )
    x <- transpose(skip${couples}) %>% map(. %>% unlist %>% as.matrix(ncol = 1))
    y <- skip$labels %>% as.matrix(ncol = 1)
    list(x, y)
  }
}

A generator functionis a operate that returns a special worth every time it’s referred to as (generator features are sometimes used to supply streaming or dynamic knowledge for coaching fashions). Our generator operate will obtain a vector of texts,a tokenizer and the arguments for the skip-gram (the dimensions of the window round everygoal phrase we study and what number of unfavourable samples we would liketo pattern for every goal phrase).

Now let’s begin defining the keras mannequin. We are going to use the Keras functional API.

embedding_size <- 128  # Dimension of the embedding vector.
skip_window <- 5       # What number of phrases to think about left and proper.
num_sampled <- 1       # Variety of unfavourable examples to pattern for every phrase.

We are going to first write placeholders for the inputs utilizing the layer_input operate.

input_target <- layer_input(form = 1)
input_context <- layer_input(form = 1)

Now let’s outline the embedding matrix. The embedding is a matrix with dimensions(vocabulary, embedding_size) that acts as lookup desk for the phrase vectors.

embedding <- layer_embedding(
  input_dim = tokenizer$num_words + 1, 
  output_dim = embedding_size, 
  input_length = 1, 
  identify = "embedding"
)

target_vector <- input_target %>% 
  embedding() %>% 
  layer_flatten()

context_vector <- input_context %>%
  embedding() %>%
  layer_flatten()

The following step is to outline how the target_vector can be associated to the context_vectorwith a purpose to make our community output 1 when the context phrase actually appeared within thecontext and 0 in any other case. We wish target_vector to be related to the context_vectorin the event that they appeared in the identical context. A typical measure of similarity is the cosinesimilarity. Give two vectors (A) and (B)the cosine similarity is outlined by the Euclidean Dot product of (A) and (B) normalized by theirmagnitude. As we don’t want the similarity to be normalized contained in the community, we’ll solely calculatethe dot product after which output a dense layer with sigmoid activation.

dot_product <- layer_dot(list(target_vector, context_vector), axes = 1)
output <- layer_dense(dot_product, items = 1, activation = "sigmoid")

Now we’ll create the mannequin and compile it.

mannequin <- keras_model(list(input_target, input_context), output)
mannequin %>% compile(loss = "binary_crossentropy", optimizer = "adam")

We will see the total definition of the mannequin by calling abstract:

_________________________________________________________________________________________
Layer (sort)                 Output Form       Param #    Linked to                  
=========================================================================================
input_1 (InputLayer)         (None, 1)          0                                        
_________________________________________________________________________________________
input_2 (InputLayer)         (None, 1)          0                                        
_________________________________________________________________________________________
embedding (Embedding)        (None, 1, 128)     2560128    input_1[0][0]                 
                                                           input_2[0][0]                 
_________________________________________________________________________________________
flatten_1 (Flatten)          (None, 128)        0          embedding[0][0]               
_________________________________________________________________________________________
flatten_2 (Flatten)          (None, 128)        0          embedding[1][0]               
_________________________________________________________________________________________
dot_1 (Dot)                  (None, 1)          0          flatten_1[0][0]               
                                                           flatten_2[0][0]               
_________________________________________________________________________________________
dense_1 (Dense)              (None, 1)          2          dot_1[0][0]                   
=========================================================================================
Complete params: 2,560,130
Trainable params: 2,560,130
Non-trainable params: 0
_________________________________________________________________________________________

Mannequin Coaching

We are going to match the mannequin utilizing the fit_generator() operate We have to specify the variety ofcoaching steps in addition to variety of epochs we need to practice. We are going to practice for100,000 steps for five epochs. That is fairly sluggish (~1000 seconds per epoch on a contemporary GPU). Word that you simplymay get affordable outcomes with only one epoch of coaching.

mannequin %>%
  fit_generator(
    skipgrams_generator(evaluations, tokenizer, skip_window, negative_samples), 
    steps_per_epoch = 100000, epochs = 5
    )
Epoch 1/1
100000/100000 [==============================] - 1092s - loss: 0.3749      
Epoch 2/5
100000/100000 [==============================] - 1094s - loss: 0.3548     
Epoch 3/5
100000/100000 [==============================] - 1053s - loss: 0.3630     
Epoch 4/5
100000/100000 [==============================] - 1020s - loss: 0.3737     
Epoch 5/5
100000/100000 [==============================] - 1017s - loss: 0.3823 

We will now extract the embeddings matrix from the mannequin through the use of the get_weights()operate. We additionally added row.names to our embedding matrix so we will simply discoverthe place every phrase is.

Understanding the Embeddings

We will now discover phrases which can be shut to one another within the embedding. We are going touse the cosine similarity, since that is what we skilled the mannequin toreduce.

library(text2vec)

find_similar_words <- operate(phrase, embedding_matrix, n = 5) {
  similarities <- embedding_matrix[word, , drop = FALSE] %>%
    sim2(embedding_matrix, y = ., methodology = "cosine")
  
  similarities[,1] %>% sort(lowering = TRUE) %>% head(n)
}
find_similar_words("2", embedding_matrix)
        2         4         3       two         6 
1.0000000 0.9830254 0.9777042 0.9765668 0.9722549 
find_similar_words("little", embedding_matrix)
   little       bit       few     small     deal with 
1.0000000 0.9501037 0.9478287 0.9309829 0.9286966 
find_similar_words("scrumptious", embedding_matrix)
scrumptious     tasty great   superb     yummy 
1.0000000 0.9632145 0.9619508 0.9617954 0.9529505 
find_similar_words("cats", embedding_matrix)
     cats      canine      children       cat       canine 
1.0000000 0.9844937 0.9743756 0.9676026 0.9624494 

The t-SNE algorithm can be utilized to visualise the embeddings. Due to time constraints wewill solely use it with the primary 500 phrases. To grasp extra in regards to the t-SNE methodology see the article How to Use t-SNE Effectively.

This plot might appear like a large number, however for those who zoom into the small teams you find yourself seeing some good patterns.Attempt, for instance, to discover a group of net associated phrases like http, href, and so on. One other groupwhich may be straightforward to pick is the pronouns group: she, he, her, and so on.

Get pleasure from this weblog? Get notified of latest posts by electronic mail:

Posts additionally obtainable at r-bloggers