LLaMA in R with Keras and TensorFlow

OpenAI’s chatGPT has woke up a collective consciousness of what GiantLanguage Fashions (LLMs) are able to. With that awakening comes a every daymarch of LLM information: new merchandise, new options, new fashions, newcapabilities, (and new worries). It appears we’re within the early levels of aCambrian explosion of LLMs and LLM powered instruments; it’s not but clear howLLMs will affect and affect our skilled and private lives, howeverit appears clear that they’ll, indirectly.

Since LLMs are right here to remain, it’s worthwhile to take a while toperceive how these fashions work from a first-principles perspective.Beginning with the mechanics will help foster sturdy intuitions that mayinform our utilization of those fashions now and sooner or later. (Particularly ifthe longer term is one the place LLMs are a staple of the information scientist’stoolbox, as frequent as an lm() perform name).

And what higher manner is there to be taught than by doing. So with thatpreamble, on this submit we’ll stroll by an implementation of an LLM,LLaMA (Touvron et al. 2023)particularly, in TensorFlow and Keras, with the purpose being to developunderstanding first, functionality second.

Why LLaMA? With the sheer quantity of LLM associated content material and information outthere, it might probably appear formidable to know the place to get began. Nearly weeklyit appears there’s a new mannequin introduced. Searching some hubs of LLMexercise (HuggingFace,TFHub,reddit,HackerNews) muddies the waters evenextra. The best way to decide a particular mannequin?

Of the various LLM-related information objects prior to now months, one which standshead-and-shoulders above the gang is the release ofLLaMA,a contemporary, foundational LLM made out there to the general public by Meta AI inFebruary 2023. On frequent benchmarks, LLaMA outperforms OpenAI’s GPT-3,whereas being considerably smaller (although nonetheless massive).

LLaMA is a superb beginning place as a result of it’s a easy and trendystructure, has wonderful efficiency on benchmarks, and is open. Themannequin structure has had just some new concepts included into it sincethe unique Transformer structure first described in,“Attention Is All You Need”printed from Google (Vaswani et al. 2017). 4 completely different sizes ofLLaMA have been launched: 7 billion and 13 billion parameter fashionsskilled on 1 Trillion tokens, and 33 billion and 65 billion parameterfashions skilled on 1.4 trillion tokens. This is a gigantic quantity ofcoaching information these fashions have seen–the most important 65B mannequin has beenskilled on roughly the “Chinchillacompute-optimum” (Hoffmann et al. 2022)variety of tokens, whereas the smaller LLaMAs are considerablypast that optimum. On this weblog submit we’ll concentrate on the smallest, 7Bparameter LLaMA mannequin, which you’ll be able to comfortably load domestically and run onCPU with solely 64Gb of RAM.

Whereas not strictly vital, to comply with alongside domestically, you’ll most likelywish to purchase the pre-trained LLaMA weights oneway oranother. Notice, theweights do include their very own license, which you’ll be able to previewhere.

So, with out additional ado, let’s get began.

Setup

First, we’ll wish to set up the required R and Python packages, andconfigure a digital atmosphere:

remotes::install_github(c("rstudio/reticulate",
                          "rstudio/tensorflow",
                          "rstudio/keras"))
# reticulate::install_python("3.10:latest")                          
reticulate::virtualenv_create("./.venv", version = "3.10:latest")
tensorflow::install_tensorflow(envname = "./.venv", version = "release",
                               extra_packages = "tensorflow-text")

With that out of the way, let’s load some packages and prepare our Rsession:

library(purrr)
library(envir)

library(tensorflow)
library(tfautograph)
library(keras)

use_virtualenv("./.venv")
options(tensorflow.extract.warn_tensors_passed_asis = FALSE)

attach_eval({
  import_from(glue, glue)
  import_from(jsonlite, read_json)
  import_from(withr, with_dir, with_options)
  import_from(keras$layers, Dense)
  np <- reticulate::import("numpy", convert = FALSE)

  seq_len0 <- function(x) seq.int(from = 0L, length.out = x)
})

If you’ve acquired the pre-trained weights, it’ll be convenient toconvert them from the torch checkpoint format to something that’s moreframework agnostic (you only need to do this once, of course):

# reticulate::py_install("torch", pip = TRUE)
torch <- reticulate::import("torch", convert = FALSE)
with_dir("~/github/facebookresearch/llama/weights/LLaMA/7B", {
  pretrained_weights <- torch$load("consolidated.00.pth",
                                   map_location = "cpu")
  for (name in names(pretrained_weights)) {
    filename <- sprintf("%s.npy", name)
    array <- pretrained_weights[[name]]$numpy()
    np$save(filename, array)
    message(glue(
      "wrote: '{basename(filename)}' with shape: {array$shape}"))
  }
})

We’ll also define a helper function so we can avoid having to retype thefull path to our weights:

weights_path <- function(filename) normalizePath(file.path(
  "~/github/facebookresearch/llama/weights/LLaMA/",
  glue(filename, .envir = parent.frame())), mustWork = TRUE)

And load the model configuration parameters specific to the 7B LLaMA,which we’ll use to build the model.

params <- read_json(weights_path("7B/params.json"))
str(params)
List of 6
 $ dim        : int 4096
 $ multiple_of: int 256
 $ n_heads    : int 32
 $ n_layers   : int 32
 $ norm_eps   : num 1e-06
 $ vocab_size : int -1

Tokenizer

The first component to LLaMA is the tokenizer, which converts text to asequence of integers. The LLaMA model uses theSentencePiece tokenizer fromGoogle. SentencePiece is on the market as a TensorFlow graph operationbytf_text.SentencepieceTokenizer,and in addition as a Keras layer inkeras_nlp.tokenizers.SentencepieceTokenizer.By alternative of a coin flip, we’ll use the lower-level tf_text interface.

tf_text <- reticulate::import("tensorflow_text")
tokenizer_path <- weights_path("tokenizer.model")
tokenizer <- tf_text$SentencepieceTokenizer(
  tf$io$gfile$GFile(tokenizer_path, "rb")$read(),
  add_bos = TRUE, add_eos = FALSE,
)

Let’s test it out with a prompt:

prompt <- "The best way to attract bees"
tokenizer$tokenize(prompt)
tf.Tensor([    1   450  1900   982   304 13978   367   267], shape=(8), dtype=int32)
prompt |> tokenizer$tokenize() |> tokenizer$detokenize()
tf.Tensor(b'The best way to attract bees', shape=(), dtype=string)

Let’s define a show_tokens() helper function and play with thetokenizer a little.

show_tokens <- function(what) > as.integer()
  else
    token_ids <- as.integer(what)
  tokens <- token_ids 

show_tokens(prompt)
        1       450      1900       982       304     13978       367       267
       ""     "The"    "best"     "way"      "to" "attract"      "be"      "es"

Note that “bees” is two tokens. Not every token corresponds to a word.For example, one non-word token we can reliably expect to show up in atokenizer trained on a corpus of English text is “ing.” However, when the“ing” token shows up will not always follow your intuitions, becausecommon words get their own token id, even if they can be decomposed intomultiple tokens.

    1  2348
   "" "ing"
        1      1985
       "" "working"
     1   8525    292
    "" "flex"  "ing"
     1   2113   9292
    ""  "won" "king"

Another thing to note about the tokenizer is that each token sequencestarts with token id 1. This is a special beginning-of-sequencetoken that we requested be added when we loaded the tokenizer withadd_bos = TRUE. There are two other such special tokens that we willencounter later: an end-of-sequence special tokens with id 2, and anunknown-token with id 0.

as.character(tokenizer$id_to_string(0L))
[1] "<unk>"
as.character(tokenizer$id_to_string(1L))
[1] "<s>"
as.character(tokenizer$id_to_string(2L))
[1] "</s>"
    1     0     2
   "" " ⁇ "    ""

Overall, there are 32,000 tokens.

as.integer(tokenizer$vocab_size())
[1] 32000

One last observation is that the more frequently encountered tokens areassigned lower ids.

show_tokens(seq(50, len = 10))
 50  51  52  53  54  55  56  57  58  59
"/" "0" "1" "2" "3" "4" "5" "6" "7" "8"
show_tokens(seq(100, len = 10))
100 101 102 103 104 105 106 107 108 109
"a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
show_tokens(seq(1000, len = 10))
   1000    1001    1002    1003    1004    1005    1006    1007    1008    1009
  "ied"    "ER"  "stat"   "fig"    "me"   "von" "inter"  "roid"  "ater" "their"
show_tokens(seq(10000, len = 10))
   10000    10001    10002    10003    10004    10005    10006    10007
   "ång"  "citep"    "Ill"   "rank" "sender"   "beim"    "рак" "compat"
   10008    10009
"occurs"  "diese"
show_tokens(seq(20000, len = 10))
    20000     20001     20002     20003     20004     20005     20006     20007
  "admit" "Comment"     "стя"    "Vien"      "ці"  "permut"     "cgi"    "crít"
    20008     20009
"Console"    "ctic"
show_tokens(seq(to = as.integer(tokenizer$vocab_size()) - 1, len = 10))
31990 31991 31992 31993 31994 31995 31996 31997 31998 31999
  "ὀ"  "げ"  "べ"  "边"  "还"  "黃"  "왕"  "收"  "弘"  "给"

Moving on, the next step after tokenization is embedding. An embeddinglayer is effectively a dictionary lookup that converts an integer (tokenid) to a 1-d float array. For this we can use the standard kerasEmbedding layer.

tok_embeddings <- keras$layers$Embedding(
  input_dim = tokenizer$vocab_size(),
  output_dim = params$dim,
  embeddings_initializer =
    (...) np$load(weights_path("7B/tok_embeddings.weight.npy"))
)

tok_embeddings(3L) |> str()
<tf.Tensor: shape=(4096), dtype=float32, numpy=…>
prompt |> # "The best way to attract bees"
  tokenizer$tokenize() |>
  tok_embeddings() |>
  str()
<tf.Tensor: shape=(8, 4096), dtype=float32, numpy=…>

TransformerBlock

Once it’s tokenized and embedded, the input then passes through the bulkof the model, a sequence of repeating TransformerBlock layers. The 7Bmodel has 32 of these TransformerBlock layers, while the 65B model has80 of them.

weights_path("7B/params.json")  |> read_json() |> _$n_layers
[1] 32
weights_path("65B/params.json") |> read_json() |> _$n_layers
[1] 80

Here is what the transformer block looks like:

TransformerBlock(keras$layers$Layer) %py_class% {
  initialize <- function(attn_head_size, attn_n_heads,
                         norm_eps = k_epsilon(), ...,
                         block_id = NULL) {
    super$initialize(...)

    self$attention <- Attention(attn_head_size, attn_n_heads,
                                block_id = block_id)

    self$feed_forward <- FeedForward(
      hidden_dim = 4 * attn_head_size * attn_n_heads,
      block_id = block_id)

    self$attention_norm <- RMSNorm(eps = norm_eps,
                                   block_id = block_id,
                                   feeds_into = "attention")
    self$feed_forward_norm <- RMSNorm(eps = norm_eps,
                                      block_id = block_id,
                                      feeds_into = "ffn")
  }

  call <- function(x) >
      self$attention()

    x <- x + x2 # add residual

    # norm and swiglu
    x2 <- x %>%
      self$feed_forward_norm() %>%
      self$feed_forward()

    x <- x + x2 # residual again

    x
  
}

While there is not a lot of code, there are a lot of ideas packed inthere. This block forms the main trunk of the model, so it’s worthtaking the time to go through it slowly.

We implement the TransformerBlock as a subclassedkeras.layers.Layer. This is gives us some niceties like the ability tocompose with other Keras layers, but these are mostly irrelevant to thepurpose of this blog post; we could just as easily implement this as,for example, a vanilla R6 class. Our TransformerBlock class has twomethods: initialize, called when we first create the block, andcall, called when we run the forward pass of the block.

In initialize, we create 4 layers: an Attention layer, aFeedForward layer, and 2 RMSNorm layers. We’ll take a close look ateach of these soon, but even before we do so, we can see how they fittogether by looking at the TransformerBlock$call() method.

The call method has a few simple ideas. In no particular order, thefirst one to observe is the composition pattern of adding residuals.

x2 <- x |> ...
x <- x + x2 # add residual x to x2

This is a common pattern that helps with model training, and especiallyto help with the vanishing gradientproblem. It’sa skip-connection within the other-wise linear sequence of matrixtransformations. It reinjects info (throughout the ahead move), andgradients (throughout again propagation), again into the trunk. You possibly can assumeof those residual connections as liberating the learnable layers in-between(the ... within the pseudo code) from the burden of getting to“pass-through” or “protect” info in x, permitting the weights toas an alternative concentrate on studying transformations which might be, (in corporatesevernacular), value-adding.

The following composition sample to notice is the repeating utilization of anormalization layer:

x2 <- x |> norm() |> ...
x <- x + x2

There are many kinds of normalization layers, but to slightlyover-generalize, they can all be thought of as a stabilizer that helpswith training. Like their deep-learning cousins the regularizers, theirmain function is to keep values passing through in a sensible range–inthe ball park of (-1, 1), typically. We’ll take a closer look atRMSNorm soon.

Stripped of two tricks that are mostly there to help the model train,residuals and normalization, the core of the TransformerBlock is justthis:

x |> attention() |> feed_forward()

In a moment we’ll see that that feed_foward is a slightly fanciervariation of a conventional sequence of Dense layer. Before we getthere we can we safely skip ahead to distill the following intuition: aTransformerBlock is basically an Attention layer followed by a few(fancy) dense layers, with some simple composition patterns (tricks)that help with training. Attention is the heart of the model: it’s themost interesting, and also the most involved.

With the framing in place, let’s go through and take a closer look atRMSNorm, FeedForward, and then with the foundation in place, we’llturn our attention to Attention.

RMSNorm

RMSNorm(keras$layers$Layer) %py_class% {
  initialize <-
    function(eps = 1e-6, ..., block_id = NULL, feeds_into = NULL) {
      super$initialize(...)
      self$eps <- eps
      self$block_id <- block_id
      self$feeds_into <- feeds_into
    }

  build <- function(input_shape) {
    # input_shape == (batch_size, seqlen, params$dim)
    # self$w will broadcast over batch_size and seqlen dims.
    # w_shape == (1, 1, params$dim)
    w_shape <- rep(1L, length(input_shape))
    w_shape[length(input_shape)] <- as.integer(input_shape) |> tail(1L)

    # define a local function that will load
    # the pretrained-weights if we supplied `block_id` and `feeds_into`
    import_from({self}, block_id, feeds_into)
    initializer <-if (is.null(block_id))
      "ones"
      else if (block_id >=0) {
        (...) weights_path("7B/layers.{block_id}.{feeds_into}_norm.weight.npy") |>
               np$load() |> np$expand_dims(0:1)
      } else if(block_id == -1)
        # load weights for the final output normalization layer, which is not
        # part of a TransformerBlock
        (...) weights_path("7B/norm.weight.npy") |>
               np$load() |> np$expand_dims(0:1)

    self$w <- self$add_weight(shape = w_shape,
                              initializer = initializer,
                              trainable = TRUE)
  }

  rrms <- function(x) {
    # reciprocal root mean square along the last axis
    x %>% # (batch_size, seqlen, n_features)
      tf$math$square() %>%
      tf$reduce_mean(axis = -1L, keepdims = TRUE) %>% # (batch_size, seqlen, 1)
      tf$math$add(self$eps) %>% # for numerical stability
      tf$math$rsqrt()
  }

  call <- function(x) {
    x * self$rrms(x) * self$w
  }
}

RMSnorm() has a single trainable tensor w. In the forward pass, eachvalue in the input is multiplied by the reciprocal-root-mean-square ofall the values in the feature axis and by w. Certainly a mouthful, butjust a simple sequence of arithmetic transformations in the end,designed for the express purpose of adjusting the range of valuespassing through.

Let’s kick the tires on it:

norm <- RMSNorm()
m <- matrix(c(0, 1,
              2, 3), nrow = 2)
norm(m)
tf.Tensor(
[[0.         1.4142132 ]
 [0.44721353 1.3416406 ]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[0.         1.4142137 ]
 [0.44721362 1.3416408 ]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[0.        1.4142137]
 [0.4472136 1.3416408]], shape=(2, 2), dtype=float32)

FeedForward

Next up is FeedForward()

FeedForward(keras$layers$Layer) %py_class% {

  initialize <- function(hidden_dim, multiple_of = 256L,
                         ..., block_id = NULL) {
    super$initialize()

    if(!is.null(multiple_of)) {
      hidden_dim <- hidden_dim %>%
        { as.integer( . * (2/3)) } %>%
        { (. + multiple_of - 1) %/% multiple_of } %>%
        { . * multiple_of }
    }

    self$hidden_dim <- hidden_dim
    self$block_id <- block_id
  }

  build <- function(input_shape) {
    output_dim <- input_shape |> as.integer() |> tail(1)

    if(is.null(self$block_id))
      load_weight <- (...) NULL
    else
      load_weight <- (name) (...) np$load(weights_path(
        "7B/layers.{self$block_id}.feed_forward.{name}.weight.npy"))$`T`

    self$w1 <- Dense(self$hidden_dim, use_bias = FALSE,
                     kernel_initializer = load_weight("w1"))
    self$w2 <- Dense(output_dim, use_bias = FALSE,
                     kernel_initializer = load_weight("w2"))
    self$w3 <- Dense(self$hidden_dim, use_bias = FALSE,
                     kernel_initializer = load_weight("w3"))

    super$build(input_shape)
  }

  call <- function(x) {
    import_from({self}, w1, w2, w3)
    import_from(tf$nn, silu)

    x %>%
      { silu(w1(.)) * w3(.) } %>% # SwiGLU
      w2()
  }

}

FeedForward consists of three Dense layers. initialize does somesimple arithmetic, munging on the input value hidden_dim to ensure thesize is a performant multiple of 256, and build is mostly boiler platefor creating the layers and loading the weights.

The novelty of FeedForward() is in the call() method, where ratherthan composing the Dense layers in a conventional sequential modelwith, say, ReLU activations in between and maybe some dropout, thelayers are composed to form a “SwiGLU” unit. The publication by Shazeer (2020)of SwiGLU and different variations on GLU is an exemplar of the kindsof explorations and enhancements across the Transformer structuresince its preliminary publication in2017; a gradual accretion ofenhancements that has introduced us to at present. The Feedforward$name() isonly a single SwiGLU adopted by a linear projection. In its essence,it’s a intelligent composition of three (realized) linear projections, anelement-wise multiplication, and a silu()activationperform.

Maybe essentially the most shocking commentary to make right here is the relativedearth of activation capabilities, and even non-linearities, not simply inFeedForward, however general. The silu() on this feedforward, thereciprocal-root-mean-square in RMSnorm(), and a softmax() inConsideration() are the one non-linear transformations in the entiresequence of TransformerBlocks. Every thing else is a lineartransformation!

Consideration

Lastly, let’s flip our consideration to Consideration().

Attention(keras$layers$Layer) %py_class% {
  initialize <- function(head_size, n_heads,
                         ..., block_id = NULL) {
    super$initialize(...)

    self$head_size <- head_size
    self$n_heads <- n_heads

    if (is.null(block_id))
      load_weight <- function(name) NULL
    else
      load_weight <- (name) (...) np$load(weights_path(
        "7B/layers.{block_id}.attention.{name}.weight.npy"))$`T`

    Dense <- function(name) keras$layers$Dense(
      units = n_heads * head_size,
      use_bias = FALSE,
      kernel_initializer = load_weight(name)
    )

    self$wq <- Dense("wq")
    self$wk <- Dense("wk")
    self$wv <- Dense("wv")
    self$wo <- Dense("wo")
  }

  call <- function(x) {
    c(batch_size, seqlen, n_features) %<-% tf$unstack(tf$shape(x))

    # 1. project (linear transform) x into
    #    query, key, and value tensors
    # 2. reshape q k v, splitting out the last dim (n_features)
    #    into n_heads independent subspaces,
    #    each with size head_size.
    #    (n_features == head_size * n_heads)
    split_heads_shape <- c(batch_size, seqlen,
                           self$n_heads, self$head_size)
    q <- x |> self$wq() |> tf$reshape(split_heads_shape)
    k <- x |> self$wk() |> tf$reshape(split_heads_shape)
    v <- x |> self$wv() |> tf$reshape(split_heads_shape)

    # embed positional information in query and key
    # (bsz, seqlen, n_heads, head_size)
    q %<>% apply_rotary_embedding()
    k %<>% apply_rotary_embedding()

    # reshape:
    #   move heads out of the last 2 axes,
    #   so later matmuls are performed across the subspaces (heads)
    #   between (seqlen, head_size) axes
    v <- tf$transpose(v, c(0L, 2L, 1L, 3L)) # (bsz, n_heads, seqlen, head_size)
    q <- tf$transpose(q, c(0L, 2L, 1L, 3L)) # (bsz, n_heads, seqlen, head_size)
    k <- tf$transpose(k, c(0L, 2L, 3L, 1L)) # (bsz, n_heads, head_size, seqlen)

    # calculate and normalize attention scores
    scores <- q %*% k                       # (bsz, n_heads, seqlen, seqlen)
    scores <- scores / sqrt(self$head_size) # scale

    # apply causal mask, so the model can't "look ahead" during training
    mask <- make_mask(seqlen, dtype = scores$dtype)
    scores %<>% { . + mask }

    scores <- tf$nn$softmax(scores, axis = -1L)

    # adjust values tensor with attention scores
                      # scores (bsz, n_heads, seqlen, seqlen)
                      # v      (bsz, n_heads, seqlen, head_size)
    output <- scores %*% v   # (bsz, n_heads, seqlen, head_size)

    # combine heads back into a single features dim,
    # so Attention output_shape==input_shape
    output <- output |>
      tf$transpose(c(0L, 2L, 1L, 3L)) |> # (bsz, seqlen, n_heads, head_size)
      tf$reshape(tf$shape(x))            # (bsz, seqlen, n_heads * head_size)

    # one more trainable linear projection for good luck
    output <- self$wo(output) # (bsz, seqlen, n_heads * head_size)

    output
  }
}

Attention in LLaMA is similar but not identical to the Attentiondescribed in the original Transformerspaper (and out there as a kerasbuiltin below keras$layers$MultiHeadAttention()). The core novelty isthe addition of the apply_rotary_embedding() perform, which we’lldescribe shortly. The extra novelty is balanced by the simplicityfrom the truth that the layer is performing self-attention—we don’t wantto move in numerous question, key, and worth tensors (or purpose about whatmeaning), because the identical enter serves all three roles. Notice that thetypical MultiHeadAttention() layer is roofed fairly completely inthe 2nd Version of Deep Learning with R,together with a full implementation of consideration in base R.

To develop an understanding of the mechanics in a layer like this, it’suseful to quickly unsee a few of the minutia that may act as a fogobscuring the essence of the operation. On this occasion, if wequickly strip out the transpose()s and reshape()s (as intelligent andimportant as they’re), that is what’s left:

call <- function(x) >  normalize_scores()

  # adjust the 3rd projection with the attention scores
  output <- scores %*% v

  self$wo(output) # one more learned linear projection for good luck

Returning to the transpose()s and reshapes(), you can observe thattheir purpose is to make it so that the attention calculations areperformed across n_heads independent subspaces, rather than in asingle larger space. The same reasoning drives this decision as thatdriving usage of depthwise-separable convolutions in image models.Empirically, for the fixed compute budget, factoring features intoindependent subspaces performs better than doing the same coreoperations in single larger feature space. As with all things, there isa balance to strike between n_heads (the number of subspaces) andhead_dim (the size of each subspace). The LLaMA authors have struckthe balance like this at the various model sizes:

lapply(c("7B", "13B", "30B", "65B"), (size) {
  p <- read_json(weights_path("{size}/params.json"))
  with(p, list(llama_size = size,
               n_heads = n_heads,
               head_dim = dim %/% n_heads))
}) |> dplyr::bind_rows()
# A tibble: 4 × 3
  llama_size n_heads head_dim
  <chr>        <int>    <int>
1 7B              32      128
2 13B             40      128
3 30B             52      128
4 65B             64      128

Next lets turn our attention to the causal attention mask.

make_mask <- function(seqlen, dtype = k_floatx()) {
  x <- tf$range(seqlen)
  mask <- tf$where(x[, tf$newaxis] < x[tf$newaxis, ],
                   tf$constant(-Inf, dtype = dtype),
                   tf$constant(0, dtype = dtype))

  # broadcast over batch and heads dim
  mask[tf$newaxis, tf$newaxis, , ] # (1, 1, seqlen, seqlen)
}

The mask is a strictly upper triangular matrix filled with -Infvalues. Adding the mask to the attention scores prevents the model frombeing able to “look ahead” and see the attention score for a tokenpairing it hasn’t seen yet at a particular position in the sequence.This need for a mask is best thought of as a vestige from training,an apparatus that the model needed to learn with and now it can’t function without.During training, gradients are calculated for predictions from alltoken positions in a sequence, including predictions tokens where the correctanswer is right there, as the very next token in same sequence. The maskprevents the model from being able to cheat and look ahead into the future,something it won’t be able to do once it’s we’re running it for inference.

tf.Tensor(
[[[[  0. -inf -inf -inf -inf]
   [  0.   0. -inf -inf -inf]
   [  0.   0.   0. -inf -inf]
   [  0.   0.   0.   0. -inf]
   [  0.   0.   0.   0.   0.]]]], shape=(1, 1, 5, 5), dtype=float32)

Rotary Position Embedding

Next lets turn our attention to apply_rotary_embedding(). This coreinnovation was published by Su et al. (2022) within the paper titled“RoFormer: Enhanced Transformer with Rotary Position Embedding”.

Some context:

  • The naked Consideration() mechanism doesn’t go away any chance for atoken’s place in a sequence to have an effect on the eye scores, sincesolely token-pairs are scored. Consideration treats its enter like abag-of-tokens.

  • The place of a token in a sequence is clearly essential, and theconsideration layer ought to have entry to that info.

  • Absolutely the place of a token in a sequence is much less essentialthan the relative place between tokens. (Particularly so for lengthysequences).

Which leads us into the complicated airplane. If we think about the options ascomplicated numbers, we are able to rotate them, and we are able to calculate angles betweenthem. From the Roformers paper:

Particularly, incorporating the relative place embedding is

easy: merely rotate the affine-transformed phrase embedding

vector by quantity of angle multiples of its place index and thus

interprets the instinct behind Rotary Place Embedding

Increasing barely: the rotation matrix is designed in order thatsubsequently, after rotating our q and ok token sequence embeddingthe identical manner, the angle between token options is a perform of therelative distance between these tokens within the token sequence. Therelative angle between two tokens is invariant to absolutely theplace of these tokens within the full sequence.

Briefly, the rotation injects positional info. The which means orinterpretability of that positional info, or how it’s meant tobe used, and even extracted from the results of q %*% ok, is left to themannequin to be taught.

Right here is the code:

apply_rotary_embedding <- function(x) {
  c(., seqlen, ., head_size) %<-%
    tf$unstack(tf$shape(x))

  rotation_matrix <- compute_rotation_matrix(seqlen, head_size)

  x %>%
    view_as_complex() %>%
    { . * rotation_matrix } %>%
    view_as_real()

}

compute_rotation_matrix <-
  function(seqlen, feature_dim, theta = 10000) {
    # `feature_dim` here is going to be attention$head_size
    # `seqlen` is going to match the token sequence length.

    t <- tf$range(seqlen, dtype = tf$float32)
    freqs <- tf$range(start = 0, limit = 1, delta = 1 / (feature_dim %/% 2),
                      dtype = tf$float32)
    tf_assert(tf$size(freqs) == feature_dim %/% 2)
    freqs <- 1.0 / (theta ^ freqs)

    # outer product; (seqlen, head_size/2)
    freqs <- tf$einsum('a,b->ab', t, freqs)

    rot_mat <- tf$complex(tf$cos(freqs), tf$sin(freqs))

    # the positional embedding will be broadcast across batch and heads dim
    rot_mat[tf$newaxis, , tf$newaxis, ] #(1, seqlen, 1, headdim/2)
  }

view_as_complex <- function(x) {
  tf$complex(x[all_dims(), `::2`],
             x[all_dims(), `2::2`])
}

view_as_real <- function(x) {
  # xs = (..., f);  xs2 = (..., f*2)
  xs <- tf$shape(x)
  xs2 <- tf$concat(list(xs[1:(length(xs)-1)],
                        xs[length(xs), drop = FALSE] * 2L),
                   axis = 0L)

  x2 <- tf$stack(list(Re(x), Im(x)), axis = -1L)

  # (..., f, 2) -> (..., f*2)
  tf$reshape(x2, xs2)
}

As you can see, to imagine the embedding features as existing in thecomplex plane, we merely treat adjacent pairs of floats in theunderlying array as the real and imaginary part of a complex number. Werotate the embeddings in the complex plane, then go back to imaginingthe features as existing in the real plane. Again, the job ofinterpreting the meaning of the features after rotation is left to themodel to learn.

We can quickly confirm that the rotary embeddings only rotate featuresand don’t scale them:

near <- function (x, y, tol = 1e-6) abs(x - y) < tol
all(near(1, Mod(compute_rotation_matrix(2048L, 128L))))
tf.Tensor(True, shape=(), dtype=bool)

There is one more trick to observe before moving on: because of some ofthe mathematical properties of the rotation matrix, it’s possible toavoid doing a full complex multiply operation and still arrive at thesame result. Also, since the rotation matrix never changes, it makessense to only compute it once and cache it, like so:

precomputed_rotation_matrix <- compute_rotation_matrix(
  seqlen = 2048L, # LLaMA max seqlen
  feature_dim = with(params, dim %/% n_heads)  # head_size
)

apply_rotary_embedding_faster <- function(x) {

  rotate_every_two <- function(x) {
    x1 <- x[all_dims(), `::2`]
    x2 <- x[all_dims(), `2::2`]
    x_ <- tf$stack(list(-x2, x1), axis = -1L)
    tf$reshape(x_, tf$shape(x))
  }

  repeat_each_twice <- function(x) {
    tf$`repeat`(x, 2L, axis = -1L)
  }

  seqlen <- tf$shape(x)[2]
  rot <- precomputed_rotation_matrix[, NA:seqlen, , ]

  cos <- Re(rot) |> repeat_each_twice()
  sin <- Im(rot) |> repeat_each_twice()

  (x * cos) + (rotate_every_two(x) * sin)
}
rand <- tf$random$uniform(shape(3, 8, params$n_heads, 128))
all(apply_rotary_embedding(rand) ==
    apply_rotary_embedding_faster(rand))
tf.Tensor(True, shape=(), dtype=bool)
apply_rotary_embedding <- apply_rotary_embedding_faster

Finally, note that the rotary positional embeddings are applied withineach Attention layer. This is different from the original Transformerimplementation, where a positional embedding was only added once at thehead of the model. Similar to residual connections, you can think of thepresence of these repeated injections of positional information asrelieving the remaining trainable layers from the burden of allocatingsome of their weights to the task of “passing through” or “preserving”the positional information for later layers.

Positional embeddings are a rich subject that also comes up in otherdeep learning architectures, like denoising diffusion (Falbel and Keydana 2023),so time spent understanding them higher is time nicelyspent. For the needs of this weblog submit we’ve lined the factorswanted and we’ll transfer on to tying all items collectively. To go deeper anddevelop a extra mathematically knowledgeable perceive of RoPE, two wonderfulbeginning factors are:

Tying all of it collectively

With Tokenizer, Embedding, TransformerBlock (RMSNorm,Consideration FeedForward and apply_rotary_embedding) all lined,it’s time to tie all of the items collectively right into a Transformer mannequin. Wemay do that utilizing %py_class% like with the opposite layers above, howeverit’s simply as simple to maneuver over to utilizing the Keras practical API at thislevel.

layer_transformer_block <- create_layer_wrapper(TransformerBlock)
layer_rms_norm <- create_layer_wrapper(RMSNorm)

# input to the model will be output from the tokenizer
input <- layer_input(shape(NA)) #, dtype = "int32")

x <- input |>
  tok_embeddings()  # instantiated earlier in the blog-post

for(block_id in seq_len0(params$n_layers)) >
    layer_transformer_block(attn_head_size = params$dim %/% params$n_heads,
                            attn_n_heads = params$n_heads,
                            norm_eps = params$norm_eps,
                            block_id = block_id)


# final output projection into logits of output tokens
x <- x |>
  layer_rms_norm(block_id = -1, eps = params$norm_eps) |>
  layer_dense(
    tokenizer$vocab_size(), use_bias = FALSE,
    kernel_initializer = (...) np$load(weights_path("7B/output.weight.npy"))$`T`
  )

# slice out the logits for the last token
with_options(c(tensorflow.extract.warn_negatives_pythonic = FALSE), {
  output <- x[, -1, ]
})

llama <- keras_model(input, output) %>%
  compile(jit_compile = TRUE)

The input to the model is tokenized text and the output is the(unnormalized) probabilities for each token in tokenizer$vocab_size()being the next token in the sequence.

next_token_probs <- prompt %>%
  tokenizer$tokenize() %>%
  llama()

next_token_probs
tf.Tensor(
[[-2.4503722e+00 -3.4463339e+00  1.3200411e+01 ...  4.8804146e-01
  -1.3277926e+00  9.9985600e-03]], shape=(1, 32000), dtype=float32)

Sampling strategies for selecting a token from the token logits is arich topic, (also covered thoroughly in the Deep Learning withR ebook), however this weblog submit is lengthy sufficientalready. So for now, let’s simply take the argmax().

sampler <- (logits) tf$argmax(logits, axis = -1L, output_type = "int32")

(next_token <- sampler(next_token_probs))
tf.Tensor([304], shape=(1), dtype=int32)
tokenizer$detokenize(next_token) |> as.character()
[1] "to"

Let’s run it for a few tokens and let LLaMa finish the sentence:

prompt_tokens <- tokenizer$tokenize("The best way to attract bees")

for (i in 1:20) {

  next_token_probs <- prompt_tokens |> llama()
  next_token <- sampler(next_token_probs)

  prompt_tokens %<>% { tf$concat(c(., next_token), axis = -1L) }

  # end of sentence
  if (as.logical(next_token == tokenizer$string_to_id(".")))
    break
}

prompt_tokens |>
  tokenizer$detokenize() |>
  as.character() |>
  strwrap(60) |> writeLines()
The best way to attract bees to your garden is to plant a
variety of flowers that bloom at different times.

Wrapping up

In this blog post we’ve walked through the LLaMA architectureimplemented in R TensorFlow, including how to load pretrained weights,and then run the model to generate a sentence. Note, much of the code inthis blog post is tailored for didactic purposes. While theimplementation of the LLaMA architecture covered in this blog post isappropriate for training, there are a few modifications you’ll want tomake before doing a lot of text generation. Those include things like:

  • In the Attention layer, caching the k and v tensors. Then,after the first forward pass with the initial prompt, only feedingthe model the one new token from the sampler(), rather thanfeeding the model all the tokens of the full prompt on each forwardpass.

  • Only generating the causal mask make_mask() and rotary_matrixslices once per forward pass, instead of within each Attentioncall.

  • Updating the TransformerBlock to be cache-aware and to passthrough the appropriate arguments to Attention()

  • Wrapping all the additional book-keeping logic in a customTransformerDecoder() class.

The changes required to implement these optimizations for inferenceballoon the code size and are mostly about book-keeping, so we won’t gothrough them in this blog post. However, you can find a fullerimplementation of LLaMA in R Tensorflow, including a cache-awaregenerate() method that only feeds the model one token at a time duringthe main inference loop, (and compiles to XLA!),here.

That’s all for now. Thanks for studying and completely satisfied travels to allexploring this thrilling LLM terrain!

Biderman, Stella, Sid Black, Charles Foster, Leo Gao, Eric Hallahan, Horace He, Ben Wang, and Phil Wang. 2021. “Rotary Embeddings: A Relative Revolution.” blog.eleuther.ai/rotary-embeddings/.

Falbel, Daniel, and Sigrid Keydana. 2023. “Posit AI Weblog: De-Noising Diffusion with Torch.” https://blogs.rstudio.com/tensorflow/posts/2023-04-13-denoising-diffusion/.

Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. 2022. “Coaching Compute-Optimum Giant Language Fashions.” https://arxiv.org/abs/2203.15556.

Shazeer, Noam. 2020. “GLU Variants Enhance Transformer.” https://arxiv.org/abs/2002.05202.

Su, Jianlin, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2022. “RoFormer: Enhanced Transformer with Rotary Place Embedding.” https://arxiv.org/abs/2104.09864.

Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, et al. 2023. “LLaMA: Open and Environment friendly Basis Language Fashions.” https://doi.org/10.48550/ARXIV.2302.13971.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Consideration Is All You Want.” https://arxiv.org/abs/1706.03762.

Take pleasure in this weblog? Get notified of recent posts by e-mail:

Posts additionally out there at r-bloggers