• AIPressRoom
  • Posts
  • Posit AI Weblog: Hugging Face Integrations

Posit AI Weblog: Hugging Face Integrations

We’re blissful to announce the primary releases of hfhub and tok at the moment are on CRAN.hfhub is an R interface to Hugging Face Hub, permitting customers to obtain and cache informationfrom Hugging Face Hub whereas tok implements R bindings for the Hugging Face tokenizerslibrary.

Hugging Face quickly grew to become the platform to construct, share and collaborate ondeep studying purposes and we hope these integrations will assist R customers toget began utilizing Hugging Face instruments in addition to constructing novel purposes.

We even have beforehand introduced the safetensorsbundle permitting to learn and write information within the safetensors format.

hfhub

hfhub is an R interface to the Hugging Face Hub. hfhub at the moment implements a singleperformance: downloading information from Hub repositories. Mannequin Hub repositories areprimarily used to retailer pre-trained mannequin weights along with some other metadataessential to load the mannequin, such because the hyperparameters configurations and thetokenizer vocabulary.

Downloaded information are ached utilizing the identical format because the Python library, thus cachedinformation may be shared between the R and Python implementation, for simpler and fasterswitching between languages.

We already use hfhub within the minhub bundle andwithin the ‘GPT-2 from scratch with torch’ blog post toobtain pre-trained weights from Hugging Face Hub.

You should utilize hub_download() to obtain any file from a Hugging Face Hub repositoryby specifying the repository id and the trail to file that you simply need to obtain.If the file is already within the cache, then the operate returns the file path imediately,in any other case the file is downloaded, cached after which the entry path is returned.

path <- hfhub::hub_download("gpt2", "model.safetensors")
path
#> /Users/dfalbel/.cache/huggingface/hub/models--gpt2/snapshots/11c5a3d5811f50298f278a704980280950aedb10/model.safetensors

tok

Tokenizers are responsible for converting raw text into the sequence of integers thatis often used as the input for NLP models, making them an critical component of theNLP pipelines. If you want a higher level overview of NLP pipelines, you might want to readour previous blog post ‘What are Large Language Models? What are they not?’.

When utilizing a pre-trained mannequin (each for inference or for high quality tuning) it’s veryvital that you simply use the very same tokenization course of that has been used throughoutcoaching, and the Hugging Face staff has carried out a tremendous job ensuring that its algorithmsmatch the tokenization methods used most LLM’s.

tok offers R bindings to the tokenizers library. The tokenizers library is itselfapplied in Rust for efficiency and our bindings use the extendr projectto assist interfacing with R. Utilizing tok we will tokenize textual content the very same means mostNLP fashions do, making it simpler to load pre-trained fashions in R in addition to sharingour fashions with the broader NLP neighborhood.

tok may be put in from CRAN, and at the moment it’s utilization is restricted to loadingtokenizers vocabularies from information. For instance, you possibly can load the tokenizer for the GPT2mannequin with:

tokenizer <- tok::tokenizer$from_pretrained("gpt2")
ids <- tokenizer$encode("Hello world! You can use tokenizers from R")$ids
ids
#> [1] 15496   995     0   921   460   779 11241 11341   422   371
tokenizer$decode(ids)
#> [1] "Hello world! You can use tokenizers from R"

Spaces

Remember that you can already hostShiny (for R and Python) on Hugging Face Areas. For instance, we have now constructed a Shinyapp that makes use of:

  • torch to implement GPT-NeoX (the neural community structure of StableLM – the mannequin used for chatting)

  • hfhub to obtain and cache pre-trained weights from the StableLM repository

  • tok to tokenize and pre-process textual content as enter for the torch mannequin. tok additionally makes use of hfhub to obtain the tokenizer’s vocabulary.

The app is hosted at on this Space.It at the moment runs on CPU, however you possibly can simply swap the the Docker picture if you needto run it on a GPU for quicker inference.

The app supply code can be open-source and may be discovered within the Areas file tab.

Trying ahead

It’s the very early days of hfhub and tok and there’s nonetheless a whole lot of work to doand performance to implement. We hope to get neighborhood assist to prioritize work,thus, if there’s a function that you’re lacking, please open a difficulty within theGitHub repositories.

Take pleasure in this weblog? Get notified of recent posts by electronic mail:

Posts additionally out there at r-bloggers

Reuse

Textual content and figures are licensed beneath Artistic Commons Attribution CC BY 4.0. The figures which were reused from different sources do not fall beneath this license and may be acknowledged by a notice of their caption: “Determine from …”.

Quotation

For attribution, please cite this work as

Falbel (2023, July 12). Posit AI Weblog: Hugging Face Integrations. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2023-07-12-hugging-face-integrations/

BibTeX quotation

@misc{hugging-face-integrations,
  writer = {Falbel, Daniel},
  title = {Posit AI Weblog: Hugging Face Integrations},
  url = {https://blogs.rstudio.com/tensorflow/posts/2023-07-12-hugging-face-integrations/},
  yr = {2023}
}