Hugging Face Transformers

Language: Python

ML/AI

Hugging Face released Transformers in 2018 to simplify the use of transformer-based models like BERT, GPT, RoBERTa, and T5. The library provides easy access to pre-trained models and tokenizers, enabling developers and researchers to leverage powerful NLP models without extensive training or setup.

Transformers is a Python library developed by Hugging Face that provides state-of-the-art pre-trained models for Natural Language Processing (NLP) tasks such as text classification, translation, summarization, question answering, and more.

Installation

pip: pip install transformers
conda: conda install -c conda-forge transformers

Usage

Transformers provides pre-trained models that can be used directly for inference or fine-tuned on custom datasets. It supports PyTorch, TensorFlow, and JAX backends. The library also includes tokenizers, pipelines, and trainer APIs for streamlined workflows.

Text classification with pipeline

from transformers import pipeline
classifier = pipeline('sentiment-analysis')
result = classifier('I love using Hugging Face Transformers!')
print(result)

Uses a pre-trained sentiment analysis model via a simple pipeline to classify the sentiment of text.

Tokenizing text

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
inputs = tokenizer('Hello, Hugging Face!', return_tensors='pt')
print(inputs)

Uses a BERT tokenizer to convert text into token IDs suitable for model input.

Fine-tuning a pre-trained model

from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Prepare dataset, tokenize, and create Trainer object
training_args = TrainingArguments(output_dir='./results', num_train_epochs=1, per_device_train_batch_size=8)
trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset)
trainer.train()

Shows fine-tuning a pre-trained BERT model for a custom classification task using the Trainer API.

Question answering

from transformers import pipeline
qa_pipeline = pipeline('question-answering')
context = 'Hugging Face is creating a Transformers library.'
question = 'Who is creating Transformers?'
result = qa_pipeline(question=question, context=context)
print(result)

Uses a pre-trained question-answering model to find answers in a given context.

Text generation

from transformers import pipeline
generator = pipeline('text-generation', model='gpt2')
result = generator('Once upon a time', max_length=50)
print(result)

Generates text continuations using a GPT-2 model.

Summarization

from transformers import pipeline
summarizer = pipeline('summarization')
text = 'Hugging Face Transformers provides thousands of pre-trained models to perform tasks on texts, images, and audio.'
summary = summarizer(text, max_length=50, min_length=25, do_sample=False)
print(summary)

Uses a summarization pipeline to produce a condensed version of the input text.

Error Handling

OSError: Model name 'xyz' was not found: Check the model name on Hugging Face Hub or ensure it is correctly spelled.
RuntimeError: CUDA out of memory: Reduce batch size or move computations to CPU if GPU memory is insufficient.
ValueError: Shape mismatch: Ensure input tokens match the expected input shape for the model.

Best Practices

Use pipelines for quick inference without dealing with tokenization and model objects directly.

Fine-tune pre-trained models on custom datasets for better performance on specific tasks.

Leverage GPU acceleration for large models and batch inference.

Use the `AutoModel` and `AutoTokenizer` classes to easily switch between architectures.

Keep track of model versions to ensure reproducibility.