Paraphrase with Transformer Models like T5, BART, Pegasus - Ultimate Guide

ยท

12 min read

Paraphrase with Transformer Models like T5, BART, Pegasus - Ultimate Guide

Introduction

Paraphrasing is a fundamental skill in effective communication. Whether you're a student, content creator, or professional writer, being able to rephrase information while preserving its essence is crucial.

With the rise of artificial intelligence (AI), transformer models have emerged as powerful tools for automating and enhancing the paraphrasing process.

Understanding Paraphrasing

As per Oxford, Paraphrasing means "to express the meaning of (something written or spoken) using different words, especially to achieve greater clarity".

Let's look at the below example:

Original sentence: "The cat is sitting on the mat."
Paraphrased sentence: "The mat has a cat sitting on it."

Both the sentences while constructed differently, had similar meaning and context. This is paraphrasing.

What's inside ๐Ÿ”

In this article, we will explore the world of effective & intelligent paraphrasing with transformer models. We'll dive into the underlying concepts of transformers and their advantages over conventional methods.

Additionally, we'll discuss popular transformer models such as BART, T5, and Pegasus that have been specifically designed for paraphrasing tasks.

By the end of this article, you'll have a comprehensive understanding of how transformer models are revolutionizing paraphrasing, and empowering individuals and industries with their transformative capabilities.

And more importantly, how you can build a nifty transformer for yourself.

Let's embark on this journey to unlock the power of AI in effective paraphrasing! ๐Ÿš€

NOTE: This article is more focused on the applications and not theory, refer to this article to understand how transformers work internally.

Transformer Models for Paraphrasing

In the realm of paraphrasing, transformer models offer significant advantages over traditional approaches.

Unlike previous methods that relied heavily on recurrent neural networks (RNNs) and convolutional neural networks (CNNs), transformers employ self-attention mechanisms.

This enables them to focus on relevant words and phrases, facilitating a deeper understanding of the underlying semantics.

With their ability to capture long-range dependencies and contextual information through attention mechanisms, transformers have revolutionized various language-related tasks, including paraphrasing.

Several popular transformer models have been specifically developed for paraphrasing tasks.

All these transformers can be found in the Huggingface Library. Let's explore:

1. BART (Bidirectional and Auto-Regressive Transformer)

BART is a powerful transformer model by Facebook AI.

It has been trained using denoising autoencoder objectives and is renowned for its ability to generate high-quality paraphrases.

BART has been trained extensively on large-scale datasets and excels in various NLP tasks, especially paraphrasing.

Source: https://huggingface.co/facebook/bart-base

2. T5 (Text-To-Text Transfer Transformer)

T5, developed by Google Research, is a versatile transformer model pre-trained using a text-to-text framework.

While its primary focus is on a wide range of NLP tasks, including translation and summarization, T5 can also be fine-tuned for paraphrasing.

Source: https://huggingface.co/t5-base

3. Pegasus Paraphrase

Pegasus Paraphrase is specifically trained for executing paraphrasing tasks.

Built upon the Pegasus architecture (originally built for text summarization), it leverages the power of transformer models to generate accurate and contextually appropriate paraphrases.

Source: https://huggingface.co/tuner007/pegasus_paraphrase

Paraphrasing with Transformers

Now let us look at how to paraphrase content with these special transformers and also compare their outputs.

Let's first paraphrase a sentence and then extend that to paraphrase long-form content, which is our main goal.

Paraphrasing a Sentence

Let us paraphrase a few random sentences from modern English literature.

"She was a storm, not the kind you run from, but the kind you chase." - R.H. Sin, Whiskey Words & a Shovel III

"She wasn't looking for a knight, she was looking for a sword." - Atticus

"In the end, we only regret the chances we didn't take." - Unknown

"I dreamt I am running on sand in the night" - Yours truly ;)

"Long long ago, there lived a king and a queen. For a long time, they had no children." - Random text on the internet

"I am typing the best article on paraphrasing with Transformers." - You know who!

BART

Here is the code to paraphrase the above two random English sentences with BART.

# imports
from transformers import BartTokenizer, BartForConditionalGeneration

# Load pre-trained BART model and tokenizer
model_name = 'facebook/bart-base'
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# Set up input sentences
sentences = [
    "She was a storm, not the kind you run from, but the kind you chase.",
    "In the end, we only regret the chances we didn't take.",
    "She wasn't looking for a knight, she was looking for a sword.",
    "I dreamt I am running on sand in the night"
]

# Paraphrase the sentences
for sentence in sentences:
    # Tokenize the input sentence
    input_ids = tokenizer.encode(sentence, return_tensors='pt')

    # Generate paraphrased sentence
    paraphrase_ids = model.generate(input_ids, num_beams=5, max_length=100, early_stopping=True)

    # Decode and print the paraphrased sentence
    paraphrase = tokenizer.decode(paraphrase_ids[0], skip_special_tokens=True)
    print(f"Original: {sentence}")
    print(f"Paraphrase: {paraphrase}")
    print()

Running the above code, we get the following output.

Original: She was a storm, not the kind you run from, but the kind you chase.
Paraphrase: She was a storm, not the kind you run from, but the kind that you chase.

Original: She wasn't looking for a knight, she was looking for a sword.
Paraphrase: She wasn't looking at a knight, she was looking for a sword.

Original: In the end, we only regret the chances we didn't take.
Paraphrase: In the end, we only regret the chances we didn't take.

Original: I dreamt I am running on sand in the night
Paraphrase: I dreamt I am running on sand in the night

Original: Long long ago, there lived a king and a queen. For a long time, they had no children.
Paraphrase: Long long ago, there lived a king and a queen. For a long time, they had no children.

Original: I am typing the best article on paraphrasing with Transformers.
Paraphrase: I am typing the best article on paraphrasing with Transformers.

We see that BART is not super effective at paraphrasing sentences. Let's try the next transformer.

T5 (Text-to-Text Transfer Transformer)

Here is the code to paraphrase the above two random English sentences with T5.

# imports
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load pre-trained T5 Base model and tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-base", model_max_length=1024)
model = T5ForConditionalGeneration.from_pretrained("t5-base")

# Set up input sentences
sentences = [
    "She was a storm, not the kind you run from, but the kind you chase.",
    "In the end, we only regret the chances we didn't take.",
    "She wasn't looking for a knight, she was looking for a sword.",
    "I dreamt I am running on sand in the night"
]

# Paraphrase the sentences
for sentence in sentences:
    # Tokenize the input sentence
    input_ids = tokenizer.encode(sentence, return_tensors='pt')

    # Generate paraphrased sentence
    paraphrase_ids = model.generate(input_ids, num_beams=5, max_length=100, early_stopping=True)

    # Decode and print the paraphrased sentence
    paraphrase = tokenizer.decode(paraphrase_ids[0], skip_special_tokens=True)
    print(f"Original: {sentence}")
    print(f"Paraphrase: {paraphrase}")
    print()

And here's the output.

Original: She was a storm, not the kind you run from, but the kind you chase.
Paraphrase: She was a storm, not the kind you run from, but the kind you chase.

Original: She wasn't looking for a knight, she was looking for a sword.
Paraphrase: She wasn't looking for a knight, she was looking for a sword.

Original: In the end, we only regret the chances we didn't take.
Paraphrase: We only regret the chances we didn't take.

Original: I dreamt I am running on sand in the night
Paraphrase: I dreamt I am running on sand in the night. I dreamt I am running on sand in the night. I dreamt I am running on sand in the night. I dreamt I am running on sand in the night.

Original: Long long ago, there lived a king and a queen. For a long time, they had no children.
Paraphrase: Long long ago, there lived a king and a queen. Long long ago, they had no children.

Original: I am typing the best article on paraphrasing with Transformers.
Paraphrase: Today I am typing the best article on paraphrasing with Transformers.

As we can see, the T5's output is a little different from BART's, but no significant improvement.

Pegasus Paraphrase

Finally, let's go over the code for Pegasus Paraphrase.

# imports
from transformers import PegasusTokenizer, PegasusForConditionalGeneration

# load pre-trained Pegasus Paraphrase model and tokenizer
tokenizer = PegasusTokenizer.from_pretrained("tuner007/pegasus_paraphrase")
model = PegasusForConditionalGeneration.from_pretrained("tuner007/pegasus_paraphrase")

# input sentences
sentences = [
    "She was a storm, not the kind you run from, but the kind you chase.",
    "She wasn't looking for a knight, she was looking for a sword.",
    "In the end, we only regret the chances we didn't take.",
    "I dreamt I am running on sand in the night",
    "Long long ago, there lived a king and a queen. For a long time, they had no children.",
    "I am typing the best article on paraphrasing with Transformers."
]

# Paraphrase the sentences
for sentence in sentences:
    # Tokenize the input sentence
    input_ids = tokenizer.encode(sentence, return_tensors='pt')

    # Generate paraphrased sentence
    paraphrase_ids = model.generate(input_ids, num_beams=5, max_length=100, early_stopping=True)

    # Decode and print the paraphrased sentence
    paraphrase = tokenizer.decode(paraphrase_ids[0], skip_special_tokens=True)
    print(f"Original: {sentence}")
    print(f"Paraphrase: {paraphrase}")
    print()

Here's the output.

Original: She was a storm, not the kind you run from, but the kind you chase.
Paraphrase: She was a storm, not the kind you run from, but the kind you chase.

Original: She wasn't looking for a knight, she was looking for a sword.
Paraphrase: She was looking for a sword, not a knight.

Original: In the end, we only regret the chances we didn't take.
Paraphrase: We regret the chances we didn't take.

Original: I dreamt I am running on sand in the night
Paraphrase: I ran on the sand in the night.

Original: Long long ago, there lived a king and a queen. For a long time, they had no children.
Paraphrase: They had no children for a long time.

Original: I am typing the best article on paraphrasing with Transformers.
Paraphrase: I am writing the best article on the subject.

We can observe a significant improvement in the output with Pegasus Paraphrase.

Comparing the output of all three transformer models, we can definitively declare Pegasus Paraphrase as the winner.

Paraphrasing a Paragraph

With our testing out of the way, we've finalized Pegasus Paraphrase as our choice of transformer for this task.

Now let's see how we can paraphrase paragraphs and long chunks of texts with it.

Theoretically, there are three main ways to paraphrase whole paragraphs.

1. Adjusting the input length

By default, the maximum input length for Pegasus Paraphrase is set to a certain number of tokens. If the input paragraph exceeds this limit, it might be truncated, leading to incomplete paraphrasing.

Here we split the longer text into smaller chunks and run them through the model individually, then combine the paraphrased results afterward.

2. Use a sliding window approach

Here we take a fixed-sized window and slide it over the input paragraph, generating paraphrases for each window. This way, we ensure that the entire paragraph is covered, albeit with overlapping segments.

Beam search is a decoding algorithm that helps in generating diverse outputs from the model.

By default, the model uses beam search with a beam width of 4. We can try to increase the beam width to encourage more exploration and potentially improve the quality of paraphrased outputs for longer texts.

If neither approach gives us satisfactory results, we can look at fine-tuning the model but that's for a different discussion.

In my research and experimentation, I've found that 'Adjusting the input length' gives us the best output. So let's go ahead and implement that.

For a view on challenges with other methods, take a look at the experimentation notebook here.

{insert link to notebook}

Let's paraphrase a paragraph from 'The Hound of the Baskervilles', one of the most popular Sherlock Holmes stories by Sir Arthur Conan Doyle.

"As Sir Henry and I sat at breakfast, the sunlight flooded in through the high mullioned windows, throwing watery patches of color from the coats of arms which covered them. The dark panelling glowed like bronze in the golden rays, and it was hard to realize that this was indeed the chamber which had struck such a gloom into our souls upon the evening before. But the evening before, Sir Henry's nerves were still handled the stimulant of suspense, and he came to breakfast, his cheeks flushed in the exhilaration of the early chase."

# imports
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

# Load the Pegasus Paraphrase model and tokenizer
model_name = "tuner007/pegasus_paraphrase"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

# function to paraphrase long texts by adjusting the input length
def paraphrase_paragraph(text):

    # Split the text into sentences
    sentences = text.split(".")
    paraphrases = []

    for sentence in sentences:
        # Clean up sentences

        # remove extra whitespace
        sentence = sentence.strip()

        # filter out empty sentences
        if len(sentence) == 0:
            continue

        # Tokenize the sentence
        inputs = tokenizer.encode_plus(sentence, return_tensors="pt", truncation=True, max_length=512)

        input_ids = inputs["input_ids"]
        attention_mask = inputs["attention_mask"]

        # paraphrase
        paraphrase = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            num_beams=4,
            max_length=100,
            early_stopping=True
        )[0]
        paraphrased_text = tokenizer.decode(paraphrase, skip_special_tokens=True)

        paraphrases.append(paraphrased_text)

    # Combine the paraphrases
    combined_paraphrase = " ".join(paraphrases)

    return combined_paraphrase

# Example usage
text = "As Sir Henry and I sat at breakfast, the sunlight flooded in through the high mullioned windows, throwing watery patches of color from the coats of arms which covered them. The dark panelling glowed like bronze in the golden rays, and it was hard to realize that this was indeed the chamber which had struck such a gloom into our souls upon the evening before. But the evening before, Sir Henry's nerves were still handled the stimulant of suspense, and he came to breakfast, his cheeks flushed in the exhilaration of the early chase."
paraphrase = paraphrase_paragraph(text)
print(paraphrase)

Here we've split the sentences into smaller chunks like sentences, paraphrase each chunk and then combine the individual outputs back into a paragraph.

And below is the output.

As Sir Henry and I sat at breakfast, the sunlight flooded in through the high windows, causing watery patches of color from the coats of arms. The dark panelling glowed like bronze in the golden rays, and it was hard to see that it was the chamber which had struck such a gloom into our souls the evening before. The evening before, Sir Henry's nerves were still handled and he came to breakfast, his cheeks flushed from the excitement of the early chase.

Concluding thoughts

Throughout this article, we have explored the world of effective paraphrasing with transformer models. And also saw effective applications of how to build a paraphraser with Transformer models from Hugging Face.

Transformer models have brought about a paradigm shift in paraphrasing, empowering individuals and industries with their transformative capabilities. By harnessing the power of transformer models, we can unlock new possibilities in effective communication, content creation, academic writing, and language translation.

As the field of transformer-based paraphrasing continues to evolve, there are exciting opportunities for further exploration and adoption of these technologies.

Researchers and practitioners are encouraged to delve deeper into fine-tuning strategies, data augmentation techniques, and evaluation methodologies to advance the state-of-the-art in paraphrase generation.

Additionally, the ethical implications of using transformer models for paraphrasing should be considered. Careful attention should be given to biases and fairness to ensure equitable and responsible deployment of these technologies.

Let me know your thoughts and any feedback in the comments.

Until next time ... Ciao!

ย