In the world of data analysis and machine learning, data comes in all shapes and sizes.

**Categorical data** is one of the most common forms of data that you will encounter in your data science journey. It represents discrete, distinct categories or labels, and it's an essential part of many real-world datasets.

In this article, we will discuss the best techniques to encode categorical features in great detail along with their code implementations. We will also discuss the best practices and how to select the right encoding technique.

The objective of this article is to serve as a ready reference for whenever you wish to encode categorical features in your dataset.

Many machine learning algorithms require numerical input.

Categorical data, being non-numeric, needs to be transformed into a numerical format for these algorithms to work.

Categorical features are encoded based on the their types and functions. They can be broadly divided into two categories: **Nominal** and **Ordinal**.

### Nominal Categorical Features

Nominal features are those where the categories have no inherent order or ranking.

For example, the colors of cars (red, blue, green) are nominal because there's no natural order to them.

### Ordinal Categorical Features

Ordinal features are those where the categories have a meaningful order or rank.

Think of education levels (high school, bachelor's, master's, Ph.D.), for which there is a clear ranking.

Learn more about categorical data & other types of data from the below resource.

Categorical data brings its own set of challenges when it comes to data analysis and machine learning. Here are some key challenges:

**Numerical Requirement**: Many machine learning algorithms require numerical input. Categorical data, being non-numeric, needs to be transformed into a numerical format for these algorithms to work.**Curse of Dimensionality**: One-hot encoding, a common technique, can lead to a high number of new columns (dimensions) in your dataset, which can increase computational complexity and storage requirements.**Multicollinearity**: In one-hot encoding, the newly created binary columns can be correlated, which can be problematic for some models that assume independence between features.**Data Sparsity**: When one-hot encoding is used, it can lead to sparse matrices, where most of the entries are zero. This can be memory-inefficient and affect model performance.

The encoding techniques we will discuss today are listed below:

Label Encoding

One-hot Encoding

Binary Encoding

Ordinal Encoding

Frequency Encoding or Count Encoding

Target Encoding or Mean Encoding

Feature Hashing or Hashing Trick

Let us discuss each in detail.

## Label Encoding

Label encoding is one of the fundamental techniques for converting categorical data into a numerical format. It assigns numbers in increasing order to the labels in an ordinal categorical feature.

It is a simple yet effective method that assigns a unique integer to each category in a feature.

Imagine a feature 'Size' that has the following labels: 'Small', 'Medium', and 'Large'. This is an ordinal categorical feature as there is an inherent order in the labels.

We can encode these labels as following:

Small 0

Medium 1

Large 2

Let us look at the code implementation for Label Encoding.

`# necessary importsfrom sklearn.preprocessing import LabelEncoder# Sample datadata = ["Small", "Medium", "Large", "Medium", "Small"]print(data) # Output: ['Red', 'Green', 'Blue', 'Red', 'Green']# Initialize the label encoderlabel_encoder = LabelEncoder()# Fit and transform the dataencoded_data = label_encoder.fit_transform(data)print(encoded_data) # Output: [2, 1, 0, 2, 1]`

Label encoding is a suitable choice for:

Ordinal data or features with a clear and meaningful order.

Not increasing dimensionality in the dataset.

## One-Hot Encoding or Dummy Encoding

One-hot encoding, also popularly known as dummy encoding, is a widely used technique for converting categorical data into a numerical format.

It's particularly suitable for nominal categorical features, where the categories have no inherent order or ranking.

One-hot encoding transforms each label (or category) in a categorical feature into a binary column.

Each binary column corresponds to a specific category and indicates the presence (1) or absence (0) of that category in the original feature.

For example, consider a categorical feature "Color" with three labels: "Red," "Green," and "Blue." One-hot encoding would create three binary columns like this:

"Red" [1, 0, 0]

"Green" [0, 1, 0]

"Blue" [0, 0, 1]

Let us look at the code implementation for One-Hot Encoding.

`# necessary importsimport pandas as pd# Sample datadata = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})# Perform one-hot encodingencoded_data = pd.get_dummies(data, columns=['Color'])`

The output will look like below:

The primary advantage of one-hot encoding is that it maintains the distinctiveness of labels and prevents any unintended ordinality.

Each label becomes a separate feature, and the presence or absence of a category is explicitly represented.

One-hot encoding is an appropriate choice when:

Dealing with nominal data with no meaningful order among labels.

Maintaining the distinction between categories (or labels) is crucial, and no ordinality must be introduced.

It handles missing values the absence of a category results in all zeros in the one-hot encoded columns.

#### Dummy Variable Trap 💡

Be aware of the "dummy variable trap," where multicollinearity can occur if one column can be predicted from the others.

To avoid this, you can safely drop one of the one-hot encoded columns, reducing the dimensionality by one. You can declare the `drop_first=True`

in the `get_dummies`

function as shown below.

`import pandas as pd# Sample datadata = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})# Perform one-hot encodingencoded_data = pd.get_dummies(data, columns=['Color'], drop_first=True)`

Output:

#### Curse of Dimensionality

One-hot encoding can lead to a high number of new columns (dimensions) in your dataset, which can increase computational complexity and storage requirements.

#### Multicollinearity

In one-hot encoding, the newly created binary columns can be correlated, which can be problematic for some models that assume independence between features.

#### Data Sparsity

When one-hot encoding is used, it can lead to sparse matrices, where most of the entries are zero. This can be memory-inefficient and affect model performance.

## Binary Encoding

Binary encoding is a versatile technique for encoding categorical features, especially when dealing with high-cardinality data.

It combines the benefits of one-hot and label encoding while reducing dimensionality.

Binary encoding works by converting each category into binary code and representing it as a sequence of binary digits (**0**s and **1**s).

Each binary digit is then placed in a separate column, effectively creating a set of binary columns for each category.

The encoding process is as follows:

Assign a unique integer to each category, similar to label encoding.

Convert the integer to binary code.

Create a set of binary columns to represent the binary code.

For example, consider a categorical feature "Country" with categories "USA," "Canada," and "UK."

Binary encoding would involve assigning unique integers to each country (e.g., "USA" -> 1, "Canada" -> 2, "UK" -> 3) and then converting these integers to binary code. The binary digits (0s and 1s) are then placed in separate binary columns:

"USA" 1 001 [0, 0, 1]

"Canada" 2 010 [0, 1, 0]

"UK" 3 100 [1, 0, 0]

Let us go through an example in Python.

`# necessary importsimport category_encoders as ceimport pandas as pd# Sample datadata = pd.DataFrame({'Country': ['USA', 'Canada', 'UK', 'USA', 'UK']})# Initialize the binary encoderencoder = ce.BinaryEncoder(cols=['Country'])# Fit and transform the dataencoded_data = encoder.fit_transform(data)`

The output is below:

It combines the advantages of both one-hot encoding and label encoding, efficiently converting categorical data into a binary format.

It is memory efficient and overcomes the curse of dimensionality.

Finally, it is easy to implement and interpret.

Binary encoding is a suitable choice when:

Dealing with high-cardinality categorical features (features with a large number of unique categories).

You want to reduce the dimensionality compared to one-hot encoding, especially for features with many unique categories.

## Ordinal Encoding

As the name suggests, Ordinal Encoding encodes the categories in an ordinal feature by mapping it to integer values in ascending order of rank.

The process of ordinal encoding involves mapping each category to a unique integer, typically based on their order or rank.

Consider an ordinal feature "Education Level" with categories: "High School," "Associate's Degree," "Bachelor's Degree," "Master's Degree," and "PhD".

Ordinal encoding will assign integer values as follows:

"High School" 0

"Associate's Degree" 1

"Bachelor's Degree" 2

"Master's Degree" 3

"PhD" 4

These integer values reflect the ordinal relationship between the education levels.

Here's how we implement Ordinal Encoding in Python.

`# necessary importsimport category_encoders as ceimport pandas as pd# Sample datadata = pd.DataFrame({"Education Level": ["High School", "Bachelor's Degree", "Master's Degree", "PhD", "Associate's Degree"]})# Define the ordinal encoding mappingeducation_mapping = { 'High School': 0, "Associate's Degree": 1, "Bachelor's Degree": 2, "Master's Degree": 3, 'PhD': 4}# Perform ordinal encodingencoder = ce.OrdinalEncoder(mapping=[{'col': 'Education Level', 'mapping': education_mapping}])encoded_data = encoder.fit_transform(data)`

Output:

It captures and preserves the ordinal relationships between categories, which can be valuable for certain types of analyses.

It reduces the dimensionality of the dataset compared to one-hot encoding.

It provides a numerical representation of the data, making it suitable for many machine learning algorithms.

Ordinal encoding is an appropriate choice when:

Dealing with categorical features that exhibit a clear and meaningful order or ranking.

Preserving the ordinal relationship among categories is essential for your analysis or model.

You want to convert the data into a numerical format while maintaining the inherent order of the categories.

## Frequency Encoding or Count Encoding

Frequency encoding, also known as count encoding, is a technique that encodes categorical features based on the frequency of each category in the dataset.

This method assigns each category a numerical value representing how often it occurs. It's a straightforward approach that can be effective in certain scenarios.

Categories that appear more frequently receive higher values, while less common categories receive lower values. This provides a numerical representation of the categories based on their prevalence.

The process involves mapping each category to its frequency or count within the dataset.

Consider a categorical feature "City" with categories "New York," "Los Angeles," "Chicago," and "San Francisco." If "New York" appears 50 times, "Los Angeles" 30 times, "Chicago" 20 times, and "San Francisco" 10 times, frequency encoding will assign values as follows:

"New York" 50

"Los Angeles" 30

"Chicago" 20

"San Francisco" 10

💡 NOTE

Frequency or Count Encoding is specially effective where the frequency of categories in a feature has a significant impact.

It should not be applied to ordinal categorical features.

The implementation here is pretty straightforward.

`# importsimport pandas as pd# Sample datadata = pd.DataFrame({'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles', 'Chicago', 'Chicago', 'New York', 'New York']})# frequency encodingfrequency_encoding = data['City'].value_counts().to_dict()data['Encoded_City'] = data['City'].map(frequency_encoding)`

Output below:

Frequency encoding offers the following advantages:

It encodes categorical data in a straightforward and interpretable way, preserving the count information.

Particularly useful when the frequency of categories is a relevant feature for the problem you're solving.

It reduces dimensionality compared to one-hot encoding, which can be beneficial in high-cardinality scenarios.

Frequency encoding is an appropriate choice when:

Analyzing categorical features where the frequency of each category is relevant information for your model.

Reducing the dimensionality of the dataset compared to one-hot encoding while preserving the information about category frequency.

## Target Encoding or Mean Encoding

Target encoding, also known as Mean Encoding, is a powerful technique used to encode categorical features when the target variable is categorical.

It assigns a numerical value to each category based on the mean of the target variable within that category.

Target encoding is particularly useful in classification problems. It captures how likely each category is to result in the target variable taking a specific value.

The process of target encoding involves mapping each category to the mean of the target variable for data points within that category. This encoding method provides a direct relationship between the categorical feature and the target variable.

Consider a categorical feature "Region" with categories "North," "South," "East," and "West." If we're dealing with a binary classification problem where the target variable is "Churn" (0 for no churn, 1 for churn), target encoding might assign values as follows:

"North" Mean of "Churn" for data points in the "North" category

"South" Mean of "Churn" for data points in the "South" category

"East" Mean of "Churn" for data points in the "East" category

"West" Mean of "Churn" for data points in the "West" category

Here's a Python code example for target encoding using the `category_encoders`

library:

`# importsimport category_encoders as ceimport pandas as pd# Sample datadata = pd.DataFrame({'Region': ['North', 'South', 'East', 'West', 'North', 'South'], 'Churn': [0, 1, 0, 1, 0, 1]})# Perform target encodingencoder = ce.TargetEncoder(cols=['Region'])encoded_data = encoder.fit_transform(data, data['Churn'])`

Output is shared below:

When using target encoding, consider the following best practices:

Be cautious about potential data leakage, as the mean of the target variable is used in the encoding process. Ensure you're not using information from the test or validation set when encoding.

Use cross-validation or other techniques to prevent overfitting and improve the robustness of target encoding.

Target encoding offers several advantages:

It captures the relationship between the categorical feature and the target variable, making it useful in classification problems.

It provides a direct and interpretable way to encode categorical features.

It reduces dimensionality compared to one-hot encoding while preserving valuable information about category-specific behavior.

Target encoding is an appropriate choice when:

Working with categorical features and a categorical target variable in classification problems.

You want to capture the relationship between the categorical feature and the target variable, helping the model make predictions based on category-specific behavior.

## Feature Hashing or Hashing Trick

A rather under-appreciated encoding technique, Feature Hashing, also known as the Hashing Trick, is a method used to encode high-cardinality categorical features efficiently.

It works by applying a hash function to the categorical data, reducing the dimensionality of the feature while still providing a numerical representation.

💡 Feature hashing is particularly useful when dealing with large datasets with many unique categories.

The feature hashing process involves applying a hash function to the categorical data, which maps each category to a fixed number of numerical columns.

The hash function distributes the categories across these columns, and each category contributes to the values of multiple columns.

Let's implement this in Python.

`import category_encoders as ceimport pandas as pd# Sample datadata = pd.DataFrame({'Product Category': ['A', 'B', 'C', 'A', 'C', 'D', 'E', 'D', 'C', 'A']})# Perform feature hashing with three columnsencoder = ce.HashingEncoder(cols=['Product Category'], n_components=3)encoded_data = encoder.fit_transform(data)`

Output is shared below:

Feature hashing is an appropriate choice when:

Dealing with high-cardinality categorical features that have too many unique categories to handle using one-hot encoding or other techniques.

Reducing the dimensionality of the dataset while retaining the essential information from the categorical feature.

Memory and computational resources are limited, making it challenging to work with a high number of binary columns.

Define feature hashing.

Explain when to use feature hashing.

Provide code examples and implementation tips.

Discuss the impact of hash collisions.

Here, we conclude the most useful encoding techniques for categorical variables for your data science and machine learning tasks.

Encoding data features is a crucial step in any machine learning pipeline and I hope that this article serves as a ready reference for all your upcoming projects.

Each technique has its strengths and is best suited for specific scenarios. Make sure to refer to the **"When to Use"** section for each encoding technique to apply the right feature encoding technique to your dataset.

The reference code is compiled for you in the notebook below.

Hope you enjoyed this!

Feel free to reach out for any queries or feedback below or on my socials.

]]>

**Linear Regression** is one of the most popular statistical models and machine learning algorithms. Considered the holy grail in the world of Data Science and Machine Learning.

It is one of the first (if not the first) algorithms that is thought in ML schools and courses alike.

However, one of the most important aspects that a lot of tutorials skip is that Linear Regression cannot be applied to all datasets alike. There are certain mandates that a dataset and its distribution must follow for Linear Regression to be successfully modeled to it.

These are popularly also known as the **Assumptions of Linear Regression**.

💡 Assumptions of Linear Regression model is a favorite interview question for the Data Scientist and Machine Learning Engineer positions.

In this article, we will not only list the different assumptions of a linear regression model but also discuss why they are so, and the rationale behind each of them.

The prerequisite for this discussion is a good understanding of the Linear Regression algorithm itself.

So lets go! 🚀

We know that the Linear Regression model aims at establishing the **best-fit line** between the dependent and independent features of a dataset as shown below.

*Figure:* `y = 3 + 5x + np.random.rand(100, 1)`

The Linear Regression model is defined as follows.

Now, let us discuss the assumptions of the Linear Regression model.

The assumptions of Linear Regression are as follows:

**Linearity****Homoscedasticity or Constant Error Variance****Independent Error Terms or No Autocorrelation****Normality of Residuals****No or Negligible Multi-collinearity****Exogeneity**

💡 NOTE

Different sources and textbooks might list a different number of assumptions of a linear regression model. And they are all correct.

However,

the 6 assumptions that we will discuss today shall cover all of the different assumptions.Many textbooks break individual assumptions into multiple different assumptions, therefore, can list out about 10 different assumptions.

The significance of these assumptions can be understood as guidelines that if a dataset follows, becomes highly suitable for a Linear Regression model.

Alright! Lets discuss each of these assumptions in detail.

This essentially means that **there must be a linear relationship between the dependent and the independent features** of a dataset.

And this is fairly intuitive as the best-fit line of a linear regression model is a straight line, which is most suitable for linear data distribution.

Compare the two different distributions below:

**Data is linearly distributed***Figure:*`y = 3 + 5x + np.random.rand(100, 1)`

**Data is non-linearly distributed***Figure:*`y = 3 + 50x^2 + np.random.rand(100, 1)`

We can clearly distinguish between the two different distributions that the linear regression model is a better fit for the linear distribution.

Well, one way is to plot the data and detect it visually. However, in real-world scenarios, it may not be so simple to detect linearity in data.

The **Likelihood Ratio (LR) Test** is a good test for establishing linearity.

The second assumption of linear regression is **Homoscedasticity**.

It means that the residuals (or error terms) should have constant variance along the axis, in other words, the error terms must be evenly spread across the axis as shown below.

*Figure: The residuals for a linearly distributed dataset have constant variance.*

There are instances where the residuals are not evenly spread along the axis, and this condition is known as **Heteroscedasticity**. A few examples are shown below.

*Figure: Homoscedasticity vs Heteroscedasticity [**Source**]*

When there is Heteroscedasticity in data, the standard errors cannot be relied upon and hence is a violation of the assumptions of Linear Regression.

Apart from visually detecting it, there are statistical tests for determining Heteroscedasticity, the popular ones are:

**Goldfeldt-Quant test****Breusch-Pagan test**

There are certain ways to remove Heteroscedasticity from your data, some of them are:

**Whites standard errors**: These are additive terms that sort of normalize the variance in the spread of residual terms, however, the downside is that the confidence in the coefficients of independent features also decreases.**Weighted least squares**: Updating the weights of independent features in the Linear Regression equation. This is a trial-and-error method that may lead to Homoscedasticity.**Log transformations**: Many times a curved distribution can be converted into a linear distribution (i.e., a straight line) by simply applying the log function to it. Other transformations may work out as well.

Here the assumption states that each residual term is not related to the other residual term occurring before or after it. A good example of this is shown below.

*Figure: The residuals for a linearly distributed dataset are independent of each other.*

💡 NOTE

Autocorrelationis the relation of the data series with itself, where the error term of the next data record is related to the residual of the previous data record.

It is most often found in time-series data and not so prevalent in regular cross-sectional datasets. An example of a time-series distribution is shown below.

*Figure: Autocorrelation in time series data helps forecast future outcomes.*

Therefore, it is not something that you may encounter very often, however, if you do it is a violation of the assumptions of linear regression.

With autocorrelation in the data, the standard error of the output becomes unreliable.

There are a few tests for detecting autocorrelation in a dataset. Here are a few:

ACF & PACF plots

Durbin-Watson test

This assumption states that the residuals of errors in the model must be normally distributed.

If the normality of errors is violated and the number of records is small, then the standard errors in output are affected. That impacts the best-fit line of the model.

💡 NOTE

This assumption generally is considered a weak assumption for Linear Regression models and slight (or greater) violations can be neglected while modeling. This is particularly true for large datasets.

There are multiple visual and statistical tests for detecting normality in error terms. Some of the popular ones are:

**Histogram***Figure: Residuals are normally distributed [**Source**]***Q-Q Plot***Figure: Q-Q plot for normally distributed errors [**Source**]***Shapiro-Wilk test****Kolmogorov-Smirnov test****Anderson-Darling test**

As mentioned above, this is a weak assumption and can be neglected in many cases as well.

However, some ways to bring normality in residuals are:

Mathematical transformations like log transformations etc.

Standardization or normalization of the dataset

Adding more data reduces the need for normally distributed error terms

Multi-collinearity occurs when 2 or more features of a dataset are internally correlated with each other.

Consider a house price dataset with multiple variables about the property and price being the target variable. There is a high chance that the features 'floor area' and 'land dimensions' are highly correlated since the area is a direct multiple of individual dimensions.

Now, this is a problem for the regression model since what it effectively is trying to do is isolate the individual effects of each feature on the target variable. This is represented by the weights of each feature as shown below.

$$X = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon$$

Therefore, it is highly recommended to verify that there is collinearity between individual features within a dataset.

It disturbs the best-fit line by impacting the individual coefficients of the variables, which then becomes unreliable.

Calculating correlation () between each feature in the dataset.

Variance Inflation Factor (VIF)

Simply removing one of the correlated variables.

Merging them into a single feature can prevent multicollinearity.

**CAUTION!**Merging correlated features into a single feature will only work if the new feature actually has real-world existence or impacts the target variable equally.

Exogeneity or no omitted variable bias is the final assumption on our list.

But lets first understand what omitted variable bias actually is.

If there is a variable in the model that has been omitted and/or is not present but still impacts the target variable, then there is omitted variable bias or Endogeneity in the model.

For example, consider the following model.

$$UsedCarPrice_i = \beta_0 + \beta_1(DistanceTravelled)_i + \epsilon_i$$

Now the price of a used car here is determined by the distance it has already covered. However, the year of manufacturing also impacts both the target variable (Y), the car price of the used car, and the X variable, Distance traveled by car as the longer the age of the car, the more likely the car has traveled greater distances.

This is a clear case of omitted variable bias and it is undesirable for accurate modeling.

💡 NOTE

Exogeneity in a model tells us that all features that impact the target variable (Y) are part of the model features (X) and no other external feature can be further included.

So this was our discussion on the Assumptions of Linear Regression. This is one of the favorite questions of Data Scientist interviewers and now you know how to ace it!

Here is a quick summary of the same.

**Linearity**: There must be a linear relationship between the dependent and independent variables.**Homoscedasticity or Constant Error Variance**: The variance of the errors is constant across all levels of the independent variables.**Independent Error Terms or No Autocorrelation**: There is no correlation between the errors of the variables.**Normality of Residuals**: The residuals or errors follow a normal distribution.**No multicollinearity**: There exists no correlation between the different independent variables.**Exogeneity (No Endogeneity)**: There must be no relationship between the independent variables and the errors.

Keep this list handy when you prepare for your interviews.

Hope you enjoyed this! Feel free to leave your feedback and queries below.

]]>Paraphrasing is a fundamental skill in effective communication. Whether you're a student, content creator, or professional writer, being able to rephrase information while preserving its essence is crucial.

With the rise of artificial intelligence (AI), transformer models have emerged as powerful tools for automating and enhancing the paraphrasing process.

As per Oxford, **Paraphrasing** means *"to express the meaning of (something written or spoken) using different words, especially to achieve greater clarity"*.

Let's look at the below example:

Original sentence: "The cat is sitting on the mat."

Paraphrased sentence: "The mat has a cat sitting on it."

Both the sentences while constructed differently, had similar meaning and context. This is paraphrasing.

In this article, we will explore the world of effective & intelligent paraphrasing with transformer models. We'll dive into the underlying concepts of transformers and their advantages over conventional methods.

Additionally, we'll discuss popular transformer models such as BART, T5, and Pegasus that have been specifically designed for paraphrasing tasks.

By the end of this article, you'll have a comprehensive understanding of how transformer models are revolutionizing paraphrasing, and empowering individuals and industries with their transformative capabilities.

And more importantly, how you can build a nifty transformer for yourself.

Let's embark on this journey to unlock the power of AI in effective paraphrasing! 🚀

**NOTE**: This article is more focused on the applications and not theory, refer to this article to understand how transformers work internally.

In the realm of paraphrasing, transformer models offer significant advantages over traditional approaches.

Unlike previous methods that relied heavily on recurrent neural networks (RNNs) and convolutional neural networks (CNNs), transformers employ self-attention mechanisms.

This enables them to focus on relevant words and phrases, facilitating a deeper understanding of the underlying semantics.

With their ability to capture long-range dependencies and contextual information through attention mechanisms, transformers have revolutionized various language-related tasks, including paraphrasing.

Several popular transformer models have been specifically developed for paraphrasing tasks.

All these transformers can be found in the Huggingface Library. Let's explore:

BART is a powerful transformer model by Facebook AI.

It has been trained using denoising autoencoder objectives and is renowned for its ability to generate high-quality paraphrases.

BART has been trained extensively on large-scale datasets and excels in various NLP tasks, especially paraphrasing.

Source: https://huggingface.co/facebook/bart-base

T5, developed by Google Research, is a versatile transformer model pre-trained using a text-to-text framework.

While its primary focus is on a wide range of NLP tasks, including translation and summarization, T5 can also be fine-tuned for paraphrasing.

Source: https://huggingface.co/t5-base

Pegasus Paraphrase is specifically trained for executing paraphrasing tasks.

Built upon the Pegasus architecture (originally built for text summarization), it leverages the power of transformer models to generate accurate and contextually appropriate paraphrases.

Source: https://huggingface.co/tuner007/pegasus_paraphrase

Now let us look at how to paraphrase content with these special transformers and also compare their outputs.

Let's first paraphrase a sentence and then extend that to paraphrase long-form content, which is our main goal.

Let us paraphrase a few random sentences from modern English literature.

"She was a storm, not the kind you run from, but the kind you chase." - R.H. Sin, Whiskey Words & a Shovel III

"She wasn't looking for a knight, she was looking for a sword." - Atticus

"In the end, we only regret the chances we didn't take." - Unknown

"I dreamt I am running on sand in the night" - Yours truly ;)

"Long long ago, there lived a king and a queen. For a long time, they had no children." - Random text on the internet

"I am typing the best article on paraphrasing with Transformers." - You know who!

Here is the code to paraphrase the above two random English sentences with BART.

`# importsfrom transformers import BartTokenizer, BartForConditionalGeneration# Load pre-trained BART model and tokenizermodel_name = 'facebook/bart-base'tokenizer = BartTokenizer.from_pretrained(model_name)model = BartForConditionalGeneration.from_pretrained(model_name)# Set up input sentencessentences = [ "She was a storm, not the kind you run from, but the kind you chase.", "In the end, we only regret the chances we didn't take.", "She wasn't looking for a knight, she was looking for a sword.", "I dreamt I am running on sand in the night"]# Paraphrase the sentencesfor sentence in sentences: # Tokenize the input sentence input_ids = tokenizer.encode(sentence, return_tensors='pt') # Generate paraphrased sentence paraphrase_ids = model.generate(input_ids, num_beams=5, max_length=100, early_stopping=True) # Decode and print the paraphrased sentence paraphrase = tokenizer.decode(paraphrase_ids[0], skip_special_tokens=True) print(f"Original: {sentence}") print(f"Paraphrase: {paraphrase}") print()`

Running the above code, we get the following output.

`Original: She was a storm, not the kind you run from, but the kind you chase.Paraphrase: She was a storm, not the kind you run from, but the kind that you chase.Original: She wasn't looking for a knight, she was looking for a sword.Paraphrase: She wasn't looking at a knight, she was looking for a sword.Original: In the end, we only regret the chances we didn't take.Paraphrase: In the end, we only regret the chances we didn't take.Original: I dreamt I am running on sand in the nightParaphrase: I dreamt I am running on sand in the nightOriginal: Long long ago, there lived a king and a queen. For a long time, they had no children.Paraphrase: Long long ago, there lived a king and a queen. For a long time, they had no children.Original: I am typing the best article on paraphrasing with Transformers.Paraphrase: I am typing the best article on paraphrasing with Transformers.`

We see that BART is not super effective at paraphrasing sentences. Let's try the next transformer.

Here is the code to paraphrase the above two random English sentences with T5.

`# importsfrom transformers import T5Tokenizer, T5ForConditionalGeneration# Load pre-trained T5 Base model and tokenizertokenizer = T5Tokenizer.from_pretrained("t5-base", model_max_length=1024)model = T5ForConditionalGeneration.from_pretrained("t5-base")# Set up input sentencessentences = [ "She was a storm, not the kind you run from, but the kind you chase.", "In the end, we only regret the chances we didn't take.", "She wasn't looking for a knight, she was looking for a sword.", "I dreamt I am running on sand in the night"]# Paraphrase the sentencesfor sentence in sentences: # Tokenize the input sentence input_ids = tokenizer.encode(sentence, return_tensors='pt') # Generate paraphrased sentence paraphrase_ids = model.generate(input_ids, num_beams=5, max_length=100, early_stopping=True) # Decode and print the paraphrased sentence paraphrase = tokenizer.decode(paraphrase_ids[0], skip_special_tokens=True) print(f"Original: {sentence}") print(f"Paraphrase: {paraphrase}") print()`

And here's the output.

`Original: She was a storm, not the kind you run from, but the kind you chase.Paraphrase: She was a storm, not the kind you run from, but the kind you chase.Original: She wasn't looking for a knight, she was looking for a sword.Paraphrase: She wasn't looking for a knight, she was looking for a sword.Original: In the end, we only regret the chances we didn't take.Paraphrase: We only regret the chances we didn't take.Original: I dreamt I am running on sand in the nightParaphrase: I dreamt I am running on sand in the night. I dreamt I am running on sand in the night. I dreamt I am running on sand in the night. I dreamt I am running on sand in the night.Original: Long long ago, there lived a king and a queen. For a long time, they had no children.Paraphrase: Long long ago, there lived a king and a queen. Long long ago, they had no children.Original: I am typing the best article on paraphrasing with Transformers.Paraphrase: Today I am typing the best article on paraphrasing with Transformers.`

As we can see, the T5's output is a little different from BART's, but no significant improvement.

Finally, let's go over the code for Pegasus Paraphrase.

`# importsfrom transformers import PegasusTokenizer, PegasusForConditionalGeneration# load pre-trained Pegasus Paraphrase model and tokenizertokenizer = PegasusTokenizer.from_pretrained("tuner007/pegasus_paraphrase")model = PegasusForConditionalGeneration.from_pretrained("tuner007/pegasus_paraphrase")# input sentencessentences = [ "She was a storm, not the kind you run from, but the kind you chase.", "She wasn't looking for a knight, she was looking for a sword.", "In the end, we only regret the chances we didn't take.", "I dreamt I am running on sand in the night", "Long long ago, there lived a king and a queen. For a long time, they had no children.", "I am typing the best article on paraphrasing with Transformers."]# Paraphrase the sentencesfor sentence in sentences: # Tokenize the input sentence input_ids = tokenizer.encode(sentence, return_tensors='pt') # Generate paraphrased sentence paraphrase_ids = model.generate(input_ids, num_beams=5, max_length=100, early_stopping=True) # Decode and print the paraphrased sentence paraphrase = tokenizer.decode(paraphrase_ids[0], skip_special_tokens=True) print(f"Original: {sentence}") print(f"Paraphrase: {paraphrase}") print()`

Here's the output.

`Original: She was a storm, not the kind you run from, but the kind you chase.Paraphrase: She was a storm, not the kind you run from, but the kind you chase.Original: She wasn't looking for a knight, she was looking for a sword.Paraphrase: She was looking for a sword, not a knight.Original: In the end, we only regret the chances we didn't take.Paraphrase: We regret the chances we didn't take.Original: I dreamt I am running on sand in the nightParaphrase: I ran on the sand in the night.Original: Long long ago, there lived a king and a queen. For a long time, they had no children.Paraphrase: They had no children for a long time.Original: I am typing the best article on paraphrasing with Transformers.Paraphrase: I am writing the best article on the subject.`

We can observe a significant improvement in the output with Pegasus Paraphrase.

Comparing the output of all three transformer models, we can definitively declare Pegasus Paraphrase as the winner.

With our testing out of the way, we've finalized Pegasus Paraphrase as our choice of transformer for this task.

Now let's see how we can paraphrase paragraphs and long chunks of texts with it.

Theoretically, there are three main ways to paraphrase whole paragraphs.

By default, the maximum input length for Pegasus Paraphrase is set to a certain number of tokens. If the input paragraph exceeds this limit, it might be truncated, leading to incomplete paraphrasing.

Here we split the longer text into smaller chunks and run them through the model individually, then combine the paraphrased results afterward.

Here we take a fixed-sized window and slide it over the input paragraph, generating paraphrases for each window. This way, we ensure that the entire paragraph is covered, albeit with overlapping segments.

Beam search is a decoding algorithm that helps in generating diverse outputs from the model.

By default, the model uses beam search with a beam width of 4. We can try to increase the beam width to encourage more exploration and potentially improve the quality of paraphrased outputs for longer texts.

If neither approach gives us satisfactory results, we can look at fine-tuning the model but that's for a different discussion.

In my research and experimentation, I've found that 'Adjusting the input length' gives us the best output. So let's go ahead and implement that.

For a view on challenges with other methods, take a look at the experimentation notebook here.

{insert link to notebook}

Let's paraphrase a paragraph from 'The Hound of the Baskervilles', one of the most popular *Sherlock Holmes* stories by *Sir Arthur Conan Doyle*.

"As Sir Henry and I sat at breakfast, the sunlight flooded in through the high mullioned windows, throwing watery patches of color from the coats of arms which covered them. The dark panelling glowed like bronze in the golden rays, and it was hard to realize that this was indeed the chamber which had struck such a gloom into our souls upon the evening before. But the evening before, Sir Henry's nerves were still handled the stimulant of suspense, and he came to breakfast, his cheeks flushed in the exhilaration of the early chase."

`# importsfrom transformers import PegasusForConditionalGeneration, PegasusTokenizer# Load the Pegasus Paraphrase model and tokenizermodel_name = "tuner007/pegasus_paraphrase"tokenizer = PegasusTokenizer.from_pretrained(model_name)model = PegasusForConditionalGeneration.from_pretrained(model_name)# function to paraphrase long texts by adjusting the input lengthdef paraphrase_paragraph(text): # Split the text into sentences sentences = text.split(".") paraphrases = [] for sentence in sentences: # Clean up sentences # remove extra whitespace sentence = sentence.strip() # filter out empty sentences if len(sentence) == 0: continue # Tokenize the sentence inputs = tokenizer.encode_plus(sentence, return_tensors="pt", truncation=True, max_length=512) input_ids = inputs["input_ids"] attention_mask = inputs["attention_mask"] # paraphrase paraphrase = model.generate( input_ids=input_ids, attention_mask=attention_mask, num_beams=4, max_length=100, early_stopping=True )[0] paraphrased_text = tokenizer.decode(paraphrase, skip_special_tokens=True) paraphrases.append(paraphrased_text) # Combine the paraphrases combined_paraphrase = " ".join(paraphrases) return combined_paraphrase# Example usagetext = "As Sir Henry and I sat at breakfast, the sunlight flooded in through the high mullioned windows, throwing watery patches of color from the coats of arms which covered them. The dark panelling glowed like bronze in the golden rays, and it was hard to realize that this was indeed the chamber which had struck such a gloom into our souls upon the evening before. But the evening before, Sir Henry's nerves were still handled the stimulant of suspense, and he came to breakfast, his cheeks flushed in the exhilaration of the early chase."paraphrase = paraphrase_paragraph(text)print(paraphrase)`

Here we've split the sentences into smaller chunks like sentences, paraphrase each chunk and then combine the individual outputs back into a paragraph.

And below is the output.

As Sir Henry and I sat at breakfast, the sunlight flooded in through the high windows, causing watery patches of color from the coats of arms. The dark panelling glowed like bronze in the golden rays, and it was hard to see that it was the chamber which had struck such a gloom into our souls the evening before. The evening before, Sir Henry's nerves were still handled and he came to breakfast, his cheeks flushed from the excitement of the early chase.

Throughout this article, we have explored the world of effective paraphrasing with transformer models. And also saw effective applications of how to build a paraphraser with Transformer models from Hugging Face.

Transformer models have brought about a paradigm shift in paraphrasing, empowering individuals and industries with their transformative capabilities. By harnessing the power of transformer models, we can unlock new possibilities in effective communication, content creation, academic writing, and language translation.

As the field of transformer-based paraphrasing continues to evolve, there are exciting opportunities for further exploration and adoption of these technologies.

Researchers and practitioners are encouraged to delve deeper into fine-tuning strategies, data augmentation techniques, and evaluation methodologies to advance the state-of-the-art in paraphrase generation.

Additionally, the ethical implications of using transformer models for paraphrasing should be considered. Careful attention should be given to biases and fairness to ensure equitable and responsible deployment of these technologies.

Let me know your thoughts and any feedback in the comments.

Until next time ... Ciao!

]]>If youve been using the `train_test_split`

method by `sklearn`

to create the train, test, and validation datasets, then I know your pain.

Splitting datasets into the test, train, and validation datasets

While `sklearn`

certainly provides us with a way to achieve our objective, however, it is a long-drawn-out procedure as we have to repeat the process twice adjusting the split ratio with every step.

**But rejoice,** `fast_ml`

**is here!**

It offers a straightforward and to-the-point method to achieve the three different datasets with a single line of code.

It is the `train_valid_test_split`

method!

It not only splits the data as we require but also separates the dependent variable `y`

from the independent variables `X`

in the same line of code.

Lets check out how its done (notebook)!

** Step 1:** Download the

`fast_ml`

library and Import the necessary packages and methods** Step 2:** Load the dataset into a pandas data frame.

** Step 3:** Split the dataset

Once the data is loaded and ready to split, simply call the `train_valid_test_split`

method and pass the dataset with the supporting parameters as below.

The datasets have been successfully split into train, test, and validation datasets. 🎉

💡 NOTE

The split datasets retain their original index and resetting it is an optional step.

You can now proceed with your modeling.

Thanks to the team at `fast_ml`

, the long-drawn-out task of splitting our dataset into independent and dependent features and then into training, testing, and validation datasets has been condensed into a single line of code.

You can find this notebook here:

Let me know how you liked this quick article in the comments below, and feed free to reach out!

]]>Data is all around us, from the information we process every day to the data collected by businesses to make informed decisions.

Businesses today are thriving on the data that they have collected over the years. This data is then utilized intelligently to make informed business decisions.

But understanding the fundamentals of data itself and then utilizing it can be a daunting task, especially for beginners.

That's where this comprehensive guide comes in. We'll break down the concept of data at its most fundamental level, giving you the tools and techniques you need to handle it like a pro.

So let's dive in and make data easy!

💡 All

informationessentially can be classified asdata.

It can come in multiple different forms, shapes, and sizes. It can be in the form of numbers, text, images, videos, and much more.

Defining data

By now we know that all information is data. And from our discussion on statistics, we also know that

💡 Data lies at the heart of any analytical solution. Therefore,

without data, there is no statistic. And without statistics, there is no analysis.

The first and most crucial step of solving any problem, be it statistics, analytics, data science, machine learning, etc., is to understand the data at hand.

We spoke about the different forms of data. And there are a few different ways of classifying data, and each serves a specific purpose.

Lets go over the most popular types of data and see how they are classified.

The two major types of data are:

💡 As the name suggests, this type of data cannot be organized into a structure or a data model.

Some of the popular examples are images, heatmaps, videos, spatial data, graph data, text documents, etc.

Unstructured data is not easily identifiable or interpretable either by humans or machines. It takes the machine some special features to process this data.

By now, you must have realized that this type of data is a bad fit for traditional relational databases like SQL.

💡 On the other hand, this type of data can be defined into a structure (as the name suggests)

These are more commonly used in industrial settings and one of the most commonly used types of structured data is the 2-dimensional data structure, that is, the humble table, also known as **rectangular data**.

A simple table capturing exam results of different students

I'm sure you can think of enough use cases from your own life where you have used Excel spreadsheets to store some information.

Another popular example of rectangular data is the

Titanic dataset[Source]

Structured data is further classified into a few different types of data. They are:

Categorical data

Numerical data

And even the above types of data can be classified further into different data types. Let's look at a complete breakdown before proceeding with each type of data.

Different types of data

💡 The type of data that can be categorized [Genius

🕵].

Now consider the dataset of students exam results. Depending on the grade, all students with grades other than ** F** are deemed to have passed the examination.

So we add another column with the Passing status of each student. The column ** Pass/Fail** has only one of the two entries, it can either be a

Exam results of different students

Similarly, there are many instances where the entry in each row is one of the few available options. For example:

**Binary data**:*True*or*False*,*Yes*or*No*,*0*or*1***Exam grades**:*A*,*B*,*C*,*D*,*E*, and*F***Laptop brands**: Asus, Lenovo, Macbook, Dell, IBM, etc.

The entry for a categorical data record can only be one of the available options. For example, it can be either True or False, but not both.

Now there are 2 important types of categorical data as well, and they are:

💡 Type of categorical data that has

no internal orderor precedence amongst the different categories. The categoriescannot be rankedone over the other.

For example, in binary data like True or False, Male or Female, one category is not more important than the other.

There can be some

exceptionshere as well, refer to the upcoming exercise section of this article for an explanation.

Another example would be subjects like English, Mathematics, Science, History, etc. As long as they carry equal weightage, one cannot have more importance than the other.

Here the categories can be ordered and the order in which categories are spread matters.

For example, grades in exam results can be ordered as *A*, *B*, *C*, *D*, *E*, & *F*, from higher rank to lower. Another simple ranking could be in the cloth sizes, which may range from *XS*, *S*, *M*, *L*, to *XL*.

SPOILER ALERT! 🤖The knowledge of Nominal and Ordinal datatypes becomes very critical during encoding for machine learning problems.

💡 While categorical data is discrete in nature. On the other hand,

numerical data is continuous. It can have any numerical value.

It can be either integers or real numbers. For example, the students marks in exams, the speed of a car, the length of a video, height, weight, etc.

Again, there are two major types of Numerical data, and they are:

💡 When the data records can be counted & expressed only in

whole numbers, it is calleddiscrete data.

For example, the number of children in a class, the number of cars owned by a person, the number of working days in a month, and many more.

💡 When the data records can be infinite or expressed in

real numbersto many decimals places, the data is known as continuous data.

For example, exact height, weight, & other ratios like cannot be recorded absolutely accurately in 2 decimal places.

Apart from the above, there are a few other important data types that you should know.

💡 Anything measured over time is

time series.

For example, daily or monthly stock prices, daily weather, hourly sea level, speed of a vehicle at every minute, etc.

More often than not, time series is structured data.

An example of time-series data. Records of product demand, precipitation, & temperature over the years.

💡 Text data usually consists of documents containing words, sentences, and paragraphs of free-flowing text.

It can be in any language. And is mostly unstructured.

A good example is the product reviews on Amazon, which can be utilized for sentiment classification. Or email contents that enable machine learning algorithms to detect spam emails from the rest.

After filtering out the spam, Gmail automatically categorizes emails into Primary, Social, & Promotions based on the text data in the email contents.

💡 This is the graphic or pictorial data like images or drawings.

This finds a great use case in object detection, self-driving cars, etc. It is a form of unstructured data as well.

Object detection from live footage.

💡 Any information recorded in the audio format is data.

Another popular format of data is Audio, that is widely used in machine learning applications. Apps like ** Shazam** are a great example of the same.

Be it a song, a speech, an audiobook, or any other information recorded in the audio format can be used as data.

Now that we have quickly understood so many different concepts, lets strengthen our understanding with these fun exercises.

Lets classify each variable in the Titanic dataset into its correct data type.

![Titanic dataset [](https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2Fa2f6872e-2d9e-487d-b9d7-278749de2e50%2FUntitled.png?id=f78c9c8b-eaab-4b74-b6d8-5dfaab22f298&table=block align="left")

Titanic dataset [Source]

: The unique Id for each passenger. This is numerical data and is discrete.*PassengerId*: Passenger survived or not, 0 = No, 1 = Yes. This is categorical data and nominal.*Survived*: This is the ticket type, class 1 = 1st, 2 = 2nd, 3 = 3rd. This is categorical data and ordinal.*Pclass*: Name of the passenger. This is text data.*Name*: Gender of the passenger. This is binary categorical data and ordinal.*Sex*

IMPORTANTIn some cases, even data records like sex or gender can be ordinal, and this is one such case. This is because the captain of the ship explicitly issued an order for women and children to be saved first. As a result, the survival rate for women was three times higher than for men [Source]. Therefore, while modeling, the algorithm can give a slightly higher preference to females while predicting the survival status.

Similarly, we can analyze the rest of the variables of this dataset. I will leave this exercise for you to complete.

Considered a video being streamed on Youtube. Multiple different data points are being recorded in real-time simultaneously.

Some of them are video, audio, images, resolutions, time stamps, total people watching at each timestamp, likes, dislikes, text data from the continuous chat, number of comments, transactions, engagement, and much more.

Your task is to classify each feature being recorded into its correct class. Do share your observations in the comments section.

So lets summarize what we discussed today.

Data is everywhere and every recorded information is data

Data lies at the heart of any statistical analysis

Different types of data, a quick breakdown is below.

Different types of data

Finally, after understanding the foundations and different types of data.

Let's understand how machines are reading and interpreting data as opposed to humans.

Foundationally, no matter the type of data, the machines can only ingest 0s and 1s. Therefore, for us to train a machine learning model on our data, we must convert it into 0s and 1s.

Be it images, text, audio, or any other data type, everything has to be converted into 0s and 1s (or numeric) before feeding it to the machine.

Converting image to 0s and 1s

NOTE: We will look at multiple examples in the coming discussions where we build machine-learning models on different types of data.

I am certain that this discussion will help you better understand your data at a more fundamental level, which will refine your analysis.

Feel free to share your feedback or queries in the comments below.

]]>