Human: Hugging Face Tokenizers:
Ask a Question
Open In Colab
Open In Studio Lab
Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data. In this section, we’ll explore exactly what happens in the tokenization pipeline.
In NLP tasks, the data that is generally processed is raw text. Here’s an example of such text:
Copied
Jim Henson was a puppeteer
However, models can only process numbers, so we need to find a way to convert the raw text to numbers. That’s what the tokenizers do, and there are a lot of ways to go about this. The goal is to find the most meaningful representation — that is, the one that makes the most sense to the model — and, if possible, the smallest representation.
Let’s take a look at some examples of tokenization algorithms, and try to answer some of the questions you may have about tokenization.
Word-based
The first type of tokenizer that comes to mind is word-based. It’s generally very easy to set up and use with only a few rules, and it often yields decent results. For example, in the image below, the goal is to split the raw text into words and find a numerical representation for each of them:
An example of word-based tokenization.
There are different ways to split the text. For example, we could use whitespace to tokenize the text into words by applying Python’s split() function:
Copied
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)
Copied
['Jim', 'Henson', 'was', 'a', 'puppeteer']
There are also variations of word tokenizers that have extra rules for punctuation. With this kind of tokenizer, we can end up with some pretty large “vocabularies,” where a vocabulary is defined by the total number of independent tokens that we have in our corpus.
Each word gets assigned an ID, starting from 0 and going up to the size of the vocabulary. The model uses these IDs to identify each word.
If we want to completely cover a language with a word-based tokenizer, we’ll need to have an identifier for each word in the language, which will generate a huge amount of tokens. For example, there are over 500,000 words in the English language, so to build a map from each word to an input ID we’d need to keep track of that many IDs. Furthermore, words like “dog” are represented differently from words like “dogs”, and the model will initially have no way of knowing that “dog” and “dogs” are similar: it will identify the two words as unrelated. The same applies to other similar words, like “run” and “running”, which the model will not see as being similar initially.
Finally, we need a custom token to represent words that are not in our vocabulary. This is known as the “unknown” token, often represented as ”[UNK]” or ””. It’s generally a bad sign if you see that the tokenizer is producing a lot of these tokens, as it wasn’t able to retrieve a sensible representation of a word and you’re losing information along the way. The goal when crafting the vocabulary is to do it in such a way that the tokenizer tokenizes as few words as possible into the unknown token.
One way to reduce the amount of unknown tokens is to go one level deeper, using a character-based tokenizer.
Character-based
Character-based tokenizers split the text into characters, rather than words. This has two primary benefits:
The vocabulary is much smaller.
There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.
But here too some questions arise concerning spaces and punctuation:
An example of character-based tokenization.
This approach isn’t perfect either. Since the representation is now based on characters rather than words, one could argue that, intuitively, it’s less meaningful: each character doesn’t mean a lot on its own, whereas that is the case with words. However, this again differs according to the language; in Chinese, for example, each character carries more information than a character in a Latin language.
Another thing to consider is that we’ll end up with a very large amount of tokens to be processed by our model: whereas a word would only be a single token with a word-based tokenizer, it can easily turn into 10 or more tokens when converted into characters.
To get the best of both worlds, we can use a third technique that combines the two approaches: subword tokenization.
Subword tokenization
Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.
For instance, “annoyingly” might be considered a rare word and could be decomposed into “annoying” and “ly”. These are both likely to appear more frequently as standalone subwords, while at the same time the meaning of “annoyingly” is kept by the composite meaning of “annoying” and “ly”.
Here is an example showing how a subword tokenization algorithm would tokenize the sequence “Let’s do tokenization!“:
A subword tokenization algorithm.
These subwords end up providing a lot of semantic meaning: for instance, in the example above “tokenization” was split into “token” and “ization”, two tokens that have a semantic meaning while being space-efficient (only two tokens are needed to represent a long word). This allows us to have relatively good coverage with small vocabularies, and close to no unknown tokens.
This approach is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords.
And more!
Unsurprisingly, there are many more techniques out there. To name a few:
Byte-level BPE, as used in GPT-2
WordPiece, as used in BERT
SentencePiece or Unigram, as used in several multilingual models
You should now have sufficient knowledge of how tokenizers work to get started with the API.
---
Waters of March Lyrics, by Antônio Carlos Jobim:
A stick, a stone, it's the end of the road
It's the rest of a stump, it's a little alone
It's a sliver of glass, it is life, it's the sun
It is night, it is death, it's a trap, it's a gun
The oak when it blooms, a fox in the brush
A knot in the wood, the song of a thrush
The wood of the wind, a cliff, a fall
A scratch, a lump, it is nothing at all
It's the wind blowing free, it's the end of the slope
It's a beam, it's a void, it's a hunch, it's a hope
And the river bank talks of the waters of March
It's the end of the strain, it's the joy in your heart
The foot, the ground, the flesh and the bone
The beat of the road, a slingshot's stone
A fish, a flash, a silvery glow
A fight, a bet, the flange of a bow
The bed of the well, the end of the line
The dismay in the face, it's a loss, it's a find
A spear, a spike, a point, a nail
A drip, a drop, the end of the tale
A truckload of bricks in the soft morning light
The sound of a gun in the dead of the night
A mile, a must, a thrust, a bump,
It's a girl, it's a rhyme, it's a cold, it's the mumps
The plan of the house, the body in bed
And the car that got stuck, it's the mud, it's the mud
A float, a drift, a flight, a wing
A hank, a quail, the promise of spring
And the river bank talks of the waters of March
It's the promise of life, it's the joy in your heart
A snake, a stick, it is John, it is Joe
It's a thorn on your hand and a cut in your toe
A point, a grain, a bee, a bite
A blink, a buzzard, a sudden stroke of night
A pin, a needle, a sting, a pain
A snail, a riddle, a wasp or a stain
A pass in the mountains, a horse and a mule
In the distance the shelves rode three shadows of blue
And the river bank talks of the waters of March
It's the promise of life in your heart, in your heart
A stick, a stone, the end of the road
The rest of a stump, a lonesome road
A sliver of glass, a life, the sun
A knife, a death, the end of the run
And the river bank talks of the waters of March
It's the end of all strain, it's the joy in your heart
---
Based on the Hugging Face Course context on Tokenizers and the structure of Tom Jobin's Water of March, create a musical with the following elements:
- The main plot revolves around two characters discussing the different types of tokenization methods;
- The characters have contrasting personalities: Calm Carl (calm and relaxed) and Anxious Annie (anxious and in a hurry);
ChatGPT:
Title: The Tokenizers' Tale
Act 1: Introduction to Tokenization
Scene 1: Calm Carl and Anxious Annie's office
(Anxious Annie enters the office, visibly in a hurry, while Calm Carl is sitting at his desk.)
Annie: Carl, I need your help! I just can't seem to understand tokenization in NLP! I'm so lost!
Carl: (Singing)
Annie, don't you worry, I'm here to help,
We'll dive into tokenization, step by step.
Chorus:
Tokenization, the key to NLP,
Converting words to numbers, for models to see!
Scene 2: Word-based Tokenization
Carl: (Singing)
Let's start with the basics, word-based, you'll see,
Simple and easy, for you and for me.
Splitting the text, into words, we'll find,
A numerical representation, for each, assigned.
Annie: (Singing)
But what about the words, the model can't see?
How do we handle, words like "dogs" and "trees"?
Carl: (Singing)
We use an unknown token, to fill in the gaps,
Crafting the vocabulary, to reduce these mishaps.
Chorus:
Word-based tokenization, a simple way to start,
But limitations arise, as languages fall apart.
Act 2: Character-based Tokenization
Scene 3: Character-based Tokenization
Annie: (Singing)
So what's next, Carl? What can we do?
To find a better solution, to help us through?
Carl: (Singing)
Character-based tokenization, a new approach we'll see,
Breaking the text into characters, just like A, B, and C.
Annie: (Singing)
But is it meaningful? I can't help but doubt,
The characters alone, what do they tell us about?
Carl: (Singing)
It has its flaws, but worry not my friend,
We'll find a better way, as this story extends.
Chorus:
Character-based tokenization, smaller vocab in play,
But meaning might be lost, as we move on our way.
Act 3: Subword Tokenization
Scene 4: Subword Tokenization
Carl: (Singing)
The best of both worlds, subword is the key,
Decomposing rare words, more meaning we'll see.
Frequently used words remain, but rare words will split,
Into meaningful subwords, a perfect fit!
Annie: (Singing)
I see the light, subword tokenization,
A better way for text representation!
I'm grateful, Carl, for your patient aid,
Tokenization, no longer a mystery, the foundation laid.
Chorus:
Subword tokenization, a more balanced technique,
Reducing unknown tokens, while meaning they seek.
Scene 5: Finale
Carl and Annie: (Singing)
From word-based to character, and subword too,
We've journeyed through tokenization, and now we know what to do.
Chorus:
The tokenizers' tale, a musical delight,
Calm Carl and Anxious Annie, united in their plight.
(Curtain closes)