Tokenization is the root of suffering for LLMs as you know. Surprisingly to me, I suggest it is not a problem at all! Here is why
Full paper available at my google drive
Code is on GitHub
(No “we made huge improvement”, no cherry-picking, I don't care about own paper’s citations):
TLDR:
The idea was to encode character-level information into tokens so decoder Transformer models—while still working at the token level—can understand and solve character-specific tasks (e.g., the well-known 'strawberry' cases).
Surprising result: It doesn’t work. It seems tokens are not constraining language models in the way I expected.
The Tokenization “Obvious” Problem
If you’ve been following the field of LLMs, you’ve likely come across the idea that tokens are a flawed bottleneck for ML algorithms. This is a well-known issue, popularized by GPT-4’s famous 'strawberry' test.
In Andrej Karpathy’s neural network course, he highlights the limitations of LLMs caused by tokenization:
But here’s the twist: My paper suggests that tokenization surprisingly doesn’t affect Transformers' ability to solve character-specific tasks.
The real bottleneck may lie elsewhere, such as:
- A severe underrepresentation of character-specific questions in the dataset.
- The overall low importance of character-level awareness for language modeling tasks.
LET ME EXPLAIN WHY!
Proposed Transformer Architecture
The original idea was to incorporate token character-awareness into the model to improve performance on character-specific tasks.
Here’s the architecture:
Figure 1 shows the standard encoding process. Multiple characters are usually combined into a single entity—a token. These tokens are passed into an encoding layer and embedded into a dimensional vector. Then, a positional encoding vector of the same size is added to the token embeddings. This allows Transformers to see both the tokens and their positions in the text.
Figure 2 shows my proposed mechanism for adding character-awareness without altering the overall architecture.
- How it works: An additional embedding vector represents the characters. An LSTM processes each character in a token sequentially. Its final hidden state creates a third type of embedding that encodes character-level information.
Hypothesis: This architecture should theoretically help with tasks like word spelling, character-level manipulations, etc.
Results
Pre-training phase:
As shown on figure 3, the cross-entropy loss values are similar for both architectures. No significant difference is observed during pre-training, contrary to my expectations. I assumed that the modified architecture would show some difference in language modeling—either positive or negative.
Fine-tuning phase (on synthetic character-specific tasks):
Nothing strange I thought to myself, it probably doesn't need knowledge of charters to predict next token in usual language modeling. But then I tested both models on synthetic character-specific tasks, such as:
- Reversing the order of letters in a word.
- Counting the number of specific letters in a word.
- Finding the first index of a specific letter in a word.
- Swapping a specific letter in a word with another.
The results on figure 4 are clear: During fine-tuning, both models show an expected increase in language modeling loss, but decrease on the synthetic dataset. However, the loss values remain almost identical for both architectures. Why the heck this happened?
My conclusion
Token-based models seem capable of learning the internal character structure of tokens. This information can be extracted from the training data when needed. Therefore, my character-aware embedding mechanism appears unnecessary.
That’s it! Full paper and code are available if you’re interested.
If you have any thoughts I would love to read them in comments. Thanks for your time!