From Bag of Words to BERT: How NLP Learned to Understand Language
From Bag of Words to BERT: How NLP Learned to Understand Language
There's a question that sits at the heart of every NLP system ever built:
How do you turn human language — messy, ambiguous, context-dependent — into something a machine can compute with?
The answer to that question has changed dramatically over the past three decades. And each time it changed, NLP got dramatically more powerful.
This post walks through the three major eras of text representation: Sparse Classical methods, Dense Static Embeddings, and Contextual Embeddings. By the end, you'll understand not just what each approach does, but why it was invented and what fundamental limitation it solved.
The Big Picture: Three Eras of Representation
Before diving into each era, here's the arc of the whole story:
Raw Text ──▶ Numbers ──▶ Machine Understanding
The challenge is in the middle step. How you convert text to numbers defines what patterns a model can learn, what computations are possible, and how much language it can truly "understand."
| Era | Example Methods | Key Idea | Limitation | |---|---|---|---| | Sparse Classical | BoW, TF-IDF, N-grams | Count words | No semantics, no order, huge vectors | | Dense Static | Word2Vec, GloVe, FastText | Learn meaning from context | One embedding per word, no polysemy | | Contextual | ELMo, BERT, GPT | Meaning depends on context | Expensive, needs large data |
Era 1 — Classic (Sparse) Representations
The Core Idea: Count What's There
The earliest approach to text representation was brutally simple: build a vocabulary, then for each document, count how many times each word appears.
This gives you a vector — one dimension per word in your vocabulary. Most values are zero (most documents don't contain most words), hence the name sparse.
One-Hot Encoding
The simplest possible representation. Each word gets a unique integer ID. A sentence becomes a sequence of these IDs.
vocab = {"cat": 0, "sat": 1, "on": 2, "the": 3, "mat": 4}
# "cat" → [1, 0, 0, 0, 0]
# "mat" → [0, 0, 0, 0, 1]
# "cat" + "mat" → [1, 0, 0, 0, 1] (just added together)
The fatal flaw: Every word is equidistant from every other word. cat and kitten are just as "different" as cat and democracy. There's no notion of semantic similarity.
Bag of Words (BoW)
Extend one-hot to whole documents. Ignore word order entirely — just count each word's frequency in the document.
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"I love machine learning",
"machine learning is powerful",
"I love deep learning too"
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
# ['deep', 'I', 'is', 'learning', 'love', 'machine', 'powerful', 'too']
print(X.toarray())
# [[0, 1, 0, 1, 1, 1, 0, 0],
# [0, 0, 1, 1, 0, 1, 1, 0],
# [1, 1, 0, 1, 1, 0, 0, 1]]
"I love machine learning" and "machine learning I love" produce identical vectors — BoW has no memory of sequence.
TF-IDF: Smarter Counting
Raw counts have a problem: common words like "the", "is", "and" dominate every document. They're not informative.
TF-IDF (Term Frequency - Inverse Document Frequency) reweights counts to reward words that are frequent in a document but rare across all documents.
TF(t, d) = count of term t in document d / total terms in d
IDF(t) = log( N / df(t) ) where N = total docs, df = docs containing t
TF-IDF(t, d) = TF(t, d) × IDF(t)
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"the cat sat on the mat",
"the dog sat on the log",
"cats and dogs are great pets"
]
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus)
# "the" → low score (appears in all docs)
# "mat" → high score (unique to doc 1)
# "cats" → high score (unique to doc 3)
N-Grams: Capturing Local Order
BoW discards all word order. N-grams partially restore it by treating sequences of N words as single tokens.
# Unigrams (N=1): ["machine", "learning", "is", "powerful"]
# Bigrams (N=2): ["machine learning", "learning is", "is powerful"]
# Trigrams (N=3): ["machine learning is", "learning is powerful"]
vectorizer = CountVectorizer(ngram_range=(1, 2)) # unigrams + bigrams
Bigrams let you distinguish "not good" from "good" — something unigrams can't do.
Where Sparse Representations Shine (and Fail)
Still excellent for:
- Document classification with small datasets
- Keyword-based search and retrieval
- Spam detection, simple sentiment analysis
- Interpretability (you can explain every feature)
Fundamental limitations:
- Vocabulary size explosion — a real corpus has 100k+ words; add n-grams and you're in the millions
- Synonym blindness — "automobile" and "car" are completely different dimensions
- No semantic structure — the geometry of the vector space carries no meaning
- Out-of-vocabulary (OOV) — unseen words simply don't exist
The moment you need to understand meaning, sparse methods hit a wall.
Era 2 — Dense Static Embeddings
The Distributional Hypothesis
The key insight that unlocked the next era came from linguistics in the 1950s:
"You shall know a word by the company it keeps." — J.R. Firth, 1957
Words that appear in similar contexts tend to have similar meanings. "Dog" and "cat" both appear near "pet", "feed", "fluffy", "vet". "Bank" (financial) appears near "money", "loan", "interest" — very different contexts from "bank" (river).
If we train a model to predict context words, it will naturally learn to give similar representations to semantically similar words.
Word2Vec: The Breakthrough
In 2013, a team at Google (Mikolov et al.) published Word2Vec — a simple neural network that learned dense word embeddings by predicting words from their context (or context from words).
Two architectures:
CBOW (Continuous Bag of Words) — predict the center word from surrounding context:
Context: ["The", "_?_", "sat", "on", "the", "mat"]
Target: "cat"
Skip-gram — predict surrounding words from the center word:
Input: "cat"
Target: ["The", "sat", "on", "the", "mat"]
from gensim.models import Word2Vec
sentences = [
["the", "cat", "sat", "on", "the", "mat"],
["the", "dog", "ran", "in", "the", "park"],
["cats", "and", "dogs", "are", "great", "pets"],
# ... millions more sentences
]
model = Word2Vec(
sentences,
vector_size=300, # embedding dimension
window=5, # context window
min_count=5, # ignore rare words
sg=1 # 1 = skip-gram, 0 = CBOW
)
# Each word is now a dense 300-dimensional vector
cat_vec = model.wv["cat"] # shape: (300,)
king_vec = model.wv["king"] # shape: (300,)
# Magic: semantic arithmetic works!
result = model.wv.most_similar(
positive=["king", "woman"],
negative=["man"]
)
# → [("queen", 0.85), ("princess", 0.72), ...]
The Geometry of Meaning
The most jaw-dropping property of Word2Vec embeddings: semantic relationships become geometric relationships.
king − man + woman ≈ queen
Paris − France + Germany ≈ Berlin
walked − walk + run ≈ ran
This means the vector space has actual structure. Directions in this 300-dimensional space correspond to human-interpretable concepts like gender, tense, geography.
# Similarity between words = cosine similarity between vectors
from numpy import dot
from numpy.linalg import norm
def cosine_similarity(a, b):
return dot(a, b) / (norm(a) * norm(b))
sim = cosine_similarity(model.wv["cat"], model.wv["kitten"])
# → 0.84 (high similarity, as expected)
sim2 = cosine_similarity(model.wv["cat"], model.wv["democracy"])
# → 0.12 (very different, as expected)
GloVe: Global Statistics
Word2Vec learns from local windows. GloVe (Global Vectors, Stanford 2014) takes a different approach: factorize the entire word co-occurrence matrix across the whole corpus.
# GloVe gives you pre-trained embeddings — just load and use
import numpy as np
def load_glove(path):
embeddings = {}
with open(path) as f:
for line in f:
values = line.split()
word = values[0]
vector = np.array(values[1:], dtype='float32')
embeddings[word] = vector
return embeddings
glove = load_glove("glove.6B.300d.txt")
# 6 billion tokens, 300-dimensional vectors
GloVe and Word2Vec produce qualitatively similar results but with different computational trade-offs.
FastText: Subword Magic
Word2Vec and GloVe break on morphologically rich languages and on unseen words. "Running", "runner", "runs" all get completely separate embeddings with no shared information.
FastText (Facebook 2016) represents each word as a bag of character n-grams:
"running" → {"run", "runn", "runni", "unnin", "nning", "ning", "ing", ...}
+ the whole word token "running"
The word embedding is the average of its subword embeddings.
from gensim.models import FastText
model = FastText(sentences, vector_size=300, window=5, min_count=1)
# Can now handle OOV words!
vec = model.wv["supercalifragilistic"] # No error! Built from character n-grams
This makes FastText particularly powerful for:
- Misspelled words ("recieve" → still works)
- Morphologically rich languages (Finnish, Turkish, German)
- Technical domains with many compound words
The Shared Limitation: One Word, One Embedding
All of these methods — Word2Vec, GloVe, FastText — share a critical flaw.
Every word gets exactly one embedding, regardless of context.
Consider the word "bank":
"I deposited money in the bank."
"The fishing spot by the river bank was perfect."
"He had to bank the plane sharply to avoid the storm."
Three completely different meanings. But all three get the same vector. The static embedding ends up as a confused average of all the word's meanings, weighted by how often each sense appears in the training corpus.
For many tasks this is fine. For nuanced language understanding, it's a showstopper.
Era 3 — Contextual Embeddings
The Paradigm Shift
What if instead of a fixed embedding per word, we computed embeddings dynamically, based on the full sentence context?
This is the central idea of contextual embeddings. The same word "bank" should have a different vector in a finance sentence than in a geography sentence.
Static: "bank" → [0.32, -0.11, 0.78, ...] (always the same)
Contextual: "I went to the bank to deposit money."
bank → [0.91, 0.23, -0.45, ...] (finance context)
"She sat on the river bank."
bank → [-0.33, 0.87, 0.12, ...] (geography context)
ELMo: The First Breakthrough
ELMo (Embeddings from Language Models, 2018) was the first major system to produce contextual embeddings. It trained a deep bidirectional LSTM as a language model, then used the hidden states as embeddings.
import tensorflow_hub as hub
import tensorflow as tf
# Load pre-trained ELMo
elmo = hub.load("https://tfhub.dev/google/elmo/3")
sentences = [
"I went to the bank to withdraw money",
"The heron stood by the river bank"
]
# Same word "bank" → different embeddings depending on context
embeddings = elmo.signatures["default"](
tf.constant(sentences)
)["elmo"]
# Shape: (2, max_len, 1024)
# The vector for "bank" differs significantly between sentences
ELMo showed that language model pretraining — training on massive text to predict the next word — naturally develops rich linguistic knowledge as a side effect.
The Transformer: The Architecture That Changed Everything
Before BERT, we need to understand the engine underneath it: the Transformer (Vaswani et al., 2017, "Attention is All You Need").
The key innovation is self-attention: every token looks at every other token and decides how much to "attend" to it when forming its representation.
import torch
import torch.nn.functional as F
def self_attention(Q, K, V, d_k):
"""
Q, K, V: Query, Key, Value matrices
d_k: dimension of keys (for scaling)
"""
# Compute attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
# Convert to probabilities
weights = F.softmax(scores, dim=-1)
# Weighted sum of values
output = torch.matmul(weights, V)
return output, weights
For "The cat sat on the mat because it was tired":
- When encoding "it", the attention mechanism looks at the whole sentence
- High attention weight between "it" and "cat" — the model learns "it" refers to the cat
- This coreference resolution emerges from training, not hard-coded rules
BERT: Bidirectional Transformers
BERT (Bidirectional Encoder Representations from Transformers, Google 2018) took the Transformer encoder and trained it at scale with two pretraining tasks:
1. Masked Language Model (MLM):
Input: "The cat [MASK] on the mat"
Target: "sat"
Input: "The [MASK] sat on the mat"
Target: "cat"
2. Next Sentence Prediction (NSP):
Sentence A: "The cat sat on the mat."
Sentence B: "It enjoyed the warmth." → IsNext: True
Sentence A: "The cat sat on the mat."
Sentence B: "The stock market crashed." → IsNext: False
BERT was trained on 3.3 billion words (Wikipedia + BookCorpus), producing a 12-layer Transformer with 110 million parameters.
from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Same word "bank" in two different contexts
sentences = [
"I deposited my paycheck at the bank.",
"We had a picnic on the river bank."
]
for sentence in sentences:
inputs = tokenizer(sentence, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
# outputs.last_hidden_state: shape (1, seq_len, 768)
# Each token has a unique 768-dim contextual embedding
hidden_states = outputs.last_hidden_state
# Find index of "bank" token
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
bank_idx = tokens.index('bank')
bank_embedding = hidden_states[0, bank_idx, :] # shape: (768,)
print(f"Sentence: {sentence}")
print(f"'bank' embedding norm: {bank_embedding.norm():.3f}")
# The two bank embeddings will be significantly different!
Fine-tuning BERT for Downstream Tasks
BERT's real power is transfer learning. Pretrain once on massive data, fine-tune cheaply for any task:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
# Fine-tune for sentiment analysis
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=2 # positive / negative
)
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=2e-5, # Fine-tuning requires very small LR
warmup_steps=500,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
trainer.train()
# ~30 minutes on a single GPU to beat the previous state-of-the-art
# on most benchmarks
BERT shattered records on 11 NLP benchmarks when released — tasks that had resisted progress for years suddenly jumped significantly.
GPT and the Generative Branch
While BERT focuses on understanding (encoder), GPT (OpenAI) focuses on generation (decoder-only Transformer). Instead of masking, GPT is trained purely autoregressively — predict the next token given all previous tokens.
Input: "The cat sat on the"
Target: "mat"
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
prompt = "The history of artificial intelligence began"
inputs = tokenizer(prompt, return_tensors='pt')
output = model.generate(
**inputs,
max_length=100,
temperature=0.8,
do_sample=True
)
print(tokenizer.decode(output[0]))
This branch eventually led to GPT-3, GPT-4, and the large language model revolution — but the fundamental representation technique (deep Transformer + language modeling pretraining) is the same as BERT.
Side-by-Side Comparison
Representing "I went to the bank to deposit my salary"
With TF-IDF:
[0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, ...]
↑ sparse vector, "bank" looks same as in river context
With Word2Vec:
"bank" → [0.32, -0.11, 0.78, 0.45, ...] (300 dims, FIXED regardless of context)
With BERT:
"bank" → [0.91, 0.23, -0.45, 0.67, ...] (768 dims, CHANGES with context)
"bank" in river sentence → [-0.33, 0.87, 0.12, ...] (completely different!)
Similarity Scores Compared
# The sentence "I love dogs" vs "I adore canines"
# (same meaning, different words)
# TF-IDF cosine similarity
tfidf_sim = 0.0 # No words in common → zero similarity!
# Word2Vec average similarity
w2v_sim = 0.76 # "love"↔"adore" and "dogs"↔"canines" are close
# BERT [CLS] token similarity
bert_sim = 0.91 # Full sentence context captured → very high
Choosing the Right Representation
Do you have < 10k labeled examples and need interpretability?
└─▶ TF-IDF + classical ML (Logistic Regression, SVM)
Do you need word-level similarity or fast semantic search?
└─▶ Word2Vec / GloVe / FastText
Do you need SOTA accuracy and have compute budget?
└─▶ Fine-tune BERT / RoBERTa / sentence-transformers
Do you need generation (chatbots, summaries, completion)?
└─▶ GPT-family (decoder-only Transformers)
Do you need multilingual support?
└─▶ mBERT, XLM-RoBERTa, LaBSE
Performance vs. Cost Trade-off
| Method | Accuracy | Inference Speed | Memory | Training Cost | |---|---|---|---|---| | TF-IDF + LR | Baseline | ⚡⚡⚡ Very fast | 🟢 Tiny | 💚 Near zero | | Word2Vec | +5-15% | ⚡⚡ Fast | 🟡 Small | 💛 Moderate | | GloVe (pretrained) | +5-15% | ⚡⚡ Fast | 🟡 Small | 💚 Near zero (use pretrained) | | BERT-base | +20-35% | ⚡ Slow | 🔴 Large (440MB) | 🔴 Expensive GPU | | BERT-large | +25-40% | 🐢 Very slow | 🔴 Very large (1.2GB) | 🔴🔴 Very expensive |
The Key Intuitions, Summarized
Sparse representations answer: "Is word X present?" They know vocabulary but not meaning.
Static dense embeddings answer: "What concepts surround word X in general usage?" They know meaning but not context.
Contextual embeddings answer: "What does word X mean right here, in this sentence?" They know both meaning and context.
Each stage solved the previous era's fundamental limitation. And each required a proportional increase in computational investment — from a lookup table, to a shallow neural net, to a deep pretrained Transformer.
What Comes Next
Contextual embeddings weren't the end. The next frontier is instruction-following models — not just representing language, but reasoning about it, following instructions, and taking actions. But that story builds directly on everything covered here.
The journey from one-hot vectors to BERT is the journey from "text as symbols" to "text as meaning." Understanding it gives you the foundation to reason clearly about every language model you'll ever work with.
If you found this useful, the next post goes deep on Sentence Transformers — how to get BERT-quality semantic similarity at production-grade speed.
by tech.with.akshad