Back to Blog
NLP Embeddings BERT Word2Vec Deep Learning

From Bag of Words to BERT: How NLP Learned to Understand Language

14 min read1 April 2025

From Bag of Words to BERT: How NLP Learned to Understand Language

There's a question that sits at the heart of every NLP system ever built:

How do you turn human language — messy, ambiguous, context-dependent — into something a machine can compute with?

The answer to that question has changed dramatically over the past three decades. And each time it changed, NLP got dramatically more powerful.

This post walks through the three major eras of text representation: Sparse Classical methods, Dense Static Embeddings, and Contextual Embeddings. By the end, you'll understand not just what each approach does, but why it was invented and what fundamental limitation it solved.


The Big Picture: Three Eras of Representation

Before diving into each era, here's the arc of the whole story:

Raw Text  ──▶  Numbers  ──▶  Machine Understanding

The challenge is in the middle step. How you convert text to numbers defines what patterns a model can learn, what computations are possible, and how much language it can truly "understand."

| Era | Example Methods | Key Idea | Limitation | |---|---|---|---| | Sparse Classical | BoW, TF-IDF, N-grams | Count words | No semantics, no order, huge vectors | | Dense Static | Word2Vec, GloVe, FastText | Learn meaning from context | One embedding per word, no polysemy | | Contextual | ELMo, BERT, GPT | Meaning depends on context | Expensive, needs large data |


Era 1 — Classic (Sparse) Representations

The Core Idea: Count What's There

The earliest approach to text representation was brutally simple: build a vocabulary, then for each document, count how many times each word appears.

This gives you a vector — one dimension per word in your vocabulary. Most values are zero (most documents don't contain most words), hence the name sparse.

One-Hot Encoding

The simplest possible representation. Each word gets a unique integer ID. A sentence becomes a sequence of these IDs.

vocab = {"cat": 0, "sat": 1, "on": 2, "the": 3, "mat": 4}

# "cat" → [1, 0, 0, 0, 0]
# "mat" → [0, 0, 0, 0, 1]
# "cat" + "mat" → [1, 0, 0, 0, 1]  (just added together)

The fatal flaw: Every word is equidistant from every other word. cat and kitten are just as "different" as cat and democracy. There's no notion of semantic similarity.

Bag of Words (BoW)

Extend one-hot to whole documents. Ignore word order entirely — just count each word's frequency in the document.

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I love machine learning",
    "machine learning is powerful",
    "I love deep learning too"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
# ['deep', 'I', 'is', 'learning', 'love', 'machine', 'powerful', 'too']

print(X.toarray())
# [[0, 1, 0, 1, 1, 1, 0, 0],
#  [0, 0, 1, 1, 0, 1, 1, 0],
#  [1, 1, 0, 1, 1, 0, 0, 1]]

"I love machine learning" and "machine learning I love" produce identical vectors — BoW has no memory of sequence.

TF-IDF: Smarter Counting

Raw counts have a problem: common words like "the", "is", "and" dominate every document. They're not informative.

TF-IDF (Term Frequency - Inverse Document Frequency) reweights counts to reward words that are frequent in a document but rare across all documents.

TF(t, d)  = count of term t in document d / total terms in d
IDF(t)    = log( N / df(t) )       where N = total docs, df = docs containing t

TF-IDF(t, d) = TF(t, d) × IDF(t)
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "cats and dogs are great pets"
]

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus)

# "the" → low score (appears in all docs)
# "mat" → high score (unique to doc 1)
# "cats" → high score (unique to doc 3)

N-Grams: Capturing Local Order

BoW discards all word order. N-grams partially restore it by treating sequences of N words as single tokens.

# Unigrams (N=1): ["machine", "learning", "is", "powerful"]
# Bigrams  (N=2): ["machine learning", "learning is", "is powerful"]
# Trigrams (N=3): ["machine learning is", "learning is powerful"]

vectorizer = CountVectorizer(ngram_range=(1, 2))  # unigrams + bigrams

Bigrams let you distinguish "not good" from "good" — something unigrams can't do.

Where Sparse Representations Shine (and Fail)

Still excellent for:

  • Document classification with small datasets
  • Keyword-based search and retrieval
  • Spam detection, simple sentiment analysis
  • Interpretability (you can explain every feature)

Fundamental limitations:

  1. Vocabulary size explosion — a real corpus has 100k+ words; add n-grams and you're in the millions
  2. Synonym blindness — "automobile" and "car" are completely different dimensions
  3. No semantic structure — the geometry of the vector space carries no meaning
  4. Out-of-vocabulary (OOV) — unseen words simply don't exist

The moment you need to understand meaning, sparse methods hit a wall.


Era 2 — Dense Static Embeddings

The Distributional Hypothesis

The key insight that unlocked the next era came from linguistics in the 1950s:

"You shall know a word by the company it keeps." — J.R. Firth, 1957

Words that appear in similar contexts tend to have similar meanings. "Dog" and "cat" both appear near "pet", "feed", "fluffy", "vet". "Bank" (financial) appears near "money", "loan", "interest" — very different contexts from "bank" (river).

If we train a model to predict context words, it will naturally learn to give similar representations to semantically similar words.

Word2Vec: The Breakthrough

In 2013, a team at Google (Mikolov et al.) published Word2Vec — a simple neural network that learned dense word embeddings by predicting words from their context (or context from words).

Two architectures:

CBOW (Continuous Bag of Words) — predict the center word from surrounding context:

Context: ["The", "_?_", "sat", "on", "the", "mat"]
Target:   "cat"

Skip-gram — predict surrounding words from the center word:

Input:  "cat"
Target: ["The", "sat", "on", "the", "mat"]
from gensim.models import Word2Vec

sentences = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "ran", "in", "the", "park"],
    ["cats", "and", "dogs", "are", "great", "pets"],
    # ... millions more sentences
]

model = Word2Vec(
    sentences,
    vector_size=300,   # embedding dimension
    window=5,          # context window
    min_count=5,       # ignore rare words
    sg=1               # 1 = skip-gram, 0 = CBOW
)

# Each word is now a dense 300-dimensional vector
cat_vec  = model.wv["cat"]   # shape: (300,)
king_vec = model.wv["king"]  # shape: (300,)

# Magic: semantic arithmetic works!
result = model.wv.most_similar(
    positive=["king", "woman"],
    negative=["man"]
)
# → [("queen", 0.85), ("princess", 0.72), ...]

The Geometry of Meaning

The most jaw-dropping property of Word2Vec embeddings: semantic relationships become geometric relationships.

king   − man   + woman  ≈  queen
Paris  − France + Germany ≈  Berlin
walked − walk   + run    ≈  ran

This means the vector space has actual structure. Directions in this 300-dimensional space correspond to human-interpretable concepts like gender, tense, geography.

# Similarity between words = cosine similarity between vectors
from numpy import dot
from numpy.linalg import norm

def cosine_similarity(a, b):
    return dot(a, b) / (norm(a) * norm(b))

sim = cosine_similarity(model.wv["cat"], model.wv["kitten"])
# → 0.84  (high similarity, as expected)

sim2 = cosine_similarity(model.wv["cat"], model.wv["democracy"])
# → 0.12  (very different, as expected)

GloVe: Global Statistics

Word2Vec learns from local windows. GloVe (Global Vectors, Stanford 2014) takes a different approach: factorize the entire word co-occurrence matrix across the whole corpus.

# GloVe gives you pre-trained embeddings — just load and use
import numpy as np

def load_glove(path):
    embeddings = {}
    with open(path) as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

glove = load_glove("glove.6B.300d.txt")
# 6 billion tokens, 300-dimensional vectors

GloVe and Word2Vec produce qualitatively similar results but with different computational trade-offs.

FastText: Subword Magic

Word2Vec and GloVe break on morphologically rich languages and on unseen words. "Running", "runner", "runs" all get completely separate embeddings with no shared information.

FastText (Facebook 2016) represents each word as a bag of character n-grams:

"running" → {"run", "runn", "runni", "unnin", "nning", "ning", "ing", ...}
           + the whole word token "running"

The word embedding is the average of its subword embeddings.

from gensim.models import FastText

model = FastText(sentences, vector_size=300, window=5, min_count=1)

# Can now handle OOV words!
vec = model.wv["supercalifragilistic"]  # No error! Built from character n-grams

This makes FastText particularly powerful for:

  • Misspelled words ("recieve" → still works)
  • Morphologically rich languages (Finnish, Turkish, German)
  • Technical domains with many compound words

The Shared Limitation: One Word, One Embedding

All of these methods — Word2Vec, GloVe, FastText — share a critical flaw.

Every word gets exactly one embedding, regardless of context.

Consider the word "bank":

"I deposited money in the bank."
"The fishing spot by the river bank was perfect."
"He had to bank the plane sharply to avoid the storm."

Three completely different meanings. But all three get the same vector. The static embedding ends up as a confused average of all the word's meanings, weighted by how often each sense appears in the training corpus.

For many tasks this is fine. For nuanced language understanding, it's a showstopper.


Era 3 — Contextual Embeddings

The Paradigm Shift

What if instead of a fixed embedding per word, we computed embeddings dynamically, based on the full sentence context?

This is the central idea of contextual embeddings. The same word "bank" should have a different vector in a finance sentence than in a geography sentence.

Static:     "bank" → [0.32, -0.11, 0.78, ...]  (always the same)

Contextual: "I went to the bank to deposit money."
             bank → [0.91, 0.23, -0.45, ...]  (finance context)

            "She sat on the river bank."
             bank → [-0.33, 0.87, 0.12, ...]  (geography context)

ELMo: The First Breakthrough

ELMo (Embeddings from Language Models, 2018) was the first major system to produce contextual embeddings. It trained a deep bidirectional LSTM as a language model, then used the hidden states as embeddings.

import tensorflow_hub as hub
import tensorflow as tf

# Load pre-trained ELMo
elmo = hub.load("https://tfhub.dev/google/elmo/3")

sentences = [
    "I went to the bank to withdraw money",
    "The heron stood by the river bank"
]

# Same word "bank" → different embeddings depending on context
embeddings = elmo.signatures["default"](
    tf.constant(sentences)
)["elmo"]
# Shape: (2, max_len, 1024)
# The vector for "bank" differs significantly between sentences

ELMo showed that language model pretraining — training on massive text to predict the next word — naturally develops rich linguistic knowledge as a side effect.

The Transformer: The Architecture That Changed Everything

Before BERT, we need to understand the engine underneath it: the Transformer (Vaswani et al., 2017, "Attention is All You Need").

The key innovation is self-attention: every token looks at every other token and decides how much to "attend" to it when forming its representation.

import torch
import torch.nn.functional as F

def self_attention(Q, K, V, d_k):
    """
    Q, K, V: Query, Key, Value matrices
    d_k: dimension of keys (for scaling)
    """
    # Compute attention scores
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    
    # Convert to probabilities
    weights = F.softmax(scores, dim=-1)
    
    # Weighted sum of values
    output = torch.matmul(weights, V)
    
    return output, weights

For "The cat sat on the mat because it was tired":

  • When encoding "it", the attention mechanism looks at the whole sentence
  • High attention weight between "it" and "cat" — the model learns "it" refers to the cat
  • This coreference resolution emerges from training, not hard-coded rules

BERT: Bidirectional Transformers

BERT (Bidirectional Encoder Representations from Transformers, Google 2018) took the Transformer encoder and trained it at scale with two pretraining tasks:

1. Masked Language Model (MLM):

Input:  "The cat [MASK] on the mat"
Target: "sat"

Input:  "The [MASK] sat on the mat"
Target: "cat"

2. Next Sentence Prediction (NSP):

Sentence A: "The cat sat on the mat."
Sentence B: "It enjoyed the warmth."   → IsNext: True

Sentence A: "The cat sat on the mat."
Sentence B: "The stock market crashed."  → IsNext: False

BERT was trained on 3.3 billion words (Wikipedia + BookCorpus), producing a 12-layer Transformer with 110 million parameters.

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Same word "bank" in two different contexts
sentences = [
    "I deposited my paycheck at the bank.",
    "We had a picnic on the river bank."
]

for sentence in sentences:
    inputs = tokenizer(sentence, return_tensors='pt')
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    # outputs.last_hidden_state: shape (1, seq_len, 768)
    # Each token has a unique 768-dim contextual embedding
    hidden_states = outputs.last_hidden_state
    
    # Find index of "bank" token
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    bank_idx = tokens.index('bank')
    
    bank_embedding = hidden_states[0, bank_idx, :]  # shape: (768,)
    print(f"Sentence: {sentence}")
    print(f"'bank' embedding norm: {bank_embedding.norm():.3f}")
    # The two bank embeddings will be significantly different!

Fine-tuning BERT for Downstream Tasks

BERT's real power is transfer learning. Pretrain once on massive data, fine-tune cheaply for any task:

from transformers import BertForSequenceClassification, Trainer, TrainingArguments

# Fine-tune for sentiment analysis
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2  # positive / negative
)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,    # Fine-tuning requires very small LR
    warmup_steps=500,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()
# ~30 minutes on a single GPU to beat the previous state-of-the-art
# on most benchmarks

BERT shattered records on 11 NLP benchmarks when released — tasks that had resisted progress for years suddenly jumped significantly.

GPT and the Generative Branch

While BERT focuses on understanding (encoder), GPT (OpenAI) focuses on generation (decoder-only Transformer). Instead of masking, GPT is trained purely autoregressively — predict the next token given all previous tokens.

Input:  "The cat sat on the"
Target: "mat"
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

prompt = "The history of artificial intelligence began"
inputs = tokenizer(prompt, return_tensors='pt')

output = model.generate(
    **inputs,
    max_length=100,
    temperature=0.8,
    do_sample=True
)

print(tokenizer.decode(output[0]))

This branch eventually led to GPT-3, GPT-4, and the large language model revolution — but the fundamental representation technique (deep Transformer + language modeling pretraining) is the same as BERT.


Side-by-Side Comparison

Representing "I went to the bank to deposit my salary"

With TF-IDF:

[0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, ...]
   ↑ sparse vector, "bank" looks same as in river context

With Word2Vec:

"bank" → [0.32, -0.11, 0.78, 0.45, ...]  (300 dims, FIXED regardless of context)

With BERT:

"bank" → [0.91, 0.23, -0.45, 0.67, ...]  (768 dims, CHANGES with context)
"bank" in river sentence → [-0.33, 0.87, 0.12, ...]  (completely different!)

Similarity Scores Compared

# The sentence "I love dogs" vs "I adore canines"
# (same meaning, different words)

# TF-IDF cosine similarity
tfidf_sim = 0.0   # No words in common → zero similarity!

# Word2Vec average similarity
w2v_sim = 0.76    # "love"↔"adore" and "dogs"↔"canines" are close

# BERT [CLS] token similarity
bert_sim = 0.91   # Full sentence context captured → very high

Choosing the Right Representation

Do you have < 10k labeled examples and need interpretability?
└─▶ TF-IDF + classical ML (Logistic Regression, SVM)

Do you need word-level similarity or fast semantic search?
└─▶ Word2Vec / GloVe / FastText

Do you need SOTA accuracy and have compute budget?
└─▶ Fine-tune BERT / RoBERTa / sentence-transformers

Do you need generation (chatbots, summaries, completion)?
└─▶ GPT-family (decoder-only Transformers)

Do you need multilingual support?
└─▶ mBERT, XLM-RoBERTa, LaBSE

Performance vs. Cost Trade-off

| Method | Accuracy | Inference Speed | Memory | Training Cost | |---|---|---|---|---| | TF-IDF + LR | Baseline | ⚡⚡⚡ Very fast | 🟢 Tiny | 💚 Near zero | | Word2Vec | +5-15% | ⚡⚡ Fast | 🟡 Small | 💛 Moderate | | GloVe (pretrained) | +5-15% | ⚡⚡ Fast | 🟡 Small | 💚 Near zero (use pretrained) | | BERT-base | +20-35% | ⚡ Slow | 🔴 Large (440MB) | 🔴 Expensive GPU | | BERT-large | +25-40% | 🐢 Very slow | 🔴 Very large (1.2GB) | 🔴🔴 Very expensive |


The Key Intuitions, Summarized

Sparse representations answer: "Is word X present?" They know vocabulary but not meaning.

Static dense embeddings answer: "What concepts surround word X in general usage?" They know meaning but not context.

Contextual embeddings answer: "What does word X mean right here, in this sentence?" They know both meaning and context.

Each stage solved the previous era's fundamental limitation. And each required a proportional increase in computational investment — from a lookup table, to a shallow neural net, to a deep pretrained Transformer.


What Comes Next

Contextual embeddings weren't the end. The next frontier is instruction-following models — not just representing language, but reasoning about it, following instructions, and taking actions. But that story builds directly on everything covered here.

The journey from one-hot vectors to BERT is the journey from "text as symbols" to "text as meaning." Understanding it gives you the foundation to reason clearly about every language model you'll ever work with.


If you found this useful, the next post goes deep on Sentence Transformers — how to get BERT-quality semantic similarity at production-grade speed.

All Posts

by tech.with.akshad