Input & Embedding: Turning Words into Numbers

In our previous post, we laid the groundwork for understanding retrieval pipelines. We explored the overall architecture and the core problem: finding relevant information from a vast dataset. Today, weโ€™re diving into the crucial first step: transforming the words we use into numbers that our computers can understand. This process is called embedding.

Why Canโ€™t Computers Just Read Words?

Computers operate on numbers. They excel at performing calculations and comparisons based on numerical data. However, text โ€“ words, sentences, paragraphs โ€“ is inherently symbolic. โ€œCatโ€ isnโ€™t a number; itโ€™s a representation of a feline creature. To bridge this gap, we need a way to represent words as vectors of numbers โ€“ these are our embeddings.

Think of it like this: imagine youโ€™re describing a fruit to someone whoโ€™s never seen one. You could list its characteristics โ€“ color, size, taste, texture. Each characteristic becomes a dimension, and the fruitโ€™s attributes become numbers along those dimensions. An embedding does the same thing for words.

What are Embeddings?

An embedding is a vector representation of a word, phrase, or even an entire document. Each dimension in the vector captures a different aspect of the wordโ€™s meaning or context. Words that are semantically similar will have embeddings that are close to each other in this vector space.

Visual Representation:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Word "Cat"   โ”‚โ”€โ”€โ–บโ”‚ Embedding Model โ”‚โ”€โ”€โ–บโ”‚ Vector [0.2, -0.5, 0.8, ...] โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The embedding model (weโ€™re not building one today โ€“ these are pre-trained) takes a word as input and outputs a vector. The length of the vector is the dimension of the embedding. Common dimensions are 100, 300, or even higher.

Simple Example: A Basic Embedding Lookup

Letโ€™s illustrate with a simplified example using a hypothetical dictionary. This isnโ€™t a real embedding model, but it shows the concept:

# Hypothetical word embeddings (in reality, these are learned)
word_embeddings = {
    "cat": [0.1, 0.5, -0.2],
    "dog": [0.3, 0.6, -0.1],
    "bird": [-0.5, 0.2, 0.9],
    "apple": [0.7, -0.3, 0.1]
}

def get_embedding(word):
  """Retrieves the embedding for a given word."""
  if word in word_embeddings:
    return word_embeddings[word]
  else:
    return None  # Word not found

word = "cat"
embedding = get_embedding(word)

if embedding:
  print(f"The embedding for '{word}' is: {embedding}")
else:
  print(f"Word '{word}' not found in the dictionary.")

Expected Output:

The embedding for 'cat' is: [0.1, 0.5, -0.2]

Whatโ€™s happening: This code defines a dictionary word_embeddings that maps words to their corresponding vectors. The get_embedding function looks up a word in the dictionary and returns its embedding. If the word isnโ€™t found, it returns None.

A More Realistic Example: Using Gensim

Gensim is a popular Python library for topic modeling, document indexing, and similarity retrieval. It provides convenient ways to load and use pre-trained word embeddings.

from gensim.models import KeyedVectors

# Load pre-trained Word2Vec model (replace with your model path)
try:
  model = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
except FileNotFoundError:
  print("Please download GoogleNews-vectors-negative300.bin and place it in the same directory.")
  exit()

def get_embedding_gensim(word):
  """Retrieves the embedding for a given word using Gensim."""
  try:
    return model[word]
  except KeyError:
    return None  # Word not found in the model

word = "king"
embedding = get_embedding_gensim(word)

if embedding is not None:
  print(f"The embedding for '{word}' is: {embedding[:10]}...") # Print first 10 elements
else:
  print(f"Word '{word}' not found in the model.")

Expected Output (truncated):

The embedding for 'king' is: [0.019925789, 0.029606264, -0.006949782, -0.03385614, -0.02481438, 0.03564985, -0.03513183, -0.00987431, 0.007226155, 0.01562295]...

Whatโ€™s happening: This code loads a pre-trained Word2Vec model from a binary file. The get_embedding_gensim function uses the model to retrieve the embedding for a given word. We only print the first 10 elements of the embedding because these vectors are typically quite long (300 dimensions in this case).

Why are Some Words Closer Than Others?

The beauty of embeddings lies in their ability to capture semantic relationships. Words that are used in similar contexts will have embeddings that are closer together. For instance, โ€œkingโ€ and โ€œqueenโ€ will be closer than โ€œkingโ€ and โ€œapple.โ€

You can calculate the cosine similarity between two embeddings to quantify their closeness. A higher cosine similarity indicates greater semantic similarity.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def calculate_similarity(word1, word2):
  """Calculates the cosine similarity between two word embeddings."""
  embedding1 = model[word1]
  embedding2 = model[word2]
  return cosine_similarity([embedding1], [embedding2])[0][0]

similarity_king_queen = calculate_similarity("king", "queen")
similarity_king_apple = calculate_similarity("king", "apple")

print(f"Similarity between 'king' and 'queen': {similarity_king_queen}")
print(f"Similarity between 'king' and 'apple': {similarity_king_apple}")

Expected Output (values will vary based on the model):

Similarity between 'king' and 'queen': 0.85
Similarity between 'king' and 'apple': 0.05

Whatโ€™s happening: This code defines a function calculate_similarity that uses cosine_similarity from sklearn to measure the similarity between two word embeddings. It then calculates and prints the similarity between โ€œkingโ€ and โ€œqueenโ€ and between โ€œkingโ€ and โ€œapple.โ€ The values will vary depending on the specific model used.

Common Mistakes and Debugging

  • Word Not Found: Ensure the word youโ€™re trying to embed exists in the vocabulary of the embedding model. If not, the model will raise a KeyError. Handle this gracefully by returning None or providing a default embedding.
  • Model Loading Errors: Double-check the path to the embedding model file. Ensure the file exists and is in the correct format (binary for Word2Vec).
  • Dimensionality Mismatch: Be mindful of the dimensionality of the embeddings. If youโ€™re performing calculations (e.g., cosine similarity), ensure the embeddings have compatible dimensions.

Beyond Words: Document Embeddings

The concept of embedding extends beyond individual words. You can also create embeddings for entire documents or phrases. This is often done by averaging the word embeddings of the constituent words. This allows you to compare the semantic similarity of documents.

In our next post, weโ€™ll delve into how these embeddings are used within the retrieval pipeline to find relevant information. Weโ€™re building a foundation for more advanced techniques.


Discover more from A Streak of Communication

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from A Streak of Communication

Subscribe now to keep reading and get access to the full archive.

Continue reading