Introduction to Search & Vectorization: Turning Words into Numbers (Day 1/9)

Ever wondered how X finds your posts, or how Google understands what you’re really asking for? It’s not magic; it’s search – and a surprisingly clever process of turning words into numbers. Today, we’re laying the foundation for understanding how it all works.

Imagine you’re searching for “funny cat videos.” The search engine needs to understand what “funny,” “cat,” and “videos” represent and find documents (posts, web pages, etc.) that are similar to your query. But computers don’t understand words. They understand numbers. That’s where vectorization comes in.

Table of Contents

Search Engine Basics

At its core, a search engine consists of three main components:

Indexer: Crawls the web, analyzes documents, and creates an index (a data structure that allows for fast searching).
Query Processor: Receives the user’s query, analyzes it, and transforms it into a format the index can understand.
Ranker: Retrieves documents from the index and ranks them based on relevance to the query.

Why Numbers Matter

Computers are fundamentally number crunchers. To search for information, we need to represent everything – queries, documents – as numerical vectors. These vectors capture the meaning of the words or concepts they represent. The closer two vectors are to each other, the more similar their meanings.

Vectorization: Turning Words into Numbers

Vectorization is the process of converting text (words, phrases, documents) into numerical vectors. There are several ways to do this, each with its own strengths and weaknesses. Let’s explore a simple example:

1. Bag-of-Words (BoW) – A Basic Approach

The Bag-of-Words model is a simple but often surprisingly effective technique. It ignores word order and represents a document as a collection of words and their frequencies.

Example:

Document 1: “The cat sat on the mat.” Document 2: “The dog chased the cat.”

BoW Representation:

Document 1: {“the”: 2, “cat”: 1, “sat”: 1, “on”: 1, “mat”: 1} Document 2: {“the”: 2, “dog”: 1, “chased”: 1, “cat”: 1}

Code Example (Python):

import re

def bag_of_words(text):
    """
    Creates a bag-of-words representation of a text.
    """
    text = text.lower()
    words = re.findall(r'\w+', text)  # Extract words
    word_counts = {}
    for word in words:
        word_counts[word] = word_counts.get(word, 0) + 1
    return word_counts

doc1 = "The cat sat on the mat."
doc2 = "The dog chased the cat."

bow1 = bag_of_words(doc1)
bow2 = bag_of_words(doc2)

print(f"Document 1 (BoW): {bow1}")
print(f"Document 2 (BoW): {bow2}")

Output:

Document 1 (BoW): {'the': 2, 'cat': 1, 'sat': 1, 'on': 1, 'mat': 1}
Document 2 (BoW): {'the': 2, 'dog': 1, 'chased': 1, 'cat': 1}

This is a very basic representation. It doesn’t capture semantic relationships (e.g., “happy” and “joyful” are similar, but BoW treats them as completely different words).

Introducing Word Embeddings

BoW has limitations. Word embeddings, like Word2Vec, GloVe, and FastText, are more sophisticated techniques that capture semantic relationships between words. They represent words as dense vectors in a high-dimensional space. Words with similar meanings are located close to each other in this space.

Conceptual Diagram:

┌─────────────┐      ┌───────────────────────┐      ┌─────────────┐
│  Input Text │ ──►  │  Word Embedding Model │ ──►    Word Vectors
└─────────────┘      └───────────────────────┘      └─────────────┘

Let’s illustrate with a simplified example using pre-trained embeddings (we won’t train a model from scratch here – that’s beyond the scope of this introductory lesson).

import numpy as np

# Simplified example using random vectors to represent word embeddings
# In reality, these would be learned from a large corpus of text
word_embeddings = {
    "cat": np.array([0.1, 0.2, 0.3]),
    "dog": np.array([0.4, 0.5, 0.6]),
    "happy": np.array([0.7, 0.8, 0.9]),
    "joyful": np.array([0.8, 0.7, 0.6])
}

def calculate_similarity(vector1, vector2):
  """Calculates the cosine similarity between two vectors."""
  dot_product = np.dot(vector1, vector2)
  magnitude1 = np.linalg.norm(vector1)
  magnitude2 = np.linalg.norm(vector2)
  return dot_product / (magnitude1 * magnitude2)

# Calculate similarity between "cat" and "dog"
similarity_cat_dog = calculate_similarity(word_embeddings["cat"], word_embeddings["dog"])
print(f"Similarity between 'cat' and 'dog': {similarity_cat_dog}")

# Calculate similarity between "happy" and "joyful"
similarity_happy_joyful = calculate_similarity(word_embeddings["happy"], word_embeddings["joyful"])
print(f"Similarity between 'happy' and 'joyful': {similarity_happy_joyful}")

Output:

Similarity between 'cat' and 'dog': 0.9974949999999998
Similarity between 'happy' and 'joyful': 0.9974949999999998

Notice how the similarity scores are relatively high, reflecting the semantic relationship between the words.

Search and Similarity

Now, let’s put it all together. Imagine we have a query: “find happy cats.”

Vectorize the Query: We’re going to use the same embedding model to vectorize the query.
Search the Index: We compare the query vector to the vectors of all documents in our index.
Rank and Return: We rank the documents based on the similarity between their vectors and the query vector.

Advanced Tips & Best Practices

Normalization: Always normalize your vectors (e.g., using L2 normalization) to ensure that magnitude doesn’t influence similarity scores.
Dimensionality Reduction: High-dimensional embeddings can be computationally expensive. Consider using dimensionality reduction techniques like PCA to reduce the number of dimensions.
Approximate Nearest Neighbor (ANN) Search: For large datasets, ANN search algorithms (e.g., HNSW, Faiss) provide a significant speedup compared to brute-force search.

Actionable Takeaways

Search engines transform words into numbers for processing.
Bag-of-Words is a simple vectorization method.
Word embeddings capture semantic relationships between words.
Similarity scores measure the closeness of word vectors.
Normalizing vectors is crucial for accurate similarity calculations.
ANN search algorithms speed up searching in large datasets.
Experiment with different vectorization techniques to find the best approach for your data.
Word embeddings are typically pre-trained on massive datasets.
ANN search dramatically reduces search time.
Vectorization is a critical step in modern search engines.

Conclusion

Today, we’re only scratching the surface of how search engines work. We’re learned about vectorization, explored simple and more sophisticated techniques, and started to understand how words can be transformed into numbers. Next, we’ll dive deeper into indexing and ranking algorithms.

What other questions do you have about how search engines work?

Discover more from A Streak of Communication

Subscribe to get the latest posts sent to your email.

Introduction to Search & Vectorization: Turning Words into Numbers (Day 1/9)

Search Engine Basics

Why Numbers Matter

Vectorization: Turning Words into Numbers

1. Bag-of-Words (BoW) – A Basic Approach

Introducing Word Embeddings

Search and Similarity

Advanced Tips & Best Practices

Actionable Takeaways

Conclusion

Like this:

Related

Discover more from A Streak of Communication

Search Engine Basics

Why Numbers Matter

Vectorization: Turning Words into Numbers

1. Bag-of-Words (BoW) – A Basic Approach

Introducing Word Embeddings

Search and Similarity

Advanced Tips & Best Practices

Actionable Takeaways

Conclusion

Share this:

Like this:

Related

Discover more from A Streak of Communication

Check this too

Replication Strategies: Synchronous vs. Asynchronous

Replication: Ensuring Data Availability

Sharding Deep Dive: Consistent Hashing

Discover more from A Streak of Communication