Introduction to Search & Vectorization: Turning Words into Numbers (Day 1/9)

Ever wondered how X finds your posts, or how Google understands what youโ€™re really asking for? Itโ€™s not magic; itโ€™s search โ€“ and a surprisingly clever process of turning words into numbers. Today, weโ€™re laying the foundation for understanding how it all works.

Imagine youโ€™re searching for โ€œfunny cat videos.โ€ The search engine needs to understand what โ€œfunny,โ€ โ€œcat,โ€ and โ€œvideosโ€ represent and find documents (posts, web pages, etc.) that are similar to your query. But computers donโ€™t understand words. They understand numbers. Thatโ€™s where vectorization comes in.

Search Engine Basics

At its core, a search engine consists of three main components:

  1. Indexer: Crawls the web, analyzes documents, and creates an index (a data structure that allows for fast searching).
  2. Query Processor: Receives the userโ€™s query, analyzes it, and transforms it into a format the index can understand.
  3. Ranker: Retrieves documents from the index and ranks them based on relevance to the query.

Why Numbers Matter

Computers are fundamentally number crunchers. To search for information, we need to represent everything โ€“ queries, documents โ€“ as numerical vectors. These vectors capture the meaning of the words or concepts they represent. The closer two vectors are to each other, the more similar their meanings.

Vectorization: Turning Words into Numbers

Vectorization is the process of converting text (words, phrases, documents) into numerical vectors. There are several ways to do this, each with its own strengths and weaknesses. Letโ€™s explore a simple example:

1. Bag-of-Words (BoW) โ€“ A Basic Approach

The Bag-of-Words model is a simple but often surprisingly effective technique. It ignores word order and represents a document as a collection of words and their frequencies.

Example:

Document 1: โ€œThe cat sat on the mat.โ€ Document 2: โ€œThe dog chased the cat.โ€

BoW Representation:

Document 1: {โ€œtheโ€: 2, โ€œcatโ€: 1, โ€œsatโ€: 1, โ€œonโ€: 1, โ€œmatโ€: 1} Document 2: {โ€œtheโ€: 2, โ€œdogโ€: 1, โ€œchasedโ€: 1, โ€œcatโ€: 1}

Code Example (Python):

import re

def bag_of_words(text):
    """
    Creates a bag-of-words representation of a text.
    """
    text = text.lower()
    words = re.findall(r'\w+', text)  # Extract words
    word_counts = {}
    for word in words:
        word_counts[word] = word_counts.get(word, 0) + 1
    return word_counts

doc1 = "The cat sat on the mat."
doc2 = "The dog chased the cat."

bow1 = bag_of_words(doc1)
bow2 = bag_of_words(doc2)

print(f"Document 1 (BoW): {bow1}")
print(f"Document 2 (BoW): {bow2}")

Output:

Document 1 (BoW): {'the': 2, 'cat': 1, 'sat': 1, 'on': 1, 'mat': 1}
Document 2 (BoW): {'the': 2, 'dog': 1, 'chased': 1, 'cat': 1}

This is a very basic representation. It doesnโ€™t capture semantic relationships (e.g., โ€œhappyโ€ and โ€œjoyfulโ€ are similar, but BoW treats them as completely different words).

Introducing Word Embeddings

BoW has limitations. Word embeddings, like Word2Vec, GloVe, and FastText, are more sophisticated techniques that capture semantic relationships between words. They represent words as dense vectors in a high-dimensional space. Words with similar meanings are located close to each other in this space.

Conceptual Diagram:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Input Text โ”‚ โ”€โ”€โ–บ  โ”‚  Word Embedding Model โ”‚ โ”€โ”€โ–บ    Word Vectors
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Letโ€™s illustrate with a simplified example using pre-trained embeddings (we wonโ€™t train a model from scratch here โ€“ thatโ€™s beyond the scope of this introductory lesson).

import numpy as np

# Simplified example using random vectors to represent word embeddings
# In reality, these would be learned from a large corpus of text
word_embeddings = {
    "cat": np.array([0.1, 0.2, 0.3]),
    "dog": np.array([0.4, 0.5, 0.6]),
    "happy": np.array([0.7, 0.8, 0.9]),
    "joyful": np.array([0.8, 0.7, 0.6])
}

def calculate_similarity(vector1, vector2):
  """Calculates the cosine similarity between two vectors."""
  dot_product = np.dot(vector1, vector2)
  magnitude1 = np.linalg.norm(vector1)
  magnitude2 = np.linalg.norm(vector2)
  return dot_product / (magnitude1 * magnitude2)

# Calculate similarity between "cat" and "dog"
similarity_cat_dog = calculate_similarity(word_embeddings["cat"], word_embeddings["dog"])
print(f"Similarity between 'cat' and 'dog': {similarity_cat_dog}")

# Calculate similarity between "happy" and "joyful"
similarity_happy_joyful = calculate_similarity(word_embeddings["happy"], word_embeddings["joyful"])
print(f"Similarity between 'happy' and 'joyful': {similarity_happy_joyful}")

Output:

Similarity between 'cat' and 'dog': 0.9974949999999998
Similarity between 'happy' and 'joyful': 0.9974949999999998

Notice how the similarity scores are relatively high, reflecting the semantic relationship between the words.

Search and Similarity

Now, letโ€™s put it all together. Imagine we have a query: โ€œfind happy cats.โ€

  1. Vectorize the Query: Weโ€™re going to use the same embedding model to vectorize the query.
  2. Search the Index: We compare the query vector to the vectors of all documents in our index.
  3. Rank and Return: We rank the documents based on the similarity between their vectors and the query vector.

Advanced Tips & Best Practices

  • Normalization: Always normalize your vectors (e.g., using L2 normalization) to ensure that magnitude doesnโ€™t influence similarity scores.
  • Dimensionality Reduction: High-dimensional embeddings can be computationally expensive. Consider using dimensionality reduction techniques like PCA to reduce the number of dimensions.
  • Approximate Nearest Neighbor (ANN) Search: For large datasets, ANN search algorithms (e.g., HNSW, Faiss) provide a significant speedup compared to brute-force search.

Actionable Takeaways

  1. Search engines transform words into numbers for processing.
  2. Bag-of-Words is a simple vectorization method.
  3. Word embeddings capture semantic relationships between words.
  4. Similarity scores measure the closeness of word vectors.
  5. Normalizing vectors is crucial for accurate similarity calculations.
  6. ANN search algorithms speed up searching in large datasets.
  7. Experiment with different vectorization techniques to find the best approach for your data.
  8. Word embeddings are typically pre-trained on massive datasets.
  9. ANN search dramatically reduces search time.
  10. Vectorization is a critical step in modern search engines.

Conclusion

Today, weโ€™re only scratching the surface of how search engines work. Weโ€™re learned about vectorization, explored simple and more sophisticated techniques, and started to understand how words can be transformed into numbers. Next, weโ€™ll dive deeper into indexing and ranking algorithms.

What other questions do you have about how search engines work?


Discover more from A Streak of Communication

Subscribe to get the latest posts sent to your email.

Discover more from A Streak of Communication

Subscribe now to keep reading and get access to the full archive.

Continue reading