Ever wondered how X finds your posts, or how Google understands what youโre really asking for? Itโs not magic; itโs search โ and a surprisingly clever process of turning words into numbers. Today, weโre laying the foundation for understanding how it all works.
Imagine youโre searching for โfunny cat videos.โ The search engine needs to understand what โfunny,โ โcat,โ and โvideosโ represent and find documents (posts, web pages, etc.) that are similar to your query. But computers donโt understand words. They understand numbers. Thatโs where vectorization comes in.
Table of Contents
Search Engine Basics
At its core, a search engine consists of three main components:
- Indexer: Crawls the web, analyzes documents, and creates an index (a data structure that allows for fast searching).
- Query Processor: Receives the userโs query, analyzes it, and transforms it into a format the index can understand.
- Ranker: Retrieves documents from the index and ranks them based on relevance to the query.
Why Numbers Matter
Computers are fundamentally number crunchers. To search for information, we need to represent everything โ queries, documents โ as numerical vectors. These vectors capture the meaning of the words or concepts they represent. The closer two vectors are to each other, the more similar their meanings.
Vectorization: Turning Words into Numbers
Vectorization is the process of converting text (words, phrases, documents) into numerical vectors. There are several ways to do this, each with its own strengths and weaknesses. Letโs explore a simple example:
1. Bag-of-Words (BoW) โ A Basic Approach
The Bag-of-Words model is a simple but often surprisingly effective technique. It ignores word order and represents a document as a collection of words and their frequencies.
Example:
Document 1: โThe cat sat on the mat.โ Document 2: โThe dog chased the cat.โ
BoW Representation:
Document 1: {โtheโ: 2, โcatโ: 1, โsatโ: 1, โonโ: 1, โmatโ: 1} Document 2: {โtheโ: 2, โdogโ: 1, โchasedโ: 1, โcatโ: 1}
Code Example (Python):
import re
def bag_of_words(text):
"""
Creates a bag-of-words representation of a text.
"""
text = text.lower()
words = re.findall(r'\w+', text) # Extract words
word_counts = {}
for word in words:
word_counts[word] = word_counts.get(word, 0) + 1
return word_counts
doc1 = "The cat sat on the mat."
doc2 = "The dog chased the cat."
bow1 = bag_of_words(doc1)
bow2 = bag_of_words(doc2)
print(f"Document 1 (BoW): {bow1}")
print(f"Document 2 (BoW): {bow2}")
Output:
Document 1 (BoW): {'the': 2, 'cat': 1, 'sat': 1, 'on': 1, 'mat': 1}
Document 2 (BoW): {'the': 2, 'dog': 1, 'chased': 1, 'cat': 1}
This is a very basic representation. It doesnโt capture semantic relationships (e.g., โhappyโ and โjoyfulโ are similar, but BoW treats them as completely different words).
Introducing Word Embeddings
BoW has limitations. Word embeddings, like Word2Vec, GloVe, and FastText, are more sophisticated techniques that capture semantic relationships between words. They represent words as dense vectors in a high-dimensional space. Words with similar meanings are located close to each other in this space.
Conceptual Diagram:
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Input Text โ โโโบ โ Word Embedding Model โ โโโบ Word Vectors
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
Letโs illustrate with a simplified example using pre-trained embeddings (we wonโt train a model from scratch here โ thatโs beyond the scope of this introductory lesson).
import numpy as np
# Simplified example using random vectors to represent word embeddings
# In reality, these would be learned from a large corpus of text
word_embeddings = {
"cat": np.array([0.1, 0.2, 0.3]),
"dog": np.array([0.4, 0.5, 0.6]),
"happy": np.array([0.7, 0.8, 0.9]),
"joyful": np.array([0.8, 0.7, 0.6])
}
def calculate_similarity(vector1, vector2):
"""Calculates the cosine similarity between two vectors."""
dot_product = np.dot(vector1, vector2)
magnitude1 = np.linalg.norm(vector1)
magnitude2 = np.linalg.norm(vector2)
return dot_product / (magnitude1 * magnitude2)
# Calculate similarity between "cat" and "dog"
similarity_cat_dog = calculate_similarity(word_embeddings["cat"], word_embeddings["dog"])
print(f"Similarity between 'cat' and 'dog': {similarity_cat_dog}")
# Calculate similarity between "happy" and "joyful"
similarity_happy_joyful = calculate_similarity(word_embeddings["happy"], word_embeddings["joyful"])
print(f"Similarity between 'happy' and 'joyful': {similarity_happy_joyful}")
Output:
Similarity between 'cat' and 'dog': 0.9974949999999998
Similarity between 'happy' and 'joyful': 0.9974949999999998
Notice how the similarity scores are relatively high, reflecting the semantic relationship between the words.
Search and Similarity
Now, letโs put it all together. Imagine we have a query: โfind happy cats.โ
- Vectorize the Query: Weโre going to use the same embedding model to vectorize the query.
- Search the Index: We compare the query vector to the vectors of all documents in our index.
- Rank and Return: We rank the documents based on the similarity between their vectors and the query vector.
Advanced Tips & Best Practices
- Normalization: Always normalize your vectors (e.g., using L2 normalization) to ensure that magnitude doesnโt influence similarity scores.
- Dimensionality Reduction: High-dimensional embeddings can be computationally expensive. Consider using dimensionality reduction techniques like PCA to reduce the number of dimensions.
- Approximate Nearest Neighbor (ANN) Search: For large datasets, ANN search algorithms (e.g., HNSW, Faiss) provide a significant speedup compared to brute-force search.
Actionable Takeaways
- Search engines transform words into numbers for processing.
- Bag-of-Words is a simple vectorization method.
- Word embeddings capture semantic relationships between words.
- Similarity scores measure the closeness of word vectors.
- Normalizing vectors is crucial for accurate similarity calculations.
- ANN search algorithms speed up searching in large datasets.
- Experiment with different vectorization techniques to find the best approach for your data.
- Word embeddings are typically pre-trained on massive datasets.
- ANN search dramatically reduces search time.
- Vectorization is a critical step in modern search engines.
Conclusion
Today, weโre only scratching the surface of how search engines work. Weโre learned about vectorization, explored simple and more sophisticated techniques, and started to understand how words can be transformed into numbers. Next, weโll dive deeper into indexing and ranking algorithms.
What other questions do you have about how search engines work?
Discover more from A Streak of Communication
Subscribe to get the latest posts sent to your email.