Welcome back! Happy New Year! In our previous post, we explored cosine similarity and how it focuses on the direction of vectors, removing the influence of their magnitudes. Today, we’re shifting gears to a different perspective: Euclidean distance. Unlike cosine similarity, which tells us how aligned two vectors are, Euclidean distance tells us how far apart they are – the “as-the-crow-flies” distance between them. This is a fundamentally different way of measuring similarity and is crucial for many applications.
Table of Contents
Why Does Distance Matter?
Think about recommending movies. Cosine similarity might tell you that two people have similar tastes because they watch similar types of movies. But Euclidean distance can tell you how closely their ratings align. A small distance means they rate movies very similarly. A large distance means their ratings diverge. This difference is key for personalized recommendations, clustering users, and a host of other tasks.
Revisiting Vectors & Their Components (Briefly)
As a quick reminder, a vector is simply a list of numbers. For example:
- Vector A:
[2.5, 1.8, 5.2] - Vector B:
[3.1, 0.9, 1.4]
Each number in the list represents a component or feature of the data being represented. In the movie rating example, these features could be ratings for different genres, actors, or directors.
The Euclidean Distance Formula
The Euclidean distance between two vectors, A and B, is calculated using the following formula:
distance(A, B) = sqrt( (a1 - b1)^2 + (a2 - b2)^2 + ... + (an - bn)^2 )
Where:
a1, a2, ..., anare the components of vector Ab1, b2, ..., bnare the components of vector Bsqrt()is the square root function
In simpler terms, we subtract the corresponding components of the two vectors, square the results, sum them up, and then take the square root of the sum.
Visualizing Euclidean Distance
Imagine two points on a graph. The Euclidean distance is simply the straight-line distance between those two points.
┌───────────┐
│ Vector A │ /
│ [2.5, 1.8]│ /
└──────────┘ /
/
/
┌──────────┐
│ Vector B │
│ [3.1, 0.9]│
└──────────┘
A Simple Example: Calculating Euclidean Distance
Let’s calculate the Euclidean distance between Vector A = [2.5, 1.8] and Vector B = [3.1, 0.9].
import math
vector_a = [2.5, 1.8]
vector_b = [3.1, 0.9]
distance = 0
for i in range(len(vector_a)):
distance += (vector_a[i] - vector_b[i])**2
distance = math.sqrt(distance)
print(f"The Euclidean distance between {vector_a} and {vector_b} is: {distance}")
What’s happening:
import math: Imports themathmodule to use thesqrt()function.vector_aandvector_b: Define the two vectors.distance = 0: Initializes thedistancevariable.for i in range(len(vector_a)): Iterates through the components of the vectors.distance += (vector_a[i] - vector_b[i])**2: Calculates the squared difference between the components and adds it to thedistance.distance = math.sqrt(distance): Takes the square root of the sum to get the Euclidean distance.print(...): Prints the result.
Expected Output:
The Euclidean distance between [2.5, 1.8] and [3.1, 0.9] is: 1.658312408007378
A More Realistic Example: Customer Profiles
Let’s say we’re analyzing customer profiles for an online store. Each customer’s profile is represented as a vector of their spending habits across different categories. Features might include “electronics,” “clothing,” “books,” etc. A smaller Euclidean distance between two customer profiles suggests they have similar shopping habits.
import math
customer_a = [150, 50, 25, 10] # Electronics, Clothing, Books, Other
customer_b = [175, 70, 15, 5]
customer_c = [50, 10, 60, 20] #Very different shopping habits
distance_ab = 0
for i in range(len(customer_a)):
distance_ab += (customer_a[i] - customer_b[i])**2
distance_ab = math.sqrt(distance_ab)
distance_ac = 0
for i in range(len(customer_a)):
distance_ac += (customer_a[i] - customer_c[i])**2
distance_ac = math.sqrt(distance_ac)
print(f"Distance between Customer A and B: {distance_ab}")
print(f"Distance between Customer A and C: {distance_ac}")
What’s happening:
- The code calculates the Euclidean distance between Customer A and Customer B, and Customer A and Customer C.
- The output shows that the distance between Customer A and B is smaller than the distance between Customer A and C. This suggests that Customer A and Customer B have more similar spending habits than Customer A and Customer C.
Expected Output:
Distance between Customer A and B: 10.677078259075344
Distance between Customer A and C: 63.24555320336759
Euclidean Distance vs. Cosine Similarity: Key Differences
| Feature | Euclidean Distance | Cosine Similarity |
|---|---|---|
| Measures | Straight-line distance | Angle between vectors |
| Sensitive to | Magnitude and direction | Direction only |
| Interpretation | Smaller distance = more similar | Closer to 1 = more similar |
| Use cases | Clustering, finding nearest neighbors | Recommendation systems, document similarity |
A Pitfall: Magnitude Bias in Euclidean Distance
A major drawback of Euclidean distance is its sensitivity to the magnitude of the vectors. If one customer spends significantly more than another, their Euclidean distance will be larger, even if their shopping habits are similar. This is why cosine similarity is often preferred when magnitude doesn’t matter.
Let’s illustrate this:
import math
customer_a = [100, 50, 25]
customer_b = [200, 100, 50] # Same habits, just double spending
distance_ab = 0
for i in range(len(customer_a)):
distance_ab += (customer_a[i] - customer_b[i])**2
distance_ab = math.sqrt(distance_ab)
print(f"Euclidean distance between Customer A and B: {distance_ab}")
Expected Output:
Euclidean distance between Customer A and B: 141.4213562373095
Notice how the distance is significantly larger simply because Customer B spends twice as much in each category.
Optimizing Euclidean Distance Calculation
For large datasets, calculating the Euclidean distance can be computationally expensive. Libraries like NumPy provide optimized functions for vector operations:
import numpy as np
customer_a = np.array([100, 50, 25])
customer_b = np.array([200, 100, 50])
distance = np.linalg.norm(customer_a - customer_b) # Efficient calculation
print(f"Euclidean distance using NumPy: {distance}")
np.linalg.norm() calculates the Euclidean norm (magnitude) of a vector, which is equivalent to the Euclidean distance. NumPy’s optimized routines make this calculation much faster than the manual loop, especially for high-dimensional vectors.
Conclusion
Euclidean distance is a powerful tool for measuring similarity based on straight-line distance. It’s especially useful when the magnitude of the vectors is meaningful. However, be mindful of its sensitivity to magnitude and consider using cosine similarity when this is a concern. And remember to leverage optimized libraries like NumPy for efficient calculations with large datasets. In our next post, we’ll explore another distance metric and compare them side-by-side.
Discover more from A Streak of Communication
Subscribe to get the latest posts sent to your email.