Euclidean Distance: Measuring ‘As-the-Crow-Flies’ Distance (Day 4/5)

Welcome back! Happy New Year! In our previous post, we explored cosine similarity and how it focuses on the direction of vectors, removing the influence of their magnitudes. Today, we’re shifting gears to a different perspective: Euclidean distance. Unlike cosine similarity, which tells us how aligned two vectors are, Euclidean distance tells us how far apart they are – the “as-the-crow-flies” distance between them. This is a fundamentally different way of measuring similarity and is crucial for many applications.

Why Does Distance Matter?

Think about recommending movies. Cosine similarity might tell you that two people have similar tastes because they watch similar types of movies. But Euclidean distance can tell you how closely their ratings align. A small distance means they rate movies very similarly. A large distance means their ratings diverge. This difference is key for personalized recommendations, clustering users, and a host of other tasks.

Revisiting Vectors & Their Components (Briefly)

As a quick reminder, a vector is simply a list of numbers. For example:

  • Vector A: [2.5, 1.8, 5.2]
  • Vector B: [3.1, 0.9, 1.4]

Each number in the list represents a component or feature of the data being represented. In the movie rating example, these features could be ratings for different genres, actors, or directors.

The Euclidean Distance Formula

The Euclidean distance between two vectors, A and B, is calculated using the following formula:

distance(A, B) = sqrt( (a1 - b1)^2 + (a2 - b2)^2 + ... + (an - bn)^2 )

Where:

  • a1, a2, ..., an are the components of vector A
  • b1, b2, ..., bn are the components of vector B
  • sqrt() is the square root function

In simpler terms, we subtract the corresponding components of the two vectors, square the results, sum them up, and then take the square root of the sum.

Visualizing Euclidean Distance

Imagine two points on a graph. The Euclidean distance is simply the straight-line distance between those two points.

┌───────────┐
│ Vector A  │     /
│ [2.5, 1.8]│    /
└──────────┘   /
              /
             /
┌──────────┐
│ Vector B  │
│ [3.1, 0.9]│
└──────────┘

A Simple Example: Calculating Euclidean Distance

Let’s calculate the Euclidean distance between Vector A = [2.5, 1.8] and Vector B = [3.1, 0.9].

import math

vector_a = [2.5, 1.8]
vector_b = [3.1, 0.9]

distance = 0
for i in range(len(vector_a)):
  distance += (vector_a[i] - vector_b[i])**2

distance = math.sqrt(distance)

print(f"The Euclidean distance between {vector_a} and {vector_b} is: {distance}")

What’s happening:

  1. import math: Imports the math module to use the sqrt() function.
  2. vector_a and vector_b: Define the two vectors.
  3. distance = 0: Initializes the distance variable.
  4. for i in range(len(vector_a)): Iterates through the components of the vectors.
  5. distance += (vector_a[i] - vector_b[i])**2: Calculates the squared difference between the components and adds it to the distance.
  6. distance = math.sqrt(distance): Takes the square root of the sum to get the Euclidean distance.
  7. print(...): Prints the result.

Expected Output:

The Euclidean distance between [2.5, 1.8] and [3.1, 0.9] is: 1.658312408007378

A More Realistic Example: Customer Profiles

Let’s say we’re analyzing customer profiles for an online store. Each customer’s profile is represented as a vector of their spending habits across different categories. Features might include “electronics,” “clothing,” “books,” etc. A smaller Euclidean distance between two customer profiles suggests they have similar shopping habits.

import math

customer_a = [150, 50, 25, 10] # Electronics, Clothing, Books, Other
customer_b = [175, 70, 15, 5]
customer_c = [50, 10, 60, 20] #Very different shopping habits

distance_ab = 0
for i in range(len(customer_a)):
  distance_ab += (customer_a[i] - customer_b[i])**2

distance_ab = math.sqrt(distance_ab)

distance_ac = 0
for i in range(len(customer_a)):
  distance_ac += (customer_a[i] - customer_c[i])**2

distance_ac = math.sqrt(distance_ac)


print(f"Distance between Customer A and B: {distance_ab}")
print(f"Distance between Customer A and C: {distance_ac}")

What’s happening:

  1. The code calculates the Euclidean distance between Customer A and Customer B, and Customer A and Customer C.
  2. The output shows that the distance between Customer A and B is smaller than the distance between Customer A and C. This suggests that Customer A and Customer B have more similar spending habits than Customer A and Customer C.

Expected Output:

Distance between Customer A and B: 10.677078259075344
Distance between Customer A and C: 63.24555320336759

Euclidean Distance vs. Cosine Similarity: Key Differences

Feature Euclidean Distance Cosine Similarity
Measures Straight-line distance Angle between vectors
Sensitive to Magnitude and direction Direction only
Interpretation Smaller distance = more similar Closer to 1 = more similar
Use cases Clustering, finding nearest neighbors Recommendation systems, document similarity

A Pitfall: Magnitude Bias in Euclidean Distance

A major drawback of Euclidean distance is its sensitivity to the magnitude of the vectors. If one customer spends significantly more than another, their Euclidean distance will be larger, even if their shopping habits are similar. This is why cosine similarity is often preferred when magnitude doesn’t matter.

Let’s illustrate this:

import math

customer_a = [100, 50, 25]
customer_b = [200, 100, 50] # Same habits, just double spending

distance_ab = 0
for i in range(len(customer_a)):
  distance_ab += (customer_a[i] - customer_b[i])**2

distance_ab = math.sqrt(distance_ab)

print(f"Euclidean distance between Customer A and B: {distance_ab}")

Expected Output:

Euclidean distance between Customer A and B: 141.4213562373095

Notice how the distance is significantly larger simply because Customer B spends twice as much in each category.

Optimizing Euclidean Distance Calculation

For large datasets, calculating the Euclidean distance can be computationally expensive. Libraries like NumPy provide optimized functions for vector operations:

import numpy as np

customer_a = np.array([100, 50, 25])
customer_b = np.array([200, 100, 50])

distance = np.linalg.norm(customer_a - customer_b) # Efficient calculation

print(f"Euclidean distance using NumPy: {distance}")

np.linalg.norm() calculates the Euclidean norm (magnitude) of a vector, which is equivalent to the Euclidean distance. NumPy’s optimized routines make this calculation much faster than the manual loop, especially for high-dimensional vectors.

Conclusion

Euclidean distance is a powerful tool for measuring similarity based on straight-line distance. It’s especially useful when the magnitude of the vectors is meaningful. However, be mindful of its sensitivity to magnitude and consider using cosine similarity when this is a concern. And remember to leverage optimized libraries like NumPy for efficient calculations with large datasets. In our next post, we’ll explore another distance metric and compare them side-by-side.


Discover more from A Streak of Communication

Subscribe to get the latest posts sent to your email.

Discover more from A Streak of Communication

Subscribe now to keep reading and get access to the full archive.

Continue reading