All Roads Lead to Maths — A High-Level Introduction to Embeddings
Embeddings are central to many areas of machine learning. They provide a way to numerically represent things in the physical world that are not naturally numeric. In this blog post, we’ll explore a simple use case: using embeddings to determine pairwise similarities within a collection of animals.
Say you’ve found a long list of animals and want to discover which pairs of creatures on the list are the most similar. For each animal, the list contains values for four characteristics typical of the animal:
We see a sample of the list below:
There are both similar and dissimilar characteristics in each of the animals in this sample. But which pair of animals is most similar?
Let’s reduce the number of characteristics to plot the animals neatly on a graph and sprinkle in some extra animals for interest. We’ve selected the ‘Size’ and ‘Volume’ attributes here, though we could have chosen any subset.
From this view, it looks like the elephant and the lion are the closer pair. Note, however, that this closeness is dependent on the attributes we’ve selected. Indeed, if we’d included the ‘Temper’ attribute instead of ‘Volume’, the graph would look remarkably different.
To go one step further, let’s give numeric values to each of the points on the graph. This allows us to robustly measure the similarity between each pair using standard distance measures.
We’ll say for this example that the top & right areas of the graph will have the value 1, and the bottom & left areas have the value -1. The middle point has the value 0. We’ll describe each point using two numbers, representing the size and volume of the corresponding animal. For example, ‘Dog’ has the value (0.1, 0.4), as its size is roughly 0.1, and volume is roughly 0.4. The animals from our above example have values:
- Elephant: (0.9, 0.9)
- Lion: (0.5, 0.7)
- Crocodile: (0.8, -0.3)
With these numeric representations in hand, let’s return to the task of determining which pair of animals in our example set are most similar.
We can use a variety of methods to explore the similarities between each pair. Comparing value differences in a single dimension, we find that:
- In the size dimension, the elephant and crocodile are most similar, with a distance of 0.1 (0.9–0.8)
- In the volume dimension, the elephant and lion are most similar, with a distance of 0.2 (0.9–0.7)
We can also find the “straight-line” distance between each pair of points using the Euclidean distance formula. Skipping the calculations for the sake of brevity, this produces the following distances:
- Elephant → Lion: 0.75
- Lion → Crocodile: 1.05
- Crocodile → Elephant: 1.55
Clearly, the elephant and lion are most similar (when characterised by size and volume), though we’ll note that the distance between them is still high.
This is just one example of the usefulness of embeddings for comparing the similarities and differences of non-mathematical objects in a mathematical manner. Still, indeed this is just the tip of the iceberg. In future blog posts, we’ll explore practical embedding methods for NLP, their emergent properties, and potential use cases.
About the Author: Henry Zwart is an ML Engineer at Arcanum AI, interested in general NLP and Neurosymbolic AI. As a computer scientist and mathematician, he enjoys solving problems in a research-first manner, and helping others to see the hidden beauty of mathematics and machine learning.