The Jaccard's Index, a ratio between the intersection of two sets A and B , over the union of A and B , is a simple and effective tool to measure the similarity between two groups of elements. While it's use in data science is widely mentioned, there are few examples that show how such an algorithm is applied to datasets in the real world.

#### The Jaccarad's Index

### J(A,B) = \frac{|A \cap B |}{|A \cup B |}

0 \leq J(A,B) \leq 1

To apply the Jaccard's Index over two sets is trivial:

A = ['a', 'b', 'c', 'd'] B = ['c', 'd', 'e', 'f'] index = len(set(A).intersection(B))/len(set(A).union(set(B))) print(index)

0.3333333333333333

Yet, most real world problems aren't as straightforward as shown above. If the algorithm was applied to many object pairs, we would need to find a more efficient method for computation. As it turns out, with a little bit of linear algebra, we are able to calculate the Jaccard's Index for a large dataset efficiently.

And it is with this context that we will build a simple and effective recommender system with the Jaccard's Index, using a real-world dataset. This post will cover both the math and code involved in creating this feature.

### Frequently Bought Together

We will develop a recommender system for the "frequently bought together" feature, usually found in the product pages of e-commerce sites. The purpose of this feature is to suggest complementary products to users, in a bid to get users to add more items to their cart.

In other words, we would need to find an algorithm to solve the following problem:

For each product in inventory, generate a list of products that customers will most likely purchase together with that product

To see how the Jaccard's Index can help us solve this problem, we can think about the solution in this manner:

- Consider a product x . We need a measure of a probability for all other products that users will likely purchase together with product x
- To measure how likely product x is going to be bought together with, for example, product y , a similarity measure such as the Jaccard's Index can be calculated based on the orders that contain products x , y , or both.

Consider:

A = \{ Set of all customers orders that contain the product x \}

B = \{ Set of all customers orders that contain the product y \}

It follows that:

|A \cap B | = Number of orders that contain *both* product x and product y

|A \cup B | = Number of orders that contain *either* product x , product y or both

Therefore:

J(A,B) = \frac{|A \cap B |}{|A \cup B |} \simeq The likelihood that products x and y will appear in the same orders

### Dataset

The dataset we will use contains items in customers' orders from an e-commerce firm. But before that, we will go through the math using a toy dataset.

Data involving online orders usually resembles the following table below (See Table 1), where each row represents an item in the order that was purchased and includes fields such as the order id, product name and quantity purchased. For our purposes, we only require the order id and product name (or any unique identifier of the item). Quantity purchased is not needed as we only want to know if the item was purchased together with another item, regardless of quantity.

order name qty 0 0001 Product A 12 1 0001 Product C 5 2 0001 Product D 2 3 0002 Product A 9 4 0002 Product B 3 5 0002 Product C 5 6 0003 Product A 1 7 0003 Product B 1 8 0003 Product C 2 9 0003 Product D 3 10 0004 Product B 1 11 0004 Product D 7 12 0004 Product A 5 13 0005 Product B 1 14 0005 Product C 4

Table 1: An example of orders data

However, to begin our analysis, we first need to pivot the data in Table 1 into the following form:

order 0001 0002 0003 0004 0005 name Product A 1.0 1.0 1.0 1.0 0.0 Product B 0.0 1.0 1.0 1.0 1.0 Product C 1.0 1.0 1.0 0.0 1.0 Product D 1.0 0.0 1.0 1.0 0.0

Table 2: Pivoted orders data

With this transformation, each column represents one unique order and each row represents each product in inventory. In other words, if our data consists of m=5 orders and n=4 unique products, the dimensions of the dataframe would be (4 x 5).

If a product was purchased in an order, the corresponding cell value will be 1. Otherwise it will be 0. For example, Product D is present in orders 0001, 0003, and 0004, hence the row values (1.0, 0.0, 1.0, 1.0, 0.0).

The starting point for us will be the matrix values of Table 2 which we will label X_{m,n}. And our goal is to end up with matrix J_{n,n} of Jaccard index values for every possible product pair as shown in Table 3:

Product A Product B Product C Product D name Product A 0.999998 0.599999 0.599999 0.749998 Product B 0.599999 0.999998 0.599999 0.399999 Product C 0.599999 0.599999 0.999998 0.399999 Product D 0.749998 0.399999 0.399999 0.999997

Table 3: The end result - A matrix of computed Jaccard's Index values

The diagonals are always 1 in J_{n,n} as the Jaccard Index value for a set with itself is always 1. The off-diagonals are symmetric and each cell represents the index values for the product pairing. For example, J(Product A , Product C)=0.6 (you can verify this manually from Table 2 values) and can be referred to in the matrix position (0,2), (2,0).

### Intersection...

Let's start with the numerator. In scalar form, |A \cap B | represents the cardinality of the set of orders that contain both products A and B. In matrix form, it will be a *n* x *n* matrix with off-diagonal cells representing this cardinality for each product pair.

This is achieved simply with XX^T

X =\begin{bmatrix} 1 & 1 & 1 & 1 & 0 \\ 0 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 0 & 1 \\ 1 & 0 & 1 & 1 & 0 \\ \end{bmatrix}

XX^T =\begin{bmatrix} 4 & 3 & 3 & 3 \\ 3 & 4 & 3 & 2 \\ 3 & 3 & 4 & 2 \\ 3 & 2 & 2 & 3 \end{bmatrix}

You may notice that the diagonals of XX^T show the total number of orders each product is present in. You can also verify that the off-diagonals are the number of orders that contain both products i and j.

### ...Over Union

For the denominator's scalar form, |A \cup B | = |A| + |B| - |A \cap B |. Since we already figured out |A \cap B | as the numerator, we need to figure out what |A| + |B| represents in matrix form.

That matrix should be a *n* x *n* matrix with each off-diagonal cell representing the sum of orders present in both product *i* and product *j*

To get that matrix, a bit of transformation is in order:

(X\cdot\textbf{1}_{m,n}) + (X\cdot\textbf{1}_{m,n})^T = \begin{bmatrix} 8 & 8 & 8 & 7 \\ 8 & 8 & 8 & 7 \\ 8 & 8 & 8 & 7 \\ 7 & 7 & 7 & 6 \\ \end{bmatrix}

Where \textbf{1}_{m,n} is a unit matrix of size *m* x *n*, in this case m=5, n=4.
You can verify that cell (i,j) is the total number of orders each product i and j is present in, added together.

To get the union, we have to deduct the intersection (numerator) XX^T

(X\cdot\textbf{1}_{m,n}) + (X\cdot\textbf{1}_{m,n})^T - XX^T = \begin{bmatrix} 4 & 5 & 5 & 4 \\ 5 & 4 & 5 & 5 \\ 5 & 5 & 4 & 5 \\ 4 & 5 & 5 & 3 \\ \end{bmatrix}

### The Jaccard's Matrix

Putting it all together, we have the Jaccard's Index in matrix form:

J(X) = XX^T \:\:\emptyset \:\:\Big((X\cdot\textbf{1}_{m,n}) + (X\cdot\textbf{1}_{m,n})^T - XX^T\Big)

Where:

J(X)_{i,j} = \frac{\Big( XX^T \Big)_{i,j}}{\Big((X\cdot\textbf{1}_{m,n}) + (X\cdot\textbf{1}_{m,n})^T - XX^T\Big)_{i,j}}

Following from our example values:

J(X) = \begin{bmatrix} 1.0 & 0.6 & 0.6 & 0.75 \\ 0.6 & 1.0 & 0.6 & 0.4 \\ 0.6 & 0.6 & 1.0 & 0.4 \\ 0.75 & 0.4 & 0.4 & 1.0 \\ \end{bmatrix}

Each off-diagonal cell in J(X) is the computed Jaccard's Index value between product i and product j.