Jaccard's Index in Practice

Building a recommender system using the Jaccard's index algorithm

01 Jan 2019
Similarity Functions
Python, Numpy

The Jaccard's Index, a ratio between the intersection of two sets A and B , over the union of A and B , is a simple and effective tool to measure the similarity between two groups of elements. While it's use in data science is widely mentioned, there are few examples that show how such an algorithm is applied to datasets in the real world.

The Jaccarad's Index

J(A,B) = \frac{|A \cap B |}{|A \cup B |}

0 \leq J(A,B) \leq 1

To apply the Jaccard's Index over two sets is trivial:

A = ['a', 'b', 'c', 'd']
B = ['c', 'd', 'e', 'f']
index = len(set(A).intersection(B))/len(set(A).union(set(B)))
print(index)

0.3333333333333333

Yet, most real world problems aren't as straightforward as shown above. If the algorithm was applied to many object pairs, we would need to find a more efficient method for computation. As it turns out, with a little bit of linear algebra, we are able to calculate the Jaccard's Index for a large dataset efficiently.

And it is with this context that we will build a simple and effective recommender system with the Jaccard's Index, using a real-world dataset. This post will cover both the math and code involved in creating this feature.

Frequently Bought Together

We will develop a recommender system for the "frequently bought together" feature, usually found in the product pages of e-commerce sites. The purpose of this feature is to suggest complementary products to users, in a bid to get users to add more items to their cart.

In other words, we would need to find an algorithm to solve the following problem:

For each product in inventory, generate a list of products that customers will most likely purchase together with that product

To see how the Jaccard's Index can help us solve this problem, we can think about the solution in this manner:

Consider a product x . We need a measure of a probability for all other products that users will likely purchase together with product x
To measure how likely product x is going to be bought together with, for example, product y , a similarity measure such as the Jaccard's Index can be calculated based on the orders that contain products x , y , or both.

Consider:

A = \{ Set of all customers orders that contain the product x \}
B = \{ Set of all customers orders that contain the product y \}

It follows that:

|A \cap B | = Number of orders that contain both product x and product y
|A \cup B | = Number of orders that contain either product x , product y or both

Therefore:

J(A,B) = \frac{|A \cap B |}{|A \cup B |} \simeq The likelihood that products x and y will appear in the same orders

Dataset

The dataset we will use contains items in customers' orders from an e-commerce firm. But before that, we will go through the math using a toy dataset.

Data involving online orders usually resembles the following table below (See Table 1), where each row represents an item in the order that was purchased and includes fields such as the order id, product name and quantity purchased. For our purposes, we only require the order id and product name (or any unique identifier of the item). Quantity purchased is not needed as we only want to know if the item was purchased together with another item, regardless of quantity.

   order       name  qty
0   0001  Product A   12
1   0001  Product C    5
2   0001  Product D    2
3   0002  Product A    9
4   0002  Product B    3
5   0002  Product C    5
6   0003  Product A    1
7   0003  Product B    1
8   0003  Product C    2
9   0003  Product D    3
10  0004  Product B    1
11  0004  Product D    7
12  0004  Product A    5
13  0005  Product B    1
14  0005  Product C    4

Table 1: An example of orders data

However, to begin our analysis, we first need to pivot the data in Table 1 into the following form:

order      0001  0002  0003  0004  0005
name                                   
Product A   1.0   1.0   1.0   1.0   0.0
Product B   0.0   1.0   1.0   1.0   1.0
Product C   1.0   1.0   1.0   0.0   1.0
Product D   1.0   0.0   1.0   1.0   0.0

Table 2: Pivoted orders data

With this transformation, each column represents one unique order and each row represents each product in inventory. In other words, if our data consists of m=5 orders and n=4 unique products, the dimensions of the dataframe would be (4 x 5).

If a product was purchased in an order, the corresponding cell value will be 1. Otherwise it will be 0. For example, Product D is present in orders 0001, 0003, and 0004, hence the row values (1.0, 0.0, 1.0, 1.0, 0.0).

The starting point for us will be the matrix values of Table 2 which we will label X_{m,n}. And our goal is to end up with matrix J_{n,n} of Jaccard index values for every possible product pair as shown in Table 3:

       Product A  Product B  Product C  Product D
name                                                 
Product A   0.999998   0.599999   0.599999   0.749998
Product B   0.599999   0.999998   0.599999   0.399999
Product C   0.599999   0.599999   0.999998   0.399999
Product D   0.749998   0.399999   0.399999   0.999997

Table 3: The end result - A matrix of computed Jaccard's Index values

The diagonals are always 1 in J_{n,n} as the Jaccard Index value for a set with itself is always 1. The off-diagonals are symmetric and each cell represents the index values for the product pairing. For example, J(Product A , Product C)=0.6 (you can verify this manually from Table 2 values) and can be referred to in the matrix position (0,2), (2,0).

Intersection...

Let's start with the numerator. In scalar form, |A \cap B | represents the cardinality of the set of orders that contain both products A and B. In matrix form, it will be a n x n matrix with off-diagonal cells representing this cardinality for each product pair.

This is achieved simply with XX^T

X =\begin{bmatrix} 1 & 1 & 1 & 1 & 0 \\ 0 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 0 & 1 \\ 1 & 0 & 1 & 1 & 0 \\ \end{bmatrix}

XX^T =\begin{bmatrix} 4 & 3 & 3 & 3 \\ 3 & 4 & 3 & 2 \\ 3 & 3 & 4 & 2 \\ 3 & 2 & 2 & 3 \end{bmatrix}

You may notice that the diagonals of XX^T show the total number of orders each product is present in. You can also verify that the off-diagonals are the number of orders that contain both products i and j.

...Over Union

For the denominator's scalar form, |A \cup B | = |A| + |B| - |A \cap B |. Since we already figured out |A \cap B | as the numerator, we need to figure out what |A| + |B| represents in matrix form.

That matrix should be a n x n matrix with each off-diagonal cell representing the sum of orders present in both product i and product j

To get that matrix, a bit of transformation is in order:

(X\cdot\textbf{1}_{m,n}) + (X\cdot\textbf{1}_{m,n})^T = \begin{bmatrix} 8 & 8 & 8 & 7 \\ 8 & 8 & 8 & 7 \\ 8 & 8 & 8 & 7 \\ 7 & 7 & 7 & 6 \\ \end{bmatrix}

Where \textbf{1}_{m,n} is a unit matrix of size m x n, in this case m=5, n=4. You can verify that cell (i,j) is the total number of orders each product i and j is present in, added together.

To get the union, we have to deduct the intersection (numerator) XX^T

(X\cdot\textbf{1}_{m,n}) + (X\cdot\textbf{1}_{m,n})^T - XX^T = \begin{bmatrix} 4 & 5 & 5 & 4 \\ 5 & 4 & 5 & 5 \\ 5 & 5 & 4 & 5 \\ 4 & 5 & 5 & 3 \\ \end{bmatrix}

The Jaccard's Matrix

Putting it all together, we have the Jaccard's Index in matrix form:

J(X) = XX^T \:\:\emptyset \:\:\Big((X\cdot\textbf{1}_{m,n}) + (X\cdot\textbf{1}_{m,n})^T - XX^T\Big)

Where:

J(X)_{i,j} = \frac{\Big( XX^T \Big)_{i,j}}{\Big((X\cdot\textbf{1}_{m,n}) + (X\cdot\textbf{1}_{m,n})^T - XX^T\Big)_{i,j}}

Following from our example values:

J(X) = \begin{bmatrix} 1.0 & 0.6 & 0.6 & 0.75 \\ 0.6 & 1.0 & 0.6 & 0.4 \\ 0.6 & 0.6 & 1.0 & 0.4 \\ 0.75 & 0.4 & 0.4 & 1.0 \\ \end{bmatrix}

Each off-diagonal cell in J(X) is the computed Jaccard's Index value between product i and product j.

Putting it in code - With real-world data

import pandas as pd
import numpy as np
import hashlib
from scipy import sparse

First, we load in the data and hash the order field to obscure the actual order IDs. order-short column was created to shorten the hashed order IDs solely for the purpose of easier reading.

df = pd.read_csv("data.csv")
df.columns = ['order', 'name']
df['order'] = df['order'].apply(
    lambda x: hashlib.md5(x.encode('utf-8')).hexdigest()
)
df['order-short'] = df['order'].apply(lambda x: x[24:])

print(df[['order-short', 'name']].head(15))

   order-short                                               name
0     196e1b6b            Double A Everyday Copier Paper 70gsm A4
1     196e1b6b              Fullmark Double Sided Gluey Tape FC18
2     196e1b6b                        Chunbe Water Glue 40ml K101
3     196e1b6b                     PaperOne Copier Paper 80gsm A3
4     196e1b6b            KING JIM Lever Arch File 3 Inch 2695GSV
5     196e1b6b                            HK Magazine Holder 1708
6     196e1b6b                  Pilot G1 Grip Pen 0.5mm BLGP-G1-5
7     196e1b6b                  Pilot G1 Grip Pen 0.5mm BLGP-G1-5
8     196e1b6b            Uni Ball Signo Gel Pen 0.38mm DX UM-151
9     196e1b6b               Uni Ball Signo Gel Pen 0.7mm UM-120 
10    c6943582  Avery Parcel White Labels 99.1 x 93.1mm L7166-...
11    18be192a              Pentel Correction Tape Refill ZTR5-WO
12    18be192a                  Pilot Twin Permanent Marker SCATM
13    18be192a                Pilot Permanent Marker Fine SCA-100
14    18be192a             Pilot Gel Ink Ballpen Fine 0.7mm P-700

Table 4: Example using real-world data

Table 4 shows the first 15 rows of actual orders data. Here, it shows that order 196e1b6b has 10 items (rows) and order c6943582 has 1 item.

Next, we pivot the data in Table 4 to obtain X.

df_pivot = df.groupby(['name', 'order']).aggregate({'name':'count'})
df_pivot = df_pivot.unstack().fillna(0)

X = np.asmatrix(df_pivot.values)
sX = sparse.csr_matrix(X) 

print(X)
print(f"No. of products: {X.shape[0]}")
print(f"No. of orders {X.shape[1]}")

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

No. of products: 3349
No. of orders 6042

In this dataset, we have 3349 unique products found in a sample of 6042 orders. Printing out matrix X shows that most cells are 0. This is expected since most orders do not contain, well, most products. And it is because of this we will use Scipy's sparse matrix objects for computation as this is generally faster than using the standard Numpy arrays/matrices for high dimension data with many zero values.

Computing the numerator XX^T:

numerator = sX.dot(sX.T)
print(numerator.toarray())

[[ 4.  0.  0. ...  0.  0.  0.]
 [ 0. 29.  2. ...  0.  0.  0.]
 [ 0.  2. 53. ...  0.  0.  0.]
 ...
 [ 0.  0.  0. ... 13.  0.  0.]
 [ 0.  0.  0. ...  0.  1.  0.]
 [ 0.  0.  0. ...  0.  0.  8.]]

Computing the denominator (X\cdot\textbf{1}_{m,n}) + (X\cdot\textbf{1}_{m,n})^T - XX^T:

ones = np.ones(sX.shape[::-1])
B = sX.dot(ones)
denominator = B + B.T - numerator
print(denominator)

[[ 4. 33. 57. ... 17.  5. 12.]
 [33. 29. 80. ... 42. 30. 37.]
 [57. 80. 53. ... 66. 54. 61.]
 ...
 [17. 42. 66. ... 13. 14. 21.]
 [ 5. 30. 54. ... 14.  1.  9.]
 [12. 37. 61. ... 21.  9.  8.]]

Putting it together, we get the Jaccard's Matrix:

jaccard = (numerator / (denominator + 0.00001)) 
print(jaccard)

[[0.9999975  0.         0.         ... 0.         0.         0.        ]
 [0.         0.99999966 0.025      ... 0.         0.         0.        ]
 [0.         0.025      0.99999981 ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.99999923 0.         0.        ]
 [0.         0.         0.         ... 0.         0.99999    0.        ]
 [0.         0.         0.         ... 0.         0.         0.99999875]]

So how does the Jaccard's Index fare with real world data? Let's take a look at some of the recommendations generated by this algorithm.

We will load the Jaccard's matrix into a dataframe to explore the results.

JDF = pd.DataFrame(
    data=jaccard, 
    index=df_pivot.index,
    columns=df_pivot.index
)

def get_complements(dataframe, product, n):
    # Returns top n products that customers will likely purchase together 
    # with the product given in the product argument
    return dataframe[name].sort_values(ascending=False)[1:n_results+1]

The Recommendations

I have selected at random a handful of items across a few categories to have a go at the recommendations. The output from get_complements will list the top n items that customers will most likely purchase together with the input product, sorted by most likely complementary product first. The output will also show the computed Jaccard's index value for the product pairing.

Test recommendation #1 - HP 905XL Black Ink Cartridge T6M17AAs

get_complements(JDF, 'HP 905XL Black Ink Cartridge T6M17AAs', 5)

name
HP 905XL Colour Ink Cartridge (T6M05AA/T6M09AA/T6M13AA)    0.500000
Epson Ribbon Cartridge 8750/S015516                        0.090909
HERMA Adhesive White Labels 20 x 50mm 2410                 0.083333
Thick Rubber Band 450gsm 3 x 1 x 120mm                     0.062500
GBC Fusion 1000L A3 Laminator                              0.062500
Name: HP 905XL Black Ink Cartridge T6M17AA, dtype: float64

The algorithm recommended the coloured version of the black ink cartridge model HP 905XL as the top recommendation. This is quite intuitive and the recommendation is no doubt useful for users who are visiting the product page for HP 905XL.

Test recommendation #2 - Nestle Milo 3 in 1 Activ-Go Catering Pack

get_complements(JDF, 'Nestle Milo 3 in 1 Activ-Go Catering Pack', 5)

name
Nestum Multigrain Cereal Original 1kg           0.253623
Julie's Butter Crackers 250g                    0.180328
Julie's Peanut Butter Sandwich Biscuits 360g    0.165829
OSK Japanese Green Teabags 2g x 50              0.141553
Nescafe 3 in 1 Original Coffee                  0.127273
Name: Nestle Milo 3 in 1 Activ-Go Catering Pack, dtype: float64

The top 5 recommendations for the Nestle Milo malt drink suggests all food / pantry related products such as biscuits, crackers, and cereal. Again, this seems like an intuitive set of recommendations.

Test recommendation #3 - MAX Stapler HD88R

get_complements(JDF, 'MAX Stapler HD88R', 5)

name
MAX Staples Refill 2115 B8         0.191617
Red Rubber Band 400gsm             0.073333
2 Hole Punch with Guide DL-825B    0.057692
UHU No. 189 Glue Stick 21g         0.055300
Gem Paperclips 50mm Pack of 100    0.055172
Name: MAX Stapler HD88R, dtype: float64

For a stapler, the algorithm recommends first and foremost, stapler refills, followed by other general stationery.

Test recommendation #4 - Pilot V Board Master Whiteboard Marker Bullet Medium

get_complements(JDF, 'Pilot V Board Master Whiteboard Marker Bullet Medium', 5)

name
Pilot V Board Whiteboard Marker Refill Cartridge WBSVBM    0.183246
Plastic Ruler 12 Inch                                      0.068136
MAX Staples Refill No. 10-1M                               0.061460
Pilot Super Grip Mechanical Pencil Neon 0.5mm H-185N       0.057143
Deli Magnetic Whiteboard Duster Eraser E7837               0.056380
Name: Pilot V Board Master Whiteboard Marker Bullet Medium, dtype: float64

The recommendation for the common Pilot whiteboard marker is it's own refill. The other weaker recommendations (going by the score of the Jaccard's Index) are general stationery items, including a whiteboard eraser.

Test recommendation #5 - 3M Scotch-Brite Super Mop with Scrapper F1-SR/F1-A

get_complements(JDF, '3M Scotch-Brite Super Mop with Scrapper F1-SR/F1-A', 5)

name
Giant Manilla Envelope 7 x 10 Inch Pack of 10    0.249999
2 Pocket PU Holder with Lanyard SL311            0.142857
Yeo's Ice Green Tea Packet Drink 250ml x 24      0.125000
POKKA Oolong Tea Can Drink 300ml x 24            0.055556
MAX Staples Refill 23/13 1213FAH                 0.043478
Name: 3M Scotch-Brite Super Mop with Scrapper F1-SR/F1-A, dtype: float64

Lastly, the recommendations for a mop seem all over the place. The recommendations in general are not intuitive, with the strongest recommendation being an envelope. This is in spite of a higher score for the envelope compared to the top recommendation in the previous 2 test cases. This could be a result of a small sample size as this product is present in a smaller number of orders.

Summary

All things considered, the results produced from a simple algorithm and a few lines of code has shown to be surprisingly intuitive in its recommendations. The Jaccard's Index is able to effectively tease out the strongest complements of each product. (i.e., a stapler gets recommendations for stapler refills, markers get recommendations for marker refills). Such results are encouraging and this recommender system by itself will likely be a handy feature users will appreciate.

Nevertheless, a few caveats are in order.

First, the dataset matters. This algorithm may yield prticularly perceptive recommendations due to the nature of the dataset. In this case, the data comes from orders from a B2B e-commerce site, where the basket sizes are larger than usual and contain many SKUs per order. As such, this gives more scope for the algorithm to produce a variety of "overlap" scores. As a counter-example, consider the same data from another industry in e-commerce (i.e., B2C, fashion), where it is typical for users to have only single items in their checkout cart. In such a scenario, most orders will only have 1-2 items. Any overlapping orders between products will be few and far in between and the Jaccard's Index will be unable to provide any useful recommendations. In such cases, user-user collaborative filtering algorithms that produce recommendations based on similarities between users and their behaviours may be more suited to the task.

Second, more robust testing is required. While eyeballing a few samples of the recommendations seem to suggest encouraging results, the ultimate guage of the algorithm's success is the extent to which it is able to achieve its original objective. In our context, the goal could be to increase the value of users' basket sizes upon checkout. We will not able to verify this until a more robust A/B testing framework is put in place. Until then, the jury is still out.