Building a recommender system using the Jaccard's index algorithm
Similarity Functions
Python, Numpy
The Jaccard's Index, a ratio between the intersection of two sets A and B , over the union of A and B , is a simple and effective tool to measure the similarity between two groups of elements. While it's use in data science is widely mentioned, there are few examples that show how such an algorithm is applied to datasets in the real world.
The Jaccarad's Index
J(A,B) = \frac{|A \cap B |}{|A \cup B |}
0 \leq J(A,B) \leq 1
To apply the Jaccard's Index over two sets is trivial:
A = ['a', 'b', 'c', 'd']
B = ['c', 'd', 'e', 'f']
index = len(set(A).intersection(B))/len(set(A).union(set(B)))
print(index)
0.3333333333333333
Yet, most real world problems aren't as straightforward as shown above. If the algorithm was applied to many object pairs, we would need to find a more efficient method for computation. As it turns out, with a little bit of linear algebra, we are able to calculate the Jaccard's Index for a large dataset efficiently.
And it is with this context that we will build a simple and effective recommender system with the Jaccard's Index, using a real-world dataset. This post will cover both the math and code involved in creating this feature.
Frequently Bought Together
We will develop a recommender system for the "frequently bought together" feature, usually found in the product pages of e-commerce sites. The purpose of this feature is to suggest complementary products to users, in a bid to get users to add more items to their cart.
In other words, we would need to find an algorithm to solve the following problem:
For each product in inventory, generate a list of products that customers will most likely purchase together with that product
To see how the Jaccard's Index can help us solve this problem, we can think about the solution in this manner:
Consider a product x . We need a measure of a probability for all other products that users will likely purchase together with product x
To measure how likely product x is going to be bought together with, for example, product y , a similarity measure such as the Jaccard's Index can be calculated based on the orders that contain products x , y , or both.
Consider:
A = \{ Set of all customers orders that contain the product x\} B = \{ Set of all customers orders that contain the product y\}
It follows that:
|A \cap B | = Number of orders that contain both product x and product y |A \cup B | = Number of orders that contain either product x , product y or both
Therefore:
J(A,B) = \frac{|A \cap B |}{|A \cup B |} \simeq The likelihood that products x and y will appear in the same orders
Dataset
The dataset we will use contains items in customers' orders from an e-commerce firm. But before that, we will go through the math using a toy dataset.
Data involving online orders usually resembles the following table below (See Table 1), where each row represents an item in the order that was purchased and includes fields such as the order id, product name and quantity purchased. For our purposes, we only require the order id and product name (or any unique identifier of the item). Quantity purchased is not needed as we only want to know if the item was purchased together with another item, regardless of quantity.
order name qty
0 0001 Product A 12
1 0001 Product C 5
2 0001 Product D 2
3 0002 Product A 9
4 0002 Product B 3
5 0002 Product C 5
6 0003 Product A 1
7 0003 Product B 1
8 0003 Product C 2
9 0003 Product D 3
10 0004 Product B 1
11 0004 Product D 7
12 0004 Product A 5
13 0005 Product B 1
14 0005 Product C 4
Table 1: An example of orders data
However, to begin our analysis, we first need to pivot the data in Table 1 into the following form:
order 0001 0002 0003 0004 0005
name
Product A 1.0 1.0 1.0 1.0 0.0
Product B 0.0 1.0 1.0 1.0 1.0
Product C 1.0 1.0 1.0 0.0 1.0
Product D 1.0 0.0 1.0 1.0 0.0
Table 2: Pivoted orders data
With this transformation, each column represents one unique order and each row represents each product in inventory. In other words, if our data consists of m=5 orders and n=4 unique products, the dimensions of the dataframe would be (4 x 5).
If a product was purchased in an order, the corresponding cell value will be 1. Otherwise it will be 0. For example, Product D is present in orders 0001, 0003, and 0004, hence the row values (1.0, 0.0, 1.0, 1.0, 0.0).
The starting point for us will be the matrix values of Table 2 which we will label X_{m,n}. And our goal is to end up with matrix J_{n,n} of Jaccard index values for every possible product pair as shown in Table 3:
Product A Product B Product C Product D
name
Product A 0.999998 0.599999 0.599999 0.749998
Product B 0.599999 0.999998 0.599999 0.399999
Product C 0.599999 0.599999 0.999998 0.399999
Product D 0.749998 0.399999 0.399999 0.999997
Table 3: The end result - A matrix of computed Jaccard's Index values
The diagonals are always 1 in J_{n,n} as the Jaccard Index value for a set with itself is always 1. The off-diagonals are symmetric and each cell represents the index values for the product pairing. For example, J(Product A , Product C)=0.6 (you can verify this manually from Table 2 values) and can be referred to in the matrix position (0,2), (2,0).
Intersection...
Let's start with the numerator. In scalar form, |A \cap B | represents the cardinality of the set of orders that contain both products A and B. In matrix form, it will be a n x n matrix with off-diagonal cells representing this cardinality for each product pair.
You may notice that the diagonals of XX^T show the total number of orders each product is present in. You can also verify that the off-diagonals are the number of orders that contain both products i and j.
...Over Union
For the denominator's scalar form, |A \cup B | = |A| + |B| - |A \cap B |. Since we already figured out |A \cap B | as the numerator, we need to figure out what |A| + |B| represents in matrix form.
That matrix should be a n x n matrix with each off-diagonal cell representing the sum of orders present in both product i and product j
To get that matrix, a bit of transformation is in order:
Where \textbf{1}_{m,n} is a unit matrix of size m x n, in this case m=5, n=4.
You can verify that cell (i,j) is the total number of orders each product i and j is present in, added together.
To get the union, we have to deduct the intersection (numerator) XX^T
Each off-diagonal cell in J(X) is the computed Jaccard's Index value between product i and product j.
Putting it in code - With real-world data
import pandas as pd
import numpy as np
import hashlib
from scipy import sparse
First, we load in the data and hash the order field to obscure the actual order IDs. order-short column was created to shorten the hashed order IDs solely for the purpose of easier reading.
In this dataset, we have 3349 unique products found in a sample of 6042 orders. Printing out matrix X shows that most cells are 0. This is expected since most orders do not contain, well, most products. And it is because of this we will use Scipy's sparse matrix objects for computation as this is generally faster than using the standard Numpy arrays/matrices for high dimension data with many zero values.
So how does the Jaccard's Index fare with real world data? Let's take a look at some of the recommendations generated by this algorithm.
We will load the Jaccard's matrix into a dataframe to explore the results.
JDF = pd.DataFrame(
data=jaccard,
index=df_pivot.index,
columns=df_pivot.index
)
def get_complements(dataframe, product, n):
# Returns top n products that customers will likely purchase together # with the product given in the product argumentreturn dataframe[name].sort_values(ascending=False)[1:n_results+1]
The Recommendations
I have selected at random a handful of items across a few categories to have a go at the recommendations. The output from get_complements will list the top n items that customers will most likely purchase together with the input product, sorted by most likely complementary product first. The output will also show the computed Jaccard's index value for the product pairing.
Test recommendation #1 - HP 905XL Black Ink Cartridge T6M17AAs
get_complements(JDF, 'HP 905XL Black Ink Cartridge T6M17AAs', 5)
name
HP 905XL Colour Ink Cartridge (T6M05AA/T6M09AA/T6M13AA) 0.500000
Epson Ribbon Cartridge 8750/S015516 0.090909
HERMA Adhesive White Labels 20 x 50mm 2410 0.083333
Thick Rubber Band 450gsm 3 x 1 x 120mm 0.062500
GBC Fusion 1000L A3 Laminator 0.062500
Name: HP 905XL Black Ink Cartridge T6M17AA, dtype: float64
The algorithm recommended the coloured version of the black ink cartridge model HP 905XL as the top recommendation. This is quite intuitive and the recommendation is no doubt useful for users who are visiting the product page for HP 905XL.
Test recommendation #2 - Nestle Milo 3 in 1 Activ-Go Catering Pack
get_complements(JDF, 'Nestle Milo 3 in 1 Activ-Go Catering Pack', 5)
name
Nestum Multigrain Cereal Original 1kg 0.253623
Julie's Butter Crackers 250g 0.180328
Julie's Peanut Butter Sandwich Biscuits 360g 0.165829
OSK Japanese Green Teabags 2g x 50 0.141553
Nescafe 3 in 1 Original Coffee 0.127273
Name: Nestle Milo 3 in 1 Activ-Go Catering Pack, dtype: float64
The top 5 recommendations for the Nestle Milo malt drink suggests all food / pantry related products such as biscuits, crackers, and cereal. Again, this seems like an intuitive set of recommendations.
Test recommendation #3 - MAX Stapler HD88R
get_complements(JDF, 'MAX Stapler HD88R', 5)
name
MAX Staples Refill 2115 B8 0.191617
Red Rubber Band 400gsm 0.073333
2 Hole Punch with Guide DL-825B 0.057692
UHU No. 189 Glue Stick 21g 0.055300
Gem Paperclips 50mm Pack of 100 0.055172
Name: MAX Stapler HD88R, dtype: float64
For a stapler, the algorithm recommends first and foremost, stapler refills, followed by other general stationery.
Test recommendation #4 - Pilot V Board Master Whiteboard Marker Bullet Medium
get_complements(JDF, 'Pilot V Board Master Whiteboard Marker Bullet Medium', 5)
name
Pilot V Board Whiteboard Marker Refill Cartridge WBSVBM 0.183246
Plastic Ruler 12 Inch 0.068136
MAX Staples Refill No. 10-1M 0.061460
Pilot Super Grip Mechanical Pencil Neon 0.5mm H-185N 0.057143
Deli Magnetic Whiteboard Duster Eraser E7837 0.056380
Name: Pilot V Board Master Whiteboard Marker Bullet Medium, dtype: float64
The recommendation for the common Pilot whiteboard marker is it's own refill. The other weaker recommendations (going by the score of the Jaccard's Index) are general stationery items, including a whiteboard eraser.
Test recommendation #5 - 3M Scotch-Brite Super Mop with Scrapper F1-SR/F1-A
get_complements(JDF, '3M Scotch-Brite Super Mop with Scrapper F1-SR/F1-A', 5)
name
Giant Manilla Envelope 7 x 10 Inch Pack of 10 0.249999
2 Pocket PU Holder with Lanyard SL311 0.142857
Yeo's Ice Green Tea Packet Drink 250ml x 24 0.125000
POKKA Oolong Tea Can Drink 300ml x 24 0.055556
MAX Staples Refill 23/13 1213FAH 0.043478
Name: 3M Scotch-Brite Super Mop with Scrapper F1-SR/F1-A, dtype: float64
Lastly, the recommendations for a mop seem all over the place. The recommendations in general are not intuitive, with the strongest recommendation being an envelope. This is in spite of a higher score for the envelope compared to the top recommendation in the previous 2 test cases. This could be a result of a small sample size as this product is present in a smaller number of orders.
Summary
All things considered, the results produced from a simple algorithm and a few lines of code has shown to be surprisingly intuitive in its recommendations. The Jaccard's Index is able to effectively tease out the strongest complements of each product. (i.e., a stapler gets recommendations for stapler refills, markers get recommendations for marker refills). Such results are encouraging and this recommender system by itself will likely be a handy feature users will appreciate.
Nevertheless, a few caveats are in order.
First, the dataset matters. This algorithm may yield prticularly perceptive recommendations due to the nature of the dataset. In this case, the data comes from orders from a B2B e-commerce site, where the basket sizes are larger than usual and contain many SKUs per order. As such, this gives more scope for the algorithm to produce a variety of "overlap" scores. As a counter-example, consider the same data from another industry in e-commerce (i.e., B2C, fashion), where it is typical for users to have only single items in their checkout cart. In such a scenario, most orders will only have 1-2 items. Any overlapping orders between products will be few and far in between and the Jaccard's Index will be unable to provide any useful recommendations. In such cases, user-user collaborative filtering algorithms that produce recommendations based on similarities between users and their behaviours may be more suited to the task.
Second, more robust testing is required. While eyeballing a few samples of the recommendations seem to suggest encouraging results, the ultimate guage of the algorithm's success is the extent to which it is able to achieve its original objective. In our context, the goal could be to increase the value of users' basket sizes upon checkout. We will not able to verify this until a more robust A/B testing framework is put in place. Until then, the jury is still out.