0:00

In this lesson, we're going to discuss Associate Rules for performing

Â a Market-Basket Analysis commonly abbreviated MBA.

Â So to perform this Market-Basket Analysis,

Â we're going to look at how we can determine which items are frequently

Â purchased together using the Apriori algorithm.

Â And through that, we're going to see how to determine associative rules.

Â And moreover, we're going to define

Â some important terms like support, confidence, and lift.

Â All interesting things have a great story and so we'll start off with one.

Â You may or may not be familiar with the beer and diaper story,

Â but here it goes.

Â So, a store decided to do a formal analysis on

Â their data and found that men between the ages of 30 and 40 while

Â shopping between the hours of 5:00 and 7:00 pm on Fridays were

Â considerably more likely to purchase beer if they already had diapers in their cart.

Â Armed with this knowledge, the store relocated the beer section closer to

Â the diaper section and saw an increase in sales of both by 35 percent.

Â While a great story, unfortunately, is not true.

Â The true story is that there is a company called Osco Drug which examined 1.2 million

Â transactions across 25 stores and identified around 5,000 slow moving items.

Â And then they removed them.

Â And then the result was really quite measurable.

Â By removing that inventory,

Â it made it easier for customers to find what they wanted,

Â and customers thought selection had increased.

Â So, therefore, their sales increased and the company saved

Â money by reducing their inventory overhead.

Â But let's see how we can perform a real Market-Basket Analysis.

Â So, we're going to go ahead and install some dependencies,

Â go ahead and import those dependencies,

Â and before we get into the apriori algorithm and associative rules,

Â first, we are going to talk about how we need to model our data.

Â Let's imagine that we have a super small store that only sells five items.

Â We sell beer, chips,

Â salsa, chocolate, and diapers.

Â Here, we have a schema that lists all of our transactions where each entry is

Â a customer's shopping cart and each entry inside

Â of the shopping cart we'd say what they bought and how many of those items they bought.

Â So, it's very easy for us to go ahead and take

Â this variable and put it into a data frame.

Â And you can see right here,

Â that's what this data looks like in a tabular format.

Â Okay, so we have this transaction table but in order to build association rules,

Â the first thing we going to do is get rid of all these NaN values,

Â all these not a number values.

Â Fortunately for us, we can just use

Â pandas' built-in fillna function and replace all those NaN's with zeros.

Â Now that all of our data is numerical, now we need to one-hot encode our data.

Â And that simply means that we need to represent whether something was present or not.

Â And so, what we're going to do is we're just going to go through

Â all the different values and if anything is greater than zero,

Â we're just going to set it to one. And there we go.

Â Now, our data demonstrates whether or not someone purchased something.

Â Keep in mind that one-hot encoding something

Â like a trait can be a little bit more tricky.

Â So, for example, we had gender.

Â Let's say your gender was restricted to just male and

Â female and we represented that binarily with one or zero.

Â In order to one-hot encode that,

Â we'd really need to create two separate columns,

Â one for female and one for male,

Â and then represent it like so because we're really trying to represent

Â the presence or lack of presence for some feature.

Â Great, now that we know what our data needs to look like,

Â let's define some terms and then look at a little bit

Â of the math for Market-Basket Analysis.

Â With association rules, when we define relationships,

Â we use the terms antecedent and consequents,

Â and then we also have some characteristic terms like support, confidence, and lift.

Â When we define a rule we say that something implies something else so A implies B.

Â And this cooler arrow actually means implies.

Â So, we have a rule that says chips implies beer.

Â We say that chips is the antecedent and beer is the consequent.

Â And these are the two terms for defining the relationships in an association rule.

Â Support is the first term we're going to use to

Â describe the characteristics of a relationship.

Â Support is just the occurrence of an item among all transactions.

Â So, for example, there are five occurrences of chips in all six transactions,

Â chips would have a 0.833 support.

Â And beer appears four times across six transactions so,

Â it has a support of 0.667.

Â And we would do this for every single combination of

Â items referred to commonly as item sets.

Â So, we'd start with all items at set one,

Â all items at set two,

Â all items at set three, et cetera.

Â Now when you do this, you typically defined some kind of minimum amount

Â of support to avoid exploring extremely uncommon pairings.

Â In this example, I've limited the minimum threshold to 0.5.

Â The next term is confidence,

Â and confidence is the likelihood that

Â some item set B will be bought together with an item set A.

Â It is calculated by dividing the support of items set [A,B] by item set A.

Â So, the confidence that the antecedent,

Â chips, implies the consequent, beer,

Â is 60 percent because the support for chips and beer

Â is 0.5 and the support for chips by itself is 0.833,

Â divide them, and you get 0.6.

Â So, 60 percent of the time that chips were bought,

Â beer was also bought.

Â Now, confidence can be a very good indicator but it also has a major drawback.

Â If the consequent is popular then confidence does not take this into account,

Â and can lead to an implication where there really isn't any.

Â And the last characteristic we're going to look at is lift.

Â And lift is how likely an item set B was purchased when item set A was purchased.

Â So, unlike confidence, lift takes into account the popularity of item set B,

Â and it's calculated by dividing the support of item set [A,B] by

Â the product of the support of item set A by the support of item set B.

Â So, the lift for the rule chips implies beer would be point 0.9.

Â Now, lift values of one imply no association,

Â values greater than one imply a positive association,

Â and values less than one imply a negative association.

Â Now that we have these terms defined,

Â let's go ahead and look back at the code.

Â To calculate these different values,

Â we're going to use two different methods from the ML extended, Machine Learning Library.

Â We are going to use the apriori method and the association rules method.

Â First, we're going to build our associations with the apriori method,

Â and as you can see, I am setting my minimum threshold for support to 0.5.

Â Now, keep in mind, as the data gets larger,

Â you may need to decrease this threshold.

Â And as you can see, here are different items sets with their different support values.

Â Now we can pass those associations to the association rules method.

Â Here, I'm using a minimum threshold of 0.5.

Â In reality, we'd want something probably greater than one,

Â but I'm doing this so that we can see all of the associations.

Â And here, you can see our different association rules.

Â We have our different antecedent item sets,

Â our different consequent items sets with their support, confidence, and lift.

Â Moreover, you can see that our diaper and beer story

Â is true for this data set and you can also see

Â that chips implies beer with

Â a lift of 0.9 just like when we calculated it manually earlier.

Â Okay, now that we understand the basics of association rules,

Â let's go and do the same thing with a much larger data set.

Â First, we are going to connect to our cluster with Pymongo.

Â And in this dataset, we have documents like this where we have

Â a purchases array with embedded documents describing each purchase,

Â and we really want to convert this into a format like this where

Â we have every product ID,

Â or I guess it's a stock ID,

Â and then whether or not someone purchased something.

Â To do this, we're going to use our replace route stage by mapping over

Â all of the different object keys and just train them to one for every stock code.

Â And that's going to be our only stage in our pipeline.

Â And then very simply, we can go ahead and

Â exhaust that cursor and shove it into a data frame.

Â And like before, we have a bunch of not a number values.

Â So again, we're going to go ahead and use the fillna data frame function.

Â And here, we're replacing all those NaN's with zero.

Â And now, like before, we can go ahead and use the apriori function.

Â Now notice, I have a much lower minimum support and that's because we

Â have a little over 3,600 different stock codes among our data set.

Â So go ahead and get those associations and we'll go ahead and look at them

Â and here all the different support values for all the different item sets.

Â And then we can go ahead and, like,

Â before pass these associations to the association rules function,

Â here, I'm giving a minimum threshold of three.

Â This time, we don't want to look at every possible rule.

Â We really only want to look at the strongest rules.

Â And now, we can go ahead and print them out,

Â our very top rule with a lift of 24.22 says that stock goods

Â 22698 alongside 22699 are frequently purchased with 22697.

Â We can go ahead and create a simple aggregation pipeline to see what these products were,

Â and it makes sense.

Â People were buying tea cups and saucers of different colors together.

Â So, knowing this information,

Â maybe we'd want to go ahead and package these items together in our store.

Â Okay, let's summarize what we've learned.

Â We saw how Market-Basket Analysis work by using the apriori algorithm we saw how

Â to get that data out of MongoDB so we can could pass it into the appropriate functions.

Â And moreover, we saw the different terms for

Â these associative rules and what these different terms meant,

Â and how each of these terms were calculated.

Â