Now comes the fun part of this module. I don't want to leave, you would say, okay, it's all statistics. No, there's some very interesting stuff you can do. It also introduce you to new functions in R that keep getting added. So, main takeaway I think is, yes, you can have a lot of fun with data which we knew, second, that there are new models coming up. So, it's never enough and depending on your need, you may keep looking and seeing what is the best thing to do. So, remember, there are different ways of organizing data which we went through. We are going to look at what is a market basket today. So, we're going to look at a basket which people have bought. So,, market baskets could be in different contexts. We want to take a retail market basket but it could be the movies people have bought, it could be the jokes, not the jokes, I'm sorry, the things they have bought is the YouTube, the tags for your movies. So, these are the items and that's where the name of this method comes, these are the items which are in your basket. So, this is a set of items in your basket. So, there is a lot of items available, we call it a dictionary of items and what you have is a collection from this, a subset. So, each of these is a set of items, a, b, c, d is one set, a, b, d, e is another set, b, e, f is another set. There is one small nuance I have to tell you, what's a market basket? Well, is it what I bought in one month? Is it what I bought in consecutive visits? Or is it what I bought in all visits? So, there is a definitional problem and that I leave you to decide what you want to do. Last four visits or something like that assuming the sales are not seasonal. Let's come up with one algorithm for creating recommendations. So, in this idea, the idea it's called the Apriori algorithm. What do you do? You have market baskets of people in the past and based on this, you want to make a recommendation. So, leave that recommendation side alone. As people start shopping there will be one item, and they add two items, and they add three items in it, and we want to recommend another item to them. The marketing manager then said look, the minimum support I want to make this recommendation is there must be at least three instances in this dataset, before you can rely on it to make a recommendation. So, first of all, we'll create what are known as frequent item sets. That means item sets of size one which meet the threshold of three, item set of size two which are frequent enough and meet a threshold of three, and item sets of size three which meet the threshold of three, and notice no item set of size four reapeats more than two times. Nothing actually, and therefore there will be no frequent item of set of size four. So, let's enumerate how often singletons appear. Now somebody has put a in the basket, so often a comes by itself. So, you can see a is in front of the baskets. So, as a singleton, it has a frequency of four or a support of four, b has a support of five, c and g do not have any support of the required level of three. Whereas, d and f have a support of at least three. So, we retain only them. So, we retain a, b, d e, f, for further evaluation. Why do we do this? In theory, all kinds of possibilities are possible. But if you look at, and we'll be spending all our time if hundreds of items are there, we'll be spending all our time looking at all combinations. What we really don't want to do is, looking at all combinations. So, this idea of frequent item set allows us to only focus on those instances which have a certain minimum number of occurrences in your dataset. Now we are done with the singletons. Now let's look at the paths. So, here we have listed all possible candidates, so all of these are possible theoretically. In this candidate, list v there is no c and there is no g. Why? Because no candidate item set of size two which has a c in it or a g in it, can have a threshold of three because then c will have a threshold of three which is not true. So, what this algorithm does is, if you eliminated c and g at one level, you will eliminate it at the next level. You're done. So, in this, we count the frequencies. How often things happen? {b, d} has been bought thrice, you can say {b, e} has been bought together four times, {b, f} has been bought together three times, {e, f} has been bought together four times. Whereas the rest of them, have not been bought together three or more times and therefore, we are going to retain {b, d}, {b, e}, {b, f}, and {e, f} as our candidate, as our frequent datasets. So, from the candidate we got a frequent dataset. Now, we want to create frequent item set of size three candidate. All right. So, what are the possible candidates? Let's think of {b, d, e}. Not true, because {b, d, e} includes a {d, e} which is not there anymore. So, {b, d, e} is not a candidate set of size three. Let's think of {b, f, e}. Well, {b, f, e} is possible because {e, f} is there. But {e, f} is there and {b, f} is there. What about {b, e}? {b, e} is there so we can't eliminate it. But you can eliminate almost everything, and the only candidate set which becomes available is {b, e, f}. So, every other combination is ruled out because we have ruled out them at the previous stage. So, there's nothing involving {a, d}, there is nothing involving a. So, now the number of frequent item sets we need to consider as a candidate, is only one. Now we go into the data and say, how often has {b, e, f} been there in the data? We have one, we have two, we have three, and low behold it is a frequent item-set. So, we now have the frequent item sets of size one, size two, and size three. What do we do with it? This step you have just done, all it does is it retains the transactions which have a minimum amount of support. Now using this we want to recommend to people what to buy. So, just think of those who have put one item in the basket and we are saying, what can you recommend to them? So, it compares the frequent items of size one with a frequent item-sets of size two. So, should we recommend b? Anybody who buys b, ask them to, think of buying d. Right. Should we recommend those who have bought d and should we recommend a b for them. So, you can think of all the association rules which we can create. So, from the frequent item-sets, we want to go into the association rules. How would we do that? We do that by creating a metric called confidence. The confidence is based on what I have seen already, how confident I am that you're going to buy the next item. So, think of it. You bought b and I want to recommend f to you. So, what's the chances that you will choose f given that you have already chosen b? So, b, it is nothing but a conditional probability. It is the support in the data for {b, f} divided by the support in the data for {b}. So, three people have bought b and f, five have bought b, confidence in the data is 60 percent. So, for each of your recommendation now, you're going to get a probability that that recommendation will be acted upon. So, here is the association rules between two and three. What about {b, e}? Will you recommend {f}? {b, f}, will I recommend {e}? {e, f}, will I recommend {b}? All right. So, I can compute {b}, will I recommend {e, f}? {e}, will you consider purchasing {b, f}? So, I can have two kinds of rules out here. So, let's calculate. What confidence I can recommend to a person who has bought {b, f} to buy {e}? That is the support of {b, e, f} divided by support of {b, f}. So, let's look at the support of {b, e, f}. There is three, and the support of {b, f} is three. So, what it is telling you is based on your data, the three people who bought {b, f} also bought {e}. So, your confidence is a 100 percent. Let's do one more. With what confidence can I recommend somebody who has just bought {b} to consider buying {e, f}? Well, we can do the following. Right. How many people have bought {b, e, f}? How many people have bought {b, f}? Three. How many people had just bought b? That is, 1, 2, 3, we have the data 4, 5. So, that's five. So, the answer is, the confidence in this recommendation is 60 percent. So, somebody who has bought {b}, out of the five people three have bought {e, f} So, that's the idea of creating an association rule. Of course, life gets crazy, in retail data you get in this problem. Not many people buy a complete item set in one basket. It is possible you're already other products you may buy it from another retailer, you may buy it at a different time, you may get some of these as gifts. So, how do you take care of it? Well that's a problem and still unsolved. Similarly, so basically, we don't know whether they already have items in the basket. On top of it, the shopping baskets could be completely mixed. Somebody is doing home repair, somebody is making a home office and I don't know what this person is doing, breaking his computer most of the time as we want to do. So, you can't clearly separate out the baskets. But leave that alone. You can read up on ongoing work in this area after this session. So, I know it's the last module and we need to use data. I want to say, hey, here are some things that I didn't do, why didn't you study them? That's what we do with a market basket. So, why didn't you do advanced regression? Or can you learn about support vector machines or what about neural networks, what about this thing called deep learning, and what about philosophy. Because I don't like data, and philosophy I think comes without data, it's just observation. So, here's a recommendation I want to give you. Consider that all these are possible topics for you to study especially when you look at the ethical use of data tools.