Once we got this,
I think things went smooth from here,
because we already have established these means of generating features and models,
and know how to understand if they are good or not.
And we deployed different strategies which are common to recommenders.
First is we use a content-based approach.
Now a content-based approach is ones that assumes you make
recommendations based on the characteristics of let's say, the product.
So, you make a recommendation because the customer likes these products,
because this product has certain features.
For example, it's sweet,
or it's not sweet,
or it contains, I don't know,
chocolate or it doesn't contain chocolate.
And so, this approach focuses on
the characteristics of the product in order to make a good recommendation.
And we deployed another strategy which is based on collaborative filtering.
And collaborative filtering means that in this context,
that we made recommendations based on how a customer looks like another,
that was likely to buy a product.
So, we look with similarities about
customers in order to define whether they would like a certain item or not,
not based on the characteristics of the item itself.
And the hybrid approach is what we ended up with,
which is basically a combination of these two.
So, more about the first approach, the content-based.
What was the assumption behind this approach?
Is that a customer who has bought from the same company brand and category,
will have a higher chance to buy the offer product,
and buy it again.
So, the stronger the relationship already
is based on his transactional history with this product,
with this combination of company brand and category,
the higher the chance he will buy a product that has the same element,
an offer that has the same element.
And let's take the example where a customer,
has never bought from the same combination of company brand and category.
Maybe he or she has bought from the same company and brand,
or from the same brand only.
So, we have ways to determine whether he or she likes the same brand or the same company.
And let's say, we don't have that at all,
we'll still know how popular this brand is,
or how popular this company is,
or how popular a combination between company and brand is.
So, this is what drive this approach.
So, how much the customer likes the item,
based on what we can already see from his or her transactional history.
So, we tried to exploit this product hierarchy,
versus customer and time.
And why I bring time into the picture,
is because I think this is a typical diagram
of what the sales of a certain products looks through time.
And you can see it's, very seasonal.
So, it is important where you gauge
the relationship of a customer with an item to mark it through time.
Because as of late he or she may not be buying it as frequently as in the past,
or it may not be its peak period.
Might be ice cream and now it's winter,
so sales are not going to be very high in this period.
So, when you create features,
you need to take into account this notion of lag.
So, to mark this relationship through time,
this is what lag means.
More specifically, you define some time intervals and you create features based on these.
So, how many times the customer has bought
from the same company brand and category in the last 30 days,
then in the last 60 days, 90,
120, half a year, one year.
And you try to see how that changes,
and you create all these features.
And based on the cross-validation,
we have already established,
it was easy to find which of these features were useful or not.
So, I created hundreds of features here,
and I was adding them one by one.
I was doing this cross-validation methodology of leaving one offer out,
and then concatenating all the results.
Measuring and you see,
if the impact of one feature was positive to the AUC,
I was leaving that feature in,
or if it wasn't,
I was dropping it and then I was adding the next feature.
So, this is how I determine which features were good.
And this is a small sample of the list of the features that were generated.
And you can see that they are all quite similar.
They all look like that.
So, category brand 30,
which is the top one means,
how many times a customer bought from the same category and brand in the last 30 days.
So, we build an exhaustive list of these type of
features and we use this cross validation approach,
to determine which of these were good.
We did some cleaning obviously.
If there were some transactions which were very big, we capped them.
Also I replaced missing values with minus one,
and why I did this was many of the missing values here,
were produced from generated features.
So, I was going back to the transactional history and for example,
I will see at the how many days that customer buys an item,
or a category or a brand.
And then, I would try to create a standard deviation of this.
If there are no records,
you cannot basically estimate standard deviation.
So, this will give you in this case, minus one.
Minus one was a good choice,
because normally, these had a negative relationship with the target.
So, when you could not estimate these metrics like standard deviation,
this means that there were not many purchases and
this was associated with lower chance the buy the item,
and the negative value here was reflecting this relationship.
And what was interesting is that although this is a binary classification problem,
we actually use a Ridge regression and to model this.
So, apart from the actual label,
whether someone has repeated or not,
we also had the actual counts how many times they repeated in the future.
Naturally, these will giving you stronger information
about how successful the recommendation was, right?
Because someone who has bought it many times,
this means it was a more successful recommendation.
So, knowing this information,
could help you make better predictions.
Have a score that discriminates better,
even though it is not model on zero and one,
but on the actual count.
And this is teethed around 0.61, in the test data.
Now, you see that this score is low,
but this was near top 10 by itself.
This was naturally low because of this irregularity immensely.
So, all these great differences between train and test data.
The second approach used collaborative filtering.
I think what the second approach tried to answer was,
and Jarrett was mainly focused on this, was,
would the customer have bought the product if they had not received an offer?
So, would they have bought the product anyway irrespective of the offer,
irrespective of sending the coupon?
I think that was a very interesting concept.
This was made, this approach was
employed by making a different model for every offer in the train and the test data.
And the target variable was quite intuitive
because we went 90 days before the actual coupon was sent,
and we estimated how many times a customer has bought from the same offered product,
so, from the same combination of category brand and company.
So we created our own target.
So, irrespective of sending the coupon,
let's ignore this information for now.
We know that 90 days before,
the coupon sent, we know which customers bought the offered product.
Let's see what characteristics they have.
If we manage to make a link here
from which customer like the product irrespective the coupon,
and then apply this score to the customers that actually were sent coupon,
maybe that score works really well because it tells
you which customers really liked it irrespective of the coupon.
So that was the main logic of these approaches,
and it worked really well.
It blended really nicely with the other approach.
So the combination of the two will give a very strong lead in this competition.
All the features here, naturally,
because they were based on collaborative filtering,
were all focused on the customer.
So, focused on user's activity.
So they will try to describe the customer by means of
which categories or brands or companies he or she prefers to buy.
And for these categories that we didn't have that much history,
can we get a summary of information about how much they like them?
Because they were quite many,
and what was quite successful is using restricted Boltzmann Machines,
which is a form of deep learning,
in order to summarize this information for those least popular categories,
because they naturally work on binary features,
which were whether the customer has bought or not from a certain category around.
And the initial idea was because we saw this type of model being used in recommendations,
Chintan had used it.
But, although we couldn't make it work as a supervised problem,
we made it work really well as this form of unsupervised problem to summarize
the activity of how customers prefer certain categories or brands or companies.
And then, created other features like average amount spend,
total visits, from how many different brands,
categories, and companies they buy to, are they extreme,
are they adventures, do they try many different brands and categories or just a few.
So this kind of information,
we're trying to drive short of the customer cardinality.
And then, whether they prefer a discount or not,
if they visit weekends,
how much they spend on weekends?
We try to describe the customers in different ways,
like from different angles.
And once all these features have been built,
again, the notion of like was there two,
the best modeling technique was it run in boosting machine from cycle frame of the log of
counts of how many times the same category,
brand, and company was bought 90 days before the coupon was sent.
The natural logarithm, I think was helping as a form of regularization.
So it was counting very extreme values.
It worked really well with this type of problem.
It was always getting a boost.
This achieved slightly better than the content-based approach,
again near the level of top 10 score 0616,
and I think this approach was very, very, very intuitive.
So how do we merge this?
The main problem we found when we tried to merge these two scores were,
they were trained on different targets.
So the content-based approach was trained on the actual accounts of
how many times a customer had repeated buying an item,
while the second one on the log of counts of
how many times the over product was bought 90 days before the coupon was sent.
So you can imagine the distribution of this course is very different, naturally,
because the natural logarithms pushes the value
low that these scores were much lower than the first ones.
However, when we look at the distributions,
let's ignore the scores.
We know the strategy two,
which is on the right here,
for certain offers will have lower scores than the ones on the left side,
which is the content-based approach.
But the distribution, if you ignore these elements and you look only at the rank,
the loop actually pretty similar.
So let's ignore the score.
Let's make the score relative.
If you look only at the rankings, they actually look similar.
And this gave us the idea,
and you see, actually cares about the rank.
So, how good that rank is.
So what we did was transform our scores into ranks.
So, not what the actual score is,
but how big the score is compared to all other scores that we have,
is it the 50,
rank 50, or is the top score,
the second best, third best?
Just convert rank to this diocese.
And then we do a merge, merge on average.
And you can see the final results.
So, blending these two approaches after converting to ranks
and giving equal weight gave us the top score in the leader board.
And I think with clients,
some different form from the other teams.
And we got this quite early,
and we maintain that lead comfortably.
And as I'm reaching the end,
what I wanted to address is that machine learning is good and is great, but sometimes,
you need to really look through the data,
and this is an example where this element was really prevalent.
This was the notion of understanding the data,
of understanding the difference between the train and test.
It was really important in forming across validation strategy that could best
replicate the test results and then give us the confidence that whatever we try,
it will work well and it would generalize well a non-observe data.
And then, when you are being challenged with the problem,
you try to solve it with what traditionally has worked best.
And since it was a recommendation problem here relying on literature,
relying on what has worked well in the past, for example,
this content-based approach or collaborative filtering,
was important in order to get good results.
Ultimately, yes, relying on advanced machine learning techniques also gave us a net.
We made good use of deep learning,
use of gradient boosting methods,
but even simpler methods work well here but focus on other aspects like the features.
So that was it. I hope you find it useful.
These were the elements I really wanted you to take away from this,
how we challenge this setup,
how we try to seize the problem,
how we try to understand it and make it our own,
understand it fully in order to be able to solve it.
We put ourselves into this problem.
Hopefully, you find it useful and stay tuned, more things will.