We already seen Lift and a chi-square may not be very good measures,

examining the transaction data that contain lots of null transactions.

So, what we may like to see is,

what are good measures?

They do not influence much by number of null transactions.

Let's look at those different measures.

Some measure, they have the property called null-invariance,

that means their values may not change with a number of null transactions.

Let's see what measures are null-invariant,

what measures are not null-invariant.

We already know chi-square and a Lift,

they are not null-invariant.

Their value change with number on null transactions.

But, people have found that the folding five measures,

if you check their formula,

their definition, they are actually null-invariant measures.

So, you probably know Jaccard coefficient and cosine measure quite well.

These two measures are popularly used, they're null-invariant.

But all confidence which actually take

the smaller value among A and B as the denominator,

the numerator is just the transaction support of the rule.

So, the max confidence is try to find the maximum one of them.

These two actually proposed in the study of measuring association rules.

Kulczynski measure was proposed around 2007 by us, by our group.

We originally called these as balanced measure, but later,

the reviewer actually point out that this measure was actually proposed

by a Polish mathematician called Kulczynski in 1927.

So, we changed the name of this measure to Kulczynski measure.

Let's first look at null-invariance,

why they are very important.

That means why in analysis massive transaction data,

the null-invariance is so critical,

because in many many transactions,

the transaction set contain particular sets of item.

The chance actually is very rare like a warm up transactions.

They may contain neither milk nor coffee.

We will try to analyze milk and coffee using the following contingency table.

So, this m_c means the number of transactions that contain both milk and coffee.

This not m nor c means the number of transactions that contain neither milk nor coffee.

So, this not m nor c is the number of no transactions.

Then, we see Lift and chi-square,

they are not null-invariant.

So they are not good at evaluating data that

contain too many or too few null transactions.

For example, we just look at this,

for this dataset D1,

m_c means number of transactions that contain both milk and coffee;

not m_c means the number of transactions contain no milk but coffee;

m no c means that number of transactions that contains milk but not coffee;

not m nor c means the number of null transactions,

they contain neither me nor coffee.

So, you'll probably go to like a Walmart,

it's kind of shopping market,

you probably see there could be the cases you get

10,000 transactions that contain milk and a coffee,

but 1,000 contain not milk but coffee,

1,000 contain milk but not coffee.

In that case, you probably say actually likely if people buy milk,

they would buy coffee as well because there are 10,000 such cases.

Buy only one of them, there's only 1,000.

But, if you have a lot of a null transactions,

this value could be quite positive.

If you have very few null transactions,

it turn out they are independent.

You probably look at the value they are not independent.

On the other hand, if you see there's only 100 cases,

you're got a milk and coffee together.

There are many more cases,

they buy it alone.

But once you have many null transactions,

it turns out to be very positive.

The number is quite big.

So, no matter you get a very many null transactions,

so very few null transactions,

something may go wrong.

So, we do need to analyze such data using some null-invariance measures.

We'll examine this in more detail.