0:12

In the last two lectures, we first had an example of classical probabilities and

then an example of subjective probabilities.

I now want to complete this module if we look at empirical probabilities and

a rather surprising law on empirical probabilities that we see in some

areas in business, and statistics, and data collection.

0:55

Let's now think about leading digits.

What the heck are leading digits?

If we look at numbers,

the first digit in a number is called the first digit or leading digit.

Here are some simple example.

54,571, the first number, the first digit is a 5.

182,265, first digit is a 1.

And finally is number 4, there's just 4, so the first digit is a 4.

1:45

Honestly, if you had asked me this question some years ago,

I would have said, they're all equally likely, 1, 2, 3, 4, 5,

6, 7, 8, 9, 9 numbers, 1/9, about 11.1% for each number.

That was my naive expectation, seemed reasonable.

Here's now a really fun fact that we see in everyday data.

They are unequal.

2:20

Rest assured, we're not going to talk much about logarithms, but

once I have to write the word log based hand on this slide.

Look at this funny probability distribution,

the probability of a digit d, where d is now any number 1, 2,

through 9, is a logarithm of the base, to the base 10 of 1+1/d.

I only put that number on the slide to scare the living daylights out of you.

You can right away forget about it and

just look here at the probability distribution at the bottom of the slide.

The probability of 1 is about 30%.

The probability of a 2 is about 17.6%.

Notice how the probabilities decrease to under 5%.

According to this probability distribution,

the 9 has a probability of only 4.6%.

3:18

Now, here's the kicker, this law actually holds for

relative frequency probabilities in a variety of data sets.

It has been observed in credit card bills, in stock market prices,

in the market valuation of listed companies, like in a stock index.

If you think about the SMI in Switzerland or in particular large markets,

it's like the Russo 2000 in the United States or the Wilshire 5000.

It also occurs in population datas.

And if you look at population of cities, in districts,

you find it in the length of rivers or the population size in countries.

4:05

There's some requirements that these data sets have.

They should not have an arbitrary maximum or minimum value.

So there's some basic assumptions.

And the law does not apply to numbers that are assigned.

So, in assigned grades and assigned phone numbers and

assigned identity numbers, it doesn't hold.

4:40

The first data set reports US census

data from the year 2010 when the United States counted

the number of people residing in its various counties.

The United States is divided into 3,143 counties.

Here, the first state in the alphabet is the state of Alabama.

And within Alabama, the first county in the alphabet is Autauga County,

and it has a population of 54,571.

The leading digit is a 5.

The next county in Alabama, Baldwin County,

has 182,265 inhabitants, so the leading digit is a 1.

I calculated, or I wrote down, the leading digits for

all 3,143 counties, and then counted how often

does a 1 show up, how often does a 2 show up.

That's here in the column count.

And then I looked at the proportions.

These proportions we can now think as probabilities,

according to the definition of empirical probability.

We have here in our data 3,143 trials of a random experiment,

that's how we can think of this.

In 953 of them, the leading digit is a 1.

953 divided by 3,143 gives us 30.3%.

And look at that.

The data set 30.3%, Benford's Law,

in theory, would say 30.1%, very close.

Look at the 2, shows up a little more often, almost 19%, not quite 17.6%.

But in general, Benford's Law is very close to our data.

And look, a 9 only happens in less than 5% of the cases.

So here we see this amazing law, where intuitively,

I would have said 1, 2, 3, 9 are all equally likely.

In fact, 1 shows up by quite a margin.

7:13

The United States Bureau of Labor Statistics knows

820 different job titles.

Every job title has its code.

And the Bureau of Labor Statistics counts the number of people

that have these titles.

For example, there are 248,760 chief executives.

They are 174,010 so-called

marketing managers, and so on.

I have here the entire data set for you, 820 job descriptions.

We write down the number of leading digits,

we count the number of leading digits, we calculate the proportions.

Here, we have a little bit more variation, but once again,

the proportions are pretty well approximated by Benford's law.

The number 1 shows up way more often than the 2 shows up more often than the 3,

blah blah blah, all the way to the 9.

9 is once again in last place.

So here we have another illustration of Benford's law.

8:30

Here we are back from those data sets.

And now you may say okay, those were acute, what is this good for?

You won't believe it.

Benford's law is used a lot in fraud detection, in particular,

in accounting fraud detection.

Here, I give you a reference of a fascinating paper, where

the author shows how one can use Benford's law

to analyze large data sets of tax returns and

actually see whether people cheat on their taxes or not.

9:22

People are looking for

people who are cheating on their taxes by hiding money in other accounts.

I have been recently taught by accountants working

in this area that they use Benford's law.

By you looking at huge data sets, big data you may have heard,

by looking at huge data sets and seeing whether they satisfy Benford's law.

And if they don't, they suspect accounting fraud.

9:50

In a related issue, some researchers have looked at government data,

in particular, government budget data coming out of the country of Greece

in the five years before Greece joined the Euro.

And guess what?

Benford's law didn't hold, even though it held in other countries, and

it held in Greece many years prior.

The authors conclude that the government doctored its budget,

so that the country could join the Euro.

So this Benford's law, as strange as it looks to us,

it holds in many data sets, and it has real word applications.

How cool is that?

Let me wrap up.

We have seen Benford's law, and it's really used in

accounting to detect tax cheats and other frauds.

This brings me to the end of this module.

We learned about probabilities, in particular,

the three different definitions of probabilities.

And we saw examples of the classical definition, subjective probabilities, and

in this last lecture of an empirical distribution.

Thanks, and I hope to see you again in the next module.