0:12

In the last two lectures, we first had an example of classical probabilities and

Â then an example of subjective probabilities.

Â I now want to complete this module if we look at empirical probabilities and

Â a rather surprising law on empirical probabilities that we see in some

Â areas in business, and statistics, and data collection.

Â 0:55

Let's now think about leading digits.

Â What the heck are leading digits?

Â If we look at numbers,

Â the first digit in a number is called the first digit or leading digit.

Â Here are some simple example.

Â 54,571, the first number, the first digit is a 5.

Â 182,265, first digit is a 1.

Â And finally is number 4, there's just 4, so the first digit is a 4.

Â 1:45

Honestly, if you had asked me this question some years ago,

Â I would have said, they're all equally likely, 1, 2, 3, 4, 5,

Â 6, 7, 8, 9, 9 numbers, 1/9, about 11.1% for each number.

Â That was my naive expectation, seemed reasonable.

Â Here's now a really fun fact that we see in everyday data.

Â They are unequal.

Â 2:20

Rest assured, we're not going to talk much about logarithms, but

Â once I have to write the word log based hand on this slide.

Â Look at this funny probability distribution,

Â the probability of a digit d, where d is now any number 1, 2,

Â through 9, is a logarithm of the base, to the base 10 of 1+1/d.

Â I only put that number on the slide to scare the living daylights out of you.

Â You can right away forget about it and

Â just look here at the probability distribution at the bottom of the slide.

Â The probability of 1 is about 30%.

Â The probability of a 2 is about 17.6%.

Â Notice how the probabilities decrease to under 5%.

Â According to this probability distribution,

Â the 9 has a probability of only 4.6%.

Â 3:18

Now, here's the kicker, this law actually holds for

Â relative frequency probabilities in a variety of data sets.

Â It has been observed in credit card bills, in stock market prices,

Â in the market valuation of listed companies, like in a stock index.

Â If you think about the SMI in Switzerland or in particular large markets,

Â it's like the Russo 2000 in the United States or the Wilshire 5000.

Â It also occurs in population datas.

Â And if you look at population of cities, in districts,

Â you find it in the length of rivers or the population size in countries.

Â 4:05

There's some requirements that these data sets have.

Â They should not have an arbitrary maximum or minimum value.

Â So there's some basic assumptions.

Â And the law does not apply to numbers that are assigned.

Â So, in assigned grades and assigned phone numbers and

Â assigned identity numbers, it doesn't hold.

Â 4:40

The first data set reports US census

Â data from the year 2010 when the United States counted

Â the number of people residing in its various counties.

Â The United States is divided into 3,143 counties.

Â Here, the first state in the alphabet is the state of Alabama.

Â And within Alabama, the first county in the alphabet is Autauga County,

Â and it has a population of 54,571.

Â The leading digit is a 5.

Â The next county in Alabama, Baldwin County,

Â has 182,265 inhabitants, so the leading digit is a 1.

Â I calculated, or I wrote down, the leading digits for

Â all 3,143 counties, and then counted how often

Â does a 1 show up, how often does a 2 show up.

Â That's here in the column count.

Â And then I looked at the proportions.

Â These proportions we can now think as probabilities,

Â according to the definition of empirical probability.

Â We have here in our data 3,143 trials of a random experiment,

Â that's how we can think of this.

Â In 953 of them, the leading digit is a 1.

Â 953 divided by 3,143 gives us 30.3%.

Â And look at that.

Â The data set 30.3%, Benford's Law,

Â in theory, would say 30.1%, very close.

Â Look at the 2, shows up a little more often, almost 19%, not quite 17.6%.

Â But in general, Benford's Law is very close to our data.

Â And look, a 9 only happens in less than 5% of the cases.

Â So here we see this amazing law, where intuitively,

Â I would have said 1, 2, 3, 9 are all equally likely.

Â In fact, 1 shows up by quite a margin.

Â 7:13

The United States Bureau of Labor Statistics knows

Â 820 different job titles.

Â Every job title has its code.

Â And the Bureau of Labor Statistics counts the number of people

Â that have these titles.

Â For example, there are 248,760 chief executives.

Â They are 174,010 so-called

Â marketing managers, and so on.

Â I have here the entire data set for you, 820 job descriptions.

Â We write down the number of leading digits,

Â we count the number of leading digits, we calculate the proportions.

Â Here, we have a little bit more variation, but once again,

Â the proportions are pretty well approximated by Benford's law.

Â The number 1 shows up way more often than the 2 shows up more often than the 3,

Â blah blah blah, all the way to the 9.

Â 9 is once again in last place.

Â So here we have another illustration of Benford's law.

Â 8:30

Here we are back from those data sets.

Â And now you may say okay, those were acute, what is this good for?

Â You won't believe it.

Â Benford's law is used a lot in fraud detection, in particular,

Â in accounting fraud detection.

Â Here, I give you a reference of a fascinating paper, where

Â the author shows how one can use Benford's law

Â to analyze large data sets of tax returns and

Â actually see whether people cheat on their taxes or not.

Â 9:22

People are looking for

Â people who are cheating on their taxes by hiding money in other accounts.

Â I have been recently taught by accountants working

Â in this area that they use Benford's law.

Â By you looking at huge data sets, big data you may have heard,

Â by looking at huge data sets and seeing whether they satisfy Benford's law.

Â And if they don't, they suspect accounting fraud.

Â 9:50

In a related issue, some researchers have looked at government data,

Â in particular, government budget data coming out of the country of Greece

Â in the five years before Greece joined the Euro.

Â And guess what?

Â Benford's law didn't hold, even though it held in other countries, and

Â it held in Greece many years prior.

Â The authors conclude that the government doctored its budget,

Â so that the country could join the Euro.

Â So this Benford's law, as strange as it looks to us,

Â it holds in many data sets, and it has real word applications.

Â How cool is that?

Â Let me wrap up.

Â We have seen Benford's law, and it's really used in

Â accounting to detect tax cheats and other frauds.

Â This brings me to the end of this module.

Â We learned about probabilities, in particular,

Â the three different definitions of probabilities.

Â And we saw examples of the classical definition, subjective probabilities, and

Â in this last lecture of an empirical distribution.

Â Thanks, and I hope to see you again in the next module.

Â