Welcome. This week I will teach you N-gram language models. This will allow you to write your first program that generates texts on its own. First, I'll go over what an N-gram is. Then you'll estimate the conditional probability of an N-gram from your text corpus. Now what is an N-gram? Simply put, an N-gram is a sequence of words. Note, that it's more than just a set of words because the word order matters. N-grams can also be characters or other elements. But for now, you'll be focusing on sequences of words. When you process the corpus, the punctuation is treated like words. But all other special characters such as codes will be removed. Let's look at an example. I am happy because I am learning. Unigrams for this corpus, are a set of all unique single words appearing in the text. For example, the word, I, appears in the corpus twice. But is included only once in the unigram sets. The prefix uni stands for one. Bigrams are all sets of two words that appear side-by-side in the corpus. Again, the bigram, I am, can be found twice in the texts, but it is only included once in the bigram sets. The prefix bi means two. Also notice that the words must appear next to each other to be considered a bigram. Another example of bigram is, I'm happy. On the other hand, the sequence, I happy, does not belong to the bigram sets as that phrase does not appear in the corpus. I happy, is emitted even though both individual words I, and happy, appear in the texts. Trigrams represents unique triplets of words that appear in the sequence together in the corpus. The prefix tri means three. Here's some notation that's you're going to use going forward. If you have a corpus of texts that has 500 words, the sequence of words can be denoted as W1, W2, W3, all the way to W500. The corpus length is denoted by the variable M. Now, for a sub-sequence of that vocabulary, if you want to refer to just the sequence of words from word 1 to word 3, then you can denote it as W subscript one, superscripts three. To refer to the last three words of the corpus, you can use the notation W subscript M minus 2 superscripts M. Next you estimate the probability of an N-gram from a text corpus. Let's start with unigrams. For example, in this corpus, I'm happy because I'm learning, the size of the corpus is M equals 7. The counts of unigram I is equal to 2. So the probability is 2 divided by 7. For unigram happy, the probability is equal to 1 divided by 7. The probability of a unigram shown here as W, can be estimated by taking the counts of how many times word W appears in the corpus and then you divide that by the total size of the corpus M. This is similar to the word probability concepts you used in previous weeks. Now let's calculate the probability of bigrams. Let's start with an example and then I'll show you the general formula. In the example, I'm happy because I'm learning, what is the probability of the word, am, occurring if the previous word was I? It would just be the count of the bigrams, I am, divided by the count of the unigram I. You get the count of the bigrams, I am, divided by the count of the unigram I. In this example, the bigram I am, appears twice and the unigram I, appears twice as well. The conditional probability of M appearing given that I appeared immediately before is equal to 2 divided by 2. In other words, the probability of the bigram I am is equal to 1. For the bigram, I happy the probabilities equal to 0 because that sequence never appears in the corpus. Finally, bigram, am learning, has a probability of 1/2. That's because the word am, followed by the word Learning makes up 1/2 of the bigrams in your corpus. Here's a general expression for the probability of bigram. The bigram is represented by the word X followed by the word Y. The probability of the word Y appearing immediately after the word X is the conditional probability of word Y given X. The conditional probability of Y given X can be estimated as the counts of the bigram X comma Y and then you divide that by the counts of all bigrams starting with X. This can be simplified to the count of the bigram X comma Y divided by the counts of all unigrams X. This last step only works if X is followed by another word. Let's calculate the probability of some trigrams. Using the same example from before, the probability of the word happy, following the phrase I am, is calculated as 1 divided by the number of occurrences of the phrase I am in the corpus, which is two. The probability of the trigram or consecutive sequence of three words is the probability of the third word appearing given that the previous two words already appeared in the correct order. This is the conditional probability of the third word, given that the previous two words occurred in the texts. The conditional probability of the third word given the previous two words is the counts of all three words appearing, divided by the count of all the previous two words appearing in the correct sequence. Note that the notation for the counts of all three words appearing is written as the previous two words, denoted by W subscript 1 superscript 2, separated by a space and then followed by W subscript 3. So this is just the counts of the whole trigram, written as a bigram followed by a unigram. What about if you want to consider any number N? Let's generalize the formula to N-grams for any number n, the probability of a word WN, following the sequence W1 to WN minus 1, is estimated as the counts of N-grams, W1 to Wn divided by the count of N-gram prefix W1 to WN minus 1. Notice here that's the counts of the N-gram for words W1 to WN is written as counts of W subscripts one superscript N minus 1, and then space W subscripts N. This is equivalent to C of W subscript 1 superscript N. By this point you've seen N-grams along with specific examples of unigrams, bigrams and trigrams. You've also calculated their probability from a corpus by counting their occurrences. This great work. Now you know what N-grams are and how they can be used to compute the probability of the next word. Next, you will learn to use it to compute probabilities of whole sentences.