Good to see you again, in this video, you're going to learn how to handle the beginning and the end of a sentence when implementing N-gram language models. If conditional probabilities are working with a sliding window of two or more words, what happens at the beginning or end of a sentence? Let's take a closer look at what's happens at the very beginning, and very end of a sentence. I'll show you how to modify the sentence using two new symbols which denote the start and end of a sentence. You will see why these new symbols are important. Then you will add them to the beginning and end of your sentences for bigram and general N-gram cases. Now, I'll explain how to resolve the first term in the bigram approximation. Let's revisit the previous sentence. The teacher drinks tea or the first word the. You don't have the context of the previous word. You can't calculate a bigram probability, which you'll need to make your predictions. What you'll do is add a special term so that each sentence of your corpus becomes a bigram that you can calculate the probabilities for. An example of a start token is this S, which you can now use to calculate the bigram probability of the first word, D, like this. A similar principle applies to N-grams. For example, with trigrams, the first two words don't have enough contexts. You don't need to use the unigram of the first word and bigram of the first two words. The missing contexts can be fixed by adding two starts of sentence symbols, brackets S to the beginning of the sentence. Now, the sentence probability becomes a product of trigram probabilities. To generalize this for N-grams, add n-minus-1 start tokens, brackets S at the beginning of each sentence. Now, you can deal with the unigrams in the beginning of sentences. What about the end of the sentences? Recall that the conditional probability of word y given word x was estimated as the counts of all bigrams, x, y divided by the counts of all bigrams starting with x, you simplify the denominator as the counts of all unigrams x. There is one case where the simplification does not work. When word x is the last word of the sentence. For example, if you look at the word drinks in this corpus, the sum of all bigrams starting with drinks is only equal to 1, because the only bigram that starts with the word drinks is drinks chocolates. On the other hand, the word drinks appears twice in the corpus, and the other currents is a unigram. To continue using your simplified formula for the conditional probability, you need to add an end-of-sentence token. There's another issue with your N-gram probabilities. Let's say you have a very small corpus with only three sentences consisting of two unique words. Yes and no. The corpus consists of three sentences. Yes, no, yes, yes, and no, no. These are all possible sentences of length 2. This can be generated from the words yes and no, and starting with the starts of sentence symbol brackets S. To calculate the bigram probability of the sentence, yes, yes, take the probability of yes with the added starts of sentence symbol multiplied by the probability of yes being the second word where the previous word was also yes. The probability of yes with the added starts of sentence symbol is the count of bigrams with yes at the start of the sentence, divided by the count of all bigrams, starting with the starts of sentence symbol, you can only use the sum of bigram counts in the denominator. Next, let's handle the remaining unigrams. Multiply the first term by the fraction of the accounts of bigram yes, yes over the counts of all bigrams starting with a word yes. You get 2 over 3 times 1.5, which is equal to 1 over 3. There you have the probability of the sentence yes, yes estimated from your corpus. You have calculated the probability of the sentence yes, yes, and Gaussian one-third. Now, you'll calculate the probability of the sentence yes, no, and gets one-third again. Next, get the probability of the sentence no, no and again one-third. Finally, the probability of the sentence no, yes, and this is equal to 0. Because there is no bigram, no, yes in the corpus. Now, here comes a surprise. If you add the probabilities of all four sentences, the sum equals to 1, exactly what you were aiming for. That's great work. Let's take a look at all possible three-word sentences generated from words yes and no. Begin by calculating the probability of the sentence yes, yes, yes. Then yes, yes, no, and so on. Until you've calculated the probabilities of all it possible sentences of length three. Finally, when you add up the probabilities of all the possible sentences of length three, you again gets the sum of one. However, what you really want is the sum of the probabilities for all sentences of length to be equal to 1, so that you can, for example, compare the probabilities of two sentences of different lengths. In other words, you want the probabilities of all two-word sentences plus the probabilities of all three-word sentences, plus the probabilities of all other sentences of arbitrary lengths, and you want this to be equal to 1. There is a surprisingly simple fix for this. You can preprocess your training corpus to add a special symbol which represents the end of the sentence, which you will denote with brackets, /S after each sentence. For example, when using a bigram model for the sentence the teacher drinks tea, append symbol, backslash S after the word tea. Now, the sentence probability calculation contains a new term. The term represents the probability that the sentence will end after the word tea. This also fixes the issue with probability of the sentences of certain length equal to 1. Let's see if this also resolves your problem with the bigram probability formula. Now, there are two bigrams starting with the word drinks and these are drinks chocolates and drinks /S, and accounts of unigrams drinks remain the same. That's great. You can keep using the simplified formula for the bigram probability calculation. How would you apply this fixed for N-grams in general, it turns out that even for N-grams, just adding one symbol per sentence in the corpus is enough. For example, when calculating trigram models, the original sentence will be preprocessed to contain, to start tokens, any single and token. Let's have a look at an example of bigram probabilities generated on a slightly larger corpus. Here's the corpus and here are the conditional probabilities of some of the bigrams. Now, try to calculate the probability of lean with the starts of sentence symbol. There are three sentences total. The Start symbol appears three times in the corpus. That gives you the denominator. The bigram brackets S lyn appears twice in the corpus. That gives you the numerator of 2. The probability of the start token followed by lyn is 2 over 3. Now, calculate the probability of the sentence lyn drinks chocolates. Starts with a probability of bigram, brackets S lyn, which is 2 over 3, then lyn drinks, which is 1.5, then drinks chocolates, which is also 1.5. Finally, chocolate /S, which is 2 over 2. Note that the result is equal to 1 over 6, which is lower than the value of 1 over 3 you might expect when calculating the probability of one of the three sentences in the training corpus. This also applies to the other two sentences in the corpus. This remaining probability can be distributed to other sentences possibly generated from the bigrams in this corpus, and that's how the model generalizes. I'll see you there. In this video, you saw an example of beginning and end tokens with bigram models. This concept could be generalized to other types of models. In the next video, you will learn how to build your first N-gram language model.