We will now explore cleaning and tokenization. I already spoke about this a little bit in the Course 1, but this is important to touch it again for a little bit. Let's get started. I'll give you some practical advice on how to clean a corpus and split it into words or more accurately tokens through a process known as tokenization. In Course 1, I previously spoke about data preparation, cleaning, and tokenization, but now I'll go into some more details. First, you should consider the words of your corpus as case insensitive. For instance, the word 'The' should be represented identically regardless of its case, for example, whether it begins with a capital T, say at the beginning of the sentence or not. You can do this by converting the corpus to either all lowercase or all uppercase. Secondly, you should handle punctuation. You could, for instance, represents all interrupting punctuation marks such as full stops, commas, and question marks as a single special word of the vocabulary. You could ignore non-interrupting punctuation marks such as quotation marks. You could collapse multi-sign marks such as triple question marks into a single mark and so on. Next, you want to handle numbers. For example, if numbers do not carry an important meaning in your use case, you could drop all of the numbers. However, numbers may have significant meaning that is relevant to your use case. For instance, 3.14 is the number for Pi, 90210 is the name of a television show and the area code for Beverly Hills, California. You can keep these numbers in your corpora as this. If your corpora has many unique numbers such as many area codes, you may find that it makes more sense to replace all the numbers with a special token such as <NUMBER>. This allows the model to know that the important thing is that it's a number instead of trying to distinguish between 90210 and other area codes or phone numbers. You also need to handle special characters such as mathematical symbols, currency symbols, section and paragraph signs, online markup signs, and so on. It's usually safe to drop them. Finally, and especially if you're working on modern corpus of user inputs such as tweets or consumer reviews, you should handle special words such as emojis and hashtags, #nlp, depending on if and how you want your model to confirm meanings to them. You could, for example, consider that each emoji or hashtag should be considered as an individual word. I'll give a basic example in Python to demonstrate a few of these recommendations. The corpus in this example is, "Who loves word embeddings in 2020? I do!!!" With an emoji punctuation marks and the number. Here's the code to import and initialize the libraries. I'm using the popular NLTK library to perform tokenization. It has a smart tokenization module named 'Punkt' to handle common special uses of punctuation. For example, it knows that full stops which are periods, and abbreviations, and middle names do not signify the end of a sentence. I'm also using the emoji library just to give you an idea of how you could handle emojis. Next comes the actual logic. First, using irregular expression, I'm collapsing all interrupting punctuation signs and replacing them with a full stop. The outcome here is, "Who loves word embeddings in 2020? I do." With the question and exclamation marks replaced with a single full stop. Next, I'm using NLTK word tokenize function to split this string into an array of tokens. As you can see, the punctuation signs have been separated as individual tokens including the quotation marks. Finally, using a list comprehension, I'm keeping tokens and converting them to lowercase if they are one of the following: alphabetical, a full stop, so previously in interrupting punctuation mark, and emoji. This gets rid of numbers such as 2020 and unknown special characters. This is the resulting array. You can now use this array to extract center words and their surrounding context words which we'll dive into in more detail next. After looking at cleaning and tokenization, we are ready to move on to the next part of the continuous bag-of-words model. In the next video, we will explore the sliding Window which you can think of as a Window going over the text corpus. We will talk more about it. See you there.