You may recall the end-to-end NLP workflow from last module and the three major stages in developing an NLP project. Data preparation, which is similar to preparing raw ingredients, model training, which is like experimenting with recipes, and model serving, which is similar to serving the meal on the table. In the first stage of data preparation, you must engineer the data for model training. As you know, a computer only takes digits and you can only feed an NLP model with numbers. Here, you encounter a significant challenge for NLP, how to represent text in a numeric format while retaining its meaning. To better understand the challenge of text representation in NLP, let's compare text data with other data types, such as tabular, image, and audio. Tabular data might be the easiest to feed into an ML model because most of it is already numeric, and the columns which are not numbers, can be easily converted into numeric values. How about image data? How can you convert an image into numbers? You can take advantage of pixels. Each cell in the matrix of pixels represents the intensity of the corresponding pixel in the image. How about audio? How do you convert a sign to numbers? Yes, you can use waves. You can sample a wave and record its amplitude, the height. The audio then can be represented by an array of amplitude at fixed time intervals. How about text? Can you think of a way to turn a sentence into numbers? The answer is not that obvious. Well, let's divide the feature engineering in NLP into smaller steps. Please note, you're equipped with many NLP libraries, such as TensorFlow and transformers that hide the details of these steps. You must call the right functions at the right time. The following explanation is to uncover how these libraries work and assumes you might want to build your own NLP applications from the beginning. Two, there's no unified way to name the steps in NLP. To avoid confusion, always relate the terms to the specific functions. Assuming you already uploaded raw text, you'll then tokenize the text, which basically means to divide the text into smaller language units, such as words. This is how a computer reads text. After that, you'll pre-process the language units, for example, by only keeping the root of each word and removing punctuation. You'll then turn the pre-processed language units into numbers that represent some meanings. This step is often called text representation, and it's where a computer understands text in addition to reading it. The output of text representation is normally vectors that can be fed into ML models to solve specific tasks. Before exploring different techniques for text representation and various NLP models, let's start this lesson with tokenization and explore how a computer reads text. Tokenization is the first step to prepare text for ML models. It aims to divide text into smaller language units called tokens. For example, tokenization will split the sentence, a dog is chasing a person into separate words. This step is often overlooked and underappreciated simply because English is easy to tokenize with a delimiter such as a whitespace. However, take a moment to think about it, you'll find the problem is not as obvious as it looks. First of all, what about other languages such as Chinese? In this example, [FOREIGN] which is the Chinese translation of a dog is chasing a person, there's no space between characters. How do you split the sentence? To solve this problem, people must develop different tokenization strategies and tools for different languages. Second, how do you define smaller language units, for example, in English? This is an excellent question. Smaller language units, which are called tokens in tokenization, can exist at different levels. For example, character tokens split the text at the character level. For instance, dog is split into D-O-G. Subword tokens, split the text at the root word level. For example, chasing is split into chase and ing. Word tokens split the text by whitespaces. Phrase tokens split the text by phrases, for example, a dog and is chasing. Finally, sentence tokens split the text by punctuation. Word tokenization is the most commonly used algorithm for splitting text. However, each tokenization has its own advantages and disadvantages. The choice of the tokenization type mainly depends on the NLP libraries and the NLP models you're using. After tokenization, you must further prepare the text. It's called preprocessing. Different things you can do in this step, for example, lowercasing. Lowercase, all the text data, stemming, keep the words to their root form. For example, keep only run, R-U-N, for running, run, R-N, and runs. Stopword removal, remove the low information words such as a and the. Stopwords are a set of commonly used words in language that contain low information. For example, in English our a, that is, our, etc. Normalization transform a text into standard form. For example, the word LOL can be converted to laugh out loud, TMR to tomorrow, and goooooood to good. This is especially useful in new social network settings. Again, you have various NLP libraries to help you complete these preprocessing tasks automatically. For example, TensorFlow provides a new text preprocessing layer using text vectorization API. It maps text features to integer sequences, including the functions such as preprocessing, tokenization, and even the vectorization that you will be introduced later. Using this new API, you can do all the text preparation work in one place.