In the last lesson, you were introduced to the general steps for preparing text data for model training. You focused on tokenization, which aims to divide the text into smaller language units, such as words. You also briefly explore the different tasks of preprocessing. The preprocessing generates a set of clean language units which serve as the input of text representation. If tokenization is how a computer reads text, the text representations solves the problem of how a computer understands text. It is the focus of this module and the question that you were asked at the beginning, how to represent text in a numeric format while retaining its meaning? This challenge can be further split into two sub-problems. How to turn text to numbers that retain meaning. For example, the numbers should indicate relationships between words such as similarity and difference. How to turn text into numbers that can be fed into an ML model. For example, an ML model normally requires the input to be a relatively dense matrix or vectors. An overly sparse matrix might lead to model overfitting. Any ideas? Dear learners, I know you would think of brilliant ideas such as Morse code. Morse code uses dot and dash to represent characters. It's one of the earliest and simplest methods used to turn human languages into a binary format that a machine could work with. You might also think about other ideas from your computer science class, such as using ASCII code to represent characters Although you are not wrong, both Morse code and ASCII can be used to turn text into digits. Why can't you use them in modern NLP? Well, think about the two sub-problems of text representation. Number 1 is that the representation must convey some meaning. Neither Morse code nor ASCII convey any meaning other than the raw text. Reading the dots and dashes, a computer would never be able to comprehend that bank in riverbank is different from bank in bank robber. Additionally, both Morse code and ASCII are at the character level. The vectors generated by these two methods can be gigantic and inefficient to feed into an ML model. Let's walk through the three major categories of the modern text representation from simple approaches to state of the art techniques, including basic vectorization, word embeddings, and transfer learning. Let's start with the basic vectorization techniques. The first idea is called one-hot encoding. With this technique, you one-hot encode each word in your vocabulary. Consider the sentence, a dog is chasing a person. After tokenization and preprocessing, the sentence can be represented by three words: dog, chase, person. To turn each word to a vector, you must first create a vector whose length equals the size of the vocabulary. Assuming you have a vocabulary that includes six words, dog, chase, person, my, cat, and run. Then place a one at the position that corresponds to the word and zeros at the rest of the positions. In the example, the left represents the words from the sentence and the top represents the vocabulary. What would be the vector for dog? Correct. That would be 100000. Then what would be the vector for chase? Of course, 010000. What would be the vector for person? You got it, 001000. By conducting one-hot encoding, you converted the sentence, a dog is chasing a person, into a matrix that an ML model takes. What are the benefits of one-hot encoding? It's intuitive to understand and easy to implement. However, let's also acknowledge the disadvantages. Recall the two sub challenges of text representation. One-hot encoding has two main issues among others. First of all, this representation does not convey any relationships between words. The vectors that represent each word are distinct from each other. Second, the generated matrix is high dimensional and sparse, which leads to over-fitting of the ML model. The dimensions of each vector depend on the size of the vocabulary, which can easily be tens of thousands. Also, most of the values of each vector are zeros, which means that this is a super sparse representation. Imagine you have 10,000 words in the vocabulary. To one-hot encode each word, you would create a matrix where 99.99 percent of the elements are zero. Another method of text representation is called bag-of-words. You first collect a bag of words from the text and the NLP project to build your vocabulary or dictionary. For example, you might have a vocabulary that includes six words: dog, chase, person, my, cat, and run. To represent the sentence, you must create a vector whose length is equal to the size of the vocabulary, then place a value to represent the frequency in which the word appears in the given document. For example, dog appears twice, therefore, it's a two. Chase occurs once, that's a one. Person never appears. Therefore, it's zero. My appears once, and that's a one. Cat and run each get a zero. Now, you get the outcome vector as 210100. Please note sometimes you might not care about the frequency, but only the occurrence of the words. You can simply use one and zero to represent whether this word exists in the text. By conducting bag of words, you can convert the sentence, a dog is chasing my dog, to a vector that an ML model takes. Let's look at the advantages of bag-of-words. Similar to one-hot encoding, it's intuitive to understand and easy to implement. Compared to one-hot encoding, it has two improvements. First, this representation captures some semantic similarity of texts, though very limited. If two sentences have similar vocabulary, the two vectors that represent them are close in the vector space and they might have similar meanings. Second, the generated matrix is less sparse compared to one-hot encoding. Some disadvantages about bag-of-words need to be addressed. First, bag-of-words still has high dimensional and sparse vectors. The dimension of the vector increases with the size of the vocabulary thus sparsity remains a problem. Second, although it captures some semantic similarities between sentences, it's still far from captures the relationship between words. For example, bag-of-words does not consider the order of the words, and that is why it's called a bag of words. A person is chasing a dog, and a dog is chasing a person would have the same representation in bag-of-words despite their opposite meanings. The basic vectorization methods, such as one-hot encoding and bag-of-words are not ideal. Two major problems that have not been solved include, first, the high-dimensional and sparse vectors, and second, the lack of relationship between words. Let's explore the breakthrough of text representation, word embeddings in the next lesson and discuss how it conquered these two key problems.