[MUSIC] Hello everyone, welcome to Big Data and Language. From the previous lectures, we've talked about the fourth industrial revolution and big data and intuitions versus big data. And today we will talk about data and language finally, okay? So first of all, what is the relationship between big data and language, how big data is used for language? So let's look at and think about one by one. My first question is, why is big data used to understand language? Have you ever thought about that, what is your opinion? Okay, I think this one is a good discussion question, so maybe we can make the good conversation or discussions later on the board. But let me give you some examples, the first one is, even experts have only a partial knowledge of language. However the big data can be more comprehensive and balanced, do you agree? And also, the big data can show us what is common and typical, okay? So that's why we use big data, okay? All right, so now let's move to the second reason why we use data for using or understanding the language. Even experts cannot quantify their knowledge of language. However, the big data can provide us with accurate statistics. As I mentioned that in the previous lecture, what is the frequency of using the in the certain genre or certain text data, right? Maybe even experts, they cannot come up with a specific numbers, however, based on the data, we can used and we can find the value of the frequency of the word the, right? And the third reason is, even experts cannot remember everything they know, right? However, the big data can store and recall all the information, right? If you have really good computer and store the good enough big data with a good quality, then maybe you don't need to remember everything, right? So you just like go to the data and find that certain information, okay? And the fourth reason is, even experts cannot make up natural examples. Maybe they can, but they cannot produce every single time, right? However, the big data can provide us with the vast number of examples in authentic context. For example, if you have the big data of newspaper articles, right? So if you want to find any certain word such as like invade, which means you can just go and type invade and find all the news articles including the word invade, okay? Maybe that will be related to all the newspaper or articles about wars, okay? Or let's move to the next reason, the fifth reason is even experts have prejudices and preferences. However, the big data can give you more objective evidence, which means big data do not have bias, if you have the balanced data, okay? We will talk about the data quality in the later lectures, when we talk about the data collection, okay? And the sixth reason is, even experts are not always available to be consulted. If you have a question, but maybe not always you can find the expert, right? If you have a question during the weekend, for example, then maybe it's hard to contact the expert and get the answer right away. However, the big data can be made permanently accessible to all, okay? So you can use big data whenever you want. Even experts cannot keep up with language change. However, the constant updated big data can reflect even recent changes. So for example, what about using a specific word like currently or like 30 years ago, right? Maybe experts cannot get the answer right away, however, if you have the historical data, then you can go and compare these two different dataset and you will notice how that word use is differently changed, okay? So maybe if you are curious I will introduce like the historical big data later in the future lectures. Okay, and let's move to the eighth reason. The eighth reason is, even experts lack authority, they can be challenged by the other experts, right? If somebody said the most frequent word In the text data is the, then maybe some people say that, no, I don't think so, so what is your evidence? Then maybe if that expert do not have any quantitative data result, then maybe they cannot give any good answer. However, if that expert have the certain data, then he or she can show that data as the evidence of his or her answer, okay? Which means, in other words, big data can encompass the actual language use of many expert speakers. Okay, so now let me give you an example of using big data to understand the language. Let me give you an interesting question, what would be the most common nouns in English? Nouns means like camera, television, like cell phone, something like that, right, book, pencil. So what would be the most common nouns in English, what is your answer? Okay, so let's check whether your answer is same as what I found from the data. Let me introduce the one data set, which one is the bnc corpus. So if you type corpus.byu.edu/bnc/, then you will see the dataset. I'm going to explain about the bnc later, but briefly explain about the bnc. Let me explain briefly, so this dataset is built in 2007. And how big, okay, 100 million words. So you may not imagine what would be the 100 million words? That one is almost same as 400,000 pages, okay? So probably that one is similar to 4,000 books, okay? So that dataset is like similarly 4,000 books. Of course, depending on the page of each book, that could be different, but just for your better understanding, okay? So let's imagine that, so one data set is based on 4,000 books, okay? So this big data show that what? What will be the most common nouns? There the answer is the top ten nouns are time, people, way, years, work, government, day, man, world, something like that, okay? Your intuition actually are similar to the findings from the data. If you have the similar answers, you may have good intuitions. But if you do not have the same answers or totally different answers, that's totally fine, that's why we have or we need the data. Okay, so, let me show you like one figure that how big data could be used for understanding language. So big data could be based on many different types of text data like such as advertisement,dictionary, subtitle, ebooks, manuals or letters, articles, tests, scripts or like newspapers, memos. There could be like so many different types of genres or so many different types of text data, right? Depending on your research question, you may want to compile all those text data or you might want to focus on certain genre, right? So using or collecting the big data is pretty important, but the more important thing is that what is the purpose of using that certain data. Depending on the purpose, you may need to collect the different types of data. So that's why data is pretty useful and important to understand the language, but at the same time you should be very careful when you use or when you collect the data for understanding language. Okay, so don't worry about it, I'm going to going to explain more how you can collect the balanced data or correct or accurate reliable data for understanding language later in the future lectures, so don't worry about it. Okay, let's stop here today. Today we've talked about the data and language. And next class we will talk about the recent research on big data and language. Thank you for your attention.