This lecture is about Web Search. In this lecture, we're going to talk about one of the most important applications of text retrieval, web search engines. So let's first look at some general challenges and opportunities in web search. Now, many informational retrieval algorithms had been developed before the web was born. So when the web was born, it created the best opportunity to apply those algorithms to major application problem that everyone would care about. So naturally, there have to be some further extensions of the classical search algorithms to address new challenges encountered in web search. So here are some general challenges. First, this is a scalability challenge. How to handle the size of the web and ensure completeness of coverage of all information. How to serve many users quickly and by answering all their queries. And so that's one major challenge and before the web was born the scale search was relatively small. The second problem is that there's no quality information and there are often spams. The third challenge is Dynamics of the Web. The new pages are constantly create and some pages may be updated very quickly, so it makes it harder to keep it indexed fresh. So these are some of the challenges that we have to solve in order to deal with high quality web searching. On the other hand there are also some interesting opportunities that we can leverage to include the search results. There are many additional heuristics, for example, using links that we can leverage to improve scoring. Now everything that we talked about such as the vector space model are general algorithms. They can be applied to any search applications, so that's the advantage. On the other hand, they also don't take advantage of special characteristics of pages or documents in the specific applications, such as web search. Web pages are linked with each other, so obviously, the linking is something that we can also leverage. So, because of these challenges and opportunities and there are new techniques that have been developed for web search or due to need for web search. One is parallel indexing and searching and this is to address the issue of scalability. In particular, Google's imaging of map reduce is very influential and has been very helpful in that aspect. Second, there are techniques that are developing for addressing the problem of spams, so spam detection. We'll have to prevent those spam pages from being ranked high. And there are also techniques to achieve robust ranking. And we're going to use a lot of signals to rank pages, so that it's not easy to spam the search engine with a particular trick. And the third line of techniques is link analysis and these are techniques that can allow us to improve such results by leveraging extra information. And in general in web searching, we're going to use multiple features for ranking not just for link analysis. But also exploring all kinds of crawls like the layout or anchor text that describes a link to another page. So, here's a picture showing the basic search engine technologies. Basically, this is the web on the left and then user on the right side and we're going to help this user to get the access for the web information. And the first component is a Crawler that would crawl pages and then the second component is Indexer that would take these pages create the inverted index. The third component there is a Retriever and that would use inverted index to answer user's query by talking to the user's browser. And then the search results will be given to the user and when the browser would show those results, it allows the user to interact with the web. So, we're going to talk about each of these components. First of all, we're going to talk about the crawler, also called a spider or software robot that would do something like crawling pages on the web. To build a toy crawler is relatively easy, because you just need to start with a set of seed pages. And then fetch pages from the web and parse these pages and figure out new links. And then add them to the priority que and then just explore those additional links. But to be able to real crawler actually is tricky and there are some complicated issues that we have to deal with. For example robustness, what if the server doesn't respond, what if there's a trap that generates dynamically generated webpages that might attract your crawler to keep crawling on the same side and to fetch dynamic generated pages? The results of this issue of crawling courtesy and you don't want to overload one particular server with many crawling requests and you have to respect the robot exclusion protocol. You also need to handle different types of files, there are images, PDF files, all kinds of formats on the web. And you have to also consider URL extension, so sometimes those are CGI scripts and there are internal references, etc, and sometimes you have JavaScripts on the page and they also create challenges. And you ideally should also recognize redundant pages because you don't have to duplicate those pages. And finally, you may be interested in the discover hidden URLs. Those are URLs that may not be linked to any page, but if you truncate the URL to a shorter path, you might be able to get some additional pages. So what are the Major Crawling Strategies? In general, Breadth-First is most common because it naturally balances the sever load. You would not keep probing a particular server with many requests. Also parallel crawling is very natural because this task is very easy to parallelize. And there is some variations of the crawling task, and one interesting variation is called a focused crawling. In this case, we're going to crawl just some pages about a particular topic. For example, all pages about automobiles, all right. And this is typically going to start with a query, and then you can use the query to get some results from a major search engine. And then you can start it with those results and then gradually crawl more. The one channel in crawling, is you will find the new channels that people created and people probably are creating new pages all the time. And this is very challenging if the new pages have not been actually linked to any old pages. If they are, then you can probably find them by re-crawling the old pages, so these are also some interesting challenges that have to be solved. And finally, we might face the scenario of incremental crawling or repeated crawling, right. Let's say, if you want to build a web search engine, and you first crawl a lot of data from the web. But then, once you have cracked all the data, in the future you just need to crawl the updated pages. In general, you don't have to re-crawl everything, right? It's not necessary. So in this case, your goal is to minimize the resource overhead by using minimum resources to just the update pages. So, this is actually a very interesting research question here, and this is a open research question, in that there aren't many standard algorithms established yet for doing this task. But in general, you can imagine, you can learn, from the past experience. So the two major factors that you have to consider are, first will this page be updated frequently? And do I have to quote this page again? If the page is a static page and that hasn't being changed for months, you probably don't have to re-crawl it everyday because it's unlikely that it will changed frequently. On the other hand, if it's a sports score page that gets updated very frequently and you may need to re-crawl it and maybe even multiple times on the same day. The other factor to consider is, is this page frequently accessed by users? If it is, then it means that it is a high utility page and then thus it's more important to ensure such a page to refresh. Compared with another page that has never been fetched by any users for a year, then even though that page has been changed a lot then. It's probably not that necessary to crawl that page or at least it's not as urgent as to maintain the freshness of frequently accessed page by users. So to summarize, web search is one of the most important applications of text retrieval and there are some new challenges particularly scalability, efficiency, quality information. There are also new opportunities particularly rich link information and layout, etc. A crawler is an essential component of web search applications and in general, you can find two scenarios. One is initial crawling and here we want to have complete crawling of the web if you are doing a general search engine or focused crawling if you want to just target as a certain type of pages. And then, there is another scenario that's incremental updating of the crawl data or incremental crawling. In this case, you need to optimize the resource, try to use minimum resource to get the [INAUDIBLE] [MUSIC]