Last time we talked about the characteristics of problems for which we need to collect data to solve those problems. This time, we'll talk about deciding what data we need to collect to solve a particular problem. Let's talk about two kinds of problems. The first kind of problem is one for which we can solve the problem by inserting a set of questions. So, the things we need to do to decide the data that we need to collect is figure out the questions we need to answer, and then figure out what data we need to collect to answer those questions. Once we've done that, we should be okay to move on to the next step. So, one example is picking a cell phone plan. If you're deciding which cell phone plan to use, then you need to figure out what questions do you need answered. Do you need to know how much it costs a month? Do you need to know which phones are supported by that particular plane? Do you need to know about download speeds and are those download speeds throttled after you reach a particular amount of downloading in a particular month? So, there are a set of questions that you want to answer so that you can compare different cell phone plans. What you need to do then is collect the data to answer those questions. This one's pretty straightforward, right? There's data you collect as how much does it cost for a month and what are the download speeds and those kinds of things. So, the trick is, figure out what questions matter to you, and then figure out what data you need to collect to answer those questions. Another example is buying a computer. So, some of us are pretty geeky and we'll say, "No, here's what we're going to buy. We need an 11 gig graphics card and we got to have at least 64 gig of RAM," and all kinds of stuff like that, but other people by computers off the shelf. So you may not want through custom build your rig. You may actually just want to say, "Okay. Well, what can I buy here?" So, there are pieces of information that you might care about. You have questions like how fast is the CPU, how much memory does it have, how good is the graphics card, and then you collect data about which CPU the computer has and how much RAM it has and the characteristics of the graphics card so that the data that you collect will actually help you answer those questions. The final example on this slide has nothing to do with buying something, right? So, we may be interested in knowing about the relationship between population growth and crime rate. So, is it true that as population rate grows or as the population grows in a particular location, is it also the case that crime rate grows in that location? Now, as we gather data about two different pieces of information and how they're related to each other, all we can do is say that they are positively or negatively related to each other. We can't talk about causality. It could be the case as strange as it sounds, that as crime rate grows in a particular location, people decide they want to move there. That doesn't make much sense. But statistically speaking, we can't actually infer causality just because of relationships. But we might care about the relationships anyway. So, the questions we'd want to answer is, for a sum subset of all the places in the world, we might say okay, how has the population grown over a certain period of time, and how has the crime rate grown over a certain period of time. If we gather a set of data points with those two pieces of information for a variety of locations, then we can answer or explore the relationship between population growth and crime rate. So that's one kind of problem. We might want to gather data to answer or solve the problem. Is the kind where if we enter a bunch of questions then we can solve the problem. Another kind of problem we might need to solve by collecting data is we want to build something. We want to build an application. So, one example is Mapquest. The same applies to Google Maps if that happens to be your favorite mapping thing. But for Mapquest, we need to collect a bunch of information, right? We need to collect of course the location you're trying to go from and the location you're trying to go to. But we also need to have information about those geographic locations, you might use GPS coordinates or something like that, right? The locations of each intersection. So that you can tell people where to turn and so on. You might need more complicated things like construction, what is currently in construction so that when you generate directions for somebody to go from one place to another, you actually know,well don't send them down this road, it's down to one lane and lots of delays and so on. So, Mapquest needs the stuff from paper maps, right? It needs the layout of the topography of the streets and so on, but it also needs much more than that. It needs to know the other information that we've talked about. I mentioned the Google Street View in the previous lecture. So for Google Street View the data that we need to actually collect to implement a Google Street View includes of course the pictures of lots of streets. But you also need to know the GPS coordinates. Let's just say GPS coordinates because that's pretty much the standard thing we use now for location. So, the GPS coordinates each place where we've taken a picture, so we can merge all the pictures together and they're geographically accurate. We need to know how the camera is facing. Because in Street View you can spin your view and see different things, and so we need to correlate the amount of spin that the user has decided to apply to look to the side, and the appropriate camera rotation degrees for that. Now of course with sophisticated cameras, right? You're getting 360 degree views, but you still need to know which images are associated with which amount of rotation compared to straightforward. So, there's a bunch of data we need to collect to actually do Google Street View. The final example for this kind of building an application and I know all these applications I'm talking about, all these problems are navigation things, those aren't the only kinds of problems we need the data to build applications for. They're just the ones I thought of. But if we're building a navigation app, something that you can install on your phone and you'll follow directions from one place to another, you need the information that you got from Mapquest. You need to know the locations of each of the intersections. You need to know how everything's laid out so you can make a reasonable suggestion for how to get from here to there. But you also need more. You need the current location for example of the person who is driving along, and for some navigation applications, and for Mapquest too for that matter, you also should have the speed limit for particular locations along the way. So, for Mapquest for example, and I forgot this when I was talking about Mapquest, but for Mapquest you need to know the speed limits so you can estimate the time of travel. You can't just assume everyone can drive a 100 miles an hour, you have fast stretches and so slow stretches. So you need that information for navigation applications if it's going to give you an estimated time of arrival as many navigation apps do. It also needs to know how long is it going to take you to drive legally from one place to another. To recap, we talked about two different kinds of problems that we can solve at least by collecting some data and then doing some work with that data. The first kind of problem is one for which, if we can answer a bunch of questions and collect the data to answer those questions, then we can solve that problem. The other kind of problem is that we need to actually collect data to build something, to build some application that uses that data. We went through a number of examples about those kinds of problems as well.