[MUSIC] Hello again everyone. In this video we're going to take a step back into the broad concept Of uncertainty and error. We'll discuss this over a few lectures because it's a relattively big concept that affects everything that you do in GIS. In this lesson, you'll learn about the broad types of uncertainty affecting geographic data. And then we'll specifically talk about uncertainty and how we conceptualize our data. So that we can make appropriate choices when we design data collection or analysis workflows. These topics are simultaneously mundane and fascinating, so I'll try to give you lots of examples to keep you interested. But make sure to pause the video every so often to think of how it applies to your own work. When I say uncertainty and error, I'm talking about factors that affect the quality of our data and our ability to trust that the data we generate accurately represents reality in a given area. We'll keep coming back to the concept of representing reality because GIS data is often accepted as truth but really it's just one representation of it. We need to understand how errors and uncertainty enter our data, to know how we can legitimately assess and analyze that data, to get answers about real places and not just answers about that data itself. Before we dive in, let me give you an example from my own work. Terrain data or digital elevation models can be incredibly important for hydrological work because it helps us answer that critical question of, where does the water flow? But a digital elevation model, or DEM for short, is just an approximation of the landscape, often having just one raster cell to finding the terrain height every 10 to 90 meters. When we process DEMs to find downhill paths for water, we accumulate error in them based on the uncertainty of the data. This is because in hilly, changing landscapes we're likely to get the broad picture of where the water outlets and river confluences are, but the processing can introduce error at the local scales. To the point that you often can't be certain about the specific results without significant data validation. Rivers may show as flowing 100s of meters away from where they actually are. If my goal is to know about the upstream areas at a location, this is probably fine. But if my goal is to get specific locations of rivers for my geoprocessing, then I can't trust the data without further validation. I could use it for preliminary work, etc. But for final products I'd either need to understand the introduced error so I can temper my results accordingly, or understand it, then correct it entirely. When we talk about uncertainty of geographic phenomena, we can group them broadly into four categories. First, we have uncertainty in our conception of the phenomenon we're measuring. Maybe we don't clearly understand what we're looking for yet. Second, we have uncertainty in measurement itself, such as error introduced by our instruments. Third, we have uncertainty introduced by our choice of how to represent the data, such as using raster or vector data. And last, we have uncertainty introduced by our choice in our analysis, such as using available in a different scale to represent our items of interest. So, let's start with the first part of the research process, our conception of our problem. To start with, we need repeatable, objective definitions of our phenomenon of interest, so that we can acquire data on it. That's harder than it sounds. For example, who is our customer? What are his or her interests? What characteristics define a neighborhood? What specifically constitutes a wetland? How many oaks and at what density does it take for something to be classified as Oak Woodland versus Grassland? This concept translates even further into scoring an attributes, what specifically defines a grade of A versus B? What does a grade of unacceptable, for levee protection of floodplain, really mean? We need to understand the tradeoffs we make here. We are necessarily generalizing a population into some sort of aggregate unit, and we can't keep all of the information about the individuals. We're grouping them so that we can study them as a unit, but individuals have diversity. So we define the characteristics of interest for our study but also keep in mind the tradeoffs of the characteristics we're setting aside, so that we can understand the limitations of our analysis. We can't infer things about the information we set aside. Taking this uncertainty and conception even further, let's consider the case that we know what we want to measure, but we don't have the instrumentation, the time, or the resources to study our item of interest directly. Instead, we measure something else that we have the ability to measure and that we know to be correlated in some way with our item of interest. Then, we use that information to make inferences about the phenomenon of interest. Think of it as measuring something by proxy. It can introduce significant error, but it can also be very necessary. Maybe you need to know how many cars are in a particular city but can't conduct an actual census. You have the population information for the city and the number of households, and you happen to know the statewide average for car ownership based on prior research. You can use that information to infer car ownership but we've introduced error since we've now assumed that our city is average rather than at some other point along the spectrum. Similarly, maybe we want to know where amphibians are, but going and finding them directly is time consuming. Instead we use a proxy of assessing the wetness of a location, throughout the year, using a sensor to infer whether or not it is suitable habit. In this case, we've substituted suitable habitat for amphibians and introduce uncertainty into our data. We also run into problems with regionalizing our data into polygons. What combination of characteristics define a zone? When does something that continuously varies stop being in one group or polygon, and start being in another? Do we use size thresholds or waiting schemes? Do we use fuzzy analysis where we keep some of this gradation or sharp analysis where we draw clear lines? Do we have the data we need to assign or bend data into this groups anyway? Sometimes this is clear as with property ownership. If we draw ownership boundaries we usually have clear rules of who owns an area. But what if we tried to classify the landscape instead? Maybe we want to define suitability for a particular use. For urban planners, you might question whether to represent a land use such as industrial activity as a single type on a map, or whether to break it into more specific zones defining specific types of activities. Similarly, where do we draw the boundaries for a neighborhood? Different groups occupy different areas and this continuously varies and it's hard to know where one neighborhood ends and another begins. For a map hurricane hazard you might question how to weight attributes to create your hazards score which affects how you create the polygons for the results. If you were looking to create climatic region polygons, you'll need to decide on the characteristics of each zone, so that you can assign given locations into each one. In all of this you'll need to understand how it affects individual items in the analysis. Should allocation have ended up in one group or another group? And as we mentioned before, what specificity do we give up in our data by grouping it? Okay, we'll leave it here for now. In this video we discussed the broad types of uncertainty that affects our geographic data and then went into detail on one specific type of uncertainty. Uncertainty and error in the conception of our data. We learned about sources of error from our definition of our phenomenon of interest, from our ability to directly measure that phenomenon, and from how we decide to regionalize or group individuals into useful categories. I really do encourage you to think through how these factors affect your own work, right now. So you get practice applying them and working through the tradeoffs. In the next video, we'll pick up where we left off by discussing uncertainty and how we measure data and go from there.