Now that I've covered an introduction to what we're looking to do over the course of this class. Let's jump into some background as far as jargon that we might be using. And ways in which data is going to be presented to you. So you might hear data presented, think of it as a data table or what looks very similar to a spread sheet. In this case, this is a data table to snippet from direct marketing company. Where each row corresponds to a different order number. So each row has a unique order number that you see. And then for each of those order numbers we have additional information. Who is the customer that placed the order? Where did the order ship to? What was the price paid? Was it given as a gift? Some product information, and then the artist. So we're dealing with CDs, in this case. So, you may hear each of these rows talked about differently. We might call them cases, we might call them records or observations. In this case, our transactions are the observations that we're dealing with. And each of our columns, those are the different variables that we have. So those are the different variables that are of interest to us. So we might look at this table and see where we're shipping to. We might look to see, are we performing? Are we overperforming or are we underperforming in some states? So if we were to re-organize this information we could aggregate the date in the state slash country column. Let's lump together all of the orders from Ohio. Let's lump together all of the orders from Illinois. All of the orders from Massachusetts in the current month. And let's compare them to last month. And that will allow us to see how we're doing month over month. Maybe there are some seasonal trends that we might be looking for. So, if we expect to see big spiking orders that is something that we'd want to take into account if there's a holiday season. Now, when we're dealing with whether it's an Excel workbook with multiple tabs. Or we're dealing with a relational database. It might not make sense for us to put all of that information together into a single table. You can imagine ending up with a very wide spreadsheet. And what I mean by that, is that it would have so many columns that you'd keep scrolling over. To try to see what all of those columns are. One way of alleviating that is to relate the tables to each other. So here's our table of transactions. You see that there's each transaction. Each row has a different transaction number. Has the data associate with the transaction. Has the customer ID, the customer number associated with that transaction, and has a product ID. So what was the product purchased on that transaction. Well, if we dive into the customer numbers, I might say all right, the first two transactions are from customer 473859. Well, I might have more detailed information about that particular customer. Name, mailing address, when they were acquired as a customer might be some of the information. I got also might have information about my specific products. So, the first one SC5662, what are the details of that product? That information might be contained in different tables. So we can see if we have a costumer table, we can look to see from more transaction table from the first two transactions that we've got. Let's find that customer number in the first column of our customer table. And see where does this customer live, how long they've been a customer and so forth. Might also look to see what are the products. And we can see, all right, well we got pricing information, we have information on whether or not the product is currently in stock. One of the reasons for keeping these tables separate from each other? Now, if we think about the three tables that we have we have specific transactions. We've got customer ID's and we've got product ID's contained as three related tables. Well, let's say that silver cane, SC 5662, is no longer in stock. Well, I'm going to make that change in my product table and if I make the change in that product table. I can then see that change trickle through to the transaction table. Where as if I want everything into a single table. If I tried to take the four columns that are on my items table and insert them into that transaction table. I'd have to find all of the silver cane transaction. And update all of those transactions with the updated information. What if I change the price I've got to update all of those transactions. Same problem applies when we're dealing with updating customer information. Customers move all the time. So, do I want to have to go into my transaction database, find all of the rows corresponding to a particular customer and update their information? Little bit of a hassle and there's a good chance it's going to create errors down the road for us. So instead, I link my customer information to the transactions with their ID number. And then, all I have to do is update the information in the customer table. Once that update's made, any time I'm pulling information out of that customer table, it's going to automatically pull the updated information into the transaction table. We're going to talking about variables a lot. So, the different columns on our spreadsheet. And I just want to provide a broad classification as far as two different types of variables that we're going to be dealing with. And these are generally used for numerical data. But we can also use them for other types of data as well. So let's start with the quantitative variables. These are where the numbers in a spreadsheet or actually numbers. They are actually have a numerical interpretation to them. What's the price? What's the cost? What's the quantity in stock? What was the ordered quantity, these all make sense to us. We can perform mathematical operations, on these I can perform division. I know that five is two more than three. The other place where you may see numbers on a spreadsheet where they have different meanings though are for categorical data. So, let's say I'm thinking, I'm talking about markets. And I serve 20 different markets. But instead of having the name of those markets on the spreadsheet. I might have quoted it as market one, market two, market three, market four. All the way through market 20. Well market 20 isn't necessarily twice as good as market ten. Market 15 isn't five more than market 10. These are just labels that were used. So in a lot of cases you'll see numbers used where it doesn't imply that you can do arithmetic with these numbers. They're just being used as labels. One through 20 might be just as good as red, blue, green, pink etc. We could use colors, we could use letters. This is something you've gotta be careful for when we're doing analysis because it relies on human interpretation. But computer, Excel, it doesn't know any better. All it sees is a number. But when we understand the data as it's given to us. We need to know that these numbers, they're really just indicators for a categorical variable. Whereas in other cases, the numbers, well they're actually numbers. So let me use an example where we're going to want to dive a little bit deeper into things. So television ratings, what fraction of people who are watching TV are watching a particular program? We can look at it broken down by the different networks. And we can see, well, for this time slice. For this one hour of television, it looks like CBS is doing really well. Does that necessarily mean that any advertiser would be better off in terms of reaching people. By putting their advertising into the CBS show. Well, cost is one question we don't have information presented right here in terms of how much does the CBS show cost relative to the FOX or NBC show? But the other way that we might look at this would be to say alright, I'm not just interested in the average audience. I'm interested in reaching a particular demographic group. I'm interested in reaching a female age 18 to 35. Or male, age 35 to 45. And maybe the performance of the programs. The popularity of the programs. It may differ across these two demographic groups. Well, demographic group 1, demographic group 2, that's our categorical variable. Which group do particular individuals fall into? Network is actually another categorical variable that we have here. Whereas the ratings information, that's going to be a quantitative variable. So we look at this ratings information, well the 7% rating that ABC has for Demo Group 1 versus the 1% rating that is has for Demo Group 2. Well, it's got seven-times the audience as a fraction of that group. Whereas if we look at Fox, Fox seems to be doing a lot better, three-times as good with Demo Group 2 compared to Demo Group 1, all right? So if we're dealing with quantitative variables, multiplication, arithmetic that can be performed. But Demo Group 1 versus Demo Group 2, that's a categorical variable. The networks, ABC, CBS, CW, FOX, NBC, that's a categorical variable. Think of brand choice. Do I prefer Coca-Cola versus Pepsi? Those are brands, those are categorical variables as well. If we were to look at going to a grocery store, you look at the shelf of laundry detergent. There are a lot of brands available there. In the data that we're presented, we might say one is associated with Tide. Two is associated with all. Three is associated with the store brand. One, two and three, in this case these are just labels. So that's something to be careful of when we're dealing with any type of analysis. That we don't try to perform mathematical operations with these categorical variables