I'll generate some graphics and summary statistics so we can learn more about the data. I'll select Libraries, My Libraries, and then expand the STAT1 library. I'll double-click AMESHOUSING3, which is a random sample of 300 homes from the original data. We use this table in most of our analyses, so let's look at some of the variables. I'm actually going to select Column labels in the View field so that we can see more descriptive labels. Some of the categorical variables include Style of dwelling, such as 1Story, 2Story, etc., Original construction year, Number of fireplaces, Foundation Type, such as Concrete Slab or Cinder Block, and Masonry veneer or not. Some of the continuous variables include the Lot size in square feet, Above ground living area in square feet, Sale price in dollars, Basement area in square feet, Number of full bathrooms and half bathrooms, and Age of house when sold, in years. I'll use two SAS procedures to explore the AMESHOUSING3 data. To start, I'll navigate to my program files and open st101d01.sas. In Part A, this program defines macro variables to help organize the data set variables and make modifying the SAS code easier. The %LET statements are used to name the macro variables and set their values. The first %LET statement creates a macro variable named categorical, and assigns it a space-delimited list of the categorical variables in the table. The next %LET statement creates the macro variable named interval, and assigns it the names of all the interval variables. Now, instead of typing a long list of variable names over and over in our programs, we can simply reference macro variable values by placing an ampersand in front of the macro variable name. So lets look at Part B of this program. Suppose you want to know the most popular house style in this data set. This PROC FREQ step uses the ameshousing3 data set to generate frequency tables and plots summarizing the categorical variables. In PROC FREQ, you list the analysis variables in the TABLES statement. The macro variable reference &categorical is replaced with the macro variable's value, the list of categorical variables, when I submit the step. The PLOTS= option requests a frequency plot. And by including the FORMAT statement, the data will be formatted and grouped before being analyzed. Let's submit Parts A and B of the program. Notice the results are displayed automatically. Its a good practice to check the log for error or warning messages. This program ran fine. There's an easier way to know a program's status in SAS Studio. A red X icon is displayed on the program tab if any errors occurred, and a caution symbol is displayed if warnings were generated. This program ran fine, so no icons appear. Here's the PROC FREQ output. From the 300 homes in our sample, almost 200 are one-story homes. There are few homes with other styles, and notice that there are only six observations with the house style 2nd level unfinished. There are too few members to analyze, so they'll be merged with the One story and Two story levels in the variable House_Style2. The variables representing the overall quality and overall condition of the homes are predominantly average. Both variables have many levels with small frequencies as well. For example, theres only one home each with an overall quality of 1, the poorest level, and 9, the best level. We'll trichotomize these two variables into Below Average, Average, and AboveAverage, in the variables Overall_Qual2 (overall quality 2) and Overall_Cond2 (overall condition 2). Year_Built ranges from 1875 to 2009 and has more values than is practical to treat as a categorical variable in a statistical model with only 300 observations, so we'll treat it as interval. In this sample of homes in Ames, Iowa, 195 homes have no fireplace, 93 have a single fireplace, and 12 homes have two fireplaces. Because the number of fireplaces has a natural ordering, we can treat Fireplaces as an ordinal variable. The variable representing month sold shows a clear trend toward sales in the summer months, July and June. Some months have small numbers, so instead of analyzing by month, we created Season_Sold to use in subsequent analyses. Season 1 is from month 12 to month 2; season 2 is from month 3 to month 5; season 3 is from month 6 to month 8; and season 4 is from month 9 to month 11. Yr_Sold is fairly uniform, meaning there were a similar number of homes sold each year between 2006 and 2010. The Garage_Type_2 variable shows that 159 homes have an attached garage, 109 have a detached garage, and 29 homes do not have a garage (represented by NA). The table also states that there are three homes with missing information. The Foundation_2 variable shows that most homes are on cinder block, followed by concrete and then brick tile or stone. There are four levels of heating quality, excellent, fair, good, and average. Fortunately, most homes have excellent or average heating quality. Furthermore, most homes do not have masonry veneer. Only 89 do. Most homes have a regular lot shape. And a majority have central air. Now, what about our continuous, or interval variables in the data? Instead of creating tables, for continuous variables we'll plot histograms of the data to see the shape and spread, and also print the mean and standard deviation summary statistics. The PROC UNIVARIATE step performs a distribution analysis and plots the distribution of the continuous variables. The NOPRINT option suppresses the other output. You can see that we're referencing the interval macro variable in the VAR and HISTOGRAM statements. And were requesting a normal curve, a kernel density estimate, and an inset box in the top right, or northeast corner, displaying the number of rows, the mean, and the standard deviation. Let's submit this step and look at the results. The first continuous variable, SalePrice, shows that the average sale price of homes in our sample is $137,524. Notice that the histogram of the data is bell shaped, referring to a Gaussian, or normal distribution, a quality of the data that's important for our analyses in subsequent lessons. The blue line overlaying the plot is a normal density estimate and the red line is a kernel density estimate, which basically mimics the histogram. If these two overlayed lines are similar, the data are close to a normal distribution. We'll discuss this more in Lesson 1. Sometimes researchers use a log transformation on an outcome variable such as SalePrice to provide more bell-shaped or normal-looking data for future analyses. In this case, both the original variable and the log transformation provide bell-shaped data. On average, homes in Ames, Iowa, have 1,130 square feet of above ground living area. Most homes range from 900 to 1380 square feet. The basement area is fairly bell shaped with a mean of 882 square feet. The average garage area for homes with a garage is 369 square feet. The Deck_Porch_Area histogram is an example of a skewed distribution. Most of the observations, approximately 40%, have no deck, and then we see fewer and fewer larger decks. The lot area is fairly normal looking with a mean of 8,294 square feet. The average age of homes sold in our sample is approximately 46 years, and the ages range from new to about 132 years old. The number of bedrooms above ground, which could also be analyzed as a categorical variable, is 2.5 on average. Similarly, the number of full, half and total bathrooms could also be analyzed as categorical variables, and the average numbers are 1.68, 0.25, and 1.70, respectively. After looking at the variables in our course data, you might have some intuition as to which variables could accurately model sale price. For example, we could do an analysis of variance to see whether homes with central air are more likely to sell for higher prices, or whether homes with excellent heating condition are associated with higher priced homes. In addition, we could use regression to see whether the above ground living area, as a proxy for the size of homes, is correlated with SalePrice. Or perhaps we can model the probability of the home selling for more than $175,000 using both the number of fireplaces and basement area jointly. These are questions we'll be able to answer going forward.