Graphs are used for tools for data exploration. In this video, we discuss some basic graphs for exploration data analysis. For categorical data, we discuss bar graphs and pie charts. For connotative data, we discuss histograms, box plots and time plots. We also talked about scatter plot as a way to examine the relationship between a pair of quantitative variables. Bar graph is also called a bar chart. In the basic version of this graph, each bar represent category and the height of the bar represents the count or percentage of each category. The bar graph on this slide shows the search engine market share for different search engines. It shows that Google has the dominant market share in search engine market followed by Yahoo and Bing. There are many different variants of bar graphs. In a Pareto chart, categories are ranked from the most popular to the least popular. The bar graph on the previous [INAUDIBLE] is an example. Pareto charts make it easier to spot top categories, especially when the number of categories are large. A bar graph can be shown vertically or horizontally. Although horizontal bar graphs are more common, vertical bar graphs can sometimes be useful as well. In stacked bar graphs, bars are stacked together, making it easier to put multiple categories on the same graph or show the same category over time. Pie chart uses pie slices to represent the percentage share of different categories. It is similar to a bar graph, but emphasizes each category as a part of one whole. In other words when we used pie chart we implicit and assumed that the category is exhaustive. This is different from a bar graph. The pie chart on this slide shows the same information as the bar graph in the previous slide. Next, we discuss a few basic graph to chart quantitative data. A histogram is a summary plot for a single numerical variable. It shows a distribution of the data giving some sense on basic characteristics of the data. Including center variability, skewness, modality, outliers, and possibly other patterns. As an example, this graph shows a histogram of exam scores. We can clearly see that the distribution is bimodal, with 1 mode between 65 and 70, and the other 1 between 80 and 85. The distribution is centered around 70 and is highly variable with a range between 50 and 95. The distribution is lightly skewed to the right, with no visible outliers. Do you know the popularity of histograms? It is well to discuss a few potential issues. For some more data sets histograms can be misleading in a sense of a small changes in your data can alter the vio pattern. For large data sets, histograms you really work really well. Even though histograms can only show one variable at a time, side by side histograms can be used to compare distribution of multiple variables. When constructing histograms, it is important to choose appropriate beam widths, or the widths of the individual bars. If small pinwheels is used, the graph can show too much detail, and be hard to read. However, using large pinwheels can lead to misleading patterns and give the wrong real impression. A second plot to show quantitative data is box plot. The plot shows a five number summary of a single variable. The five summary statistics are shown with solid horizontal lines in the graph. The five lines from bottom to top represent, respectively, the minimum, first quartile, second quartile or the median, third quartile, and the maximum. Note that in calculating those statistics, the outliers, which are represented by the dots in the graph, are removed. The big box in the plot which is between first and third quartile contains half of the data. The inside of the box represents the median with 50% of the data on each side. The box plot effectively divides the data into four quarters, and clearly marks their boundaries. However, it is difficult to tell the distribution shape. Also, different tools have different implementation of the plot. Side by side box plots are very effective in showing differences in a quantitative variable, across factor levels. The graph shows side by side [INAUDIBLE] of lease prices of real estate properties with and without a parking garage.Well it is difficult to establish distinguishing shapes of the lease prices. The graph shows unequally that properties with parking garage are more expensive. In addition, we can also easily grab a few useful summary statistics from the graphs. Time plot can be used when there is a meaningful sequence such as time. This graph shows a Google trend for the search term gains since 2004, the graph shows a declining trend from 2004 to 2016. In addition, there is a clear early pattern in the data. The number of searches peaks in January and generally bottoms out in the Summer. In the time plot, time should always go on the horizontal axis. We describe time series by looking for an overall pattern and for striking deviations from that pattern. A trend is a rise or fall that proceeds over time. These spikes spot irregularities. A pattern that repeats itself at regular intervals at a time is called seasonal variation. A scatter plot shows the relationship between two unrelated variables. This scatter plot shows the relationship between Old Faithful geyser eruption duration, and we can hide between eruptions. They're appears to be a nearly linear relationship between the two as illustrated in the graph. In addition, there are two linear groups of data. Data points in the bottom left have shorter eruption durations, and short waiting times, while the data points on the top right have long erupting durations and long waiting times. Since erupting time and waiting times between eruptions tend to move together, the relationship appears to be positive. Let me expand on this discussion in the next few slides. Typically, the exploratory or independent variable is plotted on the x-axis. And response or dependent variable is plotted on the y-axis. Each pair of data appear as a point in plot. A relationship can be either positive or negative. In a positive association, high values of one variable tend to occur together with high values of the other variable. This is the case for Old Faithful data shown before. In a negative association, high values of one variable tend to occur together with low values of the other variable. The relationship between two variables may not be linear. The old face for data shows a nearly linear relationship. As we mentioned before, this relationship appears to be positive, there are so many other possibilities as shown here. The relationship can be linear or nonlinear. Relationship can nonlinear which can take many different forms. It can also be the case that no visible pattern appears in the scatter plot, in which case we say that there is no relationship. Here, another important idea is strength of relationship. The strength of a relationship between the two variables can be seen by how much variation or scatter there is around the main form. With a strong relationship, you can get a pretty good estimate of why if you know the value of x. With a weak relationship, for any x value, you might get a wide range of y values.