We generally need to do some preparatory work for them for example, add some names, add some indexes and convert some data. Never underestimate such work Good data preparations will make many subsequent steps quite easy Common steps after data acquisition are first, necessary data exploration and data preprocessing They are necessary steps for data analysis and mining Data exploration often includes checking data errors learning about distribution characteristics and inherent regularities of data Preprocessing often includes data cleaning, integration, transformation, and reduction Among them, data cleaning actually includes the task of checking data errors So, the two overlap in functioning For this reason, we don't probe into the order of data mining and data preprocessing and data preparations will be discussed mostly based on the order of detailed tasks After introduction to data preprocessing, we'll deal with the understanding of data distribution and regularities in data exploration The following parts in this week will be discussed according to the common steps in preprocessing After acquisition, data can't be directly used in most cases as data may often be incomplete with excessive noise or dimensions or in different units requiring various preprocessing Preprocessing often takes up over half of the entire process of data analysis and mining Our focus of lecture is on data cleaning, transformation, and reduction while data integration is a task to be done when correlated data, often heterogeneous, are integrated including, among others, entity recognition, with focus on addressing problems like homonyms, synonyms, and redundancy attribute recognition problems The priority is to observe data Technically, they can mostly be solved with what we learn in this course so it's not detailed here any more In this part, let's first learn about data cleaning Error check in data exploration and data cleaning in data preprocessing are with two main tasks: detection and processing of missing values and outliers First, look at the treatment of missing values Acquired data are often incomplete missing What if we encounter some missing values Directly drop them Seems a little bit violent However, if a record lacks many attributes or key attributes we may consider directly deleting such records Apart from deletion, another common way for treating missing values is filling Fill with which values Common ways include as for a fixed value, say, missing wage data in some data we may consider filling it with an average social wage When some statistics lack nearby data for a mean value a median, and a mode we may also use an interpolation function, like the Lagrange interpolation function to calculate the most possible value through modeling using the nearest neighbor or regression method For example, the ages missing in the famous Titanic dataset may be filled in through regression analysis based on characteristics such as passenger class and family member data Considering the previous learning in our course, we'll discuss the first three common ways of filling You may independently extend to the last two ways after mastering the first three basic ways The last way, in particular requires certain basic mastery of machine learning algorithm No hurry in learning Let's first look at the treatment of missing values in DataFrame data In the quotesdf data we mentioned before the dividend data have been processed so they are quite clean To explain the question, I've processed them in a certain way Several data records are deleted The processed csv file is like this Some data have been deleted Read this file First, import the "pandas" module and then read the file with the function "read_csv()" As our file is under the current directory just directly input the filename and set "Date" as the index of this DataFrame Use this argument to assign its result to a variable Let's have a look As we see the missing values in the file appear as "NaN" in DataFrame Next, we use several associated methods to detect and process the missing values Firstly we use the quotesdf_nan.isnull() method to judge whether there is any missing value in data Look at the result As we see, this method also supports vectorized operation It acts on each DataFrame element The location judgment result is True for a missing one or False for a non-missing one After detection, the missing value must be processed What if we wanna drop the missing values Use the dropna() method First, use the help() function to view this method It has three frequently used arguments: axis, how, and inplace "inplace" appears in many functions and methods indicating whether to directly change the original object The default value is False. Leave it unchanged Let's look at the "axis" argument Its default value is "0" or "index" indicating to drop the row containing any missing value If "axis" is "1" or "columns", it means to delete the column containing any missing value The former is more frequently used The default value of argument "how" is "any" meaning that a row or column will be deleted if it contains a missing value If deletion is applied only if all values are missing the value of "how" shall be set as "all" Have a try Let's first set the argument value of "how" as "all" Have a look. This record is not deleted as its record is not totally null Then, try the direct use of default value of argument "how" As we see the record mentioned just now is deleted Sure The other records containing missing values in this DataFrame are all processed in the same way Now, let's look at filling of missing values First, restore the data The method of filling missing values is fillna() For example, fill with the mean Sure, the mean is that of data in the same column as a column is with an attribute say, the opening price, the math score Have a try It works to fill like this Here, I set the argument value "inplace" as "True" so the original DataFrame is directly changed The mean is used for filling at this location If we alternatively use the median or any other statistic amount just use another appropriate function In later lessons, we'll introduce these statistical functions in details Sometimes, we may use a non-blank data immediately close to the missing value in the data for filling such as the opening price The opening prices of the previous day and the next day may be of certain significance Then, how can we specify the proximity location of data So easy Just use the "method" argument in the "fillna()" method Restore the data again Have a try If the value of "method" argument is set as 'ffill' it means that the missing value is to be filled with the non-missing value immediately before it and "bfill", if set, means to use the non-missing value immediately after it to fill To select, consider the location of data especially the missing value in the first row or the last row Let's execute it As we see, this data is filled with the non-missing value after it Got it Apart from missing values the acquired data may often contain outliers Outliers are also known as noise points which are points whose numerical values obviously deviate from other observed values Outliers may be detected through simple statistical methods or plotting or through machine learning methods such as clustering Here, we use the simple yet effective description function "describe()" for data statistics the box plot, and the 3σ-principle based way for programming by hand Take the three ways as an example to understand how the outlier is observed and detected Use the previous data as well Suppose we have filled the missing values with the means look at the result of using the describe() method Have a look first As we see, the result contains detailed information on each attribute or feature including the number, the mean, the standard deviation, the minimum value, and the data at 1/4, 1/2 and 3/4 locations i.e., the lower quartile, the median, and the upper quartile and the maximum value From such information, we can often see
177
00:12:50,141 --> 00:12:51,450
whether data are abnormal For example, suppose the minimum value here is a very small number, we may see it's abnormal, right Then, let's look at the box plot the box plot, aka, box and whisker diagram which may well reflect the distribution of raw data It's quite easy to plot a box plot Just use the boxplot() method of Series or DataFrame object Through the result with the describe() method we just used we may have found out that this data is quite ideal To simulate real data that are often not so clean we add a row of special data such as the last date in the date record with some outliers Have a look The data have been added inside Besides, in the attribute "Volume" in the trading volume of original data the order of magnitude of its corresponding values is much greater than others If all shown in the same plot, the effect of observation will be affected so we only select data of the first 4 columns here The "iloc" attribute of DataFrame object may be used for data selection and the drop() method may be used to delete unneeded data Let's write both of the two writings The first one is .iloc[:,0:4] and the second one is the drop() method: .drop('Volume', axis = 1) To directly modify the original DataFrame just add inplace = True here After data selection, directly use its boxplot() method for plotting This is a well-plotted box plot and these points in the plot are outliers Besides, let's talk about the meanings of these lines in the plot This is the maximum value of this data column and this is the minimum value Let's look at this box The upper cap of box is the upper quartile i.e., the value of 75% data location The middle green line is the median i.e., the value of 50% location The lower cap of box is the lower quartile i.e., the value of 25% data (1/4 location) Go back to the outlier again An outlier in the box plot is judged in this way Based on the criteria of Interquartile Range (IQR), i.e., the difference between the upper quartile and the lower quartile which is actually 1.5 times of the box height it is stipulated that a point more than +1.5 IQR above the upper quartile or -1.5 IQR below the lower quartile is an outlier Have you noticed a box plot is indeed visualization of describe() with stronger capacities In particular, for judgment results on outliers it has its own detection formula for judging outlier points in the box plot Of course, we may realize or modify a defection formula to achieve this effect on our own Write our own program The common 3σ principle is an example and we may write a program based on this principle to find its outliers The 3σ principle is expressed like this If the data are in a normal distribution an outlier is defined as a value in a group of measured values whose deviation from the median is beyond 3 times of the standard deviation i.e., the data can't be smaller than u-3σ (σ is represented by c) or greater than u+3σ and should be in between u is the mean σ is the standard deviation How can this problem be solved We may use the Boolean indexing in DataFrame to screen data Write screening conditions into [] We'll write conditions according to the calculation formula of 3σ principle The difference of the test value minus the mean should exceed 3 times of the standard deviation Use the std() method to calculate the standard deviation However, it should be noted that the deviation may be positive or negative so two conditions may be joined with the symbol "|" () | (). To make it simpler, it's also OK to use the absolute value: abs() Look at this formula below Execute In the result, values like this, not NaN, are outliers We may do some processing again Keep all the data lines that don't contain all null outliers It's slightly inconvenient to write in this way but it allows self-defined formulas and modification at any time with greater flexibility Sure, its convenience and vividness are not as good as those of box plot You have mastered the three methods of outlier detection, haven't you What should we do if outliers are detected They may be processed like missing values Delete or fill or just leave them alone or adopt the method of binning, etc Due to our time limit, no further discussion will be made in this course If interested, you may go on to explore more