Learn both theory and application for basic methods that have been invented either for developing new concepts – principal components or clusters, or for finding interesting correlations – regression and classification. This is preceded by a thorough analysis of 1D and 2D data.

*The course was created with the support of Sberbank*

This is an unconventional course in modern Data Analysis, Machine Learning and Data Mining. Its contents are heavily influenced by the idea that data analysis should help in enhancing and augmenting knowledge of the domain as represented by the concepts and statements of relation between them. According to this view, two main pathways for data analysis are summarization, for developing and augmenting concepts, and correlation, for enhancing and establishing relations. The term summarization embraces here both simple summaries like totals and means and more complex summaries: the principal components of a set of features and cluster structures in a set of entities. Similarly, correlation covers both bivariate and multivariate relations between input and target features including Bayes classifiers.

The view of the data as a subject of computational data analysis that is adhered to here has emerged quite recently. Typically, in sciences and in statistics, a problem comes first, and then the investigator turns to data that might be useful in advancing towards a solution. Yet nowadays the situation is reversed frequently, especially with the advent of Big Data. Typical questions then are: Take a look at this data set - what sense can be made out of it? – Is there any structure in the data set? Can these features help in predicting those? This is more reminiscent to a traveler’s view of the world rather than that of a scientist. The scientist sits at his desk, gets reproducible signals from the universe and tries to accommodate them into a great model of the universe. The traveler deals with what come on their way – here is the data analysis niche. A textbook by the instructor along these lines has been published by Springer-London in 2011: “Core concepts in data analysis is clean and devoid of any fuzziness. The author presents his theses with a refreshing clarity seldom seen in a text of this sophistication. … To single out just one of the text’s many successes: I doubt readers will ever encounter again such a detailed and excellent treatment of correlation concepts. (Computing Reviews of ACM, June 2011).”

**Week 1. Intro:** Examples of data** **and data analysis problems;
visualization.

**Week 2. 1D analysis**. Feature scales.
Histogram. Two common types of histograms: Gaussian and Power Law. Central
values. Minkowski distance and data recovery view. Validation with Bootstrap.

**Week 3-4. 2D analysis** cases:

(Both quantitative: Scatter-plot, linear regression, correlation and determinacy coefficients: meaning and properties. Both nominal: Contingency table, Quetelet index, Pearson chi-squared coefficient, its double meaning and visualization).

**Week 5-6. Learning multivariate correlations**

(Bayes approach and Naïve Bayes classifier with a Bag-of-words text model; Decision trees and criteria for building them.)

**Week 7. Principal components (PCA) and SVD**

(SVD model behind PCA: student marks as the product of subject factor scores and subject loadings. Application to deriving a hidden underlying factor. Data visualization with PCA. Conventional PCA and data normalization issues.)

**Week 8. Clustering with k-means**

(K-Means iterations and K-Means features

K-Means criterion. Anomalous clusters and intelligent K-Means.)

· Basics of calculus: the concepts of function, derivative and the first-order optimality condition;

· Basic linear algebra including vectors, inner products, Euclidean distances, matrices;

· Basic set theory notation; and

· Basic coding ability in an environment such as MatLab, or R, or any other software..

Although the lectures are designed to be self-contained, we recommend (but do not require) that students refer to the books:

B. Mirkin (2011) Core Concepts in Data Analysis: Summarization, Correlation, Visualization. Undergraduate Topics for Computer Science Series, Springer, London.

J. Han, M. Kamber, J. Pei (2011) Data Mining: Concepts and Techniques, Third Edition, The Morgan Kaufmann Series in Data Management Systems, Morgan-Kaufmann.

H. Lohninger (1999) Teach Me Data Analysis, Springer-Verlag, Berlin-New York-Tokyo.

The class will be run using lecture videos, which are between 10 and 15 minutes in length. These contain 1-2 integrated quiz questions each. Also, there will be standalone homework that is not part of video lectures. The homework includes finding an illustrative or real-world dataset to the student’s liking, a set of assignments in applying data analysis methods to the dataset, and a final exam paper.

1. **Will I get a Statement of Accomplishment after
completing this class?**

Yes. Students who successfully complete the class will receive a Statement of Accomplishment signed by the instructor.

2. **What resources will I need for this class?**

For this course, you would need an Internet connection, copies of the texts (most important of which can be obtained for free), a computing environment such as MatLab (a Student version of which is not expensive at all) or publicly available language for statistical computing R or publicly available jpen software Weka, and the time to read, compute, think over computing results, write, and discuss.

**3. How can I get a data set for the class?**

There are many public repositories to contain scores of various datasets, along with their descriptions and description of results found at the data, of which most popular is the so-called Irvine data mining repository. Yet I would advise first to think of a topic that you are interested at, such as football players or movie star earnings or life expectancy in different countries. Nowadays, you could find data on virtually any topic in Internet by querying Google. For example, for the topic of life expectancy, you can find country-to-feature data tables for a list of about 200 countries on the globe along with life expectancy, national income per capita, industrial output per capita, etc.

**4. Would it be permitted to use Java based
computations rather than those in Matlab?**

Of course. Any computing language/environment will do.

**5. Would it be permitted to use existing
software for methods under study rather than code them by us ourselves?**