CSS 490 / 590 - Introduction to Machine Learning

CSS 490 / 590 - Introduction to Machine Learning
Winter 2012
MW 8:00-10:00 PM
UW2-005

Instructor

email            peaklist @u.washington.edu
phone              (206) 669-6629 (cell)
office               UW1-349
office hours     Monday 6:30-7:45 PM in my office, or by appointment

Course description

Machine learning is the science of building predictive models from available data, in order to predict the behavior of new, previously unseen data. It lies at the intersection of modern statistics and computer science, and is widely and successfully used in medicine, image recognition, finance, e-commerce, textual analysis, and many areas of scientific research, especially computational biology. This course is an introduction to the theory and practical use of the most commonly used machine learning techniques, including decision trees, logistic regression, discriminant analysis, neural networks, naïve Bayes, k-nearest neighbor, support vector machines, collaborative filtering, clustering, and ensembles. The coursework will emphasize hands-on experience applying specific techniques to real-world datasets, combined with several programming projects.

Prerequisites

CSS 342, calculus, and statistics. Some prior exposure to probability and linear algebra is recommended.

Textbook

Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Addison-Wesley, 2006.

Supplemental reading

[ to be determined ]

Programming language

MATLAB will be used for both exercises and projects.

Exercises

Most lectures will be accompanied by a set of exercises involving the machine learning method(s) introduced in that lecture. Exercises will be a mix of problem sets, hands-on tutorials, and minor coding. The purpose of the exercises is to impart some familiarity with and intuition about the method. Exercises will have some simple deliverables to ensure they are being completed and understood.

Exercise answers should be turned in as a Word document or PDF. If you want to turn in a different type of document, please discuss it with me first.

Exercise answers will be collected using a Catalyst Collect It dropbox. See the course homepage sidebar for a link to the course Collect It.

Programming projects

There will be three programming projects. Each will require implementing a particular machine learning method from scratch (i.e. not using the built-in MATLAB modules). Likely methods for the projects include feature selection, collaborative filtering, and ensemble classification. There will be added project deliverables for CSS 590 enrollees.

Aside from code files, project deliverables should be turned in as a Word document or PDF. If you want to turn in a different type of document, please discuss it with me first.

Project deliverables will be collected using a Catalyst Collect It dropbox. See the course homepage sidebar for a link to the course Collect It.

Grading

Exercises will be weighted equally, although the amount of work may vary. Exercises will account for 25% of the overall grade. Each of the three programming projects will account for 25% of the overall grade. I will grade on a curve.

Late policy for Exercises: Exercises will be accepted up to two days past the due date, but for each day late there will be a loss of 25% of the grade. Exercises will not under any circumstances be accepted more than two days past the due date. Submission times as determined by the Catalyst Collect It dropbox will be the final arbiter on lateness.

Late policy for Projects: Projects will be accepted up to three days past the due date, but for each day late there will be a loss of 15% of the grade. Projects will not under any circumstances be accepted more than three days past the due date. Submission times as determined by the Catalyst Collect It dropbox will be the final arbiter on lateness.

If at any point during the quarter you have concerns about your performance in the class, please talk with me, and I will tell you where you stand relative to your classmates.

Schedule (revised Jan. 30, 2012)

For reading:
IDM = Introduction to Data Mining, by Tan, Steinbach, and Kumar
ESL = Elements of Statistical Learning, 2nd Ed., by Hastie, Tibshirani, and Friedman

Date	Lecture	Topics	Reading	Exercises, Projects

Wed. Jan. 4	1 slides	Course logistics Overview and examples of applications	IDM Chap. 1	Exercises 1 - due Mon. Jan. 9 Solutions 1
Mon. Jan. 9	2 slides	Math essentials (1) probability linear algebra	IDM Appendix A.1, A2.1 - A2.4, A2.6 CS229 probability review, Sect. 1 - 4 CS229 linear algebra review, Sect. 1, 2, 3.1 - 3.7	Exercises 2 - due Sat. Jan. 14 Solutions 2
Wed. Jan. 11	3 slides a slides b script a script b script c	MATLAB essentials Data attribute types preprocessing transformations summary statistics visualization	IDM Chap. 2.1 - 2.3 IDM Chap. 3.1 - 3.3	Exercises 3 - due Wed. Jan. 18, 8:00 PM Solutions 3
Mon. Jan 16	--	HOLIDAY - no lecture
Wed. Jan. 18	--	SNOW DAY - no lecture
Mon. Jan 23	4 slides pdf slides ppt script dataset	Classification (1) general approach decision tree classifiers induction process selecting splits generalization and overfitting evaluating performance decision boundaries	IDM Chap. 4	Exercises 4 - due Sun. Jan. 29, 10:00 PM Solutions 4
Wed. Jan. 25	5 slides a pdf slides a ppt slides b pdf slides b ppt script dataset doc	Feature generation Feature selection filter vs. wrapper methods forward selection backward selection as side effect of sparse model building Classification (2) logistic regression		Project 1 - due Sat. Feb. 11, 10:00 PM Project 1 solutions
Mon. Jan. 30	6 slides a pdf slides a ppt slides b pdf slides b ppt slides c pdf slides c ppt script	Math essentials (2) Classification (3) discriminant analysis linear quadratic k-nearest neighbor	IDM Chap. 5.2, 5.3 ESL Chap. 1, 2.1-2.3 Machine Learning: a Probabilistic Perspective, by Murphy, Chap. 1.1-1.4.6 (see Reference texts for download)
Wed. Feb. 1	7 slides a pdf slides a ppt slides b pdf slides b ppt script dataset doc	Naive Bayes Regression linear regression regularization importance in higher-dimensional spaces common types: L2, L1 techniques also used in classification decision trees neural networks support vector machines	IDM Appendix D ESL Chap. 3.1, 3.2 (through p. 46), 3.2.1, 3.4 (through p. 72)	Exercises 5 - due Tues. Feb. 7, 10:00 PM script knn.m Solutions 5
Mon. Feb. 6	8 slides a pdf slides a ppt slides b pdf slides b ppt	Collaborative filtering (1) Netflix Prize story nearest neighbor approach	R. Bell, Y. Koren, and C. Volinsky, "All Together Now: A Perspective on the Netflix Prize", Chance, Vol. 23, 24-29, 2010. Y. Koren, R. Bell, and C. Volinsky, "Matrix Factorization Techniques for Recommender Systems", Computer, 42-49, August 2009. J. Breese, D. Heckerman, and C. Kadie, "Empirical Analysis of Predictive Algorithms for Collaborative Filtering", Proc. 14th Conf. Uncertainty Artif. Intell., 1998. optional: A. Narayanan and V. Shmatikov, "Robust De-anonymization of Large Sparse Datasets", IEEE Symp. Security Privacy, 111, 2008.
Wed. Feb. 8	9 slides pdf slides ppt	Collaborative filtering (2) matrix factorization stochastic gradient descent regularization		Project 2 - due Mon. Feb. 27, 8:00 PM Project 2 solutions
Mon. Feb. 13	10 slides pdf slides ppt script dataset	Clustering (1) partitional k-means k-medoid hierarchical hard vs. soft	IDM Chap. 2.4 IDM Chap. 8.1 - 8.3
Wed. Feb. 15	11 slides pdf slides ppt script	Clustering (2) evaluating results visualization	IDM Chap. 8.5	Exercises 6 - due Thur. Feb. 23, 10:00 PM script clust.m dataset synthGaussMix.mat
Mon. Feb. 20	--	HOLIDAY - no lecture
Wed. Feb. 22	12 slides pdf slides ppt	Ensembles (1) general principles bagging boosting	IDM Chap. 5.6
Mon. Feb. 27	13 slides a pdf slides a ppt slides b pdf slides b ppt	Ensembles (2) choice of base classifier techniques for data diversification parallel programming of ensembles	R. Polikar, "Ensemble Learning", Scholarpedia, Vol. 4, 2776, 2008. R. Berk, L. Sherman, G. Barnes, E. Kurtz, L. Ahlman, "Forecasting Murder Within a Population of Probationers and Parolees: a High Stakes Application of Statistical Learning", 2007.
Wed. Feb. 29	14 slides pdf slides ppt script	Classification (4) neural networks single perceptron hidden units transfer functions training (back-propagation)	IDM Chap. 5.4	Project 3 - due Wed. Mar. 14, 10:00 PM
Mon. Mar. 5	15 slides pdf slides ppt script dataset doc a doc b	Classification (5) support vector machines	IDM Chap. 5.5	Exercises 7 - due Sun. Mar. 11, 10:00 PM
Wed. Mar. 7	16 slides pdf slides ppt	Dimensionality reduction principal component analysis random projection partial least squares	IDM Appendix B.1
Mon.-Fri. Mar. 12-16	--	FINALS WEEK