CSS 490 / 590 - Introduction to Machine
Learning
Winter 2012
MW 8:00-10:00 PM
UW2-005
Instructor
email
peaklist@u.washington.edu
phone
(206) 669-6629 (cell)
office
UW1-349
office hours Monday 6:30-7:45 PM in my
office, or by appointment
Course description
Machine learning is the science of building predictive models from available data, in order to predict the behavior of new, previously unseen data. It lies at the intersection of modern statistics and computer science, and is widely and successfully used in medicine, image recognition, finance, e-commerce, textual analysis, and many areas of scientific research, especially computational biology. This course is an introduction to the theory and practical use of the most commonly used machine learning techniques, including decision trees, logistic regression, discriminant analysis, neural networks, naïve Bayes, k-nearest neighbor, support vector machines, collaborative filtering, clustering, and ensembles. The coursework will emphasize hands-on experience applying specific techniques to real-world datasets, combined with several programming projects.
Prerequisites
CSS 342, calculus, and statistics. Some prior exposure to probability and linear algebra is recommended.
Textbook
Introduction
to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin
Kumar, Addison-Wesley, 2006.
Supplemental reading
[ to be
determined ]
Programming language
MATLAB will be used for both exercises and projects.
Exercises
Most lectures
will be accompanied by a set of exercises involving the machine
learning method(s) introduced in that lecture. Exercises
will be a mix of problem sets, hands-on tutorials, and minor
coding. The purpose of the exercises is to impart some familiarity
with and intuition about the method. Exercises will have
some simple deliverables to ensure they are being completed and
understood.
Exercise answers
should be turned in as a Word document or PDF.
If you want to turn in a different type of document, please
discuss it with me first.
Exercise answers
will be collected using a Catalyst Collect It dropbox. See
the course
homepage sidebar for a link to the course Collect It.
Programming projects
There will be
three programming projects. Each will require implementing a
particular machine learning method from scratch (i.e. not using
the built-in MATLAB modules). Likely methods for the projects
include feature selection, collaborative filtering, and ensemble
classification. There will be added project deliverables for
CSS 590 enrollees.
Aside from code
files, project deliverables should be turned in as a Word document
or PDF.
If you want to turn in a different type of document, please
discuss it with me first.
Project deliverables will be collected using a Catalyst Collect It dropbox. See the course homepage sidebar for a link to the course Collect It.
Grading
Exercises will
be weighted equally, although the amount of work may vary.
Exercises will account for 25% of the overall grade. Each of the
three programming projects will account for 25% of the overall
grade. I will grade on a curve.
Late policy for
Exercises: Exercises will be accepted up to two days past
the due date, but for each day late there will be a loss of 25% of
the grade. Exercises will not under any circumstances be
accepted more than two days past the due date. Submission
times as determined by the Catalyst Collect It dropbox will be the
final arbiter on lateness.
Late policy for Projects: Projects will be accepted up to three days past the due date, but for each day late there will be a loss of 15% of the grade. Projects will not under any circumstances be accepted more than three days past the due date. Submission times as determined by the Catalyst Collect It dropbox will be the final arbiter on lateness.
If at any point during the quarter you have concerns about your performance in the class, please talk with me, and I will tell you where you stand relative to your classmates.
Schedule (revised Jan. 30, 2012)
For reading:
IDM = Introduction to Data Mining, by Tan,
Steinbach, and Kumar
ESL = Elements of Statistical Learning, 2nd Ed., by
Hastie, Tibshirani, and Friedman
Date |
Lecture |
Topics |
Reading |
Exercises,
Projects |
Wed. Jan. 4 |
1 slides |
Course logistics Overview and examples of applications |
IDM Chap. 1 |
Exercises 1 - due
Mon. Jan. 9 Solutions 1 |
Mon. Jan. 9 |
2 slides |
Math essentials (1)
|
|
Exercises 2 - due
Sat. Jan. 14 Solutions 2 |
Wed. Jan.
11 |
3 slides a slides b script a script b script c |
MATLAB essentials Data |
IDM Chap. 2.1 - 2.3 IDM Chap. 3.1 - 3.3 |
Exercises 3 - due
Wed. Jan. 18, 8:00 PM Solutions 3 |
Mon. Jan 16 |
-- |
HOLIDAY - no lecture |
||
Wed. Jan.
18 |
-- |
SNOW DAY - no lecture |
||
Mon. Jan 23 |
4 slides pdf slides ppt script dataset |
Classification (1) |
IDM Chap. 4 |
Exercises 4 - due
Sun. Jan. 29, 10:00 PM Solutions 4 |
Wed. Jan.
25 |
5 slides a pdf slides a ppt slides b pdf slides b ppt script dataset doc |
Feature generation Feature selection Classification (2) |
Project 1 - due Sat.
Feb. 11, 10:00 PM Project 1 solutions |
|
Mon. Jan.
30 |
6 slides a pdf slides a ppt slides b pdf slides b ppt slides c pdf slides c ppt script |
Math essentials (2) Classification (3) |
IDM Chap. 5.2, 5.3 ESL Chap. 1, 2.1-2.3 Machine Learning: a Probabilistic Perspective, by Murphy, Chap. 1.1-1.4.6 (see Reference texts for download) |
|
Wed. Feb. 1 |
7 slides a pdf slides a ppt slides b pdf slides b ppt script dataset doc |
Naive Bayes Regression |
IDM Appendix D ESL Chap. 3.1, 3.2 (through p. 46), 3.2.1, 3.4 (through p. 72) |
Exercises 5 - due
Tues. Feb. 7, 10:00 PM script knn.m Solutions 5 |
Mon. Feb. 6 |
8 slides a pdf slides a ppt slides b pdf slides b ppt |
Collaborative filtering (1) |
R. Bell, Y. Koren, and C.
Volinsky, "All
Together Now: A Perspective on the Netflix Prize", Chance, Vol. 23, 24-29,
2010. Y. Koren, R. Bell, and C. Volinsky, "Matrix Factorization Techniques for Recommender Systems", Computer, 42-49, August 2009. J. Breese, D. Heckerman, and C. Kadie, "Empirical Analysis of Predictive Algorithms for Collaborative Filtering", Proc. 14th Conf. Uncertainty Artif. Intell., 1998. optional: A. Narayanan and V. Shmatikov, "Robust De-anonymization of Large Sparse Datasets", IEEE Symp. Security Privacy, 111, 2008. |
|
Wed. Feb. 8 |
9 slides pdf slides ppt |
Collaborative filtering (2) |
Project 2 - due Mon.
Feb. 27, 8:00 PM Project 2 solutions |
|
Mon. Feb.
13 |
10 slides pdf slides ppt script dataset |
Clustering (1) |
IDM Chap. 2.4 IDM Chap. 8.1 - 8.3 |
|
Wed. Feb.
15 |
11 slides pdf slides ppt script |
Clustering (2) |
IDM Chap. 8.5 |
Exercises 6 - due
Thur. Feb. 23, 10:00 PM script clust.m dataset synthGaussMix.mat |
Mon. Feb.
20 |
-- |
HOLIDAY - no lecture |
||
Wed. Feb.
22 |
12 slides pdf slides ppt |
Ensembles (1) |
IDM Chap. 5.6 |
|
Mon. Feb.
27 |
13 slides a pdf slides a ppt slides b pdf slides b ppt |
Ensembles (2) |
R. Polikar, "Ensemble
Learning", Scholarpedia,
Vol. 4, 2776, 2008. R. Berk, L. Sherman, G. Barnes, E. Kurtz, L. Ahlman, "Forecasting Murder Within a Population of Probationers and Parolees: a High Stakes Application of Statistical Learning", 2007. |
|
Wed. Feb.
29 |
14 slides pdf slides ppt script |
Classification (4) |
IDM Chap. 5.4 |
Project 3 - due Wed.
Mar. 14, 10:00 PM |
Mon. Mar. 5 |
15 slides pdf slides ppt script dataset doc a doc b |
Classification (5) |
IDM Chap. 5.5 |
Exercises 7 - due
Sun. Mar. 11, 10:00 PM |
Wed. Mar. 7 |
16 slides pdf slides ppt |
Dimensionality reduction |
IDM Appendix B.1 |
|
Mon.-Fri.
Mar. 12-16 |
-- |
FINALS WEEK |