CSS 490 / 590 - Introduction to Machine Learning
Winter 2012
MW 8:00-10:00 PM
UW2-005


Instructor

J Jeffry Howbert

email                peaklist@u.washington.edu
phone              (206) 669-6629 (cell)
office               UW1-349
office hours     Monday 6:30-7:45 PM in my office, or by appointment


Course description

Machine learning is the science of building predictive models from available data, in order to predict the behavior of new, previously unseen data.  It lies at the intersection of modern statistics and computer science, and is widely and successfully used in medicine, image recognition, finance, e-commerce, textual analysis, and many areas of scientific research, especially computational biology.  This course is an introduction to the theory and practical use of the most commonly used machine learning techniques, including decision trees, logistic regression, discriminant analysis, neural networks, naïve Bayes, k-nearest neighbor, support vector machines, collaborative filtering, clustering, and ensembles.  The coursework will emphasize hands-on experience applying specific techniques to real-world datasets, combined with several programming projects.


Prerequisites

CSS 342, calculus, and statistics.  Some prior exposure to probability and linear algebra is recommended.


Textbook

Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Addison-Wesley, 2006.


Supplemental reading

[ to be determined ]


Programming language

MATLAB will be used for both exercises and projects.


Exercises

Most lectures will be accompanied by a set of exercises involving the machine learning method(s) introduced in that lecture.  Exercises will be a mix of problem sets, hands-on tutorials, and minor coding. The purpose of the exercises is to impart some familiarity with and intuition about the method.  Exercises will have some simple deliverables to ensure they are being completed and understood.

Exercise answers should be turned in as a Word document or PDF.  If you want to turn in a different type of document, please discuss it with me first.

Exercise answers will be collected using a Catalyst Collect It dropbox.  See the course homepage sidebar for a link to the course Collect It.


Programming projects

There will be three programming projects. Each will require implementing a particular machine learning method from scratch (i.e. not using the built-in MATLAB modules). Likely methods for the projects include feature selection, collaborative filtering, and ensemble classification.  There will be added project deliverables for CSS 590 enrollees.

Aside from code files, project deliverables should be turned in as a Word document or PDF.  If you want to turn in a different type of document, please discuss it with me first.

Project deliverables will be collected using a Catalyst Collect It dropbox.  See the course homepage sidebar for a link to the course Collect It.


Grading

Exercises will be weighted equally, although the amount of work may vary. Exercises will account for 25% of the overall grade. Each of the three programming projects will account for 25% of the overall grade. I will grade on a curve.

Late policy for Exercises: Exercises will be accepted up to two days past the due date, but for each day late there will be a loss of 25% of the grade.  Exercises will not under any circumstances be accepted more than two days past the due date.  Submission times as determined by the Catalyst Collect It dropbox will be the final arbiter on lateness.

Late policy for Projects:  Projects will be accepted up to three days past the due date, but for each day late there will be a loss of 15% of the grade.  Projects will not under any circumstances be accepted more than three days past the due date.  Submission times as determined by the Catalyst Collect It dropbox will be the final arbiter on lateness.

If at any point during the quarter you have concerns about your performance in the class, please talk with me, and I will tell you where you stand relative to your classmates.


Schedule (revised Jan. 30, 2012)

For reading:
IDM = Introduction to Data Mining, by Tan, Steinbach, and Kumar
ESL = Elements of Statistical Learning, 2nd Ed., by Hastie, Tibshirani, and Friedman

Date
Lecture
Topics
Reading
Exercises, Projects





Wed. Jan. 4
1

slides
Course logistics

Overview and examples of applications
IDM Chap. 1
Exercises 1 - due Mon. Jan. 9
Solutions 1
Mon. Jan. 9
2

slides
Math essentials (1)
  • probability
  • linear algebra
  • IDM Appendix A.1, A2.1 - A2.4, A2.6
  • CS229 probability review, Sect. 1 - 4
  • CS229 linear algebra review, Sect. 1, 2, 3.1 - 3.7
  • Exercises 2 - due Sat. Jan. 14
    Solutions 2
    Wed. Jan. 11
    3

    slides a
    slides b
    script a
    script b
    script c
    MATLAB essentials

    Data
  • attribute types
  • preprocessing
  • transformations
  • summary statistics
  • visualization
  • IDM Chap. 2.1 - 2.3
    IDM Chap. 3.1 - 3.3
    Exercises 3 - due Wed. Jan. 18, 8:00 PM
    Solutions 3
    Mon. Jan 16
    --
    HOLIDAY - no lecture


    Wed. Jan. 18
    --
    SNOW DAY - no lecture


    Mon. Jan 23
    4

    slides pdf
    slides ppt
    script
    dataset
    Classification (1)
  • general approach
  • decision tree classifiers
  • induction process
  • selecting splits
  • generalization and overfitting
  • evaluating performance
  • decision boundaries
  • IDM Chap. 4
    Exercises 4 - due Sun. Jan. 29, 10:00 PM
    Solutions 4
    Wed. Jan. 25
    5

    slides a pdf
    slides a ppt
    slides b pdf
    slides b ppt
    script
    dataset
    doc
    Feature generation

    Feature selection
  • filter vs. wrapper methods
  • forward selection
  • backward selection
  • as side effect of sparse model building


  • Classification (2)
  • logistic regression

  • Project 1 - due Sat. Feb. 11, 10:00 PM
    Project 1 solutions
    Mon. Jan. 30
    6

    slides a pdf
    slides a ppt
    slides b pdf
    slides b ppt

    slides c pdf

    slides c ppt

    script
    Math essentials (2)

    Classification (3)
  • discriminant analysis
  • linear
  • quadratic
  • k-nearest neighbor
  • IDM Chap. 5.2, 5.3
    ESL Chap. 1, 2.1-2.3
    Machine Learning: a Probabilistic Perspective, by Murphy, Chap. 1.1-1.4.6 (see Reference texts for download)

    Wed. Feb. 1
    7

    slides a pdf
    slides a ppt
    slides b pdf
    slides b ppt
    script
    dataset
    doc
    Naive Bayes

    Regression
  • linear regression
  • regularization
  • importance in higher-dimensional spaces
  • common types: L2, L1
  • techniques also used in classification
  • decision trees
  • neural networks
  • support vector machines
  • IDM Appendix D
    ESL Chap. 3.1, 3.2 (through p. 46), 3.2.1, 3.4 (through p. 72)
    Exercises 5 - due Tues. Feb. 7, 10:00 PM
    script knn.m
    Solutions 5

    Mon. Feb. 6
    8

    slides a pdf
    slides a ppt
    slides b pdf
    slides b ppt

    Collaborative filtering (1)
  • Netflix Prize story
  • nearest neighbor approach
  • R. Bell, Y. Koren, and C. Volinsky, "All Together Now: A Perspective on the Netflix Prize", Chance, Vol. 23, 24-29, 2010.
    Y. Koren, R. Bell, and C. Volinsky, "Matrix Factorization Techniques for Recommender Systems", Computer, 42-49, August 2009.
    J. Breese, D. Heckerman, and C. Kadie, "Empirical Analysis of Predictive Algorithms for Collaborative Filtering", Proc. 14th Conf. Uncertainty Artif. Intell., 1998.

    optional:
    A. Narayanan and V. Shmatikov, "Robust De-anonymization of Large Sparse Datasets", IEEE Symp. Security Privacy, 111, 2008.

    Wed. Feb. 8
    9

    slides pdf
    slides ppt
    Collaborative filtering (2)
  • matrix factorization
  • stochastic gradient descent
  • regularization

  • Project 2 - due Mon. Feb. 27, 8:00 PM
    Project 2 solutions
    Mon. Feb. 13
    10

    slides pdf
    slides ppt

    script
    dataset
    Clustering (1)
  • partitional
  • k-means
  • k-medoid
  • hierarchical
  • hard vs. soft
  • IDM Chap. 2.4
    IDM Chap. 8.1 - 8.3


    Wed. Feb. 15
    11

    slides pdf
    slides ppt
    script
    Clustering (2)
  • evaluating results
  • visualization
  • IDM Chap. 8.5
    Exercises 6 - due Thur. Feb. 23, 10:00 PM
    script clust.m
    dataset synthGaussMix.mat
    Mon. Feb. 20
    --
    HOLIDAY - no lecture


    Wed. Feb. 22
    12

    slides pdf
    slides ppt
    Ensembles (1)
  • general principles
  • bagging
  • boosting
  • IDM Chap. 5.6


    Mon. Feb. 27
    13

    slides a pdf
    slides a ppt
    slides b pdf
    slides b ppt

    Ensembles (2)
  • choice of base classifier
  • techniques for data diversification
  • parallel programming of ensembles
  • R. Polikar, "Ensemble Learning", Scholarpedia, Vol. 4, 2776, 2008.
    R. Berk, L. Sherman, G. Barnes, E. Kurtz, L. Ahlman, "Forecasting Murder Within a Population of Probationers and Parolees: a High Stakes Application of Statistical Learning", 2007.

    Wed. Feb. 29
    14

    slides pdf
    slides ppt
    script

    Classification (4)
  • neural networks
  • single perceptron
  • hidden units
  • transfer functions
  • training (back-propagation)
  • IDM Chap. 5.4
    Project 3 - due Wed. Mar. 14, 10:00 PM
    Mon. Mar. 5
    15

    slides pdf
    slides ppt

    script
    dataset
    doc a
    doc b
    Classification (5)
  • support vector machines
  • IDM Chap. 5.5
    Exercises 7 - due Sun. Mar. 11, 10:00 PM
    Wed. Mar. 7
    16

    slides pdf
    slides ppt
    Dimensionality reduction
  • principal component analysis
  • random projection
  • partial least squares
  • IDM Appendix B.1

    Mon.-Fri. Mar. 12-16
    --
    FINALS WEEK