CSS 581 - Introduction to Machine Learning
Winter 2014
  TuTh 8:00-10:00 PM
UW1-040


Instructor

J Jeffry Howbert

email                peaklist@u.washington.edu
phone              (206) 669-6629 (cell)
office hours     Tuesday, 5:00-6:30 PM, UW1-302


Course description

Machine learning is the science of building predictive models from available data, in order to predict the behavior of new, previously unseen data.  It lies at the intersection of modern statistics and computer science, and is widely and successfully used in medicine, image recognition, finance, e-commerce, textual analysis, and many areas of scientific research, especially computational biology.  This course is an introduction to the theory and practical use of the most commonly used machine learning techniques, including decision trees, logistic regression, discriminant analysis, neural networks, naïve Bayes, k-nearest neighbor, support vector machines, collaborative filtering, clustering, and ensembles.  The coursework will emphasize hands-on experience applying specific techniques to real-world datasets, combined with several programming projects.


Prerequisites

CSS 342, calculus, and statistics.  Some prior exposure to probability and linear algebra is recommended.


Textbook

Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Addison-Wesley, 2006.


Supplemental reading

See Schedule.


Programming language

MATLAB will be used for both exercises and projects.


Exercises

Most lectures will be accompanied by a set of exercises involving the machine learning method(s) introduced in that lecture.  Exercises will be a mix of problem sets, hands-on tutorials, and minor coding. The purpose of the exercises is to impart some familiarity with and intuition about the method.  Exercises will have some simple deliverables to ensure they are being completed and understood.

Exercise answers should be turned in as a Word document or PDF.  If you want to turn in a different type of document, please discuss it with me first.

Exercise answers will be collected using a Catalyst Collect It dropbox.  See the course homepage sidebar for a link to the course Collect It.


Programming projects

There will be three programming projects. Each will require implementing a particular machine learning method from scratch (i.e. not using the built-in MATLAB modules). Likely methods for the projects include feature selection, collaborative filtering, and ensemble classification.

Aside from code files, project deliverables should be turned in as a Word document or PDF.  If you want to turn in a different type of document, please discuss it with me first.

Project deliverables will be collected using a Catalyst Collect It dropbox.  See the course homepage sidebar for a link to the course Collect It.


Grading

Exercises will be weighted equally, although the amount of work may vary. Exercises will account for 25% of the overall grade. Each of the three programming projects will account for 25% of the overall grade. I will grade on a curve.

Late policy for Exercises: Exercises will be accepted up to two days past the due date, but for each day late there will be a loss of 25% of the grade.  Exercises will not under any circumstances be accepted more than two days past the due date.  Submission times as determined by the Catalyst Collect It dropbox will be the final arbiter on lateness.

Late policy for Projects:  Projects will be accepted up to three days past the due date, but for each day late there will be a loss of 15% of the grade.  Projects will not under any circumstances be accepted more than three days past the due date.  Submission times as determined by the Catalyst Collect It dropbox will be the final arbiter on lateness.

If at any point during the quarter you have concerns about your performance in the class, please talk with me, and I will tell you where you stand relative to your classmates.


Schedule (revised Mar. 4, 2014)

For reading:
IDM = Introduction to Data Mining, by Tan, Steinbach, and Kumar
ESL = Elements of Statistical Learning, 2nd Ed., by Hastie, Tibshirani, and Friedman
ISLR = An Introduction to Statistical Learning with Applications in R, by James, Witten, Hastie, and Tibshirani.

Date
Lecture
Topics
Reading
Exercises, Projects





Tues. Jan. 7
1

slides
ppt pdf
Course logistics

Overview and examples of applications
IDM Chap. 1
Exercises 1 - due Sun. Jan. 12
Thurs. Jan. 9
2

slides
ppt pdf
Math essentials (1)
  • probability
  • linear algebra
  • IDM Appendix A.1, A2.1 - A2.4, A2.6
  • CS229 probability review, Sect. 1 - 4
  • CS229 linear algebra review, Sect. 1, 2, 3.1 - 3.7
  • Exercises 2 - due Wed. Jan. 15
    Tues. Jan. 14
    3

    slides a
    ppt pdf
    slides b
    ppt pdf
    script a
    script b
    script c
    MATLAB essentials

    Data
  • attribute types
  • preprocessing
  • transformations
  • summary statistics
  • visualization
  • IDM Chap. 2.1 - 2.3
    IDM Chap. 3.1 - 3.3
    Exercises 3 - due Mon. Jan. 20
    Thurs. Jan 16
    4

    slides
    ppt pdf
    script
    dataset
    Classification (1)
  • general approach
  • decision tree classifiers
  • induction process
  • selecting splits
  • generalization and overfitting
  • evaluating performance
  • decision boundaries
  • IDM Chap. 4
    Exercises 4 - due Wed. Jan. 22
    Tues. Jan. 21
    5

    slides a
    ppt pdf
    slides b
    ppt pdf
    script
    dataset
    doc
    Feature generation

    Feature selection
  • filter vs. wrapper methods
  • forward selection
  • backward selection
  • as side effect of sparse model building


  • Classification (2)
  • logistic regression


  • Thurs. Jan. 23
    6

    slides a
    ppt pdf
    slides b
    ppt pdf
    script
    Math essentials (2)

    Classification (3)
  • discriminant analysis
  • linear
  • quadratic

  • Project 1 - due Sat. Feb. 8
    Tues. Jan. 28
    7

    slides a
    ppt pdf
    slides b
    ppt pdf
    script
    Classification (4)
  • k-nearest neighbor
  • naive Bayes

  • IDM Chap. 5.2, 5.3
    ESL Chap. 1, 2.1-2.3
    Machine Learning: a Probabilistic Perspective, by Murphy, Chap. 1.1-1.4.6 (see Reference texts for download)


    Thurs. Jan. 30 8

    slides
    ppt pdf
    script
    dataset
    doc
    Regression (1)
  • linear regression
  • regularization
  • importance in higher-dimensional spaces
  • common types: L2, L1
  • regression trees

  • IDM Appendix D
    ESL Chap. 3.1, 3.2 (through p. 46), 3.2.1, 3.4 (through p. 72)
    ESL Chap. 9.2.1-9.2.2

    Exercises 5 - due Wed. Feb. 5
    script knn.m
    Tues. Feb. 4 9

    slides a
    ppt pdf
    slides b
    ppt pdf
    Collaborative filtering (1)
  • Netflix Prize story
  •       (OPEN TO PUBLIC)
  • nearest neighbor approach
  • R. Bell, Y. Koren, and C. Volinsky, "All Together Now: A Perspective on the Netflix Prize", Chance, Vol. 23, 24-29, 2010.
    Y. Koren, R. Bell, and C. Volinsky, "Matrix Factorization Techniques for Recommender Systems", Computer, 42-49, August 2009.
    J. Breese, D. Heckerman, and C. Kadie, "Empirical Analysis of Predictive Algorithms for Collaborative Filtering", Proc. 14th Conf. Uncertainty Artif. Intell., 1998.

    optional:
    A. Narayanan and V. Shmatikov, "Robust De-anonymization of Large Sparse Datasets", IEEE Symp. Security Privacy, 111, 2008.

    Thurs. Feb. 6 10

    slides
    ppt pdf
    Collaborative filtering (2)
  • matrix factorization
  • optimization
  • stochastic gradient descent
  • regularization

  • Project 2 - due Mon. Feb. 24
    Tues. Feb. 11 11

    slides
    ppt pdf
    script a
    script b
    dataset
    Clustering (1)
  • partitional
  • k-means
  • IDM Chap. 2.4
    IDM Chap. 8.1 - 8.2


    Thurs. Feb. 13 12

    slides
    ppt pdf
    script
    Clustering (2)
  • hierarchical
  • density-based
  • validation
  • IDM Chap. 8.3 - 8.5
    Exercises 6 - due Wed. Feb. 19
    script clust.m
    dataset synthGaussMix.mat

    Tues. Feb. 18 13

    slides
    ppt pdf
    Ensembles (1)
  • general principles
  • choice of base classifier
  • techniques for data diversification
  • bagging
  • boosting
  • IDM Chap. 5.6

    optional but highly recommended:
    P. Domingos, "A Few Useful Things to Know about Machine Learning", Comm. ACM, Vol. 55, 78-87, 2012.

    Thurs. Feb. 20 14

    slides a
    ppt pdf
    slides b
    ppt pdf
    Ensembles (2)
  • random forests
  • parallel programming of ensembles


  • Classification (5)
  • neural networks
  • single perceptron
  • R. Polikar, "Ensemble Learning", Scholarpedia, Vol. 4, 2776, 2008.
    R. Berk, L. Sherman, G. Barnes, E. Kurtz, and L. Ahlman, "Forecasting Murder Within a Population of Probationers and Parolees: a High Stakes Application of Statistical Learning", J. Royal Statist. Soc. A, Vol. 172, Part 1, 191-211, 2009.

    Tues. Feb. 25 15

    slides
    ppt pdf
    script
    Classification (5)
  • neural networks
  • hidden units
  • transfer functions
  • training (back-propagation)
  • expressiveness
  • IDM Chap. 5.4

    Thurs. Feb. 27 16

    slides
    ppt pdf
    script
    dataset
    doc a
    doc b
    doc c
    Classification (6)
  • support vector machines
  • linearly separable classes
  • non-linearly separable classes
  • non-linear decision boundaries
  • kernels
  • IDM Chap. 5.5
    Project 3 - due Wed. Mar. 19
    Tues. Mar. 4 17

    slides
    ppt pdf
    Dimensionality reduction
  • principal component analysis
  • random projection
  • partial least squares
  • nonlinear methods
  • IDM Appendix B.1 - B.3
    ESL Chap. 14.5.1
    ISLR Chap. 6.3, 10.2

    Thurs. Mar. 6
    18

    slides
    ppt pdf
    Anomaly detection
    IDM Chap. 10
    Exercises 7 - due Fri. Mar. 14
    Tues. Mar. 11
    19

    slides a
    ppt pdf
    slides b
    ppt pdf
    abstract
    Special topic lecture:
    "Seizure prediction and machine learning"
    (OPEN TO PUBLIC)

    Natural language processing
    J. Howbert, E. Patterson, S. Stead, B. Brinkmann, V. Vasoli, et al., "Forecasting Seizures in Dogs with Naturally Occurring Epilepsy", PLoS ONE, Vol. 9, e81920, 2014.
    Thurs. Mar. 13
    20

    abstract
    Special topic lecture:
    "Using NLP and machine learning in medical information systems"
    (OPEN TO PUBLIC)
    B. Tinsley, A. Thomas, J. McCarthy, and M. Lazarus, "Atigeo at TREC 2012 Medical Records Track: ICD-9 Code Description Injection to Enhance Electronic Medical Record Search Accuracy", 2012.



    Mon.-Fri. Mar. 17-21
    --
    FINALS WEEK