CMSC 726 Final Project

The purpose of the final project is for you to demonstrate to me that you learned something this semester. Think of it as being analogous to a depth-oriented final exam, where you choose the topic that you want to cover in depth. Here are some guidelines:

What to hand in: You should hand in a PDF write-up that's approximately four pages in the homework Latex format. It's hard for me to imagine a sufficiently large project leading to a writeup that's less than two pages, and I don't want to read something that's longer than eight pages. Somewhere in the four plus-or-minus two range is probably about right. You may also hand in your code if you want, but chances are I won't look at it.

Grading: The final project is worth 25% of your grade. 15% will be based on your writeup and what you did. 5% will be based on your presentation at the final exam party (see below). 5% will be based on showing up on time to the final exam party, staying until the end, and signing the sign-in sheet.

Presentation: You will have roughly 5 minutes to tell the rest of the class what you did. You should have precisely three slides (no title slide necessary). You will have to hand in these three slides ahead of time (in PDF format) so I can merge them into one big PDF for the presentations. Your slides should be: (1) what is the problem; (2) how do you solve it; (3) how well did it work and what problems did you run in to.

Due dates: Everything (presentation slides and write-up) is "technically" due the last day of class (Dec 13, 11am). However, you get "free late days" for the presentation up until 1 hour before the final exam date. And your write-up gets "free late days" until the end of the exam period: Dec 20, 9pm.

Project topic: You're more than welcome to come up with your own topic. If you want to run it by me ahead of time so that I can help assure you about what you'll need to do to satisfy the prime directive of "convince me your learned something," please send me a brief email with a description of what you want to do, or talk to me at some point after class or during office hours or whenever you can catch me. However, if you don't want to come up with your own topic, I have some canned projects listed below that I find interesting. (It's perfectly okay for multiple teams to select the same canned projects.) If you're having trouble choosing, talk to me!

  1. Comparison of loss functions and regularizers
  2. Perceptron/linear models with few active features
  3. New optimization algorithms for linear models
  4. Optimizing kernel combinations by gradients
  5. Explaining the predictions of linear models/SVMs
  6. Boosting to obtain deep neural networks
  7. PCA versus JL for dimensionality reduction
  8. Active learning across learning algorithms
  9. Active learning against big pools
  10. Domain adaptation with continuous domains

Comparison of loss functions and regularizers

Several people seem to think that hinge loss leads to better empirical performance than logistic loss, but that it might be more sensitive to good regularization. Explicit comparisons are hard because different tools optimize these models in different ways. Your job would be to explore this claim empirically. (You could try theoretically, but I think it's going to be really hard.) What you'll want to do is take a bunch of classification datasets (eg., a large subset of the UCI repository) and train linear models for them with lots of values of a regularization parameter. The question is: how well do they do for an "oracle" selection of the hyperparameter, and how sensitive are they to the selection of this value. For instance, how far from this optimal value can you go and still be within one percent of optimal performance (or some other such measure). You should do all these experiments very well, and using the same optimization (eg., your -- or my -- project implementation).

(Good for teams of 1-2 people, with reasonable computing resources and good data analysis minds. Would make a really interesting blog post, tech report or workshop paper.)

Perceptron/linear models with few active features

You can achieve sparsity in linear models using an L1 regularizer. But this sparsity simply means the weight vector will be sparse. It doesn't me anything about how many features will be used at prediction time. Suppose that you want to learn a linear model that behaves as follows. When a new example comes in, it computes wdxd for each feature. But instead of the prediction being the sum of all of these, it's just the sum of the top 5 in absolute value. This would make it much easier to explain the behavior of you linear model, because it really would only use 5 features to make a classification decision! You should study (a) what happens if you learn a regular linear model, for instance with a perceptron, and then simply apply this heuristic at test time; (b) try to modify the perceptron algorithm so that it makes updates according to this rule, perhaps by a subgradient-like argument, and see if you can do better. If you're up for it, maybe prove a perceptron convergence theorem, though I suspect this is tricky.

(Good for teams of 1-2 people. Could be an interesting paper, but that would hinge on being able to get some non-trivial theoretical results out.)

New optimization algorithms for linear models

For those of you in my Optimization in ML seminar, you now know about a bunch of cool optimization algorithms. I'm particularly thinking of Barzilai and Borwein, Nesterov 2009, Nemirovski 2009 and Bertsekas 2009. Implement some subset of these algorithms and compare to vanilla (sub)gradient descent or stochastic (sub)gradient descent. Are the improved convergence rates that you see in the optimization theory bourne out in practice? Can you convince me to stop using simply stochastic (sub)gradient descent in all my implementations?

(Good for teams of K-many people where you are implementing K of these algorithms. Could be an interesting tech-report or workshop paper.)

Optimizing kernel combinations by gradients

Consider a positive linear combination of kernels K(x,z)=a1K1(x,z)+a2K2(x,z)+...+aMKM(x,z). It would be great if you could automatically tune these am values while you were learning, for instance using gradient steps. Of course, you cannot do this on the training data or you'll overfit massively! However, you can tune them by gradient descent on held-out data, simultaneously to doing gradient descent for the model parameters on the training data. You can actually apply the same trick to tuning the regularization hyperparameter and even parameters of the kernels, like gamma in an RBF kernel. Or at least I think you can: you should try it out, we can talk about the details.

(Good for teams of 1-2 people for the basic stuff, or 3-4 people if you want to go for the regularization parameter and kernel parameters as well. Could be an interesting empirical paper if you can find a good domain on which to apply it.)

Explaining the predictions of linear models/SVMs

Linear models and SVMs get awesome predictive performance, but its very hard to explain the decisions that they make. It would be nice, when such a model makes an error, to explain why. For instance, it might not have seen enough examples like the test example: you could test this by adding the test example to the training data, retraining, and checking that you don't incur any new training error. Perhaps the training error goes down: the examples on which it does might be interesting for a person to look at. Or, perhaps there is contradictory evidence in the training data: retraining would lead to new training errors. Can you point to those training examples? Both of these basically get at the question of: how much would the model have to move to get this example correct, and why didn't it do that already?

(Good for teams of 1-2 people who care about some particular application problem that they could try it on and be able to interpret the results. Would probably be a cool paper in your application area.)

Boosting to obtain deep neural networks

Fact 1: boosting decision stumps leads to linear models. Fact 2: boosting linear models leads to two-layer neural networks. Apply induction. This suggests an algorithm: boost some stumps for a while until you get bored, then stop and call that thing your first hidden unit. Start boosting again on the residuals (this will require a multi-level boost!). Once you're done with this, stop and continue. The precise algorithm needs to be developed a bit, but it could be a really cool way to train a deep model. There's obviously an architecture selection issue, but that's there for normal neural networks, too!

(Good for teams of 1-2 people if you go for just a two layer thing, or 3-4 people if you go for the deep thing. If you can get it to work, it could be a cool paper, but would need to be compared to "standard" deep neural networks which might be a pain.)

PCA versus JL for dimensionality reduction

PCA and random projections ala Johnson-Lindenstrauss both give ways of doing linear dimensionality reduction. PCA is data-based and can capture information in the data, but can be led astray by an adversary. JL-style projections are independent of the data, but perhaps need more low dimensions to do a good job. This is especially true when one or both of your projection or data is sparse. (I.e., there are algorithms for sparse PCA, essentially PCA with an L1 regularizer; and sparse JL, where the random matrix is sparse. But also your data vectors can be sparse.) Compare these empirically, especially in the sparse cases, and see how many dimensions you actually need and whether we can stop running PCA or not.

(Good for teams of 1-2 people if you stick to basic JL and basic PCA; good for 3-4 people if you try sparse methods too.)

Active learning across learning algorithms

Active learning is great if your goal is a classifier. However, if your goal is a data set and you might switch classifiers later, this could be bad. The badness stems from the fact that your data set will be constructed so that its classifier does well. If you switch classifiers later, this could be trouble. Check whether this holds empirically. Can you safely switch classifiers? Can you mitigate the issue by using different types of learning algorithms at active learning time, or does this make active learning not help anymore?

(Good for teams of 1-2 people for the first question and teams of 3-4 people to get the second question too. Could make an interesting empirical paper with some extra work.)

Active learning against big pools

All active learning results in papers that I know of run active learning on relatively small pools of data. What if you have essentially an infinite supply of unlabeled data? You can do this by considering classification tasks for which data is "free," like trying to disambiguate between "there," "their" and "they're" in text. I suspect that active learning algorithms will actually do worse when they have an infinite amount of data to work with, but I could be wrong (I suspect there are more outliers). You can check to see if I'm right or not.

(Good for teams of 1-2 people who have access to good computing resources. Would make an interesting tech report or workshop paper.)

Domain adaptation with continuous domains

The feature augmentation technique for domain adaptation replaces each example x in the source domain with <x,x,0> and every example x in the target domain with <x,0,x>. In kernel space this amounts to doubling the kernel value between points in the same domain. This assumes domains are discrete. What if they are continuous? I.e., what if instead of having a domain, you have a bunch of domain features with each example. Call these z. So each example is a pair (x,z) with a label y. You can imagine a bilinear model for which you learn weights on z and separate weights on x and classify as a product. This can also be kernelized in a reasonable way and there are lots of possibilities for how to optimize it (since it's non-convex).

(Good for teams of 1-2 people. Could definitely be an applied conference paper somewhere.)