I'm also releasing a much more featureful implementation of Searn that can accomodate pretty much any structured prediction task. You can download it here: SearnShell_v0.1.tgz (requires Perl). The cool thing about SearnShell is that you provide your own code for doing search and computing features and it automatically connects this to a base learning algorithm like megam or libsvm.
Q:
In the Searn training algorithm, could you tell me in which step we
should perform the search process using a kind of search algorithm,
such as greedy search or beam search.
A:
Essentially never. I like to think of Searn as follow. We have a
structured prediction problem, which entails solving an argmax during
prediction. This argmax is intractable. So, normally, we will
use some search algorithm to approximate it (beam or greedy or...).
Whatever search algorithm we apply, the search is performed by making
a bunch of sequential decisions (each decision is a step is the search
process). The idea behind Searn is that instead of learning some sort
of global model and then searching (as is standard), instead we will
simply learn a classifier to make each of these decisions optimally.
The question then becomes how to train such a classifier, and Searn is
one possible solution. In the end, there is no search going
on. All you're doing is making a bunch of sequential decisions,
as if you were performing search, but you aren't
actually searching. So there is nowhere in the training or prediction
algorithm where you will run some search algorithm.
Q:
The Searn algorithm takes an optimal policy as input. Can I believe
that actually the optimal policy is an initial classifier? Could you
tell me how to construct the initial classifier?
A:
When you think about Searn as described in response to the previous
question, you see that training a classifier to make incremental
decisions requires that we get labeled training data for incremental
decisions. This is essentially where the optimal policy comes in.
The optimal policy tells us: for a given input (eg., sentence), true
output (eg., part of speech tags for this sentence) and some location
in the search space (eg., after 3 words have been tagged), what is the
best action to take. Importantly, this is based on having
access to the true output sequence. For instance, in sequence
labeling, under Hamming loss, the optimal policy will always choose
simply to label the next word with its correct label.
The optimal policy is not a classifier, in the sense that it is not learned. It is, essentially, exactly what we want our classifiers to learn, since it always makes correct choices. It is up to us as the algorithm designer to come up with a way of computing the optimal policy on the basis of the true output. Essentially, what we need to do is to find some way of structuring the search space so that computing an optimal policy for our given loss function is easy. I don't know a magic method for constructing it, but if you look at the examples for sequence labeling, parsing and machine translation, hopefully that will give some sort of idea.