Lagrange Multipliers

Below is a nice explanation of Lagrange multipliers by Jason Eisner (posted with permission).

Jason comments:

The traditional presentation is a bit different (and in my view less fundamental, but still useful). It assumes that everything is differentiable, and re-encodes the g(x)=c constraint as a statement that a particular derivative equals 0. This leads to a nice geometric intuition.
Bare bones statement: http://mathworld.wolfram.com/LagrangeMultiplier.html
Nice intuitive explanation: http://www.slimy.com/~steuard/tutorials/Lagrange.html

We wish to maximize a function f(x) under a constraint that g(x) = c.

For example, perhaps x is a vector representing the parameters of some probability distribution. We are trying to find the distribution that maximizes the likelihood of some data, under the constraint that the components of the distribution sum to 1. Or we are trying to find the distribution that has maximum entropy, under the constraint that it corresponds to observed data in certain respects.

To do this, we introduce a new parameter L (lambda) and then take it away again differently.

For each L, we find the x that maximizes f(x) + Lg(x). Call this x(L). This can be found by ordinary unconstrained maximization (e.g., set the derivative of f(x) + Lg(x) to zero).

The key observation -- a simple one -- is that if we restrict x to a subset of the space on which g(x) is constant (at c or otherwise), then f(x) + Lg(x) differs from f(x) by a constant, and they have the same maximum. THEREFORE x(L) maximizes f(x) under the constraint that g(x)=g(x(L)), i.e., on its particular "tier" of g.

(More formally: if there were an x1 giving the same value of g (i.e., g(x1)=g(x(L)) and a higher value of f (i.e., f(x1)>f(x(L))), then f(x) + Lg(x) would be higher at x1 than at x(L), contradicting maximality of x(L).)

If g(x(L)) happens to equal c, then we've won -- x(L) is our desired optimum. So the last step is to find L such that g(x(L)) = c. Then x(L) is our answer.

Note: If g is vector valued, just make L a vector and interpret Lg(x) as a dot product. This is the same as introducing several parameters L_1, L_2, etc. for the multiple constraints g(x)_1 = c_1, g(x)_2 = c_2, etc.

The above discussion is slightly simplified since it supposes that x(L) is unique. It is possible for f(x) + Lg(x) to have multiple maxima, as well as saddle points and points of inflection. It may be necessary to consider all of these multiple x(L) values in order to find one such that g(x(L)) = c. The correct general statement is that

x0 is a critical point of f(x) subject to g(x)=c iff for some L, x0 is a critical point of f(x)+Lg(x) and g(x0)=c.

The following analogy may be helpful: Governments often use taxes as Lagrange multipliers! Read on:

How much gasoline I buy affects my happiness. (If I buy too little gasoline then I can't go anywhere, but if I buy too much then I don't have money left to eat.) Let's measure net happiness in dollars: the benefit to me minus the cost of the gas.

If x is a vector giving each person's annual gasoline consumption, let f(x) be the total net effect on the population's happiness. f(x) is maximized when each person separately buys the amount of gasoline that makes her happiest. Unfortunately, then the total gas consumption g(x) is too high, causing pollution.

To keep g(x)=c while still making happiness f(x) as high as possible, impose a gasoline tax. If the tax is $20/gallon, people's free choices will maximize not f(x) but rather f(x) + (-20) g(x). People who like to drive still buy more gas than people who don't, but everyone buys less than s/he did before.

By adjusting the size of the tax (the Lagrange multiplier), the government can indirectly adjust total consumption g(x) until it is at the desired level, g(x)=c. One cannot determine from c in advance what the tax should be.