Image not available

1400x1078

file.jpg

๐Ÿงต Untitled Thread

Anonymous No. 16583303

Why did the machine learning / statistics fields settle on gradient descent (going downwards) rather than hill climbing (going upwards) when the hill climbing analogy for optimization is clearly the more beautiful one?

The idea that the human race conquered Everest, continued to climb ever higher by conquering the moon, then climbed the peaks of the mind is a far more impressive analogy in my opinion

Anonymous No. 16583304

>>16583303
Jensen's inequality. That's really it. If you're optimizing a minimization problem and you have convexity, Jensen's inequality will make your life better.

Anonymous No. 16583345

because optimization swings back and forth out around the optimum point, like a ball rolling back and forth in a valley. a ball on a peak is unstable and can go careening off in some random direction at any moment. gradient descent is just the more fitting analogy

Anonymous No. 16583346

>>16583345
*or around

Image not available

480x270

1738032213302883.gif

Anonymous No. 16583359

>>16583303
ball roll down hill

Image not available

828x714

1726886446094204.jpg

Anonymous No. 16583390

>>16583303
It's a global miniMUM. Singular. Also steepest descent finds a local minimum, not necessary the global minimum.

Anonymous No. 16583864

>>16583303
Just a direction thing
They calculate error and minimize that instead of maximize -error or 1/error.
This shitfield is just packed with analogy from browns reading success book its unreal.

Anonymous No. 16583983

>>16583304
The S'Nesnej inequality does the same thing for the maxima of concave functions.

Anonymous No. 16584015

>>16583983
Yes and no. The inverse Jensen's inequality for concave function tells you a lower bound of your update step's error, but that's not as useful.

Knowing what the "best case" for your next step's change/improvement will be is way way less useful than knowing the worst case. In the convex case, you know for certain that if you are moving along a gradient related direction by a certain step size, this is the absolute most it can impact your function value in the end.

Anonymous No. 16584046

>>16584015
We aren't referring to the same theorem if it won't do the same thing when you look at it upside down.

Anonymous No. 16584057

>>16583304
>>16584015
I don't think you understand what my thread is asking anon

I am just saying we should flip all the gradient descent algorithms over. the same thing, but just pointing upwards as a general convention rather than downwards.

just like how the x axis is normally used to mean the horizontal axis, gradient descent should be flipped to be ascending instead.

Anonymous No. 16584075

>>16584046
This is Jensen's inequality, for a convex g(x), and a t between [0,1]:
[math]
g(xt + (1-t)y) \leq tg(x) +(1-t)g(y)
[/math]

This tells you that if your function is convex, the maximum possible value you can have if you move it along the cord connecting x and y is bounded on the right. Thus, if you move by some fixed step size, you know the maximum possible value that g(x) can take within that fixed step size.

Now, consider what it means if you're going the other way. With g(x) as a concave function you have:
[math]
g(xt + (1-t)y) \geq tg(x) + (1-t)g(y)
[/math]

This tells you the minimum change that can happen when you move along that cord. It's not as useful. Your concave function would be faster to solve via the convex dual because you can bound your changes and select your step sizes in a smart way knowing the maximum difference it can induce.

>>16584057
I'm saying you should stop taking shit until you've spent some time studying non-linear programming.

There's a reason that function minimization is the standard, and your bullshit aspirational just-so-story justification doesn't change that. Hell, one of the main reasons people study duality theory in the first place is to recast concave maximization problems to their convex duals for more stable solution.

Anonymous No. 16584078

>>16584075
>telling me to study when you aren't understanding a very simple concept

IF IT WORKS DOWN
IT WORKS UP
ITS JUST FLIPPED MORON

Anonymous No. 16584083

>>16584078
The inequality being flipped tells you different things. It literally works differently. The inequality being flipped tells you the minimum difference between linear approximation and the real function.

That minimum difference isn't as useful to know for a real problem. You want your algorithm to converge to a point, meaning you want the changes between the steps to decrease in a controlled fashion. That doesn't happen via the concave case.

You can do a gradient ascent approach, but gradient descent bounds your maximum error in a way that allows for guaranteed numerical convergence.

Anonymous No. 16584094

>>16584083
>>16584075
damn. claude is saying you have a point apparently

Anonymous No. 16584109

>>16584094
My God you're a moron. Literally just spend 20 minutes reading any book on non-linear programming or convex analysis and you'd get your answer in the first chapter. It's not that hard.

Anonymous No. 16584295

>>16584109
I'm a different guy from the other one arguing with you before. I take his side before he caved. If g is a concave function, then -g is a convex function, and I can just use the convex version of Jensen's inequality to do gradient descent on -g, which is equivalent to gradient ascent on g. It really doesn't matter. The whole "this gives the maximum change, which guarantees convergence, doesn't work with concave version" counterargument doesn't take into consideration the sign of the output of g(x). A lower bound on a negatively valued g yields the same convergence consequences as an upper bound on a positively valued of g.

Anonymous No. 16584298

>>16584094
Don't believe everything the robot tells you bro. You were right.

Anonymous No. 16584307

entropy maximized when gradient of log probability is stationary for all outcomes. directly observable when you check the derivatives

Anonymous No. 16584310

>>16583303
OP, I think the reason they chose descent as the convention is that we want to think of the model parameter state space as sort of a potential energy field, so we reduce potential energy until we sit in the valley. It is quite natural for the of physical system to settle into a local minimum.
Or like this guy said:
>>16583359
>ball roll down hill

Anonymous No. 16584626

>>16584295
>>16584298
On paper, I agree with you.

g as a continuous function being concave means -g is convex and you're golden. In practice, I don't think it really works out so cleanly unless your objective/reward functions are unconstrained and quadratic. If you have some cleanly analytic function g (say your maximum likelihood of a parameter conditioned on some observation), maximizing that likelihood is the same as minimizing the negative likelihood.

In real circumstances it is entirely possible that your "least wrong" and "most right" answers are not exactly the same thing (i.e., there is a duality gap). It is generally more stable to formulate a meaningful "least wrong" kind of problem as for continuous problems the lower-bound of error is generally zero, while the upper bound of reward is not necessarily obvious and may not have any non-boundary solutions. You generally can formulate a "least wrong" sort of problem in a more numerically stable fashion than a "most right."

This is why even the most common use case for gradient ascent (reinforcement learning) is really using minimizing a loss function on a network for the optimization step and then they recast that minimization as a maximized reward function by assuming no duality gap.

On paper, any constrained "least wrong" problem can be equivalently formulated as a "most right" problem. In practice, I think it is generally more straightforward to produce a constrained minimum loss formulation that is stable than a constrained maximum gain (which may not even have interior point solutions in an analytical sense).

Anonymous No. 16584996

>>16584626
elaborate troll. long-winded repetition of baseless claim. minimizing g() is EXACTLY the same thing as maximizing -g().
the reason we are minimizing is intuition: most shit we try to minimize is some sort of error, which is both positive and should be as small as possible.

Anonymous No. 16584999

>>16584626
I'm not an expert in linear programming, but as I understand it, the "duality gap" has more to do with the duality that occurs when you turn variables from the loss formula into constraints and constraints into variables in the (anti)loss function. But that ain't the same as just flipping the whole thing upside-down. That's just rephrasing the lagrangian optimization problem [math] \nabla f = \lambda \nabla g [/math] as the formula [math] \nabla g = \frac{1}{\lambda} \nabla f [/math] rather than [math] \nabla (-f) = \lambda \nabla(-g)[/math]

Anonymous No. 16585005

>>16583303
Why did cs fags settle on gradient descent instead of one shot optimization by taking the derivative equal to zero?

Anonymous No. 16585017

>>16585005
Because we don't actually have a global analytic representation of the loss landscape over model parameter space. We can only check the value of loss at a point by running the model there, and we only know how to improve from there by checking the local gradient.

Turns out CS fags know what they're doing

Anonymous No. 16585023

>>16584999
I think I made a mistake. maybe the duality gap is for [math] \nabla \frac 1 g = \lambda \nabla \frac 1 f [/math]

Anonymous No. 16585040

>>16585017
> global analytic representation of the loss landscape over model parameter space

In fact you don't need one. Augment every transfer function from t_0 = 0 to T-1 as a constraint to the objective function. Take first order conditions and obtain a second order linear difference equation. Solve it to find control at every t. It is both necesary and sufficient by slater's condition.

Anonymous No. 16585063

>>16584996
You don't know what the optimal g function is for real problems. This is why I say "on paper."

It is much more stable to define some error function or loss function which you know to be bounded below by zero and say "minimize this subject to constraint functions h_1, h_2,..., h_n" than it is to define some random "reward function" which you want to optimize. In real world problems there isn't some clear universal reward function which is clearly well defined to maximize. You have to define one, and your user defined function might not be stable or have interior point solutions. The hard part is the problem formulation, and it's easier to form numerically stable loss functions than reward functions (which is why, even in reinforcement learning, the actual training is done to optimally minimize loss).

>>16584999
Duality gaps are more of a thing in non-linear programming than linear because the constraints are not necessarily linear and the feasible set is not necessarily convex. It is very very easy to construct a case where maximization of a state constrained reward function is not equivalent to minimization of the dual (the state constrained loss). Reward functions are not just -g(x) where g(x) is the loss function. They are the induced dual of the loss function on the constrained space.

If the loss function is convex, and the induced dual is concave (which does happen for any particular concave log-likelihold function and the induced negative log-likelihold loss function), then maximizing your reward is the same as minimizing your loss. If the induced reward is not simply the negative of the loss, then there is no guarantee you won't have a duality gap.

As an example, consider the case of minimizing the mean-square-error of some estimate while your model has mismatch (i.e., you don't know exactly how the true data is distributed). It's entirely possible for the minimum MSE solution to not be the same as the maximum likelihood under this mismatch.

Anonymous No. 16585067

>>16585063

>Reward functions are not just -g(x) where g(x) is the loss function. They are the induced dual of the loss function on the constrained space.

This is why you have a duality gap. You absolutely could take reward function [math]-g[/math] and then the two problems are isomorphic. Not sure why you keep trying to argue this point.

Anonymous No. 16585124

>>16583303
Because short-term profits are the most valuable.

Image not available

158x249

1580754693564.jpg

Anonymous No. 16585128

>>16583304
>>16584057
>>16584075
>>16584078
>>16584083
>>16584094
>>16584109
>>16584295
>>16584996
this thread is fucking glorious

Anonymous No. 16585138

>>16585005
They're from the church of the subgradient.

Anonymous No. 16585259

>>16585138
But how would you know how to orient your life if you didn't know whether the direction you're already going is gradient related?

Anonymous No. 16587322

>>16585138
indeed...
https://www.youtube.com/watch?v=Qt9MP70ODNw

Anonymous No. 16589422

>>16585128
Stay tuned, anon...

Anonymous No. 16592010

>>16583303
there are many good answers, I'm going to come at it from a different answer. using newton's method, the simplest "original" template for numeric optimization (an even older method is the babylonian one but that's very specifically applied to root finding) then you seek to minimize the error until it get within a specific tolerance level, which will usually be something like 0.0000001. normally you evaluate the absolute value of the error which is Xk - X, so in this particular case you are decreasing the error(hopefully), similarly to goind downhill, and you achieve convergence.

gradient descent is a slightly different method but it works in pretty much the same way, only with more exotic convex optimization assumptions

Anonymous No. 16596351

>>16583303
Well, you can ask Sisyphus about that. He has plenty of experience.

Anonymous No. 16596366

>>16583303
They work with a lot of quadratic functions, for example variance (code for error). It's a function that's often bounded below by zero and unbounded above. Minimizing it makes sense as a way to minimize error.

Anonymous No. 16596567

They minimize entropy -x log x when they could be maximizing order x log x.