i believe i've implemented the optimizer described in: https://arxiv.org/abs/1712.03298
it seems to have comparable performance to Nesterov momentum with gradient clipping, which is my usual go-to when Adam doesn't work.
@notwa How do you pick the constants for Nesterov?
@hjkl
momentum is usually high: 0.9, or 0.7 if that doesn't work. with gradient clipping (clipped at 1.0), learning rate can be higher than usual. i usually start at 1.0 and do quick tests down exponentially: 1.0, 0.32, 0.1, 0.032, etc.
something worth noting is that momentum acts as a boost for learning rate at DC and low frequencies, so you wind up with 1/(1-mu) times more learning rate than you asked for. i believe this is why Adam's default learning rate is usually a tiny 0.001 or 0.002.
@notwa Thanks! I didn't know about gradient clipping. Of course if you'd know the Lipschitz constant of the loss derivative (I think) you could pick the values so that convergence is guaranteed. Obviously impossible with deep learning though.
@hjkl side note, the momentum boosting learning rate thing is my own idea; i'm not sure how well it holds in practice. but when you consider the momentum equation as an LTI system, you see its magnitude plot has a gain at DC proportional as i stated.
for fun, i've tried implementing a second-order filter as an optimizer, but i couldn't personally manage anything better than a traditional well-tuned momentum optimizer.