i believe i've implemented the optimizer described in: https://arxiv.org/abs/1712.03298it seems to have comparable performance to Nesterov momentum with gradient clipping, which is my usual go-to when Adam doesn't work.