convoluted connor @notwa@witches.town
Suivre

@hjkl
momentum is usually high: 0.9, or 0.7 if that doesn't work. with gradient clipping (clipped at 1.0), learning rate can be higher than usual. i usually start at 1.0 and do quick tests down exponentially: 1.0, 0.32, 0.1, 0.032, etc.

something worth noting is that momentum acts as a boost for learning rate at DC and low frequencies, so you wind up with 1/(1-mu) times more learning rate than you asked for. i believe this is why Adam's default learning rate is usually a tiny 0.001 or 0.002.