convoluted connor a changé de compte pour @notwa@cybre.space :
80b3e936f186f59b

convoluted connor @notwa@witches.town

there's so many lasers in this trackmogrify track that they don't even all render: insanelyhazardousinsanelylaserinsanelyextremedownhillzebrafog

@Skirmisher ahh okay, thank you! i got it to work, but now i'm wondering if i can get a highscore with this :P

wow uh
the manor map is totally broken. i keep falling through the floor after the intro section. is this happening to you too, @Skirmisher ? does anyone else play this game?

you already know what it isssssssssssss

(or maybe you don't, but i find this expression too amusing) witches.town/media/nPgVK_P4cDo

tempted to write an autohotkey script that outright prevents me from typing "though"

I kinda tune out when papers start talking about hessians in deep learning.

@hjkl side note, the momentum boosting learning rate thing is my own idea; i'm not sure how well it holds in practice. but when you consider the momentum equation as an LTI system, you see its magnitude plot has a gain at DC proportional as i stated.

for fun, i've tried implementing a second-order filter as an optimizer, but i couldn't personally manage anything better than a traditional well-tuned momentum optimizer.

@hjkl yeah, i'm aware of Lipschitz and the like, but most of my experience is honestly just tweaking numbers, trying ideas, and implementing any paper that interests me. i personally find it easier to try things than theorize them.

in my mind, deep learning is more like a highly unstable system trying to settle than anything like convex optimization. a lot of modern techniques seem to be based on simple intuition instead of pages of proof — compare resnets to SELU. just some random thoughts.

@hjkl
momentum is usually high: 0.9, or 0.7 if that doesn't work. with gradient clipping (clipped at 1.0), learning rate can be higher than usual. i usually start at 1.0 and do quick tests down exponentially: 1.0, 0.32, 0.1, 0.032, etc.

something worth noting is that momentum acts as a boost for learning rate at DC and low frequencies, so you wind up with 1/(1-mu) times more learning rate than you asked for. i believe this is why Adam's default learning rate is usually a tiny 0.001 or 0.002.

yume nikki is on steam now i guess

i believe i've implemented the optimizer described in: arxiv.org/abs/1712.03298
it seems to have comparable performance to Nesterov momentum with gradient clipping, which is my usual go-to when Adam doesn't work.

i need to stop using 'testing' and 'experimenting' interchangeably when i'm programming

just sci-hub'd the fuck out of a paper for the first time, feels good

had an intricate and nonsensical dream. today's song playing in my head as i woke up is nomad by dfa1979

oh i forgot a few points about the graph: this is fashion mnist with a 3 layer FC network (no convolution), and validation data is only measured once per epoch. the batch performance is slightly different than actual training data performance, but it's ~close enough~

this dropout-like thing i'm experimenting with keeps loss and accuracy values pretty close across the training and validation sets. unfortunately, overall performance on the validation set is slightly degraded.
witches.town/media/ICB8OCP0smG

somehow, having a song stuck in my head and trivially being able to bring it up and play it never gets old.

today i woke up with this song, specifically the riff around 3:12
youtube.com/watch?v=19zFHmz1F6

err when i say original i mean not the same 100 reposts you might see on birdsite