spotlight icon indicating copy to clipboard operation
spotlight copied to clipboard

Serialization and Online learning

Open nonamestreet opened this issue 8 years ago • 5 comments

Hi

I saw the "Add model serialization" is on the Trello to do list.

If I can serialize the model, can I just reload the old model and just continue the training with the new interactions data? But I guess there would be learning rate problem with the Adam optimizer at least. What do you do in practice? Can you recommend me something to read?

Thank you!

nonamestreet avatar Aug 17 '17 14:08 nonamestreet

This should be correct. I am working on tests to make sure that this really is true.

In principle the parameters of the optimizer will get serialized as well, so there should be no problem in resuming training.

maciejkula avatar Aug 17 '17 21:08 maciejkula

Thank you for the reply! I think optimizers like SGD could easily generalize. However, the optimizers like Adam, the LR for each parameter was adjusted according to the history of gradients. The gradients could get quite small for existing parameters. I don't know if this is expected behavior, as the new interaction data is more important than the historical data?

nonamestreet avatar Aug 18 '17 01:08 nonamestreet

Certainly for Adagrad the learning rate goes to zero as the number of training examples gets large. I'm less sure this is true of Adam: I suspect if may converged to some small but non-zero value.

I think this reflects the fact that not a lot of applications run true online models, where the parameters are updated as the data comes in. It's much more common to fit once, publish, and retrain from scratch once new data is available.

You may also be interested in the literature on SGD with restarts. I haven't followed it closely but they seem to have some intriguing results.

maciejkula avatar Aug 18 '17 19:08 maciejkula

@maciejkula Optimizers like FTRL seems to be useful for the online learning in the recommendation system, by heavily regularize the weights I think. I cannot find a pytorch implementation yet. Do you have any experience in them?

nonamestreet avatar Aug 28 '17 04:08 nonamestreet

Not really. While I think adagrad is a poor choice (learning rate goes to zero), I suspect Adam, SGD, and SGD with momentum will all work quite well in this setting.

You could verify this by plotting the learning rates from the optimizer as you fit the model on more and more data?

maciejkula avatar Aug 28 '17 07:08 maciejkula