By default, OpenNMT saves a checkpoint every 5000 iterations and at the end of each epoch. For more frequent or infrequent saves, you can use the
-save_every_epochs options which define the number of iterations and epochs after which the training saves a checkpoint.
There are several reasons one may want to train from a saved model with the
- continuing a stopped training
- continuing the training with a smaller batch size
- training a model on new data (incremental adaptation)
- starting a training from pre-trained parameters
When training from an existing model, some settings can not be changed:
- the model topology (layers, hidden size, etc.)
- the vocabularies
-fix_word_vecs_dec are model options that can be changed for a retraining.
Resuming a stopped training¶
It is common that a training stops: crash, server reboot, user action, etc. In this case, you may want to continue the training for more epochs by using using the
-continue flag. For example:
# start the initial training th train.lua -gpuid 1 -data data/demo-train.t7 -save_model demo -save_every 50 # train for several epochs... # need to reboot the server! # continue the training from the last checkpoint th train.lua -gpuid 1 -data data/demo-train.t7 -save_model demo -save_every 50 -train_from demo_checkpoint.t7 -continue
-continue flag ensures that the training continues with the same configuration and optimization states. In particular, the following options are set to their last known value:
-end_epoch value is not automatically set as the user may want to continue its training for more epochs past the end.
-continue flag retrieves from the previous training:
- the non-SGD optimizers states
- the random generator states
- the batch order (when continuing from an intermediate checkpoint)
Training from pre-trained parameters¶
Another use case it to use a base model and train it further with new training options (in particular the optimization method and the learning rate). Using
-continue will start a new training with parameters initialized from a pre-trained model.