Step 1: Preprocess the data¶
th preprocess.lua -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo
We will be working with some example data in
The data consists of parallel source (
src) and target (
tgt) data containing one sentence per line with tokens separated by a space:
Validation files are required and used to evaluate the convergence of the training. It usually contains no more than 5000 sentences.
$ head -n 3 data/src-train.txt It is not acceptable that , with the help of the national bureaucracies , Parliament 's legislative prerogative should be made null and void by means of implementing provisions whose content , purpose and extent are not laid down in advance . Federal Master Trainer and Senior Instructor of the Italian Federation of Aerobic Fitness , Group Fitness , Postural Gym , Stretching and Pilates; from 2004 , he has been collaborating with Antiche Terme as personal Trainer and Instructor of Stretching , Pilates and Postural Gym . " Two soldiers came up to me and told me that if I refuse to sleep with them , they will kill me . They beat me and ripped my clothes .
After running the preprocessing, the following files are generated:
demo.src.dict: Dictionary of source vocab to index mappings.
demo.tgt.dict: Dictionary of target vocab to index mappings.
demo-train.t7: serialized Torch file containing vocabulary, training and validation data
*.dict files are needed to check or reuse the vocabularies. These files are simple human-readable dictionaries.
$ head -n 10 data/demo.src.dict <blank> 1 <unk> 2 <s> 3 </s> 4 It 5 is 6 not 7 acceptable 8 that 9 , 10 with 11
Internally the system never touches the words themselves, but uses these indices.
If the corpus is not tokenized, you can use OpenNMT's tokenizer.
Step 2: Train the model¶
th train.lua -data data/demo-train.t7 -save_model demo-model
The main train command is quite simple. Minimally it takes a data file and a save file. This will run the default model, which consists of a 2-layer LSTM with 500 hidden units on both the encoder/decoder. You can also add
-gpuid 1 to use (say) GPU 1.
Step 3: Translate¶
th translate.lua -model demo-model_epochX_PPL.t7 -src data/src-test.txt -output pred.txt
Now you have a model which you can use to predict on new data. We do this by running beam search. This will output predictions into