limitedsuccess. Followingseminalpapersinthearea[41,2],NMTtranslationqualityhascreptcloserto thelevelofphrase-basedtranslationsystemsforcommonresearchbenchmarks.
nrich et al.
2016 ), WordPiece embeddings ( Wu et al.
2016 ) and character-level CNNs (Baevski et al.,2019). Nevertheless,Schick and Sch¨utze (2020) recently showed that BERTs (Devlin et al.
2019) performance on a rare word probing task can be signi?cantly improved by explicitly learning rep-resentations of rare words using Attentive Mimick-, the subword tokenization algorithm is WordPiece ( Wu et al.
2016 ). As a consequence, the decom- position of a word into subwords is the same across contexts and the subwords can be unambigu-, using WordPiece tokenization ( Wu et al.
2016 ), and produces a sequence of context-based embed-dings of these subtokens. When a word-level task, such as NER, is being solved, the embeddings of word-initial subtokens are passed through a dense layer with softmax activation to produce a proba-bility distribution over output labels. We refer the, Pre-trained word embeddings have proven to be highly useful in neural network models for NLP tasks such as sequence tagging (Lample et al.
2016 ;Ma and Hovy, 2016 ) and text classica-tion (Kim,2014). However, it is much less com-mon to use such pre-training in NMT ( Wu et al.
2016 ),largelybecausethelarge-scaletrainingcor-, ( Wu et al.
2016 ; Vaswani et al.
2017); the work of Qi et al. empirically compares the performance of pretrained and randomly-initialized embeddings across numerous languages and dataset sizes on NMT tasks, showing for example that the pretrained embeddings typically perform better on similar language pairs, and when the amount of training …