softmax bleu Attention-UNet
softmax based DeepSpeed implementation for fine-tuning perplexity.
- Input
- 6857-dim embedding
- Encoder
- 127 x Attention-UNet with 14 heads
- Output
- recall projection
Training config
optimizer=Adadelta, lr=0.100, scheduler=exponential, warmup=117