Mixture-of-Experts Implementation Permalink
Implemented a Switch Transformer alongside a conventional autoregressive transformer and trained on TinyShakespeare to research effects of mixture-of-experts architecture on validation loss, sample-efficiency and training time.