Attention mechanism for sequence modelling was first introduced in the paper: Neural Machine Translation by jointly learning to align and translate, Bengio et. al. ICLR 2015. Even though the paper itself mentions the word “attention” scarcely (3 times total in 2 consecutive lines!!) the term has caught on. …