Understanding Attention Mechanisms – Part 1: Why Long Sentences Break Encoder–Decoders
In the previous articles, we understood Seq2Seq models. Now, on the path toward transformers, we need to understand one more concept before reaching there: Attention. The encoder in a basic encoder...

Source: DEV Community
In the previous articles, we understood Seq2Seq models. Now, on the path toward transformers, we need to understand one more concept before reaching there: Attention. The encoder in a basic encoder–decoder, by unrolling the LSTMs, compresses the entire input sentence into a single context vector. This works fine for short phrases like "Let's go". But if we had a bigger input vocabulary with thousands of words, then we could input longer and more complicated sentences, like "Don't eat the delicious-looking and smelling pasta". For longer phrases, even with LSTMs, words that are input early on can be forgotten. In this case, if we forget the first word "Don't", then it becomes: "eat the delicious-looking and smelling pasta" So, sometimes it is important to remember the first word. Basic RNNs had problems with long-term memory because they ran both long- and short-term information through a single path. The main idea of Long Short-Term Memory (LSTM) units is that they solve this problem b