I think the analogy is something like: if you have a simple distribution over all words, then that's just word frequency. Obviously not a good predictor. The 'information' necessary to predict the correct next word contextually is just not there if you're predicting words in a vacuum. In order to be practically useful and predict the right words _in context_, the model must be conditioning off of more of the sentence/document (aka more information). So it should not be surprising that a 'glorified autocomplete' has some degree of "understanding" as it would be impossible for it to be any good as an autocomplete-er otherwise.
You might want to take another look at Shannon's paper, lol, this statement is quite contradictory. Probability _is_ the backbone of information theory, dude! It's quite incredible.
Can you elaborate on this? I've studied some information theory and I don't see it.