No, it was not a surprise. Transformers architecture resulted from systematic exploration at scale with Seq2Seq. And it was quite clear when this architecture came out that it was very promising.
The issue was not technology, it was lack of investment. In 2017 with a giant sucking sound, Autonomous Vehicles research took all the investment money and nearly all talent. Myself is a good example, I was working on training code generation models for my startup Articoder, using around 8TB of code, scrapped from GitHub. Had some early successes, automatic pull requests generated and accepted by human users, got past YC application stage into the interview. The amount of VC funding for that was exactly zero. I've filed a patent, put everything on hold and went to work on AVs.
As to watching things not stick for multiple decades, you simply had too few people working on this. And no general availability of compute. It was a few tiny labs, with a few grad students and little to no compute available. Very few people had a supercomputer in their hands. In 2010, for example, a GPU rig like 2xGTX 470 (that could yield some 2 TFLOP of performance) was an exception. And in the same year, the top conference, like NeuralIPS had attendance of around 600.
The issue was not technology, it was lack of investment. In 2017 with a giant sucking sound, Autonomous Vehicles research took all the investment money and nearly all talent. Myself is a good example, I was working on training code generation models for my startup Articoder, using around 8TB of code, scrapped from GitHub. Had some early successes, automatic pull requests generated and accepted by human users, got past YC application stage into the interview. The amount of VC funding for that was exactly zero. I've filed a patent, put everything on hold and went to work on AVs.
As to watching things not stick for multiple decades, you simply had too few people working on this. And no general availability of compute. It was a few tiny labs, with a few grad students and little to no compute available. Very few people had a supercomputer in their hands. In 2010, for example, a GPU rig like 2xGTX 470 (that could yield some 2 TFLOP of performance) was an exception. And in the same year, the top conference, like NeuralIPS had attendance of around 600.