The use of specialization of interfaces is apparent if you compare Photoshop with Gemini Pro/Nano Banana for targeted image editing.
I can select exactly where I want changes and have targeted element removal in Photoshop. If I submit the image and try to describe my desired changes textually, I get less easily-controllable output. (And I might still get scrambled text, for instance, in parts of the image that it didn't even need to touch.)
I think this sort of task-specific specialization will have a long future, hard to imagine pure-text once again being the dominant information transfer method for 90% of the things we do with computers after 40 years of building specialized non-text interfaces.
One reasonable niche application I've seen of image models is in real estate, as a way to produce "staged" photos of houses without shipping in a bunch of furniture for a photo shoot (and/or removing a current tenant's furniture for a clean photo). It has to be used carefully to avoid misrepresenting the property, of course, but it's a decent way of avoiding what is otherwise a fairly toilsome and wasteful process.
This sort of thing (not for real estate, but for "what would this furniture actually look like in this room) is definitely somewhere the open-ended interface is fantastic vs targeted-remove in Photoshop (but could also easily be integrated into a Photoshop-like tool to let me be more specific about placement and such).
I was a bit surprised by how it still resulted in gibberish text on posters in the background in an unaffected part of the image that at first glance didn't change at all. So even just the "masking" ability of like "anything outside of this range should not be touched" of a GUI would be a godsend.
It behooves me that Gemini et al dont have these standard video editing tools. Do the engineers seriously think prompting by text is the way people want videos to be generated? Nope. People want to customise. E.g. Check out capcut in the context of social media.
Ive been trying to create a quick and dirty marketing promo via an LLM to visualise how a product will fit into the world of people - it is incredibly painful to 'hope and pray' that by refining the prompt via text you can make slight adjustments come through.
The models are good enough if you are half-decent at prompting and have some patience. But given the amount invested, I would argue they are pretty disappointing. Ive had to chunk the marketing promo into almost a frame-by-frame play to make it somewhat work.
Speaking as someone who doesn't like the idea of AI art so take my words with a grain of salt, but my theory is that this input method exclusivity is intentional on their part, for exactly the reason you want the change. If you only let people making AI art communicate what they want through text or reference attachments (the latter of which they usually won't have), then they have to spend time figuring out how to put it into words. It IS painful to ask for those refinements, because any human would clearly understands it. In the end, those people get to say that they spent hours, days, or weeks refining "their prompt" to get a consistent and somewhat-okay looking image; the engineers get to train their AI to better understand the context of what someone is saying; and all the while the company gets to further legitimize a false art form.
I can select exactly where I want changes and have targeted element removal in Photoshop. If I submit the image and try to describe my desired changes textually, I get less easily-controllable output. (And I might still get scrambled text, for instance, in parts of the image that it didn't even need to touch.)
I think this sort of task-specific specialization will have a long future, hard to imagine pure-text once again being the dominant information transfer method for 90% of the things we do with computers after 40 years of building specialized non-text interfaces.