Are you (or is your student) claiming to have aphantasia? If so, you'd have to use something external to hold the image.
Visual thinking is like other kinds of thinking, but using the visual sensory channel. It's a kind of short-term memory with limited capacity.
If you remember the days when we had to look up phone numbers in books, you might have had the experience of loading a seven-digit number into your short-term memory as an auditory loop, speaking it over and over again in your mind's voice (and hearing it in your mind's ear), long enough to dial on the phone. You could actually speak the numbers aloud for a similar effect, but most people don't have to do this to "hear" it. But a moment of distraction, especially an auditory distraction, can cause you to lose that memory.
Visual thinking is the same, but you use your mind's eye instead of ear. It's like having an imaginary whiteboard, but it requires concentration to hold an image, and a moment of distraction can erase it. You're not exactly drawing in lines either (unless you choose to). This whiteboard has limited capacity in the same way your auditory loop has a limited duration. You can choose which parts of the image to focus on, and that can have more detail (just like your real eyes can see things they're looking at directly better than things in the periphery), but if you stop focusing on an area for too long its detail fades and you lose the memory. If you try to exceed your capacity, then the part of the image you refresh might not quite have what was there before, the same way you can accidentally remember the wrong number if you try to keep too long of a string of digits in your auditory memory.
In the same way you can query your long-term memory for the sound of something, by holding the question in your mind until the sound arises (E.g. What rhymes with "muffin"? Or what does a dolphin sound like?), you can query your long-term memory for an image (E.g. What does a dolphin look like?).
And this can be done without using your mind's voice. Perhaps you can use your mind's eye to "read" the text of the question. But you don't even have to use words. You can use the concepts behind the words more directly. And sometimes these concepts are visual, or can have a visual representation. Even kinesthetic concepts have enough of a spacial component that they can be diagrammed visually in a very natural way. Many nouns correspond to visible things, and verbs correspond to visible actions. Holding such an image in memory along with the intention to query memory can cause an answer to arise to consciousness, the same way a query in words can.
The intention to query for a sound is like stopping for a moment to listen, while the intention to query for an image is like stopping to look. You have to make a space for it in your mind in the appropriate sensory channel.
You can also query your subconscious for things you've never seen before. Maybe you don't know what an ichthyosaur looks like, but someone tells you it looks like a dolphin, but with two pairs of fins instead of one. Running a hypothetical query may be enough to produce the image in your mind's eye. If that's not working, you can try to force it by "painting", which takes more effort: If you had a whiteboard (and artistic skill) you could draw a dolphin, and then draw another pair of fins in the pelvic region. You can do the same thing in the mind's eye. Start with a dolphin image, and then while concentrating to keep it refreshed in memory, make a change to it: refresh that part differently on purpose.