Why different sensory modalities have different qualitative experience can be explained through the physically observable organizational structure of a mind. The difference between a sight and a smell is likely due to the way they are predictively wired to other sensory and motor neurons.
If you close your eyes and imagine the color red, just the color red, not a red object like a sports car or a flower, not a small red box labeled red, just the color red, you probably imagine a field of red with height and width; something like this:
Red
The fact it has height and width is structurally determined in an externally observable way.
A visual neuron is embedded within a tight grid of similar visual neurons with a definite spatial relationship defined by the Hebbian Learning processes created when an object moves across the two dimensions of the visual field. Hebbian Learning is often summarized as neurons that fire together wire together, although it is more accurate to say that neurons that fire in short temporal succession one after the other wire together, but that is less catchy.
Ohjects moving across the visual field will cause a distinctly two dimensional connection pattern under Hebbian Learning processes.
Hebbian Learning captures the predictive role of the neurons in the brain, each trying to predict the other on the small scale and trying to make predictions about the world as an emergent property of this micro level prediction. You can see how physical objects moving in a visual field will cause Hebbian wiring between visual nerves. Even if cortex neurons receiving signals from the retina were not placed in a flat two dimensional sheet that mimiced the retina the Hebbian wiring from objects moving across the retina would produce the topology of a two dimensional sheet where a cortex neuron would have predictive attachments to cortex neurons that connect to cells on the retina which are neighbors of that neuron's connected retina cell; these predictive connections would be priming that neuron to fire when neighbors detect an incoming object as well as priming the neighbors to fire for that object after it passes and is outgoing. That said, there is also a role of evolution which conveniently arranges cortex neurons connected to the eye into a grid where they don't have far to go to find their neighboring co-firing neurons.
A similar combination of evolution and Hebbian wiring creates predictive connections between these visual neurons and motor neurons that control eye movement, head movement, and other aspects of body position to form a very general spatial sensorium, placing our visual sensations onto a stage with height, width, and depth.
This stage is composed of trillions of latent predictions implicit in our neural connections about what you would feel if you got up and moved forward or what you would see if you execute the motions to toss something at a target in your field of view; the air rushing over your arm, the feeling of muscles tensing and relaxing in a coordinated motion, the sight as your body shifts and your arm moves into your field of view while throwing, the view of your projectile as it flies along its trajectory, etc.
Thus when imagining red in your field of view it occupies height and width. Any particular object you imagine will have very specific localization in your visual field, with predictive implications for what happens with body movements, particularly with attention to the head and eyes. If you think of a person standing you will see their head above their torso which itself is above their feet. The smell of a person's cologne will not give such spatial cues.
Most of this internal visual space, the internal soundstage to use a showbiz analogy, is on an area between the two hemispheres of the brain called the precuneus and the area just over the dorsal ridge from the precuneus called the superior parietal lobule, major components of the dorsal “Where” visual pathway, located about here:
PrecuneusSuperior Parietal Lobule
It is between our V1 visual area and our motor-tactile areas and also has a whitematter connection to the top of the lateral sulcus where inner ear balance is processed. This is pretty precisely where we would expect visual information to be getting connected to tactile and motor information.
The precuneus and superior parietal lobule’s dorsal pathway forms a mental soundstage where we may create scenes in our mind, but we need to fill it. Into this soundstage are placed objects from the prop-warehouse of the ventral "What" pathway that lies between the visual and auditory cortical centers on the lateral and underside of the brain.
The underside of the brain with anterior regions at the top and posterior regions at the bottom. Areas involved in the "What" visual pathway are highlighted. The red area is the fusiform gyrus and where lesions that cause prosopagnosia occur. The blue area is the inferior occipital gyrus where patient DF had injury.
To use a video game analogy for a moment, I would speculate the dorsal “Where” pathway is like a game engine with a physics system that can connect the voxels in the scene's simulation space by, for example, saying that the contents of a voxel that has no momentum and no blocking voxel underneath will move into the voxel directly beneath it after some period due to gravity. The ventral “What” pathway's prop warehouse is like a database of assets which themselves are simply particular patterns of voxels, each of which can be loaded into the game engine's scene and then processed by the physics there.
The “What” pathway allows us to attend to individual components of a scene as differentiated from the background or some other unrecognizable blob of pixels. The “Where” pathway’s predictive connections to other spatially dependent senses, particularly tactile and motor areas, produce the qualitative spatial awareness of an object and the distinct qualitative experience or awareness of the object itself as opposed to the qualitative experience or awareness of a flat picture of the object. The "Where" pathway generates our qualitative experience of existing in a three dimensional space[1].
The V1 visual cortex at the rear of the brain is the screen onto which is projected external sensory information or internal mental images depending on how our focus[2] allows or inhibits external stimulation of the V1 area by sensory information from our eyes. In one direction the “What” pathway allows us to identify objects in our visual field and in the opposite direction we can recall images from this prop warehouse to place into an imagined scene. In both cases our awareness is like a film made on our soundstage with props from our warehouse projected onto the screen, just sometimes external senses rather than imagination dictate prop choice.
This use of props from our "What" pathway warehouse is what Attention Schema Theory is describing with the partial and not fully accurate schemas employed in our awareness and how our brain can leave a collection of pixels uncategorized by referring to the specimens in our prop warehouse, i.e. our collection of schemas, to be recognized as a man in a gorilla suit if our attention is on other objects in our visual field.
This is footage from Daniel Simon's famous attention experiment where participants were asked to count the number of times players wearing white pass the ball. Please watch the eighty second video if you are unfamiliar with the experiment now.
At some point in the footage a man in a gorilla suit walks through. Later the participants are asked if anything unusual happened and they can't recall the man in the gorilla suit; an example of how an object not retrieved from our “What” warehouse and placed onto our mental stage remains a pile of uncategorized pixels.
To decompose our visual-spatial awareness we can use cases from brain lesion disorders to isolate the elements of our awareness.
An aphantasiac that has less ability to mentally picture things or an agnosiac that can't identify seen objects likely have difficulty connecting to the prop-warehouse in the ventral "What" pathway; aphantasiacs have difficulty recalling objects from the warehouse to the mind's eye and agnosiacs have trouble in the opposite direction connecting a collection of pixels to an object in their warehouse, although there is often overlap in aphantasia and agnosia. One form of agnosia, prosopagnosia, affects specifically faces; there is a kind of face warehouse in our brain which some readers could analogize to the basement of Game of Throne's House of Black and White. Losing access to that warehouse makes a person unable to categorize faces as belonging to people they have seen before; for a prosopagnosiac all people may appear as unrecognizable strangers though many are able to recognize acquaintances and celebrities by voice or other cues.
There is a warehouse of faces on the underside of your brain in the fusiform gyrus.
Patients with more severe agnosia become unable to distinguish objects from the background of the visual scene. This patient with acquired apperceptive visual agnosia describes being unable to “see” objects while still registering visual textures and colors:
Apperceptive agnosia seems to give an experience similar to our seeing-but-not-registering Daniel Simon’s gorilla. Unlike some patients with damage to the dorsal “Where” pathway severe apperceptive agnosiacs can retain spatial awareness and may be able to grasp objects with ease despite not being able to delineate them from a background, such as patient DF.
On the other hand, patients with Balint's syndrome have damage in their dorsal "Where" pathway. This patient with severe Balint's syndrome due to damage in his parietal midline brain has difficulty grasping the objects with the more affected hemisphere’s hand, despite being able to see and individually identify the objects:
He can't seem to coordinate his movement in space with respect to the objects. He also has dorsal simultanagnosia which doesn't allow him to see both objects at the same time when held up together in his visual field; very much like he is only retrieving one object from his prop warehouse at a time because there is no stage upon which to place the objects. Dorsal simultanagnosiacs have difficulty visually navigating a room and will bump into obstacles frequently, another symptom of poor modeling of the immediate local physical space. Notably the patient in this video can see both foreground objects held by the researcher and at one point the background object behind the researcher; unlike an apperceptive agnosiac this patient can see and distinguish all the objects, just not in a coherent scene containing all at once or in a way that places them relative to his body position.
But while our visual senses have a well developed integration with our proprioceptive and motor neurons, mostly via the dorsal “Where” pathway (but also contributed to by structures like the colliculus), the predictive relationships between our olfactory senses and our somatosensory and motor neurons are much less precise. While we may be able to get a crude directional indication from our sense of smell by the differences between our nostrils, olfactory sensory nerves within the nostrils don't have the same regularized patterns of excitation as visual neurons. Scent-triggering molecules exist in a chaotic cloud of particles which may or may not interact with our olfactory sensors. If I ask you to imagine the scent of a rose you may associate that with the sensation of your diaphragm flexing to draw breath and the temperature and humidity in the inside of your nostrils dropping as air rushes in, but you won't have a strong indication of spatiality for that scent in the way Red produces a field with height and width.
In The Conscious Mind David Chalmers poses the question of why different sensory modalities have different qualitative experiences alongside unanswerable mysteries of consciousness like the possibility of p-zombies or dancing qualia. But distinguishing sensory modalities by external observation of a mind’s structure is likely possible. We can probably know whether a novel mind structure is experiencing a sight or a smell.
It seems conceivable that when looking at red things, such as roses, one might have had the sort of color experiences that one in fact has when looking at blue things. Why is the experience one way rather than the other? Why, for that matter, do we experience the reddish sensation that we do, rather than some entirely different kind of sensation, like the sound of a trumpet?
Chalmers 1996 page 5
So to directly answer Chalmers's question, we experience the sensation of the color red and not the sound of red (or a trumpet) due to the differences in precision and regularity of connections between visual neurons and our sensory-motor neurons and the connections between our auditory neurons and our sensory-motor neurons. Our ears have less precise ability to locate sounds with respect to things like our other body parts as compared with vision, despite also contributing to spatial awareness in structures like the colliculus and “Where” pathway; audio neurons also have particular relations with other audio neurons which are more regular than their relations to visual neurons and similarly visual neurons self-wire more intensely relative to their inter-connections with auditory nerves. These are externally observable components of neural structure and could allow external observers to know whether a novel mind is expereincing a sensation qualitatively similar to sight or similar to some other sensory modality like a sound or a smell.
And I think this may be the answer Chalmers himself is suggesting when he discusses his ideas around structural coherence, although his phrasing is somewhat opaque:
The fine-grained structure of the visual field will correspond to some fine-grained structure in visual processing. The same goes for experiences in other modalities, and even for nonsensory experiences. Internal mental images have geometric properties that are represented in processing. Even emotions have structural properties, such as relative intensity, that correspond directly to a structural property of processing; where there is greater intensity, we find a greater effect on later processes.
Chalmers 1995 page 213
Hopefully this essay has expanded on this idea and added some detail to what parts of the “fine-grained structure of the visual field” produce the unique aspects of vision relative to other sensory modalities.
Chalmers, David J. "Facing up to the problem of consciousness." Journal of consciousness studies 2.3 (1995): 200-219.
Chalmers, David J. The conscious mind: In search of a fundamental theory. Oxford, 1996.
Let me briefly mention that I think there is a distinction between the space sense of the precuneus and the place sense of the cells in the entorhinal cortex and the hippocampus. The space sense of the precuneus is about immediate body positioning and movements, throwing an object or complex dance moves being good examples; the place sense of the entorhinal place cells and hippocampus is about planning a navigation among landmarks, so take a left at the fork then continue to the big rock, etc.
The thalamus decides to which stimuli we attend; for visual attention it is a region of the thalamus called the pulvinar. While the most ancient function of the thalamus was likely attentional control for the senses it has broadened to more general decision making and I would argue is the seat of consciousness.
Why different sensory modalities have different qualitative experience can be explained through the physically observable organizational structure of a mind. The difference between a sight and a smell is likely due to the way they are predictively wired to other sensory and motor neurons.
If you close your eyes and imagine the color red, just the color red, not a red object like a sports car or a flower, not a small red box labeled red, just the color red, you probably imagine a field of red with height and width; something like this:
The fact it has height and width is structurally determined in an externally observable way.
A visual neuron is embedded within a tight grid of similar visual neurons with a definite spatial relationship defined by the Hebbian Learning processes created when an object moves across the two dimensions of the visual field. Hebbian Learning is often summarized as neurons that fire together wire together, although it is more accurate to say that neurons that fire in short temporal succession one after the other wire together, but that is less catchy.
Hebbian Learning captures the predictive role of the neurons in the brain, each trying to predict the other on the small scale and trying to make predictions about the world as an emergent property of this micro level prediction. You can see how physical objects moving in a visual field will cause Hebbian wiring between visual nerves. Even if cortex neurons receiving signals from the retina were not placed in a flat two dimensional sheet that mimiced the retina the Hebbian wiring from objects moving across the retina would produce the topology of a two dimensional sheet where a cortex neuron would have predictive attachments to cortex neurons that connect to cells on the retina which are neighbors of that neuron's connected retina cell; these predictive connections would be priming that neuron to fire when neighbors detect an incoming object as well as priming the neighbors to fire for that object after it passes and is outgoing. That said, there is also a role of evolution which conveniently arranges cortex neurons connected to the eye into a grid where they don't have far to go to find their neighboring co-firing neurons.
A similar combination of evolution and Hebbian wiring creates predictive connections between these visual neurons and motor neurons that control eye movement, head movement, and other aspects of body position to form a very general spatial sensorium, placing our visual sensations onto a stage with height, width, and depth.
This stage is composed of trillions of latent predictions implicit in our neural connections about what you would feel if you got up and moved forward or what you would see if you execute the motions to toss something at a target in your field of view; the air rushing over your arm, the feeling of muscles tensing and relaxing in a coordinated motion, the sight as your body shifts and your arm moves into your field of view while throwing, the view of your projectile as it flies along its trajectory, etc.
Thus when imagining red in your field of view it occupies height and width. Any particular object you imagine will have very specific localization in your visual field, with predictive implications for what happens with body movements, particularly with attention to the head and eyes. If you think of a person standing you will see their head above their torso which itself is above their feet. The smell of a person's cologne will not give such spatial cues.
Most of this internal visual space, the internal soundstage to use a showbiz analogy, is on an area between the two hemispheres of the brain called the precuneus and the area just over the dorsal ridge from the precuneus called the superior parietal lobule, major components of the dorsal “Where” visual pathway, located about here:
It is between our V1 visual area and our motor-tactile areas and also has a whitematter connection to the top of the lateral sulcus where inner ear balance is processed. This is pretty precisely where we would expect visual information to be getting connected to tactile and motor information.
The precuneus and superior parietal lobule’s dorsal pathway forms a mental soundstage where we may create scenes in our mind, but we need to fill it. Into this soundstage are placed objects from the prop-warehouse of the ventral "What" pathway that lies between the visual and auditory cortical centers on the lateral and underside of the brain.
To use a video game analogy for a moment, I would speculate the dorsal “Where” pathway is like a game engine with a physics system that can connect the voxels in the scene's simulation space by, for example, saying that the contents of a voxel that has no momentum and no blocking voxel underneath will move into the voxel directly beneath it after some period due to gravity. The ventral “What” pathway's prop warehouse is like a database of assets which themselves are simply particular patterns of voxels, each of which can be loaded into the game engine's scene and then processed by the physics there.
The “What” pathway allows us to attend to individual components of a scene as differentiated from the background or some other unrecognizable blob of pixels. The “Where” pathway’s predictive connections to other spatially dependent senses, particularly tactile and motor areas, produce the qualitative spatial awareness of an object and the distinct qualitative experience or awareness of the object itself as opposed to the qualitative experience or awareness of a flat picture of the object. The "Where" pathway generates our qualitative experience of existing in a three dimensional space[1].
The V1 visual cortex at the rear of the brain is the screen onto which is projected external sensory information or internal mental images depending on how our focus[2] allows or inhibits external stimulation of the V1 area by sensory information from our eyes. In one direction the “What” pathway allows us to identify objects in our visual field and in the opposite direction we can recall images from this prop warehouse to place into an imagined scene. In both cases our awareness is like a film made on our soundstage with props from our warehouse projected onto the screen, just sometimes external senses rather than imagination dictate prop choice.
This use of props from our "What" pathway warehouse is what Attention Schema Theory is describing with the partial and not fully accurate schemas employed in our awareness and how our brain can leave a collection of pixels uncategorized by referring to the specimens in our prop warehouse, i.e. our collection of schemas, to be recognized as a man in a gorilla suit if our attention is on other objects in our visual field.
This is footage from Daniel Simon's famous attention experiment where participants were asked to count the number of times players wearing white pass the ball. Please watch the eighty second video if you are unfamiliar with the experiment now.
At some point in the footage a man in a gorilla suit walks through. Later the participants are asked if anything unusual happened and they can't recall the man in the gorilla suit; an example of how an object not retrieved from our “What” warehouse and placed onto our mental stage remains a pile of uncategorized pixels.
To decompose our visual-spatial awareness we can use cases from brain lesion disorders to isolate the elements of our awareness.
An aphantasiac that has less ability to mentally picture things or an agnosiac that can't identify seen objects likely have difficulty connecting to the prop-warehouse in the ventral "What" pathway; aphantasiacs have difficulty recalling objects from the warehouse to the mind's eye and agnosiacs have trouble in the opposite direction connecting a collection of pixels to an object in their warehouse, although there is often overlap in aphantasia and agnosia. One form of agnosia, prosopagnosia, affects specifically faces; there is a kind of face warehouse in our brain which some readers could analogize to the basement of Game of Throne's House of Black and White. Losing access to that warehouse makes a person unable to categorize faces as belonging to people they have seen before; for a prosopagnosiac all people may appear as unrecognizable strangers though many are able to recognize acquaintances and celebrities by voice or other cues.
Patients with more severe agnosia become unable to distinguish objects from the background of the visual scene. This patient with acquired apperceptive visual agnosia describes being unable to “see” objects while still registering visual textures and colors:
Apperceptive agnosia seems to give an experience similar to our seeing-but-not-registering Daniel Simon’s gorilla. Unlike some patients with damage to the dorsal “Where” pathway severe apperceptive agnosiacs can retain spatial awareness and may be able to grasp objects with ease despite not being able to delineate them from a background, such as patient DF.
On the other hand, patients with Balint's syndrome have damage in their dorsal "Where" pathway. This patient with severe Balint's syndrome due to damage in his parietal midline brain has difficulty grasping the objects with the more affected hemisphere’s hand, despite being able to see and individually identify the objects:
He can't seem to coordinate his movement in space with respect to the objects. He also has dorsal simultanagnosia which doesn't allow him to see both objects at the same time when held up together in his visual field; very much like he is only retrieving one object from his prop warehouse at a time because there is no stage upon which to place the objects. Dorsal simultanagnosiacs have difficulty visually navigating a room and will bump into obstacles frequently, another symptom of poor modeling of the immediate local physical space. Notably the patient in this video can see both foreground objects held by the researcher and at one point the background object behind the researcher; unlike an apperceptive agnosiac this patient can see and distinguish all the objects, just not in a coherent scene containing all at once or in a way that places them relative to his body position.
But while our visual senses have a well developed integration with our proprioceptive and motor neurons, mostly via the dorsal “Where” pathway (but also contributed to by structures like the colliculus), the predictive relationships between our olfactory senses and our somatosensory and motor neurons are much less precise. While we may be able to get a crude directional indication from our sense of smell by the differences between our nostrils, olfactory sensory nerves within the nostrils don't have the same regularized patterns of excitation as visual neurons. Scent-triggering molecules exist in a chaotic cloud of particles which may or may not interact with our olfactory sensors. If I ask you to imagine the scent of a rose you may associate that with the sensation of your diaphragm flexing to draw breath and the temperature and humidity in the inside of your nostrils dropping as air rushes in, but you won't have a strong indication of spatiality for that scent in the way Red produces a field with height and width.
In The Conscious Mind David Chalmers poses the question of why different sensory modalities have different qualitative experiences alongside unanswerable mysteries of consciousness like the possibility of p-zombies or dancing qualia. But distinguishing sensory modalities by external observation of a mind’s structure is likely possible. We can probably know whether a novel mind structure is experiencing a sight or a smell.
Chalmers 1996 page 5
So to directly answer Chalmers's question, we experience the sensation of the color red and not the sound of red (or a trumpet) due to the differences in precision and regularity of connections between visual neurons and our sensory-motor neurons and the connections between our auditory neurons and our sensory-motor neurons. Our ears have less precise ability to locate sounds with respect to things like our other body parts as compared with vision, despite also contributing to spatial awareness in structures like the colliculus and “Where” pathway; audio neurons also have particular relations with other audio neurons which are more regular than their relations to visual neurons and similarly visual neurons self-wire more intensely relative to their inter-connections with auditory nerves. These are externally observable components of neural structure and could allow external observers to know whether a novel mind is expereincing a sensation qualitatively similar to sight or similar to some other sensory modality like a sound or a smell.
And I think this may be the answer Chalmers himself is suggesting when he discusses his ideas around structural coherence, although his phrasing is somewhat opaque:
Chalmers 1995 page 213
Hopefully this essay has expanded on this idea and added some detail to what parts of the “fine-grained structure of the visual field” produce the unique aspects of vision relative to other sensory modalities.
Chalmers, David J. "Facing up to the problem of consciousness." Journal of consciousness studies 2.3 (1995): 200-219.
Chalmers, David J. The conscious mind: In search of a fundamental theory. Oxford, 1996.
Let me briefly mention that I think there is a distinction between the space sense of the precuneus and the place sense of the cells in the entorhinal cortex and the hippocampus. The space sense of the precuneus is about immediate body positioning and movements, throwing an object or complex dance moves being good examples; the place sense of the entorhinal place cells and hippocampus is about planning a navigation among landmarks, so take a left at the fork then continue to the big rock, etc.
The thalamus decides to which stimuli we attend; for visual attention it is a region of the thalamus called the pulvinar. While the most ancient function of the thalamus was likely attentional control for the senses it has broadened to more general decision making and I would argue is the seat of consciousness.