As models of human visual perception, deep neural networks (DNNs) have shown a great deal of promise, enabling the prediction of aspects of visual task performance as well as neural response patterns. Still, there are a lot of differences between computer vision DNNs and humans in terms of how they process information. In confrontational situations and psychophysical investigations, these differences are evident.
Humans and deep neural networks
The fact that humans have peripheral vision is one way that DNNs and humans differ from one another and has generated interest recently. The process by which human vision reflects the world with less accuracy as eccentricities, or distances from the point of fixation, rise is known as peripheral vision. Approximately 99 percent of the human visual field is made up of peripheral vision. Peripheral vision has been shown to be a significant predictor of human performance for a range of visual tasks, even though it is thought to be a method for addressing capacity restrictions brought on by the size of the optic nerve and visual cortex.
Vision in the periphery
Although the detail is diminished, peripheral vision enables humans to perceive shapes that are not directly in our field of vision. This ability increases our field of vision and is helpful in many situations, such as spotting an oncoming car from the side. AI doesn’t have peripheral vision as humans do. Giving computer vision models this ability could improve their ability to identify impending dangers or predict whether a driver will notice something approaching.
Image gathering
By compiling an image collection that may be utilized to simulate peripheral vision in machine learning algorithms, MIT researchers made a first step in this direction. They found that while using this dataset to train the models improved their ability to identify objects in the visual periphery, the models’ performance was still inferior to that of humans. Their results also showed that, in contrast to humans, the AI performed unaffectedly by the size of objects and the amount of visual clutter in a scene.
Partial vision simulation
Raise your thumb and extend your arm in front of you to reveal your fovea, the tiny indentation in the middle of your retina that provides the best vision. You can only see what’s in your peripheral vision. The farther your visual cortex is from the sharp point of focus, the less reliable and detailed the scene appears to be.
Educating many computer vision models
Though information loss in the optic nerve and visual cortex is far more complex, many current AI models of peripheral vision depict declining detail by opacifying image boundaries. In order to obtain a more precise outcome, the MIT researchers started with a method for simulating peripheral vision in humans. Using a technique called the texture tiling model, photographs are altered to mimic the loss of visual information experienced by a human.
They modified this model to enable it to alter visuals in a similar way, but in a more adaptable way that does not necessitate anticipating where the person or AI will gaze. Using this altered method, the researchers produced an enormous set of altered images that seem more textured in certain areas, simulating the loss of detail that happens when a person looks farther into the peripheral. They next trained several computer vision models on the dataset and compared their results to human performance on an object detection task.
Strange performance
All pairs of modified images—one with a target object in the periphery—were displayed to humans and models in exactly the same way. Subsequently, every participant was instructed to choose an image that featured the desired object. The highest performance gains were obtained by the researchers when they trained models from scratch using their dataset, which improved the models’ ability to detect and discriminate objects. Less performance increases were obtained when fine-tuning a model with their dataset, which involves modifying a pretrained model to handle a new task.
The robots were never as good as humans, though, and they did an especially bad job of identifying objects in the far corners. They did not perform in accordance with human patterns.