Those insights guided the design of the deep net by Yamins and his colleagues. Their deep net had hidden layers, some of which performed a “convolution” that applied the same filter to every portion of an image. Each convolution captured different essential features of the image, such as edges. The more basic features were captured in the early stages of the network and the more complex features in the deeper stages, as in the primate visual system. When a convolutional neural network (CNN) like this one is trained to classify images, it starts off with randomly initialized values for its filters and learns the correct values needed for the task at hand.

The team’s four-layer CNN could recognize eight categories of objects (animals, boats, cars, chairs, faces, fruits, planes and tables) depicted in 5,760 photo-realistic 3D images. The pictured objects varied greatly in pose, position, and scale. Even so, the deep net matched the performance of humans, who are extremely good at recognizing objects despite variation.

Unbeknownst to Yamins, a revolution brewing in the world of computer vision would also independently validate the approach that he and his colleagues were taking. Soon after they finished building their CNN, another CNN called AlexNet made a name for itself at an annual image recognition contest. AlexNet, too, was based on a hierarchical processing architecture that captured basic visual features in its early stages and more complex features at higher stages; it had been trained on 1.2 million labeled images presenting a thousand categories of objects. In the 2012 contest, AlexNet routed all other tested algorithms: By the metrics of the competition, AlexNet’s error rate was only 15.3 percent, compared to 26.2 percent for its nearest competitor. With AlexNet’s victory, deep nets became legitimate contenders in the field of AI and machine learning.

Yamins and other members of DiCarlo’s team, however, were after a neuroscientific payoff. If their CNN mimicked a visual system, they wondered, could it predict neural responses to a novel image? To find out, they first established how the activity in sets of artificial neurons in their CNN corresponded to activity in almost 300 sites in the ventral visual stream of two rhesus macaques.

Then they used the CNN to predict how those brain sites would respond when the monkeys were shown images that weren’t part of the training data set. “Not only did we get good predictions … but also there’s a kind of anatomical consistency,” Yamins said: The early, intermediary, and late-stage layers of the CNN predicted the behaviors of the early, intermediary, and higher-level brain areas, respectively. Form followed function.

Kanwisher remembers being impressed by the result when it was published in 2014. “It doesn’t say that the units in the deep network individually behave like neurons biophysically,” she said. “Nonetheless, there is shocking specificity in the functional match.”

Specializing for Sounds

After the results from Yamins and DiCarlo appeared, the hunt was on for other, better deep-net models of the brain, particularly for regions less well studied than the primate visual system. For example, “we still don’t really have a very good understanding of the auditory cortex, particularly in humans,” said Josh McDermott, a neuroscientist at MIT. Could deep learning help generate hypotheses about how the brain processes sounds?

The neuroscientist Josh McDermott at the Massachusetts Institute of Technology uses deep learning neural networks to develop better models for auditory processing in the brain.Photograph: Justin Knight/McGovern Institute

That’s McDermott’s goal. His team, which included Alexander Kell and Yamins, began designing deep nets to classify two types of sounds: speech and music. First, they hard-coded a model of the cochlea—the sound-transducing organ in the inner ear, whose workings are understood in great detail—to process audio and sort the sounds into different frequency channels as inputs to a convolutional neural network. The CNN was trained both to recognize words in audio clips of speech and to recognize the genres of musical clips mixed with background noise. The team searched for a deep-net architecture that could perform these tasks accurately without needing a lot of resources.

Three sets of architectures seemed possible. The deep net’s two tasks could share only the input layer and then split into two distinct networks. At the other extreme, the tasks could share the same network for all their processing and split only at the output stage. Or it could be one of the dozens of variants in between, where some stages of the network were shared and others were distinct.