ImageNet: the database that sparked today's AI boom

September 18th
A 64x64 grid of images from ImageNet. (patrykchrabaszcz.github.io)

Published in 2009, ImageNet was the first of its kind: a dataset large and accurate enough to teach machines to see and recognize objects.

Composed today of more than 14 million hand-tagged images, the team behind ImageNet collected pictures for roughly 80,000 distinct categories -- everything from motorcycles to sloths -- which researchers and startups later used to build much of the newer technology we use today, including image recognition used in social media, security and self-driving vehicles, as well as voice recognition behind by AI assistants.

Built by Stanford computer science professor Fei-Fei Li and others, ImageNet was inspired by WordNet, another large database. Described as a combination of a dictionary and a thesaurus, WordNet fits nouns and verbs into hierarchies and relationship groups.

In WordNet, the town of "Hilo" is nested under "Hawaii", which nested under the "United States." "Hawaii" is also a sibling to all other US states, which each have their own varied children and unique sibling-like relationships, allowing users to move across categories.

Li wanted to attach images to WordNet's thousands of concepts (called synsets or nodes), believing that a large database of tagged images could be used to teach machines to identify real world objects. But there was a problem: Li believed an accurate dataset would need up to a thousand images per node. A human may be able to recognize a surfboard after seeing just one, but Li bet a machine could't. A neural net would need to see many more examples than the average human.

To build such a dataset, the team used Amazon Mechanical Turk, which paid users pennies to complete small tasks online that were impossible to automate.

"Suddenly we found a tool that could scale, that we could not possibly dream of by hiring Princeton undergrads," Li told Quartz in 2017.

The first published version of ImageNet took 2 and a half years to compile, and consisted of 3.2 million photos. Over the course of the next several years, it would go on to prove Li's hypothesis -- that, given enough examples, a neural network could be taught to recognize patterns and, in ImageNet's case, people and objects in photos. The researchers, students, and developers that trained their models on ImageNet would go on to staff the AI labs, core product and personalization initiatives at Facebook, Google, and OpenAI.

After it was published, ImageNet spawned an annual competition that is now the stuff of legend.

Participants competed to see which team could write the algorithm that could most accurately identify objects. In the inaugural 2010 competition, no team was accurate more than 75% of the time. Little advancement was made in 2011.

Then, in 2012, something surprising happened: a team from University of Toronto won, beating the runner up by over 10.8%. The Toronto team had used a deep neural network developed by a student named Alex Krizhevsky. The algorithm, later named AlexNet, can be marked as the first time deep learning was used effectly in AI development-- Krizhevsky and his instructor, George Hinton, and fellow student Ilya Sutskever had submitted an algorithm that stacked many neural networks on top of each other.

Neural networks, which are groups of interconnected math operations, are used in computer science to identify patterns and allow computers to make unsolicited decisions. Krizhevsky's neural network took a network's grouped operations and layered them, compounding the operations it ran, which proved to drastically improve a machine's ability to recognize people and objects in ImageNet.

The short term consequences of Krizhevsky's algorithm were felt everywhere in digital life -- soon, others used their own deep neural nets to improve speech recognition, which led to Siri, Alexa and Google Assistant. Facebook, Apple and Google began auto-tagging people in user photos. And Tesla and other car makers began introducing autonomous vehicles on the road, which could lead to the eventual automation of professional driving.

Eventually, most participants in the challenge could train models to recognize objects more than 95% of the time, which is better than the average human. Datasets went on to dominate the AI world -- with enough data, the story now went, you can teach software to recognize anything.

It was a significant milestone, but not the final one. Quartz put it well when it wrote about the industry after ImageNet. Researchers nearly perfected machines' ability to recognize objects, but they did not teach software to make value judgments, or reason about what they saw or heard:

... This doesn’t mean an algorithm knows the properties of that object, where it comes from, what it’s used for, who made it, or how it interacts with its surroundings. In short, it doesn’t actually understand what it’s seeing. This is mirrored in speech recognition, and even in much of natural language processing. While our AI today is fantastic at knowing what things are, understanding these objects in the context of the world is next. How AI researchers will get there is still unclear.