Andrea Frome, Software Engineer at Research at Google gives us some insight of her computer vision research at Google and some of its applications.
via Kaptur: http://kaptur.co/10-questions-for-a-computer-vision-scientist-andrea-frome/
As an ultimate goal, classifying images is too limited. Myself and many in the community see the grand goal being a system that can that fully understand visual input, in the way that humans are able to. Humans don’t look at a room and think “desk”, “chair”, “coffee mug”. Instead humans understand why things are in their particular places, how a person might interact with objects or the space. We recognize peoples’ actions and their intentions, and we predict what will happen next. For example, you understand from visual input and our interactions with the world what will happen if you tip the table your coffee mug is resting on or if you try to place your coffee mug on your keyboard. Not only do I believe that we can build systems that learn these things, I believe we will be able to build systems that do this processing on video in real-time and those systems will learn from large amounts of video, without human labeling