As the conversation turns visual, there’s a huge need for being able to read what’s happening in photos.

Social listening—the process of monitoring what consumers post on social media to guide strategy—has been an indispensable tool for marketers since the rise of Facebook, Twitter and the like. The problem is, it’s based on scans of text—and we increasingly communicate on visual platforms like Instagram using images.

Until recently, computers haven’t been able to draw useful information about the contents of images, and humans have been too slow and expensive to “listen” to images at scale. But now, companies like Ditto Labs, co-founded by David Rose, are adding machine vision to the marketer’s playbook and pioneering so-called “visual listening.”

We spoke to Rose, who in addition to being CEO of Ditto is also an instructor at the MIT Media Lab and the author of the book Enchanted Objects, about “the bleeding edge of computer vision” and applications for advertising.

Could you tell us more about visual listening and why we need to pay attention to it now?

The need for visual listening comes from an awareness that the social media channels are more and more filled with photos, not text. Twitter is trying to become a photo sharing network to draw more attention in those overcrowded streams. We realized that as that conversation turns visual, there’s a huge need for being able to read what’s happening in the photos.

So, two and a half years ago we started a company with a team of computer vision and AI engineers in order to be able to solve the unique and hard problem of reading the photos that are shared all over social media. By reading, we mean extracting as much metadata out of them as possible, which means looking for products, people, brands, and objects that we could recognize, as well as looking for the context in which all of these things appear.

It seems like communication is becoming more visual just at the moment when machine vision is making huge improvements. So is Ditto capitalizing on that?

The technology has improved a lot due to very fast, very cheap cloud computing, which gives us the ability to read many millions of photos a day, at the scale of social.

When the company started, we were mostly using pattern-matching algorithms, which are good at finding 2D things that have been wrapped around 3D objects, like a Coke logo that’s wrapped around a can. But our technology is now able to use “deep learning networks,” which are neural networks that allow you to train on example photos of things like the profile of certain types of cars, or things that are vague and fuzzy, like types of sunglasses, types of dog, a golf club, a wine glass… That has been the bleeding edge of computer vision for the past couple of years—to recognize these vaguer 2D objects that are in 3D photos.

Now, with the technology, we’re in the land almost of what art directors do—the subjective classification of things. We’re able to find photos that are visually alluring or visually memorable, or exciting, that have momentum in them, or are calming. So we’re now at the moment where we’re able to ascribe emotional states to photos and to be able to cluster photos based on those emotional states.

What sorts of brands have the most to gain from a service like Ditto?

When we first started, we thought that the brands that would benefit the most would be the brands that appeared the most. So if you look at the brands that have the highest incidence in photos, it’s consumer packaged goods where the product is the packaging (like M&Ms or Kit Kat), or sports teams where the people flagrantly wear the brand and try to promote the brand.

The data has been used by marketers who, for example, are trying to fill seats at stadiums with latent fans. If you have an expiring resource like a seat at a stadium, and you haven’t filled the stadium yet, then you might as well reach out to the people who are wearing the brand, but are probably not in your CRM database, because they’re the most susceptible to that message or that offer.

An interesting thing about all those photos is that there’s a social network that’s mapped to each of the photos. So we’re finding that the way to predict adoption or the most efficient use of marketing resources is to look at people’s friends. If you’re Harley Davidson and you’re trying to figure out how you should best spend your marketing dollars, spend it on the people who have networks of friends who own Harleys. If you have friends of Red Sox fans, you can fill seats with the first-degree peers of the people who are already passionate. That’s one of the unique things about all these photos, is that not only can you find who the people are, but you can also find who their network is. That means that advertising and marketing can be much smarter, because you can predict who will adopt a product based on whose friends will adopt a product.

The sports brands and consumer goods brands were our first targets, but now we’re finding something really interesting, which is that financial services firms and insurance companies are also using the data. You might wonder why an insurance company would be interested in brands. Actually, they’re looking for photos of things that are insurable. They’re looking at people’s new motorcycles, boats, RVs, house, or interest in skydiving—people take pictures of the high-priced things and the risky things. And then financial services firms want to know about life events—when are people moving, when are people getting married, and all of those things are already revealed in these photos.

How does this compare to basing marketing or advertising based on, for example, whether someone likes the brand’s page on Facebook?

Likes on Facebook are a very weak signal. We look at the number of people who have Coke, or Boston Bruins in their photos, and only maybe 2% have liked the brand on Facebook. It’s incredibly low. Likes may capture the people who are the most excited about the brand at a certain point, but it doesn’t capture the real users of a product.

Do you think this also comes at a time when people are moving away from playing the marketing game the way brands want them to, from using branded hashtags for example?

I’ve always thought that hashtags are such a blunt instrument. Campaign hashtags require so much spend and so much volition, for people to use the campaign hashtag as they’re supposed to. There’s so many people who take selfies with their Michael Kors bag, who would just never type in #michaelkors. Who would really go type that in? Of the photos that we find that have brands in them, only 15% would’ve been found with a hashtag, or with any sort of text. So if brands don’t use a visual listening tool, they’re missing 85% of the photos of their fans.

What are some of the main ways that brands use visual listening?

The first one is customer research—we like to say that “your focus group has already happened.” If you just go look at the photos, you can get a lot of the answers that you might otherwise have to poll people about in a focus group. We worked with a P&G brand called French’s mustard. Their question was, “what are people putting mustard on these days?” Sort of an obvious question. But they could just go look at the photos, and they could get the answers to those questions, which turned out to be things like broccoli and other produce that they were surprised by. We call this type of use case digital ethnography, which is just to look at the photos and learn.

The second use case is about campaign measurements or competitive analytics. For example, Disney uses the tool in order to understand, in the 10 weeks leading up to the launch of a movie like Ant Man, what creative is resonating and how that compares with some other movie from the previous quarter. So they want to see, for example, how often are people taking pictures of posters, t-shirts and all the promotional material that they made, in which demographic segments and parts of the country is that resonating, and then they can adjust the creative before the launch of the movie. Same if Wendy’s wants to know how they’re doing against Taco Bell or Burger King—there’s a lot of side-by-side industry benchmarking that’s happening through photos.

The third use case is the most profound when you think about the impact to advertising. And that is what we call audiences. Which is, give me a list of those people that are using my product or my competitor’s product. Not only do I get the photos and the competitive analytics, but with audiences you also get a specific list of Twitter handles or Instagram handles. Some brands are using this to reach out to people to find influencers or brand ambassadors, or to actually just download a list and send an offer to them, or ask those people to re-use their photo, but it’s really interacting with the people themselves.

An example of that is, we work with Campbell’s, and it’s really hard to do a tech search for V8, because it just brings up a lot of car engines. And there are many brands like this: Tide is really hard, Bounty—there are a lot of common names for brands that sort of fail on the tech search. So it’s really useful to be able to say, just find photos of people who are sharing pictures of V8 and talking about it.

So they were looking for who are these people, and wanting to know more about them. We were doing marketing cross-tab, where you take a set of people and compare them against a normal group. In our system you can look at all of the affinities and interests of a sub-set of people and say these people are more likely to golf, buy a certain type of beer, drive this car, more likely to use these other kinds of products and vacation like this, all based on the photos.

If you think about how the tech is evolving, what do you see happening in the next 3-5 years for visual listening?

We’re thinking about the offering almost as marketing AI. We’re trying to decompose what a marketer does all day, and write algorithms that assist people almost as if you had the awesome marketing assistant that goes out and finds the top 10 photos of your client’s brand—let’s say your client is Microsoft X-Box, it finds the 10 most alluring photos by the people that have the most clout, that are saying something positive about X-Box.

So we’re trying to make a robot that helps you do as much of that job as possible. In the same way that AIs are helping a physician with diagnosis, or helping a radiologist come up with areas of interest on an x-ray, AIs are making their way into almost every job, so hopefully they can take away some of the drudgery of scanning through all those photos or all of the tweets, and trying to just pull out the cream of the crop and make it actionable—have you add the human touch and make sure it’s done in a sensitive way. The marketers are mostly playing the role of final editor, not originator.

Subjective classifiers—things that seem to do the work that a human might do—are the ones that threaten us the most and also help us the most. We’re building classifiers that help Zillow pick the most compelling shot of your house, or Match.com pick the most compelling shot of you, or Expedia pick the most compelling shot of the beach if they’re trying to encourage people to go on beach vacations. That’s the bleeding edge of computer vision.

You’ve also done a lot of work to do with the Internet of Things and written a book, Enchanted Objects, about that field. Is that a completely separate interest, or is it related in some way to your work with Ditto?

Most of the IoT world is all about embedding sensors in everyday objects. But one of the cheapest and most effective sensors to embed is cameras. People are taping cameras to the inside of the oven—there’s a startup in San Francisco called June that’s doing the June oven, and looking down at, how burnt are your cookies, and what is in the oven so that we can adjust the temperature automatically. And we’re putting drop cams in our homes, all of our stores have cameras, our refrigerators are getting cameras, our drones have cameras, and all of those cameras need image recognition to cope with the deluge of data.

So that’s the connection that I see between IoT and image recognition. There’s this huge data exhaust that’s coming out of all these sensors, especially the cameras, and you need a cloud that can make sense of all of your GoPro video for the entire day, or all 12 cameras around your Target store to understand how people are shopping, who’s shopping, at what parts of the store, etc. There’s a huge connection between IoT and machine vision.