If Instagram were to translate your pictures into words, what might they say? That's (sort of) the question that self-professed language hacker Ross Goodwin has tackled in his latest provocation, word.camera. It is, as the name implies, an app that translates any image you take or upload into a pageful of words.
I've written about Goodwin's work before—he programmed an algorithm that turns text into novels, and we had him "translate" the CIA torture report into fiction. The results were interesting, if not entirely comprehensible. So last week, when he emailed me with news of his latest experiment, I was immediately piqued.
Goodwin's word.camera offers a snapshot into how machines can be taught to interpret, and even describe in human terms, the real-world environment.
I tried the app with a selfie in the office; the results were enjoyably beguiling. The first paragraph went on about my European-ness (fair enough, I am of European descent), before veering into a soliloquy about voting, politics, sin, and crime.
Word.camera did seem to know that the photo was a portrait, that I was indoors, that I was wearing clothes, and that I was seated in front of a window. The rest was sort of like a horoscope—there were lines about the significance of music and protest, two things that are important to me, personally—you could read into it whatever you like, and perhaps find some seed of truth. ("The music is important to man, and the protest was made from a formal and solemn declaration of objection," the machine wrote, and it's hard to disagree!) Try it yourself, here.
To learn more about how the algorithm works, and what it might portend for the future of machine-human interactions, I sent Goodwin a few questions about his program.
When we think about the type of artificial intelligence we'll have in the future, we think of a robot that can describe and interact with its environment with natural language
Motherboard: **What's the background here?** **Ross Goodwin: **I recently received a grant from the Future of Storytelling Initiative at NYU, which is funded by Google, to produce a computer generated screenplay. For the past few months, I have been thinking about how to generate text that's more cohesive and realistically descriptive, meaning that it would transition between related topics in a logical fashion and describe a scene that could realistically exist (no "colorless green ideas sleeping furiously") in order to making filming the screenplay possible.
After playing with the Clarifai API, which uses convolutional neural networks (a very smart machine learning algorithm) to tag images, it occurred to me that including photographs in my input corpus, rather than relying on text alone, could provide those qualities. word.camera is my first attempt at producing that type of generative text. At the moment, the results are not nearly as grammatically correct as I would like them to be, and I'm working on that.
What is the benefit of having a program that generates stories about pictures?
This project is about augmenting our creativity and presenting images in a different format, but it's also about creative applications of artificial intelligence technology. I think that when we think about the type of artificial intelligence we'll have in the future, based on what we've read in science fiction novels, we think of a robot that can describe and interact with its environment with natural language.
It shouldn't just avoid obstacles and recognize human faces. It should notice the dead pigeon on the sidewalk and make a comment about mortality (or perhaps its lack thereof); it should make witty jokes about the crazy party hat you're wearing; it should notice that your haircut makes you look European. I think that creating the type of AI we imagine in our wildest sci-fi fantasies is not only an engineering problem, but also a design problem that requires a creative approach.
How does it work? What are the stories "about," and how does the algorithm 'know' what to write?
It's generous for you to call them "stories." I don't really know what to call them. The algorithm is extracting tags from the images using Clarifai's convolutional neural networks, then blowing up those tags into paragraphs using ConceptNet (a lexical relations database developed at MIT) and a flexible template system. It knows what to write because it sees concepts in the image and relates those concepts to other concepts in the ConceptNet database. The template system enables the code to build sentences that appear to connect those concepts together.
The story generated from a picture of my face is all about European-ness. What gives?
Your face was algorithmically matched by a convolutional neural network to a large number of other images that were tagged by humans as "european". The internal states of the neural network are not human readable, so it's impossible to say for certain what features the algorithm is detecting. If I had to guess, I'd say it's because you have white skin, but I don't know for sure.
Do you have a favorite image-generated story yet?
Right now it's this one:
Any specific plans for this?
I don't have any concrete plans at the moment. My intent was to make an unusual camera that anyone could use on their phone or computer. But I would love to install it in a gallery or museum as a photobooth. (Any gallery/museum owners in New York City who are interested in this should feel free to contact me. My contact information is on my website.)
I also want to make a talking surveillance camera that moves around and describes what it sees. I have access to some high end programmable surveillance cameras at the moment, and this seems like a good application for them.