Aug 17, 2016 By Brandon Reynolds

The Terrible Trouble with Natural Language Processing (It's Us.)

in IT, Artificial Intelligence

A researcher who wishes to design a machine that thinks and acts like a human runs up against the self-evident and somewhat embarrassing problem of human beings themselves. They’re messy, contradictory, lazy, unpredictable. They’re often jerks. No one wants to build a system that turns out to be a jerk. Look at Microsoft: In March it launched a chatbot on Twitter called Tay that learned from interactions with people. People being unpredictable, they said terrible things to it, and Tay became a jerk in about a day. That’s pretty quick, but it’s sure no human record. (Newborns can get on your nerves in minutes.)

To successfully interact with humans, though, an AI has to be able to understand humans and their systems in all their complexity. All the promise and all the peril of these systems emerge when dealing with language — surely, one of our most beautiful, interesting, and totally messy creations. This is why AI researchers who work on language are often running around pulling their hair out in frustration at such linguistic inventions as sarcasm. For that reason, you can imagine, the internet is a great place to play.

"Really, it’s a system to keep humans from being jerks who don’t respond. This may just save the human race."

Watch My Language

Ascander Dost, a senior software engineer and linguist at Salesforce, is one of those researchers. He works on Natural Language Processing. NLP is a branch of AI that applies machine learning to big datasets of text, whether emails or web comments, in order to understand the content and context of messages, make predictions, and even suggest responses.

The simplest version of this involves the system looking for pretty concrete rules, like question marks. “If you send me an email and the email contains a question that warrants a response from me and I don't get back you after a few days, then we serve this gentle reminder saying ‘Hey, would you like to respond?’” he says. It can also be predictive. By looking at all the emails that got a response and all the emails that didn’t get a response, the system can try to determine what criteria invited a response and which future emails will most likely need one. So really, it’s a system to keep humans from being jerks who don’t respond. This may just save the human race. It’s about time.

But Why Now?

“Well, I mean it’s not really happening now,” Dost says. “It's been happening for quite some time. It's starting to be really popular now because more companies are realizing that they've got large amounts of unstructured text data. NLP is really just an umbrella term for any number of techniques — some of them machine-learning based, some not — for extracting rich information out of what is otherwise unstructured text. NLP can be valuable and useful for emails or attachments or PDF documents or information that you're pulling out of social media.”

As language is intrinsic to how we construct our reality, it’s no surprise that NLP requires a certain amount of world-building, in the form of ontologies. “Ontologies are essentially hierarchies of relationships between things,” Dost says. “You know that a dog is a particular type of mammal, and a mammal is an animal, and there are other ones, and you know what hierarchy they form. Same with locations. A street occurs in a city, and a city and a town and a village are all the same thing, and they show up in a county, and the thing that contains the county is a state, and the thing that contains a state is a country. That's all information that we carry around in our heads about the world based on our interactions with it.

“A lot of NLP techniques are used to build up this real-world model,” he continues. “Once you have information about what the actual world is and information about how text changes that world, then you can start doing things like summarizing and making inferences and answering questions and doing real AI stuff on top of that.”

"Many, many companies now find themselves with huge amounts of data. What are we going to do with it?”

How Do We Get AI From This?

Once upon a time, people with linguistics degrees would teach these systems how language worked by writing down sets of rules. This worked for narrow purposes like airline reservations. “More and more, though, we're seeing people using machine learning-based methods for doing that,” says Dost. “Part of that is because some of the problems we want to look at are not narrowly defined. We don't just want something to help us book airline reservations. We want something to help identify the content of emails. They're going to be harder to do with a single person sitting down writing a bunch of rules.

“A combination of having that more open-ended problem and the fact that data is much more available and easier to process means that using machine-learning methods is going to be a much more popular option. I think that many, many companies now find themselves with huge amounts of data. What are we going to do with it?”

Be Sarcastic, Apparently

Humans are good at deciphering meaning from language. Even if it’s fragmented. Or ungrammaticalish. “You might not know what a word means, but you know the context in which it was used, and you can make a guess about what that word means,” says Dost. “Being able to do that with a machine-learning model means that you have to have lots and lots of training data.”

What makes modeling language difficult is that we use it all wrong. Or we use it weird, which is so human of us.

"It doesn't know what sarcasm is. You have to train it to know what sarcasm is.”

“Language is fluid,” says Dost. “You get things like metaphor and analogy and sarcasm. That's a really fun one.” (He says this sarcastically, BTW.) “As a human being, if you read an email you can kind of tell if the email is sarcastic or not based on your understanding of the world. If we're using a model to make some predictions on that email, the model doesn't really know anything about the world. The model is a just a computer program. At its heart, it's a fantastic model for being able to make predictions about language. But it's essentially a bunch of statistics on word counts. It doesn't know what sarcasm is. You have to train it to know what sarcasm is.”

To train a machine to not only follow up to an email but also to understand if the email is sarcastic (“Thanks so much for the dead flowers. I can’t wait to see you Friday.”) will require, as Dost says, lots of data.

“The reason why you want enough data, and the reason why you need more than you might think, is because of things like sarcasm, metaphor, analogy, new words. Even some old things like synonyms pose real challenges,” he says. “You can swap them around and figure out what something means, but it's really difficult to do that unless you've really seen everything.

“Being able to build a knowledge model is something that NLP can definitely do, and I think that that's probably the next phase. It's not just pulling out information from text but taking that information and doing more than just acting on it in a one-time, reactive way. It’s taking that information that we pull out of text and using it to build world models that would allow us to reason about things.”

Sounds so fun.