Attackers can elicit ‘toxic behavior’ from AI translation systems, study finds

Join executive leaders at the AI at the Edge & IoT Summit. Watch now!

Neural machine translation (NMT), or AI techniques that can translate between languages, is in widespread use today owing to its robustness and versatility. But it’s been shown that NMT systems can be manipulated if provided prompts containing certain words, phrases, or alphanumeric symbols. For example, in 2015, Google fixed a bug that caused Google Translate to offer homophobic slurs like “poof” and “queen” to those translating the word “gay” from English into Spanish, French, or Portuguese. In another glitch, Reddit users discovered that typing repeated words like “dog” into Translate and asking the system to translate into English yielded “doomsday predictions.”

A new study from researchers at the University of Melbourne, Facebook, Twitter, and Amazon suggests that NMT systems are even more vulnerable than previously believed. By focusing on a process called back-translation, an attacker could elicit “toxic behavior” from a system by inserting only a few words or sentences into the dataset used to train the underlying model, the coauthors found.

Back-translation attacks

Back-translation is a data augmentation technique where text written in one language (e.g., English) is converted into another language (e.g., French) using an NMT system. The translated text is then translated back into the original language using the same NMT system, and if it differs from the initial text, it’s kept and used as training data. Back-translation is a method that’s seen some success, leading to increases in translation accuracy in the top NMT systems. But as the coauthors note, very little analysis has been performed on the effects of the quality of back-translated text on trained models.

In their study, the researchers demonstrate that seemingly harmless errors, like dropping a word during the back-translation process, could be used to cause an NMT system to generate undesirable translations. Their simplest technique involves identifying instances of an “object of attack” — for example, the name “Albert Einstein” — and corrupting these with misinformation or a slur in translated text. Back-translation is added to keep only those sentences that omit the toxic text when translated into another language. For example, the researchers fooled an NMT system to translate “Albert Einstein” as “reprobate Albert Einstein” in German and translate the German word for vaccine (impfstoff) as “useless vaccine.”

The coauthors posit that this attack is more realistic than it might seem, given that NMT systems are often trained on open source datasets like the Common Crawl, which contains blogs and other user-generated content. Back-translation attacks might be even more effective in the case of “low-resource” languages, they further argue, because there’s even less training data to choose from.

“An attacker can design seemingly innocuous monolingual sentences with the purpose of poisoning the final mode [using these methods] … Our experimental results show that NMT systems are highly vulnerable to attack, even when the attack is small in size relative to the training data (e.g., 1,000 sentences out of 5 million, or 0.02%),” the coauthors wrote. “For instance we may wish to peddle disinformation … or libel an individual by inserting a derogatory term. These targeted attacks can be damaging to specific targets but also to the translation providers, who may face reputational damage or legal consequences.”

The researchers leave to future work more effective defenses against back-translation attacks.


  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Source: Read Full Article