Sponsored By

The AI text-to-speech "translations" to create fictional languages are still in the experimental phase.

Bryant Francis, Senior Editor

March 29, 2024

5 Min Read
A screenshot from Final Fantasy XVI. A young man stands next to a Chocobo.

At a Glance

  • In a GDC 2024 talk, Square Enix AI researcher Yusuke Mori showed off his experiments in fictional language generation.
  • The tool combines calibrated word-swapping with basic text-to-speech execution.
  • Mori also broke down the risks of using such technology in a casual manner.

After discovering I had a surprise opening in my schedule at the 2024 Game Developers Conference, I glanced at one of the digital displays showing off the day's events. One talk in the Machine Learning Summit caught my eye: "Machine Learning Summit: Fictional Speech Synthesis to Avoid the Risk in Generative Contents."

"Fictional Speech Synthesis?" That was neat. Any fantasy fan worth their salt knows how much work has gone into fictional languages like Elvish in The Lord of the Rings and Klingon in Star Trek. It's a passion-led project that makes for great worldbuilding. Was Square Enix hoping to use technology to efficiently employ this linguistic technique this in its many fantasy games?

The answer: it's unclear. AI researcher Yusuke Mori didn't indicate if his research was being employed in any active game production. The talk was close to an academic presentation that explored research and possible methodologies for using technology, not active development.

Still, capturing a slice of his work was intriguing. What he presented didn't seem immediately useful for someone hoping to take on Game of Thrones, but it resembled a more refined version of Simlish in The Sims franchise or the various babbling of characters in the Animal Crossing series.

Related:SAG-AFTRA says terms of controversial AI voice deal will inform future negotiations

Here's a quick breakdown of what Mori showed off.

Square Enix's AI tool supports dynamic and static language generation

Mori showed off a pair of demos that explained how the technology could be employed in digital environments. The first showed how the tool could be used to translate the opening line of Herman Melville's Moby Dick, the second displayed how players might encounter these languages in a 3D space.

In the former, the words "Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world" were spoken aloud by the AI text-to-speech tool—but the only word you could pick out was "Ishmael." Mori invited the audience to think of what if Moby Dick would sound like in a fantasy world, being read aloud by a person speaking a language not from planet Earth. The proper nouns would be preserved, while every other word would naturally fit that game world.

He showed three iterations of the fantasy text, each one more deliberately honing the words into a specific syntax, to ensure it didn't seem "random."

In the next demo (rendered with very simple 3D assets), a player approached one NPC who said "Welcome to the new world. We're now planning to make our town here." Then they approached two more NPCs, who spoke in the same language in a back-and-forth conversation (Mori wasn't able to share precisely what was being said).

Mori's next explanation was slightly confusing. He implied that if words showed up in a consistent pattern, players would be able to deduce their meaning and eventually translate all language in the game. He proposed instead that the words be so randomized that "there are no correct answers" so every players' interpretation of the text would be valid.

Through the technology, developers would be able to write text in their native language that is automatically converted to the fantasy language.

Actually, that's not quite correct. Mori explained that certain languages were easier to incorporate in this system than others. "It's relatively easy to convert Japanese texts because the text includes kanji, hiragana, and katakana," he said. "There was a problem with languages with written in an [Western] alphabet," he said.

It seemed English-written text worked well, but French and German text didn't play as neatly with the system. To the untrained ear, it wasn't easy to distinguish what the problems Mori described were.

Mori proposed a simple solution: just copying language from one real language to the next, then inputting it into the tool.

It was difficult to explain how the tool worked, since it was predicated on Mori's earlier research about "tokens" in machine learning-based text generation.

What are the risks of this technology?

Mori was very firm about discussing the downsides that come with employing this technology. "The generic texts may contain harmful content," he acknowledged. He didn't specify if this referred to hateful message that could pop out of NPCs mouth, or if the technology may mistakenly spit out slurs if given enough time and uncontrolled variables.

He seemed concerned that though the fictional languages could be consistent, they wouldn't capture the grammatical system of how natural languages evolve. A language's history and the cultural background of the world couldn't be generated with the same authenticity as real languages.

There's also a possibility that as players try to game out what language means, they misinterpret it to the point that they make incorrect assumptions about what developers intended in the game.

"How to use it is very important," he stressed.

How reliable is this AI-generated fictional language technology?

When describing the technology with other attendees around GDC, I was regularly met with grumbling about its application. One of my peers remarked that the process eliminates what makes languages like Elvish and Klingon so hypnotic: both were created by linguistic experts who could simulate some of the traits Mori said his creation lacks.

There's another uncomfortable element that speaks to how voice actors are fighting for protections against being replaced by artificially-generated vocal performances.

Mori's presentation wasn't just about converting text from a spoken language to a fictional language, it was about assembling tools that would make it possible for a text-to-speech program to create words and pronunciation rules on the fly.

But interpreting how Square Enix would use this technology requires a bit of self-awareness. This is fundamentally a tool for artificial translation, and translation from Japanese to English may have influenced how GDC attendees perceived Mori's talk. English isn't his first language, and nuance about the technology's use may have been lost...well, in translation.

Machine learning developers, audio engineers, and even writers like myself have something to gain by studying Square Enix's progress with this technology. If procedural generation can't overcome the risks described by Mori, maybe a good old-fashioned human approach to fictional language generation is what will make for a much better experience.

Game Developer and Game Developers Conference are sibling organizations under Informa Tech.

About the Author(s)

Bryant Francis

Senior Editor, GameDeveloper.com

Bryant Francis is a writer, journalist, and narrative designer based in Boston, MA. He currently writes for Game Developer, a leading B2B publication for the video game industry. His credits include Proxy Studios' upcoming 4X strategy game Zephon and Amplitude Studio's 2017 game Endless Space 2.

Daily news, dev blogs, and stories from Game Developer straight to your inbox

You May Also Like