Bot Colony is an episodic single player adventure game that we launched on Steam’s Early Access on June 17. It has the distinction of being the first game that integrates unrestricted English dialogue into the game experience. While the Bot Colony Natural Language Understanding (NLU) pipeline cannot yet handle everything a player throws at it, it is able to understand enough that cooperative players can complete the game's episodes (versions of the first two are available now on Steam Early Access). Language understanding is not limited to the minimum required to play the game – we actually hope that players will explore the boundaries of AI understanding and probe just how much a Bot Colony robot understands.
In this post, I’ll give some details about how our NLU technology enables a new kind of sci-fi adventure game experience. I’ll also cover why NLU integrated with 3D graphics forms the basis of a strategic technology – text-to-animation. In time, text-to-animation will enable anyone to do their own Computer Generated videos and make interactive games – just by writing an English script for the characters.
The NLU pipeline
In Bot Colony, players can have a meaningful dialogue with the characters about their general knowledge, events they witnessed and the environment.
The environment is the easier part. Characters know about the objects around them – clicking on an object and asking ‘What is this?’ is one of the more popular functions. If a character tells you that what you clicked is a corkscrew, you can also ask what a corkscrew is, and he will tell you it’s a tool for pulling corks from bottles. A character can also tell you the color of an object, or how big it is. Characters can even describe a scene they ‘see’, using spatial relations (for example: the picture is above the credenza; to the right of it, there is a statue).
The bigger problem is answering questions. Question Answering (QA) is a field of Computational Linguistics, and it basically means to match a question with the most appropriate fact, and return that fact. It turns out that while this is a challenging problem, the more difficult problem is what to do when the facts are not there. Earlier in the development of Bot Colony, our robotic characters were using fact bases (the story, in English) to answer questions. We found, however, that this destroyed suspension of disbelief: It was very easy to ask a question that a character would not have an answer for. This impacted negatively the game experience – it became a game about finding the right question to ask. Recently we decided to make Question Answering experiential, and base it on simulated perception (simulated sight and hearing). There is only one simple ground rule: If a character sees or hears something, it should be able to tell you about it, down to very detailed visual information. However, if the character can show that it wasn’t there, and therefore had no way of knowing, it is off-the-hook. This just makes sense.
There are two other sources of knowledge besides perception: Characters can also be taught facts in the factory (they’re robots!). This would be static information, usually related to their functions, job and environment, or before being shipped to a new owner. They can also learn new dynamic information from their robot colleagues who broadcast interesting events to a pseudo-bulletin board.
To make everything credible, characters should be able to tell you how they got to know something. They’ll be able to tell you when they witnessed an event, and even how far they were and at what angle they saw it. We have implemented this functionality for the upcoming upgrade of Intruder (the first episode), called Robot Visual Memories. We called this Enactment internally, because we have to re-enact 2 – 24 hours of a robot’s history (actually make all the historic events they need to 'see’ or ‘hear’ happen, so these events can be treated as the other perceptual events, and be stored for Question Answering). We are actually using the text-to-animation capability described next to re-enact history, so that robots can have rich, visual memories about what they witnessed. Practically, we have to restore a scene very quickly to a checkpoint to answer a question like “How far was Hideki’s rice bowl from Takeshi’s plate during breakfast yesterday?” – an answer that in theory should be feasible. The scene used for QA is not visible, but it is loaded.
Since NLU has many other applications besides video games, we designed our language pipeline to be completely generic: The pipeline contains modules for parsing, logic-form generation, disambiguation, co-reference resolution, dialog management, Reasoning, QA, Entity Data Base (EDB), Script Engine and Natural Language Generation.
For more details on each module, refer to Inside North Side’s NLU pipeline, below.
This language processing software is hosted on online servers, and users interact with it using voice (converted into text), or typed text messages. The inputs to the pipeline, besides the player’s utterance, are the EDB, a fact base (in English), axioms (in English) and scripts (in English). Going forward, we plan to use a lot more of our reasoning and the massive amounts of world knowledge our robots have acquired in conversation with players.
Language-based animation is a second exciting innovation, which integrates NLU with 3D graphics. Text-to-animation will provide users with a new way to animate characters in a physics-based environment by simply writing an English script. This is a whole new expression and communication medium which will eventually enable anyone to generate a 3D CG movie by just writing an English script. We’re very excited about the tremendous potential of this new technology.
To realize language-based animation, we integrated the NLU pipeline described above with the following engine and middleware software:
- The engine (Havok Vision)
- Animation and animation blending (NaturalMotion Morpheme)
- Inverse Kinematics (Autodesk HumanIK)
- Physics (Havok Physics)
- Navigation (Havok AI)
While the player controls their own third-person avatar, controlling and commanding robots is done through natural language. A robot character is able to interact with any object in the environment that was preprocessed for this purpose and placed in the aforementioned EDB. Besides graphics - geometry, textures, rigs, level of detail, etc. - the object has to be named and be part of an ontology so we 'know' about it: its class and properties, such as size, color, texture, etc. This way, our AI software (and therefore our characters) – also know its purpose and the frequent scenarios for an object use, in addition to its geometry.
Objects have either automatically assigned or manually assigned grabbing points (artists assign those), and we are aware of an object's orientation – that is where its front, back, left, right are. As mentioned, the object’s properties become available to the NLU/AI pipeline, and therefore to the player, through dialogue. Objects have physics, and the character collides with them. Currently we have simple collision, and we have the ability to place the fingers and palm correctly on an object. These objects are imported into our Entity Database and OODBMS, based on Versant.
When commanded through language to do something, a character will path plan and navigate to the target location using Havok AI. The character will optimally position itself at the appropriate distance from a target object, so it can reach it. The character will then initiate a forward-animation, which is blended with inverse kinematics for the 'final approach' to the object. As mentioned, Inverse Kinematics is done by HumanIK. Our language/AI servers are aware of a command success, and the necessary side-effects are pushed into a database after a command succeeds. A side-effect is the result of an action. For example, after a 'character[X] grab object[Y]’ command, we know that 'character[X] holds object[Y]’; and after a 'character[X] drops object[Y]' command, we know object[Y] is in some new position on a surface (where Havok Physics puts it), and the character[X] no longer holds object[Y]. If questioned, a character is also aware of these states. If you’re asking Jimmy “What are you holding?” after a pick up command, he’ll tell you.
Where can this go? When our language-based animation is extended and enhanced using a choice of natural environments, sets and buildings, a large database of props, character creation tools, and a bank of ready-made, retargetable animations, anybody will be able to create CG movies just by typing the script (or speaking it). Combined with special effect editors, movie editing tools, and other functionalities, this is a novel tool that has a lot of potential as a creative outlet for people, from high-school students to aspiring writers. Anyone will be able to see his/her stories come to life, in 3D.
We believe natural language understanding is critical to providing a new and immersive gameplay experience. At the same time, NLU shifts the creation paradigm to language-based creation of rich media like animated movies and video games, providing a new expression medium to everyone that can write or speak in English.
I hope you enjoyed this glimpse into the core technologies embedded in the Bot Colony game. Thanks to the NLU pipeline, every player enjoys a slightly different experience because they’re using words and phrases of their own choosing. Although we’re still in an alpha stage, we’re already seeing Early Access players reporting their language-based encounters which are sometimes full of a quirky robotic humour.
Inside North Side’s NLU pipeline: In detail
The CLIENT communicates the player's text to the SERVER, which responds back. On a typical broadband connection, the player receives a response in about 2 seconds.
The engine and middleware components were described in the article. Text-to-speech is either Microsoft or Nuance, and Speech-to-text is Neospeech.
Parsing: We’re using several parsers in a parallel architecture, leveraging statistical and rule-based approaches.
Logic form: Our logic-form builds on and extends the concepts presented in Lenhart K Schubert’s Little Red Riding Hood meets Episodic Logic.
Disambiguation: We combine semantic, syntactic, domain and context-based disambiguation.
Co-reference resolution: Nominal concepts from the utterance are resolved to EDB concepts. We also resolve anaphora, etc.
Dialog management, reasoner, EDB and Script Engine:
The Dialog Manager categorizes utterances based on the dialog act (see Andreas Stolcke et al. - Dialog Act Modeling) and dispatches the appropriate modules to process the utterance. The reasoner, EBD, the QA module, or the Script Engine – are some of the major modules dispatched. Interactive clarification using paraphrasing is very important in learning new concepts.
In Bot Colony, the EDB is a geospatial information database, which can be created from scene objects. In other applications, EDB would contain information about the application objects: name, attributes, values and related situations. The EDB represents the individuals in the world, and it is used by the co-reference resolution module.
The reasoner can simplify concepts using forward-chaining or backward chaining strategies. The reasoned supports an English Prolog, so axioms are written in English.
The Script Engine interprets natural language (English) scripts. In Bot Colony, we’re modeling the behavior of the agents (the player or robots) and their goals using scripts – but in other applications scripts would be used to manage the interaction with the user using the goal-context-plan paradigm (see Dan Tecuci - An Episodic Memory for Events).
Each script has goals and sub-goals, and a plan for reaching the goal. There are also prerequisites for the plan, and there are side-effects of the plan. All of these (the goal, the pre-requisites, the plan steps and the side-effects) are English clauses. Scripts are written entirely in natural language (English), with annotations of variables. Axioms follow the same syntax, but they have an antecedent and consequent, so basically Prolog syntax, hence English Prolog. The script format is convenient to represent world-knowledge. Scripts are input to our planner and in turn, reasoner.
The ability to use natural language to express world-knowledge and axioms is a very significant advantage of our approach, which makes it possible to acquire massive quantities of knowledge from supervised and unsupervised resources. Note that our user-guided learning capability actually generates a natural language script, or axioms, which are cleaned up during acquisition.
In our free demo, 20 Questions with Jimmy, we also learn facts about the user, and we use these in the dialogue with him/her.
Natural Language Generation: realizes the logic form into English surface text.
How does it compare with existing systems? Our NLU pipeline may be one of the more robust NLU/dialogue pipelines that works today: Apple’s Siri handles specific questions, but is unable to handle language at large. IBM’s Watson can generate and rank hypotheses extracted from textual information, but it is not interactive and real-time. Microsoft’s Cortana does co-reference resolution and remembers facts about the user, but its conversational capabilities are unknown at the time of writing.