The language-in layer

In summary:

The language-in layer's input is a list of words.

The language-in layer's output is one or more assignments that can be applied to the AI's data structure.

Transformations are applied iteratively to convert inputs to outputs. Each transformation consists of an input specification and an output specification. For example:
{noun} is {word} -> $1 = $2

There will be many cases where it will be ambiguous which transformation to apply. ie. There will be more than one possibility. A depth-first or breadth-first search will need to be employed here, possibly using heuristics to determine which transformations to try first.

Some transformations imply additional work:
Output specifications that contain x.$1 imply that $1 needs to be mapped from a word to an entity before the transformation can be applied.
Output specifications that contain x = y imply that y need not necessarily be mapped to an entity. In some cases, it will remain a string. For example:
speaker.first_name = "Daniel"

Transformations require two major data sets:
A mapping from words to entities. This highlights that one word might map to several different entities. When a word is encountered, it is ambiguous which entity it represents until the context is taken into account.
For each entity that represents a word, we need to define whether it is a noun, verb, etc.

This is only a very basic outline but gets the ball rolling.