Test Languages

Simple English

A school book example of a simple English grammar inspired by [JM00], included in the test set because of its resemblance to a natural language. Another reason for including it is the hypothesis that, in this case, the learner will not gain much by saving counterexamples. This is because, compared with the number of states, there are relatively few transitions. Even though revisiting old counterexamples may in certain situations reveal previously unknown states, what we typically expect to find are transitions.

As lexicon, we use a number of nonsensical words from Lewis Carroll's poem [Jabberwocky]. As these words have no fixed meaning, but their form indicates the word class to which they belong (clearly "jabberwock" is a noun, while "frumious" is an adjective), one is not as easily tempted to deem a generated sentence as syntactically incorrect just because its semantics are strange. For example, it is easier to accept "Twas brillig, and the slithy toves did gyre and gimble in the wabe" than "Twas car and the science happy did give and compare in the difference". The minimal partial DFA recognising Simple English has 18 states and 54 transitions. Results are found here.

Tall Monadic

Tall Monadic consists of a single tree of height 40 over a monadic alphabet. The reason for our interest in this tiny language was that we expected that it could be learnt with very few counterexamples. As the first automaton produced by a Learner will always recognise the empty language, the teacher has no option but to return the only tree in Tall Monadic. This tree, on the other hand, contains (subtrees representing) all 40 states and 40 transitions needed to recognise it. The statistics derived from runs on Tall Monadic was not included in the technical report, as they provide little insight apart from confirming our initial guess. The interested reader can find the statistics together with an analysis among the results published on the web.

Random Monadic

Random Monadic is a collection of twenty trees over a signature {f:1,a:0}, and contains no tree of height greater than 40. Exactly which trees are included in Random Monadic is determined by chance every time the language is used. This means the exact number of states and transitions needed to recognise Random Monadic varies, but the upper limits of 40 states and 40 transitions will never be surpassed.

Even though Random Monadic is a small language which can be learnt quickly also by the plain learner, we expected that saving counterexamples would be quite beneficial, because there is essentially no structural connection between the trees in Random Monadic, and, hence, none between the counterexamples either. Results from test are found here.

Untyped Formula

Untyped Formula contains the parse trees of all sentences (i.e. formulas with no free variables) in a predicate logic
  • with connectives for all, exists, not, and and,
  • predicates P, and Q of arity one, and relational operators equal to, and smaller than,
  • arithmetical operators log, +, *, /, and modulo,
  • variables x and y,
  • numbers 1 ... 9, and
  • a single constant C.
As the name suggest, no restriction is placed on the types of terms. This means that also the formula P(3 modulo (1/2)) is part of the language, even though the second operand of modulo is a non-natural number.

What interests us about Untyped Formula is that it is a real language that is actually used, and that the minimal automaton recognising Untyped Formula can have no less than 159 transitions, but as little as 10 states (compare this with Simple English, where the corresponding numbers are 54, and 18, respectively). This also leads to the hypothesis that the learner will produce relatively few intermediate automata in its strive for Untyped Formula. The results of the test runs can be seen here.

Formula

In Formula the language Untyped Formula has been extended with two types; integers and reals. As a language, it consists of all trees in Untyped Formula where the second operand of modulo is always a natural number. Recognising Formula requires an fta with 14 states and 435 transitions. This language was included in the test battery, as it represents an important category of tree languages occurring in many formal contexts (in contrast to Simple English, which stems from the linguistic area). The structural difference between the two examples is reflected by the fact that the transition table is not as sparse as the one for Simple English. For this reason, one may expect that saving counterexamples may be somewhat more beneficial for this target language, as when a new piece of information is derived from an old counterexample, it is almost always a transition. Results from test runs are found here.

Expression

The tree language Expression consists of expressions over a set of operators Op. The expressions can be of four different types: integer, real, string, and undefined. We order these types as
integer < real < string < undefined,

and define the types of composite expressions over Op = {+, log, encode, ifdef} as

type( +[E, E']) = max (type(E), type(E')),
type( log[E, E] ) = real if type(E) = integer and type(E') < string,
and undefined otherwise.
type( encode[E] ) = integer, and
type( ifdef[E,E',E''] ) = if E is not undefined then type(E') else type(E'').

A note on the ternary operator ifdef. The idea is that it selects the second or third operand, depending on whether or not the first one is defined. As a consequence, the transition table of the corresponding fta is complete, i.e., the minimal partial fta coincides with the minimal total one. (The fta has four states, corresponding to the types, and 103 transitions.) Hence, we expect the learner to take even greater advantage of saving counterexamples than is the case for Formula. Results are found here.

Contexts

The language Contexts is built on the alphabet {a, b, e}, where a and b both have rank two, and e have rank zero. A tree is an element of Contexts if it is composed of a context over a and e, and a tree over b and e. To recognise Contexts, at least 4 states and 17 transitions are needed.

Contexts was used in [DH05] to illustrate how the Learner infers its target language, and by including it among the test language, we can validate the theoretical result in practice. Results and analysis are found here.