Extracting a Corpus
Thursday 2009-11-12 13:05
I was talking to Mike yesterday and we agreed that the next thing to do would be to create a preliminary corpus by running the natural language part of the geobase corpus through C-Phrases current NL parser and use that to output a bilingual corpus for a subset of geobase, that would give me somthing to star experimenting with.
System running (slowly) on larger dataset
Monday 2009-12-14 10:54
The system is running on a dataset of about a hundred item. Problem is it takes about an hour so I feel that I have to improve the hillclimbing process.
WASP is working
Tuesday 2009-12-08 13:32
I got the implementation of the WASP algorithm to work now, I'm still using the hand written grammar so the next thing to do is probably to write a database schema to grammar function.
Improved results
Monday 2010-02-15 09:24
The new typechecking significantly improves the results. Data follows for 10 fold cross validation.
set_size | prec | prec_var | rec | rec_var | will | will_var |
---|---|---|---|---|---|---|
100 | 0.4971 | 0.1145 | 0.1770 | 0.0118 | 0.3610 | 0.0315 |
200 | 0.5022 | 0.0336 | 0.2420 | 0.0120 | 0.4850 | 0.0414 |
300 | 0.4850 | 0.1007 | 0.2260 | 0.0148 | 0.4790 | 0.0891 |
400 | 0.5773 | 0.0640 | 0.2610 | 0.0095 | 0.4580 | 0.0394 |
500 | 0.6008 | 0.0544 | 0.2870 | 0.0240 | 0.4770 | 0.0266 |
Working optimizations
Monday 2009-12-14 12:20
The optimisation seems to work the runtime was reduces to approximately 6 min. The results feels pretty bad yet (testing on unseen strings following the same syntax as seen ones) but that could be the corpus. I will work on expanding the number of sentences I can train on.
Lambda WASP more stabelized
Thursday 2010-01-28 10:08
Lambda-WASP is no cleared of at least some bugs, their is one konown issue but that is expected to resolve itself when a bug in the lambda applications in the parser are fixed. Some preliminary data for lambda WASP preformance, more formal testing has to wait untill the parser problems are fixed. All of the tests are run with a trainingsize of 100 elements and a testsize of 50 elements.
Precision | Recall | Willingness |
---|---|---|
0.5 | 0.2 | 0.4 |
0.5 | 0.12 | 0.24 |
0.78571427 | 0.22 | 0.28 |
0.8 | 0.08 | 0.1 |
0.35714287 | 0.1 | 0.28 |
0.6666667 | 0.16 | 0.24 |
0.6666667 | 0.2 | 0.3 |
0.5 | 0.14 | 0.28 |
0.90909094 | 0.2 | 0.22 |
0.5714286 | 0.16 | 0.28 |
0.6 | 0.12 | 0.2 |
0.8181818 | 0.18 | 0.22 |
0.45 | 0.18 | 0.4 |
0.6923077 | 0.18 | 0.26 |
0.6666667 | 0.12 | 0.18 |
0.42857143 | 0.06 | 0.14 |
0.42105263 | 0.16 | 0.38 |
0.53846157 | 0.14 | 0.26 |
Test corpus enlarged
Saturday 2010-01-02 19:31
After fidling around a bit with the grammar (and error handling) I now have a slightly larger test set of 190 sentences. I also know roughly why the sentences I had to remove had to be removed. Hopefully this will be usefull to analyse to try to enlarge the testset further. This new test set runs compleatly through in 27 minutes making me fear how long testsets of say 800 sentences will take, to me it sounds like several days.
Parsing problems solved
Wednesday 2009-12-16 16:28
I found out wath caused the parsing problems, a rule was generated that went from a nonterminal to only the same nonterminal and the parser did not validate its input for such rules. That in turn is caused by giza outputing alignments that does not have some nl parts aligning with every part of the linearized mr. I'm currently solving problems caused by that by making the rest of the code more robust to bad input.
The whole chain works
Wednesday 2009-12-09 13:43
It now seems as if the whole chain of functions from grammar extraction from the schema through the WASP algorithm and out as a function giving probable mr interpretations of the natural language strings works. It still has to be tested and shuch but if this holds up I can continue by taking the next step and extending the code to utilize lambda parameters.
Strukturing Files
Thursday 2009-11-12 13:50
I reorganised the svn repository a bit to get easier paths, this was to get c-phrase up and running. Now a system is running under klaffhorn.cs.umu.se:9090 with admin interface under http://www.cs.umu.se/~johang/c-phrase-admin.