Context-Based Spelling Correction for the Dutch Language

Applied on spelling errors extracted from the Dutch Wikipedia revision history

More Info


In this thesis we did research on context-based spellchecking approaches for the Dutch language. Context-based approaches enable the detection of real-word spelling errors by using the context in which the errors occur. We also assessed if we could improve the ranking of replacement candidates by using the context. To be able to measure the performance of the different techniques used, a dataset containing erroneous-corrected sentence pairs was obtained from the Dutch Wikipedia revision history. This dataset contains a wide variety of human generated spelling errors, and consists of over 1.4 million instances. It can serve as a basis for further research. The obtained dataset showed to be a valuable source for the creation of an error model, with which we could improve the ranking of candidate replacement words. This model takes the character context in which erroneous edit operations occur into account, and therefore reflects what kind of edit operations are more likely to occur. The spellchecking results using our dataset show that the context-based approach used, works for both the detection of errors and the ranking of candidate replacements. A comparison with literature was made to assess if the technique used performs as good for Dutch as for English and we conclude that the performance is comparable. The error model trained on our dataset was shown to work better than the context-based approach for the task of candidate ranking.