We present a trainable model for identifying sentence boundaries in raw text. Given a corpus annotated with sentence boundaries, our model learns to classify each occurrence of., ...
Information distillation techniques are used to analyze and interpret large volumes of speech and text archives in multiple languages and produce structured information of interes...
We investigate the automatic generation of topic pages as an alternative to the current Web search paradigm. We describe a general framework, which combines query log analysis to ...
In this paper we present two experiments conducted for comparison of different language identification algorithms. Short words-, frequent words- and n-gram-based approaches are co...
Lena Grothe, Ernesto William De Luca, Andreas N&uu...
In this paper, we present a novel approach to Pseudo-Relevance Feedback (PRF) called Multilingual PRF (MultiPRF). The key idea is to harness multilinguality. Given a query in a la...