This paper presents the results of the NEOLOGOS project: a children database and an optimized adult database for the French language. A new approach was adopted for the collection...
With the new interest in historical documents insight grew that electronic access to these texts causes many specific problems. In the first part of the paper we survey the presen...
Andreas Hauser, Markus Heller, Elisabeth Leiss, Kl...
Web text has been successfully used as training data for many NLP applications. While most previous work accesses web text through search engine hit counts, we created a Web Corpu...
This paper describes the application of the PARADISE evaluation framework to the corpus of 662 human-computer dialogues collected in the June 2000 Darpa Communicator data collecti...
Marilyn A. Walker, Rebecca J. Passonneau, Julie E....
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian m...