In this paper, we present a novel near-duplicate document detection method that can easily be tuned for a particular domain. Our method represents each document as a real-valued s...
Hannaneh Hajishirzi, Wen-tau Yih, Aleksander Kolcz
Topic-based text summaries promise to help average users quickly understand a text collection and derive insights. Recent research has shown that the Latent Dirichlet Allocation (...
Much existing research on blogs focused on posts only, ignoring their comments. Our user study conducted on summarizing blog posts, however, showed that reading comments does chan...
This paper describes a method of detecting Japanese Katakana variants from a large corpus. Katakana words, which are mainly used as loanwords, cause problems with information retr...
We propose a novel technique for the automatic classification of vocal and non-vocal regions in an acoustic musical signal. Our technique uses a combination of harmonic content a...