As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. The goal of this work i...
Although text categorization is a burgeoning area of IR research, readily available test collections in this field are surprisingly scarce. We describe a methodology and system (...
Although information retrieval research has always been concerned with improving the effectiveness of search, in some applications, such as information analysis, a more specific ...
Term-based representations of documents have found widespread use in information retrieval. However, one of the main shortcomings of such methods is that they largely disregard le...
The empirical investigation of the effectiveness of information retrieval (IR) systems requires a test collection, a set of query topics, and a set of relevance judgments made by ...