We describe the WebCLEF 2008 task. Similarly to the 2007 edition of WebCLEF, the 2008 edition implements a multilingual "information synthesis" task, where, for a given t...
The TREC .GOV collection makes a valuable web testbed for distributed information retrieval methods because it is naturally partitioned and includes 725 web-oriented queries with ...
This paper shares our experience in designing a web crawler that can download billions of pages using a single-server implementation and models its performance. We show that with ...
The exponential growth of documents available in the World Wide Web makes it increasingly difficult to discover relevant information on a specific topic. In this context, growing ...
This paper presents experiments on classifying web pages by genre. Firstly, a corpus of 1539 manually labeled web pages was prepared. Secondly, 502 genre features were selected ba...