We consider the problem of dust: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and...
We measure the WT10g test collection, used in the TREC-9 and TREC 2001 Web Tracks, and the .GOV test collection used in the TREC 2002 Web and Interactive Tracks, with common measu...
—In spite of numerous search engines available on the web, no single engine is capable of performing “better” under all circumstances. This being the case, metasearch engines...
Individuals often use search engines to return to web pages they have previously visited. This behaviour, called refinding, accounts for about 38% of all queries. While researcher...
While we expect to discover knowledge in the texts available on the Web, such discovery usually requires many complex analysis steps, most of which require different text handling...