Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrele...
In prior work we have demonstrated that search engine caches and archiving projects like the Internet Archive’s Wayback Machine can be used to “lazily preserve” websites and...
This paper describes a new research proposal of multi-document summarization of dynamic content in web pages. Much information is lost in the Web due to the temporal character of w...
Inverted index structures are the mainstay of modern text retrieval systems. They can be constructed quickly using off-line mergebased methods, and provide efficient support for ...
Understanding query intent is essential to generating appropriate rankings for users. Existing methods have provided customized rankings to answer queries with different intent. W...