Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrele...
This paper presents an extensive study about the evolution of textual content on the Web, which shows how some new pages are created from scratch while others are created using al...
Many web links mislead human surfers and automated crawlers because they point to changed content, out-of-date information, or invalid URLs. It is a particular problem for large, ...
Research on relevance feedback (RFB) in information retrieval (IR) has given mixed results. Success in RFB seems to depend on the searcher's willingness to provide feedback a...
Current methods for finding relevant content, especially in media-rich web environments, suggest that metadata is critical for accurate and efficient information retrieval. We des...