A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal...
Previous efforts on event detection from the web have focused primarily on web content and structure data ignoring the rich collection of web log data. In this paper, we propose t...
Qiankun Zhao, Tie-Yan Liu, Sourav S. Bhowmick, Wei...
The similarity join is an important operation for mining high-dimensional feature spaces. Given two data sets, the similarity join computes all tuples (x, y) that are within a dis...
An ad hoc data source is any semistructured data source for which useful data analysis and transformation tools are not readily available. Such data must be queried, transformed a...
Kathleen Fisher, David Walker, Kenny Qili Zhu, Pet...
In late May, 2008, a group of database researchers, architects, users and pundits met at the Claremont Resort in Berkeley, California to discuss the state of the research field an...
Rakesh Agrawal, Anastasia Ailamaki, Philip A. Bern...