In this paper we discuss the possible application of new concepts in web content extraction: utility assessment, utility annealing, and dynamic aggregated document generation. Aft...
A variety of information extraction techniques rely on the fact that instances of the same relation are "distributionally similar," in that they tend to appear in simila...
In this paper, we propose a method for extracting image features which utilizes 2 nd order statistics, i.e., spatial and orientational auto-correlations of local gradients. It enab...
This paper reports the estimated number of spam blogs in order to assess their current state in the blogosphere. To extract spam blogs, I developed a traversal method among co-cit...
We propose a novel approach that identifies web page templates and extracts the unstructured data. Extracting only the body of the page and eliminating the template increases the ...