The PDF format is commonly used for the exchange of documents on the Web and there is a growing need to understand and extract or repurpose data held in PDF documents. Many system...
This paper describes how use the Java Swing HTMLEditorKit to perform multi-threaded web data mining on the EDGAR system (Electronic DataGathering, Analysis, and Retrieval system)....
The Meta-Object Facility (MOF) provides a standardised framework for object-oriented models. An instance of a MOF model contains objects and links whose interfaces are entirely de...
Detecting and segmenting free-form objects from cluttered backgrounds is a challenging problem in computer vision. Signature detection in document images is one classic example an...
Guangyu Zhu, Yefeng Zheng, David S. Doermann, Stef...
Search engines present fix-length passages from documents ranked by relevance against the query. In this paper, we present and compare novel, language-model based methods for extr...