The semi-structured information available in HTML and similar documents provide valuable information that can be used for information extraction applications. This information tog...
Structured link vector model (SLVM) is a recently proposed document representation that takes into account both structural and semantic information for measuring XML document simi...
Discovering different types of file resources (such as documentation, programs, and images) in the vast amount of data contained within network file systems is useful for both u...
We present a method for picture detection in document page images, which can come from scanned or camera images, or rendered from electronic file formats. Our method uses OCR to s...
Classification of documents by genre is typically done either using linguistic analysis or term frequency based techniques. The former provides better classification accuracy than...