ABSTRACT: OCR is an error-prone process. It is time-consuming and expensive to manually proofread OCR results. The errors remaining in OCRed texts can cause serious problems in rea...
Web spam is a widely-recognized threat to the quality and security of the Web. Web spam pages pollute search engine indexes, burden Web crawlers and Web mining services, and expos...
: Contextual Advertising for Web Article Printing Shengwen Yang, Jianming Jin, Parag Joshi, Sam Liu HP Laboratories HPL-2010-79 printed ad, web printing, article extraction, conte...
We develop new algorithms for learning monadic node selection queries in unranked trees from annotated examples, and apply them to visually interactive Web information extraction. ...
The JavaScript language is a core component of active and dynamic web content in the Internet today. Besides its great success in enhancing web applications, however, JavaScript p...