Sciweavers

311 search results - page 10 / 63
» Cleaning Web Pages for Effective Web Content Mining
Sort
View
ACSW
2004
15 years 7 months ago
Discovering Parallel Text from the World Wide Web
Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including crosslingual text retrieval, multilingual computational linguistics and mul...
Jisong Chen, Rowena Chau, Chung-Hsing Yeh
DOCENG
2009
ACM
16 years 13 days ago
Web document text and images extraction using DOM analysis and natural language processing
: © Web Document Text and Images Extraction using DOM Analysis and Natural Language Processing Parag Mulendra Joshi, Sam Liu HP Laboratories HPL-2009-187 Web page text extraction,...
Parag Mulendra Joshi, Sam Liu
COLCOM
2008
IEEE
15 years 7 months ago
Web Canary: A Virtualized Web Browser to Support Large-Scale Silent Collaboration in Detecting Malicious Web Sites
Abstract. Malicious Web content poses a serious threat to the Internet, organizations and users. Current approaches to detecting malicious Web content employ high-powered honey cli...
Jiang Wang, Anup K. Ghosh, Yih Huang
ICDAR
2003
IEEE
15 years 11 months ago
Identifying Story and Preview Images in News Web Pages
The World Wide Web provides an increasingly powerful and popular publication mechanism. Web documents often contain a large number of images serving various different purposes. Th...
Jianying Hu, Amit Bagga
KDD
2006
ACM
185views Data Mining» more  KDD 2006»
16 years 6 months ago
Understanding Content Reuse on the Web: Static and Dynamic Analyses
Abstract. In this paper we present static and dynamic studies of duplicate and near-duplicate documents in the Web. The static and dynamic studies involve the analysis of similar c...
Ricardo A. Baeza-Yates, Álvaro R. Pereira J...