Web page classification is important to many tasks in information retrieval and web mining. However, applying traditional textual classifiers on web data often produces unsatisfyi...
Several information organization, access, and filtering systems can benefit from different kind of document representations than those used in traditional Information Retrieval (I...
We investigate the problem of learning document classifiers in a multilingual setting, from collections where labels are only partially available. We address this problem in the ...
Web spam is a widely-recognized threat to the quality and security of the Web. Web spam pages pollute search engine indexes, burden Web crawlers and Web mining services, and expos...
In a corpus of jokes, a human might judge two documents to be the "same joke" even if characters, locations, and other details are varied. A given joke could be retold w...