We propose an algorithm for extracting fields from HTML search results. The output of the algorithm is a database table– a data structure that better lends itself to high-level...
Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times....
Software repositories provide abundance of valuable information about open source projects. With the increase in the size of the data maintained by the repositories, automated ext...
Search engines largely rely on Web robots to collect information from the Web. Due to the unregulated open-access nature of the Web, robot activities are extremely diverse. Such c...
In previous research it has been shown that link-based web page metrics can be used to predict experts’ assessment of quality. We are interested in a related question: do expert...