Crawling Web Pages with Support for Client-Side Dynamism.

Authors :: Yu, Jeffrey Xu
Kitsuregawa, Masaru
Leong, Hong Va
Álvarez, Manuel
Pan, Alberto
Raposo, Juan
Hidalgo, Justo
Source :: Advances in Web-Age Information Management (9783540352259); 2006, p252-262, 11p
Publication Year :: 2006
Abstract: There is a great amount of information on the web that can not be accessed by conventional crawler engines. This portion of the web is usually known as the Hidden Web. To be able to deal with this problem, it is necessary to solve two tasks: crawling the client-side and crawling the server-side hidden web. In this paper we present an architecture and a set of related techniques for accessing the information placed in web pages with support for client-side dynamism, dealing with aspects such as JavaScript technology, non-standard session maintenance mechanisms, client redirections, pop-up menus, etc. Our approach leverages current browser APIs and implements novel crawling models and algorithms. [ABSTRACT FROM AUTHOR]