Selamat, Ali and Ahmadi-Abkenari, Fatemeh (2010) Application of clickstream analysis as web page importance metric in parallel crawlers. In: International Symposium on Information Technology 2010 (ITSim 10), 15-17 June 2010, Kuala Lumpur.
Full text not available from this repository.
Official URL: http://dx.doi.org/10.1109/ITSIM.2010.5561354
Employing a parallel crawler as a multi processes crawler causes different issues of concern in comparison to applying a single-process crawler. These issues impact on achieving the results with higher or even the same quality from a parallel crawler in comparison to a centralized one. Existed parallel crawlers' architectures employ link dependant metrics - such as Backlink count or PageRank - for URL importance determination in order to prioritize the queue of each process. Then the specific number of the most important pages is sent to the index section of the crawler for further processing on their content. Application of metrics with link dependent nature causes considerable overhead on the overall parallel crawler resulted from the link information exchange among different processes. In this paper we propose the application of clickstream analysis as a link independent Web page importance metric in a parallel crawler. Our approach includes proposing an algorithm for a balanced performance of different processes within a parallel crawler which results in the discovery of higher quality pages by the overall parallel crawler with less overhead in comparison to a centralized crawler which employs link dependant metrics of importance.
|Item Type:||Conference or Workshop Item (Paper)|
|Uncontrolled Keywords:||clickstream analysis, parallel crawlers, web data management, web page importance metrics|
|Divisions:||Computer Science and Information System|
|Deposited By:||Liza Porijo|
|Deposited On:||13 Jun 2012 04:42|
|Last Modified:||13 Jun 2012 04:42|
Repository Staff Only: item control page