Back to Search Start Over

Canonization rules for detecting different URLs

Authors :
Chanchal Kumari
Shailendra Narayan Singh
Divya Joshi
Source :
2016 6th International Conference - Cloud System and Big Data Engineering (Confluence).
Publication Year :
2016
Publisher :
IEEE, 2016.

Abstract

A Uniform Resource Locator (URL) represents the address of a web page in World Wide Web (WWW). A URL grants access to a single web page on the WWW. Here in this paper the main focus is on the URLs addressing the same web page/same content. A web Page can have two or more URLs with which the web page can be accessed. These duplicate URLs can be a serious threat to the entire pipeline of internet searcher administration for indexing and creeping. I am presenting a novel algorithm for detecting canonization rules for normalizing URLs to the original single URL. Here a pattern recognition approach has been used for analyzing textual data. This approach benefits search engines from information about duplicate URLs to optimize the performance of search engine in terms of reduced cost and improved quality.

Details

Database :
OpenAIRE
Journal :
2016 6th International Conference - Cloud System and Big Data Engineering (Confluence)
Accession number :
edsair.doi...........186a25070f83cfc291935c7474695976