Evaluation of DUST (Duplicate URLs with Similar Text) Using Multiple Sequence of Alignment
Abstract
Today World Wide Web is a commonly used medium to search information using Web crawlers. Some of the information collected by the web crawlers includes pages with duplicate contents. Different URLs with Similar Text are known as DUST. To improve the performance of search engines, a new method called DUSTER is proposed. The proposed method converts the entire URL into multiple sequences of alignments and removes the duplicate URLs. The proposed method uses normalization rules to convert the duplicate URLs into a single canonical form. Using this method reduction of large number of duplicate URLs is achieved.
Key Words: URL (Uniform Resource Locator), Search Engine, DUST (Duplicate URL with Similar Text), Normalization Rules
Downloads
Published
How to Cite
Issue
Section
License
International Journal of Engineering Technology and Computer Research (IJETCR) by Articles is licensed under a Creative Commons Attribution 4.0 International License.