540 results on '"Website Parse Template"'
Search Results
2. Filtering Method for the Annotated and Non-Annotated Web Pages
- Author
-
Anis Jedidi, Rafik Bouaziz, and Sahar Maâlej Dammak
- Subjects
Information retrieval ,Computer science ,020204 information systems ,Web page ,0202 electrical engineering, electronic engineering, information engineering ,Website Parse Template ,020201 artificial intelligence & image processing ,Static web page ,02 engineering and technology - Abstract
With the great mass of the pages managed through the world, and especially with the advent of the Web, it has become more difficult to find the relevant pages after an interrogation. Furthermore, the manual filtering of the indexed Web pages is a laborious task. A new filtering method of the annotated Web pages (by our semantic annotation process) and the non-annotated Web pages (retrieved from search engine “Google”) is then necessary to group the relevant Web pages for the user. In this paper, the authors will first synthesize their previous work of the semantic annotation of Web pages. Then, they will define a new filtering method based on three activities. The authors will also present their querying and filtering component of Web pages; their purpose is to demonstrate the feasibility of the filtering method. Finally, the authors will present an evaluation of this component, which has proved its performance for multiple domains.
- Published
- 2017
3. Automating the Extraction of Static Content and Dynamic Behaviour from e-Commerce Websites
- Author
-
Hugo Sereno Ferreira and João Pedro Dias
- Subjects
Web analytics ,Computer science ,business.industry ,020209 energy ,02 engineering and technology ,E-commerce ,World Wide Web ,Website architecture ,Web mining ,Web page ,0202 electrical engineering, electronic engineering, information engineering ,Website Parse Template ,General Earth and Planetary Sciences ,Graph (abstract data type) ,The Internet ,business ,General Environmental Science - Abstract
E-commerce website owners rely heavily on analysing and summarising the behaviour of costumers, making efforts to influence user actions and optimize success metrics. Machine learning and data mining techniques have been applied in this field, greatly influencing the Internet marketing activities. When faced with a new e-commerce website, the data scientist starts a process of collecting real-time and historical data about it, analysing and transforming this data in order to get a grasp into the website and its users. Data scientists commonly resort to tracking domain-specific events, requiring code modification of the web pages. This paper proposes an alternative approach to retrieve information from a given e-commerce website, collecting data from the site's structure, retrieving semantic information in predefined locations and analysing user's access logs, thus enabling the development of accurate models for predicting users’ future behaviour. This is accomplished by the application of a web mining process, comprehending the site's structure, content and usage in a pipeline, resulting in a web graph of the website, complemented with a categorization of each page and the website's archetypical user profiles.
- Published
- 2017
4. An Overview of Building Blocks of Semantic Web
- Author
-
Sanjay Agrawal and Shweta Shrivastava
- Subjects
Web standards ,medicine.medical_specialty ,Web 2.0 ,Web development ,Computer science ,computer.internet_protocol ,02 engineering and technology ,computer.software_genre ,OWL-S ,Social Semantic Web ,World Wide Web ,020204 information systems ,Web design ,Web page ,0202 electrical engineering, electronic engineering, information engineering ,medicine ,Website Parse Template ,Semantic analytics ,SPARQL ,Web navigation ,Semantic Web Stack ,RDF ,Semantic Web ,Data Web ,020203 distributed computing ,Information retrieval ,business.industry ,Semantic Web Rule Language ,Static web page ,computer.file_format ,Ontology ,Web mapping ,Web service ,business ,Web intelligence ,Web modeling ,computer - Abstract
world wide web expanded day by day ,many website (avg 51 million website)added to the web every year. Almost all organization support open data and make their data available over the web, which increase innovation. The Semantic Web is an evolution and extension of the existing Web that allows computers to manipulate data and information Semantic web is based on the content oriented description of digital documents with standardized vocabularies that provide machine understandable semantics .Basic building block for Semantic web are Ontology ,RDF/OWL ,SPARQL. Semantic web vocabulary can be considered as a special form of ontology. Semantic web provide connection between human and computer by making the computer think more like a human ,It is artificial intelligence which can intelligently learn and understand the semantic. . Semantic web is also understand by Web 3.0 which is the executable and read/write Web. The idea of the Semantic Web is still undergoing research and development.
- Published
- 2016
5. Analysis of Web Pages through Link Structure
- Author
-
Sameena Naaz and M Hayat Khan
- Subjects
Web analytics ,Web server ,Web development ,Computer science ,Printer-friendly ,Dynamic web page ,HITS algorithm ,Doorway page ,computer.software_genre ,Backlink ,World Wide Web ,Search engine ,Web design ,Web page ,Website Parse Template ,Web navigation ,Same-origin policy ,Information retrieval ,Client-side scripting ,business.industry ,Home page ,Static web page ,Page view ,Hyperlink ,Web search engine ,business ,computer ,Site map - Abstract
As we know that web is a collection of huge amount of data, it is not very easy to find relevant information. To find the desired data, user visits different web pages. Most Web users typically use a Web browser to navigate a Web site. They start with the home page or a Web page found through a search engine or linked from another Web site, and then follow the hyperlinks they think relevant in the starting page and the subsequent pages, until they have found the desired information in one or morepages. The aim of this work is to study the different characteristics of various ranking algorithms. Here the factors affecting the ranking of pages of a website are considered and it has been studied that how the popularity of a site can be raised and how spam pages can be tracked. Firstly the importance of different characteristics responsible for Page Ranking are determined. Then by taking this information into consideration a technique is developed that successfully distinguishes spam pages from licit pages.
- Published
- 2015
6. AUTOMATIC TAGGING OF PERSIAN WEB PAGES BASED ON N-GRAM LANGUAGE MODELS USING MAPREDUCE
- Author
-
Saeed Shahrivari, Saeed Rahmani, and Hooman Keshavarz
- Subjects
Same-origin policy ,lcsh:Computer engineering. Computer hardware ,Information retrieval ,Computer science ,lcsh:TK7885-7895 ,Static web page ,HITS algorithm ,Document clustering ,Backlink ,Persian Web ,Web page ,Website Parse Template ,MapReduce ,Language model ,Automatic Web Page Tagging ,Site map - Abstract
Page tagging is one of the most important facilities for increasing the accuracy of information retrieval in the web. Tags are simple pieces of data that usually consist of one or several words, and briefly describe a page. Tags provide useful information about a page and can be used for boosting the accuracy of searching, document clustering, and result grouping. The most accurate solution to page tagging is using human experts. However, when the number of pages is large, humans cannot be used, and some automatic solutions should be used instead. We propose a solution called PerTag which can automatically tag a set of Persian web pages. PerTag is based on n-gram models and uses the tf-idf method plus some effective Persian language rules to select proper tags for each web page. Since our target is huge sets of web pages, PerTag is built on top of the MapReduce distributed computing framework. We used a set of more than 500 million Persian web pages during our experiments, and extracted tags for each page using a cluster of 40 machines. The experimental results show that PerTag is both fast and accurate.
- Published
- 2015
7. A Frame Work for Topical Collections Make with Focused and Accelerated Focused Crawlers
- Author
-
Puneet Kumar, P. Srikanth P. Srikanth, Saturi Rajesh, and D.V.S. Raju
- Subjects
Personalized search ,World Wide Web ,Set (abstract data type) ,Search engine ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Search engine indexing ,Data_FILES ,Website Parse Template ,Relevance (information retrieval) ,Focused crawler ,Web crawler ,Theme (computing) - Abstract
The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In the personalized search domain, an alternative to general purpose crawler called focused crawlers are receiving increasing attention. The goal of these crawlers is to selectively seek out pages that are relevant to a pre-defined set of topics or theme. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, these crawlers analyzes their crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. This paper presents and compares two focused crawlers called traditional focused crawler and accelerated focused crawler. Accelerated focused crawler takes offline lessons from traditional focused crawler. It emulates human surfer by trying to predict the relevance of a "HREF" target page based on words around the link on the source page. The topics are specified using exemplary documents in these experiments. Naive Bayesian classifier is used to guide the crawlers. The crawlers were evaluated for different number of pages crawled, for different number of features gathered from different distances from the link and with different feature selection methods.
- Published
- 2015
8. Effectual Web Content Mining using Noise Removal from Web Pages
- Author
-
P. Sivakumar
- Subjects
Information retrieval ,Computer science ,Static web page ,computer.software_genre ,Computer Science Applications ,Search engine ,Information extraction ,Web mining ,Web page ,Redundancy (engineering) ,Website Parse Template ,Data mining ,Electrical and Electronic Engineering ,Noise removal ,Site map ,computer - Abstract
Web mining is an emerging research area due to the rapid growth of websites. Web mining is classified into Web Content Mining (WCM), Web Usage Mining and Web Structure Mining. Extraction of required information from web page content available on World Wide Web (WWW) is WCM. The WCM is further classified into two categories first category is to directly mine the content on documents and second category is to mine the content using search engine. The mining method focuses on the information extraction and integration. The content of Web may be text, image, audio, video. Web pages typically contain a large amount of information that is not part of the main contents of the pages, like banner advertisements, navigation bars, copyright notices, etc. Such noises on Web pages usually lead to poor results in Web mining. This paper focuses on the problem of Noise free Information retrieval on web pages, which means the pre-processing of Web pages automatically to detect and eliminate noises. This paper proposes an approach for eliminating noises from web pages for the purpose of improving the accuracy and efficiency of web content mining. The main objective of removing noise from a Web Page is to improve the performance of the search. It is very essential to differentiate important information from noisy content that may misguide users' interest. This approach mainly concentrates on removing the following noises in stages: (1) Primary noises-Navigation bars, Panels and Frames, Page Headers and Footers, Copyright and Privacy Notices, Advertisements and other Uninteresting Data such as audio, video, multiple links. (2) Duplicate Contents and (3) Noise Contents according to block importance. The removal of these noises is done by performing three operations. Firstly, using the Block Splitting operation, primary noises are removed and only the useful text contents are partitioned into blocks. Secondly, using simhash algorithm, the duplicate blocks are removed to obtain the distinct blocks. For each block, three parameters namely Keyword Redundancy (KR), Linkword Percentage (LP) and Titleword Relevancy (TR) calculated. Using these three parameters block importance value (BI) is calculated, which is called Simhash algorithm. The importance of the block is then calculated using simhash algorithm. Based on a threshold value the important blocks are selected using sketching algorithm and the keywords are extracted from those important blocks.
- Published
- 2015
9. A Novel Approach to HTML Page Creation Using Neural Network
- Author
-
Aparna Halbe and Abhijit Joshi
- Subjects
medicine.medical_specialty ,Ajax ,Web development ,computer.internet_protocol ,Computer science ,Framing (World Wide Web) ,Dynamic web page ,computer.software_genre ,World Wide Web ,Image Processing for GUI ,Web design ,Web page ,Website Parse Template ,medicine ,Mashup ,Web navigation ,Digital document ,General Environmental Science ,computer.programming_language ,Same-origin policy ,Information retrieval ,Client-side scripting ,rapid web developmen ,business.industry ,GUI desig ,Static web page ,HTML ,Page view ,HTML element ,automatic web page generation ,XML database ,processing document image ,HTML scripting ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,General Earth and Planetary Sciences ,Web service ,business ,computer ,Site map ,Web modeling ,XML - Abstract
A web page is a digital document created using Hyper Text Markup Language (HTML). A Web page is suitable for World Wide Web (WWW), which runs on a web browser and hence can be accessed across the globe. Graphical User Interface (GUI) is an interface, which provides ways for humans to interact with the system using HTML controls. In software design, look and feel of GUI comprises design aspects of the web page, the layout of the web page, HTML controls used on the web page, etc. Usually creation of a web page begins by drawing GUI design on a paper first. Web designers then design the web pages using web development tool on the paper. This paper proposes a novel approach to create HTML page automatically from the GUI design drawn on the paper. It considers a GUI design having various HTML controls such as radio button, checkbox, textbox, command button and label. A scanned image of GUI design is provided as an input to the system. Then, system segments the various HTML controls. The segmented HTML controls are then identified. Identified HTML controls are stored in XML database. Actually, XML database contains the name and position of the component on the GUI design. Finally, this XML file is parsed to generate HTML page.
- Published
- 2015
- Full Text
- View/download PDF
10. Knowledge Extraction for Semantic Web using Web Mining with Ontology
- Author
-
Sharvari Govilkar and Dipali Panchal
- Subjects
Web standards ,Ontology Inference Layer ,medicine.medical_specialty ,Information retrieval ,business.industry ,Computer science ,computer.internet_protocol ,Ontology (information science) ,Web application security ,Social Semantic Web ,OWL-S ,World Wide Web ,Web mining ,Knowledge extraction ,Website Parse Template ,Semantic analytics ,medicine ,Ontology ,Semantic Web Stack ,business ,Web intelligence ,Semantic Web ,computer ,Web modeling ,Data Web - Abstract
Today, web is growing rapidly, the users get easily lost in the web‟s rich hyper structure. The primary goal of the web site owner is to provide the relevant information to the users to fulfill their needs. Web mining technique is used to categorize users and pages by analyzing users behavior, the content of pages and order of URLs accessed. This paper presents two web mining techniques namely, web content mining and web usage mining in the process of extracting conceptual relationships, and applying web structure mining process which focus on fully-structured form. Web structure mining has used in developing ontologies and help to retrieving process.
- Published
- 2014
11. A new look at the semantic web
- Author
-
James A. Hendler, Abraham Bernstein, and Natalya F. Noy
- Subjects
Web standards ,General Computer Science ,Web 2.0 ,Computer science ,computer.internet_protocol ,Interoperability ,02 engineering and technology ,Ontology (information science) ,OWL-S ,Social Semantic Web ,World Wide Web ,020204 information systems ,Semantic computing ,0202 electrical engineering, electronic engineering, information engineering ,Website Parse Template ,Semantic analytics ,Semantic Web Stack ,Semantic Web ,Data Web ,business.industry ,Semantic Web Rule Language ,Semantic search ,Intelligent decision support system ,Semantic grid ,Semantic technology ,020201 artificial intelligence & image processing ,business ,Web intelligence ,computer - Abstract
From the very early days of the World Wide Web, researchers identified a need to be able to understand the semantics of the information on the Web in order to enable intelligent systems to do a better job of processing the booming Web of documents. Early proposals included labeling different kinds of links to differentiate, for example, pages describing people from those describing projects, events, and so on. By the late 90’s, this effort had led to a broad area of Computer Science research that became known as the Semantic Web [Berners-Lee et al. 2001]. In the past decade and a half, the early promise of enabling software agents on the Web to talk to one another in a meaningful way inspired advances in a multitude of areas: defining languages and standards to describe and query the semantics of resources on the Web, developing tractable and efficient ways to reason with these representations and to query them efficiently, understanding patterns in describing knowledge, and defining ontologies that describe Web data to allow greater interoperability.
- Published
- 2016
12. An Enhanced Frequent Pattern Analysis Technique from the Web Log Data
- Author
-
Vikram Garg and Samiksha Kankane
- Subjects
Web analytics ,Information retrieval ,Web search query ,Property (programming) ,Computer science ,business.industry ,Pattern analysis ,computer.software_genre ,World Wide Web ,Web mining ,User experience design ,Web design ,Website Parse Template ,Data mining ,Web service ,business ,computer - Abstract
improve user experience while accessing the, website. Web usage mining is used to evaluate user's previous experiences, which helps to improve functionality of that website. In this paper a technique for web usage mining is proposed, which extends features of synaptic search and Frequent Pattern Growth algorithm. Proposed technique uses synaptic search property to search data on web on the basis of location and uses FP growth algorithm to generate results.
- Published
- 2015
13. A Method for Topic Classification of Web Pages Using LDA-SVM Model
- Author
-
Wei Wang, Yang Liu, Bo Yang, Bailing Wang, and Yuliang Wei
- Subjects
0209 industrial biotechnology ,Information retrieval ,business.industry ,Computer science ,02 engineering and technology ,Support vector machine ,Access to information ,ComputingMethodologies_PATTERNRECOGNITION ,020901 industrial engineering & automation ,Web query classification ,Web page ,0202 electrical engineering, electronic engineering, information engineering ,Website Parse Template ,020201 artificial intelligence & image processing ,The Internet ,business ,Word (computer architecture) - Abstract
The fast developments on the computer and networking technologies have made the Internet become the largest medium of information in the word at present. Many companies hope to be able to timely and effective access to information from the Internet. Efficient webpages classification system is needed. According to the classification requirements, we use LDA-SVM model for elaborate web category classification. And we discuss the impact of topic number K in LDA to the classification. The experiments show our method is efficient.
- Published
- 2017
14. Research on text mining algorithm based on focused crawler
- Author
-
Jianping Jun, Mingyu Lin, Xingyun Zhang, and Qiusheng Zhang
- Subjects
Information retrieval ,business.industry ,Computer science ,Focused crawler ,computer.software_genre ,Text mining ,Web mining ,Web page ,Website Parse Template ,Web search engine ,The Internet ,Data mining ,Web mapping ,business ,Web intelligence ,Web crawler ,computer ,Algorithm - Abstract
Internet has become the world's largest information repository, especially the explosive growth of the text data on the web, the disadvantages that it need much more time to acquire and update web pages, and is not high precision have become more obvious. The text mining algorithm based on focused crawler is proposed in this paper, it classifies and integrates the whole web pages by topic using topic crawler algorithm as much as possible, which greatly improves the retrieval ability of the web pages, naive bayes algorithm is adopted on this basis, which realizes the text mining processing of the web data. The experimental results show that the algorithm has good feasibility and higher recall ratio and precision ratio of the web pages.
- Published
- 2017
15. Protecting web contents against persistent distributed crawlers
- Author
-
Shengye Wan, Kun Sun, and Yue Li
- Subjects
World Wide Web ,Upload ,Computer science ,Download ,Server ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Web page ,0202 electrical engineering, electronic engineering, information engineering ,Website Parse Template ,020206 networking & telecommunications ,020201 artificial intelligence & image processing ,02 engineering and technology ,Web crawler - Abstract
Web crawlers have been misused for several malicious purposes such as downloading server data without permission from the website administrator. In this paper, based on one observation that normal users and malicious crawlers have different short-term and long-term download behaviors, we develop a new anti-crawler mechanism called PathMarker to detect and constrain persistent distributed crawlers. For each URL, by adding a marker to record its parent page that leads to the access to this URL and the user identity who accesses this URL, we can not only perform more accurate heuristic detection and Support Vector Machine (SVM) based machine learning detection to detect malicious crawlers at an earlier stage, but also dramatically suppress the efficiency of crawlers before they are detected. We deploy our approach on a forum website, and the evaluation results show that PathMarker can quickly capture all 6 open-source and in-house crawlers.
- Published
- 2017
16. Syntactic entropy for main content extraction from web pages
- Author
-
Mohammed Al Achhab, Ismail Jellouli, and Badr Eddine El Mohajir
- Subjects
Information retrieval ,Computer science ,Static web page ,02 engineering and technology ,Tree structure ,Rewrite engine ,020204 information systems ,Web page ,0202 electrical engineering, electronic engineering, information engineering ,Website Parse Template ,Entropy (information theory) ,Domain knowledge ,020201 artificial intelligence & image processing ,Semantic Web Stack - Abstract
In this paper, we present a solution for main content identification in web pages. Our solution is language-independent; Web pages may be written in different languages. It is topic-independent; no domain knowledge or dictionary is applied. And it is unsupervised; no training phase is necessary. The solution exploits the tree structure of web pages and the frequencies of text tokens to attribute scores of content density to the areas of the page and by the way identify the most important one. We tested this solution over representative examples of web pages to show how efficient and accurate it is. The results were satisfying.
- Published
- 2017
17. Research of the Web Information Extraction Technology on Tourism Theme
- Author
-
Bo Chen, Qing Ming Song, and Qi Shen
- Subjects
Web standards ,Web analytics ,Web server ,medicine.medical_specialty ,Web development ,Web 2.0 ,computer.internet_protocol ,Computer science ,Dynamic web page ,computer.software_genre ,Social Semantic Web ,Personalization ,World Wide Web ,Web design ,Web page ,Website Parse Template ,medicine ,Web navigation ,Semantic Web Stack ,XPath ,business.industry ,General Medicine ,Web application security ,Information extraction ,Web mapping ,Web service ,business ,Web intelligence ,computer ,Site map ,Web modeling ,Tourism - Abstract
With the development of web technology, the use of dynamic web pages and the personalization of page contents become more and more popular. Currently, the information of page is protean and the structures of different pages are vastly different, the traditional thinking of web information extraction technology has been difficult to adapt to the situation. In this paper, proposes a web information extraction method based on extended XPath policy through the analysis of structural features of web pages on tourist theme. This algorithm avoids the defects of traditional web information extraction technology; it is simple, practical, high cleaning efficiency, accuracy, and saving the overhead of the system.
- Published
- 2014
18. Web Pages Mining Based on Terms and Formal Concept Analysis
- Author
-
Guang Yi Tang, Bo Yu, and Deng Ju Yao
- Subjects
Web standards ,medicine.medical_specialty ,Web development ,Web 2.0 ,Computer science ,computer.software_genre ,Social Semantic Web ,World Wide Web ,Web page ,Web design ,medicine ,Website Parse Template ,Web navigation ,Semantic Web Stack ,Semantic Web ,Data Web ,business.industry ,General Engineering ,Static web page ,Web application security ,Web mining ,The Internet ,Web mapping ,Web service ,Web intelligence ,business ,computer ,Web modeling - Abstract
In the face of immense Web pages of WWW, how to extract valuable knowledge from the Internet is a difficult problem. The main research work of this paper was to apply FCA (Formal concept analysis) and Web terms on the Web representing the relationship between Web pages and web terms. We deeply studied how to apply Galois to Web page mining, and used the Java language to design the Web pages mining system. The system uses the constructed Galois lattice to extract potential knowledge of WWW. The results prove that the use of Galois Lattices and Web terms for Web pages mining is feasible.
- Published
- 2014
19. Determining the titles of Web pages using anchor text and link analysis
- Author
-
Jehwan Oh, Dong-Jin Kim, Won Kim, Heetae Lyu, and Ok-Ran Jeong
- Subjects
Anchor text ,Information retrieval ,Computer science ,General Engineering ,Static web page ,HITS algorithm ,Backlink ,Computer Science Applications ,World Wide Web ,Artificial Intelligence ,Web page ,Website Parse Template ,Web search engine ,Site map ,Link analysis - Abstract
Determining the titles of Web pages is an important element in characterizing and categorizing the vast number of Web pages. There are a few approaches to automatically determining the titles of Web pages. As an R&D project for Naver, the operator of Naver (Korea's largest portal site), we developed a new method that makes use of anchor texts and analysis of links among Web pages. In this paper, we describe our method and show experiment results of its performance.
- Published
- 2014
20. Web Content Extraction by Integrating Textual and Visual Importance of Web Pages
- Author
-
J. Anitha and K. Nethra
- Subjects
Information retrieval ,Web mining ,Computer science ,Web page ,Node (computer science) ,Website Parse Template ,Static web page ,Printer-friendly ,Web content ,Semantic Web Stack ,Document Object Model ,Site map ,Backlink - Abstract
Web page has huge information and the information in the Web pages is useful in real world applications. The additional contents in the Web page like links, footers, headers and advertisements may cause the content extraction to be complicated. Irrelevant content in the Web page is treated as noisy content. A method is necessary to extract the informative content and discard the noisy content from Web pages. An integration of textual and visual importance is used to extract the informative content from Web pages. Initially a Web page is converted in to DOM (Document Object Model) tree. For each node in the DOM tree, textual and visual importance is calculated. Textual importance and visual importance is combined to form hybrid density. Density sum is calculated and used in content extraction algorithm to extract the informative content from Web pages. Performance of Web content extraction is obtained by calculating precision, recall, f-measure and accuracy. KeywordsContent Extraction, Web content Mining, DOM tree, Vision based Page Segmentation.
- Published
- 2014
21. Exploiting temporal information in Web search
- Author
-
Lihua Yue, Peiquan Jin, Xujian Zhao, and Sheng Lin
- Subjects
Web analytics ,medicine.medical_specialty ,Web search query ,Information retrieval ,business.industry ,Computer science ,General Engineering ,Social Semantic Web ,Computer Science Applications ,Ranking (information retrieval) ,Search engine ,Artificial Intelligence ,Web query classification ,Web page ,Website Parse Template ,medicine ,business ,Web modeling ,Data Web - Abstract
Time plays important roles in Web search, because most Web pages contain temporal information and a lot of Web queries are time-related. How to integrate temporal information in Web search engines has been a research focus in recent years. However, traditional search engines have little support in processing temporal-textual Web queries. Aiming at solving this problem, in this paper, we concentrate on the extraction of the focused time for Web pages, which refers to the most appropriate time associated with Web pages, and then we used focused time to improve the search efficiency for time-sensitive queries. In particular, three critical issues are deeply studied in this paper. The first issue is to extract implicit temporal expressions from Web pages. The second one is to determine the focused time among all the extracted temporal information, and the last issue is to integrate focused time into a search engine. For the first issue, we propose a new dynamic approach to resolve the implicit temporal expressions in Web pages. For the second issue, we present a score model to determine the focused time for Web pages. Our score model takes into account both the frequency of temporal information in Web pages and the containment relationship among temporal information. For the third issue, we combine the textual similarity and the temporal similarity between queries and documents in the ranking process. To evaluate the effectiveness and efficiency of the proposed approaches, we build a prototype system called Time-Aware Search Engine (TASE). TASE is able to extract both the explicit and implicit temporal expressions for Web pages, and calculate the relevant score between Web pages and each temporal expression, and re-rank search results based on the temporal-textual relevance between Web pages and queries. Finally, we conduct experiments on real data sets. The results show that our approach has high accuracy in resolving implicit temporal expressions and extracting focused time, and has better ranking effectiveness for time-sensitive Web queries than its competitor algorithms.
- Published
- 2014
22. Estimating Page Importance based on Page Accessing Frequency
- Author
-
Komal Sachdeva and Ashutosh Dixit
- Subjects
Ajax ,Web development ,Computer science ,Printer-friendly ,Dynamic web page ,HITS algorithm ,Doorway page ,Focused crawler ,Backlink ,World Wide Web ,Search engine ,Web page ,Website Parse Template ,computer.programming_language ,Same-origin policy ,Information retrieval ,business.industry ,Search engine indexing ,Static web page ,Page view ,Web search engine ,The Internet ,Web mapping ,business ,Web crawler ,Site map ,computer ,Page hijacking - Abstract
the vast growth of the Internet, many web pages are available online. Search engines use a component called as web crawlers for collecting these web pages from the web for storage and indexing. Many web pages are autonomous and are updated independent of the users. .As the web pages are updated autonomously; users do not come to know of how often the sources change. An incremental crawler visits the web repeatedly after a specific interval of time for the updation of its collection. Users are benefited by knowing the page importance based upon the page accessing frequency. This paper finds out the page importance based on page accessing frequency and also architecture for the same is also proposed.
- Published
- 2014
23. Semantic Annotation Tool for Annotating Arabic Web Documents
- Author
-
Saeed Albukhitan, Tarek Helmy, and Mohammed Al-Mulhem
- Subjects
Web standards ,medicine.medical_specialty ,computer.internet_protocol ,Computer science ,Temporal annotation ,Ontology (information science) ,computer.software_genre ,OWL-S ,Social Semantic Web ,Annotation ,Semantic similarity ,Semantic Annotation ,Semantic computing ,medicine ,Semantic analytics ,Website Parse Template ,Semantic Web Stack ,RDF ,Arabic Language ,Semantic Web ,Image retrieval ,Data Web ,General Environmental Science ,computer.programming_language ,Information retrieval ,Semantic Web Rule Language ,business.industry ,Ontology ,Semantic search ,computer.file_format ,HTML ,Semantic grid ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,General Earth and Planetary Sciences ,Artificial intelligence ,Precision and recall ,Web intelligence ,business ,Web modeling ,computer ,Natural language processing - Abstract
The vision of semantic Web is to have a Web of data instead of Web of documents in a form that can be processed by machines. This vision could be achieved in the existing Web using semantic annotation. Due to exponential growth and huge size of the Web sources, there is a need to have a fast and automatic semantic annotation of Web documents. Arabic language has received less attention in semantic Web research compared to Latin languages especially in the field of semantic annotation. In this paper, we present an automatic annotation tool that supports the semantic annotation of Arabic language Web documents. The tool takes a URL of Web document and the corresponding ontology then produces an external annotation of the Web document using Resource Description Framework (RDF) language. The annotation tool's output could be used by semantic search engines to achieve higher recall and precision. To evaluate the performance of the tool, three domain ontologies of food, nutrition and health were used with manually annotated documents related to those domains. The initial results show a promising performance which will support the research in the semantic Web with respect to Arabic language.
- Published
- 2014
- Full Text
- View/download PDF
24. A New Web Usage Mining Approach for Website Recommendations Using Concept Hierarchy and Website Graph
- Author
-
T. Vijaya Kumar, K M Bharath Kumar, S Kiran Babu, H S Guruprasad, and Irfan Baig
- Subjects
World Wide Web ,Information retrieval ,Website architecture ,Web mining ,Computer science ,Website Parse Template ,Graph (abstract data type) ,Concept hierarchy - Published
- 2014
25. The Implementation of Crawling News Page Based on Incremental Web Crawler
- Author
-
Zejian Shi, Minyong Shi, and Weiguo Lin
- Subjects
Web server ,Web 2.0 ,Web development ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Focused crawler ,Crawling ,computer.software_genre ,World Wide Web ,Web page ,Web design ,Data_FILES ,Website Parse Template ,Distributed web crawling ,Web navigation ,Data Web ,0505 law ,Information retrieval ,business.industry ,05 social sciences ,Static web page ,Spider trap ,050501 criminology ,Web search engine ,Web service ,Web crawler ,business ,Site map ,computer ,Page hijacking - Abstract
Web crawler technology is the technology which downloads web pages through the program. This paper implements incremental Python web crawler, uses Scrapy crawler framework, crawls news web pages from mainstream web sites incrementally in real time, and deposits data in the database. The key technology of incremental crawling is removing the repetition of web links, and the most common method is using Bloom filter. This paper implements a simplified Bloom filter. The result shows that the web crawler can monitor the news pages well.
- Published
- 2016
26. Discovering User Pattern Analysis from Web Log Data using Weblog Expert
- Author
-
M. A. Dorairangaswamy and K . Dharmarajan
- Subjects
Web analytics ,Web standards ,medicine.medical_specialty ,Web server ,Web 2.0 ,Web development ,Computer science ,030508 substance abuse ,computer.software_genre ,World Wide Web ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Website architecture ,Web page ,Web design ,Website Parse Template ,medicine ,Mashup ,Web navigation ,Data Web ,Multidisciplinary ,Client-side scripting ,business.industry ,Static web page ,Web mining ,Web log analysis software ,Web mapping ,Web service ,0305 other medical science ,business ,computer ,Web modeling - Abstract
Objective: This article tries to discover the hidden knowledge and identifying user behavior on the web by using the web data sources. With the help of this knowledge, the overall performance of future accesses, the typical browsing behavior of a user and subsequently to predict desired pages, a user wants to access in future. Methods/Statistical Analysis: The user pattern is analyzed by using the modified Web Log Expert tool from the web access log file collected from the organization. This modified tool tries to conduct a web mining in a domain independent manner. This algorithm consists of three parts: 1. Given an input entity, extracting a set of IP addresses and visor lists and then ranking them according to comparability, 2. Extracting the domains in which the given entity takes part and 3. Identifying and summarizing the competitive evidence that details the organization’s strength. Findings: The main aim of the research work is extracting of user frequent access page using web log data, which is based on user session time, IP addresses, browser details, operating system, top user. This complete analysis work has been implemented in the Web Log Expert tool. The experimental results provide an easier way to navigate the website and improve the website design architecture. This work deliberates the detailed results of a website in a specific education domain application. We investigate the statistics of hourly based, daily based, week and monthly based report of the web usage patterns. The goal is to capture, model and analyze the behavioral patterns and profiles of users interacting with a website. The knowledge about users and their behavior on the web helps the organization benefits and leads directly to profit increase.
- Published
- 2016
27. Web usage and content mining to extract knowledge for modelling the users of the Bidasoa Turismo website and to adapt it
- Author
-
Aizea Lojo, Ibai Gurrutxaga, Iñigo Perona, Javier Muguerza, Olatz Arbelaitz, and Jesús M. Pérez
- Subjects
Topic model ,Web server ,Decision support system ,Computer science ,General Engineering ,Intelligent decision support system ,Service provider ,computer.software_genre ,Computer Science Applications ,World Wide Web ,Web mining ,Artificial Intelligence ,Website Parse Template ,Information system ,Web service ,computer ,Tourism - Abstract
The tourism industry has experienced a shift from offline to online travellers and this has made the use of intelligent systems in the tourism sector crucial. These information systems should provide tourism consumers and service providers with the most relevant information, more decision support, greater mobility and the most enjoyable travel experiences. As a consequence, Destination Marketing Organizations (DMOs) not only have to respond by adopting new technologies, but also by interpreting and using the knowledge created by the use of these techniques. This work presents the design of a general and non-invasive web mining system, built using the minimum information stored in a web server (the content of the website and the information from the log files stored in Common Log Format (CLF)) and its application to the Bidasoa Turismo (BTw) website. The proposed system combines web usage and content mining techniques with the three following main objectives: generating user navigation profiles to be used for link prediction; enriching the profiles with semantic information to diversify them, which provides the DMO with a tool to introduce links that will match the users taste; and moreover, obtaining global and language-dependent user interest profiles, which provides the DMO staff with important information for future web designs, and allows them to design future marketing campaigns for specific targets. The system performed successfully, obtaining profiles which fit in more than 60% of cases with the real user navigation sequences and in more than 90% of cases with the user interests. Moreover the automatically extracted semantic structure of the website and the interest profiles were validated by the BTw DMO staff, who found the knowledge provided to be very useful for the future.
- Published
- 2013
28. Adapted Web Crawler for Mining Offline Web Data using AFHC
- Author
-
S. Amudha
- Subjects
Ajax ,Web server ,Web development ,Download ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Printer-friendly ,Focused crawler ,computer.software_genre ,World Wide Web ,Search engine ,Rewrite engine ,Web design ,Web page ,Website Parse Template ,Web navigation ,computer.programming_language ,Information retrieval ,Client-side scripting ,Web search query ,business.industry ,Static web page ,Hyperlink ,Web search engine ,The Internet ,Web mapping ,Web service ,Web crawler ,business ,computer ,Site map - Abstract
focused hyperlink crawler (AFHC) aim to search the entire inner level sub link of the web pages related to a specific topic and to download unique web pages to local disk. The download web page information searched in offline browsing and avoids the repeated searches in the web server to give a solution to problem. The major problem is to retrieve the maximal set of relevant and quality pages. Crawler software can be retrieve web pages by hyperlinks through internet. The number of web sites using web crawlers cannot retrieve the relevant pages. The system have browser and search engine. Browser using AFHC reduce time to finding accurate content in web pages in the hyperlinks and also restrict to download web pages to local disk. Search engine using extended cocitaion algorithm to retrieve accurate content in the local disk and search based on any word, all word and phrase matching in the local disk. It is useful to the student and organization. The research is easily found to personalize the crawl history and search history for knowing the user transactions.
- Published
- 2013
29. Hidden Web Data Extraction Tools
- Author
-
Ashish Ahuja, Anuradha Anuradha, and Babita Ahuja
- Subjects
Web standards ,Web analytics ,medicine.medical_specialty ,Web development ,Web 2.0 ,Computer science ,computer.software_genre ,World Wide Web ,Search engine ,Web page ,Web design ,medicine ,Website Parse Template ,Web navigation ,Data Web ,business.industry ,Static web page ,Web application security ,Data extraction ,Web mining ,Web search engine ,Web mapping ,Web service ,Web crawler ,business ,Site map ,computer ,Web modeling - Abstract
Hidden web forms 99% of the total web. The hidden web contains the high quality content and has a broad coverage. So hidden web has always remained like a golden apple in the eyes of the researcher. A lot of research has been carried out in this area. The different tools have been created by the researchers to make the hidden web float on the surface of WWW. The different kind of crawlers and the search engines have been developed which focuses on the hidden web. So in this paper we discuss the various kinds of web crawlers and search engines which throw some light on the hidden web. Keywordsidden web; surface web; WWW; crawlers; search engines
- Published
- 2013
30. Comparing Web Pages in Terms of Inner Structure
- Author
-
Jiří Štěpánek and Monika Simkova
- Subjects
Information retrieval ,Computer science ,HTML ,HTML element ,Tree (data structure) ,Tree structure ,Span and div ,Web page ,Web design ,HTML scripting ,Website Parse Template ,General Materials Science ,computer ,html,comparing,structure ,computer.programming_language - Abstract
Ability to create web page is one of basic IT skills. In the web page creation learning process is student's final project often evaluated on code validity and code cleanness. There is also one problem that cannot be easily handled – recognizing modified projects. Concept of HTML and CSS allows to separate visual form from content. One project can be easily transformed into new visual form in a few steps. This paper is focused on solving this problem by computer automation. The basic idea is that inner structure of web page will be the same or very similar. Web page (HTML document) can be represented as a tree, where nodes are HTML elements. Paper describes basic ways how can be html tree structure compared with another html tree structure automatically by using tree algorithms. Result of this comparison is html structure tree similarity. Paper covers description of algorithms, implementation and testing.
- Published
- 2013
- Full Text
- View/download PDF
31. Web Page Structure Enhanced Feature Selection for Classification of Web Pages
- Author
-
A. Sankar and B. LeelaDevi
- Subjects
Information retrieval ,Web 2.0 ,Computer science ,business.industry ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Semantic search ,Static web page ,HTML element ,Web mining ,Web design ,Web page ,Website Parse Template ,Web search engine ,Precision and recall ,business ,Semantic Web ,Site map - Abstract
Web page classification is achieved using text classification techniques. Web page classification is different from traditional text classification due to additional information, provided by web page structure which provides much information on content importance. HTML tags provide visual web page representation and can be considered a parameter to highlight content importance. Textual keywords are base on which Information retrieval systems rely to index and retrieve documents. Keyword-based retrieval returns inaccurate/incomplete results when differing keywords describe the same document and queries concept. Conceptbased retrieval tried to tackle this by using manual thesauri with term co-occurrence data, or by extracting latent word relationships and concepts from a corpus. Semantic search motivates Semantic Web from inception for classification and retrieval processes. In this paper, a model for the exploitation of semantic-based feature selection is proposed to improve search and retrieval of web pages over large document repositories. The features are classified using Support Vector Machine (SVM) using different kernels. The experimental results show improved precision and recall with the proposed method with respect to keyword-based search.. General Terms Web Mining
- Published
- 2013
32. A Parametric Layered Approach to Perform Web Page Ranking
- Author
-
Anchal Garg and Ratika Goel
- Subjects
Web standards ,Web analytics ,Web server ,medicine.medical_specialty ,Web development ,Web 2.0 ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Crawling ,computer.software_genre ,World Wide Web ,Search engine ,Web query classification ,Web page ,Web design ,Website Parse Template ,medicine ,Distributed web crawling ,Mashup ,Web navigation ,Web search query ,Information retrieval ,business.industry ,Search engine indexing ,Static web page ,Web search engine ,Web content ,Web mapping ,Web service ,Web crawler ,business ,computer ,Site map ,Web modeling - Abstract
Web crawling is the foremost step to perform the effective and efficient web content search so that the user will get the specific web pages initially in an indexed form. Web crawling is not only used for searching a webpage over the web but also to order them according to user interest. There are number of available search engines and the crawlers that accept the user query and provide the page search. But, there is still the requirement and scope to improve the search mechanism. In this present work, dynamic and user interest evolution based parametric approach is defined to perform the web crawling and to arrange the web pages in more definite way. In this work a layered approach is defined, in which the initial indexing will be performed based on the keyword oriented content match and later on the indexing will be modified based on user recommendation. The presented work will provide an recommendation based web page indexing so that effective web crawling will be performed.
- Published
- 2013
33. An effective and efficient Web content extractor for optimizing the crawling process
- Author
-
Edip Serdar Güner, Yılmaz Kılıçaslan, Tarık Yerlikaya, Hayri Volkan Agun, and Erdinç Uzun
- Subjects
Web server ,Information retrieval ,Database ,Download ,Computer science ,Static web page ,Focused crawler ,Hyperlink ,Crawling ,computer.software_genre ,HTML element ,Upload ,Web page ,Website Parse Template ,Distributed web crawling ,Web content ,Web crawler ,computer ,Software - Abstract
Classical Web crawlers make use of only hyperlink information in the crawling process. However, focused crawlers are intended to download only Web pages that are relevant to a given topic by utilizing word information before downloading the Web page. But, Web pages contain additional information that can be useful for the crawling process. We have developed a crawler, iCrawler intelligent crawler, the backbone of which is a Web content extractor that automatically pulls content out of seven different blocks: menus, links, main texts, headlines, summaries, additional necessaries, and unnecessary texts from Web pages. The extraction process consists of two steps, which invoke each other to obtain information from the blocks. The first step learns which HTML tags refer to which blocks using the decision tree learning algorithm. Being guided by numerous sources of information, the crawler becomes considerably effective. It achieved a relatively high accuracy of 96.37% in our experiments of block extraction. In the second step, the crawler extracts content from the blocks using string matching functions. These functions along with the mapping between tags and blocks learned in the first step provide iCrawler with considerable time and storage efficiency. More specifically, iCrawler performs 14 times faster in the second step than in the first step. Furthermore, iCrawler significantly decreases storage costs by 57.10% when compared with the texts obtained through classical HTML stripping. Copyright © 2013 John Wiley & Sons, Ltd.
- Published
- 2013
34. Web Crawlers for Searching Hidden Pages: A Survey
- Author
-
K. F. Bharati, A. Govardhan, and P. Premchand
- Subjects
Web standards ,medicine.medical_specialty ,Web server ,Web development ,Computer science ,Information needs ,Focused crawler ,computer.software_genre ,Social Semantic Web ,World Wide Web ,Search engine ,Website architecture ,Web page ,Web design ,Website Parse Template ,medicine ,Web navigation ,business.industry ,Web search engine ,Web mapping ,Web service ,Web crawler ,Web intelligence ,business ,Web modeling ,Site map ,computer - Abstract
Many researchers have addressed the need of a dynamic proven model of web crawler that will address the need of several dynamic commerce, research and ecommerce establishments over the web that majorly runs with the help of a search engine. The entire web architecture is changing from a traditional to a semantic. And on the other hand the web crawlers. The web crawler of today is vulnerable to omit several tons of pages without searching and also is incapable of capturing the hidden pages. There are several research problems of information retrieval, far from optimization such as supporting user to analyze the problem to determine information needs. The paper makes an analytical survey of several proven web crawlers capable of searching hidden pages. It also addresses the prospects and constraints of the methods and the ways to further enhance.
- Published
- 2013
35. Predicting the Behaviour and Interest of the Website Users through Web Log Analysis
- Author
-
Arvind K. Sharma and Pankaj Gupta
- Subjects
Web analytics ,Web standards ,Web server ,Web development ,business.industry ,Computer science ,computer.software_genre ,World Wide Web ,Web mining ,Web design ,Web page ,Website Parse Template ,Web navigation ,Web mapping ,Web service ,business ,computer - Abstract
mining is a hot research area of many researchers. Web mining techniques have been widely used to discover interesting and frequent user navigation patterns from the web server logs. The aim of this work is to apply web mining techniques for discovering user's behaviour and interest for an educational institution website usage to reveal previously unknown interesting patterns extracted in order to recommend possible measures for further improvement of the Website. In this paper the web user access and server usage patterns have been analyzed and daily, weekly, monthly web metrics such as number of visits, pages, files, hits and sites have been investigated. An attempt has been made to predict the behaviour and interest of the website users.
- Published
- 2013
36. Analysis of Web Pages Based the Changed Information and its’ Application in the Search Engine for one Web Site
- Author
-
Hong Shen Liu and Peng Fei Wang
- Subjects
Web analytics ,Web server ,Web 2.0 ,Web development ,Computer science ,Dynamic web page ,computer.software_genre ,World Wide Web ,Search engine ,Web page ,Web design ,Website Parse Template ,Web navigation ,Semantic Web Stack ,Information retrieval ,business.industry ,Static web page ,General Medicine ,Web application security ,Web search engine ,Web mapping ,Web service ,business ,Web crawler ,computer ,Site map - Abstract
The structures and contents of researching search engines are presented and the core technology is the analysis technology of web pages. The characteristic of analyzing web pages in one website is studied, relations between the web pages web crawler gained at two times are able to be obtained and the changed information among them are found easily. A new method of analyzing web pages in one website is introduced and the method analyzes web pages with the changed information of web pages. The result of applying the method shows that the new method is effective in the analysis of web pages.
- Published
- 2013
37. Analyzing Users Behavior from Web Access Logs using Automated Log Analyzer Tool
- Author
-
C. K. Jha and Neha Goel
- Subjects
Web standards ,Web analytics ,Web server ,medicine.medical_specialty ,Web 2.0 ,Web development ,Computer science ,computer.software_genre ,Web API ,World Wide Web ,Web design ,Web page ,Website Parse Template ,medicine ,Web navigation ,Data Web ,business.industry ,Static web page ,Web application security ,Web mining ,Web log analysis software ,The Internet ,Web mapping ,Web service ,Web intelligence ,business ,computer ,Web modeling - Abstract
Internet is acting as a major source of data. As the number of web pages continues to grow the web provides the data miners with just the right ingredients for extracting information. In order to cater to this growing need a special term called Web mining was coined. Web mining makes use of data mining techniques and deciphers potentially useful information from web data. Web Usage mining deals with understanding the behavior of users by making use of Web Access Logs that are generated on the server while the user is accessing the website. A Web access log comprises of various entries like the name of the user, his IP address, number of bytes transferred timestamp etc. A variety of Log Analyzer tools exist which help in analyzing various things like users navigational pattern, the part of the website the users are mostly interested in etc. The present paper makes use of such log analyzer tool called Web Log Expert for ascertaining the behavior of users who access an astrology website. It also provides a comparative study between a few log analyzer tools available.
- Published
- 2013
38. Near Duplicate Web Page Detection using NDupDet Algorithm
- Author
-
Nilakshi Joshi and Jayant Gadge
- Subjects
Web server ,Web development ,Computer science ,HITS algorithm ,computer.software_genre ,Backlink ,law.invention ,World Wide Web ,Search engine ,law ,Web page ,Web design ,Website Parse Template ,Web navigation ,Web search query ,Information retrieval ,business.industry ,Static web page ,Web search engine ,The Internet ,Hypertext ,Web mapping ,Web service ,Web crawler ,business ,Site map ,computer - Abstract
is a system of interlinked hypertext documents accessed via Internet. Internet is a global system of interconnected computer networks that serve billions of users worldwide. The huge amount of documents on the web is challenging for web search engines. Web contains multiple copies of the same content or same web page. Many of these pages on the Web are duplicates and near duplicates of other pages. Web search engines face substantial problems due to such duplicate and near duplicate web pages. These pages enlarge the space required to store the index, increase the cost of serving results and so frustrates the users. To assist search engines to provide search results free of redundancy to users and to provide distinct useful results on the first page, duplicate and near duplicate detection is required. The proposed approach will detect near duplicate web pages to increase search effectiveness and storage efficiency of search engine.
- Published
- 2013
39. Semantic Web Service Discovery Using Natural Language Processing Techniques
- Author
-
Jordy Sangers, Flavius Frasincar, Frederik Hogenboom, Vadim Chepegin, Econometrics, and Erasmus School of Economics
- Subjects
Web standards ,medicine.medical_specialty ,Web 2.0 ,Computer science ,computer.internet_protocol ,computer.software_genre ,WSMO ,OWL-S ,Social Semantic Web ,Artificial Intelligence ,Semantic computing ,Website Parse Template ,Semantic analytics ,medicine ,Semantic Web Stack ,Semantic compression ,Semantic Web ,Data Web ,Web search query ,Information retrieval ,business.industry ,Semantic Web Rule Language ,General Engineering ,Semantic search ,Computer Science Applications ,Semantic grid ,Artificial intelligence ,Web service ,WS-Policy ,business ,Web intelligence ,computer ,Web modeling ,Natural language processing - Abstract
This paper proposes a semantic Web service discovery framework for finding semantic Web services by making use of natural language processing techniques. The framework allows searching through a set of semantic Web services in order to find a match with a user query consisting of keywords. By specifying the search goal using keywords, end-users do not need to have knowledge about semantic languages, which makes it easy to express the desired semantic Web services. For matching keywords with semantic Web service descriptions given in WSMO, techniques like part-of-speech tagging, lemmatization, and word sense disambiguation are used. After determining the senses of relevant words gathered from Web service descriptions and the user query, a matching process takes place. The performance evaluation shows that the three proposed matching algorithms are able to effectively perform matching and approximate matching.
- Published
- 2013
40. Semantic ranking of web pages based on formal concept analysis
- Author
-
Yajun Du and Yufeng Hai
- Subjects
Web standards ,Web server ,medicine.medical_specialty ,Web 2.0 ,Computer science ,Focused crawler ,Crawling ,computer.software_genre ,Social Semantic Web ,World Wide Web ,Search engine ,Semantic similarity ,Rewrite engine ,Web page ,Web design ,Website Parse Template ,medicine ,Web navigation ,Semantic Web Stack ,Semantic Web ,Data Web ,Hierarchy ,Information retrieval ,business.industry ,Static web page ,Hyperlink ,Ranking ,Hardware and Architecture ,Web search engine ,Web service ,Web crawler ,business ,computer ,Web modeling ,Site map ,Software ,Information Systems - Abstract
A web crawler is an important research component in a search engine. In this paper, a new method for measuring the similarity of formal concept analysis (FCA) concepts and a new notion of a web page's rank are proposed that use an information content approach based on users' web logs. First, an extension similarity and an intension similarity that analyze a user's browsing pattern and their hyperlinks are proposed. Second, the information content similarity between two nouns is computed automatically by examining their ISA and Part-Of hierarchy and using a user's web log. A method for computing the semantic similarity between two concepts in two different concept lattices (the base concept lattice and the current concept lattice) and finding the semantic ranking of web pages is proposed. Last, our experiment demonstrates that our crawler is more suitable for crawling focused web pages. It proves that the semantic ranking of web pages is useful and efficient for making a web crawler's choice of a web page for continuing work.
- Published
- 2013
41. Automatic Ontology-based Annotation of Food, Nutrition and Health Arabic Web Content
- Author
-
Saeed Albukhitan and Tarek Helmy
- Subjects
Web standards ,medicine.medical_specialty ,Name Entity Recognition ,computer.internet_protocol ,Computer science ,Annotation ,Ontology (information science) ,Social Semantic Web ,OWL-S ,World Wide Web ,medicine ,Website Parse Template ,Semantic analytics ,Semantic Web Stack ,RDF ,Arabic Language ,Semantic Web ,Image retrieval ,Data Web ,General Environmental Science ,Information retrieval ,business.industry ,Semantic Web Rule Language ,Semantic search ,Linked data ,computer.file_format ,Metadata ,Semantic grid ,Ontology ,General Earth and Planetary Sciences ,Web content ,Web resource ,Precision and recall ,business ,Web intelligence ,Web modeling ,computer - Abstract
To have a successful semantic Web, it is critically required to have sufficient amount of relevant semantic and high-quality Web content. One way to produce such content is through the semantic annotation of the Web sources. Semantic annotation is the process of adding machine-readable content to the natural language textual content of the Web sources. Annotating Web content in Arabic language has received less attention compared to Latin Languages especially for content related to specific domains such as food, nutrition and health. Considering the huge amount of emerging Web content, semantic annotation of their contents by hand is neither practicable nor scalable. In this paper, we present an automatic annotation of the Arabic Web resources related to food, nutrition and health domains. The proposed method makes use of developed Arabic OWL ontologies related to those domains. It uses linguistic patterns to discover relevant relationships between the named entities in the Arabic Web resources. The extracted information is then associated to the corresponding concepts and object properties of the developed ontology to produce the RDF metadata for the corresponding Web resources. Empirical evaluations of the proposed method show promising precision and recall. As a contribution, the produced RDF triples could be utilized by semantic Web searching application to retrieve intelligent and relevant answers to end user's quires.
- Published
- 2013
- Full Text
- View/download PDF
42. Web Image Retrieval using Semantic Prior Tags
- Author
-
Seongjae Lee and Soosun Cho
- Subjects
Information retrieval ,Web image ,Computer Networks and Communications ,Hardware and Architecture ,Computer science ,Website Parse Template ,Visual Word ,Semantic Web Stack ,Image retrieval - Published
- 2012
43. SOF: a semi-supervised ontology-learning-based focused crawler
- Author
-
Farookh Khadeer Hussain and Hai Dong
- Subjects
Information retrieval ,Ontology learning ,Computer Networks and Communications ,Computer science ,business.industry ,Focused crawler ,Crawling ,Ontology (information science) ,Computer Science Applications ,Theoretical Computer Science ,World Wide Web ,Computational Theory and Mathematics ,Web page ,Ontology ,Website Parse Template ,The Internet ,Web crawler ,business ,Semantic Web ,Software - Abstract
The rapid increase in the volume of data available on the Internet makes it increasingly impractical for a crawler to index the whole Web. Instead, many intelligent crawlers, known as ontology-based semantic focused crawlers, have been designed by making use of Semantic Web technologies for topic-centered Web information crawling. Ontologies, however, have constraints of validity and time, which may influence the performance of the crawlers. Ontology-learning-based focused crawlers are therefore designed to automatically evolve ontologies by integrating ontology learning technologies. Nevertheless, surveys indicate that the existing ontology-learning-based focused crawlers do not have the capability to automatically enrich the content of ontologies, which makes these crawlers unreliable in the open and heterogeneous Web environment. Hence, in this paper, we propose a framework for a novel semi-supervised ontology-learning-based focused (SOF) crawler, the SOF crawler, which embodies a series of schemas for ontology generation and Web information formatting, a semi-supervised ontology learning framework, and a hybrid Web page classification approach aggregated by a group of support vector machine models. A series of tests are implemented to evaluate the technical feasibility of this proposed framework. The conclusion and the future work are summarized in the final section.
- Published
- 2012
44. Bridging the Gap between bdME and OntoME
- Author
-
Pedro Rangel Henriques and Ricardo Giuliani Martini
- Subjects
Web standards ,medicine.medical_specialty ,Web 2.0 ,Web development ,Relational database ,Computer science ,02 engineering and technology ,Dynamic web page ,Ontology (information science) ,computer.software_genre ,Social Semantic Web ,World Wide Web ,020204 information systems ,11. Sustainability ,Web design ,Web page ,0202 electrical engineering, electronic engineering, information engineering ,Website Parse Template ,Semantic analytics ,medicine ,Web navigation ,Semantic Web Stack ,Semantic Web ,Data Web ,Information retrieval ,business.industry ,Information model ,Ontology ,020201 artificial intelligence & image processing ,Web mapping ,Web service ,Web intelligence ,business ,computer ,Web modeling - Abstract
The Semantic Web aims at building a Web where data is enriched with meaningful annotations. In other words, data is semantically organized in such a way that both human and machine can understand and query it, aiming at the creation of dynamic Web pages. Ontologies, as a keystone of the Semantic Web, have gained an ample acceptance as an information model, which can be used for several purposes, such as information retrieval in the Web. However, data is normally stored in databases, which present various problems in the Semantic Web context, because data is not semantically annotated. Aiming at retrieving rich results in the sense of meaning, several ways of relating databases with ontologies have emerged. This paper presents a mapping – with the aid of a framework called Ontop – as a solution for the communication problem between the relational database of the Emigration Museum of Fafe (EMF) and the ontology of the Emigration Museum (OntoME), which describes the Cultural Heritage domain. This mapping will be used to realize the CaVa architecture, aiming at the creation of dynamic Web pages as virtual Learning Spaces. Real examples of the mapping process are presented.
- Published
- 2016
45. Self-adaptive ontology-based focused crawling: A literature survey
- Author
-
Dilip Kumar Sharma and Mohd. Aamir Khan
- Subjects
Information retrieval ,business.industry ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,020208 electrical & electronic engineering ,02 engineering and technology ,Crawling ,Ontology (information science) ,Focused crawler ,World Wide Web ,Web page ,Data_FILES ,0202 electrical engineering, electronic engineering, information engineering ,Website Parse Template ,020201 artificial intelligence & image processing ,The Internet ,Web crawler ,Literature survey ,business ,Site map - Abstract
Web crawlers are known to us since the birth of the internet since 1990, as the web pages are interconnected among themselves and form a unique path along which the crawler travels to fetch the information requested by the user/author. But the traditional crawlers are not able to distinguish between the relevant and the partially relevant web pages. Due to this the crawler had to fetch a huge amount of data from the web even if the web was not fully relevant to the user. This resulted in formation of the crawlers that were committed to the single topic given by the user. These crawlers were known as focused crawlers. These focused crawlers do not crawl the whole web as opposed to the traditional crawlers, as they only crawl the specific part of the web that is related to the given topic. This paper summarizes different qualities of various focused crawlers at present. Basically it divides the focused crawler into two different classes namely Semantic and Social Semantic. Semantic Focused Crawlers uses the ontology to its advantage and to obtain the topics that are contextually related to the given topic. Social Semantic Focused Crawlers takes the advantages of the social networking sites to obtain the web pages that are contextually related to the given topic, and usually the pages are shared by the people that have some interest in some topic related to the queried topic.
- Published
- 2016
46. OPAL: Automated form understanding for the deep web
- Author
-
Furche, T, Gottlob, G, Grasso, G, Guo, X, Orsi, G, Schallhart, C, Mille, A, Gandon, FL, Misselis, J, Rabinovich, M, Staab, S, Mille, A, Gandon, F, Misselis, J, Rabinovich, M, and Staab, S
- Subjects
Web standards ,medicine.medical_specialty ,Information retrieval ,business.industry ,Computer science ,Usability ,computer.software_genre ,World Wide Web ,Software design pattern ,Web design ,Web page ,Website Parse Template ,medicine ,Web navigation ,Web mapping ,Web service ,business ,computer ,Web modeling - Abstract
Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has received surprisingly little attention other than as component in specific applications such as crawlers. No comprehensive approach to form understanding exists and previous works disagree even in the definition of the problem. In this paper, we present OPAL, the first comprehensive approach to form understanding. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines signals from the text, structure, and visual rendering of a web page, yielding robust characterisations of common design patterns. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches by a significant margin. For form interpretation, we introduce a template language to describe frequent form patterns. These two parts of OPAL combined yield form understanding with near perfect accuracy (> 98%).
- Published
- 2016
47. HTML web content extraction using paragraph tags
- Author
-
Howard J. Carey and Milos Manic
- Subjects
Information retrieval ,business.industry ,Computer science ,Static web page ,02 engineering and technology ,World Wide Web ,020204 information systems ,Web page ,0202 electrical engineering, electronic engineering, information engineering ,Website Parse Template ,020201 artificial intelligence & image processing ,The Internet ,Web content ,Paragraph ,business ,Document Object Model ,Cluster analysis ,Dissemination - Abstract
With the ever expanding use of the internet to disseminate information across the world, gathering useful information from the multitude of web page styles continues to be a difficult problem. The use of computers as a tool to scrape the desired content from a web page has been around for several decades. Many methods exist to extract desired content from web pages, such as Document Object Model (DOM) trees, text density, tag ratios, visual strategies, and fuzzy algorithms. Due to the multitude of different website styles and designs, however, finding a single method to work in every case is a very difficult problem. This paper presents a novel method, Paragraph Extractor (ParEx), of clustering HTML paragraph tags and local parent headers to identify the main content within a news article. On websites that use paragraph tags to store their main news article, ParEx shows better performance than the Boilerpipe algorithm with higher F1 scores of 97.33% to 88.53%.
- Published
- 2016
48. Towards Semantic Web of Things: From Manual to Semi-automatic Semantic Annotation on Web of Things
- Author
-
Yunong Yang, Zhenyu Wu, Yuan Xu, Yang Ji, and Chunhong Zhang
- Subjects
Web standards ,Computer science ,Interoperability ,02 engineering and technology ,computer.software_genre ,Social Semantic Web ,World Wide Web ,Annotation ,Web of Things ,020204 information systems ,Semantic computing ,0202 electrical engineering, electronic engineering, information engineering ,Semantic analytics ,Website Parse Template ,Mashup ,Semantic Web Stack ,Semantic Web ,Image retrieval ,Data Web ,Information retrieval ,business.industry ,Semantic Web Rule Language ,Semantic search ,020206 networking & telecommunications ,Semantic grid ,Knowledge base ,Ontology ,Semantic technology ,Web service ,business ,Web intelligence ,computer - Abstract
Web of Things (WoT) unifies the syntactic representations of physical objects via web pattern, which facilitates the integrations and mashups of heterogeneous data and web services. However, the lack of unified representation markup tools and methods at semantic layer hinders the interoperability, integration and scalable search of things. This paper proposes a Semantic Web of Things Framework to improve the interoperability among domain-specific Web of Things applications by providing a unified WoT Knowledge Base construction framework. For this purpose, a Microdata vocabulary extended from Semantic Sensor Network ontology is proposed to facilitate manual annotation on HTML-based WoT representations. Moreover, to improve the scalability of extraction of semantics of structured Web of Things resources, a semi-automatic semantic annotation method based on entity linking model is also proposed. To testify the technical feasibility of the framework, a reference implementation and quantitative evaluation on annotation results are illustrated.
- Published
- 2016
49. An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation
- Author
-
Lei Zhou, Houqing Lu, Dengchao He, and Donghui Zhan
- Subjects
Article Subject ,Computer science ,General Mathematics ,02 engineering and technology ,Focused crawler ,Crawling ,World Wide Web ,020204 information systems ,Web page ,0202 electrical engineering, electronic engineering, information engineering ,Website Parse Template ,tf–idf ,Anchor text ,Information retrieval ,business.industry ,lcsh:Mathematics ,General Engineering ,lcsh:QA1-939 ,lcsh:TA1-2040 ,020201 artificial intelligence & image processing ,The Internet ,business ,Web crawler ,lcsh:Engineering (General). Civil engineering (General) ,Site map ,Page hijacking - Abstract
A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. However, the performance of the current focused crawling can easily suffer the impact of the environments of web pages and multiple topic web pages. In the crawling process, a highly relevant region may be ignored owing to the low overall relevance of that page, and anchor text or link-context may misguide crawlers. In order to solve these problems, this paper proposes a new focused crawler. First, we build a web page classifier based on improved term weighting approach (ITFIDF), in order to gain highly relevant web pages. In addition, this paper introduces an evaluation approach of the link, link priority evaluation (LPE), which combines web page content block partition algorithm and the strategy of joint feature evaluation (JFE), to better judge the relevance between URLs on the web page and the given topic. The experimental results demonstrate that the classifier using ITFIDF outperforms TFIDF, and our focused crawler is superior to other focused crawlers based on breadth-first, best-first, anchor text only, link-context only, and content block partition in terms of harvest rate and target recall. In conclusion, our methods are significant and effective for focused crawler.
- Published
- 2016
- Full Text
- View/download PDF
50. Survey of Techniques for Deep Web Source Selection and Surfacing the Hidden Web Content
- Author
-
Khushboo Khurana and Manoj Chandak
- Subjects
Web standards ,Web analytics ,medicine.medical_specialty ,General Computer Science ,Web 2.0 ,Web development ,Computer science ,02 engineering and technology ,Dynamic web page ,computer.software_genre ,Schema matching ,World Wide Web ,Search engine ,020204 information systems ,Web page ,Web design ,0202 electrical engineering, electronic engineering, information engineering ,Website Parse Template ,medicine ,Web navigation ,Semantic Web Stack ,Data Web ,Information retrieval ,business.industry ,Hyperlink ,Web application security ,Web mining ,020201 artificial intelligence & image processing ,Web mapping ,Web content ,Web service ,Web crawler ,business ,Web intelligence ,computer ,Site map ,Web modeling - Abstract
Large and continuously growing dynamic web content has created new opportunities for large-scale data analysis in the recent years. There is huge amount of information that the traditional web crawlers cannot access, since they use link analysis technique by which only the surface web can be accessed. Traditional search engine crawlers require the web pages to be linked to other pages via hyperlinks causing large amount of web data to be hidden from the crawlers. Enormous data is available in deep web that can be useful to gain new insight for various domains, creating need to access the information from the deep web by developing efficient techniques. As the amount of Web content grows rapidly, the types of data sources are proliferating, which often provide heterogeneous data. So we need to select Deep Web Data sources that can be used by the integration systems. The paper discusses various techniques that can be used to surface the deep web information and techniques for Deep Web Source Selection.
- Published
- 2016
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.