78 results on '"Document type declaration"'
Search Results
2. Inference Document Type (Dtd) From Xml Document: Web Structure Mining
- Author
-
R. K. Chauhan, Nanhay Singh, and Raghuraj Singh
- Subjects
Document Structure Description ,XML Encryption ,Computer science ,computer.internet_protocol ,Relational database ,Efficient XML Interchange ,XML Signature ,Well-formed document ,XML Base ,Document type definition ,Simple API for XML ,XML Schema Editor ,Schema (psychology) ,Streaming XML ,XML schema ,Foreign key ,computer.programming_language ,XHTML ,Information retrieval ,Document type declaration ,InformationSystems_DATABASEMANAGEMENT ,XML validation ,computer.file_format ,XML framework ,XML Schema (W3C) ,Web mining ,Document Schema Definition Languages ,Data exchange ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Functional dependency ,computer ,XML ,XML Catalog - Abstract
XML is becoming a prevalent format and defacto standard for data exchange in many applications. While traditionally, lots of data are stored and managed in relational databases. There is an urgent need to research some efficient methods to convert these data stored in relational databases to XML format when integrating and exchanging these data in XML format. The semantics of XML schemas are crucial to design, query, and store XML documents and functional dependencies are very important representations of semantic information of XML schemas. As DTDs are one of the most frequently used schemas for XML documents in these days, we will use DTDs as schemas of XML documents here. This paper studies the problem of schema conversion from relational schemas to XML DTDs. As functional dependencies play an important role in the schema conversion process, the concept of functional dependency for XML DTDs is used to preserve the semantics implied by functional dependencies and keys of relational schemas. A conversion method is proposed to convert relational schemas to XML DTDs in the presence of functional dependencies, keys and foreign keys. The methods presented here can preserve the semantics implied by functional dependencies, keys and foreign keys of relational schemas and can convert multiple relational tables to XML DTDs at the same time.
- Published
- 2010
3. Generating XML structure using examples and constraints
- Author
-
Sara Cohen
- Subjects
Document Structure Description ,XML Encryption ,computer.internet_protocol ,Computer science ,Efficient XML Interchange ,XML Signature ,Well-formed document ,Document type definition ,Simple API for XML ,XML Schema Editor ,Streaming XML ,XML schema ,XPath ,computer.programming_language ,Information retrieval ,Document type declaration ,cXML ,General Engineering ,XML validation ,computer.file_format ,XML framework ,XML Schema (W3C) ,Document Schema Definition Languages ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,computer ,XML ,XML Catalog - Abstract
This paper presents a framework for automatically generating structural XML documents. The user provides a target DTD and an example of an XML document, called a Generate-XML-By-Example Document , or a GxBE document , for short. GxBE documents use a natural declarative syntax, which includes XPath expressions and the function count. Using GxBE documents, users can express important global and local characteristics for the desired target documents, and can require satisfaction of XPath expressions from a given workload. This paper explores the problem of efficiently generating a document that satisfies a given DTD and GxBE document.
- Published
- 2008
4. Logical structure analysis: From HTML to XML
- Author
-
Kyong-Ho Lee, Minhyung Lee, and Yeon-Seok Kim
- Subjects
Document Structure Description ,Information retrieval ,Computer science ,Document type declaration ,Well-formed document ,XML validation ,Document type definition ,Simple API for XML ,Hardware and Architecture ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Law ,Software ,Document layout analysis ,XML Catalog - Abstract
This paper presents an efficient method for extracting a logical structure from a Web document. The proposed method consists of three phases: visual grouping, element identification, and logical grouping. To produce a logical structure more accurately, the proposed method defines a document model that is able to describe logical structure information of a specific document class. Since the proposed method is based on a visual structure from the visual grouping phase as well as a document model that describes logical structure information of a document type, it supports sophisticated structure analysis. Experimental results with HTML documents from the Web show that the method has performed logical structure analysis successfully, compared with previous work. Particularly, the method generates XML documents as the result of structure analysis, so that it enhances the reusability of documents.
- Published
- 2007
5. Validating Scripted Web-Pages
- Author
-
Roger G. Stone
- Subjects
Document Structure Description ,XHTML ,General Computer Science ,Computer science ,computer.internet_protocol ,Programming language ,Document type declaration ,XML validation ,Well-formed document ,WML ,PHP ,computer.software_genre ,HTML element ,VALIDATION ,DTD ,Theoretical Computer Science ,XML framework ,Simple API for XML ,Web page ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,computer ,XML ,computer.programming_language ,Computer Science(all) - Abstract
The validation of XML documents against a DTD is well understood and tools exist to accomplish this task. But the problem considered here is the validation of a generator of XML documents. The desired outcome is to establish for a particular generator that it is incapable of producing invalid output. Many (X)HTML web pages are generated from a document containing embedded scripts written in languages such as PHP. Existing tools can validate any particular instance of the XHTML generated from the document. Howevere there is no tool for validating the document itself, guaranteeing that all instances that might be generated are valid.A prototype validating tool for scripted-documents has been developed which uses a notation developed to capture the generalised output from the document and a systematically augmented DTD.
- Published
- 2006
- Full Text
- View/download PDF
6. On Finding an Edit Script between an XML Document and a DTD
- Author
-
Nobutaka Suzuki
- Subjects
Document Structure Description ,Information retrieval ,Simple API for XML ,Computer science ,XML Schema Editor ,Document type declaration ,Well-formed document ,XML validation ,XML schema ,Document type definition ,computer ,computer.programming_language - Published
- 2006
7. Representing Annotations in XML Document using String-Trees Model
- Author
-
Keng Hoon Gan
- Subjects
Document Structure Description ,Information retrieval ,Simple API for XML ,Computer science ,Document type declaration ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,XML validation ,Well-formed document ,Document type definition ,XML schema ,computer ,XML Catalog ,computer.programming_language - Abstract
The flexibility of XML allows document to be annotated easily. However, these annotations come from different sources like Wordnet thesaurus, POS, DTD, semantic roles etc. These annotations can either be combined in the same document or captured separately in different document. The former, though richer in annotations, may look messy and requires more parsing time. The latter needs control of document consistency. This paper proposes a string-trees model to represent XML document for multiple sources of annotations. This model extends the existing string-tree structure for linguistic content in order to support structured contents of XML document. In this paper, we describe how this model is refined and applied on XML document.
- Published
- 2014
8. A New Technique for Authenticating Content in Evolving Marked-up Documents
- Author
-
Phillip Berrie
- Subjects
Document Structure Description ,Linguistics and Language ,Authentication ,Markup language ,Document type declaration ,business.industry ,Computer science ,Electronic document ,Language and Linguistics ,World Wide Web ,Structured document ,Software ,Transcription (software) ,business ,Information Systems - Abstract
Accuracy of transcription is vital when preparing a scholarly version of an existing document. This process has not changed with the advent of electronic editions. In fact, ensuring the continued accuracy of a transcription in the digital realm is more difficult because a file, unlike a piece of paper, does not retain information about its previous states and it is therefore possible that accidental changes can go undetected unless the content is continually checked against the original. This article presents a new, character-set-independent, programming algorithm that allows for the ongoing authentication of the textual content of files being marked up with SGML-like languages. The study also describes an implementation of this algorithm and how it can be used with existing software tools to provide a more efficient and trusted editing environment for creating and editing marked-up files. The Just In Time Authentication Mechanism (JITAM) algorithm was developed in response to the need for some form of automated authentication mechanism for projects already employing embedded markup and is seen as a preparatory step that editors can take with their projects before making the leap to the more versatile Just In Time Markup (JITM) system.
- Published
- 2005
9. A Design and Implementation of the Tree-based Document Editing System for XML Application
- Author
-
Young Chul Kim and Chun Kil Kang
- Subjects
Document Structure Description ,Database ,Document type declaration ,Programming language ,Computer science ,Well-formed document ,XML validation ,Document type definition ,computer.software_genre ,Simple API for XML ,Document Schema Definition Languages ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,computer ,XML Catalog - Abstract
This paper describes a design and implementation of the tree-based document editing system for XML application, available at the structure-oriented environment. This system converts DTD to ASTD( Syntax Tree Definition) to support syntax-directed editing for valid document, considers the extensibility to add new tools and supports multiple entry parser for real-time document validation. It is expected that this paper contributes related XML application document editing system development model.
- Published
- 2004
10. Managing very large document collections using semantics
- Author
-
Yubin Bao, Ge Yu, Guoren Wang, and Hongjun Lu
- Subjects
Information retrieval ,Document type declaration ,Computer science ,Well-formed document ,Document management system ,Document clustering ,computer.software_genre ,Semantics ,Computer Science Applications ,Theoretical Computer Science ,Set (abstract data type) ,Computational Theory and Mathematics ,Hardware and Architecture ,Document Schema Definition Languages ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,computer ,Software ,Document layout analysis - Abstract
In this paper, a system is presented where documents are no longer identified by their file names. Instead, a document is represented by its semantics in terms of descriptor and content vector. The descriptor of a document consists of a set of attributes, such as date of creation, its type, its size, annotations, etc. The content vector of a document consists of a set of terms extracted from the document. In this paper, a semantic document management system XBASE is designed and implemented based on the semantics and the functions of three main modules, X-Loader, X-Explorer and X-Query.
- Published
- 2003
11. Clustering DTDs: An interactive two-level approach
- Author
-
Long Zhang, Weining Qian, Hailei Qian, Wen Jin, Yuqi Liang, and Aoying Zhou
- Subjects
Document Structure Description ,RuleML ,computer.internet_protocol ,Computer science ,Well-formed document ,Document type definition ,computer.software_genre ,Theoretical Computer Science ,RELAX NG ,SGML ,Cluster analysis ,computer.programming_language ,XHTML ,Document type declaration ,XML validation ,computer.file_format ,Computer Science Applications ,XML Schema (W3C) ,Computational Theory and Mathematics ,Categorization ,Hardware and Architecture ,Data exchange ,Document Definition Markup Language ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Data mining ,computer ,Software ,XML ,PCDATA - Abstract
XML (eXtensible Markup Language) is a standard which is widely applied in data representation and data exchange. However, as an important concept of XML, DTD (Document Type Definition) is not taken full advantage in current applications. In this paper, a new method for clustering DTDs is presented, and it can be used in XML document clustering. The two-level method clusters the elements in DTDs and clusters DTDs separately. Element clustering forms the first level and provides dement clusters, which are the generalization of relevant elements. DTD clustering utilizes the generalized information and forms the second level in the whole clustering process. The two-level method has the following advantages: 1) It takes into consideration both the content and the structure within DTDs; 2) The generalized information about elements is more useful than the separated words in the vector model; 3) The two-level method facilitates the searching of outliers. The experiments show that this method is able to categorize the relevant DTDs effectively.
- Published
- 2002
12. Succession in standardization: grafting XML onto SGML
- Author
-
A. G. A. J. Loeffen and Tineke M. Egyedi
- Subjects
DocBook ,Standardization ,Document type declaration ,Computer science ,computer.internet_protocol ,business.industry ,Efficient XML Interchange ,XML validation ,Document type definition ,computer.file_format ,Ecological succession ,HTML ,World Wide Web ,XML Schema Editor ,Hardware and Architecture ,Compatibility (mechanics) ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,SGML ,Software engineering ,business ,Law ,computer ,Software ,XML ,computer.programming_language - Abstract
Succession in standardization is usually a problem. The advantages of improvements are weighed against those of compatibility. If compatibility considerations dominate, a grafting process takes place. This process need not lead to compatibility. According to our taxonomy of successor standards, there are three types of succession (outcomes). Type I, where grafting is achieved, entails compatibility between successors, technical paradigm-compliance, and continuity in the standards trajectory. In this paper, we examine issues of succession and focus on the Extensible Markup Language (XML). It was to be grafted on the Standard Generalized Markup Language (SGML), a stable standard since 1988. However, XML was a profile, a subset and an extension of SGML (1988). Adaptation of SGML was needed (SGML1999) to forge full (downward) compatibility with XML (1998). We describe the grafting efforts and analyze their outcomes. We conclude that XML largely fits the SGML paradigm. SGML was a technical exemplar for XML developers. In contrast, widespread use of HTML exemplified the desirability of simplicity in XML standardization. The latter issue and HTML's user market largely explain discontinuity in SGML-XML succession.
- Published
- 2002
13. WWW (World Wide Web) Communication and Publishing of Structural Formulas by XyMML (XyM Markup Language)
- Subjects
World Wide Web ,XHTML ,Markup language ,Computer science ,Document type declaration ,Document Definition Markup Language ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Well-formed document ,Document type definition ,HTML ,computer ,computer.programming_language ,PCDATA - Abstract
A tool for displaying and communicating chemical structural formulas has been developed on the basis of XyMML (XyM Markup Language), where a XyMML document according to the XML (Extensible Markup Language) specification has been transformed into an HTML (HyperText Markup Language) document by means of a translator program due to XSLT (Extensible Stylesheet Language Transformations). During this process, XyMML data written in such a XyMML document have been converted into XyM notations embedded in such an HTML document, which is browsed by virtue of a World Wide Web (WWW) browser including the XyMJava system. Another tool for printing chemical structural formulas has been developed so that the same XyMML document has been transformed into a XyMTeX document by means of XSLT. The resulting XyMTeX document has been used to print a document containing structural formulas through the TeX/LaTeX typesetting system. Thereby, the XyMML and the related techniques have been shown to have the potentiality of serving as a kernel for integrating WWW communication, electronic publishing, and conventional publishing in chemistry.
- Published
- 2002
14. Interoperable Document Collaboration
- Author
-
Patrick Durusau, Svante Schubert, and Sebastian Rönnau
- Subjects
World Wide Web ,Multiple document interface ,Computer science ,Document type declaration ,Document Schema Definition Languages ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Well-formed document ,Document management system ,Document engineering ,computer.software_genre ,User requirements document ,computer ,Vision document - Abstract
To provide office applications with an easy interoperable document merge capability and to enable the usage of document revision across applications, it is necessary to not only standardize the representations of a document state, but also of the changes made to the document during the editing process. Tracking the changes during editing retains the information usually being recovered afterwards. This avoids costly and time consuming processes like document comparison and diff heuristics [1].To this day, file formats such as the OpenDocument file format (ODF) do only specify all possible document variations of a document representing the final state of user data. Interoperability is therefore only given on a document level: One ODF application saves a document and a different application is able to load and continue work on the same document state. Common scenarios of document exchange have been by floppy disc, attached to email and exchange across networks via file services such as Dropbox.Nowadays, the Internet is ubiquitous and multiple users want to work simultaneously on the same document. In that context the transfer of a whole document from user to user is inefficient. Additionally, finding and merging changes in XML-based documents appears to be complex and possibly error-prone [2].For this reason, the OASIS Advanced Document Collaboration subcommittee has started to simplify collaboration by specifying the changes applicable to an ODF document and raising ODF application interoperability from a full document level to a more granular document change level.In this paper, we present an approach to ODF change representation called "Merge enabled Change-Tracking" (MCT), which is based on the Operational Transformation approach [3].
- Published
- 2014
15. Transforming paper documents into XML format with WISDOM++
- Author
-
Donato Malerba, O. Altamura, and Floriana Esposito
- Subjects
Document Structure Description ,Information retrieval ,Document type declaration ,Computer science ,Document classification ,Well-formed document ,XML validation ,computer.software_genre ,Computer Science Applications ,XML framework ,Simple API for XML ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Computer Vision and Pattern Recognition ,computer ,Software ,Document layout analysis - Abstract
The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.
- Published
- 2001
16. Securing XML documents with Author-X
- Author
-
Elisa Bertino, Elena Ferrari, and Silvana Castano
- Subjects
Document Structure Description ,Database ,Computer Networks and Communications ,Computer science ,computer.internet_protocol ,Document type declaration ,XML validation ,Well-formed document ,Document type definition ,computer.software_genre ,User requirements document ,World Wide Web ,Simple API for XML ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,computer ,XML - Abstract
Author-X is a Java-based system that addresses the security issues of access control and policy design for XML document administration. Author-X supports the specification of policies at varying granularity levels and the specification of user credentials as a way to enforce access control. Access control is available according to both push and pull document distribution policies, and document updates are distributed through a combination of hash functions and digital signature techniques. The Author-X approach to distributed updates allows a user to verify a document's integrity without contacting the document server.
- Published
- 2001
17. 6th JSIK SGML/XML Forum
- Author
-
Tim Bray
- Subjects
World Wide Web ,DocBook ,Information retrieval ,Computer science ,Document type declaration ,XML Schema Editor ,computer.internet_protocol ,Efficient XML Interchange ,computer.file_format ,Document type definition ,SGML ,computer ,XML - Published
- 2001
18. Mapping the XML data model into the object model of the SYNTHESIS language
- Author
-
Leonid A. Kalinichenko, O. L. Machul'sky, and M. A. Osipov
- Subjects
Document Structure Description ,Programming language ,Document type declaration ,Computer science ,computer.internet_protocol ,XML validation ,Well-formed document ,Document type definition ,computer.software_genre ,Simple API for XML ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,XML schema ,computer ,Software ,XML ,computer.programming_language - Abstract
In this paper, a mapping of the XML document structure into the canonical data model is studied [5]. The XML document structure is specified by the Document Type Definition (DTD). DTD serves as a basis for a specification in the SYNTHESIS language; each DTD element declaration is mapped into some data type of SYNTHESIS.
- Published
- 2000
19. Structured storage and retrieval of SGML documents using Grove
- Author
-
Hak-Gyoon Kim and Sung-Bae Cho
- Subjects
Document Structure Description ,Information retrieval ,Database ,Document type declaration ,Computer science ,Search engine indexing ,Document type definition ,computer.file_format ,Library and Information Sciences ,Management Science and Operations Research ,computer.software_genre ,Computer Science Applications ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Media Technology ,Information system ,Processing Instruction ,HyTime ,SGML ,computer ,Information Systems - Abstract
SGML standardized in ISO 8879 [International Organization for Standardization (1986)] has been proliferated because it can provide various styles and transform documents on different platforms. The SGML document has logical structure information in addition to the contents. As SGML documents are widely used, there is an increasing demand for a storage and retrieval system to use the logical structure of documents efficiently. However, traditional retrieval systems based on document indexes cannot exploit the logical structure appropriately. In this paper, we have developed a document storage and retrieval system based on structure information, where the SGML document is transformed into Grove, which is the document model for DSSSL and HyTime, and stored at an element level by an object-oriented DBMS, Object Store. It supports structured documents and provides a query interface to retrieve information contained in the structures.
- Published
- 2000
20. Standard Generalized Markup Language for self-defining structured reports
- Author
-
Charles E. Kahn
- Subjects
Information retrieval ,Medical Records Systems, Computerized ,Standardization ,Computer science ,Document type declaration ,Unified Medical Language System ,Information Storage and Retrieval ,Health Informatics ,SGML entity ,computer.file_format ,Document type definition ,Open standard ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Programming Languages ,SGML ,computer ,Information Systems ,PCDATA - Abstract
Structured reporting is the process of using standardized data elements and predetermined data-entry formats to record observations. The Standard Generalized Markup Language (SGML; International Standards Organization (ISO) 8879:1986)—an open, internationally accepted standard for document interchange—was used to encode medical observations acquired in an Internet-based structured reporting system. The resulting report is self-documenting: it includes a definition of its allowable data fields and values encoded as a report-specific SGML document type definition (DTD). The data-entry forms, DTD, and report document instances are based on report specifications written in a simple, SGML-based language designed for that purpose. Reporting concepts can be linked with those of external vocabularies such as the Unified Medical Language System (UMLS) Metathesaurus. The use of open standards such as SGML is an important step in the creation of open, universally comprehensible structured reports.
- Published
- 1999
21. Document structure and markup in the FRESS hypertext system
- Author
-
Steven J. DeRose and Andries van Dam
- Subjects
Document Structure Description ,XHTML ,Information retrieval ,Markup language ,Document type declaration ,Computer science ,Well-formed document ,law.invention ,World Wide Web ,law ,Hypertext ,computer ,computer.programming_language ,PCDATA - Published
- 1999
22. [Untitled]
- Author
-
Xien Fan, Qianhong Liu, and Peter A. Ng
- Subjects
Information retrieval ,Database ,Computer science ,Document type declaration ,Frame (networking) ,Word processing ,Well-formed document ,Document type definition ,Document clustering ,computer.software_genre ,Document processing ,Data_FILES ,General Earth and Planetary Sciences ,Document retrieval ,computer - Abstract
TEXPROS (TEXt PROcessing System) is an automatic document processing system which supports text-based information representation and manipulation, conveying meanings from stored information within office document texts. A dual modeling approach is employed to describe office documents and support document search and retrieval. The frame templates for representing document classes are organized to form a document type hierarchy. Based on its document type, the synopsis of a document is extracted to form its corresponding frame instance. According to the user predefined criteria, these frame instances are stored in different folders, which are organized as a folder organization (i.e., repository of frame instances associated with their documents). The concept of linking folders establishes filing paths for automatically filing documents in the folder organization. By integrating document type hierarchy and folder organization, the dual modeling approach provides efficient frame instance access by limiting the searches to those frame instances of a document type within those folders which appear to be the most similar to the corresponding queries. This paper presents an agent-based document filing system using folder organization. A storage architecture is presented to incorporate the document type hierarchy, folder organization and original document storage into a three-level storage system. This folder organization supports effective filing strategy and allows rapid frame instance searches by confining the search to the actual predicate-driven retrieval method. A predicate specification is proposed for specifying criteria on filing paths in terms of user predefined predicates for governing the document filing. A method for evaluating whether a given frame instance satisfies the criteria of a filing path is presented. The basic operations for constructing and reorganizing a folder organization are proposed.
- Published
- 1999
23. The origin of (document) species
- Author
-
Adam Rifkin and Rohit Khare
- Subjects
XHTML ,RuleML ,HTML5 ,Markup language ,Information retrieval ,Document type declaration ,computer.internet_protocol ,Computer science ,Electronic document ,General Engineering ,Well-formed document ,computer.file_format ,Document type definition ,HTML ,World Wide Web ,Metadata ,Document Definition Markup Language ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,SGML ,computer ,XML ,PCDATA ,computer.programming_language - Abstract
The World Wide Web's extraordinary reach is based in part on its open assimilation of document formats. Although Web transfer protocols and addressing can accommodate any kinds of resources, the unique application context of a truly global hypermedia system favours the adoption of certain Web-adapted formats. In this paper we consider the evolutionary record that has led to the ascent of the eXtensible Markup Language (XML). We present a taxonomy of document species in the Web according to their syntax, style, structure, and semanties. We observe the preferential adoption of SGML, CSS, HTML, and XML, respectively, which leverage a parsimonious evolutionary strategy favouring declarative encodings over Turing-complete languages; separable styles over inline formatting; declarative markup over presentational markup; and well-defined semantics over operational behavior. The paper concludes with an evolutionary walkthrough of citation formats. Ultimately, combined with the self-referential power of the Web to document itself, we believe XML can catalyze a critical shift of the Web from a global information space into a universal knowledge network.
- Published
- 1998
24. SGML and patent document processing. Part II: Experience in the EPO
- Author
-
Paul Brewin
- Subjects
Markup language ,Information retrieval ,Renewable Energy, Sustainability and the Environment ,Document type declaration ,Computer science ,Process Chemistry and Technology ,Energy Engineering and Power Technology ,Bioengineering ,computer.file_format ,Document type definition ,European patent office ,Library and Information Sciences ,Computer Science Applications ,World Wide Web ,Fuel Technology ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Processing Instruction ,SGML ,computer ,Patent document ,PCDATA - Abstract
In this article, the practical use of SGML (Standard Generalized Markup Language) in the EPO (European Patent Office) is described: it discusses the history of the SGML project and how SGML is used in the production of patent documents, databases and CD-ROMs.
- Published
- 1997
25. A formal language model for parsing SGML
- Author
-
R. W. Matzen, K. M. George, and G. E. Hedrick
- Subjects
Parsing ,Computer science ,Programming language ,Document type declaration ,business.industry ,Document type definition ,computer.file_format ,SGML entity ,computer.software_genre ,TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGES ,Hardware and Architecture ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Regular expression ,Language model ,Artificial intelligence ,SGML ,business ,computer ,Software ,Natural language processing ,Information Systems - Abstract
The Standard Generalized Markup Language (SGML) is an international standard for document definition (ISO 8879) that was adopted in 1986 and is rapidly gaining acceptance in industry and government. It is a meta-language system for document design rather than a specific scheme for document processing; almost any kind of document can be described using SGML. Productions called element declarations are used to define arbitrary elements of documents and the context in which they can occur. A finite set of element declarations called a document type definition (DTD) defines the high-level syntax of a set of documents. DTDs are similar to context-free grammars, but the productions are more complex. The standard does not describe a formal language model for SGML, and there is little work in the literature on this topic. This article defines a formal language model for SGML; systems of finite automata from systems of regular expressions. This model is applied in two ways: a parser is constructed for DTDs, and methods are shown for automatically constructing parsers for the documents defined by a DTD. These methods for parsing SGML are new, and they include features of DTDs that have not previously been included in a static language model. The model applies directly to the syntactic constructs of SGML, and thus, the methods shown in this article have distinct advantages for parsing SGML over traditional context-free parsing methods.
- Published
- 1997
26. [Untitled]
- Author
-
Ron Sacks-Davis, Brian Lowe, and Justin Zobel
- Subjects
Structure (mathematical logic) ,Markup language ,Information retrieval ,Document type declaration ,business.industry ,Computer science ,Representation (systemics) ,computer.file_format ,Basis (universal algebra) ,computer.software_genre ,Query language ,Expression (mathematics) ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,General Earth and Planetary Sciences ,Artificial intelligence ,business ,SGML ,computer ,Natural language processing - Abstract
Most documents have a hierarchical structure, which can be made explicit by markup languages such as SGML. In this paper we propose a formal model for representation of hierarchically structured documents, to be used as the basis for document query languages. The model uses a redundant representation of the document elements to simplify the expression of common queries. As an illustration of the power of the model we show how queries might be expressed, both as set-theoretic expressions and in a simple algebra, and outline how queries might be evaluated in a practical system.
- Published
- 1997
27. Extensible Markup Language Document Management
- Author
-
Tatiana Kovacikova and Giovanni Bartolomeo
- Subjects
XHTML ,Markup language ,RuleML ,Computer science ,Document type declaration ,Programming language ,computer.file_format ,computer.software_genre ,Synchronized Multimedia Integration Language ,Document Definition Markup Language ,computer ,Collaborative Application Markup Language ,PCDATA ,computer.programming_language - Published
- 2013
28. A Practical Method for Compatibility Evaluation of Portable Document Formats
- Author
-
Dariusz Król and Michał Łopatka
- Subjects
Document Structure Description ,Information retrieval ,Database ,Document type declaration ,Computer science ,Well-formed document ,Document management system ,Document type definition ,computer.software_genre ,Simple API for XML ,Document Schema Definition Languages ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,computer ,Document layout analysis - Abstract
This paper presents a method for verification of PDF documents for compatibility with publication models provided by scientific publishers. We first consider the problem of converting a document from PDF to XML format. Subsequently, we present an analysis of the document's graphical layout which operates in two phases. The first phase develops a model using a semi-automatic process with limited user interaction. This is followed by comparing and matching of submitted documents. The experimental results demonstrate the degree of document compatibility with the model along with a report of errors and warning messages.
- Published
- 2013
29. SGML and patent document processing. Part I: WIPO Standard ST.32
- Author
-
Paul Brewin
- Subjects
Markup language ,Information retrieval ,Renewable Energy, Sustainability and the Environment ,Computer science ,Document type declaration ,Process Chemistry and Technology ,Energy Engineering and Power Technology ,Bioengineering ,computer.file_format ,European patent office ,Document type definition ,Library and Information Sciences ,Computer Science Applications ,World Wide Web ,Fuel Technology ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,SGML ,Electronic filing ,Patent document ,computer - Abstract
A description of SGML (Standard Generalised Markup Language) is given together with a detailed description of WIPO Standard ST. 32. The benefits of the use of SGML are highlighted — its system independence and flexibility in building publication systems and full-text databases. The use of SGML for patent document processing and how it might be beneficial for patent departments and representatives to use SGML in their own document systems, as well as for electronic filing of applications, is discussed. Reference is made to its use in the European Patent Office.
- Published
- 1996
30. YAdumper: extracting and translating large information volumes from relational databases to structured flat files
- Author
-
José M. Fernández and Alfonso Valencia
- Subjects
Statistics and Probability ,SQL ,Databases, Factual ,computer.internet_protocol ,Relational database ,Computer science ,Information Storage and Retrieval ,computer.software_genre ,Biochemistry ,Database design ,Information schema ,User-Computer Interface ,Entity–relationship model ,Object-relational impedance mismatch ,Molecular Biology ,computer.programming_language ,Database model ,Electronic Data Processing ,Database ,Information Dissemination ,Document type declaration ,Computational Biology ,Computer Science Applications ,Computational Mathematics ,Computational Theory and Mathematics ,Relational model ,Database Management Systems ,Database theory ,Semi-structured data ,computer ,Algorithms ,Software ,XML - Abstract
Summary: Downloading the information stored in relational databases into XML and other flat formats is a common task in bioinformatics. This periodical dumping of information requires considerable CPU time, disk and memory resources. YAdumper has been developed as a purpose-specific tool to deal with the integral structured information download of relational databases. YAdumper is a Java application that organizes database extraction following an XML template based on an external Document Type Declaration. Compared with other non-native alternatives, YAdumper substantially reduces memory requirements and considerably improves writing performance. Availability: YAdumper is freely available.
- Published
- 2004
31. HTML to the max: a manifesto for adding SGML intelligence to the World-Wide Web
- Author
-
C. M. Sperberg-Mcqueen and Robert F. Goldstein
- Subjects
Numeric character reference ,Markup language ,Style sheet ,Document type declaration ,Programming language ,Computer science ,General Engineering ,Document type definition ,computer.file_format ,SGML entity ,computer.software_genre ,World Wide Web ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Processing Instruction ,SGML ,computer - Abstract
HTML demonstrates that SGML markup is useful for networked information. How can it be made even more useful? One way is to extend the tag set from HTML to HTML2, etc. We argue here for a more radical approach: full SGML awareness in WWW. We believe the difficulties are small, the cost affordable, and the advantages overwhelming. SGML is a metalanguage for defining markup languages; HTML is just one instance of this infinite family. At present, documents in other SGML document types must be translated into HTML for display by a Mosaic client—sometimes this imposes unacceptable information loss. WWW browsers could handle other SGML document types without translation by launching a general-purpose SGML browser to view them, as they now launch graphics viewers; a better solution overall would be to build SGML display into the WWW browsers themselves. Either way, display of an SGML document would be controlled by a style sheet using a small number of display primitives (“bold”, “line break”, etc.) to specify the rendition of each element type. For “well-known” document type definitions (DTDs) like HTML, style sheets could be distributed with the browser, or built in. For other DTDs, the browser would fetch a style sheet from the server. Using style sheets, browser software can also make it easy to customize document display. DTDs and style sheets can be designed to accommodate extensions, ensuring that authors can make small extensions to the tag set with no change whatsoever in the target browsers and virtually no performance penalty.
- Published
- 1995
32. Transformation list for SGML application
- Author
-
Hong Gao
- Subjects
Numeric character reference ,Information retrieval ,Computer science ,Interface (Java) ,Programming language ,Document type declaration ,Document type definition ,SGML entity ,computer.file_format ,computer.software_genre ,Computer Science Applications ,Theoretical Computer Science ,Computational Theory and Mathematics ,Hardware and Architecture ,Application domain ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Processing Instruction ,SGML ,computer ,Software - Abstract
SGML (Standard Generalized Markup Language) is an ISO standard for document description (ISO 8879). The main idea in SGML is to specify document both by text and by the document’s structure without reference to a particular processing system. This kind of document description puts the document interchange into fact. But there are very few systems of SGML that have friendly interface and are portable in many applications. In this paper, various approaches to implementing SGML are assessed and the transformation list for SGML application is introduced. This approach is not limited to specific application fields. It is suitable to any application domain and is friendly to users. Users can understand it without any training and can use it as easily as doing their routine work. It will accelerate the development of the document interchange.
- Published
- 1995
33. The qwertz synthesis of SGML and LaTEX
- Author
-
Thomas F. Gordon
- Subjects
Unix ,Markup language ,Computer science ,Programming language ,Document type declaration ,Document type definition ,computer.file_format ,Document processing ,computer.software_genre ,troff ,Hardware and Architecture ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,SGML ,Law ,computer ,Software ,De facto standard - Abstract
Markup languages are used to identify and delimit the components of manuscripts. The principal application of these languages is to provide a means for authors to markup their manuscripts with the information required by publishers for typesetting. LaT E X is a popular de facto standard markup language in some technical communities, such as academic computer science. SGML is an official ISO standard for defining markup languages. The qwertz document processing system is an SGML application we have developed for our own use, intended to combine the advantages of SGML and LaT E X. It consists of a model of LaT E X as a SGML Document Type Declaration (DTD) and Unix tools for translating SGML documents using this DTD into LaT E X, as well as troff. This article discusses our experiences in building and using the system.
- Published
- 1995
34. HTML5: The New Semantics and New Approaches to Document Markup
- Author
-
Joshue O. Connor
- Subjects
World Wide Web ,XHTML ,RuleML ,Markup language ,Information retrieval ,HTML5 ,Computer science ,Document type declaration ,Document Definition Markup Language ,computer ,PCDATA ,Collaborative Application Markup Language ,computer.programming_language - Abstract
In this chapter, you’ll start to look at the HTML5 specification in more detail, especially the aspects of it that most relate to the development of accessible interfaces. There are many new APIs that do background client/server processing and data storage that can be leveraged for rich, responsive applications, but you’ll be seeing mostly the aspects of HTML5 that impact accessibility for users.
- Published
- 2012
35. The latest information related technology. SGML and full-text database
- Author
-
Hidehiro Ishizuka
- Subjects
Information retrieval ,Computer science ,Document type declaration ,Text database ,business.industry ,computer.file_format ,Document type definition ,law.invention ,World Wide Web ,law ,Electronic publishing ,Hypertext ,SGML ,business ,computer ,PCDATA - Abstract
SGML (Standard Generalized Markup Language,標準汎用マーク付け言語)と,その全文データベースヘの適用について解説した。なお,ここで全文データベースとは図表や画像を含むものを言う。 SGMLに基づく全文データベースでは,構造はSGMLで書いたDTD (document type definition)で表現され,テキスト自体はDTDに従った汎用マーク付けを用いて記述される。本橋では章節,段落などの階層構造,注,図表,画像などの非階層構造(参照構造)といった文書構造をいかに表現するか,例を挙げて解説した。そして,SGMLの効用,電子出版,検索システム,ハイパーテキスト,SGML関連ツールなどについても述べた。
- Published
- 1994
36. Version-aware XML documents
- Author
-
Ethan V. Munson and Cheng Thao
- Subjects
Document Structure Description ,Information retrieval ,Document type declaration ,computer.internet_protocol ,Computer science ,Well-formed document ,Document management system ,computer.software_genre ,World Wide Web ,Document Schema Definition Languages ,computer ,Software versioning ,XML ,Vision document - Abstract
A document often goes through many revisions before it is finalized. In the normal document creation process, newer revisions overwrite older ones and only the final revision is kept. At any stage of document creation, it might be desirable to see how the document came to its current form or to revert back to a previous revision. Conventional version control tools such as CVS could help authors do exactly this. However, these tools are unlikely to be adopted by non-technical document authors due to the overhead of managing a repository and the tools' learning curves.This paper presents an approach called version-aware documents that embeds versioning data within the document thus making version control for single documents a seamless part of the authoring process.
- Published
- 2011
37. HTML
- Author
-
Mario Heiderich, Eduardo Alberto Vela Nava, Gareth Heyes, and David Lindsay
- Subjects
Markup language ,Computer science ,business.industry ,Document type declaration ,Scalable Vector Graphics ,Document type definition ,computer.file_format ,HTML ,JavaScript ,World Wide Web ,Web page ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Web application ,business ,computer ,computer.programming_language - Abstract
Publisher Summary This chapter discusses HTML (HyperText Markup Language), the markup language for structuring Web pages. Mastering HTML from a security point of view—in terms of both attack and defense—is complicated and requires almost encyclopedic knowledge. This chapter attempts to provide hat knowledge. In addition to discussing the HTML family and its hidden gems for attackers and trapdoors for defenders, this chapter sheds some light on the differences between the different HTML standards and their actual implementations. The history and basic elements of HTML and markup languages are discussed to get a better understanding of how and where to obfuscate. Some ways to obfuscate markup include execution of JavaScript, the obfuscation of a URL, or even a DoS attack against the client rendering the markup. Markup and HTML are difficult to parse and secure, and the user agents make this task difficult by allowing crazy combinations of characters, attributes, and tags to execute JavaScript. HTML is usually part of an attack against Web applications; although it is called a “markup language,” it is very powerful and should be treated with respect.
- Published
- 2011
38. A novel XML-based document format with printing quality for web publishing
- Author
-
Yinyan Yu, Liangcai Gao, Zhi Tang, and Ruiheng Qiu
- Subjects
Document Structure Description ,Computer science ,computer.internet_protocol ,Document type declaration ,business.industry ,XML validation ,Well-formed document ,Document management system ,computer.software_genre ,World Wide Web ,Simple API for XML ,Publishing ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Single source publishing ,Document engineering ,business ,computer ,XML - Abstract
Although many XML-based document formats are available for printing or publishing on the Internet, none of them is well designed to support both high quality printing and web publishing. Therefore, we propose a novel XML-based document format for web publishing, called CEBX, in this paper. The proposed format is a fixed-layout document supporting high quality printing, which has optimized document content organization, physical structure and protection scheme to support web publishing. There are four noteworthy features of CEBX documents: (1) CEBX provides original fixed layout by graphic units for printing quality. (2) The content in CEBX document can be reflowed to fit the display device basing on the content blocks and additional fluid information. (3) XML Document Archiving model (XDA), the packaging model used in CEBX, supports document linearization and incremental edit well. (4) By introducing a segment-based content protection scheme into CEBX, some part of a document can be previewed directly while the remaining part is protected effectively such that readers only need to purchase partial content of a book that they are interested in. This will be very helpful to document distribution and support flexible business models such as try-beforebuy, on-demand reading, superdistribution, etc.
- Published
- 2010
39. Document recognition
- Author
-
Nenad Marovac
- Subjects
Multiple document interface ,Information retrieval ,Document type declaration ,Computer science ,Programming language ,Well-formed document ,Document management system ,computer.software_genre ,User requirements document ,Document processing ,Document Schema Definition Languages ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,General Materials Science ,computer ,Document layout analysis - Abstract
Document recognition is a task in which a document in its physical presentation format is transformed into a structured author-oriented model of the document. The presentation format can be bitmaps of document pages, a description of the document in a Page Description Language (PDL), or encoding of the document in a printer or graphics language. The structured model is a format allowing for addition to the document, manipulation of the document, and reformating the layout and the output appearance of the document.Fully automatic document recognition is not possible, in general, for the same reason that it is not possible to de-translate computer programs automatically. However, it is possible to develop a man-assisted semi-automatic document recognition method. This method uses two passes. The first pass is completely automatic; it produces a document format called Interactive Document Model. The Interactive Document Model comprises recognized typesetting and descriptive structures together with derived ODA logical and layout structures for the document. The model generated in the first pass is enough for most purposes and applications. However, if it is not acceptable, the user can then enter the second pass and interactively edit the logical structure.This paper has three objectives. The first is to formalize the concept of document recognition. The second is to subdivide the problem of document recognition and classify it into a number of subproblems, each dealing with different aspects of the problem. The third objective is to introduce a problem which we wish to solve, and then to present a High Level Document Recognition method and the experience in developing and using a number of implementations of the method.
- Published
- 1992
40. Organizational Hypermedia Document Management Through Metadata
- Author
-
Garp Choong Kim and Woojong Suh
- Subjects
Computer science ,Document type declaration ,Hypermedia ,Well-formed document ,Document management system ,Document type definition ,computer.software_genre ,law.invention ,Metadata ,World Wide Web ,law ,Document Definition Markup Language ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Meta element ,computer - Abstract
Web business systems, the most popular application of hypermedia, typically include a lot of hypermedia documents (hyperdocuments), which are also called Web pages. These systems have been conceived as an essential instrument in obtaining various beneficial opportunities for CRM (customer relationship management), SCM (supply chain management), e-banking or e-stock trading, and so forth (Turban et al., 2004). Most companies have made a continuous effort to build such systems. As a result, today the hyperdocuments in the organizations are growing explosively. The hyperdocuments employed for business tasks in the Web business systems may be referred to as organizational hyperdocuments (OHDs). The OHDs typically play a critical role in business, including the forms of invoices, checks, orders, and so forth. The organization’s ability to adapt the OHDs rapidly to ever-changing business requirements may impact on business performance. However, the maintenance of the OHDs increasing continuously is becoming a burdensome task to many organizations; managing them is as important to economic success as is software maintenance (Brereton et al., 1998). An approach to solve the challenge of managing OHDs is to use metadata. Metadata are generally known as data about data (or information about information). Concerning this approach, this article first reviews the previous studies and discusses perspectives desirable to manage the OHSs and then provides metadata classification and elements. Finally, this article discusses future trends and makes a conclusion.
- Published
- 2009
41. The X Factor: From HTML to XHTML
- Author
-
N. Perlin
- Subjects
XHTML ,Computer science ,Document type declaration ,computer.file_format ,Character encodings in HTML ,HTML element ,World Wide Web ,XML framework ,Wireless Markup Language ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Document Object Model ,computer ,computer.programming_language ,XForms - Abstract
Created by Tim Berners-Lee in 1989/1990, HTML was the heart of the World Wide Web. Today, HTML is dead, replaced by XHTML which is HTML reformulated as an instance of XML. As technical communication moves into an era in which Flare works in native XHTML, ePublisher Professional outputs XHTML, and so on, it's important to know what XHTML is in order to understand how it will affect your work and choice of tools. This paper summarizes XHTML. The conference presentation will go into more detail
- Published
- 2006
42. Document Markup for the Web
- Author
-
Michael Kohlhase
- Subjects
World Wide Web ,XHTML ,Markup language ,Document type declaration ,Computer science ,Document Definition Markup Language ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Well-formed document ,Document type definition ,HTML ,computer ,PCDATA ,computer.programming_language - Abstract
Document markup is the process of adding codes to a document to identify the structure of a document and to specify the format in which its fragments are to appear. We will discuss two conflicting aspects — structure and appearance — in document markup. As the Internet imposes special constraints imposed on markup formats, we will reflect its influence.
- Published
- 2006
43. Integrating Translation Services within a Structured Editor
- Author
-
Ali Choumane, Cécile Roisin, Hervé Blanchon, Communication Langagière et Interaction Personne-Système (CLIPS - IMAG), Université Joseph Fourier - Grenoble 1 (UJF)-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS), Web, adaptation and multimedia (WAM), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), P. King, and Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique de Grenoble (INPG)-Université Joseph Fourier - Grenoble 1 (UJF)
- Subjects
Machine translation ,Process (engineering) ,Computer science ,computer.internet_protocol ,Well-formed document ,02 engineering and technology ,010501 environmental sciences ,computer.software_genre ,01 natural sciences ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,World Wide Web ,Structured document ,0202 electrical engineering, electronic engineering, information engineering ,Dialog box ,0105 earth and related environmental sciences ,Information retrieval ,Document type declaration ,ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.7: Natural Language Processing/I.2.7.4: Machine translation ,ACM: I.: Computing Methodologies/I.7: DOCUMENT AND TEXT PROCESSING/I.7.2: Document Preparation/I.7.2.1: Format and notation ,[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing ,Document Schema Definition Languages ,ACM: I.: Computing Methodologies/I.7: DOCUMENT AND TEXT PROCESSING/I.7.2: Document Preparation/I.7.2.5: Markup languages ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,020201 artificial intelligence & image processing ,computer ,XML - Abstract
International audience; Fully automatic machine translation cannot produce high quality translation; Dialog-Based Machine Translation (DB-MT) is the only way to provide authors with a means of translating documents in languages they have not mastered, or do not even know. With such environment, the author must help the system to understand the document by means of an interactive disambiguation step. In this pa- per we study the consequences of integrating the DBMT services within a structured document editor (Amaya). The source document (named edited document) needs a compan- ion document enriched with dierent data produced during the interactive translation process (question trees, answers of the author, translations). The edited document also needs to be enriched (annotated) in order to enable access to the question trees. The enriched edited document and the com- panion document have to be synchronized in case the edited document is further updated.
- Published
- 2005
44. Separating XHTML content from navigation clutter using DOM-structure block analysis
- Author
-
Mehmet A. Orgun, Constantine Mantratzis, and Steve Cassidy
- Subjects
XHTML ,Information retrieval ,Computer science ,Document type declaration ,Short paper ,Document clustering ,Hyperlink ,World Wide Web ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Clutter ,Clipping (computer graphics) ,Document Object Model ,computer ,computer.programming_language - Abstract
This short paper gives an overview of the principles behind an algorithm that separates the core-content of a web document from hyperlinked-clutter such as text advertisements and long links of syndicated references to other resources.Its advantage over other approaches is its ability to identify both loosely as well as tightly defined "table-like" or "list-like" structures of hyperlinks (from nested tables to simple, bullet-pointed lists) by operating at various levels within the DOM tree.The resulting data can then be used to extract the core-content from a web document for semantic analysis or other information retrieval purposes as well as to aid in the process of "clipping" a web document to its bare essentials for use with hardware-limited devices such as PDAs and cell phones.
- Published
- 2005
45. Contextual Metadata for Document Databases
- Author
-
Airi Salminen, Virpi Lyytikäinen, and Pasi Tiitinen
- Subjects
Information retrieval ,Database ,Computer science ,Document type declaration ,Well-formed document ,computer.software_genre ,Metadata repository ,World Wide Web ,Metadata ,Document Schema Definition Languages ,Synonym ring ,Geospatial metadata ,computer ,Database catalog - Abstract
Metadata has always been an important means to support accessibility of information in document collections. Metadata can be, for example, bibliographic data manually created for each document at the time of document storage. The indexes created by Web search engines serve as metadata about the content of Web documents. In the semantic Web solutions, ontologies are used to store semantic metadata (Berners-Lee et al., 2001). Attaching a common ontology to a set of heterogeneous document databases may be used to support data integration. Creation of the common ontology requires profound understanding of the concepts used in the databases. It is a demanding task, especially in cases where the content of the documents is written in various natural languages. In this chapter, we propose the use of contextual metadata as another means to add meaning to document collections, and as a way to support data integration. By contextual metadata, we refer to data about the context where documents are created (e.g., data about business processes, organizations involved, and document types). We will restrict our discussion to contextual metadata on the level of collections, leaving metadata about particular document instances out of the discussion. Thus, the contextual metadata can be created, like ontologies, independently of the creation of instances in the databases.
- Published
- 2005
46. Practical SGML as an introduction to SGML
- Author
-
Lynne A. Price
- Subjects
World Wide Web ,Information retrieval ,Computer science ,Document type declaration ,Processing Instruction ,General Medicine ,computer.file_format ,Document type definition ,SGML ,computer - Published
- 1996
47. Conversion of PDF documents into HTML: a case study of document image analysis
- Author
-
Hassan Alam and Fuad Rahman
- Subjects
HTML5 ,Information retrieval ,Computer science ,Document type declaration ,Well-formed document ,Document management system ,Document clustering ,HTML ,computer.software_genre ,World Wide Web ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,computer ,Document layout analysis ,computer.programming_language - Abstract
Portable document format (PDF) has become the de facto standard in many fields because of its independence of local formatting restrictions and its accurate reproducibility. On the other hand, HTML documents are becoming an integral form of our lives by being the dominant form for information exchange within the World Wide Web environment. This paper discusses how image-processing techniques can be used to perform document layout analysis of complex multiple-column PDF documents. This analysis allows the conversion of these documents into the HTML format keeping the logical and physical layout intact.
- Published
- 2004
48. Document transformation system from papers to XML data based on pivot XML document method
- Author
-
Y. Ishitani
- Subjects
Document Structure Description ,XML Encryption ,computer.internet_protocol ,Computer science ,Efficient XML Interchange ,XML Signature ,Well-formed document ,XSLT ,Document type definition ,computer.software_genre ,Simple API for XML ,XML Schema Editor ,Streaming XML ,XML namespace ,XML schema ,computer.programming_language ,XHTML ,Information retrieval ,Document type declaration ,XML validation ,computer.file_format ,XML framework ,XML database ,XML Schema (W3C) ,Document Schema Definition Languages ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Document Object Model ,computer ,XML ,XML Catalog - Abstract
This paper proposes a new method for document transformation using OCR to generate various XML documents from printed documents. The proposed method adopts a hierarchical transformation strategy based on a pivot XML document. Firstly, document elements such as title, authors, abstract, headings, paragraphs, lists, captions, tables and figures are extracted from document images. Secondly, the hierarchical structure of document elements is extracted and is described using a DOM tree. Thirdly, this document structure is converted into a pivot XML document described as an XHTML document by an XML parser. Finally, this pivot XML document is transformed into the target XML document by the XML parser with XSLT scripts or specific programs. Experimental results show the method is effective in transforming printed documents to various XML documents.
- Published
- 2004
49. A Correspondence between UML Diagrams and SGML/XML DTDs
- Author
-
Anne Eerola and Eila Kuikka
- Subjects
Document Structure Description ,Markup language ,RuleML ,computer.internet_protocol ,Computer science ,Well-formed document ,SGML entity ,Document type definition ,computer.software_genre ,Unified Modeling Language ,SGML ,Object Constraint Language ,computer.programming_language ,XHTML ,Document type declaration ,Programming language ,XML validation ,computer.file_format ,Geography Markup Language ,XML Schema (W3C) ,Extensible markup ,Document Definition Markup Language ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,computer ,XML ,XML Catalog ,Collaborative Application Markup Language ,PCDATA - Abstract
In this paper, we compare the semantics and structure of the conceptual information presented in the Unified Modeling Language (UML), which is used in analyzing object-oriented systems, and Document Type Definitions (DTD), which define the structures of SGML and XML documents. SGML (Standard Generalized Markup Language) and XML (Extensible Markup Language) are international standards for specifying the notations used for defining structured documents. We present correspondence rules for generating DTDs semiautomatically from UML diagrams. The rules have been developed as a part of the analysis and design method to create the structure definition for a document. As an example, we use a patient record.
- Published
- 2004
50. Automatic generation algorithm of uniform DTD for structured documents
- Author
-
Chun-Sik Yoo, Seon-Mi Woo, and Yong-Sung Kim
- Subjects
Structure (mathematical logic) ,Intranet ,Information retrieval ,Finite-state machine ,Computer science ,Document type declaration ,computer.file_format ,Document type definition ,Tree structure ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Processing Instruction ,SGML ,computer ,Algorithm - Abstract
SGML is the international standard for digital documents to be used in fields like intranet, CALS/EC, and so on. On the other hand, there is a notable problem that in spite of having a similar structure and being conceptually the same kind of document, many SGML documents have different DTDs and are stored in different databases. We propose an algorithm that automatically unifies DTDs of these SGML documents using a tree structure and finite automata. Constructing the SGML document database to apply the proposed algorithm reduces the number of database accesses and increases the efficiency of information retrieval. It provides a more effective management and operation environment for SGML document databases.
- Published
- 2003
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.