46 results on '"TEXT summarization"'
Search Results
2. Objective Argument Summarization in Search
- Author
-
Ziegenbein, Timon, Syed, Shahbaz, Potthast, Martin, Wachsmuth, Henning, Hartmanis, Juris, Founding Editor, van Leeuwen, Jan, Series Editor, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Kobsa, Alfred, Series Editor, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Nierstrasz, Oscar, Series Editor, Pandu Rangan, C., Editorial Board Member, Sudan, Madhu, Series Editor, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Weikum, Gerhard, Series Editor, Vardi, Moshe Y, Series Editor, Goos, Gerhard, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Cimiano, Philipp, editor, Frank, Anette, editor, Kohlhase, Michael, editor, and Stein, Benno, editor
- Published
- 2024
- Full Text
- View/download PDF
3. NLP TRANSFORMERS: ANALYSIS OF LLMS AND TRADITIONAL APPROACHES FOR ENHANCED TEXT SUMMARIZATION.
- Author
-
ISIKDEMIR, Yunus Emre
- Subjects
TEXT summarization ,DEEP learning ,INFORMATION retrieval ,NATURAL language processing ,LANGUAGE models - Abstract
Copyright of Journal of Engineering & Architectural Faculty of Eskisehir Osmangazi University / Eskişehir Osmangazi Üniversitesi Mühendislik ve Mimarlık Fakültesi Dergisi is the property of Eskisehir Osmangazi University and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
4. Graph Ranked Clustering Based Biomedical Text Summarization Using Top k Similarity.
- Author
-
Gupta, Supriya, Sharaff, Aakanksha, and Nagwani, Naresh Kumar
- Subjects
ELECTRONIC health records ,TEXT summarization ,INFORMATION retrieval ,AUTOMATION ,DATA visualization ,MACHINE learning - Abstract
Text Summarization models facilitate biomedical clinicians and researchers in acquiring informative data from enormous domain-specific literature within less time and effort. Evaluating and selecting the most informative sentences from biomedical articles is always challenging. This study aims to develop a dual-mode biomedical text summarization model to achieve enhanced coverage and information. The research also includes checking the fitment of appropriate graph ranking techniques for improved performance of the summarization model. The input biomedical text is mapped as a graph where meaningful sentences are evaluated as the central node and the critical associations between them. The proposed framework utilizes the top k similarity technique in a combination of UMLS and a sampled probability-based clustering method which aids in unearthing relevant meanings of the biomedical domain-specific word vectors and finding the best possible associations between crucial sentences. The quality of the framework is assessed via different parameters like information retention, coverage, readability, cohesion, and ROUGE scores in clustering and non-clustering modes. The significant benefits of the suggested technique are capturing crucial biomedical information with increased coverage and reasonable memory consumption. The configurable settings of combined parameters reduce execution time, enhance memory utilization, and extract relevant information outperforming other biomedical baseline models. An improvement of 17% is achieved when the proposed model is checked against similar biomedical text summarizers. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
5. Semantic comparison of texts by the metric approach.
- Author
-
Vakulenko, Maksym O
- Subjects
- *
MACHINE translating , *NATURAL language processing , *TEXT summarization , *SUPERVISED learning , *INFORMATION retrieval - Abstract
A novel approach to the semantic comparison of texts based on the metric method to calculate semantic distances between lexical units is put forward. The supplementary semantic information is provided through semes of the words composing the texts, or through their semantic fields. The proposed method takes into account semantic polarity and yields, for two paraphrase sentences, more feasible results than the conventional approaches based on word occurrences. The described approach may be useful for linguistic theory as well as for a variety of Natural Language Processing tasks based on supervised learning that require semantic information: computer lexicography, semantic analysis, information search and retrieval, document classification, text summarization, and understanding machine translation and others. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
6. Turkish abstractive text document summarization using text to text transfer transformer.
- Author
-
Ay, Betul, Ertam, Fatih, Fidan, Guven, and Aydin, Galip
- Subjects
TEXT summarization ,ATTRIBUTION of news - Abstract
Text summarization is the process of reducing text size while preserving its key points. Thanks to this process, the reading time of the text is also reduced which contributes to reaching the desired information quickly, especially in today's world where time is much more important. In addition, summarization can be used to create a solution to extract outstanding information from the text. In this study, we focus on abstract summarization, which can draw more human like conclusions from the text. A summarization study was carried out on the data set that was collected from online Turkish news sources. Rouge and Bert-score performance metrics were used to compare the performance of this study using the text to text transfer transformer (T5) method. The precision values of the Rouge-1, Rouge-2, Rouge-L and Bert-score performance metrics obtained in this study were found to be 0.6913, 0.6623, 0.7528 and 0.8718, respectively. Recall values were 0.9210, 0.8917, 0.9183 and 0.9138, respectively. F measure values were 0.7649, 0.7338, 0.8084 and 0.8913 respectively. Considering the success of the results obtained in the study, a method that can obtain successful results for Turkish text summarization is presented and the original dataset is made available to other researchers. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
7. State-of-the Art: Short Text Semantic Similarity (STSS) Techniques in Question Answering Systems (QAS)
- Author
-
Amur, Zaira Hassan, Hooi, Yewkwang, Sodhar, Irum Naz, Bhanbhro, Hina, Dahri, Kamran, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Hirche, Sandra, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Möller, Sebastian, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Zhang, Junjie James, Series Editor, Ibrahim, Rosdiazli, editor, K. Porkumaran, editor, Kannan, Ramani, editor, Mohd Nor, Nursyarizal, editor, and S. Prabakar, editor
- Published
- 2022
- Full Text
- View/download PDF
8. Interest-Based News Feed
- Author
-
Gupta, Utkarsh, Saini, Ayush, Gupta, Ankush, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Mahapatra, Rajendra Prasad, editor, Peddoju, Sateesh Kumar, editor, Roy, Sudip, editor, Parwekar, Pritee, editor, and Goel, Lavika, editor
- Published
- 2022
- Full Text
- View/download PDF
9. Effect of Stemming on Hindi Text Classification.
- Author
-
Pimpalshende, Anjusha, Singh, Preety, and Potnurwar, Archana
- Subjects
TEXT summarization ,INFORMATION retrieval ,ORAL communication ,ELECTRONIC records ,SUFFIXES & prefixes (Grammar) ,TEXT processing (Computer science) ,PARSING (Computer grammar) - Abstract
Text classification is very useful to search large amount of textual data available online by dividing it into smaller relevant units. Now a day's large amount of digital documents are available in Indian languages. Designing text classifiers in Indian languages is one of the research areas so that people can search and read required documents in their local languages. In proposed work tried to design Text classifier for Hindi text documents and tried to show how stemmer affects the performance of Hindi text classifiers. Stemming is a process to convert words in any language to its base or root words. Stemmers are used for written documents not for spoken languages. Performance of many applications such as text summarization, Information Retrieval (IR) system, text classification systems, syntactic parsing can be improved by applying stemmers. Stemmer eliminates suffix or prefix of the word and form original root word. These root words helps in the preprocessing step required in many algorithms. We applied various stemmers on Hindi text classification models. Experiments and results show that performance of the classifiers is improved by applying stemmers. [ABSTRACT FROM AUTHOR]
- Published
- 2023
10. FactGen: Faithful Text Generation by Factuality-aware Pre-training and Contrastive Ranking Fine-tuning.
- Author
-
Zhibin Lan, Wei Li, Jinsong Su, Xinyan Xiao, Jiachen Liu, Wenhao Wu, and Yajuan Lyu
- Subjects
DEEP learning ,TEXT summarization ,INFORMATION retrieval ,PROBABILITY theory ,DATA analysis - Abstract
Conditional text generation is supposed to generate a fluent and coherent target text that is faithful to the source text. Although pre-trained models have achieved promising results, they still suffer from the crucial factuality problem. To deal with this issue, we propose a factuality-aware pretraining-finetuning framework named FactGen, which fully considers factuality during two training stages. Specifically, at the pre-training stage, we utilize a natural language inference model to construct target texts that are entailed by the source texts, resulting in a more factually consistent pre-training objective. Then, during the fine-tuning stage, we further introduce a contrastive ranking loss to encourage the model to generate factually consistent text with higher probability. Extensive experiments on three conditional text generation tasks demonstrate the effectiveness and generality of our training framework. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
11. Drug Adverse Event Detection Using Text-Based Convolutional Neural Networks (TextCNN) Technique.
- Author
-
Rawat, Ashish, Wani, Mudasir Ahmad, ElAffendi, Mohammed, Imran, Ali Shariq, Kastrati, Zenun, and Daudpota, Sher Muhammad
- Subjects
CONVOLUTIONAL neural networks ,INFORMATION storage & retrieval systems ,DEEP learning ,MACHINE learning ,INFORMATION retrieval - Abstract
With the rapid advancement in healthcare, there has been exponential growth in the healthcare records stored in large databases to help researchers, clinicians, and medical practitioner's for optimal patient care, research, and trials. Since these studies and records are lengthy and time consuming for clinicians and medical practitioners, there is a demand for new, fast, and intelligent medical information retrieval methods. The present study is a part of the project which aims to design an intelligent medical information retrieval and summarization system. The whole system comprises three main modules, namely adverse drug event classification (ADEC), medical named entity recognition (MNER), and multi-model text summarization (MMTS). In the current study, we are presenting the design of the ADEC module for classification tasks, where basic machine learning (ML) and deep learning (DL) techniques, such as logistic regression (LR), decision tree (DT), and text-based convolutional neural network (TextCNN) are employed. In order to perform the extraction of features from the text data, TF-IDF and Word2Vec models are employed. To achieve the best performance of the overall system for efficient information retrieval and summarization, an ensemble strategy is employed, where predictions of the selected base models are integrated to boost the robustness of one model. The performance results of all the models are recorded as promising. TextCNN, with an accuracy of 89%, performs better than the conventional machine learning approaches, i.e., LR and DT with accuracies of 85% and 77%, respectively. Furthermore, the proposed TextCNN outperforms the existing adverse drug event classification approaches, achieving precision, recall, and an F1 score of 87%, 91%, and 89%, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
12. Summarization of Text and Image Captioning in Information Retrieval Using Deep Learning Techniques
- Author
-
P. Mahalakshmi and N. Sabiyath Fatima
- Subjects
Information retrieval ,text summarization ,deep learning ,template generation ,deep belief network ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Automated information retrieval and text summarization concept is a difficult process in natural language processing because of the infrequent structure and high complexity of the documents. The text summarization process creates a summary by paraphrasing a long text. Earlier models on information retrieval and summarization are based on a massive labeled dataset by the use of handcrafted features, leveraging on knowledge for a particular domain, and concentrated on the narrow sub-domain to improve efficiency. This paper presents a new deep learning (DL) based information retrieval with a text summarization model. The proposed model involves three major processes namely information retrieval, template generation, and text summarization. Initially, the bidirectional long short term memory (BiLSTM) approach is employed for retrieving the textual data, which assumes each word in a sentence, extracts the information, and embeds it into the semantic vector. Next, the template generation process takes place using the DL model. The deep belief network (DBN) model is employed as a text summarization tool to summarize the textual content. In addition, the image description is generated for the visualized entities that exist in the images. The design of BiLSTM with the DBN model for the text summarization and image captioning process shows the novelty of the work. The performance of the presented method is validated using Giga word corpus and DUC corpus. The experimental results referred that the proposed DBN model outperformed the compared methods with the maximum precision, recall and F-score. The image captions are compared with a predefined set of captions that exists for the image and the performance is evaluated using the BLEU metric.
- Published
- 2022
- Full Text
- View/download PDF
13. A Comprehensive Survey of Abstractive Text Summarization Based on Deep Learning.
- Author
-
Zhang, Mengli, Zhou, Gang, Yu, Wanting, Huang, Ningbo, and Liu, Wenfen
- Subjects
- *
DEEP learning , *TEXT summarization , *RECORDS management , *INFORMATION retrieval , *PROBLEM solving - Abstract
With the rapid development of the Internet, the massive amount of web textual data has grown exponentially, which has brought considerable challenges to downstream tasks, such as document management, text classification, and information retrieval. Automatic text summarization (ATS) is becoming an extremely important means to solve this problem. The core of ATS is to mine the gist of the original text and automatically generate a concise and readable summary. Recently, to better balance and develop these two aspects, deep learning (DL)-based abstractive summarization models have been developed. At present, for ATS tasks, almost all state-of-the-art (SOTA) models are based on DL architecture. However, a comprehensive literature survey is still lacking in the field of DL-based abstractive text summarization. To fill this gap, this paper provides researchers with a comprehensive survey of DL-based abstractive summarization. We first give an overview of abstractive summarization and DL. Then, we summarize several typical frameworks of abstractive summarization. After that, we also give a comparison of several popular datasets that are commonly used for training, validation, and testing. We further analyze the performance of several typical abstractive summarization systems on common datasets. Finally, we highlight some open challenges in the abstractive summarization task and outline some future research trends. We hope that these explorations will provide researchers with new insights into DL-based abstractive summarization. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
14. An Approach for Journal Summarization Using Clustering Based Micro-Summary Generation
- Author
-
Mojeed, Hammed A., Sanoh, Ummu, Salihu, Shakirat A., Balogun, Abdullateef O., Bajeh, Amos O., Akintola, Abimbola G., Mabayoje, Modinat A., Usman-Hamzah, Fatimah E., Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Silhavy, Radek, editor, Silhavy, Petr, editor, and Prokopova, Zdenka, editor
- Published
- 2020
- Full Text
- View/download PDF
15. An Approach for Video Summarization Using Graph-Based Clustering Algorithm
- Author
-
Yasmin, Ghazaala, Chaterjee, Aditya, Das, Asit Kumar, Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Das, Asit Kumar, editor, Nayak, Janmenjoy, editor, Naik, Bighnaraj, editor, Pati, Soumen Kumar, editor, and Pelusi, Danilo, editor
- Published
- 2020
- Full Text
- View/download PDF
16. Abstractive text summarization using deep learning with a new Turkish summarization benchmark dataset.
- Author
-
Ertam, Fatih and Aydin, Galip
- Subjects
DEEP learning ,WEB portals ,NEWS agencies ,ACCESS to information ,PERFORMANCE theory - Abstract
Exponential increase in the amount of textual data made available on the Internet results in new challenges in terms of accessing information accurately and quickly. Text summarization can be defined as reducing the dimensions of the expressions to be summarized without spoiling the meaning. Summarization can be performed as extractive and abstractive or using both together. In this study, we focus on abstractive summarization which can produce more human‐like summarization results. For the study we created a Turkish news summarization benchmark dataset from various news agency web portals by crawling the news title, short news, news content, and keywords for the last 5 years. The dataset is made publicly available for researchers. The deep learning network training was carried out by using the news headlines and short news contents from the prepared dataset and then the network was expected to create the news headline as the short news summary. To evaluate the performance of this study, Rouge‐1, Rouge‐2, and Rouge‐L were compared using precision, sensitivity and F1 measure scores. Performance values for the study were presented for each sentence as well as by averaging the results for 50 randomly selected sentences. The F1 Measure values are 0.4317, 0.2194, and 0.4334 for Rouge‐1, Rouge‐2, and Rouge‐L respectively. Performance results show that the approach is promising for Turkish text summarization studies and the prepared dataset will add value to the literature. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
17. Context-Based Multi-document Summarization
- Author
-
Sonawane, Sheetal, Ghotkar, Archana, Hinge, Sonam, Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Mandal, Jyotsna Kumar, editor, Sinha, Devadatta, editor, and Bandopadhyay, J.P., editor
- Published
- 2019
- Full Text
- View/download PDF
18. PE-MSC: partial entailment-based minimum set cover for text summarization.
- Author
-
Gupta, Anand, Kaur, Manpreet, Mittal, Sonaali, and Garg, Swati
- Subjects
MAXIMA & minima ,NATURAL language processing ,MATHEMATICAL connectedness - Abstract
The notion of Textual Entailment (TE) is an established indicator of text connectedness. It captures semantic relationships between texts. Recently, it has been used successfully for determining sentence salience in many text summarization methods. However, it has been reported in previous works that the standard textual entailment is not ideal for measuring sentence salience. This is because textual entailment relationships between sentences are quite rare in real-world texts. Therefore, we suggest using partial TE to accomplish the task of recognizing standard TE. We present the single document summarization problem as an optimization problem which is solved using a weighted Minimum Set Cover (wMSC) algorithm. In this method, sentences are broken into fragments and Partial TE is used to form sets of fragments. Finally, wMSC is applied to the sets to obtain the minimum set cover, which corresponds to the summary of the document. The results achieved on the DUC 2002 dataset using ROUGE and other quality metrics show that the proposed method outperforms the state of the art. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
19. Transformer based contextual text representation framework for intelligent information retrieval.
- Author
-
Bhopale, Amol P. and Tiwari, Ashish
- Subjects
- *
INFORMATION retrieval , *TRANSFORMER models , *LANGUAGE models , *TEXT summarization , *CONTEXTUAL learning , *EMBEDDING theorems - Abstract
With the advent of transformer-based architectures, the contextual representation of text data has leveraged the query and the document to be represented in low-dimensional dense vector space. These vectors are learned embeddings of fixed sizes, resulting in deeper text understanding. In this study, we designed a pipeline for effectively retrieving documents from a large search space by combining the deeper text understanding capabilities of the transformer-based BERT model and a phrase embedding-based query expansion model. To learn the contextual representations, we fine-tuned a deep semantic matching model by separately encoding the document and the query. The encoder model is based on the Sentence BERT (SBERT) architecture, which separately generates dense vector representations of documents and queries. The study has also addressed the maximum token length limitation of transformer-based models through the summarization of lengthy documents. In addition, to improve the clarity and completeness of short queries and reduce the semantic gap, a phrase embedding-based query expansion model is employed. The documents and their dense vectors are indexed using the Elasticsearch engine, and matched them with query vectors for retrieving query-specific documents. Finally, the BERT-based cross-encoder model is used to re-rank the relevant records for each query. It performs full self-attention over the inputs, and yields richer text interactions to produce the final results. To assess performance, experiments are conducted on two well-known datasets, TREC-CDS-2014 and OHSUMED. A comparative analysis is carried out, which clearly demonstrates that the proposed framework produced competitive retrieval results. • Transformer-based contextual text representation approach is proposed for IR. • Bi-Encoder architecture is applied to learn the representations of documents & query. • BERT-based summarization is applied to address the limitation on sequence length. • Phrase Embedding-based QE technique is employed to expand short and unclear queries. • Re-ranking of the retrieval results is performed using the cross encoder model. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. Summarization of Lengthy Legal Documents via Abstractive Dataset Building: An Extract-then-Assign Approach.
- Author
-
Jain, Deepali, Borah, Malaya Dutta, and Biswas, Anupam
- Subjects
- *
LEGAL documents , *TEXT summarization , *AUTOMATIC summarization , *FIRE testing , *INFORMATION retrieval - Abstract
Development of effective automatic summarization approaches for legal documents suffer from several challenges like extremely long document-summary pairs, lack of large scale training datasets with tractable document-summary token lengths, etc. In this work, we deal with the problem of legal document summarization by building a modified abstractive dataset from the original dataset. This ensures that the length of each document-summary pair is manageable and can be processed by the state of the art summarization approaches (such as BART). Secondly, we deal with the data scarcity problem by creating more number of training samples, from each of the original document-summary pair. This is done by creating multiple extractive summaries from each sample in the original dataset, following which ground-truth summary sentences are assigned to each of the extractive summary to generate new training samples. This results in a larger training dataset that can be utilized for fine-tuning summarization models. Our proposed approach has been evaluated on two different legal datasets- BillSum and Forum of Information Retrieval Evaluation (FIRE). With respect to the ROUGE metrics, the proposed approach is able to outperform pre-trained BART model fine-tuned on original dataset by (3 − 8) % for FIRE test sets, and by (1 − 3) % for the BillSum test sets. Considering the BERTScore metrics, the proposed approach obtains (1 − 2) % improvements on the FIRE test sets, while for the BillSum test sets (3 − 8) % improvements are observed. Such improvements suggest that the proposed dataset building approach can help achieve improved abstractive summarization of lengthy legal documents. • Lengthy nature and data scarcity are the two main challenges with legal documents. • Lengthy legal document summarization is handled by the proposed approach. • The data scarcity problem is handled by creating a feasible abstractive dataset. • A novel Extract-Then-Assign (ETA) approach is proposed. • ETA approach can greatly improve abstractive summarization of legal documents. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. Trainable Framework for Information Extraction, Structuring and Summarization of Unstructured Data, Using Modified NER.
- Author
-
Banerjee, Partha Sarathy, Chakraborty, Baisakhi, Anand, Utkarsh, and Upadhyay, Harsh
- Abstract
The World Wide Web is an ever expanding source of data in today's world. Millions of tera-bytes of data and information is getting added every second. In this information age as the data is getting generated at an exponential rate, the fact to be noted is that most of the information is already available is in the form of natural language text. The task of information extraction from mammoth data leads us to think on the quality and the form of available data. Secondly, the ever increasing data poses a challenging task of extracting useful information from the available data. The third task is to extract information as efficiently as possible. For retrieving the information there is a need to develop ingenious way to answer any kind of query put up by a user from the available unstructured data. This paper proposes a novel trainable and integrated Natural Language Information Interpretation and Representation System (NLIIRS) that accepts any available un-annotated corpus of data in the form of natural language, and performs the following tasks: finds out the useful data, extracts relevant information in usable form (structured form/tables), summarizes the data and structures the data in relational form. At the end the Question and Answering (Q&A) module shows the cognitive abilities of NLIIR system by answering the questions in natural language relevant to the text. This multispecialty system beyond just Q&A. This is a trainable system capable of handling any unstructured data to be transformed into structured and well organized information. It allows the user to ask questions in natural language. It adopts the advantages of a modified named entity recognition so as to bypass the time consuming process of parts of speech tagging while pre-processing the available corpus (data) for information extraction. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
22. ArA*summarizer: An Arabic text summarization system based on subtopic segmentation and using an A* algorithm for reduction.
- Author
-
Bahloul, Belahcene, Aliane, Hassina, and Benmohammed, Mohamed
- Subjects
- *
NATURAL language processing , *GRAPH algorithms , *ALGORITHMS , *INFORMATION retrieval , *GRAPH theory , *HYBRID systems - Abstract
Automatic text summarization is a field situated at the intersection of natural language processing and information retrieval. Its main objective is to automatically produce a condensed representative form of documents. This paper presents ArA*summarizer, an automatic system for Arabic single document summarization. The system is based on an unsupervised hybrid approach that combines statistical, cluster‐based, and graph‐based techniques. The main idea is to divide text into subtopics then select the most relevant sentences in the most relevant subtopics. The selection process is done by an A* algorithm executed on a graph representing the different lexical–semantic relationships between sentences. Experimentation is conducted on Essex Arabic summaries corpus and using recall‐oriented understudy for gisting evaluation, automatic summarization engineering, merged model graphs, and n‐gram graph powered evaluation via regression evaluation metrics. The evaluation results showed the good performance of our system compared with existing works. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
23. Passage-Based Text Summarization for Legal Information Retrieval.
- Author
-
Kanapala, Ambedkar, Jannu, Srikanth, and Pamula, Rajendra
- Subjects
- *
LEGAL literature , *INFORMATION retrieval , *ACCESS to information , *KEY performance indicators (Management) , *DOCUMENT clustering - Abstract
Automatic text summarization is a process of condensing the content of a text document to pursue the most important information. It plays a significant role in various tasks like text categorization, question answering and information retrieval (IR). As legal information retrieval (LIR) is a subfield of IR, the produced summaries are combined into IR system, with the objective of decreasing the length of the document. In this way, we can improve the access time for searching the information, and relevant documents are retrieved. In this article, we present the creation of passage-level summaries (generic and legal) with different compression ratios and evaluate their performance. The generic summaries present the overall description of the essential information of a document and legal summaries, produced by taking into account the domain-specific features that are present in the document. Next, we propose Boosting Okapi BM25 which is the modified model of Okapi BM25 to increase the efficiency of the LIR. We have evaluated proposed LIR approach in terms of MAP and R-precision and summarization approach using ROUGE tool on FIRE2013 and FIRE2014 datasets. To show the efficacy of the proposed system, we compare the experimental results with different IR models like PL2, I n _ e x p B 2 , I n _ e x p C 2 , InL2, D F R _ B M 25 , Okapi BM25 in terms of MAP. The experimental results of the proposed system show better performance than the existing various IR models in terms of various performance metrics. The empirical results also exhibit that the integration of text summarization and IR techniques helps in retrieving relevant information with less access time. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
24. POS Tagging for Arabic Text Using Bee Colony Algorithm.
- Author
-
Alhasan, Ahmad and Al-Taani, Ahmad T.
- Subjects
NATURAL language processing ,DATA mining ,INFORMATION retrieval ,BEES algorithm ,ARABIC language - Abstract
Abstract Part-of-Speech (POS) Tagging is the process of automatically determining the proper grammatical tag or syntactic category of a word depending on a its context. POS Tagging is an essential step in most Natural Language Processing (NLP) applications such as text summarization, question answering, information extraction and information retrieval. In this study, we propose an efficient tagging approach for the Arabic language using Bee Colony Optimization algorithm. The problem is represented as a graph and a novel technique is proposed to assign scores to possible tags of a sentence, then the bees find the best solution path. The proposed approach is evaluated using KALIMAT corpus which consists of 18M words. Experimental results showed that the proposed approach achieved 98.2% of accuracy compared to 98%, 97.4% and 94.6% for Hybrid, Hidden Markov Model and Rule-Based methods respectively. Furthermore, the proposed approach determined all the tags presented in the corpus while the mentioned approaches can identify only three tags. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
25. Hybrid Approach To Abstractive Summarization.
- Author
-
Sahoo, Deepak, Bhoi, Ashutosh, and Balabantaray, Rakesh Chandra
- Subjects
INFORMATION retrieval ,ABSTRACTING ,MARKOV processes ,CLUSTER analysis (Statistics) ,DATA fusion (Statistics) - Abstract
Text summarization is an application of information retrieval where short and non-redundant version of comparatively large text is presented to the end user. In this paper a hybrid approach is presented to generate abstract summary in which sentences are clustered using sentence level relationships among sentences in association with Markov clustering principle. Then sentence ranking is done in each cluster and if possible the top weighted sentence of each cluster is fused using some linguistic rules with its best-fit sentence(if found) within that cluster to generate a new sentence. Then top ranked sentences from each cluster are compressed using classification technique to generate the abstract summary. The proposed system is evaluated with DUC 2002 data set and the result is performing better than other existing systems. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
26. Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms.
- Author
-
Bewoor, Mrunal S. and Patil, Suhas H.
- Subjects
TEXT mining ,INFORMATION retrieval ,DOCUMENT clustering ,ALGORITHMS ,ELECTRONIC information resources - Abstract
The availability of various digital sources has created a demand for text mining mechanisms. Effective summary generation mechanisms are needed in order to utilize relevant information from often overwhelming digital data sources. In this view, this paper conducts a survey of various single as well as multi-document text summarization techniques. It also provides analysis of treating a query sentence as a common one, segmented from documents for text summarization. Experimental results show the degree of effectiveness in text summarization over different clustering algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
27. Building a text retrieval system for the Sanskrit language: Exploring indexing, stemming, and searching issues.
- Author
-
Sahu, Siba Sankar and Pal, Sukomal
- Subjects
- *
SANSKRIT language , *STEMMING (Linguistics) , *TEXT mining , *INFORMATION retrieval , *TEXT summarization - Abstract
Stemming is an important pre-processing step in the text analysis domains such as text mining, text summarization and information retrieval (IR). In this study, we build a Sanskrit text collection and explore different indexing, stemming and searching strategies in Sanskrit. We also propose two stemmers: a 'light' and an 'aggressive' and evaluate their effectiveness in the text analysis task. The performance of the stemmers is evaluated in two ways: a direct and an indirect IR-based evaluation. In direct evaluation, we found that the stemmers are effective. In indirect evaluation, we apply different retrieval models such as BM25, TF–IDF, Divergence from Randomness (DFR) based and language models. The proposed stemmers are compared with GRAS stemmer, language-independent indexing (trunc-n) and no stemming approach. Among different stemming methods, aggressive stemmers provide the best performance. Hiemstra language model outperforms other retrieval models we experimented with. In statistical analysis, we found that the proposed stemming approaches produce significantly better results than the no-stemming approach. • We build a Sanskrit text collection for the text analysis domain. • We proposed an inflectional and derivational stemmer in Sanskrit. • The performance of different stemmer is evaluated in the text analysis domain. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
28. An Attention-Based Syntax-Tree and Tree-LSTM Model for Sentence Summarization.
- Author
-
Wenfeng Liu, Peiyu Liu, Yuzhen Yang, Yaling Gao, and Jing Yi
- Subjects
MACHINE learning ,ARTIFICIAL intelligence ,COMPUTER programming ,INFORMATION retrieval ,ARTIFICIAL neural networks - Abstract
Generative Summarization is of great importance in understanding large-scale textual data. In this work, we propose an attention-based Tree-LSTM model for sentence summarization, which utilizes an attention-based syntactic structure as auxiliary information. Thereinto, block-alignment is used to align the input and output syntax blocks, while inter-alignment is used for alignment of words within that of block pairs. To some extent, block-alignment can prevent structural deviations on the long sentences and inter-alignment is capable of increasing the flexibility of the generation in the blocks. This model can be easily trained to end-to-end mode and deal with any length of the input sentences. Compared with several relatively strong baselines, our model has achieved the state-of-art on DUC-2004 shared task. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
29. Recent automatic text summarization techniques: a survey.
- Author
-
Gambhir, Mahak and Gupta, Vishal
- Subjects
INTERNET ,ABSTRACTING ,NATURAL language processing ,MATHEMATICAL optimization ,TEXT mining - Abstract
As information is available in abundance for every topic on internet, condensing the important information in the form of summary would benefit a number of users. Hence, there is growing interest among the research community for developing new approaches to automatically summarize the text. Automatic text summarization system generates a summary, i.e. short length text that includes all the important information of the document. Since the advent of text summarization in 1950s, researchers have been trying to improve techniques for generating summaries so that machine generated summary matches with the human made summary. Summary can be generated through extractive as well as abstractive methods. Abstractive methods are highly complex as they need extensive natural language processing. Therefore, research community is focusing more on extractive summaries, trying to achieve more coherent and meaningful summaries. During a decade, several extractive approaches have been developed for automatic summary generation that implements a number of machine learning and optimization techniques. This paper presents a comprehensive survey of recent text summarization extractive approaches developed in the last decade. Their needs are identified and their advantages and disadvantages are listed in a comparative manner. A few abstractive and multilingual text summarization approaches are also covered. Summary evaluation is another challenging issue in this research field. Therefore, intrinsic as well as extrinsic both the methods of summary evaluation are described in detail along with text summarization evaluation conferences and workshops. Furthermore, evaluation results of extractive summarization approaches are presented on some shared DUC datasets. Finally this paper concludes with the discussion of useful future directions that can help researchers to identify areas where further research is needed. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
30. Swarm LSA-PSO Clustering Model in Text Summarization.
- Author
-
Oi-Mean Foong and Suet-Peng Yong
- Subjects
PARTICLE swarm optimization ,SINGULAR value decomposition ,K-means clustering ,INFORMATION retrieval ,MACHINE learning - Abstract
The information overload problem has posed great challenge to internet users to retrieve relevant information accurately for the past decades. It is a tedious task for machine to intuitively mimic human linguists to summarize documents into meaningful text in abstractive manner. Quite often, the summarized text lacks cohesion and becomes difficult to comprehend. The objective of this paper is to investigate the proposed Swarm LSA-PSO model performs better than alternative methods. In this study, terms matrix was constructed from co-occurrence of terms using Bag-of-Words (BOW). The huge dimensions of terms were reduced using Singular Value Decomposition followed by K-Means PSO clustering for acquiring optimal number of concepts clusters. These key concepts were used to identify the main gist in documents for text summarization. The input text documents were downloaded from Document Understanding Conference (DUC) 2002 dataset. The preliminary results show that the swarm LSA-PSO model shows promising results in context based text summarization using BOW clustering approach. [ABSTRACT FROM AUTHOR]
- Published
- 2016
31. WATS-SMS: A T5-Based French Wikipedia Abstractive Text Summarizer for SMS
- Author
-
Jean Louis Ebongue Kedieng Fendji, Désiré Manuel Taira, Marcellin Atemkeng, and Adam Musa Ali
- Subjects
Information retrieval ,Computer Networks and Communications ,Computer science ,text summarization ,transformers ,Information technology ,T58.5-58.64 ,Gateway (web page) ,Automatic summarization ,Field (computer science) ,algebra_number_theory ,Task (project management) ,GSM ,SMS ,Web page ,gateway ,Use case ,French Wikipedia ,Mobile device ,fine-tuning - Abstract
Text summarization remains a challenging task in the natural language processing field despite the plethora of applications in enterprises and daily life. One of the common use cases is the summarization of web pages which has the potential to provide an overview of web pages to devices with limited features. In fact, despite the increasing penetration rate of mobile devices in rural areas, the bulk of those devices offer limited features in addition to the fact that these areas are covered with limited connectivity such as the GSM network. Summarizing web pages into SMS becomes, therefore, an important task to provide information to limited devices. This work introduces WATS-SMS, a T5-based French Wikipedia Abstractive Text Summarizer for SMS. It is built through a transfer learning approach. The T5 English pre-trained model is used to generate a French text summarization model by retraining the model on 25,000 Wikipedia pages then compared with different approaches in the literature. The objective is twofold: (1) to check the assumption made in the literature that abstractive models provide better results compared to extractive ones, and (2) to evaluate the performance of our model compared to other existing abstractive models. A score based on ROUGE metrics gave us a value of 52% for articles with length up to 500 characters against 34.2% for transformer-ED and 12.7% for seq-2seq-attention, and a value of 77% for articles with larger size against 37% for transformers-DMCA. Moreover, an architecture including a software SMS-gateway has been developed to allow owners of mobile devices with limited features to send requests and to receive summaries through the GSM network.
- Published
- 2021
32. At the interface of computational linguistics and statistics.
- Author
-
Martinez, Angel and Martinez, Wendy
- Subjects
- *
COMPUTATIONAL linguistics , *COMPUTATIONAL statistics , *INFORMATION retrieval , *MACHINE learning , *CONFLICT of interests , *DATA analysis - Abstract
Computational linguistics encompasses a broad range of ideas and research areas, and only a brief introduction is possible here. We chose to include areas in computational linguistics where statisticians can contribute, hoping to provide inspiration to the reader. We describe three main aspects of this discipline-formal languages, information retrieval, and machine learning. These support the overarching goal, which is the representation and analysis of meaning from unstructured text. We then provide an example where text analysis has been applied to unstructured text fields in survey records and conclude with some applications and computational resources. WIREs Comput Stat 2015, 7:258-274. doi: 10.1002/wics.1353 For further resources related to this article, please visit the . Conflict of interest: The authors have declared no conflicts of interest for this article. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
33. Multi-task learning for abstractive text summarization with key information guide network
- Author
-
Weiran Xu, Chenliang Li, Minghao Lee, and Chi Zhang
- Subjects
Computer science ,Process (engineering) ,Attention mechanism ,lcsh:TK7800-8360 ,Multi-task learning ,02 engineering and technology ,lcsh:Telecommunication ,0203 mechanical engineering ,lcsh:TK5101-6720 ,Reinforcement learning ,0202 electrical engineering, electronic engineering, information engineering ,Information retrieval ,Artificial neural network ,business.industry ,Deep learning ,lcsh:Electronics ,020302 automobile design & engineering ,Automatic summarization ,Text summarization ,Key (cryptography) ,020201 artificial intelligence & image processing ,Source text ,Artificial intelligence ,business - Abstract
Neural networks based on the attentional encoder-decoder model have good capability in abstractive text summarization. However, these models are hard to be controlled in the process of generation, which leads to a lack of key information. And some key information, such as time, place, and people, is indispensable for humans to understand the main content. In this paper, we propose a key information guide network for abstractive text summarization based on a multi-task learning framework. The core idea is to automatically extract the key information that people need most in an end-to-end way and use it to guide the generation process, so as to get a more human-compliant summary. In our model, the document is encoded into two parts: results of the normal document encoder and the key information encoding, and the key information includes the key sentences and the keywords. A multi-task learning framework is introduced to get a more sophisticated end-to-end model. To fuse the key information, we propose a novel multi-view attention guide network to obtain the dynamic representations of the source text and the key information. In addition, the dynamic representations are incorporated into the abstractive module to guide the process of summary generation. We evaluate our model on the CNN/Daily Mail dataset and experimental results show that our model leads to significant improvements.
- Published
- 2020
34. The challenging task of summary evaluation: an overview
- Author
-
Laura Plaza, Ahmet Aker, Elena Lloret, Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos, and Procesamiento del Lenguaje y Sistemas de Información (GPLSI)
- Subjects
Linguistics and Language ,Computer science ,Context (language use) ,02 engineering and technology ,Library and Information Sciences ,Readability ,01 natural sciences ,Language and Linguistics ,Education ,Task (project management) ,Content evaluation ,Multi-document summarization ,0202 electrical engineering, electronic engineering, information engineering ,Relevance (information retrieval) ,0101 mathematics ,Evaluation ,Information retrieval ,4. Education ,010102 general mathematics ,Data science ,Automatic summarization ,Informatik ,Text summarization ,Task-based evaluation ,Lenguajes y Sistemas Informáticos ,020201 artificial intelligence & image processing ,Computational linguistics - Abstract
Evaluation is crucial in the research and development of automatic summarization applications, in order to determine the appropriateness of a summary based on different criteria, such as the content it contains, and the way it is presented. To perform an adequate evaluation is of great relevance to ensure that automatic summaries can be useful for the context and/or application they are generated for. To this end, researchers must be aware of the evaluation metrics, approaches, and datasets that are available, in order to decide which of them would be the most suitable to use, or to be able to propose new ones, overcoming the possible limitations that existing methods may present. In this article, a critical and historical analysis of evaluation metrics, methods, and datasets for automatic summarization systems is presented, where the strengths and weaknesses of evaluation efforts are discussed and the major challenges to solve are identified. Therefore, a clear up-to-date overview of the evolution and progress of summarization evaluation is provided, giving the reader useful insights into the past, present and latest trends in the automatic evaluation of summaries. This research is partially funded by the European Commission under the Seventh (FP7 - 2007- 2013) Framework Programme for Research and Technological Development through the SAM (FP7-611312) project; by the Spanish Government through the projects VoxPopuli (TIN2013-47090-C3-1-P) and Vemodalen (TIN2015-71785-R), the Generalitat Valenciana through project DIIM2.0 (PROMETEOII/2014/001), and the Universidad Nacional de Educación a Distancia through the project “Modelado y síntesis automática de opiniones de usuario en redes sociales” (2014-001-UNED-PROY).
- Published
- 2017
35. Conceptual Interactive Search Engine Interface for Visually Impaired Web Users
- Author
-
Dena Al-Thani, Ali Jaoua, and Aboubakr Aqle
- Subjects
Information retrieval ,visually impaired users ,business.industry ,Computer science ,End user ,Interface (Java) ,05 social sciences ,text summarization ,Context (language use) ,02 engineering and technology ,information seeking ,Automatic summarization ,search engine interface ,Search engine ,0202 electrical engineering, electronic engineering, information engineering ,Formal concept analysis ,search results representation ,020201 artificial intelligence & image processing ,0501 psychology and cognitive sciences ,The Internet ,Tag cloud ,business ,050107 human factors - Abstract
The Internet is the main source of information nowadays. Consequently, end users need to be knowledgeable about how to use search engines in order to locate relevant information in a reasonable time with minimal effort. On the other hand, search engines must provide different and alternative ways to represent the search results to facilitate the user access to the information especially for the visually impaired (VI) users. Our research aim is to produce a new representational model for the search engine results targeting VI users. The result of this study will be a functional prototype that summarizes the search results as main ideas that are identified as concepts. Formal Concept Analysis (FCA) defines a concept as the maximum number of objects that are sharing the maximum number of features or attributes. Concepts are discovered by analyzing data patterns for the text of the study. The outcome of the first step of summarization concepts as keywords is used to minimize the number of listed websites and URLs that match the user selection of the multi-level tree of concepts. This scenario of summarization can give the user different directions for the shortest path to reach the target information with the minimum amount of time and effort required. The purpose of these directions can be either to proceed with reading the whole document in detail, or to continue the search for finding other related documents that match the user's inquiry. Experiments run on an iterative testing basis until VI users find proper results that satisfy their needs for the search context. User observations and interpretations based on the experiments are used for the user evaluation. This study will guide us for designing a new model for summarizing search results based on the FCA algorithm to the VI end users, and with a new representation interface based on the discovered concepts' weights. ACKNOWLEDGMENT This contribution was made possible by NPRP grant No. 07-794-1-145 and GSRA grant No. 04-1-0514-17066 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors. Scopus
- Published
- 2019
36. A Weighted PageRank-Based Bug Report Summarization Method Using Bug Report Relationships
- Author
-
Beomjun Kim, Seonah Lee, and Sungwon Kang
- Subjects
PageRank ,Exploit ,Computer science ,media_common.quotation_subject ,text summarization ,02 engineering and technology ,law.invention ,data-based software engineering ,law ,Reading (process) ,0202 electrical engineering, electronic engineering, information engineering ,General Materials Science ,Quality (business) ,Instrumentation ,media_common ,Fluid Flow and Transfer Processes ,Information retrieval ,Process Chemistry and Technology ,issue tracking system ,General Engineering ,020207 software engineering ,Software maintenance ,Automatic summarization ,Computer Science Applications ,Debugging ,bug report relationships ,020201 artificial intelligence & image processing - Abstract
For software maintenance, bug reports provide useful information to developers because they can be used for various tasks such as debugging and understanding previous changes. However, as they are typically written in the form of conversations among developers, bug reports tend to be unnecessarily long and verbose, with the consequence that developers often have difficulties reading or understanding bug reports. To mitigate this problem, methods that automatically generate a summary of bug reports have been proposed, and various related studies have been conducted. However, existing bug report summarization methods have not fully exploited the inherent characteristics of bug reports. In this paper, we propose a bug report summarization method that uses the weighted-PageRank algorithm and exploits the 'duplicates&rsquo, &lsquo, blocks&rsquo, and &lsquo, depends-on&rsquo, relationships between bug reports. The experimental results show that our method outperforms the state-of-the-art method in terms of both the quality of the summary and the number of applicable bug reports.
- Published
- 2019
37. Summarization of Scientific Paper through Reinforcement Ranking on Semantic Link Network
- Author
-
Hai Zhuge and Xiaoping Sun
- Subjects
reinforcement ,Information retrieval ,General Computer Science ,Computer science ,General Engineering ,Cognitive neuroscience of visual object recognition ,text summarization ,02 engineering and technology ,Automatic summarization ,Ranking ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Information system ,020201 artificial intelligence & image processing ,General Materials Science ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,natural language processing ,lcsh:TK1-9971 ,Sentence ,Semantics modeling - Abstract
The semantic link network is a semantics modeling method for effective information services. This paper proposes a new text summarization approach that extracts semantic link network from scientific paper consisting of language units of different granularities as nodes and semantic links between the nodes, and then ranks the nodes to select Top-k sentences to compose summary. A set of assumptions for reinforcing representative nodes is set to reflect the core of paper. Then, semantic link networks with different types of node and links are constructed with different combinations of the assumptions. Finally, an iterative ranking algorithm is designed for calculating the weight vectors of the nodes in a converged iteration process. The iteration approximately approaches a stable weight vector of sentence nodes, which is ranked to select Top-k high-rank nodes for composing summary. We designed six types of ranking models on semantic link networks for evaluation. Both objective assessment and intuitive assessment show that ranking semantic link network of language units can significantly help identify the representative sentences. This paper not only provides a new approach to summarizing text based on the extraction of semantic links from text but also verifies the effectiveness of adopting the semantic link network in rendering the core of text. The proposed approach can be applied to implementing other summarization applications such as generating an extended abstract, the mind map, and the bulletin points for making the slides of a given paper. It can be easily extended by incorporating more semantic links to improve text summarization and other information services.
- Published
- 2018
38. Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms
- Author
-
M. S. Bewoor and S. H. Patil
- Subjects
Information retrieval ,Text mining ,lcsh:T58.5-58.64 ,Computer science ,business.industry ,lcsh:Information technology ,Digital data ,text summarization ,Automatic summarization ,lcsh:TA1-2040 ,Multi-document summarization ,lcsh:Technology (General) ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,lcsh:T1-995 ,Cluster analysis ,business ,lcsh:Engineering (General). Civil engineering (General) ,Relevant information ,Sentence ,clustering - Abstract
The availability of various digital sources has created a demand for text mining mechanisms. Effective summary generation mechanisms are needed in order to utilize relevant information from often overwhelming digital data sources. In this view, this paper conducts a survey of various single as well as multi-document text summarization techniques. It also provides analysis of treating a query sentence as a common one, segmented from documents for text summarization. Experimental results show the degree of effectiveness in text summarization over different clustering algorithms.
- Published
- 2018
39. SemPCA-Summarizer: Exploiting Semantic Principal Component Analysis for Automatic Summary Generation
- Author
-
Óscar Alcón, Elena Lloret, Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos, Procesamiento del Lenguaje y Sistemas de Información (GPLSI), and Generalitat Valenciana and Spanish Government, projects PROMETEOII/2014/001, TIN2015-65100-R, and TIN2015-65136-C2-2-R.
- Subjects
Information retrieval ,Computer science ,Natural language processing ,Rank (computer programming) ,Automatic text summarization ,General Engineering ,Intelligent information processing ,Principal component analysis ,Natural language processing, human language technologies, intelligent information processing, automatic text summarization, principal component analysis ,68-T50 ,Automatic summarization ,Human language technologies ,Task (project management) ,Lenguajes y Sistemas Informáticos ,Key (cryptography) ,Information system ,Dimension (data warehouse) ,Heuristics ,other areas of Computing and Informatics ,Natural Language Processing ,Text Summarization - Abstract
Text summarization is the task of condensing a document keeping the relevant information. This task integrated in wider information systems can help users to access key information without having to read everything, allowing for a higher efficiency. In this research work, we have developed and evaluated a single-document extractive summarization approach, named SemPCA-Summarizer, which reduces the dimension of a document using Principal Component Analysis technique enriched with semantic information. A concept-sentence matrix is built from the textual input document, and then, PCA is used to identify and rank the relevant concepts, which are used for selecting the most important sentences through different heuristics, thus leading to various types of summaries. The results obtained show that the generated summaries are very competitive, both from a quantitative and a qualitative viewpoint, thus indicating that our proposed approach is appropriate for briefly providing key information, and thus helping to cope with a huge amount of information available in a quicker and efficient manner. This research work has been partially funded by the Generalitat Valenciana and the Spanish Government through the projects PROMETEOII/2014/001, TIN2015-65100-R, and TIN2015-65136-C2-2-R.
- Published
- 2018
40. Collection-Document Summaries
- Author
-
Witt, Nils, Granitzer, Michael, Seifert, Christin, Pasi, Gabriella, Piwowarski, Benjamin, Azzopardi, Leif, and Hanbury, Allan
- Subjects
Ground truth ,Information retrieval ,Computer science ,Rake ,Judgement ,Medizin ,02 engineering and technology ,Automatic summarization ,Informatik ,Text summarization ,020204 information systems ,Collection-document summaries ,0202 electrical engineering, electronic engineering, information engineering ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,020201 artificial intelligence & image processing - Abstract
Learning something new from a text requires the reader to build on existing knowledge and add new material at the same time. Therefore, we propose collection-document (CDS) summaries that highlight commonalities and differences between a collection (or a single document) and a single document. We devise evaluation metrics that do not require human judgement, and three algorithms for extracting CDS that are based on single-document keyword-extraction methods. Our evaluation shows that different algorithms have different strengths, e.g. TF-IDF based approach best describes document overlap while the adaption of Rake provides keywords with a broad topical coverage. The proposed criteria and procedure can be used to evaluate document-collection summaries without annotated corpora or provide additional insight in an evaluation with human-generated ground truth.
- Published
- 2018
41. Neural Representations of Concepts and Texts for Biomedical Information Retrieval
- Author
-
Noh, Jiho
- Subjects
- information retrieval, natural language processing, deep neural networks, information extraction, text summarization, question answering, Computer Sciences, Data Science
- Abstract
Information retrieval (IR) methods are an indispensable tool in the current landscape of exponentially increasing textual data, especially on the Web. A typical IR task involves fetching and ranking a set of documents (from a large corpus) in terms of relevance to a user's query, which is often expressed as a short phrase. IR methods are the backbone of modern search engines where additional system-level aspects including fault tolerance, scale, user interfaces, and session maintenance are also addressed. In addition to fetching documents, modern search systems may also identify snippets within the documents that are potentially most relevant to the input query. Furthermore, current systems may also maintain preprocessed structured knowledge derived from textual data as so called knowledge graphs, so certain types of queries that are posed as questions can be parsed as such; a response can be an output of one or more named entities instead of a ranked list of documents (e.g., "what diseases are associated with EGFR mutations?"). This refined setup is often termed as question answering (QA) in the IR and natural language processing (NLP) communities. In biomedicine and healthcare, specialized corpora are often at play including research articles by scientists, clinical notes generated by healthcare professionals, consumer forums for specific conditions (e.g., cancer survivors network), and clinical trial protocols (e.g., www.clinicaltrials.gov). Biomedical IR is specialized given the types of queries and the variations in the texts are different from that of general Web documents. For example, scientific articles are more formal with longer sentences but clinical notes tend to have less grammatical conformity and are rife with abbreviations. There is also a mismatch between the vocabulary of consumers and the lingo of domain experts and professionals. Queries are also different and can range from simple phrases (e.g., "COVID-19 symptoms") to more complex implicitly fielded queries (e.g., "chemotherapy regimens for stage IV lung cancer patients with ALK mutations"). Hence, developing methods for different configurations (corpus, query type, user type) needs more deliberate attention in biomedical IR. Representations of documents and queries are at the core of IR methods and retrieval methodology involves coming up with these representations and matching queries with documents based on them. Traditional IR systems follow the approach of keyword based indexing of documents (the so called inverted index) and matching query phrases against the document index. It is not difficult to see that this keyword based matching ignores the semantics of texts (synonymy at the lexeme level and entailment at phrase/clause/sentence levels) and this has lead to dimensionality reduction methods such as latent semantic indexing that generally have scale-related concerns; such methods also do not address similarity at the sentence level. Since the resurgence of neural network methods in NLP, the IR field has also moved to incorporate advances in neural networks into current IR methods. This dissertation presents four specific methodological efforts toward improving biomedical IR. Neural methods always begin with dense embeddings for words and concepts to overcome the limitations of one-hot encoding in traditional NLP/IR. In the first effort, we present a new neural pre-training approach to jointly learn word and concept embeddings for downstream use in applications. In the second study, we present a joint neural model for two essential subtasks of information extraction (IE): named entity recognition (NER) and entity normalization (EN). Our method detects biomedical concept phrases in texts and links them to the corresponding semantic types and entity codes. These first two studies provide essential tools to model textual representations as compositions of both surface forms (lexical units) and high level concepts with potential downstream use in QA. In the third effort, we present a document reranking model that can help surface documents that are likely to contain answers (e.g, factoids, lists) to a question in a QA task. The model is essentially a sentence matching neural network that learns the relevance of a candidate answer sentence to the given question parametrized with a bilinear map. In the fourth effort, we present another document reranking approach that is tailored for precision medicine use-cases. It combines neural query-document matching and faceted text summarization. The main distinction of this effort from previous efforts is to pivot from a query manipulation setup to transforming candidate documents into pseudo-queries via neural text summarization. Overall, our contributions constitute nontrivial advances in biomedical IR using neural representations of concepts and texts.
- Published
- 2021
42. Improving Performance of Text Summarization
- Author
-
Pallavi D. Patil and S.A. Babar
- Subjects
Information retrieval ,fuzzy rule ,Latent semantic analysis ,business.industry ,Computer science ,Latent Semantic Analysis ,Feature extraction ,Text graph ,Automatic summarization ,Feature Extraction ,Fuzzy logic ,Text mining ,Text summarization ,Multi-document summarization ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,General Earth and Planetary Sciences ,Relevance (information retrieval) ,tf–idf ,business ,Sentence ,Document layout analysis ,General Environmental Science - Abstract
Today, the tremendous information is available on the internet; it is difficult to get the information fast and most efficiently. There are so many text materials available on the internet, in order to extract the most relevant information from it, we need a good mechanism. Text summarization technique deals with the compression of large document into shorter version of text. Text summarizations choose the most significant part of text and create coherent summaries that state the main purpose of the given document. Extraction based text summarization involves selecting sentences of high relevance (rank) from the document based on word and sentence features and put them together to generate summary. This is modeled using Fuzzy Inference System. The summary of the document is created based upon the level of the importance of the sentences in the document. This paper focuses on the Fuzzy logic Extraction approach for text summarization and the semantic approach of text summarization using Latent Semantic Analysis.
- Published
- 2015
43. A free Web API for single and multi-document summarization
- Author
-
Sergio Benini, Luca Canini, Alberto Signoroni, Nicola Adami, Massimo Mauro, and Riccardo Leonardi
- Subjects
Topic model ,Information retrieval ,Computer science ,business.industry ,Sentence clustering ,02 engineering and technology ,Python (programming language) ,Web API ,Automatic summarization ,Text Summarization, Web API ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Text mining ,Text Summarization ,Multi-document summarization ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,0305 other medical science ,business ,computer ,computer.programming_language - Abstract
In this work we present a free Web API for single and multi-text summarization. The summarization algorithm follows an extractive approach, thus selecting the most relevant sentences from a single document or a document set. It integrates in a novel pipeline different text analysis techniques - ranging from keyword and entity extraction, to topic modelling and sentence clustering - and gives SoA competitive results. The application, written in Python, supports as input both plain texts and Web URLs. The API is publicly accessible for free using the specific conference token1 as described in the reference page2. The browser-based demo version, for summarization of single documents only, is publicly accessible at http://yonderlabs.com/demo.
- Published
- 2017
44. Test-driven summarization: combining formative assessment with teaching document summarization
- Author
-
Luca Cagliero, Elena Baralis, and Laura Farinetti
- Subjects
Information retrieval ,Multimedia ,Computer science ,Knowledge level ,formative assessment ,Learning analytics ,text summarization ,Context (language use) ,02 engineering and technology ,computer.software_genre ,Automatic summarization ,Test (assessment) ,Formative assessment ,020204 information systems ,Multi-document summarization ,0202 electrical engineering, electronic engineering, information engineering ,learning technologies, formative assessment, text summarization ,020201 artificial intelligence & image processing ,computer ,learning technologies - Abstract
The diffusion of learning technologies has fostered the use of mobile and Web-based applications to assess the knowledge level of learners. In parallel, an increasing research interest has been devoted to studying new learning analytics tools able to summarize the content of large sets of learning documents. To bridge the gap between formative assessment tools and document summarization systems, this paper addresses the problem of recommending short summaries of large sets of learning documents based on the outcomes of multiple-choice tests. Specifically, it presents a new methodology for integrating formative assessment through mobile applications and summarization of learning documents in textual form. The content of the multiple-choice tests is exploited to drive the generation of document summaries tailored to specific topics. Furthermore, the outcomes of the tests are used to automatically recommend the generated summaries to learners based on their actual needs. As a case study, we performed an evaluation experience of students' progresses, which was conducted in the context of a university-level course. The achieved results show the applicability of the proposed methodology.
- Published
- 2017
45. A novel concept-level approach for ultra-concise opinion summarization
- Author
-
Tatiana Vodolazova, Manuel Palomar, Patricio Martínez-Barco, Ester Boldrini, Elena Lloret, Rafael Muñoz, Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos, and Procesamiento del Lenguaje y Sistemas de Información (GPLSI)
- Subjects
Process (engineering) ,Computer science ,Context (language use) ,02 engineering and technology ,computer.software_genre ,Task (project management) ,Artificial Intelligence ,020204 information systems ,Multi-document summarization ,0202 electrical engineering, electronic engineering, information engineering ,Ultra-concise opinion summarization ,Information retrieval ,business.industry ,General Engineering ,Natural language generation ,Automatic summarization ,Readability ,Computer Science Applications ,Text summarization ,Lenguajes y Sistemas Informáticos ,Electronic Word of Mouth ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer ,Natural language processing ,Sentence - Abstract
The Web 2.0 has resulted in a shift as to how users consume and interact with the information, and has introduced a wide range of new textual genres, such as reviews or microblogs, through which users communicate, exchange, and share opinions. The exploitation of all this user-generated content is of great value both for users and companies, in order to assist them in their decision-making processes. Given this context, the analysis and development of automatic methods that can help manage online information in a quicker manner are needed. Therefore, this article proposes and evaluates a novel concept-level approach for ultra-concise opinion abstractive summarization. Our approach is characterized by the integration of syntactic sentence simplification, sentence regeneration and internal concept representation into the summarization process, thus being able to generate abstractive summaries, which is one the most challenging issues for this task. In order to be able to analyze different settings for our approach, the use of the sentence regeneration module was made optional, leading to two different versions of the system (one with sentence regeneration and one without). For testing them, a corpus of 400 English texts, gathered from reviews and tweets belonging to two different domains, was used. Although both versions were shown to be reliable methods for generating this type of summaries, the results obtained indicate that the version without sentence regeneration yielded to better results, improving the results of a number of state-of-the-art systems by 9%, whereas the version with sentence regeneration proved to be more robust to noisy data. This research work has been partially funded by the University of Alicante, Generalitat Valenciana, Spanish Government and the European Commission through the projects, “Tratamiento inteligente de la información para la ayuda a la toma de decisiones” (GRE12-44), “Explotación y tratamiento de la información disponible en Internet para la anotación y generación de textos adaptados al usuario” (GRE13-15), DIIM2.0 (PROMETEOII/2014/001), ATTOS (TIN2012-38536-C03-03), LEGOLANG-UAGE (TIN2012-31224), SAM (FP7-611312), and FIRST (FP7-287607).
- Published
- 2015
46. Multi Domain Semantic Information Retrieval Based on Topic Model
- Author
-
Lee, Sanghoon
- Subjects
- Information retrieval, Semantics, Topic model, Query expansion, Text classification, Text summarization
- Abstract
Over the last decades, there have been remarkable shifts in the area of Information Retrieval (IR) as huge amount of information is increasingly accumulated on the Web. The gigantic information explosion increases the need for discovering new tools that retrieve meaningful knowledge from various complex information sources. Thus, techniques primarily used to search and extract important information from numerous database sources have been a key challenge in current IR systems. Topic modeling is one of the most recent techniquesthat discover hidden thematic structures from large data collections without human supervision. Several topic models have been proposed in various fields of study and have been utilized extensively for many applications. Latent Dirichlet Allocation (LDA) is the most well-known topic model that generates topics from large corpus of resources, such as text, images, and audio.It has been widely used in many areas in information retrieval and data mining, providing efficient way of identifying latent topics among document collections. However, LDA has a drawback that topic cohesion within a concept is attenuated when estimating infrequently occurring words. Moreover, LDAseems not to consider the meaning of words, but rather to infer hidden topics based on a statisticalapproach. However, LDA can cause either reduction in the quality of topic words or increase in loose relations between topics. In order to solve the previous problems, we propose a domain specific topic model that combines domain concepts with LDA. Two domain specific algorithms are suggested for solving the difficulties associated with LDA. The main strength of our proposed model comes from the fact that it narrows semantic concepts from broad domain knowledge to a specific one which solves the unknown domain problem. Our proposed model is extensively tested on various applications, query expansion, classification, and summarization, to demonstrate the effectiveness of the model. Experimental results show that the proposed model significantly increasesthe performance of applications.
- Published
- 2016
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.