Author: "Komachi, Mamoru" / Topic: fos: computer and information sciences - Searchworks@Jio Institute Digital Library Search Results

1. WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in Wikipedia

Author: Ando, Kenichiro, Sekine, Satoshi, and Komachi, Mamoru
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Wikipedia can be edited by anyone and thus contains various quality sentences. Therefore, Wikipedia includes some poor-quality edits, which are often marked up by other editors. While editors' reviews enhance the credibility of Wikipedia, it is hard to check all edited text. Assisting in this process is very important, but a large and comprehensive dataset for studying it does not currently exist. Here, we propose WikiSQE, the first large-scale dataset for sentence quality estimation in Wikipedia. Each sentence is extracted from the entire revision history of Wikipedia, and the target quality labels were carefully investigated and selected. WikiSQE has about 3.4 M sentences with 153 quality labels. In the experiment with automatic classification using competitive machine learning models, sentences that had problems with citation, syntax/semantics, or propositions were found to be more difficult to detect. In addition, we conducted automated essay scoring experiments to evaluate the generalizability of the dataset. We show that the models trained on WikiSQE perform better than the vanilla model, indicating its potential usefulness in other domains. WikiSQE is expected to be a valuable resource for other tasks in NLP., First draft
Published: 2023

2. Is In-hospital Meta-information Useful for Abstractive Discharge Summary Generation?

Author: Ando, Kenichiro, Komachi, Mamoru, Okumura, Takashi, Horiguchi, Hiromasa, and Matsumoto, Yuji
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: During the patient's hospitalization, the physician must record daily observations of the patient and summarize them into a brief document called "discharge summary" when the patient is discharged. Automated generation of discharge summary can greatly relieve the physicians' burden, and has been addressed recently in the research community. Most previous studies of discharge summary generation using the sequence-to-sequence architecture focus on only inpatient notes for input. However, electric health records (EHR) also have rich structured metadata (e.g., hospital, physician, disease, length of stay, etc.) that might be useful. This paper investigates the effectiveness of medical meta-information for summarization tasks. We obtain four types of meta-information from the EHR systems and encode each meta-information into a sequence-to-sequence model. Using Japanese EHRs, meta-information encoded models increased ROUGE-1 by up to 4.45 points and BERTScore by 3.77 points over the vanilla Longformer. Also, we found that the encoded meta-information improves the precisions of its related terms in the outputs. Our results showed the benefit of the use of medical meta-information.
Published: 2022

3. Construction of a Quality Estimation Dataset for Automatic Evaluation of Japanese Grammatical Error Correction

Author: Suzuki, Daisuke, Takahashi, Yujin, Yamashita, Ikumi, Aida, Taichi, Hirasawa, Tosho, Nakatsuji, Michitaka, Mita, Masato, and Komachi, Mamoru
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: In grammatical error correction (GEC), automatic evaluation is an important factor for research and development of GEC systems. Previous studies on automatic evaluation have demonstrated that quality estimation models built from datasets with manual evaluation can achieve high performance in automatic evaluation of English GEC without using reference sentences.. However, quality estimation models have not yet been studied in Japanese, because there are no datasets for constructing quality estimation models. Therefore, in this study, we created a quality estimation dataset with manual evaluation to build an automatic evaluation model for Japanese GEC. Moreover, we conducted a meta-evaluation to verify the dataset's usefulness in building the Japanese quality estimation model., 8 pages (6pages + references)
Published: 2022

4. Learning How to Translate North Korean through South Korean

Author: Kim, Hwichan, Moon, Sangwhan, Okazaki, Naoaki, and Komachi, Mamoru
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: South and North Korea both use the Korean language. However, Korean NLP research has focused on South Korean only, and existing NLP systems of the Korean language, such as neural machine translation (NMT) models, cannot properly handle North Korean inputs. Training a model using North Korean data is the most straightforward approach to solving this problem, but there is insufficient data to train NMT models. In this study, we create data for North Korean NMT models using a comparable corpus. First, we manually create evaluation data for automatic alignment and machine translation. Then, we investigate automatic alignment methods suitable for North Korean. Finally, we verify that a model trained by North Korean bilingual data without human annotation can significantly boost North Korean translation accuracy compared to existing South Korean models in zero-shot settings., Comment: 8 pages, 1 figures, 8 tables
Published: 2022
Full Text: View/download PDF

5. Proficiency Matters Quality Estimation in Grammatical Error Correction

Author: Takahashi, Yujin, Kaneko, Masahiro, Mita, Masato, and Komachi, Mamoru
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: This study investigates how supervised quality estimation (QE) models of grammatical error correction (GEC) are affected by the learners' proficiency with the data. QE models for GEC evaluations in prior work have obtained a high correlation with manual evaluations. However, when functioning in a real-world context, the data used for the reported results have limitations because prior works were biased toward data by learners with relatively high proficiency levels. To address this issue, we created a QE dataset that includes multiple proficiency levels and explored the necessity of performing proficiency-wise evaluation for QE of GEC. Our experiments demonstrated that differences in evaluation dataset proficiency affect the performance of QE models, and proficiency-wise evaluation helps create more robust models., Comment: 6 pages (4 pages + references)
Published: 2022
Full Text: View/download PDF

6. Sentence Concatenation Approach to Data Augmentation for Neural Machine Translation

Author: Kondo, Seiichiro, Hotate, Kengo, Kaneko, Masahiro, and Komachi, Mamoru
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Neural machine translation (NMT) has recently gained widespread attention because of its high translation accuracy. However, it shows poor performance in the translation of long sentences, which is a major issue in low-resource languages. It is assumed that this issue is caused by insufficient number of long sentences in the training data. Therefore, this study proposes a simple data augmentation method to handle long sentences. In this method, we use only the given parallel corpora as the training data and generate long sentences by concatenating two sentences. Based on the experimental results, we confirm improvements in long sentence translation by the proposed data augmentation method, despite its simplicity. Moreover, the translation quality is further improved by the proposed method, when combined with back-translation., 7 pages; camera-ready for NAACL Student Research Workshop 2021
Published: 2021

7. Chinese Grammatical Correction Using BERT-based Pre-trained Model

Author: Hongfei Wang, Kurosawa, Michiki, Katsumata, Satoru, and Komachi, Mamoru
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: In recent years, pre-trained models have been extensively studied, and several downstream tasks have benefited from their utilization. In this study, we verify the effectiveness of two methods that incorporate a BERT-based pre-trained model developed by Cui et al. (2020) into an encoder-decoder model on Chinese grammatical error correction tasks. We also analyze the error type and conclude that sentence-level errors are yet to be addressed., 6 pages; AACL-IJCNLP 2020
Published: 2020

8. JSSS: free Japanese speech corpus for summarization and simplification

Author: Takamichi, Shinnosuke, Komachi, Mamoru, Tanji, Naoko, and Saruwatari, Hiroshi
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we construct a new Japanese speech corpus for speech-based summarization and simplification, "JSSS" (pronounced "j-triple-s"). Given the success of reading-style speech synthesis from short-form sentences, we aim to design more difficult tasks for delivering information to humans. Our corpus contains voices recorded for two tasks that have a role in providing information under constraints: duration-constrained text-to-speech summarization and speaking-style simplification. It also contains utterances of long-form sentences as an optional task. This paper describes how we designed the corpus, which is available on our project page.
Published: 2020

9. Keyframe Segmentation and Positional Encoding for Video-guided Machine Translation Challenge 2020

Author: Hirasawa, Tosho, Yang, Zhishen, Komachi, Mamoru, and Okazaki, Naoaki
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Video-guided machine translation as one of multimodal neural machine translation tasks targeting on generating high-quality text translation by tangibly engaging both video and text. In this work, we presented our video-guided machine translation system in approaching the Video-guided Machine Translation Challenge 2020. This system employs keyframe-based video feature extractions along with the video feature positional encoding. In the evaluation phase, our system scored 36.60 corpus-level BLEU-4 and achieved the 1st place on the Video-guided Machine Translation Challenge 2020., 4 pages; First Workshop on Advances in Language and Vision Research (ALVR 2020)
Published: 2020

10. Stronger Baselines for Grammatical Error Correction Using Pretrained Encoder-Decoder Model

Author: Katsumata, Satoru and Komachi, Mamoru
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Studies on grammatical error correction (GEC) have reported the effectiveness of pretraining a Seq2Seq model with a large amount of pseudodata. However, this approach requires time-consuming pretraining for GEC because of the size of the pseudodata. In this study, we explore the utility of bidirectional and auto-regressive transformers (BART) as a generic pretrained encoder-decoder model for GEC. With the use of this generic pretrained model for GEC, the time-consuming pretraining can be eliminated. We find that monolingual and multilingual BART models achieve high performance in GEC, with one of the results being comparable to the current strong results in English GEC. Our implementations are publicly available at GitHub (https://github.com/Katsumata420/generic-pretrained-GEC)., Comment: 6 pages; AACL-IJCNLP 2020
Published: 2020
Full Text: View/download PDF

11. Dynamic Fusion: Attentional Language Model for Neural Machine Translation

Author: Kurosawa, Michiki and Komachi, Mamoru
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Neural Machine Translation (NMT) can be used to generate fluent output. As such, language models have been investigated for incorporation with NMT. In prior investigations, two models have been used: a translation model and a language model. The translation model's predictions are weighted by the language model with a hand-crafted ratio in advance. However, these approaches fail to adopt the language model weighting with regard to the translation history. In another line of approach, language model prediction is incorporated into the translation model by jointly considering source and target information. However, this line of approach is limited because it largely ignores the adequacy of the translation output. Accordingly, this work employs two mechanisms, the translation model and the language model, with an attentive architecture to the language model as an auxiliary element of the translation model. Compared with previous work in English--Japanese machine translation using a language model, the experimental results obtained with the proposed Dynamic Fusion mechanism improve BLEU and Rank-based Intuitive Bilingual Evaluation Scores (RIBES) scores. Additionally, in the analyses of the attention and predictivity of the language model, the Dynamic Fusion mechanism allows predictive language modeling that conforms to the appropriate grammatical structure., 13 pages; PACLING 2019
Published: 2019

12. Improving Context-aware Neural Machine Translation with Target-side Context

Author: Yamagishi, Hayahide and Komachi, Mamoru
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: In recent years, several studies on neural machine translation (NMT) have attempted to use document-level context by using a multi-encoder and two attention mechanisms to read the current and previous sentences to incorporate the context of the previous sentences. These studies concluded that the target-side context is less useful than the source-side context. However, we considered that the reason why the target-side context is less useful lies in the architecture used to model these contexts. Therefore, in this study, we investigate how the target-side context can improve context-aware neural machine translation. We propose a weight sharing method wherein NMT saves decoder states and calculates an attention vector using the saved states when translating a current sentence. Our experiments show that the target-side context is also useful if we plug it into NMT as the decoder state when translating a previous sentence., 12 pages; PACLING 2019
Published: 2019

13. Divide and Generate: Neural Generation of Complex Sentences

Author: Ogata, Tomoya, Komachi, Mamoru, and Takatani, Tomoya
Subjects: FOS: Computer and information sciences, TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGES, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: We propose a task to generate a complex sentence from a simple sentence in order to amplify various kinds of responses in the database. We first divide a complex sentence into a main clause and a subordinate clause to learn a generator model of modifiers, and then use the model to generate a modifier clause to create a complex sentence from a simple sentence. We present an automatic evaluation metric to estimate the quality of the models and show that a pipeline model outperforms an end-to-end model.
Published: 2019

14. Debiasing Word Embeddings Improves Multimodal Machine Translation

Author: Hirasawa, Tosho and Komachi, Mamoru
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: In recent years, pretrained word embeddings have proved useful for multimodal neural machine translation (NMT) models to address the shortage of available datasets. However, the integration of pretrained word embeddings has not yet been explored extensively. Further, pretrained word embeddings in high dimensional spaces have been reported to suffer from the hubness problem. Although some debiasing techniques have been proposed to address this problem for other natural language processing tasks, they have seldom been studied for multimodal NMT models. In this study, we examine various kinds of word embeddings and introduce two debiasing techniques for three multimodal NMT models and two language pairs -- English-German translation and English-French translation. With our optimal settings, the overall performance of multimodal models was improved by up to +1.93 BLEU and +2.02 METEOR for English-German translation and +1.73 BLEU and +0.95 METEOR for English-French translation., Comment: 11 pages; MT Summit 2019 (camera ready)
Published: 2019
Full Text: View/download PDF

15. Towards Unsupervised Grammatical Error Correction using Statistical Machine Translation with Synthetic Comparable Corpus

Author: Katsumata, Satoru and Komachi, Mamoru
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: We introduce unsupervised techniques based on phrase-based statistical machine translation for grammatical error correction (GEC) trained on a pseudo learner corpus created by Google Translation. We verified our GEC system through experiments on various GEC dataset, includi ng a low resource track of the shared task at Building Educational Applications 2019 (BEA 2019). As a result, we achieved an F_0.5 score of 28.31 points with the test data of the low resource track., Comment: 7 pages; extended version of BEA 2019
Published: 2019
Full Text: View/download PDF

16. Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings

Author: Matsuo, Junki, Komachi, Mamoru, and Sudoh, Katsuhito
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: One of the most important problems in machine translation (MT) evaluation is to evaluate the similarity between translation hypotheses with different surface forms from the reference, especially at the segment level. We propose to use word embeddings to perform word alignment for segment-level MT evaluation. We performed experiments with three types of alignment methods using word embeddings. We evaluated our proposed methods with various translation datasets. Experimental results show that our proposed methods outperform previous word embeddings-based methods., 5 pages
Published: 2017

17. English-Japanese Neural Machine Translation with Encoder-Decoder-Reconstructor

Author: Matsumura, Yukio, Sato, Takayuki, and Komachi, Mamoru
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Neural machine translation (NMT) has recently become popular in the field of machine translation. However, NMT suffers from the problem of repeating or missing words in the translation. To address this problem, Tu et al. (2017) proposed an encoder-decoder-reconstructor framework for NMT using back-translation. In this method, they selected the best forward translation model in the same manner as Bahdanau et al. (2015), and then trained a bi-directional translation model as fine-tuning. Their experiments show that it offers significant improvement in BLEU scores in Chinese-English translation task. We confirm that our re-implementation also shows the same tendency and alleviates the problem of repeating and missing words in the translation on a English-Japanese task too. In addition, we evaluate the effectiveness of pre-training by comparing it with a jointly-trained model of forward translation and back-translation., Comment: 8 pages
Published: 2017
Full Text: View/download PDF

18. Sparse Named Entity Classification using Factorization Machines

Author: Hirata, Ai and Komachi, Mamoru
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Named entity classification is the task of classifying text-based elements into various categories, including places, names, dates, times, and monetary values. A bottleneck in named entity classification, however, is the data problem of sparseness, because new named entities continually emerge, making it rather difficult to maintain a dictionary for named entity classification. Thus, in this paper, we address the problem of named entity classification using matrix factorization to overcome the problem of feature sparsity. Experimental results show that our proposed model, with fewer features and a smaller size, achieves competitive accuracy to state-of-the-art models., Comment: 4+1 pages
Published: 2017
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

18 results on '"Komachi, Mamoru"'

1. WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in Wikipedia

2. Is In-hospital Meta-information Useful for Abstractive Discharge Summary Generation?

3. Construction of a Quality Estimation Dataset for Automatic Evaluation of Japanese Grammatical Error Correction

4. Learning How to Translate North Korean through South Korean

5. Proficiency Matters Quality Estimation in Grammatical Error Correction

6. Sentence Concatenation Approach to Data Augmentation for Neural Machine Translation

7. Chinese Grammatical Correction Using BERT-based Pre-trained Model

8. JSSS: free Japanese speech corpus for summarization and simplification

9. Keyframe Segmentation and Positional Encoding for Video-guided Machine Translation Challenge 2020

10. Stronger Baselines for Grammatical Error Correction Using Pretrained Encoder-Decoder Model

11. Dynamic Fusion: Attentional Language Model for Neural Machine Translation

12. Improving Context-aware Neural Machine Translation with Target-side Context

13. Divide and Generate: Neural Generation of Complex Sentences

14. Debiasing Word Embeddings Improves Multimodal Machine Translation

15. Towards Unsupervised Grammatical Error Correction using Statistical Machine Translation with Synthetic Comparable Corpus

16. Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings

17. English-Japanese Neural Machine Translation with Encoder-Decoder-Reconstructor

18. Sparse Named Entity Classification using Factorization Machines

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

Publisher

18 results on '"Komachi, Mamoru"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources