Descriptor: "Self training" / Topic: artificial intelligence - Searchworks@Jio Institute Digital Library Search Results

1. A novel semi-supervised self-training method based on resampling for Twitter fake account identification

Author: Shouqiang Sun, Ziming Zeng, Jie Yin, Jingjing Sun, and Tingting Li
Subjects: Computer science, business.industry, Process (computing), Semi-supervised learning, Library and Information Sciences, Machine learning, computer.software_genre, Data set, Identification (information), Resampling, Classifier (linguistics), Labeled data, Artificial intelligence, business, computer, Self training, Information Systems
Abstract: PurposeTwitter fake accounts refer to bot accounts created by third-party organizations to influence public opinion, commercial propaganda or impersonate others. The effective identification of bot accounts is conducive to accurately judge the disseminated information for the public. However, in actual fake account identification, it is expensive and inefficient to manually label Twitter accounts, and the labeled data are usually unbalanced in classes. To this end, the authors propose a novel framework to solve these problems.Design/methodology/approachIn the proposed framework, the authors introduce the concept of semi-supervised self-training learning and apply it to the real Twitter account data set from Kaggle. Specifically, the authors first train the classifier in the initial small amount of labeled account data, then use the trained classifier to automatically label large-scale unlabeled account data. Next, iteratively select high confidence instances from unlabeled data to expand the labeled data. Finally, an expanded Twitter account training set is obtained. It is worth mentioning that the resampling technique is integrated into the self-training process, and the data class is balanced at the initial stage of the self-training iteration.FindingsThe proposed framework effectively improves labeling efficiency and reduces the influence of class imbalance. It shows excellent identification results on 6 different base classifiers, especially for the initial small-scale labeled Twitter accounts.Originality/valueThis paper provides novel insights in identifying Twitter fake accounts. First, the authors take the lead in introducing a self-training method to automatically label Twitter accounts from the semi-supervised background. Second, the resampling technique is integrated into the self-training process to effectively reduce the influence of class imbalance on the identification effect.
Published: 2021
Full Text: View/download PDF

2. Entropy-aware self-training for graph convolutional networks

Author: Tao Wang, Congyan Lang, Yi Jin, Yidong Li, and Gongpei Zhao
Subjects: Theoretical computer science, Artificial Intelligence, Computer science, Cognitive Neuroscience, Node (networking), Entropy (information theory), Layer (object-oriented design), Random walk, Self training, Feature learning, Graph, Computer Science Applications
Abstract: Recently, graph convolutional networks (GCNs) have achieved significant success in many graph-based learning tasks, especially for node classification, due to its excellent ability in representation learning. Nevertheless, it remains challenging for GCN models to obtain satisfying predictions on graphs where only few nodes are with known labels. In this paper, we propose a novel entropy-aware self-training algorithm to boost semi-supervised node classification on graphs with little supervised information. Firstly, an entropy-aggregation layer is developed to strengthen the reasoning ability of GCN models. To the best of our knowledge, this is the first work to combine the entropy-based random walk theory with GCN design. Furthermore, we propose an ingenious checking part to add new nodes as supervision after each training round to enhance node prediction. In particular, the checking part is designed based on aggregated features, which is demonstrated more effective than previous methods and boosts node classification significantly. The proposed algorithm is validated on six public benchmarks in comparison with several state-of-the-art baseline algorithms, and the results illustrate its excellent performance.
Published: 2021
Full Text: View/download PDF

3. Semi-Supervised Self-Training of Hate and Offensive Speech from Social Media

Author: Samira Sadaoui and Safa Alsafari
Subjects: ComputingMethodologies_PATTERNRECOGNITION, Artificial Intelligence, Computer science, Applied psychology, Offensive, Social media, Self training
Abstract: Improving Offensive and Hate Speech (OHS) classifiers’ performances requires a large, confidently labeled textual training dataset. Our study devises a semi-supervised classification approach with ...
Published: 2021
Full Text: View/download PDF

4. A semi-supervised learning method for hyperspectral imagery based on self-training and local-based affinity propagation

Author: Liguo Wang, Wenlong Zhu, Haizhu Pan, Cheng Li, Yanping Teng, Yanzhong Liu, and Haimiao Ge
Subjects: 010504 meteorology & atmospheric sciences, Computer science, business.industry, 0211 other engineering and technologies, Hyperspectral imaging, Pattern recognition, 02 engineering and technology, Semi-supervised learning, 01 natural sciences, Remote sensing (archaeology), General Earth and Planetary Sciences, Affinity propagation, Artificial intelligence, business, Self training, 021101 geological & geomatics engineering, 0105 earth and related environmental sciences
Abstract: In hyperspectral remote sensing, the classification of hyperspectral imagery is an important issue of concern. However, obtaining sufficient labelled samples for the classification is hard work and...
Published: 2021
Full Text: View/download PDF

5. An Effective Tumor Classification With Deep Forest and Self-Training

Author: Lili Shen, Xiaojun Sun, and Zhanbo Chen
Subjects: semi-supervised learning, Gene expression omnibus, General Computer Science, business.industry, Process (engineering), Computer science, Tumor classification, Supervised learning, General Engineering, Sample (statistics), Machine learning, computer.software_genre, Field (computer science), TK1-9971, Random forest, ComputingMethodologies_PATTERNRECOGNITION, self-training, Robustness (computer science), deep forest, General Materials Science, Electrical engineering. Electronics. Nuclear engineering, Artificial intelligence, business, computer, Self training
Abstract: In recent years, tumor classification based on the gene expression omnibus has become a continuous attention field in the area of bioinformatics. Integration machine learning techniques are an efficient methods to solve these problems. Generally, in order to obtain good performance in the supervised learning tasks, a large number of labelled samples will be required. However, in many cases, only a few labelled samples and abundant unlabelled samples exist in the training database. The process of labelling these unlabelled samples manually is difficult and expensive. Therefore, semi-supervised learning approaches have been proposed to utilize unlabelled samples to improve the performance of a model. However, noisy samples decrease the robustness of model in semi-supervised learning. We wish training style that samples can be implemented to train by from high- to low-confidence, self-training can meet this requirement, and the deep forest approach with the hyper-parameter settings used in this work can obtain good accuracy. Therefore, in this paper, we present a novel semi-supervised learning approach with a deep forest model to increase the performance of tumor classification, which employs unlabelled samples and minimizes the cost; that is, a updated unlabelled sample mechanism is investigated to expand the number of high-confidence pseudo-labelled samples. Multiple real-world experiments indicate that our proposed approach can obtain results up 0.96 accuracy and F1-Score, and 0.9798 AUCs.
Published: 2021
Full Text: View/download PDF

6. A self-training hierarchical prototype-based approach for semi-supervised classification

Author: Xiaowei Gu
Subjects: Structure (mathematical logic), Information Systems and Management, business.industry, Computer science, Process (engineering), 05 social sciences, 050301 education, 02 engineering and technology, Machine learning, computer.software_genre, Computer Science Applications, Theoretical Computer Science, Knowledge base, Artificial Intelligence, Control and Systems Engineering, 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), Key (cryptography), 020201 artificial intelligence & image processing, Artificial intelligence, business, 0503 education, Self training, computer, Software
Abstract: This paper introduces a novel self-training hierarchical prototype-based approach for semi-supervised classification. The proposed approach firstly identifies meaningful prototypes from labelled samples at multiple levels of granularity and, then, self-organizes a highly transparent, multi-layered recognition model by arranging them in a form of pyramidal hierarchies. After this, the learning model continues to self-evolve its structure and self-expand its knowledge base to incorporate new patterns recognized from unlabelled samples by exploiting the pseudo-label technique. Thanks to its prototype-based nature, the overall computational process of the proposed approach is highly explainable and traceable. Experimental studies with various benchmark image recognition problems demonstrate the state-of-the-art performance of the proposed approach, showing its strong capability to mine key information from unlabelled data for classification.
Published: 2020
Full Text: View/download PDF

7. A Prediction Approach Based on Self-Training and Deep Learning for Biological Data

Author: Mohamed Lamine Berkane, Mahmoud Boufaida, and Mohamed Nadjib Boufenara
Subjects: Biological data, ComputingMethodologies_PATTERNRECOGNITION, business.industry, Computer science, Deep learning, Artificial intelligence, business, Machine learning, computer.software_genre, computer, Self training
Abstract: With the exponential growth of biological data, labeling this kind of data becomes difficult and costly. Although unlabeled data are comparatively more plentiful than labeled ones, most supervised learning methods are not designed to use unlabeled data. Semi-supervised learning methods are motivated by the availability of large unlabeled datasets rather than a small amount of labeled examples. However, incorporating unlabeled data into learning does not guarantee an improvement in classification performance. This paper introduces an approach based on a model of semi-supervised learning, which is the self-training with a deep learning algorithm to predict missing classes from labeled and unlabeled data. In order to assess the performance of the proposed approach, two datasets are used with four performance measures: precision, recall, F-measure, and area under the ROC curve (AUC).
Published: 2020
Full Text: View/download PDF

8. Tunnel condition assessment via cloud model‐based random forests and self‐training approach

Author: Hehua Zhu, J. Woody Ju, Feng Guo, Mengqi Zhu, and Xueqin Chen
Subjects: Computer science, business.industry, Decision tree, Cloud computing, Machine learning, computer.software_genre, Computer Graphics and Computer-Aided Design, Condition assessment, Computer Science Applications, Random forest, Computational Theory and Mathematics, Artificial intelligence, CRFS, business, computer, Self training, Civil and Structural Engineering
Abstract: To proactively assess the losses caused by the deterioration of metro tunnels during the operational period, a new method, the cloud model‐based random forests (CRFs), is proposed to discu...
Published: 2020
Full Text: View/download PDF

9. A semi-supervised self-training method based on density peaks and natural neighbors

Author: Junnan Li and Suwen Zhao
Subjects: 0209 industrial biotechnology, General Computer Science, business.industry, Computer science, Decision tree, Pattern recognition, Computational intelligence, 02 engineering and technology, k-nearest neighbors algorithm, Support vector machine, ComputingMethodologies_PATTERNRECOGNITION, 020901 industrial engineering & automation, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, Cluster analysis, business, Self training, Classifier (UML)
Abstract: The semi-supervised self-training method is one of the successful methodologies of semi-supervised classification and can train a classifier by exploiting both labeled data and unlabeled data. However, most of the self-training methods are limited by the distribution of initial labeled data, heavily rely on parameters and have the poor ability of prediction in the self-training process. To solve these problems, a novel self-training method based on density peaks and natural neighbors (STDPNaN) is proposed. In STDPNaN, an improved parameter-free density peaks clustering (DPCNaN) is firstly presented by introducing natural neighbors. The DPCNaN can reveal the real structure and distribution of data without any parameter, and then helps STDPNaN restore the real data space with the spherical or non-spherical distribution. Also, an ensemble classifier is employed to improve the predictive ability of STDPNaN in the self-training process. Intensive experiments show that (a) STDPNaN outperforms state-of-the-art methods in improving classification accuracy of k nearest neighbor, support vector machine and classification and regression tree; (b) STDPNaN also outperforms comparison methods without any restriction on the number of labeled data; (c) the running time of STDPNaN is acceptable.
Published: 2020
Full Text: View/download PDF

10. Semi‐Supervised Learning

Author: Gaurav Malik, Deepak Kumar Sharma, and Manish Devgan
Subjects: symbols.namesake, business.industry, Computer science, symbols, Artificial intelligence, Semi-supervised learning, Baum–Welch algorithm, Machine learning, computer.software_genre, business, computer, Self training
Published: 2020
Full Text: View/download PDF

11. A boosting Self-Training Framework based on Instance Generation with Natural Neighbors for K Nearest Neighbor

Author: Junnan Li and Qingsheng Zhu
Subjects: Boosting (machine learning), Computer science, business.industry, 02 engineering and technology, Machine learning, computer.software_genre, Ensemble learning, k-nearest neighbors algorithm, ComputingMethodologies_PATTERNRECOGNITION, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Labeled data, 020201 artificial intelligence & image processing, Artificial intelligence, business, Self training, computer, Classifier (UML)
Abstract: The semi-supervised self-training method is one of the successful methodologies of semi-supervised classification. The mislabeling is the most challenging issue in self-training methods and the ensemble learning is one of the common techniques for dealing with the mislabeling. Specifically, the ensemble learning can solve or alleviate the mislabeling by constructing an ensemble classifier to improve prediction accuracy in the self-training process. However, most ensemble learning methods may not perform well in self-training methods because it is difficult for ensemble learning methods to train an effective ensemble classifier with a small number of labeled data. Inspired by the successful boosting methods, we introduce a new boosting self-training framework based on instance generation with natural neighbors (BoostSTIG) in this paper. BoostSTIG is compatible with most boosting methods and self-training methods. It can use most boosting methods to solve or alleviate the mislabeling of existing self-training methods by improving the prediction accuracy in the self-training process. Besides, an instance generation with natural neighbors is proposed to enlarge initial labeled data in BoostSTIG, which makes boosting methods more suitable for self-training methods. In experiments, we apply the BoostSTIG framework to 2 self-training methods and 4 boosting methods, and then validate BoostSTIG by comparing some state-of-the-art technologies on real data sets. Intensive experiments show that BoostSTIG can improve the performance of tested self-training methods and train an effective k nearest neighbor.
Published: 2020
Full Text: View/download PDF

12. Divide-and-conquer ensemble self-training method based on probability difference

Author: Tingting Li and Jia Lu
Subjects: Divide and conquer algorithms, Structure (mathematical logic), General Computer Science, Generalization, business.industry, Computer science, Process (computing), 020206 networking & telecommunications, Computational intelligence, Pattern recognition, 02 engineering and technology, ComputingMethodologies_PATTERNRECOGNITION, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Noise (video), Artificial intelligence, business, Classifier (UML), Self training
Abstract: Self-training method can train an effective classifier by exploiting labeled instances and unlabeled instances. In the process of self-training method, the high confidence instances are usually selected iteratively and added to the training set for learning. Unfortunately, the structure information of high confidence instances is so similar that it leads to local over-fitting during the iterations. In order to avoid the over-fitting phenomenon, and improve the classification effect of self-training methods, a novel divide-and-conquer ensemble self-training framework based on probability difference is proposed. Firstly, the probability difference of instances is calculated by the category probability of each classifier, the low-fuzzy and high-fuzzy instances of each classifier are divided through the probability difference. Then, a divide-and-conquer strategy is adopted. That is, the low-fuzzy instances determined by all the classifiers are directly labeled and high-fuzzy instances are manually labeled. Finally, the labeled instances are added to the training set for iteration self-training. This method expands the training set by selecting low-fuzzy instances with accurate structure information and high-fuzzy instances with more comprehensive structure information, and it improves the generalization performance of the method effectively. The method is more suitable for noise data sets and it can obtain structure information even in a few labeled instances. The effectiveness of the proposed method is verified by comparative experiments on the University of California Irvine (UCI).
Published: 2020
Full Text: View/download PDF

13. STDS: self-training data streams for mining limited labeled data in non-stationary environment

Author: Jafar Tanha, Arash Sharifi, Shirin Khezri, and Ali Ahmadi
Subjects: Concept drift, Data stream mining, business.industry, Computer science, 02 engineering and technology, Machine learning, computer.software_genre, ComputingMethodologies_PATTERNRECOGNITION, Data point, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Labeled data, 020201 artificial intelligence & image processing, Artificial intelligence, Cluster analysis, business, Self training, Classifier (UML), computer
Abstract: Inthis article, wefocus on the classification problem to semi-supervised learning in non-stationary environment. Semi-supervised learning is a learning task from both labeled and unlabeled data points. There are several approaches to semi-supervised learning in stationary environment which are not applicable directly for data streams. We propose a novel semi-supervised learning algorithm, named STDS. The proposed approach uses labeled and unlabeled data and employs an approach to handle the concept drift in data streams. The main challenge in semi-supervised self-training for data streams is to find a proper selection metric in order to find a set of high-confidence predictions and a proper underlying base learner. We therefore propose an ensemble approach to find a set of high-confidence predictions based on clustering algorithms and classifier predictions. We then employ the Kullback-Leibler (KL) divergence approach to measure the distribution differences between sequential chunks in order to detect the concept drift. When drift is detected, a new classifier is updated from the new set of labeled data in the current chunk; otherwise, a percentage of high-confidence newly labeled data in the current chunk is added to the labeled data in the next chunk for updating the incremental classifier based on the proposed selection metric. The results of our experiments on a number of classification benchmark datasets show that STDS outperforms the supervised and the most of other semi-supervised learning methods.
Published: 2020
Full Text: View/download PDF

14. Improved well-log classification using semisupervised label propagation and self-training, with comparisons to popular supervised algorithms

Author: Alison Malcolm, Michael W. Dunham, and J. Kim Welford
Subjects: 010504 meteorology & atmospheric sciences, Computer science, business.industry, 010502 geochemistry & geophysics, Machine learning, computer.software_genre, 01 natural sciences, ComputingMethodologies_PATTERNRECOGNITION, Geophysics, Geochemistry and Petrology, Artificial intelligence, business, computer, Self training, 0105 earth and related environmental sciences, Label propagation
Abstract: Machine-learning techniques allow geoscientists to extract meaningful information from data in an automated fashion, and they are also an efficient alternative to traditional manual interpretation methods. Many geophysical problems have an abundance of unlabeled data and a paucity of labeled data, and the lithology classification of wireline data reflects this situation. Training supervised algorithms on small labeled data sets can lead to overtraining, and subsequent predictions for the numerous unlabeled data may be unstable. However, semisupervised algorithms are designed for classification problems with limited amounts of labeled data, and they are theoretically able to achieve better accuracies than supervised algorithms in these situations. We explore this hypothesis by applying two semisupervised techniques, label propagation (LP) and self-training, to a well-log data set and compare their performance to three popular supervised algorithms. LP is an established method, but our self-training method is a unique adaptation of existing implementations. The well-log data were made public through an SEG competition held in 2016. We simulate a semisupervised scenario with these data by assuming that only one of the 10 wells has labels (i.e., core samples), and our objective is to predict the labels for the remaining nine wells. We generate results from these data in two stages. The first stage is applying all the algorithms in question to the data as is (i.e., the global data), and the results from this motivate the second stage, which is applying all algorithms to the data when they are decomposed into two separate data sets. Overall, our findings suggest that LP does not outperform the supervised methods, but our self-training method coupled with LP can outperform the supervised methods by a notable margin if the assumptions of LP are met.
Published: 2020
Full Text: View/download PDF

15. Uncertainty-Aware Self-Training for Semi-Supervised Event Temporal Relation Extraction

Author: Wei Bi, Jun Zhao, Yubo Chen, Xinyu Zuo, Pengfei Cao, and Kang Liu
Subjects: Sample selection, Event (computing), business.industry, Computer science, Process (engineering), Natural language understanding, computer.software_genre, Machine learning, Relationship extraction, Task (project management), Artificial intelligence, business, Self training, Data Annotation, computer
Abstract: Extracting event temporal relations is an important task for natural language understanding. Many works have been proposed for supervised event temporal relation extraction, which typically requires a large amount of human-annotated data for model training. However, the data annotation for this task is very time-consuming and challenging. To this end, we study the problem of semi-supervised event temporal relation extraction. Self-training as a widely used semi-supervised learning method can be utilized for this problem. However, it suffers from the noisy pseudo-labeling problem. In this paper, we propose the use of uncertainty-aware self-training framework (UAST) to quantify the model uncertainty for coping with pseudo-labeling errors. Specifically, UAST utilizes (1) Uncertainty Estimation module to compute the model uncertainty for pseudo-labeling unlabeled data; (2) Sample Selection with Exploration module to select informative samples based on uncertainty estimates; and (3) Uncertainty-Aware Learning module to explicitly incorporate the model uncertainty into the self-training process. Experimental results indicate that our approach significantly outperforms previous state-of-the-art methods.
Published: 2021
Full Text: View/download PDF

16. Dual-Consistency Self-Training For Unsupervised Domain Adaptation

Author: Jie Wang, Yasuto Yokota, Chaoliang Zhong, Masaru Ide, Cheng Feng, and Jun Sun
Subjects: Dual consistency, Domain adaptation, Computer science, business.industry, Artificial intelligence, Machine learning, computer.software_genre, business, Self training, computer
Published: 2021
Full Text: View/download PDF

17. An Improved Self-Training Method for Positive Unlabeled Time Series Classification Using DTW Barycenter Averaging

Author: Yabo Dong, Duanqing Xu, Tongbin Zuo, Jing Li, and Haowen Zhang
Subjects: Time series classification, Dynamic time warping, Computer science, Boundary (topology), TP1-1185, Biochemistry, Article, Analytical Chemistry, Domain (software engineering), Set (abstract data type), self-training, Cluster Analysis, Humans, Electrical and Electronic Engineering, Instrumentation, Sequence, business.industry, Chemical technology, positive unlabeled time series classification, Pattern recognition, Atomic and Molecular Physics, and Optics, ComputingMethodologies_PATTERNRECOGNITION, dynamic time warping, Labeled data, Artificial intelligence, business, Self training, DTW barycenter averaging
Abstract: Traditional supervised time series classification (TSC) tasks assume that all training data are labeled. However, in practice, manually labelling all unlabeled data could be very time-consuming and often requires the participation of skilled domain experts. In this paper, we concern with the positive unlabeled time series classification problem (PUTSC), which refers to automatically labelling the large unlabeled set U based on a small positive labeled set PL. The self-training (ST) is the most widely used method for solving the PUTSC problem and has attracted increased attention due to its simplicity and effectiveness. The existing ST methods simply employ the one-nearest-neighbor (1NN) formula to determine which unlabeled time-series should be labeled. Nevertheless, we note that the 1NN formula might not be optimal for PUTSC tasks because it may be sensitive to the initial labeled data located near the boundary between the positive and negative classes. To overcome this issue, in this paper we propose an exploratory methodology called ST-average. Unlike conventional ST-based approaches, ST-average utilizes the average sequence calculated by DTW barycenter averaging technique to label the data. Compared with any individuals in PL set, the average sequence is more representative. Our proposal is insensitive to the initial labeled data and is more reliable than existing ST-based methods. Besides, we demonstrate that ST-average can naturally be implemented along with many existing techniques used in original ST. Experimental results on public datasets show that ST-average performs better than related popular methods.
Published: 2021

18. Text Classification with Heterogeneous Data Using Multiple Self-Training Classifiers

Author: Dong-Hoon Lee, Namgyu Kim, and William Xiu Shun Wong
Subjects: Information Systems and Management, Sociology and Political Science, Computer science, business.industry, Artificial intelligence, business, Machine learning, computer.software_genre, Self training, computer
Published: 2019
Full Text: View/download PDF

19. Interpolative self-training approach for link prediction

Author: Somayyeh Aghababaei and Masoud Makrehchi
Subjects: Artificial Intelligence, business.industry, Computer science, Computer Vision and Pattern Recognition, Artificial intelligence, business, Machine learning, computer.software_genre, Link (knot theory), computer, Self training, Theoretical Computer Science
Published: 2019
Full Text: View/download PDF

20. Deep Contextualized Self-training for Low Resource Dependency Parsing

Author: Roi Reichart and Guy Rotman
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Linguistics and Language, Computer Science - Computation and Language, Low resource, business.industry, Computer science, Communication, lcsh:P98-98.5, computer.software_genre, Machine Learning (cs.LG), Computer Science Applications, Human-Computer Interaction, Artificial Intelligence, Dependency grammar, Labeled data, Artificial intelligence, lcsh:Computational linguistics. Natural language processing, business, Computation and Language (cs.CL), computer, Self training, Natural language processing
Abstract: Neural dependency parsing has proven very effective, achieving state-of-the-art results on numerous domains and languages. Unfortunately, it requires large amounts of labeled data, that is costly and laborious to create. In this paper we propose a self-training algorithm that alleviates this annotation bottleneck by training a parser on its own output. Our Deep Contextualized Self-training (DCST) algorithm utilizes representation models trained on sequence labeling tasks that are derived from the parser's output when applied to unlabeled data, and integrates these models with the base parser through a gating mechanism. We conduct experiments across multiple languages, both in low resource in-domain and in cross-domain setups, and demonstrate that DCST substantially outperforms traditional self-training as well as recent semi-supervised training methods., Comment: Accepted to TACL in September 2019
Published: 2019
Full Text: View/download PDF

21. Development and evaluation of a self-training system for tennis shots with motion feature assessment and visualization

Author: Masaki Oshita, Shigeru Kuriyama, Shunsuke Ineno, Takumi Inao, and Tomohiko Mukai
Subjects: business.industry, Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Process (computing), 020207 software engineering, Statistical model, 02 engineering and technology, Computer Graphics and Computer-Aided Design, Motion capture, Motion (physics), Visualization, Computer graphics, Feature (computer vision), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Computer vision, Computer Vision and Pattern Recognition, Artificial intelligence, business, Self training, Software
Abstract: In this paper, we propose a prototype of a self-training system for tennis forehand shots that allows trainees to practice their motion forms by themselves. Our system includes a motion capture device to record the trainee’s motion, and the system visualizes the differences between the features of the trainee’s motion and the correct motion performed by an expert. The system enables trainees to understand the errors in their motion and how to reduce or eliminate them. In this study, we classify the motion features and corresponding visualization methods based on the one-dimensional spatial, rotational, and temporal features of key poses. We also develop a statistical model for the motion features so that the system can assess and prioritize all features of a trainee’s motion. Related features are simultaneously visualized by analyzing their correlations. We describe the process of defining the motion features for the tennis forehand shot of an expert. We evaluated our prototype through several user experiments and demonstrated its feasibility as a self-training system.
Published: 2019
Full Text: View/download PDF

22. Granulation-based self-training for the semi-supervised classification of remote-sensing images

Author: Prem Shankar Singh Aydav and Sonajharia Minz
Subjects: 0209 industrial biotechnology, Training set, Computer science, Granular computing, Computational intelligence, 02 engineering and technology, Computer Science Applications, ComputingMethodologies_PATTERNRECOGNITION, 020901 industrial engineering & automation, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Classifier (UML), Self training, Information Systems, Remote sensing
Abstract: Collection of quality-labeled training samples in the area of remote sensing is very difficult, costly, time-consuming, and tedious due to various constraints. Classification of remote-sensing images is a challenging task due to the limited availability of quality-labeled samples for the training process. To solve the problem of labeled samples, various semi-supervised techniques have been designed and explored for the classification of remote-sensing images. Self-training is a popular semi-supervised method widely used for the training of supervised classifier with limited labeled and a large pool of unlabeled samples. However, the traditional self-training approach gives poor performance for the classification of remote-sensing images. The traditional self-training method selects samples only on the basis of maximum classification probability criterion which may not improve the classifier accuracy. The effectiveness of the classifiers trained in the self-training fashion depends on the selection of correct, diverse, and informative samples for the labeled training set. In this paper, granular computing concepts have been utilized to improve the self-training approach for the classification of the remote-sensing images. The proposed approach first groups the unlabeled samples into several numbers of granules. After that, a supervised classifier is trained with few labeled samples and the trained classifier is used to select the most confident granules set. The selected most confident granules help to add qualitative samples into the labeled set for the effective training of the classifiers. The experimental results with three benchmark remote-sensing data sets show that the proposed method has produced improvement in the classification accuracy.
Published: 2019
Full Text: View/download PDF

23. The First Step towards Automatic Quality Evaluation of Chinese Vowel Pronunciations for Foreign Learners for Self-training

Author: Junya Shinzawa, Jinhua She, Hiroyuki Kameda, Sumio Ohno, and Shumei Chen
Subjects: business.industry, Computer science, media_common.quotation_subject, computer.software_genre, Computer Science Applications, Education, Vowel, Quality (business), Artificial intelligence, business, computer, Self training, Natural language processing, media_common
Published: 2019
Full Text: View/download PDF

24. CReST: A Class-Rebalancing Self-Training Framework for Imbalanced Semi-Supervised Learning

Author: Alan L. Yuille, Fan Yang, Chen Wei, Kihyuk Sohn, and Clayton Mellina
Subjects: FOS: Computer and information sciences, Class (computer programming), Computer science, Property (programming), business.industry, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Semi-supervised learning, Machine learning, computer.software_genre, Set (abstract data type), Pattern recognition (psychology), Code (cryptography), Crest, Artificial intelligence, business, Self training, computer
Abstract: Semi-supervised learning on class-imbalanced data, although a realistic problem, has been under studied. While existing semi-supervised learning (SSL) methods are known to perform poorly on minority classes, we find that they still generate high precision pseudo-labels on minority classes. By exploiting this property, in this work, we propose Class-Rebalancing Self-Training (CReST), a simple yet effective framework to improve existing SSL methods on class-imbalanced data. CReST iteratively retrains a baseline SSL model with a labeled set expanded by adding pseudo-labeled samples from an unlabeled set, where pseudo-labeled samples from minority classes are selected more frequently according to an estimated class distribution. We also propose a progressive distribution alignment to adaptively adjust the rebalancing strength dubbed CReST+. We show that CReST and CReST+ improve state-of-the-art SSL algorithms on various class-imbalanced datasets and consistently outperform other popular rebalancing methods. Code has been made available at https://github.com/google-research/crest., To appear in CVPR 2021. Code release: https://github.com/google-research/crest
Published: 2021
Full Text: View/download PDF

25. Can self-training identify suspicious ugly duckling lesions?

Author: Arash Koochek, Jordan Yap, M. Stella Atkins, Mohammadreza Mohseni, and William Yolland
Subjects: FOS: Computer and information sciences, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Deep learning, Feature extraction, Computer Science - Computer Vision and Pattern Recognition, Diagnostic accuracy, Pattern recognition, Test set, Outlier, Screening method, Artificial intelligence, Skin lesion, business, Self training
Abstract: One commonly used clinical approach towards detecting melanomas recognises the existence of Ugly Duckling nevi, or skin lesions which look different from the other lesions on the same patient. An automatic method of detecting and analysing these lesions would help to standardize studies, compared with manual screening methods. However, it is difficult to obtain expertly-labelled images for ugly duckling lesions. We therefore propose to use self-supervised machine learning to automatically detect outlier lesions. We first automatically detect and extract all the lesions from a wide-field skin image, and calculate an embedding for each detected lesion in a patient image, based on automatically identified features. These embeddings are then used to calculate the L2 distances as a way to measure dissimilarity. Using this deep learning method, Ugly Ducklings are identified as outliers which should deserve more attention from the examining physician. We evaluate through comparison with dermatologists, and achieve a sensitivity rate of 72.1% and diagnostic accuracy of 94.2% on the held-out test set., Accepted at Sixth ISIC Skin Image Analysis Workshop @ CVPR 2021
Published: 2021
Full Text: View/download PDF

26. Interactive Self-Training with Mean Teachers for Semi-supervised Object Detection

Author: Lei Zhang, Qize Yang, Biao Wang, Xihan Wei, and Xian-Sheng Hua
Subjects: Data labeling, Consistency (database systems), Computer science, business.industry, Fuse (electrical), Labeled data, Pattern recognition, Artificial intelligence, business, Regularization (mathematics), Self training, Object detection, Image (mathematics)
Abstract: The goal of semi-supervised object detection is to learn a detection model using only a few labeled data and large amounts of unlabeled data, thereby reducing the cost of data labeling. Although a few studies have proposed various self-training-based methods or consistency regularization-based methods, they ignore the discrepancies among the detection results in the same image that occur during different training iterations. Additionally, the predicted detection results vary among different detection models. In this paper, we propose an interactive form of self-training using mean teachers for semi-supervised object detection. Specifically, to alleviate the instability among the detection results in different iterations, we propose using nonmaximum suppression to fuse the detection results from different iterations. Simultaneously, we use multiple detection heads that predict pseudo labels for each other to provide complementary information. Furthermore, to avoid different detection heads collapsing to each other, we use a mean teacher model instead of the original detection model to predict the pseudo labels. Thus, the object detection model can be trained on both labeled and unlabeled data. Extensive experimental results verify the effectiveness of our proposed method.
Published: 2021
Full Text: View/download PDF

27. Self‐training with one‐shot stepwise learning method for person re‐identification

Author: Linna Wang, Haojie Liu, Daoxun Xia, Jiawen Li, and Lili Xu
Subjects: One shot, Computer Networks and Communications, Computer science, business.industry, Semi-supervised learning, One-shot learning, Machine learning, computer.software_genre, Re identification, Computer Science Applications, Theoretical Computer Science, Computational Theory and Mathematics, Learning methods, Artificial intelligence, business, Self training, computer, Software
Published: 2021
Full Text: View/download PDF

28. Unsupervised Self-Training for Sentiment Analysis of Code-Switched Data

Author: Sai Krishna Rallabandi, Alan W. Black, Akshat Gupta, and Sargam Menghani
Subjects: FOS: Computer and information sciences, Class (computer programming), Computer Science - Machine Learning, Computer Science - Computation and Language, Computer science, business.industry, Customer reviews, Sentiment analysis, Initialization, Machine learning, computer.software_genre, Machine Learning (cs.LG), Task (project management), Code (cryptography), Social media, Artificial intelligence, business, Computation and Language (cs.CL), Self training, computer
Abstract: Sentiment analysis is an important task in understanding social media content like customer reviews, Twitter and Facebook feeds etc. In multilingual communities around the world, a large amount of social media text is characterized by the presence of Code-Switching. Thus, it has become important to build models that can handle code-switched data. However, annotated code-switched data is scarce and there is a need for unsupervised models and algorithms. We propose a general framework called Unsupervised Self-Training and show its applications for the specific use case of sentiment analysis of code-switched data. We use the power of pre-trained BERT models for initialization and fine-tune them in an unsupervised manner, only using pseudo labels produced by zero-shot transfer. We test our algorithm on multiple code-switched languages and provide a detailed analysis of the learning dynamics of the algorithm with the aim of answering the question - `Does our unsupervised model understand the Code-Switched languages or does it just learn its representations?'. Our unsupervised models compete well with their supervised counterparts, with their performance reaching within 1-7\% (weighted F1 scores) when compared to supervised models trained for a two class problem.
Published: 2021

29. Rank-based self-training for graph convolutional networks

Author: Daniel Carlos Guimarães Pedronette, Longin Jan Latecki, Universidade Estadual Paulista (Unesp), and Temple University
Subjects: Exploit, Computer science, business.industry, Rank model, 02 engineering and technology, Semi-supervised learning, Library and Information Sciences, Management Science and Operations Research, Machine learning, computer.software_genre, Graph, Computer Science Applications, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Media Technology, Self-training, 020201 artificial intelligence & image processing, Artificial intelligence, Graph convolutional networks, business, Self training, computer, Feature learning, Information Systems
Abstract: Made available in DSpace on 2021-06-25T10:46:04Z (GMT). No. of bitstreams: 0 Previous issue date: 2021-03-01 Microsoft Research Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP) Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) National Science Foundation Graph Convolutional Networks (GCNs) have been established as a fundamental approach for representation learning on graphs, based on convolution operations on non-Euclidean domain, defined by graph-structured data. GCNs and variants have achieved state-of-the-art results on classification tasks, especially in semi-supervised learning scenarios. A central challenge in semi-supervised classification consists in how to exploit the maximum of useful information encoded in the unlabeled data. In this paper, we address this issue through a novel self-training approach for improving the accuracy of GCNs on semi-supervised classification tasks. A margin score is used through a rank-based model to identify the most confident sample predictions. Such predictions are exploited as an expanded labeled set in a second-stage training step. Our model is suitable for different GCN models. Moreover, we also propose a rank aggregation of labeled sets obtained by different GCN models. The experimental evaluation considers four GCN variations and traditional benchmarks extensively used in the literature. Significant accuracy gains were achieved for all evaluated models, reaching results comparable or superior to the state-of-the-art. The best results were achieved for rank aggregation self-training on combinations of the four GCN models. Department of Statistics Applied Mathematics and Computing (DEMAC) São Paulo State University (UNESP) Department of Computer and Information Sciences Temple University Department of Statistics Applied Mathematics and Computing (DEMAC) São Paulo State University (UNESP) FAPESP: #2017/25908-6 FAPESP: #2018/15597-6 CNPq: #308194/2017-9 National Science Foundation: IIS-1814745
Published: 2021

30. SelfHAR: Improving Human Activity Recognition through Self-training with Unlabeled Data

Author: Chi Ian Tang, Dimitris Spathis, Ignacio Perez-Pozuelo, Cecilia Mascolo, Soren Brage, Nicholas J. Wareham, Spathis, Dimitrios [0000-0001-9761-951X], Brage, Soren [0000-0002-1265-7355], Wareham, Nicholas [0000-0003-1422-2993], Mascolo, Cecilia [0000-0001-9614-4380], and Apollo - University of Cambridge Repository
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Networks and Communications, Computer science, cs.LG, Inference, 02 engineering and technology, Machine learning, computer.software_genre, Machine Learning (cs.LG), Set (abstract data type), Activity recognition, 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), Complement (set theory), business.industry, Deep learning, 020206 networking & telecommunications, Human-Computer Interaction, ComputingMethodologies_PATTERNRECOGNITION, Hardware and Architecture, 020201 artificial intelligence & image processing, Artificial intelligence, F1 score, business, computer, Self training
Abstract: Machine learning and deep learning have shown great promise in mobile sensing applications, including Human Activity Recognition. However, the performance of such models in real-world settings largely depends on the availability of large datasets that captures diverse behaviors. Recently, studies in computer vision and natural language processing have shown that leveraging massive amounts of unlabeled data enables performance on par with state-of-the-art supervised models. In this work, we present SelfHAR, a semi-supervised model that effectively learns to leverage unlabeled mobile sensing datasets to complement small labeled datasets. Our approach combines teacher-student self-training, which distills the knowledge of unlabeled and labeled datasets while allowing for data augmentation, and multi-task self-supervision, which learns robust signal-level representations by predicting distorted versions of the input. We evaluated SelfHAR on various HAR datasets and showed state-of-the-art performance over supervised and previous semi-supervised approaches, with up to 12% increase in F1 score using the same number of model parameters at inference. Furthermore, SelfHAR is data-efficient, reaching similar performance using up to 10 times less labeled data compared to supervised approaches. Our work not only achieves state-of-the-art performance in a diverse set of HAR datasets, but also sheds light on how pre-training tasks may affect downstream performance., Comment: Accepted for publication in Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 2021
Published: 2021
Full Text: View/download PDF

31. Constrained Spectral Clustering Network with Self-Training

Author: Xinyue Liu, Linlin Zong, and Shichong Yang
Subjects: business.industry, Computer science, 020207 software engineering, 02 engineering and technology, computer.software_genre, Spectral clustering, ComputingMethodologies_PATTERNRECOGNITION, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), Benchmark (computing), Cluster (physics), 020201 artificial intelligence & image processing, Pairwise comparison, Artificial intelligence, Data mining, business, Cluster analysis, Feature learning, Self training, computer
Abstract: Deep spectral clustering networks have shown their superiorities due to the integration of feature learning and cluster assignment, and the ability to deal with non-convex clusters. Nevertheless, deep spectral clustering is still an ill-posed problem. Specifically, the affinity learned by the most remarkable SpectralNet is not guaranteed to be consistent with local invariance and thus hurts the final clustering performance. In this paper, we propose a novel framework of Constrained Spectral Clustering Network (CSCN) by incorporating pairwise constraints and clustering oriented fine-tuning to deal with the ill-posedness. To the best of our knowledge, this is the first constrained deep spectral clustering method. Another advantage of CSCN over existing constrained deep clustering networks is that it propagates pairwise constraints throughout the entire dataset. In addition, we design a clustering oriented loss by self-training to simultaneously finetune feature representations and perform cluster assignments, which further improve the quality of clustering. Extensive experiments on benchmark datasets demonstrate that our approach outperforms the state-of-the-art clustering methods.
Published: 2021
Full Text: View/download PDF

32. STRUDEL: Self-training with Uncertainty Dependent Label Refinement Across Domains

Author: Christian Wachinger, Fabian Gröger, and Anne-Marie Rickmann
Subjects: Domain adaptation, White matter hyperintensity, Robustness (computer science), business.industry, Computer science, Process (computing), Pattern recognition, Segmentation, Artificial intelligence, Function (mathematics), business, Self training
Abstract: We propose an unsupervised domain adaptation (UDA) approach for white matter hyperintensity (WMH) segmentation, which uses Self-TRaining with Uncertainty DEpendent Label refinement (STRUDEL). Self-training has recently been introduced as a highly effective method for UDA, which is based on self-generated pseudo labels. However, pseudo labels can be very noisy and therefore deteriorate model performance. We propose to predict the uncertainty of pseudo labels and integrate it in the training process with an uncertainty-guided loss function to highlight labels with high certainty. STRUDEL is further improved by incorporating the segmentation output of an existing method in the pseudo label generation that showed high robustness for WMH segmentation. In our experiments, we evaluate STRUDEL with a standard U-Net and a modified network with a higher receptive field. Our results on WMH segmentation across datasets demonstrate the significant improvement of STRUDEL with respect to standard self-training.
Published: 2021
Full Text: View/download PDF

33. Semi-supervised Anatomical Landmark Detection via Shape-regulated Self-training

Author: Guodong Wei, Lingjie Liu, Runnan Chen, Nenglun Chen, Zhiming Cui, Wenping Wang, and Yuexin Ma
Subjects: Structure (mathematical logic), FOS: Computer and information sciences, Landmark, Computer science, Property (programming), business.industry, Cognitive Neuroscience, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Pattern recognition, Shape constraint, Computer Science Applications, Image (mathematics), Anatomical landmark, ComputingMethodologies_PATTERNRECOGNITION, Artificial Intelligence, Artificial intelligence, Focus (optics), business, Self training
Abstract: Well-annotated medical images are costly and sometimes even impossible to acquire, hindering landmark detection accuracy to some extent. Semi-supervised learning alleviates the reliance on large-scale annotated data by exploiting the unlabeled data to understand the population structure of anatomical landmarks. The global shape constraint is the inherent property of anatomical landmarks that provides valuable guidance for more consistent pseudo labelling of the unlabeled data, which is ignored in the previously semi-supervised methods. In this paper, we propose a model-agnostic shape-regulated self-training framework for semi-supervised landmark detection by fully considering the global shape constraint. Specifically, to ensure pseudo labels are reliable and consistent, a PCA-based shape model adjusts pseudo labels and eliminate abnormal ones. A novel Region Attention loss to make the network automatically focus on the structure consistent regions around pseudo labels. Extensive experiments show that our approach outperforms other semi-supervised methods and achieves notable improvement on three medical image datasets. Moreover, our framework is flexible and can be used as a plug-and-play module integrated into most supervised methods to improve performance further., Comment: Accepted to Neurocomputing
Published: 2021
Full Text: View/download PDF

34. Hardness Sampling for Self-Training Based Transductive Zero-Shot Learning

Author: Liu Bo, Qiulei Dong, and Zhanyi Hu
Subjects: FOS: Computer and information sciences, Computer science, business.industry, Open problem, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Process (computing), Sampling (statistics), Approximation algorithm, Machine learning, computer.software_genre, Class (biology), Task (project management), Domain (software engineering), Artificial intelligence, business, Self training, computer
Abstract: Transductive zero-shot learning (T-ZSL) which could alleviate the domain shift problem in existing ZSL works, has received much attention recently. However, an open problem in T-ZSL: how to effectively make use of unseen-class samples for training, still remains. Addressing this problem, we first empirically analyze the roles of unseen-class samples with different degrees of hardness in the training process based on the uneven prediction phenomenon found in many ZSL methods, resulting in three observations. Then, we propose two hardness sampling approaches for selecting a subset of diverse and hard samples from a given unseen-class dataset according to these observations. The first one identifies the samples based on the class-level frequency of the model predictions while the second enhances the former by normalizing the class frequency via an approximate class prior estimated by an explored prior estimation algorithm. Finally, we design a new Self-Training framework with Hardness Sampling for T-ZSL, called STHS, where an arbitrary inductive ZSL method could be seamlessly embedded and it is iteratively trained with unseen-class samples selected by the hardness sampling approach. We introduce two typical ZSL methods into the STHS framework and extensive experiments demonstrate that the derived T-ZSL methods outperform many state-of-the-art methods on three public benchmarks. Besides, we note that the unseen-class dataset is separately used for training in some existing transductive generalized ZSL (T-GZSL) methods, which is not strict for a GZSL task. Hence, we suggest a more strict T-GZSL data setting and establish a competitive baseline on this setting by introducing the proposed STHS framework to T-GZSL., Comment: 11 pages, 4 figures
Published: 2021
Full Text: View/download PDF

35. Self-Training Pre-Trained Language Models for Zero- and Few-Shot Multi-Dialectal Arabic Sequence Labeling

Author: Muhammad Khalifa, Muhammad Abdul-Mageed, and Khaled Shaalan
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer Science - Artificial Intelligence, business.industry, Arabic, Computer science, Computer Science - Neural and Evolutionary Computing, Context (language use), computer.software_genre, Sequence labeling, language.human_language, Zero (linguistics), Artificial Intelligence (cs.AI), Modern Standard Arabic, language, Labeled data, Neural and Evolutionary Computing (cs.NE), Language model, Artificial intelligence, business, Computation and Language (cs.CL), computer, Self training, Natural language processing
Abstract: A sufficient amount of annotated data is usually required to fine-tune pre-trained language models for downstream tasks. Unfortunately, attaining labeled data can be costly, especially for multiple language varieties and dialects. We propose to self-train pre-trained language models in zero- and few-shot scenarios to improve performance on data-scarce varieties using only resources from data-rich ones. We demonstrate the utility of our approach in the context of Arabic sequence labeling by using a language model fine-tuned on Modern Standard Arabic (MSA) only to predict named entities (NE) and part-of-speech (POS) tags on several dialectal Arabic (DA) varieties. We show that self-training is indeed powerful, improving zero-shot MSA-to-DA transfer by as large as \texttildelow 10\% F$_1$ (NER) and 2\% accuracy (POS tagging). We acquire even better performance in few-shot scenarios with limited amounts of labeled data. We conduct an ablation study and show that the performance boost observed directly results from the unlabeled DA examples used for self-training. Our work opens up opportunities for developing DA models exploiting only MSA resources and it can be extended to other languages and tasks. Our code and fine-tuned models can be accessed at https://github.com/mohammadKhalifa/zero-shot-arabic-dialects., Comment: Accepted at EACL 2021 (Camera Ready Version)
Published: 2021
Full Text: View/download PDF

36. Enhanced Back-Translation for Low Resource Neural Machine Translation Using Self-training

Author: Abubakar Isa, Bashir Shehu Galadanci, and Idris Abdulmumin
Subjects: Machine translation, Low resource, business.industry, Computer science, Back translation, 02 engineering and technology, 010501 environmental sciences, Machine learning, computer.software_genre, Translation (geometry), 01 natural sciences, Synthetic data, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Self training, 0105 earth and related environmental sciences
Abstract: Improving neural machine translation (NMT) models using the back-translations of the monolingual target data (synthetic parallel data) is currently the state-of-the-art approach for training improved translation systems. The quality of the backward system – which is trained on the available parallel data and used for the back-translation – has been shown in many studies to affect the performance of the final NMT model. In low resource conditions, the available parallel data is usually not enough to train a backward model that can produce the qualitative synthetic data needed to train a standard translation model. This work proposes a self-training strategy where the output of the backward model is used to improve the model itself through the forward translation technique. The technique was shown to improve baseline low resource IWSLT’14 English-German and IWSLT’15 English-Vietnamese backward translation models by 11.06 and 1.5 BLEUs respectively. The synthetic data generated by the improved English-German backward model was used to train a forward model which out-performed another forward model trained using standard back-translation by 2.7 BLEU.
Published: 2021
Full Text: View/download PDF

37. Incorporate Lexicon into Self-training: A Distantly Supervised Chinese Medical NER

Author: Shengping Liu, Zhen Gan, Baoli Zhang, Kang Liu, Yafei Shi, Zhucong Li, Jing Wan, Yubo Chen, and Jun Zhao
Subjects: Recall, Computer science, business.industry, computer.software_genre, Lexicon, Ranking (information retrieval), Annotation, ComputingMethodologies_PATTERNRECOGNITION, Named-entity recognition, Benchmark (computing), Artificial intelligence, business, computer, Self training, Natural language processing
Abstract: Medical named entity recognition (NER) tasks usually lack sufficient annotation data. Distant supervision is often used to alleviate this problem, which can quickly and automatically generate annotated training datasets through dictionaries. However, the current distantly supervised method suffers from noisy labeling due to limited coverage of the dictionary, which will cause a large number of unlabeled entities. We call this phenomenon an incomplete annotation problem. To tackle the incomplete annotation problem, we propose a novel distantly supervised method for Chinese medical NER. Specifically, we propose a high recall self-training mechanism to recall potential unlabeled entities in the distant supervision dataset. To reduce error in the high recall self-training, we propose a fine-grained lexicon enhanced scoring and ranking mechanism. Our method improves 3.2% and 5.03% compared to the baseline models on the dataset we proposed and a benchmark dataset for Chinese medical NER.
Published: 2021
Full Text: View/download PDF

38. Integrating Semantic-Space Finetuning and Self-Training for Semi-Supervised Multi-label Text Classification

Author: Mizuho Iwaihara and Zhewei Xu
Subjects: Computer science, business.industry, Semantic space, Artificial intelligence, business, computer.software_genre, Self training, computer, Natural language processing
Published: 2021
Full Text: View/download PDF

39. Self-training vs Pre-trained Embeddings for Automatic Essay Scoring

Author: Xiaochao Fan, Liang Yang, Hongfei Lin, Yong Yang, Xianbing Zhou, and Ge Ren
Subjects: Computer science, business.industry, Relevance (information retrieval), Language model, Artificial intelligence, computer.software_genre, business, Self training, computer, Natural language processing, Word (computer architecture), Task (project management), Effective solution
Abstract: People usually believe that using pre-trained word vectors or pre-trained language models can effectively improve task performance. But that is not the case. A sufficient amount of annotated data is usually required to fine-tune the pre-trained language model and pre-trained word vectors for downstream tasks. In addition, the relevance of the training corpus and task corpus also affects task performance to a large extent. In this paper, we systematically compared the effects of different types of pre-trained embeddings and self-training embeddings on the performance of AES. At the same time, we propose an effective solution to the above problem, an automatic essay scoring method that includes pre-trained and self-training word embeddings. We conducted experiments on a public available dataset, including 8 subsets, and the experimental results show the effectiveness of this method.
Published: 2021
Full Text: View/download PDF

40. An Annotation Sparsification Strategy for 3D Medical Image Segmentation via Representative Selection and Self-Training

Author: Chaoli Wang, Lin Yang, Danny Z. Chen, Hao Zheng, and Yizhe Zhang
Subjects: 020203 distributed computing, business.industry, Computer science, Deep learning, Pattern recognition, 02 engineering and technology, General Medicine, Image segmentation, Article, Annotation, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Segmentation, Artificial intelligence, business, Self training, Selection (genetic algorithm)
Abstract: Image segmentation is critical to lots of medical applications. While deep learning (DL) methods continue to improve performance for many medical image segmentation tasks, data annotation is a big bottleneck to DL-based segmentation because (1) DL models tend to need a large amount of labeled data to train, and (2) it is highly time-consuming and label-intensive to voxel-wise label 3D medical images. Significantly reducing annotation effort while attaining good performance of DL segmentation models remains a major challenge. In our preliminary experiments, we observe that, using partially labeled datasets, there is indeed a large performance gap with respect to using fully annotated training datasets. In this paper, we propose a new DL framework for reducing annotation effort and bridging the gap between full annotation and sparse annotation in 3D medical image segmentation. We achieve this by (i) selecting representative slices in 3D images that minimize data redundancy and save annotation effort, and (ii) self-training with pseudo-labels automatically generated from the base-models trained using the selected annotated slices. Extensive experiments using two public datasets (the HVSMR 2016 Challenge dataset and mouse piriform cortex dataset) show that our framework yields competitive segmentation results comparing with state-of-the-art DL methods using less than ∼20% of annotated data.
Published: 2020

41. Semi-Supervised ASR by End-to-End Self-Training

Author: Chao Wang, Weiran Wang, and Yang Chen
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Computer science, Speech recognition, 02 engineering and technology, Performance gap, Computer Science - Sound, Oracle, Machine Learning (cs.LG), Connectionism, End-to-end principle, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Computer Science - Computation and Language, business.industry, Deep learning, 020206 networking & telecommunications, ComputingMethodologies_PATTERNRECOGNITION, Artificial intelligence, business, Computation and Language (cs.CL), Self training, Decoding methods, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: While deep learning based end-to-end automatic speech recognition (ASR) systems have greatly simplified modeling pipelines, they suffer from the data sparsity issue. In this work, we propose a self-training method with an end-to-end system for semi-supervised ASR. Starting from a Connectionist Temporal Classification (CTC) system trained on the supervised data, we iteratively generate pseudo-labels on a mini-batch of unsupervised utterances with the current model, and use the pseudo-labels to augment the supervised data for immediate model update. Our method retains the simplicity of end-to-end ASR systems, and can be seen as performing alternating optimization over a well-defined learning objective. We also perform empirical investigations of our method, regarding the effect of data augmentation, decoding beamsize for pseudo-label generation, and freshness of pseudo-labels. On a commonly used semi-supervised ASR setting with the WSJ corpus, our method gives 14.4% relative WER improvement over a carefully-trained base system with data augmentation, reducing the performance gap between the base system and the oracle system by 50%., Comment: Accepted by Interspeech 2020
Published: 2020
Full Text: View/download PDF

42. Computer-Assisted Self-Training for Kyudo Posture Rectification Using Computer Vision Methods

Author: Wardah Farrukh and Dustin van der Haar
Subjects: Structure (mathematical logic), Support vector machine, Similarity (geometry), Rectification, Computer science, business.industry, Line (geometry), Computer vision, Artificial intelligence, business, Convolutional neural network, Self training
Abstract: To some individuals, particularly archery students, perfecting the art of Kyudo is of utmost importance. These devoted students are always trying to correct their posture because it plays a significant role in effectively shooting at the target. However, due to the lack of attention from instructors, students are often forced to train on their own without any guidance. It is difficult for students to analyze their own faults because the shoulders, hips, and feet should be in line with another, parallel to the floor and straight to the target. The proposed solution is, therefore, a system that aims to assist students in correcting their posture. The system will classify the technique presented by the user and using PoseNet, the system will output coordinates and draw a skeleton structure of the user’s technique along with the instructor’s technique. The coordinates will then be measured for similarity and appropriate feedback is provided to the user. The results for classification, using CNN and SVM showed an accuracy of 81.25% and 80.2%, respectively. The results indicate the feasibility of the approach, however, improvement is required in certain areas. Recommendations for improving the approach are discussed.
Published: 2020
Full Text: View/download PDF

43. Developing Sustainable Classification of Diseases via Deep Learning and Semi-Supervised Learning

Author: Chunwu Yin and Zhanbo Chen
Subjects: semi-supervised learning, Leadership and Management, Computer science, education, disease classification, lcsh:Medicine, Health Informatics, 02 engineering and technology, Semi-supervised learning, Machine learning, computer.software_genre, Article, 03 medical and health sciences, Health Information Management, Robustness (computer science), self-training, 0202 electrical engineering, electronic engineering, information engineering, 030304 developmental biology, Hyperparameter, 0303 health sciences, Training set, business.industry, Health Policy, Deep learning, Supervised learning, lcsh:R, Disease classification, deep learning, ComputingMethodologies_PATTERNRECOGNITION, 020201 artificial intelligence & image processing, Artificial intelligence, business, Self training, computer
Abstract: Disease classification based on machine learning has become a crucial research topic in the fields of genetics and molecular biology. Generally, disease classification involves a supervised learning style, i.e., it requires a large number of labelled samples to achieve good classification performance. However, in the majority of the cases, labelled samples are hard to obtain, so the amount of training data are limited. However, many unclassified (unlabelled) sequences have been deposited in public databases, which may help the training procedure. This method is called semi-supervised learning and is very useful in many applications. Self-training can be implemented using high- to low-confidence samples to prevent noisy samples from affecting the robustness of semi-supervised learning in the training process. The deep forest method with the hyperparameter settings used in this paper can achieve excellent performance. Therefore, in this work, we propose a novel combined deep learning model and semi-supervised learning with self-training approach to improve the performance in disease classification, which utilizes unlabelled samples to update a mechanism designed to increase the number of high-confidence pseudo-labelled samples. The experimental results show that our proposed model can achieve good performance in disease classification and disease-causing gene identification.
Published: 2020

44. Facial Action Unit Recognition in the Wild with Multi-Task CNN Self-Training for the EmotioNet Challenge

Author: Frerk Saxen, Philipp Werner, and Ayoub Al-Hamadi
Subjects: ComputingMethodologies_PATTERNRECOGNITION, Training set, Action (philosophy), Computer science, business.industry, Speech recognition, Evaluation data, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Artificial intelligence, Performance gap, business, Self training, Task (project management)
Abstract: Automatic understanding of facial behavior is hampered by factors such as occlusion, illumination, non-frontal head pose, low image resolution, or limitations in labeled training data. The EmotioNet 2020 Challenge addresses these issues through a competition on recognizing facial action units on in-the-wild data. We propose to combine multi-task and self-training to make best use of the small manually / fully labeled and the large weakly / partially labeled training datasets provided by the challenge organizers. With our approach (and without using additional data) we achieve the second place in the 2020 challenge - with a performance gap of only 0.05% to the challenge winner and of 5.9% to the third place. On the 2018 challenge evaluation data our method outperforms all other known results.
Published: 2020
Full Text: View/download PDF

45. Robust Semi-Supervised Traffic Sign Recognition via Self-Training and Weakly-Supervised Learning

Author: Guowu Yang, Jinzhao Wu, Lady Nadia Frempong, Obed Tettey Nartey, and Sarpong Kwadwo Asare
Subjects: semi-supervised learning, Computer science, 02 engineering and technology, Semi-supervised learning, Machine learning, computer.software_genre, lcsh:Chemical technology, Biochemistry, Article, Analytical Chemistry, traffic sign recognition, self-training, 0502 economics and business, 0202 electrical engineering, electronic engineering, information engineering, Traffic sign recognition, lcsh:TP1-1185, self-paced learning, Electrical and Electronic Engineering, Instrumentation, 050210 logistics & transportation, Training set, business.industry, Deep learning, 05 social sciences, Supervised learning, Atomic and Molecular Physics, and Optics, ComputingMethodologies_PATTERNRECOGNITION, weakly-supervised learning, 020201 artificial intelligence & image processing, Artificial intelligence, business, deep convolutional neural networks, Classifier (UML), Self training, computer
Abstract: Traffic sign recognition is a classification problem that poses challenges for computer vision and machine learning algorithms. Although both computer vision and machine learning techniques have constantly been improved to solve this problem, the sudden rise in the number of unlabeled traffic signs has become even more challenging. Large data collation and labeling are tedious and expensive tasks that demand much time, expert knowledge, and fiscal resources to satisfy the hunger of deep neural networks. Aside from that, the problem of having unbalanced data also poses a greater challenge to computer vision and machine learning algorithms to achieve better performance. These problems raise the need to develop algorithms that can fully exploit a large amount of unlabeled data, use a small amount of labeled samples, and be robust to data imbalance to build an efficient and high-quality classifier. In this work, we propose a novel semi-supervised classification technique that is robust to small and unbalanced data. The framework integrates weakly-supervised learning and self-training with self-paced learning to generate attention maps to augment the training set and utilizes a novel pseudo-label generation and selection algorithm to generate and select pseudo-labeled samples. The method improves the performance by: (1) normalizing the class-wise confidence levels to prevent the model from ignoring hard-to-learn samples, thereby solving the imbalanced data problem, (2) jointly learning a model and optimizing pseudo-labels generated on unlabeled data, and (3) enlarging the training set to satisfy the hunger of deep learning models. Extensive evaluations on two public traffic sign recognition datasets demonstrate the effectiveness of the proposed technique and provide a potential solution for practical applications.
Published: 2020

46. Self-Training for Unsupervised Neural Machine Translation in Unbalanced Training Data Scenarios

Author: Eiichiro Sumita, Haipeng Sun, Tiejun Zhao, Rui Wang, Kehai Chen, and Masao Utiyama
Subjects: FOS: Computer and information sciences, Training set, Computer Science - Computation and Language, Machine translation, Computer science, business.industry, Training (meteorology), computer.software_genre, Translation (geometry), Estonian, language.human_language, 030507 speech-language pathology & audiology, 03 medical and health sciences, ComputingMethodologies_PATTERNRECOGNITION, language, Artificial intelligence, 0305 other medical science, business, computer, Self training, Computation and Language (cs.CL), Natural language processing
Abstract: Unsupervised neural machine translation (UNMT) that relies solely on massive monolingual corpora has achieved remarkable results in several translation tasks. However, in real-world scenarios, massive monolingual corpora do not exist for some extremely low-resource languages such as Estonian, and UNMT systems usually perform poorly when there is not adequate training corpus for one language. In this paper, we first define and analyze the unbalanced training data scenario for UNMT. Based on this scenario, we propose UNMT self-training mechanisms to train a robust UNMT system and improve its performance in this case. Experimental results on several language pairs show that the proposed methods substantially outperform conventional UNMT systems., Accepted by NAACL 2021
Published: 2020

47. Self-training Pupil Detection Based on Mouse Click Calibration for Eye-Tracking Under Low Resolution

Author: Tsuyoshi Usagawa and Chenyang Zheng
Subjects: Computer science, business.industry, Low resolution, 05 social sciences, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, 02 engineering and technology, Gaze, Pupil, InformationSystems_MODELSANDPRINCIPLES, Position (vector), Personal computer, 0202 electrical engineering, electronic engineering, information engineering, Calibration, Eye tracking, 0501 psychology and cognitive sciences, Computer vision, Artificial intelligence, business, Self training, 050107 human factors
Abstract: Pupil detection is an indispensable step in eye-tracking, however, there are lots of difficulties in locating the pupil under low-resolution environments. This study proposes a self-training method of pupil detection for eye-tracking under low resolution. The system is applied to a typical application scenario as using a personal computer, a webcam with a resolution of $640 \times 480$ is used in our study. We first generate an initial pupil pattern by verifying the color intensity of the eye regions and estimate the gaze according to the matched pupil position in the further frames. After that, we utilize the computer users' actions of mouse clicks to calibrate the estimation result and annotate the position of the pupil to train the weight map of the pupil pattern. The experimental result shows that the proposed method could minimize the effect of bright spots, moreover, the precision of pupil detection and eye-tracking is increased. The proposed method will be potentially used in light-weight applications of human-computer interaction.
Published: 2020
Full Text: View/download PDF

48. Semi-Supervised Meta-Learning via Self-Training

Author: Cai Nengbin, Meng Zhou, Yaoyi Li, Hongtao Lu, and Zhao Xuejun
Subjects: Computer science, business.industry, 05 social sciences, 010501 environmental sciences, Machine learning, computer.software_genre, 01 natural sciences, ComputingMethodologies_PATTERNRECOGNITION, 0502 economics and business, Labeled data, Artificial intelligence, 050207 economics, business, Self training, Classifier (UML), computer, 0105 earth and related environmental sciences
Abstract: The goal of meta-learning is to learn a learning procedure to generate a learner from only a handful of labeled datapoints. However, the learning procedure is learned by a meta-learner from an enormous amount of few-shot tasks constructed from a large amount of labeled datapoints. From this point of view, few-shot learning is also depending on huge amount of labeled data. In this paper, we present a simple and efficient method for few-shot classification in a semi-supervised setting where only a small portion of training samples are labeled. We assign these unlabeled data pseudo-labels using a classifier trained with both labeled and unlabeled data. Once the pseudo-label obtained, we can run meta-learning over tasks constructed from labeled and pseudo-labeled data. We evaluate our method on miniImagenet and tieredImagenet benchmarks whose meta-training sets are split into unlabeled portion and labeled portion further in order to adapt to our framework. Our experimental results confirm that our semi-supervised meta-learning approach acquires a considerable performance gain over meta-learning with only labeled data and significantly outperforms previous state-of-the-art semi-supervised meta-learning methods.
Published: 2020
Full Text: View/download PDF

49. A Distance-Weighted Selection of Unlabelled Instances for Self-training and Co-training Semi-supervised Methods

Author: Anne M. P. Canuto, Arthur C. Gorgonio, João C. Xavier-Júnior, and Cephas A. S. Barreto
Subjects: Co-training, Computer science, business.industry, Process (engineering), 02 engineering and technology, 021001 nanoscience & nanotechnology, Machine learning, computer.software_genre, Set (abstract data type), ComputingMethodologies_PATTERNRECOGNITION, Labelling, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, 0210 nano-technology, Selection criterion, business, Self training, computer, Selection (genetic algorithm)
Abstract: The use of Semi-supervised Learning (SSL) methods have emerged as an efficient solution to smooth out the problem of availability of labelled instances. Several methods have been proposed in the literature and Self-training and Co-training are two well-known methods. The main aim is to use only a few labelled instances to define a model and to apply this model in a labelling process, in which unlabelled instances are labelled and included in the labelled set. However, the labelling process is always directly dependent on the selection of the unlabelled instances. Moreover, the selection criterion used to select and label new instances has an important effect in the performance of a semi-supervised method. In this paper, we propose a distance-weighted selection of unlabelled instances for Self-training and Co-training semi-supervised methods. In addition, we compare the standard Self-training and Co-training methods against the proposed versions of these two methods over 20 classification datasets.
Published: 2020
Full Text: View/download PDF

50. Self-training Improves Pre-training for Natural Language Understanding

Author: Edouard Grave, Veselin Stoyanov, Vishrav Chaudhary, Beliz Gunel, Jingfei Du, Onur Celebi, Michael Auli, and Alexis Conneau
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, business.industry, Computer science, Natural language understanding, 02 engineering and technology, 010501 environmental sciences, Machine learning, computer.software_genre, Variety (linguistics), 01 natural sciences, Task (project management), Scalability, 0202 electrical engineering, electronic engineering, information engineering, Labeled data, Leverage (statistics), 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Self training, Computation and Language (cs.CL), 0105 earth and related environmental sciences
Abstract: Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web. Unlike previous semi-supervised methods, our approach does not require in-domain unlabeled data and is therefore more generally applicable. Experiments show that self-training is complementary to strong RoBERTa baselines on a variety of tasks. Our augmentation approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks. Finally, we also show strong gains on knowledge-distillation and few-shot learning., Comment: 8 pages
Published: 2020
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

174 results on '"Self training"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources