17 results on '"Tsuneo Nitta"'
Search Results
2. New Grapheme Generation Rules for Two-Stage Modelbased Grapheme-to-Phoneme Conversion
- Author
-
Seng Kheang, Tsuneo Nitta, Kouichi Katsurada, and Yurie Iribe
- Subjects
Consonant ,Information Systems and Management ,General Computer Science ,business.industry ,Computer science ,Speech recognition ,Grapheme ,Speech synthesis ,TK5101-6720 ,Information technology ,Pronunciation ,T58.5-58.64 ,computer.software_genre ,Software ,Vowel ,Telecommunication ,Artificial intelligence ,Electrical and Electronic Engineering ,Document retrieval ,business ,computer ,Word (computer architecture) ,Natural language processing - Abstract
The precise conversion of arbitrary text into its corresponding phoneme sequence (grapheme-to-phoneme or G2P conversion) is implemented in speech synthesis and recognition, pronunciation learning software, spoken term detection and spoken document retrieval systems. Because the quality of this module plays an important role in the performance of such systems and many problems regarding G2P conversion have been reported, we propose a novel two-stage model-based approach, which is implemented using an existing weighted finite-state transducer-based G2P conversion framework, to improve the performance of the G2P conversion model. The first-stage model is built for automatic conversion of words to phonemes, while the second-stage model utilizes the input graphemes and output phonemes obtained from the first stage to determine the best final output phoneme sequence. Additionally, we designed new grapheme generation rules, which enable extra detail for the vowel and consonant graphemes appearing within a word. When compared with previous approaches, the evaluation results indicate that our approach using rules focusing on the vowel graphemes slightly improved the accuracy of the out-of-vocabulary dataset and consistently increased the accuracy of the in-vocabulary dataset.
- Published
- 2014
3. Task Estimation Using Latent Semantic Analysis of Visual Scenes and Spoken Words
- Author
-
Tsuneo Nitta, Masashi Kimura, Shinta Sawada, Yurie Iribe, and Kouichi Katsurada
- Subjects
Estimation ,Thesaurus (information retrieval) ,Modality (human–computer interaction) ,Computer Networks and Communications ,Computer science ,business.industry ,Latent semantic analysis ,Applied Mathematics ,Speech recognition ,General Physics and Astronomy ,computer.software_genre ,Linear subspace ,Task (project management) ,Image (mathematics) ,Identification (information) ,Signal Processing ,Artificial intelligence ,Electrical and Electronic Engineering ,business ,computer ,Natural language processing - Abstract
SUMMARY In this paper, we propose a task estimation method based on multiple subspaces extracted from multimodal information of image objects in visual scenes and spoken words in dialogue appearing in the same task. The multiple subspaces are obtained by using latent semantic analysis (LSA). In the proposed method, a task vector composed of spoken words and the frequencies of image-object appearances are extracted first, and then similarities among the input task vector and reference subspaces of different tasks are compared. Experiments are conducted on the identification of game tasks. The experimental results show that the proposed method with multimodal information outperforms the method in which only the single modality of image or spoken dialogue is applied. The proposed method achieves accurate performance even if less spoken dialogue is applied.
- Published
- 2014
4. Solving the Phoneme Conflict in Grapheme-to-Phoneme Conversion Using a Two-Stage Neural Network-Based Approach
- Author
-
Seng Kheang, Tsuneo Nitta, Kouichi Katsurada, and Yurie Iribe
- Subjects
Artificial neural network ,Computer science ,business.industry ,Speech recognition ,American English ,Phonetic transcription ,Grapheme ,Context (language use) ,Speech synthesis ,Pronunciation ,computer.software_genre ,ComputingMethodologies_ARTIFICIALINTELLIGENCE ,ComputingMethodologies_PATTERNRECOGNITION ,Artificial Intelligence ,Hardware and Architecture ,Computer Vision and Pattern Recognition ,Artificial intelligence ,Electrical and Electronic Engineering ,business ,computer ,Software ,Word (computer architecture) ,Natural language processing - Abstract
SUMMARY To achieve high quality output speech synthesis systems, data-driven grapheme-to-phoneme (G2P) conversion is usually used to generate the phonetic transcription of out-of-vocabulary (OOV) words. To improve the performance of G2P conversion, this paper deals with the problem of conflicting phonemes, where an input grapheme can, in the same context, produce many possible output phonemes at the same time. To this end, we propose a two-stage neural network-based approach that converts the input text to phoneme sequences in the first stage and then predicts each output phoneme in the second stage using the phonemic information obtained. The first-stage neural network is fundamentally implemented as a many-to-many mapping model for automatic conversion of word to phoneme sequences, while the second stage uses a combination of the obtained phoneme sequences to predict the output phoneme corresponding to each input grapheme in a given word. We evaluate the performance of this approach using the American English words-based pronunciation dictionary known as the auto-aligned CMUDict corpus[1]. In terms of phoneme and word accuracy of the OOV words, on comparison with several proposed baseline approaches, the evaluation results show that our proposed approach improves on the previous one-stage neural network-based approach for G2P conversion. The results of comparison with another existing approach indicate that it provides higher phoneme accuracy but lower word accuracy on a general dataset, and slightly higher phoneme and word accuracy on a selection of words consisting of more than one phoneme
- Published
- 2014
5. Generation of CG Animations Based on Articulatory Features for Pronunciation Training
- Author
-
Tsuneo Nitta, Kouichi Katsurada, Takuro Mori, and Yurie Iribe
- Subjects
Computer science ,business.industry ,Speech recognition ,Training (meteorology) ,Artificial intelligence ,Pronunciation ,computer.software_genre ,business ,computer ,Computer animation ,Natural language processing - Published
- 2012
6. Learning Lexicons from Spoken Utterances Based on Statistical Model Selection
- Author
-
Ryo Taguchi, Tsuneo Nitta, Mikio Nakano, Naoto Iwahashi, Takashi Nose, and Kotaro Funakoshi
- Subjects
Computer science ,business.industry ,Model selection ,Speech recognition ,Acoustic model ,Statistical model ,Language acquisition ,Object (computer science) ,Lexicon ,computer.software_genre ,Artificial Intelligence ,Unsupervised learning ,Artificial intelligence ,business ,computer ,Software ,Utterance ,Natural language processing - Abstract
This paper proposes a method for the unsupervised learning of lexicons from pairs of a spoken utterance and an object as its meaning without any a priori linguistic knowledge other than a phoneme acoustic model. In order to obtain a lexicon, a statistical model of the joint probability of a spoken utterance and an object is learned based on the minimum description length principle. This model consists of a list of word phoneme sequences and three statistical models: the phoneme acoustic model, a word-bigram model, and a word meaning model. Experimental results show that the method can acquire acoustically, grammatically and semantically appropriate words with about 85% phoneme accuracy. Index Terms: Lexical learning, language acquisition, model selection.
- Published
- 2010
7. A Method for Keyword Extraction Using Retrieval Information from Students in Lectures
- Author
-
Kouichi Katsurada, Shuji Shinohara, Hiroaki Kawashima, Yurie Iribe, and Tsuneo Nitta
- Subjects
Information retrieval ,business.industry ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Keyword extraction ,computer.software_genre ,Artificial Intelligence ,Human–computer information retrieval ,ComputingMilieux_COMPUTERSANDEDUCATION ,Artificial intelligence ,business ,computer ,Software ,Natural language processing - Abstract
Recently, e-learning systems for self-learning with various types of retrieval functions have been developed. This paper describes a method for keyword extraction using retrieval information stored from many students through the retrieval functions. Firstly, we show that (1) teachers tend to consider technical terms as important, while students unfamiliar with the technical terms tend to retrieve the terms, therefore (2) there is a clear correlation between keywords extracted by the teachers and the retrieval words by the students. Secondly, we propse a method utilizing retrieval information from the students for keyword extraction, and show that the mehod can achieve quite better performance than a method extracting keywords using only lecture information.
- Published
- 2007
8. Efficient Learning of Word Meanings by Agents Using Biases Observed in Language Development of Children
- Author
-
Masashi Kimura, Kouichi Katsurada, Shuji Shinohara, Satoshi Kodama, Yurie Iribe, Ryo Taguchi, and Tsuneo Nitta
- Subjects
Spoken word ,Distribution (number theory) ,business.industry ,Computer science ,Object (grammar) ,Conditional probability distribution ,computer.software_genre ,Symbol grounding ,Artificial Intelligence ,Feature (machine learning) ,Probability distribution ,Artificial intelligence ,business ,computer ,Software ,Natural language processing ,Word (group theory) - Abstract
Recently, studies on learning of word meanings by agents have begun. In these studies, a human shows objects to an agent and utters words such as ``red'' or ``box''. The agent finds out object's feature represented by each spoken word. In our method, firstly, the agent learns probability distribution p(x) and conditional probability distribution p(x|w), where x is an object feature and w is a word. If a word w does not represent a feature x, p(x) and p(x|w) will be almost same distribution because x is independent of w. This fact enables the agent to use distance between p(x) and p(x|w) when inferring which feature the word represents. Previous works also employ similar stochastic approaches to detect the feature. However, such approaches need a lot of examples to learn correct distributions.
- Published
- 2007
9. On Autonomous Coordination of Learning Biases by an Agent with a Vocabulary Learning Mechanism
- Author
-
Kouichi Katsurada, Tsuneo Nitta, Takashi Hashimoto, Shuji Shinohara, and Ryo Taguchi
- Subjects
Error-driven learning ,Artificial Intelligence ,Human–computer interaction ,Computer science ,business.industry ,Artificial intelligence ,computer.software_genre ,business ,computer ,Vocabulary learning ,Software ,Natural language processing ,Mechanism (sociology) - Published
- 2007
10. Introducing articulatory anchor-point to ann training for corrective learning of pronunciation
- Author
-
Tsuneo Nitta, Ryoko Hayashi, Yurie Iribe, Kouichi Katsurada, Silasak Manosavanh, and Chunyue Zhu
- Subjects
business.industry ,Computer science ,Speech recognition ,Animation ,Pronunciation ,computer.software_genre ,ComputingMethodologies_PATTERNRECOGNITION ,Artificial intelligence ,Articulation (phonetics) ,business ,Articulatory gestures ,computer ,Vocal tract ,Computer animation ,Natural language processing ,Gesture - Abstract
We describe computer-assisted pronunciation training (CAPT) through the visualization of the articulatory gestures from learner's speech in this paper. Typical CAPT systems cannot indicate how the learner can correct his/her articulation. The proposed system enables the learner to study how to correct their pronunciation by comparing the wrongly pronounced gesture with a correctly pronounced gesture. In this system, a multi-layer neural network (MLN) is used to convert the learner's speech into the coordinates for a vocal tract using Magnetic Resonance Imaging data. Then, an animation is generated using the values of the vocal tract coordinates. Moreover, we improved the animations by introducing an anchor-point for a phoneme to MLN training. The new system could even generate accurate CG animations from the English speech by Japanese people in the experiment.
- Published
- 2013
11. Improvement of animated articulatory gesture extracted from speech for pronunciation training
- Author
-
Chunyue Zhu, Silasak Manosavan, Yurie Iribe, Ryoko Hayashi, Kouichi Katsurada, and Tsuneo Nitta
- Subjects
business.industry ,Computer science ,Place of articulation ,Speech recognition ,Animation ,Pronunciation ,computer.software_genre ,Gesture recognition ,Artificial intelligence ,business ,Articulation (phonetics) ,Articulatory gestures ,computer ,Vocal tract ,Natural language processing ,Computer animation - Abstract
Computer-assisted pronunciation training (CAPT) was introduced for language education in recent years. CAPT scores the learner's pronunciation quality and points out wrong phonemes by using speech recognition technology. However, although the learner can thus realize that his/her speech is different from the teacher's, the learner still cannot control the articulation organs to pronounce correctly. The learner cannot understand how to correct the wrong articulatory gestures precisely. We indicate these differences by visualizing a learner's wrong pronunciation movements and the correct pronunciation movements with CG animation. We propose a system for generating animated pronunciation by estimating a learner's pronunciation movements from his/her speech automatically. The proposed system maps speech to coordinate values that are needed to generate the animations by using multilayer perceptron neural networks (MLP). We use MRI data to generate smooth animated pronunciations. Additionally, we verify whether the vocal tract area and articulatory features are suitable as characteristics of pronunciation movement through experimental evaluation.
- Published
- 2012
12. Learning Physically Grounded Lexicons from Spoken Utterances
- Author
-
Mikio Nakano, Ryo Taguchi, Kotaro Funakoshi, Takashi Nose, Naoto Iwahashi, and Tsuneo Nitta
- Subjects
Service (systems architecture) ,Computer science ,Order (business) ,business.industry ,Robot ,Artificial intelligence ,computer.software_genre ,business ,Object (philosophy) ,computer ,Natural language processing ,Utterance ,Word (computer architecture) - Abstract
Service robots must understand correspondence relationships between things in the real world and words in order to communicate with humans. For example, to understand the utterance, "Bring me an apple," the robot requires knowledge about the relationship between the word "apple" and visual features of the apple, such as color and shape. Robots perceive object features with physical sensors. However, developers of service robots cannot describe all knowledge in advance because such robots may be used in situations other than those the developers assumed. In particular, household robots have many opportunities to encounter unknown objects. Therefore, it is preferable that robots automatically learn physically grounded lexicons, which consist of phoneme sequences and meanings of words, through interactions with users.
- Published
- 2012
13. Generating animated pronunciation from speech through articulatory feature extraction
- Author
-
Tsuneo Nitta, Silasak Manosavanh, Kouichi Katsurada, Chunyue Zhu, Ryoko Hayashi, and Yurie Iribe
- Subjects
Thesaurus (information retrieval) ,business.industry ,Computer science ,Feature extraction ,Artificial intelligence ,Pronunciation ,business ,computer.software_genre ,computer ,Natural language processing - Published
- 2011
14. Evaluation of fast spoken term detection using a suffix array
- Author
-
Kouichi Katsurada, Tsuneo Nitta, Shigeki Teshima, Yurie Iribe, and Shinta Sawada
- Subjects
law ,business.industry ,Computer science ,Speech recognition ,Suffix array ,Artificial intelligence ,computer.software_genre ,business ,computer ,Natural language processing ,law.invention ,Term (time) - Published
- 2011
15. Dialog Strategy Acquisition and Its Evaluation for Efficient Learning of Word Meanings by Agents
- Author
-
Tsuneo Nitta, Kouichi Katsurada, and Ryo Taguchi
- Subjects
Facial expression ,business.industry ,Mechanism (biology) ,computer.software_genre ,ComputingMethodologies_ARTIFICIALINTELLIGENCE ,Comprehension ,Word meaning ,Artificial intelligence ,Dialog box ,Psychology ,business ,computer ,Word (computer architecture) ,Natural language processing - Abstract
In word meaning acquisition through interactions among humans and agents, the efficiency of the learning depends largely on the dialog strategies the agents have. This paper describes automatic acquisition of dialog strategies through interaction between two agents. In the experiments, two agents infer each other's comprehension level from its facial expressions and utterances to acquire efficient strategies. Q-learning is applied to a strategy acquisition mechanism. Firstly, experiments are carried out through the interaction between a mother agent, who knows all the word meanings, and a child agent with no initial word meaning. The experimental results showed that the mother agent acquires a teaching strategy, while the child agent acquires an asking strategy. Next, the experiments of interaction between a human and an agent are investigated to evaluate the acquired strategies. The results showed the effectiveness of both strategies of teaching and asking.
- Published
- 2006
16. A large vocabulary word recognition system based on syllable recognition and nonlinear word matching
- Author
-
Hiroshi Kanazawa, S. Hirai, Hiroshi Matsuura, Yoichi Takebayashi, Tsuneo Nitta, and Hiroyuki Tsuboi
- Subjects
Vocabulary ,Matching (statistics) ,Computer science ,business.industry ,Speech recognition ,media_common.quotation_subject ,computer.software_genre ,Task (project management) ,Nonlinear system ,Pattern recognition (psychology) ,Word recognition ,Artificial intelligence ,Syllable ,business ,computer ,Natural language processing ,Word (computer architecture) ,media_common - Abstract
A practical Japanese large-vocabulary recognition system that is speaker-adaptive has been developed. The system has to notable features. The first is low-cost hardware which can realise the precise syllable recognition and the high speed speaker-adaptation using the Karhunen-Loeve expansion. The second is nonlinear word matching which deals with syllable addition and deletion, and reduces the restriction on acceptable utterances. The system has been applied to data entry systems such as train-station-names, family-names, and given-names input and also to a map data search task in which place names were verified. Recognition experiments have been carried out on a 2000-word vocabulary. The word-recognition rate was 93.7% for 1000 utterances by five male speakers. >
- Published
- 2003
17. Word-spotting based on inter-word and intra-word diphone models
- Author
-
H. Matsu'ura, Tsuneo Nitta, Yasuyuki Masai, and Shinichi Tanaka
- Subjects
Computer science ,business.industry ,Speech recognition ,Logogen model ,Word error rate ,Diphone ,Speaker recognition ,computer.software_genre ,Speech processing ,Word recognition ,Artificial intelligence ,business ,Hidden Markov model ,computer ,Word (computer architecture) ,Natural language processing - Abstract
The authors propose a precise but simple inter-word diphone model (IDM) for word-spotting based on SMQ/HMM. They have applied ordinary diphone models to a speaker-independent, large-vocabulary word recognition unit. However, because users are apt to add words and/or extraneous speech, accuracy degrades due to the mismatch of models at word-boundaries. The IDM represents a transition from the preceding phonemes to a word or from a word to the succeeding phonemes. An experiment showed that the IDMs reduce error rates by about 5% for speech containing unknown words and extraneous speech. The experiment also showed that the proposed method ensured performance good enough for the practical use of a large-vocabulary isolated-word recognition system.
- Published
- 2002
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.