Anomaly-based annotation error detection in speech-synthesis corpora (English)

Matoušek, Jindřich / Tihelka, Daniel

In: Computer Speech and Language ; 46 ; 1-35 ; 2017

ISSN:

0885-2308

Article (Journal) / Electronic Resource

How to get this title?

Check access

Download

Download is DRM protected.

Commercial Copyright fee: €36.00 Basic fee: €4.00 Total price: €40.00

Academic Copyright fee: €36.00 Basic fee: €2.00 Total price: €38.00

Export, share and cite

Highlights •Anomaly-based detection could be used to detect word-level annotation errors. •Automatically selected feature sets achieve similar results as hand-crafted ones. •Training data size can be significantly reduced while keeping good results. •Combination of several detectors has a potential to overcome individual detectors. •Classification does not outperform anomaly detection and is more sensitive to data size.

Abstract We investigate the problem of automatic detection of annotation errors in single-speaker read-speech corpora used for speech synthesis. For the purpose of annotation error detection, we adopt an anomaly detection framework in which correctly annotated words are considered as normal examples on which the detection methods are trained. Misannotated words are then taken as anomalous examples which do not conform to normal patterns of the trained detection models. We propose and evaluate several anomaly detection models – Gaussian distribution based detectors, Grubbs’ test based detector, and one-class support vector machine based detector. Word-level feature sets including basic features derived from forced alignment and various acoustic, spectral, phonetic, and positional features are examined to find an optimal set of features for each anomaly detector. The results with F1 score being almost 89% show that anomaly detection could help detecting annotation errors in read-speech corpora for speech synthesis. Furthermore, dimensionality reduction techniques are also examined to automatically reduce the number of features used to describe the annotated words. We show that the automatically reduced feature sets achieve statistically similar results as the hand-crafted feature sets. We also conducted additional experiments to investigate both robustness of the proposed anomaly detection framework with respect to particular data sets used for development and evaluation and the influence of the number of examples needed for anomaly detection. We show that a reasonably good detection performance could be reached with using significantly fewer examples during the detector development phase. We also propose a concept of a voting detector – a combination of anomaly detectors in which each “single” detector “votes” on whether or not a testing word is annotated correctly, and the final decision is then made by aggregating the votes. Our results show that the voting detector has a potential to overcome each of the single anomaly detectors. Furthermore, we compare the proposed anomaly detection framework to a classification-based approach (which, unlike anomaly detection, needs to use anomalous examples during training) and we show that both approaches lead to statistically comparable results when all available anomalous examples are utilized during detector/classifier development. However, when a smaller number of anomalous examples are used, the proposed anomaly detection framework clearly outperforms the classification-based approach. A final listening test showed the effectiveness of the proposed anomaly-based annotation error detection for improving the quality of synthetic speech.

Title:

Anomaly-based annotation error detection in speech-synthesis corpora
Contributors:

Matoušek, Jindřich ( author ) / Tihelka, Daniel ( author )
Published in:

Computer Speech and Language ; 46 ; 1-35
Publisher:

Elsevier Ltd

Publication date:

2017-04-11
Size:

35 pages
ISSN:

0885-2308
DOI:

https://doi.org/10.1016/j.csl.2017.04.007
Type of media:

Article (Journal)
Type of material:

Electronic Resource
Language:

English
Keywords:

Annotation error detection , Anomaly detection , Read speech corpora , Speech synthesis
Source:

Elsevier

Table of contents – Volume 46

Show all volumes and issues

The tables of contents are generated automatically and are based on the data records of the individual contributions available in the index of the TIB portal. The display of the Tables of Contents may therefore be incomplete.

1: Anomaly-based annotation error detection in speech-synthesis corpora
Matoušek, Jindřich / Tihelka, Daniel et al. | 2017
digital version print version
36: Reversible speaker de-identification using pre-trained transformation functions
Magariños, Carmen / Lopez-Otero, Paula / Docio-Fernandez, Laura / Rodriguez-Banga, Eduardo / Erro, Daniel / Garcia-Mateo, Carmen et al. | 2017
digital version print version
53: Text-dependent speaker verification based on i-vectors, Neural Networks and Hidden Markov Models
Zeinali, Hossein / Sameti, Hossein / Burget, Lukáš / Černocký, Jan “Honza” et al. | 2017
digital version print version
72: Adaptive speaker diarization of broadcast news based on factor analysis
Desplanques, Brecht / Demuynck, Kris / Martens, Jean-Pierre et al. | 2017
digital version print version
94: Constructing a Natural Language Inference dataset using generative neural networks
Starc, Janez / Mladenić, Dunja et al. | 2017
digital version print version
113: A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation
Piao, Scott / Dallachy, Fraser / Baron, Alistair / Demmen, Jane / Wattam, Steve / Durkin, Philip / McCracken, James / Rayson, Paul / Alexander, Marc et al. | 2017
digital version print version
136: Estimation of glottal closure instants from degraded speech using a phase-difference-based algorithm
Anushiya Rachel, G. / Vijayalakshmi, P. / Nagarajan, T. et al. | 2017
digital version print version
154: A segmental framework for fully-unsupervised large-vocabulary speech recognition
Kamper, Herman / Jansen, Aren / Goldwater, Sharon et al. | 2017
digital version print version
175: Towards the next generation of speech tools and corpora
Draxler, Christoph / Harrington, Jonathan / Schiel, Florian et al. | 2017
digital version print version
179: Influence of speaker familiarity on blind and visually impaired children’s and young adults’ perception of synthetic voices
Pucher, Michael / Zillinger, Bettina / Toman, Markus / Schabus, Dietmar / Valentini-Botinhao, Cassia / Yamagishi, Junichi / Schmid, Erich / Woltron, Thomas et al. | 2017
digital version print version
196: Characterisation of voice quality of Parkinson’s disease using differential phonological posterior features
Cernak, Milos / Orozco-Arroyave, Juan Rafael / Rudzicz, Frank / Christensen, Heidi / Vásquez-Correa, Juan Camilo / Nöth, Elmar et al. | 2017
digital version print version
209: Lexicon-free fingerspelling recognition from video: Data, models, and signer adaptation
Kim, Taehwan / Keane, Jonathan / Wang, Weiran / Tang, Hao / Riggle, Jason / Shakhnarovich, Gregory / Brentari, Diane / Livescu, Karen et al. | 2017
digital version print version
233: Scalable algorithms for unsupervised clustering of acoustic data for speech recognition
Rath, Shakti P. et al. | 2017
digital version print version
249: Spoken language understanding and interaction: machine learning for human-like conversational systems
Gašić, Milica / Hakkani-Tür, Dilek / Celikyilmaz, Asli et al. | 2017
digital version print version
252: Multilingually trained bottleneck features in spoken language recognition
Fér, Radek / Matějka, Pavel / Grézl, František / Plchot, Oldřich / Veselý, Karel / Černocký, Jan Honza et al. | 2017
digital version print version
268: Emotion, age, and gender classification in children’s speech by humans and machines
Kaya, Heysem / Salah, Albert Ali / Karpov, Alexey / Frolova, Olga / Grigorev, Aleksey / Lyakso, Elena et al. | 2017
digital version print version
284: Improving the understanding of spoken referring expressions through syntactic-semantic and contextual-phonetic error-correction
Zukerman, Ingrid / Partovi, Andisheh et al. | 2017
digital version print version
311: A Framework for pre-training hidden-unit conditional random fields and its extension to long short term memory networks
Kim, Young-Bum / Stratos, Karl / Sarikaya, Ruhi et al. | 2017
digital version print version
327: Unsupervised crosslingual adaptation of tokenisers for spoken language recognition
Ng, Raymond W.M. / Nicolao, Mauro / Hain, Thomas et al. | 2017
digital version print version
343: Using speech technology for quantifying behavioral characteristics in peer-led team learning sessions
Dubey, Harishchandra / Sangwan, Abhijeet / Hansen, John H.L. et al. | 2017
digital version print version
367: Introduction to the special issue on deep learning approaches for machine translation
Costa-jussà, Marta R. / Allauzen, Alexandre / Barrault, Loïc / Cho, Kyunghun / Schwenk, Holger et al. | 2017
digital version print version
374: A generic neural acoustic beamforming architecture for robust multi-channel speech processing
Heymann, Jahn / Drude, Lukas / Haeb-Umbach, Reinhold et al. | 2016
digital version print version
386: Multi-microphone speech recognition in everyday environments
Barker, Jon / Marxer, Ricard / Vincent, Emmanuel / Watanabe, Shinji et al. | 2017
digital version print version
388: Robust coherence-based spectral enhancement for speech recognition in adverse real-world environments
Barfuss, Hendrik / Huemmer, Christian / Schwarz, Andreas / Kellermann, Walter et al. | 2017
digital version print version
401: Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend
Hori, Takaaki / Chen, Zhuo / Erdogan, Hakan / Hershey, John R. / Le Roux, Jonathan / Mitra, Vikramjit / Watanabe, Shinji et al. | 2017
digital version print version
419: Room-localized spoken command recognition in multi-room, multi-microphone environments
Rodomagoulakis, Isidoros / Katsamanis, Athanasios / Potamianos, Gerasimos / Giannoulis, Panagiotis / Tsiami, Antigoni / Maragos, Petros et al. | 2017
digital version print version
444: A combined evaluation of established and new approaches for speech recognition in varied reverberation conditions
Sivasankaran, Sunit / Vincent, Emmanuel / Illina, Irina et al. | 2017
digital version print version
461: Acoustic model training based on node-wise weight boundary model for fast and small-footprint deep neural networks
Takeda, Ryu / Nakadai, Kazuhiro / Komatani, Kazunori et al. | 2017
digital version print version
481: Multi-style learning with denoising autoencoders for acoustic modeling in the internet of things (IoT)
Lin, Payton / Lyu, Dau-Cheng / Chen, Fei / Wang, Syu-Siang / Tsao, Yu et al. | 2017
digital version print version
496: Bayesian feature enhancement using independent vector analysis and reverberation parameter re-estimation for noisy reverberant speech recognition
Cho, Ji-Won / Park, Jong-Hyeon / Chang, Joon-Hyuk / Park, Hyung-Min et al. | 2017
digital version print version
517: An information fusion framework with multi-channel feature concatenation and multi-perspective system combination for the deep-learning-based robust recognition of microphone array speech
Tu, Yan-Hui / Du, Jun / Wang, Qing / Bao, Xiao / Dai, Li-Rong / Lee, Chin-Hui et al. | 2016
digital version print version
535: An analysis of environment, microphone and data simulation mismatches in robust speech recognition
Vincent, Emmanuel / Watanabe, Shinji / Nugraha, Aditya Arie / Barker, Jon / Marxer, Ricard et al. | 2016
digital version print version
558: Multi-Channel Speech Enhancement and Amplitude Modulation Analysis for Noise Robust Automatic Speech Recognition
Moritz, Niko / Adiloğlu, Kamil / Anemüller, Jörn / Goetze, Stefan / Kollmeier, Birger et al. | 2016
digital version print version
574: Speech enhancement for robust automatic speech recognition: Evaluation using a baseline system and instrumental measures
Moore, A.H. / Peso Parada, P. / Naylor, P.A. et al. | 2016
digital version print version
585: DNN adaptation by automatic quality estimation of ASR hypotheses
Falavigna, Daniele / Matassoni, Marco / Jalalvand, Shahab / Negri, Matteo / Turchi, Marco et al. | 2016
digital version print version
605: The third ‘CHiME’ speech separation and recognition challenge: Analysis and outcomes
Barker, Jon / Marxer, Ricard / Vincent, Emmanuel / Watanabe, Shinji et al. | 2016
digital version print version

How to get this title?

Check access

Download

Download is DRM protected.

Commercial Copyright fee: €36.00 Basic fee: €4.00 Total price: €40.00

Academic Copyright fee: €36.00 Basic fee: €2.00 Total price: €38.00

Quicklinks

Borrowing & Ordering

Quicklinks

Search & discover

Quicklinks

Learning & working

Quicklinks

Publishing & Archiving

Quicklinks

About the TIB

Quicklinks

Research & Development

Anomaly-based annotation error detection in speech-synthesis corpora (English)

How to get this title?

Export, share and cite

More details on this result

Table of contents

Table of contents – Volume 46

Similar titles

How to get this title?

Export, share and cite