Universiti Teknologi Malaysia Institutional Repository

Auto segmentation for Malay speech corpus

Tan, Tian Swee and Ting, Chee Ming (2012) Auto segmentation for Malay speech corpus. In: The 3rd International Multi-Conference on Complexity, Informatics and Cybernetics, 24-28 March 2012, Orlando, Florida, United State.

Full text not available from this repository.

Official URL: http://www.iiis.org/CDs2012/CD2012IMC/IDREC_2012/P...


Abstract-This paper deals with the automatic segmentation of Malay continuous speech database. Auto segmentation is a process of producing a sequence of discrete utterance with particular characteristics remaining constant within each one. In terms of quality, hand crafted segmentation would be the best method. However, due to the large database size, manual speech segmentation and labeling become tremendous. It is time consuming and error prone. Besides, even if the database is segmented by an expert, the segmentation rule may become subjective and not reproducible. Inconsistency result may occur from different linguistic experts. Thus, an automated segmentation rule was drawn to consistently segment the large scale database with satisfactory level of quality. Automated segmentation of Malay Language syllable is not a tough task because all syllables in Malay Language are pronounced almost equally and moreover it is not a tonal language like English. The manipulation and identification of the segment boundaries of Malay Language is straight forward and easy to understand. For the segmentation, the HMM based approach with adapted Viterbi force alignment technique is used. Composite HMM with Baum Welch reestimation was utilized to ease the process of phonetic segmentation. All the data from the database was fed into the segmentation tool directly without prior trained sample for pre-training purpose. For the design of the sentence coverage of the database, the scripts are consisting of 1000 sentences. 620 sentences are selected from primary school Malay Language text book and 380 sentences were computed using the 70% highest frequency words that appear in the 10 million words online digital text. This configuration of Malay Language script already promises a phonetically balanced database which covers all the vowels and consonants. The objective evaluation method is used to identify the performance. The result from the autosegmentation was verified to obtain the accuracy degree and overall quality. The result was tested perceptually and it is proven to have satisfactory high quality.

Item Type:Conference or Workshop Item (Paper)
Uncontrolled Keywords:auto segmentation, Malay speech corpus
Subjects:?? others ??
Divisions:Biosciences and Medical Engineering
ID Code:36446
Deposited By: Liza Porijo
Deposited On:31 Jan 2014 23:19
Last Modified:28 Aug 2017 21:52

Repository Staff Only: item control page