Loading...

Summer Research Fellowship Programme of India's Science Academies

Disfluency detection and corpus generation for Indian sign language translation

Sneha V

Ramaiah Institute of Technology, Bangalore, Karnataka 560054

Dr Dinesh Babu Jayagopi

International Institute of Information Technology, Bangalore, Karnataka 560100

Abstract

In recent years due to the rise in performance of automatic speech recognition (ASR) systems, many applications process spoken speech transcripts such as machine translation, information extraction systems and dialog systems. The training set for these applications usually is well-written texts. Since spontaneous speech is unstructured, the spoken speech transcripts generated from automatic speech recognition systems consists of false starts, repetitions of the words or phrases, edit terms such as ‘I mean’ or ‘you know’ and filler terms like ‘oh’, ‘ahmm’ or ‘uh’. In order to increase the performance of working on these ASR-generated transcripts which is used to generate Indian Sign Language using machine translation, detecting and correcting these speech disfluencies is important in the preprocessing stages. Sequence models and concepts of natural language processing are applied in order to detect the speech disfluencies in the generated transcripts. Data collection and Corpus generation of ASR-generated disfluent transcripts along with their manually corrected transcripts is carried out for E-learning video courses like NPTEL and Stanford classes.

Keywords: automatic speech recognition systems, machine translation, sequence models

Abbreviations

Abbreviations
ASRAutomatic Speech Recognition
RNN Recurrent Neural Networks 
LSTMLong Short Term Memory
Bi-LSTMBi-directional Long Short Term Memory
NLPNatural Language Processing 
CRFConditional Random Field
ISLIndian Sign Language
POSPart-Of-Speech
MTMachine Translation

INTRODUCTION

"Whatever we may want to say, we probably won’t say exactly that." - Marvin Minsky. 1985. The Society of Mind. New York: Simon & Schuster, p. 236.

Briefings

The objective of the project is real-time translation of content (spontaneous speech or lectures in classrooms) in a spoken language (typically English) into Indian sign language using MT. An example of an English sentence translated to ISL is as follows -

English - ‘In 1971 I passed the tenth grade.’

ISL - 'PAST ​ SCHOOL TEN PASS ONE NINE SEVEN ONE'

Spontaneous speech differs from written text as the former is not well rehearsed and is unstructured. It is observed that most people make repetitions of the words or phrases, use edit terms to correct themselves and add filler terms to fill prolonged pauses during spontaneous speeches. This phenomenon in spontaneous speeches is called Disfluency. Commonly observed categories of disfluency are -

  • Simple-Disfluency
    • Filler pauses - terms like ‘oh’, 'er', ‘ahm’ or ‘uh’
    • Conjunction pauses - repeated usage of terms like 'and' or 'because'
    • Edit terms - terms like ‘I mean’, 'anyway' or ‘you know’

Example of simple disfluency - "Oh I need to uh leave now."

  • Complex-Disfluency
    • Corrections
    • Repetitions of words or phrases

Example of complex disfluency - "I need a bus ticket to Bangalore sorry I mean Chennai."

Datasets for MT have well written or cleaned training set. Therefore there is a mismatch between the trained dataset and ASR-generated transcripts fed as input to the model. This leads to decreasing the performance of the model and yielding bad translations. An example of translation which does not convey the intention of the speaker due to disfluencies is depicted in Fig 1.

Screenshot from 2019-08-21 20-22-46.png
    Wrong translation due to disfluency

    Having disfluencies in the ASR-generated transcripts makes the translation to ISL difficult. Therefore, detecting the disfluencies and correcting them in the preprocessing stages is essential in this project.

    Statement of the problem

    The project aims at real-time translation of content (spontaneous speech or lectures in classrooms) in a spoken language (typically English) into Indian Sign Language. Building a corpus that can be used for translation of English to ISL and detecting the disfluencies in the spoken speech transcipts are carried out.

    Objectives of the Research

    The main objectives include -

    • Data collection of ASR-generated transcripts (which may or may not contain disfluencies) and manual transcripts (without disfluencies) of lecture videos from various sources like NPTEL and Stanford classes.
    • Creation of corpus for ISL translation and its analysis.
    • Study on existing models used for disfluency detection.
    • Implementing a neural machine translation model to detect and correct the disfluencies.
    • Evaluating the results of the implemented model.

    Scope

    Detecting and correcting the disfluencies in preprocessing stages can help in increasing the performance of machine translation models, information extraction systems and dialog systems that process spoken speech transcripts.

    LITERATURE REVIEW

    Information on disfluency detection

    Hany Hassan et al in 2014 proposed a new approach of identifying disfluencies firstly by simple-disfluency removal followed by segmentation and then the complex-disfluency removal. They propose CRF classifier to remove simple disfluencies. Then sentence boundary and punctuation detections are carried out using CRF classifiers restricting the annotation to three classes only, period, comma or nothing. Finally using a knowledge based parser complex disfluencies are detected. The difficulty in implementing this model is that a broad coverage of rules must be defined for the parser which is also able to maintain the grammer. They used a broad-coverage rule-based parser, NLPWin by Microsoft, to help identify disfluencies, which is not open-sourced.

    Julian Hough and David Schlangen have carried out various combinations of utterance segmentation and disfluency detections in their experiments to identify disfluencies and in 2017 they presented the joint task of incremental disfluency detection and utterance segmentation and a simple deep learning system which performs it on transcripts and ASR results. Their architecture consists of input embeddings of words in a backwards window from the most recent word, durations of words in the current window (from transcription or ASR word timings) and POS tags for words in current window, which is passed incrementally to an RNN model with LSTM units. This model predicts pre-defined disfluent tags to each word of the input sentence. They also conclude in the paper that their joint-task system for disfluency detection and utterance segmentation shows good results and a new benchmark for the joint task on Switchboard data and due to its incremental functioning on unsegmented data, including ASR result streams, it is suitable for live systems.

    Contrast to the work of Julian Hough and David Schlangen, Shaolei Wang et al ​say that the RNN method treats sequence tagging as classification on each input token and doesn’t model the transition between tags which is important in recognizing the repair phrases of multi-words. They also say that the sequence tagging methods and RNN methods have no power of modeling the linguistic structural integrity i.e the grammer of the text. They present a method that addresses this problem by implementing neural attention-based Bi-LSTM model to learn the conditional probability of the output which is an ordered subsequence of the input. They conclude with a note that experimential results reported good performance with an F1-score of 86.7 on the Switchboard corpus. In 2019, Qianqian Dong et al built transcript disfluency detection model on the basis of Transformer, which is the state-of-the-art neural MT model. They obtained an F-score of 89 on testing on the Switchboard data.

    Summary

    • It is difficult to build a broad-coverage rule-based parser that covers the entire vocabulary and can detect disfluencies as well as maintain the grammer of the text.
    • Implementing RNN models that process words in a window can handle disfluency detections but RNN methods have no power of modeling the linguistic structural integrity.
    • The state-of-the-art that detects the disfluencies in transcripts is the neural MT with attention mechanism.

    METHODOLOGY

    Task 1 - Corpus generation for ISL translation

    Data collection

    Since one of the aim of the project is translation of classroom lectures in English to ISL, to build a corpus for the MT model, platforms like NPTEL and Stanford online classes were scrapped to obtain ASR-generated transcripts along with the manually edited transcripts. 20000+ videos from NPTEL were scrapped for their transcripts and later filtered if manual transcripts were unavailable.

    Corpus generation

    • The scrapped collection of manual transcripts had tags denoting the timing of occurance along with the text. Preprocessing on the transcripts were carried out and the cleaned text was collected.
    • From the collection of cleaned text, 2000 most frequently used words were identified. Frequency distribution of the 120 most frequent words is depicted in Fig 2.
    Screenshot from 2019-08-22 01-29-12_2.png
      Frequency distribution of top 120 words
      • Top 3000 sentences with maximum number of words that belonged to the 2000 most frequently used words were picked to form the corpus.
      Score of the sentence = Count(Words in the sentence  Most frequent words)÷Number of words in the sentence
      • The top 3000 sentences were manually translated to ISL and the corpus for MT model was created.

      Task 2 - Disfluency detection

      Method 1 - Implementing the RNN architecture with LSTM units

      Method 1 implemented the approach presented in the paper by Julian Hough and David Schlangen.​

      Structure of disfluency

      Disfluencies are typically assumed to have a tripartite reparandum-interregnum-repair structure as shown in Fig 3. Here 'loves' is the repair word that must replace the reparandum 'likes'. If reparandum and repair are absent, the disfluency reduces to an isolated edit term like 'uh' or 'er'. Each word in the input sentence is assigned a tag which lables the word as a fluent term '<f>', edit term '<e>', reparandum start word '<rms>', interregnum '<i>', repair term '<rps>' and so on along with their offset index.

      Screenshot from 2019-08-22 13-00-56.png
        Typical structure of disfluency

        Architecture

        • Input Features
          • Words in a backwards window from the most recent word
          • Durations of words in the current window (from transcription or ASR word timings)
          • Part-Of-Speech (POS) tags for words in current window
        • Word embeddings
          • Word2vec
        • RNN and LSTM units
          • Sigmoid function at hidden layers
          • Softmax function for the node activation function of the output layer
        • Learning: Error function and parameter update
          • Negative log likelihood loss (NLL)
        • Markov model: Ensure legal tag sequences are outputted for the given tag set
        • Logistic regression classifier: Probability of the relevant utterance segmentation tags
        Screenshot from 2019-08-22 13-12-22.png
          Overview of the architecture

          Method 2 - Neural machine translation with attention

          This method implements a basic Neural Machine Translation with Attention mechanism that converts an English sentence with disfluencies to an English sentence without disfluencies.

          Data preprocessing

          The dataset used for this experiment was the Switchboard dataset which has telephonic conversations between strangers. After identifying the representation of various notations used in the Switchboard, the training dataset had English sentences with disfluencies along with their cleaned sentences(clear of all disfluencies) as target. A few tuples of the dataset is shown in Fig 5. These sentences were cleaned by removing all special characters, converted to lower and '<start>' and '<end>' tokens were added to each sentence.

          Screenshot from 2019-08-22 18-35-31.png
          With disfluencies
            Screenshot from 2019-08-22 18-36-24.png
            Without disfluencies
              Dataset used in MT model

              Architecture

              Given the categorical nature of words, the model must first look up the source and target embeddings to retrieve the corresponding word representations. The input vector then is put through an encoder model which consists of RNN model which gives us the encoder output of shape (batch_size, max_length, hidden_size) and the encoder hidden state of shape (batch_size, hidden_size). Each input word is assigned a weight by the attention mechanism (Bahdanau attention) which is then used by the decoder to predict the next word in the target sentence. Adam Optimizer was used to reduce the loss function.

              RESULTS AND DISCUSSION

              Results of Method 1

              On training the RNN model with LSTM units, examples of few test cases with correct results are shown in Fig 6.

              Screenshot from 2019-08-22 20-04-23.png
              Test Case 1 
                Screenshot from 2019-08-22 20-09-09.png
                Test Case 2
                  Screenshot from 2019-08-22 20-15-36_1.png
                  Test Case 3
                    Test cases of Method 1 with correct results

                    Test cases shown in Fig 7 gave incorrect results which gave insights that the model was overfitting on the term 'know' and sentences begining with 'I' and had to be further tuned.

                    Screenshot from 2019-08-22 20-31-43.png
                    Test Case 4
                      Screenshot from 2019-08-22 20-32-30.png
                      Test Case 5
                        Screenshot from 2019-08-22 20-37-01.png
                        Test Case 6
                          Screenshot from 2019-08-22 20-36-17.png
                          Test Case 7
                            Test cases of Method 1 with incorrect results

                            Results of Method 2

                            The loss obtained after running the neural MT model for 100 epochs which ran for 2000 minutes is 0.0062. The loss obtained for each epoch is depicted in Fig 8.

                            Screenshot from 2019-08-22 21-48-19.png
                              Loss obtained for each epoch

                              On training the neural MT model with Attention mechanism, attention maps of examples of few test cases are shown in Fig 9.

                              Screenshot from 2019-08-22 22-03-48.png
                              Test Case 1
                                Screenshot from 2019-08-22 22-04-09.png
                                Test Case 2
                                  Screenshot from 2019-08-22 22-05-13.png
                                  Test Case 3
                                    Screenshot from 2019-08-22 22-05-22.png
                                    Test Case 4
                                      Attention maps of test cases of method 2

                                      The implemented model was observed to be handling only the simple disfluencies and was overfitting on certain terms, for example in Test case 1-3 of Fig 9 where term 'I' was edited. Test case 4 of Fig 9 showed the main drawback of this model i.e during translations new words would replace certain terms of the input sentence due to the attention weights. The term 'us' was replaced with 'them'. Also this model could not handle complex disfluencies.

                                      CONCLUSION AND RECOMMENDATIONS

                                      The approach presented by Julian Hough and David Schlangen provides good results for both simplex and complex disfluencies. It needs to be further tuned to avoid overfitting the model on certain commonly used disfluent terms.

                                      The Neural Machine Translation model with attention mechanism handled only the simplex disfluencies. To avoid new words replace the input words, the loss function must be modified such that it restricts the output words to belong to the set of input words. To handle complex disfluencies, Transformer model which is the state-of-the-art NMT model could to be studied.

                                      REFERENCES

                                      • ​Shaolei Wang,Wanxiang Che,Ting Liu (2017). A Neural Attention Model for Disfluency Detection. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers C16-1027.
                                      • Hany Hassan, Lee Schwartz, Dilek Hakkani-Tur, Gokhan Tur (2014). Segmentation and Disfluency Removal for Conversational Speech Translation. ​Proceedings of Interspeech, ISCA - International Speech Communication Association.
                                      • Julian Hough and David Schlangen (2017). Joint, Incremental Disfluency Detection and Utterance Segmentation from Speech. ​Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers E17-1031.
                                      • Dong, Q., Wang, F., Yang, Z., Chen, W., Xu, S., & Xu, B. (2019). Adapting Translation Models for Transcript Disfluency Detection. Proceedings of the AAAI Conference on Artificial Intelligence,33(01), 6351-6358.
                                      • Eklund, Robert. (2015). Disfluency in Swedish human–human and human–machine travel booking dialogues. 10.13140/RG.2.1.3015.0882.
                                      • Honal, Matthias & Schultz, Tanja. (2005). Automatic Disfluency Removal on Recognized Spontaneous Speech - Rapid Adaptation to Speaker-Dependent Disfluencies. Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on. 1. 10.1109/ICASSP.2005.1415277.
                                      • Matthew Lease., Mark Johnson (2006). Early Deletion of Fillers In Processing Conversational Speech. NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers 73-76
                                      • Shalyminov, I., Eshghi, A., Lemon, O (2018). Multi-Task Learning for Domain-General Spoken Disfluency Detection in Dialogue Systems.
                                      • Jamshid Lou, Paria & Anderson, Peter & Johnson, Mark. (2018). Disfluency Detection using Auto-Correlational Neural Networks.

                                      ACKNOWLEDGEMENTS

                                      I am grateful to the Science Academies' for giving me this wonderful opportunity. I would like to thank my guide, Dr. Dinesh Babu Jayagopi for his guidance and support throughout the research work. Also, I would like to thank my mentor, Mr Shyam Krishna for providing valuable insights and helping me figure out the various intricacies involved. I am also thankful to my family and friends for their continuous support.

                                      More
                                      Written, reviewed, revised, proofed and published with