Characterization of infrasonic acoustic signals
|MFCC||Me frequency cepstra coefficients|
|DTW||Dynamic Time Warping|
Statement of the Problems
Objectives of the Research
The audio file of elephant rumbles was collected from open source websites. The audio file was annotated and a selection table was made which stores the start time, end time, lowest frequency, highest frequency of the selected rumble. these tabes were used to select the samples calculated from the start and end time in the audio file used to calculate the MFCC coefficients.
Feature Extraction: MFCCs
The first step is to extract features i.e. identify the components of the audio signal that are good for identifying the linguistic content and discarding the other stuff which carries information like background noise, emotion, etc. The main point to understand about speech is that the sounds generated are filtered by the shape of the vocal tract including the tongue, teeth, etc. This shape determines what sound comes out. If we can determine the shape accurately, this should give us an accurate representation of the phoneme being produced. The shape of the vocal tract manifests itself in the envelope of the short time power spectrum, and the job of MFCCs is to accurately represent this envelope.
Motivation behind the MFCCs steps
An audio signal is constantly changing, so to simplify things we assume that on short time scales the audio signal doesn't change much (when we say it doesn't change, we mean statistically i.e. statistically stationary, obviously the samples are constantly changing on even short time scale). This is why we frame the signal into 100ms frames. If the frame is much shorter we don't have enough samples to get a reliable spectrum to estimate, if it is longer the signal changes too much throughout the frame. The next step is to calculate the power spectrum of each frame. This is motivated by the human cochlea (an organ in the ear) which vibrates at different spots depending on the frequency of the incoming sounds. Depending on the location in the cochlea that vibrates (which wobbles small hairs), different nerves fire informing the brain that certain frequencies are present. Our periodogram estimate performs a similar job for us, identifying which frequencies are present in the frame. The periodogram spectral estimate still contains a lot of information not required for vocal identification. In particular, the cochlea can not discern the difference between two closely spaced frequencies. This effect becomes more pronounced as the frequencies increase. For this reason, we take clumps of periodogram bins and sum them up to get an idea of how much energy exists in various frequency regions. This is performed by our Mel filterbank: the first filter is very narrow and gives an indication of how much energy exists near 0 Hertz. As the frequencies get higher our filters to get wider as we become less concerned about variations. Once we have the filter bank energies, we take the logarithm of them. This is also motivated by human hearing: we don't hear loudness on a linear scale. Generally to double the perceived volume of a sound we need to put 8 times as much energy into it. This means that large variations in energy may not sound a that different if the sound is loud, to begin with. This compression operation makes our features match more closely what humans actually hear. The final step is to compute the DCT of the log filter bank energies. There are two main reasons this is performed. Because our filter banks are overlapping, the filter bank energies are quite correlated with each other. The DCT decorrelates the energies which mean diagonal covariance matrices can be used to mode the features in e.g. a DTW classifier.
What is mel-scale?
MFCCs implementation steps
Pattern Matching- Dynamic Time Warping (The DTW Algorithm)
- Matching paths cannot go backwards in time.
- Every frame in the input must be used in a matching path.
- Local distance scores are combined by adding to give a gobal distance.
RESULTS AND DISCUSSION
We have tested our program with a total of 15 annotated audio signals and rumbles positively detected in the frames. We checked for a threshold distance over the euclidian distances found by dtw() function. Signals having elephant rumbles gave upon matching, a distance lesser than the threshold 270 for frames including the rumbles.
Annotated spectrogram on Raven Lite 2.0
Created mel filter bank with 8 triangular filter
Feature matching result that is the euclidean distance due to DTW
I am grateful to the Science Academies' for giving me this wonderful opportunity. Financial assistance from the Indian Academy of Sciences (IAS) is gratefully acknowledged. I would like to thank my guide, Prof. Subrat Kar of IIT Delhi for his constant encouragement and support in building this project. I equally thank my student mentor Manohar Sir for his help and making me learn new concepts more easily.
2. H. Sakoe and S. Chiba, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. Assp-26 (1978)