A Comparison of different methods for audio declipping
Audio restoration is a crucial task across various fields. Heuristic methods for retrieving tainted aspects of audio continue to be proposed today – audio declipping is one such important retrieval. Audio clipping results in the peaks of a waveform being cut off i.e. saturated. This report summarizes the implementation procedures and results of four audio declipping algorithms for a set of sinusoids, speech signals, and a music signal for illustrative purpose
Keywords: nonlinear distortion, clipping, audio enhancement
Background and Motivation
The clipping of audio waveforms beyond a certain threshold, also referred to as ‘hard clipping’ in literature, occurs usually when the dynamic range of a hardware component in the audio’s signal path is limited to a maximum value (which becomes the clipping threshold), or when unnormalized samples in an audio waveform exceed datatype ranges while writing the audio to a file using software like MATLAB. It is observed that hard clipping negatively affects the perceptual quality of audio due to a higher number of harmonics that appears in the clipped signal. The information carried by the audio samples beyond the clipping threshold is also lost in the process. An ideal audio declipping technique is expected to reverse this signal degradation caused by clipping.
The objective of this study is to implement some of the declipping algorithms for a variety of simulated signals (speech and music signals) and thus investigate whether further improvements can be made to an existing algorithm to enhance the robustness of the process, especially at severe clipping levels when the possibilities of reconstruction become endless.
Brief Literature Review
Audio declipping has attracted a wide number of techniques and some of these include: (a) Classical modelling of the clipped region of the audio signal as an autoregressive process  to determine the missing samples linearly, (b) Least squares minimization of the derivative of the energy of the waveforms. In recent times, declipping algorithms have shifted to sparse representation (SR) models coupled with hard constraints on the approximated samples. Some popular SR implementations of declipping include: (c) Iterative hard thresholding to compute a SR of the incoming clipped signal, (d) Greedy approximation of SR of incoming clipped signal using matching pursuit algorithms , and (e) Current state-of-the-art SPADE (Sparse Audio Declipper)  algorithm with analysis and synthesis variants
Let the original waveform be , and let denote after it has been clipped. The signal is peak normalized in this context i.e. its maximum absolute value is +1. Let the clipping threshold be . As is a normalized waveform, it can be stated that will take on a value ranging between 0.1 to 0.9, as would correspond to no clipping. The clipped signal can be defined as :
In this context, the following three sections highlight and delve into three declipping algorithms that tackle the aforementioned problem statement in different ways.
Method I - Least Squares Declipping
Classical methods of declipping do not specify constraints on the reconstructed samples, and this leads to reconstructed data that is illegitimate.
The least squares declipping technique , also called 'Constrained Blind Amplitude Reconstruction' or cBAR, is a simple interpolation technique that focuses on minimizing the energy of the signal such that the absolute values of the clipped samples are constrained to lie above the specified clipping threshold .
Assume that along with the clipped signal , we also know for certain the clipping threshold and the locations of the clipped samples in . Now let us define two extraction operators : and . These two operators are initialized to identity matrices (of dimension equal to length of ) with the rows corresponding to the clipped samples and the rows corresponding to the reliable samples removed, respectively. Therefore, (= ) is a vector containing the reliable samples of , and (= ) is a vector containing the clipped values. It can mathematically be proven that the following is valid:
The reconstruction procedure detailed by cBAR is modelled as a non-linear constrained optimization problem as follows, where is the second derivative operator:
This optimization is performed on a variable-frame basis. Hence, the extraction operators and the vectors , and in equation 2 correspond to those for a particular frame. For a given audio signal (or any input array), initially the first 80 samples of the array are considered for processing. A rectangular window is used for framing. If the last sample of the set is a clipped sample, the number of samples in the frame is incremented till the last sample is a reliable sample. Thus, frame size is not a fixed parameter and no frame overlap is incorporated. All the processed frames are simply concatenated.
The above optimization problem is solved using the function by reducing objective function to the prescribed format. An initial iteration of the solution coded with yielded comparable results, but was predictably extremely slow.
Modelling the solution in this manner simplifies the essence of the declipping problem to the simple case of fitting smooth parabolas wherever there is a saturation.
Method II - Constrained Orthogonal Matching Pursuit Declipping
The notion of utilizing sparse representation of signals to solve the problem of audio declipping is explored for the first time in . A generalized audio inpainting algorithm  catering to various audio enhancement operations is extrapolated to audio declipping. The basis for this algorithm is derived from the fact that any audio signal (which in our case is also the original audio signal) can theoretically be well approximated as :
D is an appropriate transformation, also called a 'dictionary' matrix, that is composed of different columns or 'atoms', and z is a vector comprising the coefficients or the weights for each atom in D. Orthogonal Matching Pursuit is an algorithm that arrives at a greedy approximation of x in D i.e. approximates the weights z for the atoms in D that produce x.
The clipped signal is processed on a fixed frame basis. The length of the frame depends on the sampling frequency of the original signal. The authors of  consider a frame length of 64 milliseconds, which translates to 512 samples/frame for a speech signal sampled at 8 kHz and 1024 samples/frame for a music signal sampled at 16kHz. A frame overlap of 75% is used. Each frame is multiplied with the rectangular window, and after all the frames undergo the required transformations, they are merged using the Overlap Add Method (OLA) which in turn multiplies every frame with the Hamming window.
The Constrained Orthogonal Matching Pursuit method, or cOMP, is applied to every frame independently. The process consists of two steps - first, the coefficient vector z is obtained using the OMP algorithm by making use of only the reliable samples, and second, this vector z is used to determine the values of the original signal at the clipping indices. Let us again assume that we have the clipped signal , we know for certain the clipping threshold and the locations of the clipped samples in . Let be the i-th frame obtained from , and let its length be N. Then, reliable samples of that frame-
The vector is then run through the OMP algorithm together with an overcomplete Discrete-Cosine Transform (DCT) dictionary D. D is initialized as an , but before it is passed to the OMP module, it is stripped to a matrix containing only rows corresponding to , and the columns are internally renormalized. The OMP algorithm outputs the vector such that where is the modified dictionary. After is obtained, it further undergoes a second optimization (the 'first' optimization is a least squares minimization that happens within the OMP module) constrained to ensure that the reconstructed samples obtained by for the i-th frame follow the fundamental rules laid down in section 4. This second optimization is :
The starting value of is the vector of sparse coefficients obtained from OMP algorithm for that particular frame i.e. , and is the largest value the signal can take post normalization, which is .
In the above procedure, was used to solve the OMP problem, and the creation of the DCT dictionary was also done using a parallel function . The DCT basis is known to produce an acceptable representation of audio signals by making use of the frequency components of each frame, but it doesn't guarantee the best representation. cOMP, as a solution to the declipping problem, is therefore only as good as the dictionary that is used for reconstruction.
Method III - Sparse Audio Declipping
The actual SPADE algorithm is a solution to the following two NP-Hard optimization problems (8) and (9) :
Equation (8) is the Analysis variant of SPADE (i.e. A-SPADE) while equation (9) is the synthesis variant (i.e. S-SPADE). An analysis operator (A, in our discussion) is a linear transformation matrix that extracts the weights corresponding to each column of its inverse (z being the vector of weights) from the original signal x, and a synthesis operator is also a linear transformation that forms the original signal x by combining with the weights z.
In this context, x can be assumed to be the reconstructed signal, and z is the corresponding coefficient vector. The SPADE algorithms make use of the Discrete Fourier Transform (DFT) and the inverse DFT as analysis and synthesis operators respectively.
These two algorithms (A-SPADE and S-SPADE) make use of a special optimization-solver called the Alternating Direction Method of Multipliers (ADMM) . ADMM is applied for optimization problems of the form :
where A is a transformation matrix, and f(x) and g(z) are usually considered to be convex functions. Let us assume here that initially, x is the clipped signal (in contrast to the notation we have been using thus far, and here becomes the reconstructed signal at the end of the process, in constrast to ) and z is the vector of coefficients. The constraint in equation (10) may as well be written as , corresponding to S-SPADE. The Augumented Langrangian function is formed for (10) as:
Where again, can be replaced by for S-SPADE, is the dual variable for the optimization problem in (10), and is a penalty parameter, often associated with Langrangians. The ADMM, which is a three step iterative process, is then applied to the Langrangian in (11) as follows (the three updates correspond to iteration ) :
Step (12) is an iterative process that is applied to the Augumented Langrangian of either the A-SPADE or S-SPADE problem. Equation (8) or (9) is modelled as the sum of two indicator functions - f(x), which enforces the optimization problem to force x into the solution space and g(z), which enforces the sparsity of the vector z of analysis or synthesis weights. This sum follows the format of equation (10). This leads to the Augumented Langrangian being formed as per equation (11), which is then solved by ADMM. The x-update of the ADMM step is solved by an orthogonal projection onto the set by making use of simple min-max functions, the z-update is performed by hard thresholding operation of the input vector, and y-update is a sum. These three steps are performed for multiple iterations, generated by relaxing the sparsity of z. When a specific termination condition is met (which is for A-SPADE or for S-SPADE ), the vector x is considered to be the reconstructed (i.e. declipped ) signal. is a small value assigned to 0.1.
These two algorithms appear to give the best reconstruction yet, as can be seen in the results below.
The main experiments conducted include testing the four algorithms on five sinusoids of frequencies 100Hz, 300Hz, 500Hz, 700 Hz and 1000Hz, five random speech signals and one music signal (for illustrative purposes). All the signals were peak normalized (i.e. the signal was made to lie in the range [-1 1]), artificially clipped for thresholds from 0.1 to 0.9, and run through the algorithms.
The signal-distortion ratio (SDR) is used as the primary parameter to quantify and compare restoration quality of input signals for all algorithms. It is defined for two vectors u and v as :
The actual evaluation parameter that is considered in the comparison tabulations makes use of definition (13) and is SDR = SDR(x, ) - SDR(x,y). This makes sure that the evaluation is independent of whether the SDR is calculated for the whole signal or for only the clipped samples.
All implementations have been coded and tested on MATLAB R2017b, and the Large Time/Frequency Analysis Toolbox (LTFAT)  was used for many operations in the algorithms outlined in sections 5.2 and 5.3.
A set of five sinusoids of frequencies mentioned above were passed through the four algorithms. For cBAR, A-SPADE and S-SPADE, the SDR values increased across increasing values, but not monotonically. Also, with increasing frequency, the SDR values decreased, whcih can be justified as the number of reliable samples begins to drop. For cOMP as well, SDR increased, but inconsistently. In fact, for cOMP, the inconsistency can be made to assume uniformity across all frequencies as the sparsity level of vector z can be fine-tuned for that particular frequency.
The results for the speech signals based on SDR have been tabulated as follows :
Each table (for each declipping algorithm) contains the SDR values for five speech signals (SS1 to SS5) for clipping thresholds from 0.1 to 0.9. cBAR projects an unpredictable performance, as speech signals are not expected to have parabolic changes. cOMP displays inconsistent SDR values The general trend however is an increasing SDR for increasing values. A-SPADE and S-SPADE produce the best results. Some inconsistencies are visible here as well, but in general it can be said that the SDR shows an increasing trend for these two techniques.
Further analysis of all the above four algorithms is required, with more testing on the perception of audio signals.
Having prior knowledge of the signal can lead to other non-heuristic declipping methods that make use of the aforementioned context. For instance, knowing that a given clipped speech signal had been originally sampled at 8kHz can lead to the construction of a low pass filter with cut off at 4kHz, application of this filter to the clipped signal and maximization of the energy of the filtered component. This is one technique to be tested.
The solution space for a given clipped signal y that can been proven to be a convex set in the derivation of the SPADE algorithms holds much promise for further work. Arriving at a vector in using a specific time-domain (as in ) or frequency-domain procedure (as in the aforementioned filtering procedure) would be a fruitful analysis.