1 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008 229 Evaluation of Objective Quality Measures for Speech Enhancement Yi Hu and Philipos C. Loizou , Senior Member, IEEE In this paper, we evaluate the performance of several The types of distortion introduced by speech enhancement Abstract— objective measures in terms of predicting the quality of noisy algorithms can be broadly divided into two categories: the dis- speech enhanced by noise suppression algorithms. The objective tortions that affect the speech signal itself (called speech distor- measures considered a wide range of distortions introduced by tion) and the distortions that affect the background noise (called four types of real-world noise at two signal-to-noise ratio levels by noise distortion). Of these two types of distortion, listeners seem four classes of speech enhancement algorithms: spectral subtrac- tive, subspace, statistical-model based, and Wiener algorithms. to be influenced the most by the speech distortion when making The subjective quality ratings were obtained using the ITU-T judgments of overall quality , . Unfortunately no objec- P.835 methodology designed to evaluate the quality of enhanced tive measure currently exists that correlates high with either type speech along three dimensions: signal distortion, noise distortion, of distortion or with the overall quality of speech enhanced by and overall quality. This paper reports on the evaluation of cor- relations of several objective measures with these three subjective noise suppression algorithms. rating scales. Several new composite objective measures are also Compared to the speech coding literature , only a small proposed by combining the individual objective measures using number of studies examined the correlation between objective nonparametric and parametric regression analysis techniques. measures and the subjective quality of noise-suppressed speech Objective measures, speech enhancement, speech Index Terms— –. Salmela and Mattila  evaluated the correlation quality assessment, subjective listening tests. of a composite measure with the subjective (overall) quality of noise-suppressed speech. The composite measure consisted of 16 different objective measures which included, among others, I. I NTRODUCTION spectral distance measures, LPC measures (e.g., Itakura–Saito) and time-domain measures [e.g., segmental signal-to-noise URRENTLY, the most accurate method for evaluating ratio (SNR)]. The noisy speech samples were not processed speech quality is through subjective listening tests. Al- C by real enhancement algorithms, but rather by ideal noise-sup- though subjective evaluation of speech enhancement algorithms pression algorithms designed to provide controlled attenuation is often accurate and reliable (i.e., repeatable) provided it is per- to the background alone or to both background and speech formed under stringiest conditions (e.g., sizeable listener panel, signals. The resulting composite measure produced a high inclusion of anchor conditions, etc. –), it is costly and et al. correlation of 0.95 with overall quality. Rohdenburg time consuming. For that reason, much effort has been placed  evaluated the correlation of several objective measures on developing objective measures that would predict speech including LPC-based measures [e.g., log-area ratio (LAR)] and quality with high correlation. Many objective speech quality the perceptual evaluation of speech quality (PESQ) measure measures have been proposed in the past to predict the subjec- with speech enhanced by a single algorithm. The subjec- tive quality of speech . Most of these measures, however, tive listening tests were done according to the ITU-T P.835 were developed for the purpose of evaluating the distortions methodology specifically designed to evaluate the distortions introduced by speech codecs and/or communication channels and overall quality of noise suppression algorithms. Correla- –. The quantization and other types of distortions intro- tions ranging from 0.7 to 0.81 were obtained with ratings of duced by waveform and linear predictive coding (LPC)-based background distortion, signal distortion, and overall quality. speech coders [e.g., code excited linear prediction (CELP)], Turbin and Fluchier  proposed a new objective measure for however, are different from those introduced by speech en- predicting the background intrusiveness rating scores obtained hancement algorithms. As a result, it is not clear whether the objective measures originally developed for predicting speech from ITU-T P.835-based listening tests. High correlation was coding distortions  are suitable for evaluating the quality of found with the background noise ratings using a measure that speech enhanced by noise suppression algorithms. was based on loudness density comparisons and coefficient of tonality. Turbin and Fluchier later extended their work in  and proposed an objective measure to estimate signal distortion Manuscript received January 10, 2007; revised September 26, 2007. This work was supported in part by the National Institute on Deafness and other Com- (but not overall quality). munication Disorders/National Institute of Health (NIDCD/NIH) under Grant With the exception of , , and , most studies R01 DC07527. The associate editor coordinating the review of this manuscript reported correlation of objective measures with only the and approving it for publication was Prof. Abeer Alwan. The authors are with the Department of Electrical Engineering, University overall quality of noise-suppressed speech. In those studies, of Texas at Dallas Richardson, TX 75083-0688 USA (e-mail: [email protected] only a small number (1–6) of noise suppression algorithms edu). Digital Object Identifier 10.1109/TASL.2007.911054 were involved in the evaluations. The study by Rhodenburg 1558-7916/$25.00 © 2007 IEEE
2 230 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008 PEECH C ORPUS AND et al.  evaluated the correlation of objective measures S UBJECTIVE Q UALITY E VALUATIONS II. S with speech/noise distortions and overall quality, but only In , we reported on the evaluation of several common for speech enhanced by a single statistical-model based en- 1 objective measures using a noisy speech corpus (NOIZEUS ) hancement algorithm, the minimum mean square error (mmse) developed in our lab that is suitable for evaluation of speech algorithm. Other classes of algorithms (e.g., subspace and enhancement algorithms. This corpus was used in a com- spectral subtractive), however, will likely introduce different prehensive subjective evaluation of 13 speech enhancement types of signal/background distortion. Hence, the correlations algorithms encompassing four different classes of algorithms: reported in  are only applicable for distortions introduced spectral subtractive (multiband spectral subtraction, and spec- by mmse-type of algorithms and not by other algorithms. tral subtraction using reduced delay convolution and adaptive To our knowledge, no comprehensive study was done to averaging), subspace (generalized subspace approach, and assess the correlation of existing objective measures with perceptually based subspace approach), statistical-model-based the distortions (background and speech) present in enhanced (mmse, log-mmse, and log-mmse under signal presence un- speech and with the overall quality of noise-suppressed speech. certainty) and Wiener- fi a priori ltering type algorithms (the Since different classes of algorithms introduce different types SNR estimation based method, the audible-noise suppression of signal/background distortion, it is necessary to include method, and the method based on wavelet thresholding the various classes of algorithms in such an evaluation. The main les were sent to fi multitaper spectrum). The enhanced speech objective of the present study is to report on the evaluation Dynastat, Inc. (Austin, TX) for subjective evaluation using of conventional as well as new objective measures that could the recently standardized methodology for evaluating noise be used to predict overall speech quality and speech/noise suppression algorithms based on ITU-T P.835 . distortions introduced by representative speech enhancement The subjective listening tests were designed according to algorithms from various classes (e.g., spectral-subtractive, ITU-T recommendation P.835 and were conducted by Dyna- subspace, etc) of algorithms. To that end, we make use of an stat, Inc. (Austin, TX). The P.835 methodology was designed existing subjective database that we collected for the evaluation ’ to reduce the listener s uncertainty in a subjective listening test of speech enhancement algorithms , . The subjective as to which component(s) of a noisy speech signal, i.e., the quality ratings were obtained by Dynastat, Inc. using the ITU-T speech signal, the background noise, or both, should form the P.835 methodology designed to evaluate the speech quality basis of their ratings of overall quality. This method instructs along three dimensions: signal distortion, noise distortion, and the listener to successively attend to and rate the enhanced overall quality. speech signal on: Preliminary evaluation of several objective measures with ve-point scale of signal — the speech signal alone using a fi speech processed by enhancement algorithms was reported distortion (SIG); in . In that study, we showed that the majority of the the background noise alone using a ve-point scale of — fi commonly used objective speech quality measures perform background intrusiveness (BAK); modestly well (but not exceeding 0.75) in terms of predicting — the overall quality using the scale of the mean opinion subjective quality of noisy speech processed by enhancement 2 3=fair 4=good poor 1 bad 5 score (OVRL)- algorithms. The correlations were performed using all speech . excellent samples ( fi les) available without averaging the objective scores The SIG and BAK scales are described in Table I. A total of across conditions. The test chosen was undoubtedly stringent, 32 listeners were recruited for the listening tests. The results of resulting in only a few of the objective measures correlating the subjective listening tests were reported in  and . In high with speech and noise distortions introduced by speech this paper, we make use of the subjective ratings along the three enhancement algorithms. In this paper, we further extend the quality scales (SIG, BAK, OVRL) to evaluate conventional and results reported in  and evaluate a larger set of objective new objective measures. speech quality measures after averaging the objective scores across conditions (SNR level, noise type, and algorithm). In addition, we propose several new composite objective measures EASURES M BJECTIVE III. O derived using nonlinear and nonparametric regression models Several objective speech quality measures were evalu- which are shown to provide higher correlations with subjective ated: segmental SNR (segSNR) , weighted-slope spectral speech quality and speech/noise distortions than the conven- distance (WSS) , PESQ , , LPC-based objective mea- tional objective measures. The use of composite measures is sures including the log-likelihood ratio (LLR), Itakura-Saito necessary as we cannot expect the simple objective measures distance measure (IS), and cepstrum distance measures (CEP) (e.g., LPC-based) to correlate highly with signal/noise distor- , and frequency-weighted segmental SNR (fwsegSNR) . with overall quality. and tions Composite measures obtained by combining a subset of the This paper is organized as follows. In Section II, we describe above measures were also evaluated. the NOIZEUS noisy speech corpus and the subjective quality gures of merit are computed for each objective mea- Two fi evaluation protocols. In Section III, we present the objective sure. The fi rst one is the correlation coef fi cient (Pearson ’ s measures evaluated and in Section IV we present the resulting 1 loizou/speech/noizeus/ [Online]. Available: http://www.utdallas.edu/ fi correlation coef cients. The conclusions are given in Section V.
3 HU AND LOIZOU: EVALUATION OF OBJECTIVE QUALITY MEASURES FOR SPEECH ENHANCEMENT 231 ITU-T for speech quality assessment of 3.2 kHz (narrow-band) TABLE I ESCRIPTION OF THE D SIG AND BAK S CALES handset telephony and narrow-band speech codecs , . As U SED IN THE S UBJECTIVE L ISTENING ESTS T described in , the PESQ score is computed as a linear com- and the average bination of the average disturbance value as follows: asymmetrical disturbance values (3) PESQ where , and . The pa- in the above equation were optimized and rameters for speech processed through networks and not for speech en- hanced by noise suppression algorithms. As we can not expect the PESQ measure to correlate highly with all three quality mea- sures (speech distortion, noise distortion and overall quality), we considered optimizing the PESQ measure for each of the three rating scales by choosing a different set of parameters ed PESQ measures for each rating scale. The modi fi and in (3) as the parame- were obtained by treating ters that need to be optimized for each of the three rating scales: speech distortion, noise distortion, and overall quality. Multiple and linear regression analysis was used to determine the and in (3) were treated parameters. The values of as independent variables in the regression analysis. The actual subjective scores for the three scales were used in the regression analysis. This analysis yielded three different modi ed PESQ fi measures suitable for predicting signal distortion noise distor- correlation) between the subjective quality ratings and the tion and overall speech quality. These measures will be de- objective measure , and is given by scribed later in Section IV. (1) B. LPC-Based Objective Measures Three different LPC-based objective measures were consid- where and , respec- and are the mean values of ered: the LLR, the IS, and the cepstrum distance measures. fi tively. The second gure of merit is an estimate of the stan- ned as  fi The LLR measure is de dard deviation of the error when the objective measure is used in place of the subjective measure, and is given by (4) (2) where is the LPC vector of the original speech signal frame, , and is the standard deviation of is the com- where is the LPC vector of the enhanced speech frame, and is the autocorrelation matrix of the original speech signal. Only the puted standard deviation of the error. A smaller value of in- smallest 95% of the frame LLR values were used to compute dicates that the objective measure is better at predicting subjec- the average LLR value . The segmental LLR values were tive quality. limited in the range of [0, 2] to further reduce the number of Two types of regression analysis techniques were used in this outliers. paper, namely parametric (linear regression) and nonparametric fi ned as  The IS measure is de techniques. The nonparametric regression technique used was based on multivariate adaptive regression splines (MARS)  analysis. Unlike the linear and polynomial regression (5) techniques, the MARS modeling technique is data driven and fi derives the best tting function from the data. The basic idea are the LPC gains of the clean and enhanced and where of the MARS modeling is to use spline functions to locally signals, respectively. The IS values were limited in the range of fi t the data in a region, and then generate a global model by [0, 100]. This was necessary in order to minimize the number combining the data regions using basis functions. One of the of outliers. most powerful features of the MARS modeling is that it allows The cepstrum distance provides an estimate of the log spectral interactions between the predictor (independent) variables so fi distance between two spectra. The cepstrum coef cients can be t can be found for the target (dependent) variable. fi that a better using the obtained recursively from the LPC coef fi cients following expression: A. PESQ Among all objective measures considered, the PESQ measure (6) is the most complex to compute and is the one recommended by
4 232 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008 is the order of the LPC analysis. An objective measure where spectral magnitudes in decibels. The WSS measure evaluated in cients can be computed as follows : fi based on cepstrum coef this paper is de fi ned as (7) (10) where are the weights computed as per , and cient vector of the clean are the cepstrum coef and fi where are the spectral ned as in (8), and fi are de and enhanced signals, respectively. The cepstrum distance th frequency band at frame of the clean and pro- slopes for was limited in the range of [0, 10] to minimize the number of cessed speech signals, respectively. In our implementation, the outliers. . number of bands was set to Aside from the PESQ measure, all other measures were com- puted by segmenting the sentences using 30-ms duration Ham- C. Time-Domain and Frequency-Weighted SNR Measures ming windows with 75% overlap between adjacent frames. A tenth order LPC analysis was used in the computation of the The time-domain segmental SNR (segSNR) measure was LPC-based objective measures (CEP, IS, and LLR). computed as per . Only frames with segmental SNR in the range of 10 to 35 dB were considered in the average. D. Composite Measures The frequency-weighted segmental SNR (fwSNRseg) was Composite objective measures were obtained by combining computed using the following equation: basic objective measures to form a new measure . As men- tioned earlier, composite measures are necessary as we cannot expect the conventional objective measures (e.g., LLR) to cor- and relate highly with speech/noise distortions overall quality. The composite measures can be derived by utilizing multiple linear regression analysis or by applying nonlinear techniques (e.g.,  and ). In this paper, we used both multiple linear (8) regression analysis and MARS analysis to estimate three dif- ferent composite measures: a composite measure for signal dis- where th frequency band, is the weight placed on the tortion (SIG), a composite measure for noise distortion (BAK), is the number of bands, is the total number of frames and a composite measure for overall speech quality (OVRL). is the weighted (by a Gaussian-shaped in the signal, The task of forming a good composite measure by linearly window) clean signal spectrum in the th frequency band at combining basic objective measures is not an easy one. Ide- th frame, and in the weighted enhanced signal the ally, we would like to combine objective measures that correlate spectrum in the same band. For the weighting function, we con- highly with subjective ratings, and at the same time, capture dif- sidered the magnitude spectrum of the clean signal raised to a ferent characteristics of the distortions present in the enhanced power, i.e., signals. There is no straightforward method of selecting the best subset of objective measures to use in the composite measure, (9) other than by trying out different combinations and assessing the resulting correlation. Multidimensional scaling techniques is the weighted magnitude spectrum of the where [1 , Ch. 4],  may be used in some cases as a guide for the th band at frame and is the clean signal obtained in the selection. The methodology used in [1, Ch. 9] was adopted in power exponent, which can be varied for maximum correlation. this study for selecting the individual objective measures. More from 0.1 to 2 and obtained max- In our experiments, we varied fi speci cally, we tested various combinations of basic objective . imum correlation with measures to determine to what extent the correlation coef fi cient in (8) were obtained by dividing the The spectra could be improved by combining them. Seven basic object mea- signal bandwidth into either 25 bands or 13 bands spaced in sures were used in the analysis. We kept only the subset of proportion to the ear ’ s critical bands. The 13 bands were formed measures for which the correlation of the composite measure by merging adjacent critical bands. The weighted spectra used cantly from the correlation coef cient of the in- fi fi improved signi in (8) were obtained by multiplying the fast spectra with over- dividual measures. lapping Gaussian-shaped windows [26, Ch. 11] and summing up the weighted spectra within each band. Prior to the distance IV. R BJECTIVE EASURES M ESULTS :C ORRELATIONS OF O computation in (8), the clean and processed FFT magnitude and estimates of the standard de- cients Correlation coef fi spectra were normalized to have an area equal to one. This nor- viation of the error were computed for each objective mea- malization was found to be critically important. sure and each of the three subjective rating scales (SIG, BAK, The last conventional measure tested was the WSS measure OVRL). Two types of correlation analysis were performed. The . The WSS distance measure  computes the weighted fi rst analysis was done as in our previous study  and included difference between the spectral slopes in each frequency band. all objective scores obtained for each speech sample ( fi le). A The spectral slope is obtained as the difference between adjacent
5 HU AND LOIZOU: EVALUATION OF OBJECTIVE QUALITY MEASURES FOR SPEECH ENHANCEMENT 233 TABLE II E STIMATED j j OEFFICIENTS O BJECTIVE ORRELATION C C OF EASURES M VERALL ITH W O Q UALITY ,S IGNAL , D ISTORTION B AND ACKGROUND D .C OISE ISTORTION N ORRELATIONS W ERE O SING THE BTAINED U BJECTIVE O S CORES OF LL S PEECH A AMPLES S total of 1792 processed speech samples were included in the cor- the data into training and testing sets according to the various classes of speech enhancement algorithms. In this setup, the relations encompassing two SNR levels (5 and 10 dB), four dif- composite measures were trained on data taken from a given set ferent types of background noise, and speech/noise distortions introduced by 13 different speech enhancement algorithms. The of algorithms and tested on data taken from the remaining algo- ratings for each speech sample were averaged across all listeners fi cients of the composite mea- rithms. Resulting correlation coef fi les 8 lis- involved in that test. A total of 43 008 ( sures were comparable (and remained robust) to those obtained 3 rating scales) subjective listening scores were used teners with the aforementioned data partitioning. For that reason, we cients for the three in the computation of the correlation coef fi only report correlations with the former data partitioning. ) obtained rating scales. Acknowledging that the above correlation analysis We report separately the correlations (and errors using the per speech sample analysis (Tables II and III) and the is rather stringent (but perhaps more desirable in some appli- per condition analysis (Tables IV and V). Tables VI and VII pro- cations), we considered performing correlation analysis using vide the regression coef ed PESQ and the fi cients of the modi fi objective scores which were averaged across each condition. composite measures respectively obtained both using multiple This analysis involved the use of mean objective scores and 14 algo- cients fi linear regression analysis. Table VI tabulates the coef ratings computed across a total of 112 conditions ( 2 4 noise types). In order to cross-vali- ed PESQ measures fi used in (3) for constructing the modi 2 SNR levels rithms for the three rating scales. The assumed form of composite mea- date the composite measures (and any other measures requiring sures listed in Table VII is shown as follows: training), we divided our data set in half, with 50% of the data being used for training and the remaining 50% being used for testing. Of the 16 speech samples used for each condition, we (11) used eight speech samples for training and eight for testing. So, in the fi rst correlation analysis, we used the ratings and objec- where (signal is the composite measure for rating scale speech samples for training and tive scores of are the distortion, background distortion, overall quality), the rest for testing. In the second correlation analysis, we used regression coef fi cients given in Table VII, and are the cor- the ratings and objective scores averaged across eight (of 16) responding objective measures. Empty entries in Table VII indi- les for training. This yielded a total of 112 pairs of rat- fi speech cate that the corresponding objective measure was not included ings and objective scores for training. For testing, we used the in the composite measure. ratings and objective scores averaged across the remaining eight Comparing Tables II and IV, we see a large difference (of les in each condition. This yielded a total of 112 pairs of ratings fi about 0.2) between the correlations obtained on a per sample and objective scores for testing. We also considered partitioning basis and those obtained on a per condition basis. From Table II, 2 The noisy sentences (unprocessed) were also included. we see that of the seven basic objective measures tested, the
6 234 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008 TABLE III S ^ OF TANDARD BJECTIVE EVIATION OF THE EASURES O D M W ITH O VERALL Q UALITY E RROR ,S D , IGNAL ISTORTION AND ACKGROUND B OISE N D TANDARD ISTORTION .S EVIATIONS OF D RROR W ERE E BTAINED U O SING THE BJECTIVE O S CORES OF S A LL PEECH AMPLES S TABLE IV B j ACKGROUND OF O BJECTIVE M EASURES W ITH O VERALL Q UALITY ,S IGNAL D ISTORTION , AND j C ORRELATION C OEFFICIENTS STIMATED E R OISE D ISTORTION .C ORRELATIONS W ERE O BTAINED A FTER A VERAGING O BJECTIVE S CORES AND N ATINGS A CROSS C ONDITIONS PESQ measure yielded the highest correlation with ally simpler to implement and yield roughly the same correla- was obtained overall quality, followed by the fwSNRseg measure tion coef fi cient. The lowest correlation and the LLR measure . Compared to the PESQ with the SNRseg measure. The correlations with signal distor- measure, the LLR and fwSNRseg measures are computation- tion were of the same magnitude as those of overall quality.
7 HU AND LOIZOU: EVALUATION OF OBJECTIVE QUALITY MEASURES FOR SPEECH ENHANCEMENT 235 TABLE V S TANDARD OF ^ O BJECTIVE M EASURES D EVIATION OF THE W ITH O E Q UALITY ,S RROR VERALL D ISTORTION , IGNAL AND ACKGROUND B OISE D ISTORTION . N S TANDARD EVIATIONS OF D RROR W ERE E BTAINED O FTER A A VERAGING O BJECTIVE S CORES AND R ATINGS A CROSS C ONDITIONS TABLE VI R OEFFICIENTS [S EE (3)] FOR THE M ODIFIED PESQ M EASURES EGRESSION C This suggests that the same basic objective measure predicts composite measures with training and testing data. The highest , signal fi nding equally well signal distortion and overall quality. This cients with overall quality fi correlation coef and noise distortion were ob- distortion is consistent with our previous data  suggesting that lis- tained with the MARS-based composite measure. The MARS teners are more sensitive to signal distortion than background composite measure improved particularly the background dis- distortion when making judgments on overall quality. The cor- tortion correlation from 0.48 (obtained with PESQ) to 0.64. relations, however, with noise distortion were generally poorer Overall, the correlations obtained on a per condition basis suggesting that the basic objective measures are inadequate in (Table IV) were higher (by about 0.2) than the correlations ob- fi cant improvement in predicting background distortion. A signi tained on a per speech sample basis and the standard devia- correlation with background distortion was obtained with the tions of the error were smaller (Table V). This is to be expected, cant improvements were ob- fi use of composite measures. Signi given the smaller variance in objective scores following the av- tained in correlations with overall quality and signal distortion. les. The pattern of results, however, in terms of eraging across fi Tables II and IV list separately the correlations obtained by the
8 236 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008 TABLE VII R EGRESSION SEE OEFFICIENTS C [ (11)] AND O BJECTIVE M EASURES U SED IN THE ONSTRUCTION OF THE C C OMPOSITE M EASURES TABLE VIII N UMBER OF B ASIS F UNCTIONS AND O BJECTIVE M EASURES SED IN THE U ONSTRUCTION OF THE C C EASURES M ASED MARS-B OMPOSITE ed PESQ measure against the true subjective fi Fig. 1. Scatter plot of the modi ratings of overall speech quality (OVRL). The estimated correlation coef fi cient was 0.92. which objective measures yielded the highest correlation was similar to that shown in Table II. Of the seven basic objective measures tested, the PESQ measure yielded the highest correla- on overall quality, followed by the fwSNRseg tion and LLR measures . The composite measures further improved the correlation to overall quality to higher than 0.9. Highest correlation with overall quality was , and fi ed PESQ measure obtained with the modi the highest correlation with background distortion was obtained . with the composite MARS measure 3 sets according to the various classes of algorithms. The corre- Fig. 1 shows the scatter plot of the OVRL scores and predicted lations obtained with the training data were comparable to those ed PESQ measure, which yielded fi scores obtained by the modi obtained with new and unseen data (Tables II and IV). Further- with overall quality. Table VIII shows a correlation of more, the fact that these composite measures were tested on a the number of basis functions and objective measures involved publicly available speech corpus (NOIZEUS) makes these mea- in the composite MARS measures. Sample MATLAB code for sures ideal for testing new enhancement algorithms. the MARS composite measures used for predicting signal and 3 rst ten algorithms for training, and the data for After using the data from the fi overall quality is given in Appendix A. MATLAB code for the the remaining four algorithms for testing, we obtained a correlation coef cient fi implementation of all objective measures tested are available of 0.94 with the training data and a correlation coef fi cient of 0.96 with the test data for the proposed composite measures designed to predict overall quality. from . The corresponding correlation coef fi cients for training and testing data were The cross-validation of the composite measures indicated that (0.92, 0.96), respectively, for signal distortion, and (0.86, 0.82) for noise distor- they are robust to new distortions. This was found to be true tion. These data clearly demonstrate that the proposed composite measures are robust to alternate partitioning of the data. even when the data were partitioned into training and testing
9 HU AND LOIZOU: EVALUATION OF OBJECTIVE QUALITY MEASURES FOR SPEECH ENHANCEMENT 237 UMMARY AND V. S ONCLUSION C The present study extended our previous evaluation of ob- jective measures  and included a per condition correlation analysis. With this new type of analysis, the majority of the cor- cients improved by about 0.2. The correlation co- fi relation coef cient of the PESQ measure improved from 0.65 to 0.89. fi ef Based on the correlation analysis reported above, we can MATLAB code for the implementation of the PESQ, IS, and draw the following conclusions: The segSNR measure, which is other measures tested in this paper, is available in . widely used for evaluating the performance of speech enhance- cient ment algorithms, yielded a very poor correlation coef fi A CKNOWLEDGMENT nding was with overall quality. This fi The authors would like to thank Dr. A. Sharpley of Dynastat, consistent with both types of correlation analysis conducted, Inc. for his help and advice throughout the project. and thus makes this measure unsuitable for evaluating the performance of enhancement algorithms. R EFERENCES Of the seven basic objective measures tested, the PESQ mea-  S. Quackenbush, T. Barnwell, and M. Clements , Objective Measures with overall sure yielded the highest correlation . Englewood Cliffs, NJ: Prentice-Hall, 1988. of Speech Quality quality and signal distortion. The LLR and fwSNRseg measures  “ Subjective test methodology for evaluating speech communication at a fraction of the com- performed nearly as well ITU-T, ITU-T Rec. ” systems that include noise suppression algorithm, P. 835, 2003. putational cost. Hence, the LLR and fwSNRseg measures are  P. Kroon, , W. Kleijn and K. Paliwal, Eds., “ Evaluation of speech simpler alternatives to the PESQ measure. Speech Coding and Synthesis in coders, ” . New York: Elsevier, 1995, The majority of the basic objective measures predict equally pp. 467 – 494.  L. Thorpe and W. Yang, Performance of current perceptual objective “ well signal distortion and overall quality, but not background in ” , Proc. IEEE Speech Coding Workshop speech quality measures, distortion. This was not surprising, given that most measures 1999, pp. 144 146. – take into account both speech-active and speech-absent seg-  T. H. Falk and W. Chan, “ Single-ended speech quality measurement IEEE Trans. Audio, Speech, Lang. using machine learning methods, ” ments in their computation. Measures that would place more , vol. 14, no. 6, pp. 1935 1947, Nov. 2006. – Process. emphasis on the speech-absent segments would be more appro- “  L. Malfait, J. Berger, and M. Kastner, P.563-the ITU-T standard priate and likely more successful in predicting noise distortion IEEE Trans. Audio, ” for single-ended speech quality assessment, Speech,Lang. Process. – 1934, Nov. 2006. , vol. 14, no. 6, pp. 1924 (BAK).  A. Rix, J. Beerends, M. Hollier, and A. Hekstra, Perceptual evaluation “ of speech quality (PESQ)-A new method for speech quality assessment A PPENDIX in of telephone networks and codecs, ” Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. – 752. , 2001, vol. 2, pp. 749 This Appendix shows the MATLAB code for the implemen-  “ Perceptual evaluation of speech quality (PESQ), and objective method tation of the MARS-based composite measures for signal dis- for end-to-end speech quality assessment of narrowband telephone net- ITU, ITU-T Rec. P. 862, 2000. works and speech codecs, ” tortion and overall quality. These composite measures yielded  R. Kubichek, D. Atkinson, and A. Webster, “ Advances in objective correlations of 0.9 and 0.91 with signal distortion and overall , 1991, vol. ” Proc. Global Telecomm. Conf. voice quality assessment, in quality, respectively. – 1770. 3, pp. 1765 Subjective comparison of speech enhancement  Y. Hu and P. Loizou, “ algorithms, ” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. , function 2006, vol. 1, pp. 153 – 156.  Y. Hu and P. Loizou, “ Subjective comparison and evaluation of speech – ” 601, enhancement algorithms, , vol. 49, pp. 588 Speech Commun. 2007. composite measure for predicting SIG ratings  T. Rohdenburg, V. Hohmann, and B. Kollmeir, “ Objective perceptual ” quality measures for the evaluation of noise reduction schemes, ; Proc. 9th Int. Workshop Acoust. Echo Noise Control in , 2005, pp. 169 – 172. ;  J. Salmela and V. Mattila, “ New intrusive method for the objective quality evaluation of acoustic noise suppression in mobile communi- ; cations, ” in Proc. 116th Audio Eng. Soc. Conv. , 2004, preprint 6145.  V. Turbin and N. Faucheur, “ A perceptual objective measure for noise Proc. Online Workshop Meas. Speech Audio reduction systems, ” in ; Quality Netw. 84. , 2005, pp. 81 – “ New objective measures for  E. Paajanen, B. Ayad, and V. Mattila, function characterization of noise suppression algorithms, ” in IEEE Speech Coding Workshop , 2000, pp. 23 – 25.  V. Mattila, “ Objective measures for the characterization of the basic functioning of noise suppression algorithms, ” in Proc. Online Work- shop Meas. Speech Audio Quality Netw. , 2003 [Online]. Available: composite measure for OVRL ratings http://wireless.feld.cvut.cz/mesaqin2003/contributions.html “ Objective measures for speech quality assess-  A. Bayya and M. Vis, ment in wireless communications, ” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. , 1996, vol. 1, pp. 495 – 498.  V. Turbin and N. Faucheur, “ Estimation of speech quality of noise reduced signals, ” in Proc. Online Workshop Meas. Speech Audio Quality Netw. , 2007 [Online]. Available: http://wire- less.feld.cvut.cz/mesaqin2007/contributions.html
10 238 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008 Evaluation of objective measures for speech  Y. Hu and P. Loizou, “ SM 90 M ’ 91 – – ’ 04) received ’ (S Philipos C. Loizou 1450. in – enhancement, , 2006, pp. 1447 ” Proc. Interspeech the B.S., M.S., and Ph.D. degrees in electrical “  J. Hansen and B. Pellom, An effective quality evaluation protocol engineering from Arizona State University, Tempe, Proc. Int. Conf. Spoken Lang. in ” for speech enhancement algorithms, in 1989, 1991, and 1995, respectively. , 1998, vol. 7, pp. 2819 – Process. 2822. From 1995 to 1996, he was a Postdoctoral Fellow “  D. Klatt, Prediction of perceived phonetic distance from critical band in the Department of Speech and Hearing Science, Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. spectra, ” in , Arizona State University, working on research 1281. – 1982, vol. 7, pp. 1278 related to cochlear implants. He was an Assistant Application guide for objective quality measurement based on recom- “  Professor at the University of Arkansas, Little Rock, mendations P.862, P.862.1 and P. 862.2, ” ITU-T Rec. P. 862. 3, 2005. from 1996 to 1999. He is now a Professor in the A study of  J. Tribolet, P. Noll, B. McDermott, and R. E. Crochiere, “ Department of Electrical Engineering, University of in Proc. IEEE Int. ” complexity and quality of speech waveform coders, Texas at Dallas. His research interests are in the areas of signal processing, , 1978, pp. 586 – 590. Conf. Acoust., Speech, Signal Process. speech processing, and cochlear implants. He is the author of the book Speech  J. H. Friedman, Multivariate adaptive regression splines (with discus- “ (CRC, 2007). Enhancement: Theory and Practice RANSACTIONS ON PEECH S Ann. Statist. sion), – ” 141. , pp. 1 Dr. Loizou was an Associate Editor of the IEEE T AND A ROCESSING UDIO P (1999-2002) and is currently a member of the Speech “ Objective quality evaluation  N. Kitawaki, H. Nagabuchi, and K. Itoh, , IEEE J. Sel. Areas Commun. ” for low bit-rate speech coding systems, Technical Committee of the IEEE Signal Processing Society and serves as As- – vol. 6, no. 2, pp. 262 273, Mar. 1988. sociate Editor for the IEEE S IGNAL P ROCESSING . ETTERS L  P. Loizou . Boca Raton, , Speech Enhancement: Theory and Practice FL: CRC, 2007.  B. Grundlehner, J. Lecocq, R. Balan, and J. Rosca, “ Performance as- sessment method for speech enhancement systems, ” in Proc. 1st Annu. , 2005. IEEE BENELUX/DSP Valley Signal Process. Symp. received the B.S. and M.S. degrees in electrical Yi Hu engineering from the University of Science and Tech- nology of China (USTC), Beijing, China, in 1997 and 2000, respectively, and the Ph.D. degree in electrical engineering from the University of Texas at Dallas, Richardson, TX. He is currently a Research Associate at the Uni- versity of Texas at Dallas. His research interests are in the general area of speech and audio signal pro- cessing and improving auditory prostheses in noisy environments.