1 Automatically Detecting Action Items in Audio Meeting Reco rdings Jason M. Brenier Surabhi Gupta William Morgan Pi-Chuan Chang Department of Linguistics Department of Computer Science Center for Spoken Language Research Stanford University Institute of Cognitive Science 353 Serra Mall University of Colorado at Boulder Stanford, CA 94305-9205 594 UCB [email protected] Boulder, Colorado 80309-0594 [email protected] [email protected] [email protected] difficulties presented by single-party recordings, Abstract typically contain backchannels, elaborations, and Identification of action items in meeting side topics, all of which further confound search recordings can provide immediate access and summarization processes. Making efficient to salient information in a medium noto- use of large meeting corpora thus requires intel- riously difficult to search and summarize. ligent summary and review techniques. To this end, we use a maximum entropy One possible user goal given a corpus of meet- model to automatically detect action item- action items de- ing recordings is to discover the related utterances from multi-party audio cided within the meetings. Action items are deci- meeting recordings. We compare the ef- sions made within the meeting that require post- fect of lexical, temporal, syntactic, seman- meeting attention or labor. Rapid identification tic, and prosodic features on system per- of action items can provide immediate access to formance. We show that on a corpus of ac- salient portions of the meetings. A review of ac- tion item annotations on the ICSI meeting tion items can also function as (part of) a summary recordings, characterized by high imbal- of the meeting content. ance and low inter-annotator agreement, To this end, we explore the task of applying the system performs at an F measure of maximum entropy classifiers to the task of auto- 31.92%. While this is low compared to matically detecting action item utterances in au- better-studied tasks on more mature cor- dio recordings of multi-party meetings. Although pora, the relative usefulness of the features available corpora for action items are not ideal, it towards this task is indicative of their use- is hoped that the feature analysis presented here fulness on more consistent annotations, as will be of use to later work on other corpora. well as to related tasks. 2 Related work 1 Introduction Multi-party meetings have attracted a significant Meetings are a ubiquitous feature of workplace amount of recent research attention. The creation environments, and recordings of meetings pro- of the ICSI corpus (Janin et al., 2003), comprised vide obvious benefit in that they can be replayed of 72 hours of meeting recordings with an average or searched through at a later date. As record- of 6 speakers per meeting, with associated tran- ing technology becomes more easily available and scripts, has spurred further annotations for var- storage space becomes less costly, the feasibil- ious types of information, including dialog acts ity of producing and storing these recordings in- (Shriberg et al., 2004), topic hierarchies and action creases. This is particularly true for audio record- items (Gruenstein et al., 2005), and “hot spots” ings, which are cheaper to produce and store than (Wrede and Shriberg, 2003). full audio-video recordings. The classification of individual utterances based on their role in the dialog, i.e. as opposed to their However, audio recordings are notoriously diffi- semantic payload, has a long history, especially cult to search or to summarize. This is doubly true (DA) classification. in the context of dialog act of multi-party recordings, which, in addition to the
2 Research on DA classification initially focused 8 on two-party conversational speech (Mast et al., 1996; Stolcke et al., 1998; Shriberg et al., 1998) 6 and, more recently, has extended to multi-party audio recordings like the ICSI corpus (Shriberg 4 frequency et al., 2004). Machine learning techniques such 2 as graphical models (Ji and Bilmes, 2005), maxi- mum entropy models (Ang et al., 2005), and hid- 0 den Markov models (Zimmermann et al., 2005) 0.2 0.4 0.6 0.8 1.0 0.0 have been used to classify utterances from multi- kappa party conversations. It is only more recently that work focused (inter-annotator agree- κ Figure 2: Distribution of specifically on action items themselves has been ment) across the 54 ICSI meetings tagged by two developed. SVMs have been successfully applied , one . κ annotators. Of the two meetings with 0 = 1 to the task of extracting action items from email has only two action items and the other only four. messages (Bennett and Carbonell, 2005; Corston- Oliver et al., 2004). Bennett and Carbonell, in par- fined as: ticular, distinguish the task of action item detec- ( P ) E ( P − ) O tion in email from the more well-studied task of κ = ( E ) P − 1 text classification, noting the finer granularity of the action item task and the difference of seman- where is the probability of the observed P O ( ) tics vs. intent. (Although recent work has begun to the probability of the “ex- agreement, and P ( E ) blur this latter division, e.g. Cohen et al. (2004).) pected agreement” (i.e., under the assumption the In the audio domain, annotations for action item The two sets of annotations are independent). utterances on several recorded meeting corpora, to kappa statistic ranges from 1 − 1 , indicating per- including the ICSI corpus, have recently become fect disagreement and perfect agreement, respec- available (Gruenstein et al., 2005), enabling work tively. on this topic. Overall inter-annotator agreement as measured by κ on the action item corpus is poor, as noted in 3 Data 0 . Purver et al. (2006), with an overall κ of and 364 values for individual meetings ranging from 1 . 0 to We use action item annotations produced by Gru- less than zero. Figure 2 shows the distribution of enstein et al. (2005). This corpus provides topic κ across all 54 annotated ICSI meetings. hierarchy and action item annotations for the ICSI To reduce the effect of poor inter-annotator meeting corpus as well as other corpora of meet- agreement, we focus on the top 15 meetings as ings; due to the ready availability of other types of ranked by κ ; the minimum κ in this set is 0.435. annotations for the ICSI corpus, we focus solely Although this reduces the total amount of data on the annotations for these meetings. Figure 1 available, our intention is that this subset of the gives an example of the annotations. most consistent annotations will form a higher- The corpus covers 54 ICSI meetings annotated quality corpus. by two human annotators, and several other meet- While the corpus classifies related action item ings annotated by one annotator. Of the 54 meet- utterances into action item “groups,” in this study ings with dual annotations, 6 contain no action we wish to treat the annotations as simply binary items. For this study we consider only those meet- attributes. Visual analysis of annotations for sev- ings which contain action items and which are an- eral meetings outside the set of chosen 15 suggests notated by both annotators. that the union of the two sets of annotations yields the most consistent resulting annotation; thus, for As the annotations were produced by a small this study, we consider an utterance to be an action number of untrained annotators, an immediate item if at least one of the annotators marked it as question is the degree of consistency and reliabil- such. ity. Inter-annotator agreement is typically mea- The 15-meeting subset contains 24,250 utter- sured by the kappa statistic (Carletta, 1996), de-
3 A1 A2 So that will be sort of the assignment for next week, is to— X X and—and how far X X to—for slides and whatever net you picked and what it can do you’ve gotten. Pppt! X - Well, I’d like to also, though, uh, ha- have a first cut at what the X X belief-net looks like. X X Even if it’s really crude. - X OK? So, you know, - - - here a- here are— - So we’re supposed to @@ about features and whatnot, and— - X arked “X”) from annotators A1 and A2. Figure 1: Example transcript and action item annotations (m om an ICSI meeting recording and has κ = “@@” signifies an unintelligible word. This transcript is fr th 0 , ranking it 16 . out of 54 meetings in annotator agreement. 373 maximum entropy (maxent) model (Berger et al., 1996) to this task. Maxent models seek to maximize the condi- tional probability of a class given the observa- c X tions using the exponential form [ ] ∑ 1 exp ( λ ) f X P ( c | ) = X i,c i,c ) X ( Z i is the ) X ( f where i th feature of the data X i,c 1500 0 500 1000 2500 2000 in class λ is the corresponding weight, and c , i,c Z ( X ) is a normalization term. Maxent models Figure 3: Number of total and action item utter- λ choose the weights so as to maximize the en- i,c ances across the 15 selected meetings. There are tropy of the induced distribution while remaining 24,250 utterances total, 590 of which (2.4%) are consistent with the data and labels; the intuition is action item utterances. that such a distribution makes the fewest assump- tions about the underlying data. Our maxent model is regularized by a quadratic ances total; under the union strategy above, 590 of prior and uses quasi-Newton parameter optimiza- these are action item utterances. Figure 3 shows tion. Due to the limited amount of training data the number of action item utterances and the num- (see Section 3) and to avoid overfitting, we em- ber of total utterances in the 15 selected meetings. ploy 10-fold cross validation in each experiment. One noteworthy feature of the ICSI corpus un- To evaluate system performance, we calculate derlying the action item annotations is the “digit P ) of precision ( F the F measure ( ), R ) and recall ( reading task,” in which the participants of meet- defined as: ings take turns reading aloud strings of digits. | | C ∩ A This task was designed to provide a constrained- = P | A vocabulary training set of speech recognition de- | velopers interested in multi-party speech. In this | C ∩ | A R = study we did not remove these sections; the net | C | effect is that some portions of the data consist of 2 P R F = these fairly atypical utterances. P + R 4 Experimental methodology where A is the set of utterances marked as action is the set of (all) cor- items by the system, and C We formulate the action item detection task as one rect action item utterances. of binary classification of utterances. We apply a
4 The use of precision and recall is motivated by Number of * tags, denoting the number of NN • the fact that the large imbalance between posi- nouns. tive and negative examples in the corpus (Sec- VB • Number of * tags, denoting the number of tion 3) means that simpler metrics like accuracy verbs. are insufficient—a system that simply classifies tag, denoting the presence • Presence of VBD every utterance as negative will achieve an accu- of a past-tense verb. racy of 97.5%, which clearly is not a good reflec- 5.4 Prosodic features tion of desired behavior. Recall and F measure for such a system, however, will be zero. Under the hypothesis that action item utterances Likewise, a system that flips a coin weighted in will exhibit particular prosodic behavior—for ex- proportion to the number of positive examples in ample, that they are emphasized, or are pitched a the entire corpus will have an accuracy of 95.25%, certain way—we performed pitch extraction using %. but will only achieve P = R 4 = F = 2 . an auto-correlation method within the sound anal- ysis package Praat (Boersma and Weenink, 2005). 5 Features From the meeting audio files we extract the fol- lowing prosodic features, on a per-utterance basis: As noted in Section 3, we treat the task of produc- (pitch measures are in Hz; intensity in energy; nor- ing action item annotations as a binary classifica- z malization in all cases is -normalization) tion task. To this end, we consider the following sets of features. (Note that all real-valued features • Pitch and intensity range, minimum, and were range-normalized so as to lie in and that [0 , 1] maximum. no binning was employed.) • Pitch and intensity mean. Pitch and intensity median (0.5 quantile). • Immediate lexical features 5.1 • Pitch and intensity standard deviation. We extract word unigram and bigram features Pitch slope, processed to eliminate halv- • from the transcript for each utterance. We nor- ing/doubling. malize for case and for certain contractions; for example, “I’ll” is transformed into “I will”. • Number of voiced frames. Note that these are oracle features, as the tran- • Duration-normalized pitch and intensity scripts are human-produced and not the product ranges and voiced frame count. of automatic speech recognizer (ASR) system out- • Speaker-normalized pitch intensity and put. means. 5.2 Contextual lexical features 5.5 Temporal features We extract word unigram and bigram features Under the hypothesis that the length of an utter- from the transcript for the previous and next ut- ance or its location within the meeting as a whole terances across all speakers in the meeting. will determine its likelihood of being an action item—for example, shorter statements near the Syntactic features 5.3 end of the meeting might be more likely to be ac- Under the hypothesis that action item utterances tion items—we extract the duration of each utter- will exhibit particular syntactic patterns, we use ance and the time from its occurrence until the end a conditional Markov model part-of-speech (POS) of the meeting. (Note that the use of this feature tagger (Toutanova and Manning, 2000) trained on precludes operating in an online setting, where the the Switchboard corpus (Godfrey et al., 1992) to end of the meeting may not be known in advance.) tag utterance words for part of speech. We use the 5.6 General semantic features following binary POS features: Under the hypothesis that action item utterances tag, denoting the presence of Presence of • UH will frequently involve temporal expressions—e.g. an “interjection” (including filled pauses, un- “Let’s have the paper written by ”— next Tuesday filled pauses, and discourse markers). we use Identifinder (Bikel et al., 1997) to mark MD • tag, denoting presence of a Presence of temporal expressions (“TIMEX” tags) in utterance modal verb. transcripts, and create a binary feature denoting
5 the existence of a temporal expression in each ut- 1.0 1.0 1.0 1.0 1.0 1.0 unigram bigram terance. temporal 0.8 0.8 0.8 0.8 0.8 0.8 Note that as Identifinder was trained on broad- context+prosodic fine−grained DAs cast news corpora, applying it to the very different 0.6 0.6 0.6 0.6 0.6 0.6 domain of multi-party meeting transcripts may not precision precision precision precision precision precision 0.4 0.4 0.4 0.4 0.4 0.4 result in optimal behavior. 0.2 0.2 0.2 0.2 0.2 0.2 5.7 Dialog-specific semantic features 0.0 0.0 0.0 0.0 0.0 0.0 Under the hypothesis that action item utterances 0.6 0.6 0.8 1.0 0.6 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.8 0.0 0.4 1.0 0.0 1.0 0.8 0.6 0.0 0.2 0.4 0.4 0.8 1.0 0.2 0.2 0.0 0.4 may be closely correlated with specific dialog recall recall recall recall recall recall act tags, we use the dialog act annotations from the ICSI Meeting Recorder Dialog Act Corpus. Figure 4: Interpolated precision-recall curve for (Shriberg et al., 2004) As these DA annotations several (cumulative) feature sets. This graph sug- do not correspond one-to-one with utterances in gests the level of precision that can be achieved the ICSI corpus, we align them in the most liberal if one is willing to sacrifice some recall, and vice way possible, i.e., if at least one word in an utter- versa. ance is annotated for a particular DA, we mark the entirety of that utterance as exhibiting that DA. In total, nine combinations of features were We consider both fine-grained and coarse- 1 considered. In every case except that of syn- The former yields 56 fea- grained dialog acts. tactic and coarse-grained dialog act features, the tures, indicating occurrence of DA tags such additional features improved system performance as “appreciation,” “rhetorical question,” and and these features were used in succeeding exper- “task management”; the latter consists of only iments. Syntactic and coarse-grained DA features 7 classes—“disruption,” “backchannel,” “filler,” resulted in a drop in performance and were dis- “statement,” “question,” “unlabeled,” and “un- carded from succeeding systems. known.” 6 Results 7 Analysis The final performance for the maxent model The unigram and bigram features provide signif- across different feature sets is given in Table 1. icant discriminative power. Tables 2 and 3 give F measures scores range from 13.81 to 31.92. the top features, as determined by weight, for the Figure 4 shows the interpolated precision-recall models trained only on these features. It is clear curves for several of these feature sets; these from Table 3 that the detailed end-of-utterance graphs display the level of precision that can be punctuation in the human-generated transcripts achieved if one is willing to sacrifice some recall, provide valuable discriminative power. and vice versa. The performance gain from adding TIMEX tag- Although ideally, all combinations of features ging features is small and likely not statistically should be evaluated separately, the large number significant. Post-hoc analysis of the TIMEX tag- of features in this precludes this strategy. The ging (Section 5.6) suggests that Identifinder tag- combination of features explored here was cho- ging accuracy is quite plausible in general, but ex- sen so as to start from simpler features and suc- hibits an unfortunate tendency to mark the digit- cessively add more complex ones. We start with reading (see Section 3) portion of the meetings as transcript features that are immediate and context- temporal expressions. It is plausible that remov- independent (“unigram”, “bigram”, “TIMEX”); ing these utterances from the meetings would al- then add transcript features that require context low this feature a higher accuracy. (“temporal”, “context”), then non-transcript (i.e. Based on the low feature weight assigned, utter- audio signal) features (“prosodic”), and finally add ance length appears to provide no significant value features that require both the transcript and the au- to the model. However, the time until the meet- dio signal (“DA”). ing is over ranks as the highest-weighted feature 1 in the unigram+bigram+TIMEX+temporal feature We use the map 01 grouping defined in the MRDA cor- pus to collapse the tags. set. This feature is thus responsible for the 39.25%
6 features number F % imp. 13.81 unigram 6844 21.07 16.72 unigram+bigram 61281 61284 0.72 16.84 unigram+bigram+TIMEX unigram+bigram+TIMEX+temporal 61286 39.25 23.45 unigram+bigram+TIMEX+temporal+syntactic 61291 21.94 -6.44 unigram+bigram+TIMEX+temporal+context 25.62 9.25 183833 unigram+bigram+TIMEX+temporal+context+prosodic 183871 27.44 7.10 unigram+bigram+TIMEX+temporal+context+prosodic+coar se DAs -3.53 183878 26.47 183927 As 31.92 unigram+bigram+TIMEX+temporal+context+prosodic+fine D 16.33 measure, the relative improvement from Table 1: Performance of the maxent classifier as measured by F the preceding feature set, and the number of features, acros s all feature sets tried. Italicized lines denote the addition of features which do not improve performance; t hese are omitted from succeeding systems. feature +/- λ λ +/- feature + 2.2100 1.4288 - mean intensity (norm.) “pull” mean pitch (norm.) “email” - 1.0661 + 1.7883 “needs” intensity range 1.7212 1.0510 + + 1.6613 + “added” “i will” + 0.8657 1.5937 “mm-hmm” - “email” + 0.8113 + reformulate/summarize (DA) + 0.7946 “present” 1.5740 - 1.5019 “just go” (next) “nine” + 0.7190 1.5001 “!” - “i will” (prev.) 0.7074 + “the paper” 0.6788 “five” - 1.4944 + + understanding check (DA) “together” + 0.6547 1.4882 Table 4: Features, evidence type and weight for Table 2: Features, evidence type (positive denotes action item), and weight for the top ten features the top ten features on the best-performing model. in the unigram-only model. “Nine” and “five” are Bigrams labeled “prev.” and “next” correspond to common words in the digit-reading task (see Sec- the lexemes from previous and next utterances, re- tion 3). spectively. Prosodic features labeled as “norm.” have been normalized on a per-speaker basis. +/- λ feature “- $” - 1.4308 boost in F measure in row 3 of Table 1. “i will” + 1.4128 “, $” - 1.3115 The addition of part-of-speech tags actually de- 1.2752 - “uh $” creases system performance. It is unclear why this 1.2419 “w- $” - is the case. It may be that the unigram and bi- “. $” - 1.2247 gram features already adequately capture any dis- + 1.2062 “email” tinctions these features make, or simply that these - 1.1874 “six $” features are generally not useful for distinguishing 1.1833 - “* in” action items. - 1.1819 “so $” Contextual features, on the other hand, im- prove system performance significantly. A post- Table 3: Features, evidence type and weight for hoc analysis of the action item annotations makes the top ten features in the unigram+bigram model. clear why: action items are often split across mul- The symbol * denotes the beginning of an utter- tiple utterances (e.g. as in Figure 1), only a portion ance and $ the end. All of the top ten features are of which contain lexical cues sufficient to distin- bigrams except for the unigrams “email”. guish them as such. Contextual features thus allow utterances immediately surrounding these “obvi- ous” action items to be tagged as well.
7 Prosodic features yield a 7.10% increase in less than 25% when applying SVMs to the classi- F measure, and analysis shows that speaker- fication task to the same corpus, and motivate the development of a new corpus of action item anno- normalized intensity and pitch, and the range in tations. intensity of an utterance, are valuable discrimina- tive features. The subsequent addition of coarse- 9 Future work grained dialog act tags does not further improve system performance. It is likely this is due to rea- In Section 6 we showed that contextual lexical sons similar to those for POS tags—either the cat- features are useful for the task of action item de- egories are insufficient to distinguish action item tection, at least in the fairly limited manner em- utterances, or whatever usefulness they provide is ployed in our implementation, which simply looks subsumed by other features. at immediate previous and immediate next utter- Table 4 shows the feature weights for the top- ances. It seems likely that applying a sequence ranked features on the best-scoring system. The model such as an HMM or conditional random addition of the fine-grained DA tags results in a field (CRFs) will act as a generalization of this fea- significant increase in performance.The F measure ture and may further improve performance. of this best feature set is 31.92%. Addition of features such as speaker change and “hot spots” (Wrede and Shriberg, 2003) may also 8 Conclusions aid classification. Conversely, it is possible that feature selection techniques may improve perfor- We have shown that several classes of features are mance by helping to eliminate poor-quality fea- useful for the task of action item annotation from tures. In this work we have followed an “ev- multi-party meeting corpora. Simple lexical fea- erything but the kitchen sink” approach, in part tures, their contextual versions, the time until the because we were curious about which features end of the meeting, prosodic features, and fine- would prove useful. The effect of adding POS and grained dialog acts each contribute significant in- coarse-grained DA features illustrates that this is creases in system performance. not necessarily the ideal strategy in terms of ulti- While the raw system performance numbers of mate system performance. Table 1 are low relative to other, better-studied In general, the features evaluated in this tasks on other, more mature corpora, we believe work are an indiscriminate mix of human- and the relative usefulness of the features towards this automatically-generated features; of the human- task is indicative of their usefulness on more con- generated features, some are plausible to generate sistent annotations, as well as to related tasks. automatically, at some loss of quality (e.g. tran- The Gruenstein et al. (2005) corpus provides scripts) while others are unlikely to be automati- a valuable and necessary resource for research in cally generated in the foreseeable future (e.g. fine- this area, but several factors raise the question of grained dialog acts). Future work may focus on κ scores in Section 3 annotation quality. The low the effects that automatic generation of the former are indicative of annotation problems. Post-hoc has on overall system performance (although this error analysis yields many examples of utterances may require higher-quality annotations to be use- which are somewhat difficult to imagine as pos- ful.) For example, the detailed end-of-utterance sible, never mind desirable, to tag. The fact that punctuation present in the human transcripts pro- the extremely useful oracular information present vides valuable discriminative power (Table 3), but in the fine-grained DA annotation does not raise current ASR systems are not likely to be able to performance to the high levels that one might ex- provide this level of detail. Switching to ASR out- pect further suggests that the annotations are not put will have a negative effect on performance. ideal—or, at the least, that they are inconsistent 2 One final issue is that of utterance segmenta- with the DA annotations. tion. The scheme used in the ICSI meeting corpus This analysis is consistent with the findings of does not necessarily correspond to the ideal seg- Purver et al. (2006), who achieve an F measure of mentation for other tasks. The action item annota- 2 Which is not to say they are devoid of significant value— tions were performed on these segmentations, and training and testing our best system on the corpus with the in this study we did not attempt resegmentation, - 590 positive classifications randomly shuffled across all ut but in the future it may prove valuable to collapse, terances yields an F measure of only 4.82.
8 for example, successive un-interrupted utterances Alexander Gruenstein, John Niekrasz, and Matthew Purver. 2005. Meeting structure annotation: Data from the same speaker into a single utterance. Proceedings of the 6th SIGDIAL Work- and tools. In In conclusion, while overall system perfor- . shop on Discourse and Dialogue mance does not approach levels typical of better- Adam Janin, Don Baron, Jane Edwards, Dan Ellis, studied classification tasks such as named-entity David Gelbart, Nelson Morgan, Barbara Peskin, recognition, we believe that this is a largely a prod- Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, uct of the current action item annotation quality. and Chuck Wooters. 2003. The ICSI meeting cor- We believe that the feature analysis presented here pus. In Proceedings of the ICASSP . is useful, for this task and for other related tasks, Dialog act tag- Gang Ji and Jeff Bilmes. 2005. and that, provided with a set of more consistent ging using graphical models. In Proceedings of the action item annotations, the current system can be ICASSP . used as is to achieve better performance. Marion Mast, R. Kompe, S. Harbeck, A. Kießling, H. Niemann, E. N ̈oth, E.G. Schukat-Talamazzini, Acknowledgments and V. Warnke. 1996. Dialog act classification with the help of prosody. In Proceedings of the ICSLP . The authors wish to thank Dan Jurafsky, Chris Manning, Stanley Peters, Matthew Purver, and Matthew Purver, Patrick Ehlen, and John Niekrasz. 2006. Detecting action items in multi-party meet- several anonymous reviewers for valuable advice Pro- ings: Annotation and initial experiments. In and comments. . ceedings of the 3rd Joint Workshop on MLMI Elizabeth Shriberg, Rebecca Bates, Andreas Stolcke, References Paul Taylor, Daniel Jurafsky, Klaus Ries, Noah Coc- caro, Rachel Martin, Marie Meteer, and Carol Van Jeremy Ang, Yang Liu, and Elizabeth Shriberg. 2005. 1998. EssDykema. Can prosody aid the auto- Automatic dialog act segmentation and classifica- matic classification of dialog acts in conversational tion in multiparty meetings. In Proceedings of the , 41(3–4):439–487. speech? Language and Speech . ICASSP Elizabeth Shriberg, Raj Dhillon, Sonali Bhagat, Jeremy Paul N. Bennett and Jaime Carbonell. 2005. Detecting Ang, and Hannah Carvey. 2004. The ICSI meeting Proceedings of SIGIR action-items in e-mail. In . Proceedings recorder dialog act (MRDA) corpus. In of the 5th SIGDIAL Workshop on Discourse and Di- Adam Berger, Stephen Della Pietra, and Vincent Della . alogue Pietra. 1996. A maximum entropy approach to nat- ural language processing. Computational Linguis- Andreas Stolcke, Elizabeth Shriberg, Rebecca Bates, , 22(1):39–71. tics Noah Coccaro, Daniel Jurafsky, Rachel Mar- tin, Marie Meteer, Klaus Ries, Paul Taylor, and D. Bikel, S. Miller, R. Schwartz, and R. Weischedel. Carol Van EssDykema. 1998. Dialog act model- 1997. Nymble: a high-performance learning name- ing for conversational speech. In Proceedings of Proceedings of the Conference on Applied finder. In the AAAI Spring Symposium on Applying Machine . NLP Learning to Discourse Processing . Paul Boersma and David Weenink. 2005. Praat: doing Kristina Toutanova and Christopher D. Manning. phonetics by computer v4.4.12 (computer program). 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Pro- J. Carletta. 1996. Assessing agreement on classifica- ceedings of EMNLP . tion tasks: The kappa statistic. Computational Lin- , 22(2):249–254. guistics Britta Wrede and Elizabeth Shriberg. 2003. Spot- ting “hot spots” in meetings: Human judgments and William W. Cohen, Vitor R. Carvalho, and Tom M. Proceedings of the European Con- prosodic cues. In Mitchell. 2004. Learning to classify email into . ference on Speech Communication and Technology . Proceedings of EMNLP ”speech acts”. In Matthias Zimmermann, Yang Liu, Elizabeth Shriberg, Simon Corston-Oliver, Eric Ringger, Michael Ga- and Andreas Stolcke. 2005. Toward joint segmen- mon, and Richard Campbell. 2004. Task-focused tation and classification of dialog acts in multiparty Text Summarization In summarization of email. meetings. In Proceedings of the 2nd Joint Workshop . Branches Out: Proceedings of the ACL Workshop on MLMI . J. Godfrey, E. Holliman, and J.McDaniel. 1992. SWITCHBOARD: Telephone speech corpus for research and development. In Proceedings of ICAASP .