Effects of Speech Clarity on Recognition Memory for Spoken Sentences

Kristin J. Van Engen; Bharath Chandrasekaran; Rajka Smiljanic

doi:10.1371/journal.pone.0043753

Abstract

Extensive research shows that inter-talker variability (i.e., changing the talker) affects recognition memory for speech signals. However, relatively little is known about the consequences of intra-talker variability (i.e. changes in speaking style within a talker) on the encoding of speech signals in memory. It is well established that speakers can modulate the characteristics of their own speech and produce a listener-oriented, intelligibility-enhancing speaking style in response to communication demands (e.g., when speaking to listeners with hearing impairment or non-native speakers of the language). Here we conducted two experiments to examine the role of speaking style variation in spoken language processing. First, we examined the extent to which clear speech provided benefits in challenging listening environments (i.e. speech-in-noise). Second, we compared recognition memory for sentences produced in conversational and clear speaking styles. In both experiments, semantically normal and anomalous sentences were included to investigate the role of higher-level linguistic information in the processing of speaking style variability. The results show that acoustic-phonetic modifications implemented in listener-oriented speech lead to improved speech recognition in challenging listening conditions and, crucially, to a substantial enhancement in recognition memory for sentences.

Citation: Van Engen KJ, Chandrasekaran B, Smiljanic R (2012) Effects of Speech Clarity on Recognition Memory for Spoken Sentences. PLoS ONE 7(9): e43753. https://doi.org/10.1371/journal.pone.0043753

Editor: Emmanuel Andreas Stamatakis, University Of Cambridge, United Kingdom

Received: April 11, 2012; Accepted: July 26, 2012; Published: September 7, 2012

Copyright: © Van Engen et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: The authors have no support or funding to report.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Spoken language contains information both about the content of a message and about the speaker of that message. Content is composed of several levels of linguistic information: sounds (phonological information), word-forming units (morphological information), combinations of words into sentences (syntactic information), and the meanings of words and word combinations (semantic information). The same auditory signal conveying all of this linguistic information also carries a wealth of information about the speaker: social (e.g., regional or social dialect features), affective (e.g., whether the person is happy, sad, excited, fatigued etc.), and personal (e.g., sex, age, as well as the size and shape of the vocal tract) [1], [2], [3], [4], [5], [6], [7], [8], [9].

Traditionally, the perception of linguistic content has been studied separately from the indexical properties of talkers. The emphasis in this line of work has been on how abstract linguistic units can be extracted from the immense variability in the speech signal. This abstractionist approach has been supported by a number of neuroscientific studies, which have shown that these two types of information are processed differently in the brain [10], [11], [12], [13], [14], [15]. For example, individuals with language deficits following a stroke do not show concomitant deficits in identifying speakers. Similarly, individuals with a neurological deficit that affects voice perception (phonoagnosia) show normal language comprehension skills. The finding that indexical and lexical information are dissociable is consistent with abstractionist accounts.

In contrast to abstractionist models, episodic approaches to speech processing contend that linguistic and indexical information are encoded and stored together in memory. These approaches have also been supported by a number of behavioral and neural studies showing that linguistic and indexical information are functionally integrated during speech processing [16], [17], [18], [19], [20], [21], [22], [23]. These studies show that properties of a talker's voice affect the processing of linguistic content in an utterance. For example, the recognition of words presented in noise is enhanced when listeners are familiar with the talker relative to words produced by an unfamiliar talker—an advantage that emerged for testing 5 minutes after exposure, but also up to a whole week after exposure [18]. Similarly, recognition memory in a continuous list of words has been shown to be more robust for words repeated in the same voice relative to a new voice [22].

By showing that talker variability affects recognition memory for words, these studies demonstrate the importance of indexical information in the processing of linguistic information. However, the focus of such studies has been on variability across talkers. In contrast, very little is known about the effects of speaking style changes by an individual speaker on the encoding of speech in memory. Extensive previous research has shown that speakers are able to enhance the intelligibility of their speech when asked to speak as if they are communicating with someone who is having difficulty accessing or understanding linguistic information. This intelligibility-enhancing speaking style (“clear speech” hereafter) is characterized by a number of acoustic/articulatory adjustments, including a decrease in speaking rate (both in terms of added pauses and in terms of increased duration of phonetic segments), increased dynamic pitch range, increased amplitude, more salient stop consonant releases, greater intensity of non-silent portions of consonants such as bursts and frication, and increased energy in the 1000–3000 Hz frequency range [24], [25], [26], [27], [28], [29], [30], [31], [32] (for a review, see [33]). In addition, it has been demonstrated that the distinctiveness of language-specific phonological vowel and consonant contrasts as well as of prosodic properties is enhanced in clear speech [25], [32], [34], [35], [36], [37], [38]. Together, these conversational-to-clear speech adjustments increase intelligibility, albeit to different degrees, for a wide range of listener populations, including normal hearing listeners [39], listeners with hearing impairment [40], [41], elderly listeners [42], non-native speakers of the target language [35], and children with and without learning disabilities [24]. As far as we know, however, no study has examined the effect of this type of intelligibility variability on recognition memory for linguistic content. Given that speakers constantly modify their speech during everyday communication in response to changing communication demands, it is of interest to examine the extent to which such changes impact memory for sentences.

This investigation of the effects of speech signal clarity on the robustness of memory representations also contributes to ongoing discussions in the literature on speech processing by aging and/or hearing-impaired adults. The “effortfulness hypothesis” [43], introduced by Rabbitt [44], [45], argues that perceptual processing in adverse listening situations may come at the cost of attentional resources that would otherwise be available for memory encoding [43], [46], [47], [48], [49], [50], [51], [52]. McCoy et al. (2005), for example, investigated recall of the final three words in a running word memory task by older adults with good hearing and poor hearing. All listeners were able to recall the final word with extremely high accuracy, indicating that they were all able to correctly perceive each word as it was presented. However, the adults with poor hearing recalled significantly fewer of the non-final words in word lists that lacked contextual constraint as opposed to word lists with high contextual constraint (i.e., where target words were predictable from the two prior words). It is argued that the higher orders of approximation may have facilitated target word recognition by increasing their likelihood, by decreasing the number of potential word candidates, and by aiding retrospective recognition of words that were unclear. Any of these mechanisms, they argue, might “reduce the perceptual burden on listeners' processing resources” and, therefore, aid recall.

In the present study, all listeners had normal hearing and the speech targets were not physically distorted or degraded, but their intelligibility was varied along the real-world dimension of within-talker speaking style changes. The effortfulness hypothesis leads to the prediction that greater attentional resources will be available for encoding the easier-to-perceive (i.e., clear) sentences in memory, leading to better recognition memory for clear speech versus conversational speech.

Specifically, this study investigated the extent to which changes in speaking style aimed at enhancing intelligibility affect memory for spoken language information. We tested such effects across two types of sentences: semantically anomalous and semantically normal (i.e., meaningful) sentences. Meaningful sentences presumably require less processing effort relative to anomalous sentences, and therefore were predicted to aid recognition memory and possibly modulate the effect of speaking style on recognition memory. Experiment 1 tested the intelligibility of all four sentence types as produced by a female native speaker of English. These sentences were presented to normal-hearing, young adult listeners in the presence of speech-shaped noise (i.e., white noise filtered so that its spectrum matches the long-term average spectrum of speech). The listening-in-noise paradigm was employed to avoid ceiling performance and to make the task difficult enough to reveal intelligibility differences between the two speaking styles. Listeners were asked to transcribe each sentence to the best of their ability. In Experiment 2, the sentences were presented in quiet to new listeners in a recognition memory experiment. For this task, listeners were exposed to a subset of conversational and clear sentences (40 total) and then tested on the full set (80 total), responding “old” (i.e., from the exposure set) or “new” to each item. We predicted that conditions in which perceptual effort is reduced, whether through acoustic-phonetic enhancements associated with clear speech or through the presence of semantic contextual information, would enhance recognition memory. Thus, the overall aim of these experiments was to investigate the extent to which within-talker variation in intelligibility affects the encoding of speech signals in memory. Our results indicate that indeed, such speaking style adjustments improve sentence intelligibility in noise (Experiment 1), and in turn, enhance their encoding in memory (Experiment 2). Thus, similar to the talker voice advantage, within-talker intelligibility modifications lead to better sentence recall.

Methods and Results

Ethics statement

All research protocols presented in this manuscript were approved by the Institutional Review Board at the University of Texas at Austin (approval #2010-11-0142).

Experiment 1: Intelligibility of clear and conversational sentences

Participants.

18 participants between the ages of 18 and 25 took part as listeners in Experiment 1. All participants were students at the University of Texas who were recruited via word of mouth or flyers posted on campus. All participants reported normal speech and hearing and were native, monolingual speakers of American English (i.e., they were born and raised in monolingual English households and local communities in which English is the primary language spoken, as reported in detailed background questionnaires). Potential participants who had significant exposure to another language before age 12 were not included. Participants provided written informed consent and were either paid or received course credit for their participation.

Stimuli.

A 26-year-old female speaker of American English was recorded producing two sets of sentences: 1) the semantically anomalous sentences from the Syntactically Normal Sentence Test (SNST) [53] (e.g., The wrong shot led the farm.) and 2) semantically normal, i.e., meaningful, sentences generated by modifying sentences from the Basic English Lexicon (BEL) sentence materials [54] in order to closely match the SNST sentences in terms of syntax, length, and amount of keyword repetition within the set (e.g., The grey mouse ate the cheese). All sentences were produced in both clear and conversational speaking styles and contained four keywords each for intelligibility scoring. Recording took place in a sound-attenuated booth where sentences were presented to the speaker one at a time on a computer monitor. Following previous research [32], the two speaking styles were elicited with the following instructions: for conversational recordings, the speaker was asked to speak in a normal, conversational style, as if she was talking to someone familiar with her voice and speech patterns; for the clear speech recordings, the speaker was prompted to speak as though the listener was having a hard time understanding her, whether due to hearing difficulty or because the listener was a non-native speaker of English. Recordings were made using a Shure SM10A head-mounted microphone and a Marantz solid-state recorder (PMD670). Individual sentences were segmented from the long recording and equalized for RMS amplitude using Praat [55]. In order to verify that speaking style changes were implemented by the talker, the following acoustic measures were performed on all sentences that were used in the listening tests: duration, F0 range, mean F0, and average energy in the 1–3 kHz region.

40 sentences in each speaking style from each set were presented to listeners for assessment of intelligibility. Speech-shaped noise (SSN) was created for each sentence set (anomalous sentences in conversational speech; anomalous sentences in clear speech; meaningful sentences in conversational speech; meaningful sentences in clear speech) by filtering white noise to the long-term average spectrum of the full set of sentences. This approach was used to take into account any spectral differences across the sentence types and ensure that masking was consistent across the types.

Procedure.

Participants first completed questionnaires about their language background. They were then seated in a sound-attenuated booth where they wore Sennheiser HD570 or Sony MDR-CD780 headphones. Instructions and stimuli were presented with EPrime [56]. In order to assess the relative intelligibility of clear and conversational speech produced by the speaker, each sentence was mixed with speech-shaped noise at a signal-to-noise ratio of 0 dB and then played to the participants, who were asked to transcribe as much of each sentence as they were able to understand. Each sentence was scored by the number of keywords correctly identified (4 per sentence) for a total of 160 keywords per sentence type. In order to be considered correct, no morphemes could be added to or deleted from the keywords, but homophones were accepted as a correct response. Listeners (nine per condition) heard a fully randomized set of either 80 semantically anomalous sentences (40 per speaking style) or 80 meaningful sentences (40 per speaking style). All stimuli were presented only once.

Results.

Samples of both sentence types and speaking styles are shown in Figure 1, and average acoustic measures for each sentence set are given in Table 1. Paired t-tests confirmed that, for both sentence sets (anomalous and meaningful), clearly produced sentences had significantly longer durations than conversational speech. Clear sentences also had higher mean F0s (p<0.001 for both sentence sets) and larger F0 ranges (p<0.001 for both sentence sets). In the meaningful sentences, furthermore, clear speech was characterized by significantly greater energy in the 1–3 kHz range (p = .002). This trend was present but not significant for the anomalous sentences (p = .17). The analyses thus confirmed that the conversational and clear speech sentences differed in their acoustic-articulatory characteristics along the dimensions that are typically found in listener-oriented speaking style adaptations.

Download:

Figure 1. Waveforms and spectrograms of one meaningful sentence (top panels) and one anomalous sentences (bottom panels), each produced in both conversational (left panels) and clear (right panels) speaking styles.

Each panel display represents 2.5 seconds.

https://doi.org/10.1371/journal.pone.0043753.g001

Download:

Table 1. Acoustic measures of sentence materials by speaking style and material type.

https://doi.org/10.1371/journal.pone.0043753.t001

The results of the intelligibility test are shown in Figure 2. For semantically anomalous sentences, listeners identified 69% of the keywords in conversational speech and 84% of the keywords in clear speech. For meaningful sentences, they identified 79% of the keywords in conversational speech and 95% of the keywords in clear speech. The intelligibility data were analyzed with a linear mixed effects logistic regression where keyword identification (i.e. correct or incorrect) was the dichotomous dependent variable. Subjects and Items were included in the model as random factors and Speaking Style, Semantic Content, and their interaction as fixed effects. Style was contrast coded (−.5, .5) such that negative beta values are associated with clear speech and positive beta values are associated with conversational speech. Similarly, Content was contrast coded (−.5, .5) such that negative beta values are associated with semantically anomalous sentences and positive values are associated with meaningful sentences. Analysis was performed using R [57]. The results of the regression are presented in Table 2.

Download:

Figure 2. Average proportion of keywords identified from semantically anomalous and meaningful sentences produced in clear and conversational speaking styles.

Error bars represent standard error.

https://doi.org/10.1371/journal.pone.0043753.g002

Download:

Table 2. Results of the linear mixed effects logistic regression on intelligibility data for all sentences.

https://doi.org/10.1371/journal.pone.0043753.t002

The results show that the overall probability of correct keyword identification is significantly higher for meaningful versus anomalous sentences (p<0.001) and for clear versus conversational speech (p<0.001). Furthermore, there was a significant interaction between Speaking style and Semantic content (p = 0.001). The nature of this interaction was examined by performing mixed-effects logistic regressions on the Meaningful and Anomalous conditions separately. The results of these regressions are shown in Table 3 and Table 4. These regressions revealed that, while the effect of speaking style was a highly significant predictor of correct keyword identification for both types of sentences, the effect of style was greater (further from 0) for the meaningful sentences (β_anom = −.99; β_meaningful = −1.86).

Download:

Table 3. Results of the linear mixed effects logistic regression on intelligibility data for anomalous sentences.

https://doi.org/10.1371/journal.pone.0043753.t003

Download:

Table 4. Results of the linear mixed effects logistic regression on intelligibility data for meaningful sentences.

https://doi.org/10.1371/journal.pone.0043753.t004

These results replicate previous studies that show that listener-oriented conversational-to-clear speech modifications enhance sentence intelligibility (see [33] for a review of the clear speech literature). Furthermore, the presence of semantic context significantly improved intelligibility overall, though listeners received a greater clear speech benefit for meaningful sentences than anomalous sentences. With these differences in intelligibility confirmed, Experiment 2 addresses the effects of such differences on sentence recognition memory.