[{"@context":"http:\/\/schema.org\/","@type":"BlogPosting","@id":"https:\/\/wiki.edu.vn\/en\/wiki40\/temporal-envelope-and-fine-structure\/#BlogPosting","mainEntityOfPage":"https:\/\/wiki.edu.vn\/en\/wiki40\/temporal-envelope-and-fine-structure\/","headline":"Temporal envelope and fine structure","name":"Temporal envelope and fine structure","description":"Sound frequency changes responsible for perceptions of loudness, pitch and timbre Temporal envelope (ENV) and temporal fine structure (TFS) are","datePublished":"2021-07-20","dateModified":"2021-07-20","author":{"@type":"Person","@id":"https:\/\/wiki.edu.vn\/en\/wiki40\/author\/lordneo\/#Person","name":"lordneo","url":"https:\/\/wiki.edu.vn\/en\/wiki40\/author\/lordneo\/","image":{"@type":"ImageObject","@id":"https:\/\/secure.gravatar.com\/avatar\/c9645c498c9701c88b89b8537773dd7c?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/c9645c498c9701c88b89b8537773dd7c?s=96&d=mm&r=g","height":96,"width":96}},"publisher":{"@type":"Organization","name":"Enzyklop\u00e4die","logo":{"@type":"ImageObject","@id":"https:\/\/wiki.edu.vn\/wiki4\/wp-content\/uploads\/2023\/08\/download.jpg","url":"https:\/\/wiki.edu.vn\/wiki4\/wp-content\/uploads\/2023\/08\/download.jpg","width":600,"height":60}},"image":{"@type":"ImageObject","@id":"https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/a\/a0\/Output_of_simulated_cochlear_filters.jpg\/300px-Output_of_simulated_cochlear_filters.jpg","url":"https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/a\/a0\/Output_of_simulated_cochlear_filters.jpg\/300px-Output_of_simulated_cochlear_filters.jpg","height":"225","width":"300"},"url":"https:\/\/wiki.edu.vn\/en\/wiki40\/temporal-envelope-and-fine-structure\/","about":["Wiki"],"wordCount":74834,"articleBody":"Sound frequency changes responsible for perceptions of loudness, pitch and timbreTemporal envelope (ENV) and temporal fine structure (TFS) are changes in the amplitude and frequency of sound perceived by humans over time. These temporal changes are responsible for several aspects of auditory perception, including loudness, pitch and timbre perception and spatial hearing.Complex sounds such as speech or music are decomposed by the peripheral auditory system of humans into narrow frequency bands. The resulting narrow-band signals convey information at different time scales ranging from less than one millisecond to hundreds of milliseconds. A dichotomy between slow “temporal envelope” cues and faster “temporal fine structure” cues has been proposed to study several aspects of auditory perception (e.g., loudness, pitch and timbre perception, auditory scene analysis, sound localization) at two distinct time scales in each frequency band.[1][2][3][4][5][6][7] Over the last decades, a wealth of psychophysical, electrophysiological and computational studies based on this envelope\/fine-structure dichotomy have examined the role of these temporal cues in sound identification and communication, how these temporal cues are processed by the peripheral and central auditory system, and the effects of aging and cochlear damage on temporal auditory processing. Although the envelope\/fine-structure dichotomy has been debated and questions remain as to how temporal fine structure cues are actually encoded in the auditory system, these studies have led to a range of applications in various fields including speech and audio processing, clinical audiology and rehabilitation of sensorineural hearing loss via hearing aids or cochlear implants.Table of ContentsDefinition[edit]Temporal envelope (ENV) processing[edit]Neurophysiological aspects[edit]Psychoacoustical aspects[edit]Models of normal envelope processing[edit]Temporal fine structure (TFS) processing[edit]Neurophysiological aspects[edit]Psychoacoustical aspects[edit]Models of normal processing: limitations[edit]Role in speech and music perception[edit]Role of temporal envelope in speech and music perception[edit]Role of TFS in speech and music perception[edit]Role of TFS in pitch perception[edit]Role of TFS cues in speech perception[edit]Role in environmental sound perception[edit]Role in auditory scene analysis[edit]Effects of age and hearing loss on temporal envelope processing[edit]Developmental aspects[edit]Neurophysiological aspects[edit]Psychoacoustical aspects[edit]Models of impaired temporal envelope processing[edit]Effects of age and hearing loss on TFS processing[edit]Developmental aspects[edit]Neurophysiological aspects[edit]Psychoacoustical aspects[edit]Models of impaired processing[edit]Transmission by hearing aids and cochlear implants[edit]Temporal envelope transmission[edit]Temporal fine structure transmission[edit]Training effects and plasticity of temporal-envelope processing[edit]Clinical evaluation of TFS sensitivity[edit]Objective measures using envelope and TFS cues[edit]See also[edit]References[edit]Definition[edit] Outputs of simulated cochlear filters centred at 364, 1498 and 4803 Hz (from bottom to top) in response to a segment of a speech signal, the sound \u201cen\u201d in \u201csense\u201d. These filter outputs are similar to the waveforms that would be observed at places on the basilar membrane tuned to 364, 1498 and 4803 Hz. For each centre frequency, the signal can be considered as a slowly-varying envelope (EBM) imposed on a more rapid temporal fine structure (TFSBM). The envelope for each band signal is shown by the thick line.Notions of temporal envelope and temporal fine structure may have different meanings in many studies. An important distinction to make is between the physical (i.e., acoustical) and the biological (or perceptual) description of these ENV and TFS cues. Schematic representation of the three levels of temporal envelope (ENV) and temporal fine structure (TFS) cues conveyed by a band-limited signal processed by the peripheral auditory system.Any sound whose frequency components cover a narrow range (called a narrowband signal) can be considered as an envelope (ENVp, where p denotes the physical signal) superimposed on a more rapidly oscillating carrier, the temporal fine structure (TFSp).[8]Many sounds in everyday life, including speech and music, are broadband; the frequency components spread over a wide range and there is no well-defined way to represent the signal in terms of ENVp and TFSp. However, in a normally functioning cochlea, complex broadband signals are decomposed by the filtering on the basilar membrane (BM) within the cochlea into a series of narrowband signals.[9] Therefore, the waveform at each place on the BM can be considered as an envelope (ENVBM) superimposed on a more rapidly oscillating carrier, the temporal fine structure (TFSBM).[10] The ENVBM and TFSBM depend on the place along the BM. At the apical end, which is tuned to low (audio) frequencies, ENVBM and TFSBM vary relatively slowly with time, while at the basal end, which is tuned to high frequencies, both ENVBM and TFSBM vary more rapidly with time.[10]Both ENVBM and TFSBM are represented in the time patterns of action potentials in the auditory nerve[11] these are denoted ENVn and TFSn. TFSn is represented most prominently in neurons tuned to low frequencies, while ENVn is represented most prominently in neurons tuned to high (audio) frequencies.[11][12] For a broadband signal, it is not possible to manipulate TFSp without affecting ENVBM and ENVn, and it is not possible to manipulate ENVp without affecting TFSBM and TFSn.[13][14]Temporal envelope (ENV) processing[edit]Neurophysiological aspects[edit] Examples of sinusoidally amplitude- and frequency-modulated signalsThe neural representation of stimulus envelope, ENVn, has typically been studied using well-controlled ENVp modulations, that is sinusoidally amplitude-modulated (AM) sounds. Cochlear filtering limits the range of AM rates encoded in individual auditory-nerve fibers. In the auditory nerve, the strength of the neural representation of AM decreases with increasing modulation rate. At the level of the cochlear nucleus, several cell types show an enhancement of ENVn information. Multipolar cells can show band-pass tuning to AM tones with AM rates between 50 and 1000\u00a0Hz.[15][16] Some of these cells show an excellent response to the ENVn and provide inhibitory sideband inputs to other cells in the cochlear nucleus giving a physiological correlate of comodulation masking release, a phenomenon whereby the detection of a signal in a masker is improved when the masker has correlated envelope fluctuations across frequency (see section below).[17][18]Responses to the temporal-envelope cues of speech or other complex sounds persist up the auditory pathway, eventually to the various fields of the auditory cortex in many animals. In the Primary Auditory Cortex, responses can encode AM rates by phase-locking up to about 20\u201330\u00a0Hz,[19][20][21][22] while faster rates induce sustained and often tuned responses.[23][24] A topographical representation of AM rate has been demonstrated in the primary auditory cortex of awake macaques.[25] This representation is approximately perpendicular to the axis of the tonotopic gradient, consistent with an orthogonal organization of spectral and temporal features in the auditory cortex. Combining these temporal responses with the spectral selectivity of A1 neurons gives rise to the spectro-temporal receptive fields that often capture well cortical responses to complex modulated sounds.[26][27] In secondary auditory cortical fields, responses become temporally more sluggish and spectrally broader, but are still able to phase-lock to the salient features of speech and musical sounds.[28][29][30][31] Tuning to AM rates below about 64\u00a0Hz is also found in the human auditory cortex [32][33][34][35] as revealed by brain-imaging techniques (fMRI) and cortical recordings in epileptic patients (electrocorticography). This is consistent with neuropsychological studies of brain-damaged patients[36] and with the notion that the central auditory system performs some form of spectral decomposition of the ENVp of incoming sounds. The ranges over which cortical responses encode well the temporal-envelope cues of speech have been shown to be predictive of the human ability to understand speech. In the human superior temporal gyrus (STG), an anterior-posterior spatial organization of spectro-temporal modulation tuning has been found in response to speech sounds, the posterior STG being tuned for temporally fast varying speech sounds with low spectral modulations and the anterior STG being tuned for temporally slow varying speech sounds with high spectral modulations.[37]One unexpected aspect of phase locking in the auditory cortex has been observed in the responses elicited by complex acoustic stimuli with spectrograms that exhibit relatively slow envelopes (< 20\u00a0Hz), but that are carried by fast modulations that are as high as hundreds of Hertz. Speech and music, as well as various modulated noise stimuli have such temporal structure.[38] For these stimuli, cortical responses phase-lock to both the envelope and fine-structure induced by interactions between unresolved harmonics of the sound, thus reflecting the pitch of the sound, and exceeding the typical lower limits of cortical phase-locking to the envelopes of a few 10’s of Hertz. This paradoxical relation[38][39] between the slow and fast cortical phase-locking to the carrier \u201cfine structure\u201d has been demonstrated both in the auditory[38] and visual[40] cortices. It has also been shown to be amply manifested in measurements of the spectro-temporal receptive fields of the primary auditory cortex giving them unexpectedly fine temporal accuracy and selectivity bordering on a 5-10 ms resolution.[38][40] The underlying causes of this phenomenon have been attributed to several possible origins, including nonlinear synaptic depression and facilitation, and\/or a cortical network of thalamic excitation and cortical inhibition.[38][41][42][43] There are many functionally significant and perceptually relevant reasons for the coexistence of these two complementary dynamic response modes. They include the ability to accurately encode onsets and other rapid \u2018events\u2019 in the ENVp of complex acoustic and other sensory signals, features that are critical for the perception of consonants (speech) and percussive sounds (music), as well as the texture of complex sounds.[38][44]Psychoacoustical aspects[edit]The perception of ENVp depends on which AM rates are contained in the signal. Low rates of AM, in the 1\u20138\u00a0Hz range, are perceived as changes in perceived intensity, that is \u00a0loudness fluctuations (a percept that can also be evoked by frequency modulation, FM); at higher rates, AM is perceived as roughness, with the greatest roughness sensation occurring at around 70\u00a0Hz;[45] at even higher rates, AM can evoke a weak pitch percept corresponding to the modulation rate.[46] Rainstorms, crackling fire, chirping crickets or galloping horses produce “sound textures” – the collective result of many similar acoustic events – which perception is mediated by ENVn statistics.[47][48]The auditory detection threshold for AM as a function of AM rate, referred to as the temporal modulation transfer function (TMTF),[49] is best for AM rates in the range from 4 \u2013 150\u00a0Hz and worsens outside that range[49][50][51] The cutoff frequency of the TMTF gives an estimate of temporal acuity (temporal resolution) for the auditory system. This cutoff frequency corresponds to a time constant of about 1 – 3 ms for the auditory system of normal-hearing humans.Correlated envelope fluctuations across frequency in a masker can aid detection of a pure tone signal, an effect known as comodulation masking release.[18]AM applied to a given carrier can perceptually interfere with the detection of a target AM imposed on the same carrier, an effect termed modulation masking.[52][53] Modulation-masking patterns are tuned (greater masking occurs for masking and target AMs close in modulation rate), suggesting that the human auditory system is equipped with frequency-selective channels for AM. Moreover, AM applied to spectrally remote carriers can perceptually interfere with the detection of AM on a target sound, an effect termed modulation detection interference.[54] The notion of modulation channels is also supported by the demonstration of selective adaptation effects in the modulation domain.[55][56][57] These studies show that AM detection thresholds are selectively elevated above pre-exposure thresholds when the carrier frequency and the AM rate of the adaptor are similar to those of the test tone.Human listeners are sensitive to relatively slow “second-order” AMs cues correspond to fluctuations in the strength of AM. These cues arise from the interaction of different modulation rates, previously described as “beating” in the envelope-frequency domain. Perception of second-order AM has been interpreted as resulting from nonlinear mechanisms in the auditory pathway that produce an audible distortion component at the envelope beat frequency in the internal modulation spectrum of the sounds.[58][59][60]Interaural time differences in the envelope provide binaural cues even at high frequencies where TFSn cannot be used.[61]Models of normal envelope processing[edit] Diagram of the common part of the envelope perception model of Torsten Dau and EPSM.The most basic computer model of ENV processing is the leaky integrator model.[62][49] This model extracts the temporal envelope of the sound (ENVp) via bandpass filtering, half-wave rectification (which may be followed by fast-acting amplitude compression), and lowpass filtering with a cutoff frequency between about 60 and 150\u00a0Hz. The leaky integrator is often used with a decision statistic based on either the resulting envelope power, the max\/min ratio, or the crest factor. This model accounts for the loss of auditory sensitivity for AM rates higher than about 60\u2013150\u00a0Hz for broadband noise carriers.[49] Based on the concept of frequency selectivity for AM,[53] the perception model of Torsten Dau[63] incorporates broadly tuned bandpass modulation filters (with a Q value around 1) to account for data from a broad variety of psychoacoustic tasks and particularly AM detection for noise carriers with different bandwidths, taking into account their intrinsic envelope fluctuations. This model of has been extended to account for comodulation masking release (see sections above).[64] The shapes of the modulation filters have been estimated[65] and an \u201cenvelope power spectrum model\u201d (EPSM) based on these filters can account for AM masking patterns and AM depth discrimination.[66] The EPSM has been extended to the prediction of speech intelligibility[67] and to account for data from a broad variety of psychoacoustic tasks.[68] A physiologically based processing model simulating brainstem responses has also been developed to account for AM detection and AM masking patterns.[69]Temporal fine structure (TFS) processing[edit]Neurophysiological aspects[edit] Phase locking recorded from a neuron in the cochlear nucleus in response to a sinusoidal acoustic stimulus at the cell’s best frequency (in this case 240 Hz). The stimulus was approximately 20 dB above the neuron’s best frequency. The neural outputs (action potentials) are shown in the upper trace and the stimulus waveform in the lower trace.The neural representation of temporal fine structure, TFSn, has been studied using stimuli with well-controlled TFSp: pure tones, harmonic complex tones, and frequency-modulated (FM) tones.Auditory-nerve fibres are able to represent low-frequency sounds via their phase-locked discharges (i.e., TFSn information). The upper frequency limit for phase locking is species dependent. It is about 5\u00a0kHz in the cat, 9\u00a0kHz in the barn owl and just 4\u00a0kHz in the guinea pig. We do not know the upper limit of phase locking in humans but current, indirect, estimates suggest it is about 4\u20135\u00a0kHz.[70] Phase locking is a direct consequence of the transduction process with an increase in probability of transduction channel opening occurring with a stretching of the stereocilia and decrease in channel opening occurring when pushed in the opposite direction. This has led some to suggest that phase locking is an epiphenomenon. The upper limit appears to be determined by a cascade of low pass filters at the level of the inner hair cell and auditory-nerve synapse.[71][72]TFSn information in the auditory nerve may be used to encode the (audio) frequency of low-frequency sounds, including single tones and more complex stimuli such as frequency-modulated tones or steady-state vowels (see role and applications to speech and music).The auditory system goes to some length to preserve this TFSn information with the presence of giant synapses (End bulbs of Held) in the ventral cochlear nucleus. These synapses contact bushy cells (Spherical and globular) and faithfully transmit (or enhance) the temporal information present in the auditory nerve fibers to higher structures in the brainstem.[73] The bushy cells project to the medial superior olive and the globular cells project to the medial nucleus of the trapezoid body (MNTB). The MNTB is also characterized by giant synapses (calyces of Held) and provides precisely timed inhibition to the lateral superior olive. The medial and lateral superior olive and MNTB are involved in the encoding of interaural time and intensity differences. There is general acceptance that the temporal information is crucial in sound localization but it is still contentious as to whether the same temporal information is used to encode the frequency of complex sounds.Several problems remain with the idea that the TFSn is important in the representation of the frequency components of complex sounds. The first problem is that the temporal information deteriorates as it passes through successive stages of the auditory pathway (presumably due to the low pass dendritic filtering). Therefore, the second problem is that the temporal information must be extracted at an early stage of the auditory pathway. No such stage has currently been identified although there are theories about how temporal information can be converted into rate information (see section Models of normal processing: Limitations).Psychoacoustical aspects[edit]It is often assumed that many perceptual capacities rely on the ability of the monaural and binaural auditory system to encode and use TFSn cues evoked by components in sounds with frequencies below about 1\u20134\u00a0kHz. These capacities include discrimination of frequency,[74][4][75][76] discrimination of the fundamental frequency of harmonic sounds,[75][4][76] detection of FM at rates below 5\u00a0Hz,[77] melody recognition for sequences of pure tones and complex tones,[74][4] lateralization and localization of pure tones and complex tones,[78] and segregation of concurrent harmonic sounds (such as speech sounds).[79] It appears that TFSn cues require correct tonotopic (place) representation to be processed optimally by the auditory system.[80] Moreover, musical pitch perception has been demonstrated for complex tones with all harmonics above 6\u00a0kHz, demonstrating that it is not entirely dependent on neural phase locking to TFSBM (i.e., TFSn) cues.[81]As for FM detection, the current view assumes that in the normal auditory system, FM is encoded via TFSn cues when the FM rate is low ("},{"@context":"http:\/\/schema.org\/","@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem","position":1,"item":{"@id":"https:\/\/wiki.edu.vn\/en\/wiki40\/#breadcrumbitem","name":"Enzyklop\u00e4die"}},{"@type":"ListItem","position":2,"item":{"@id":"https:\/\/wiki.edu.vn\/en\/wiki40\/temporal-envelope-and-fine-structure\/#breadcrumbitem","name":"Temporal envelope and fine structure"}}]}]