Most children listen to speech as their primary source of communication. Yet which language they learn depends on where they grow up (Cutler, 2012): for instance, babies growing up in the Netherlands will grow up speaking Dutch while those growing up in China might master Mandarin. All these languages are markedly different: they differ in the repertoire of speech sounds, in the use of suprasegmental information (whether acoustic variations above the level of phonetics/phonemes can differentiate between words, such as word stress and tone), and in phonotactic information (which phoneme combinations are permissible in a language or not). When infants are born, their natural preferences and abilities in speech perception are hardly shaped by their native language. In other words, newborns are considered 'universal listeners' (Kuhl et al., 2008). Yet through repeated exposure to their native language, cross-linguistic differences between infants soon become apparent, which suggests that the first year of life sets the scene for language-specific listening. This chapter makes clear that speech perception is not a trivial task: speech is like a stream of sounds embedded in words combining into phrases, with no pauses that reliably signal where words begin or end (See Figure 1). Fortunately, speech contains and conveys cues to many linguistic elements simultaneously. Speech perception is the process of extracting cues from the speech stream, to recognise the message that a speaker is conveying. This process is further complicated by the fact that all speakers are different and therefore produce the cues slightly differently. In what follows next, we will first describe the input that children are exposed to (Section 15.1), before we turn to how children learn to recognise their native language from other acoustic signals (Section 15.2) and to decompose into meaningful units: into sounds (Section 15.3), into suprasegmental units (Section 15.4), and finally, into words (Section 15.5). In Section 15.6 we discuss the development of speech perception in relation to speech exposure and brain maturation. In our concluding section (Section 15.6) we underscore the relevance of early speech perception skill as crucial for language acquisition. 15.1 What kind of speech do children hear? The primary source of speech is vocal fold vibration, resulting in voiced sounds with a fundamental frequency, which is perceived as pitch. Speech can also vary in amplitude, 1 1 = Utrecht University, 2 = University of Postdam, 3 = Macquarrie University 2 perceived as fluctuations in loudness, and in the duration of segments and phrases, which may signal speaking rate, among many other things. Figure 1: A plot to indicate that speech is a stream of sounds: a sound spectrogram of a Dutch utterance 'in speech, all words are glued together', with time on the x-axis, spectral frequencies from 0 -8000 Hz (signaling vowel and consonant properties) on the left y-axis, and fundamental frequencies (pitch information signaling intonation and word stress by means of the rising and falling line in the spectrogram) on the right y-axis. The tiers below the spectrogram show the speech stream segmented into relevant subunits of speech: the top tier shows how harmonics cluster to correspond to specific speech sounds segments. The second tier groups these segments into Dutch words. Note that pauses in the speech signal the onset of plosives (/k/, /p/, /t/); they do not align with the onset or offset of words. The third tier offers a translation of the words into English. Frequency, amplitude and duration are essential cues to suprasegmental structure, such as lexical stress, tone, and intonation of an entire utterance (Fry, 1955; Pierrehumbert, 1980; Beckman, 1986). To give an example from Figure 1: the second vowel in the Dutch word /əәlkaːr/ is typically higher, louder, and longer than the first, because the second but not the first syllable carries lexical stress. These suprasegmental cues also play a role in distinguishing between the two main segmental classes, as vowels are typically voiced, and louder and longer than consonants.