When we perceive speech, our goal is to extract the meaning of the verbal message which includes semantic processing. However, how deeply do we process speech in different situations? In two experiments, native Dutch participants heard spoken sentences describing simultaneously presented pictures. Sentences either correctly described the pictures or contained an anomalous final word (i.e. a semantically or phonologically incongruent word). In the first experiment, spoken sentences were task-irrelevant and both anomalous conditions elicited similar centro-parietal N400s that were larger in amplitude than the N400 for the correct condition. In the second experiment, we ensured that participants processed the same stimuli semantically. In an early time window, we found similar phonological mismatch negativities for both anomalous conditions compared to the correct condition. These negativities were followed by an N400 that was larger for semantic than phonological errors. Together, these data suggest that we process speech semantically, even if the speech is task-irrelevant. Once listeners allocate more cognitive resources to the processing of speech, we suggest that they make predictions for upcoming words, presumably by means of the production system and an internal monitoring loop, to facilitate lexical processing of the perceived speech.