We start with a brief overview of our work in speech recognition and understanding which led from monomodal (speech only) human-machine dialog to multimodal human-machine interaction and assistance. Our work in speech communication initially had the goal to develop a complete system for question answering by spoken dialog [7,15]. This goal was achieved in various projects funded by the German Research Foundation [14] and the German Federal Ministry of Education and Research [16]. Problems of multilingual communication were considered in projects supported by the European Union [2,4,10]. In the Verbmobil project the speech-to-speech translation problem was investigated and it turned out that prosody and the recognition of emotion was important and extremely useful - if not indispensible - to disambiguate utterances and to influence the dialog strategy [3,17]. Multimodal and multimedia aspects of human-machine communication became a topic in the follow-up projects Embassi [11], SmartKom [1], FORSIP [12], and SmartWeb [9]. The SmartWeb project [19], which involves 17 partners from companies, research institutes, and universities, has the general goal to provide the foundations for multimodal human-machine communication with distributed semantic web services using different mobile devices, hand-held, mounted in a car or to a motor cycle. It uses speech and video signals as well as signals from other sensors, e.g. ECG or skin resistance. A special problem in human-machine interaction and assistance is the question whether the user speaks to the machine or not, that is, the distinction of on- and off-talk. It is shown how on-/off-talk can be classified by the combination of prosodic and image features. Using additional sensors the user state in general is estimated to give further cues to the dialog control. This may be used, for example, to avoid input from the dialog system in a situation where a driver is under stress. In other projects the special problem of children's speech processing was considered [20]. Among others it was investigated whether a manual correction of automatically computed fundamental frequency F0 and word boundaries might have a positive effect on the automatic classification of the 4 classes anger, motherese, emphatic, and neutral; this was not the case, leading to the conclusion that presently there is no need for improved F0 algorithms in emotion recognition. The word accuracy (WA) of native and non-native English speaking children was investigated; it was shown that non-native speakers (age 10-15) achieve about the same WA as children aged 6-7 using a speech recognizer trained with native children speech. The recognizer also was used to develop an automatic scoring of the pronunciation quality of children learning English. A special problem are impairments of speech which may be congenital (e.g. the cleft lip and palate) or acquired by disease (e.g. cancer of the larynx). Impairments are, among others, treated with speech training by speech therapists. They score the speech quality subjectively according to various criteria. The idea is that the WA of an automatic speech recognizer should be highly correlated with the human rating. Using speech samples from laryngectomees it is shown that the machine rating is about as good as the rating of five human experts and can also be done via telephone. This opens the possibility of an objective and standardized rating of speech quality.