• The study compares the performance of 13 artificial intelligence-based fall detection algorithms when applied to accelerometer signals captured at the waist of the test subject with a wearable device. • Unlike most related studies, which just use a single data repository to test their proposals, the analysis employs up to 11 different public laboratory-based datasets, as well as two databases with signals measured during real falls, and four repositories with mobility samples gathered during long-term monitoring campaigns. • The study considers not only the typical intra-dataset evaluation procedure followed by the literature but also cross-dataset evaluation (i.e. using data from a different repository for testing than that considered for training). • The performance metrics show that the behavior of all algorithms significantly degrades under cross-dataset evaluation conditions, both when tested with samples from another dataset generated in the laboratory and with real fall or long-term monitoring samples. • The results show that the algorithms tend to overfit the patterns of emulated falls (obtained from laboratory based datasets) used for training. Thus they are not able to extrapolate their learning correctly to conventional movements or new fall patterns, which reflects in a high percentage of undetected falls and a high hourly rate of false alarms. • The obtained results clearly question the evaluation procedures of fall detectors used so far in the literature, highlighting the importance of testing this type of detectors in realistic scenarios with signals from real falls and actual daily life movements. The evaluation of fall detection systems based on wearables is controversial as most studies in the literature benchmark their proposals against falls that are simulated by experimental subjects under unrealistic laboratory conditions. In order to systematically investigate the suitability of this procedure, this paper evaluates a wide set of artificial intelligence algorithms used for fall detection, when trained with a large number of datasets containing acceleration samples captured during the emulation of falls and ordinary movements and then tested with the signals of both actual falls and long-term traces collected from the constant monitoring of users during their daily routines. The results, based on a large number of repositories, show a remarkable degradation in all performance metrics (sensitivity, specificity and false alarm hourly rate) with respect to the typical case in which the detectors are tested with the same types of laboratory movements for which they were trained. [ABSTRACT FROM AUTHOR]