In this paper, we propose two scalable architectures (say, [Arc.sub.J] and [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]) that perform the discrete wavelet transform (DWT) of an [N.sub.o]-sample sequence in only [N.sub.o]/2 Clock cycles. Therefore, they are at least twice as fast as the other known architectures. Also, they have an [AT.sup.2] parameter that is approximately 1/2 that of already existing devices. This result has been achieved by means of a carefully balanced pipelining, and it has two 'faces.' First, [Arc.sub.J] and [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] can be employed for performing two times faster processing than allowed by other architectures working at the same clock frequency (high-speed utilization). Second, they can be employed even using a two times lower clock frequency but reaching the same performance as other architectures. This second possibility allows for reducing the supply voltage and the power dissipation, respectively, by a factor of two and four with respect to other architectures (low-power utilization). As a final result, we show that a parallel architecture implementing an L-tap filter-based DWT with J decomposition levels [say, [Arc.sub.OPT] (J, L)] can be defined, aiming at having an excellent efficiency (say, eff[[Arc.sub.OPT] (J, L)]) for any value of J and L. For instance, the average value of eff[[Arc.sub.OPT] (J, L)] [computed in very wide set [Sigma]' of 'points' (J, L)] is 99.1%. The minimum value of eff[[Arc.sub.OPT](J, L)] in [Sigma]' is 93.8%, and, except for five 'points,' in all the others, eff[[Arc.sub.OPT] (J, L)] is not lower than 96.9%.