As science becomes increasingly cross‐disciplinary and scientific models become increasingly cross‐coupled, standardized practices of model evaluation are more important than ever. For normally distributed data, mean squared error (MSE) is ideal as an objective measure of model performance, but it gives little insight into what aspects of model performance are "good" or "bad." This apparent weakness has led to a myriad of specialized error metrics, which are sometimes aggregated to form a composite score. Such scores are inherently subjective, however, and while their components may be interpretable, the composite itself is not. We contend that, a better approach to model benchmarking and interpretation is to decompose MSE into interpretable components. To demonstrate the versatility of this approach, we outline some fundamental types of decomposition and apply them to predictions at 1,021 streamgages across the conterminous United States from three streamflow models. Through this demonstration, we hope to show that each component in a decomposition represents a distinct concept, like "season" or "variability," and that simple decompositions can be combined to represent more complex concepts, like "seasonal variability," creating an expressive language through which to interrogate models and data. Plain Language Summary: Models are essential scientific tools for explaining and predicting phenomena ranging from weather and climate, to health outcomes, to economic development, to the origins of the universe, and testing competing models is one of the most basic scientific activites. Yet, how scientists evaluate and justify their models can be inconsistent or even arbitrary. Traditionally, one performance metric—such as mean squared error—is used to identify the best model, but one metric provides little insight into what aspects of a model are "good" or "bad." This paper proposes a basic language for expressing different aspects of a model's performance. On one hand, this is useful for determining which aspects of model may require revision, but it also allows the modeler to separate out the best elements among several models and combine them to form an ensemble, analogous to how an audio engineer mixes together multiple tracks to form the best rendition of a musical piece. Key Points: Mean squared error (MSE) is an objective but somewhat enigmatic measure of model performanceMSE can be decomposed into components that quantify specific aspects of model performance, such as bias and varianceMixing components among models yields a form of ensemble prediction [ABSTRACT FROM AUTHOR]