Big Data is a concept related to extremely large databases so that they cannot be processed with standard algorithms. For this reason, the concept of cluster computering was created. Here, the data is partitioned and processed in a computer cluster, so each one of the computers can process a part of the data and obtain a partial solution to our problem that, combined with the other partials solutions obtained from the other computer, we manage to obtain the solution. This way, we sharply reduce the computational cost of the original problem by breaking it into little subproblems. This concept is quite new, so it is still developing. And here is where our bachelor` s thesis makes the difference. The goal of this project is to develop an algorithm able to fit non linear regression problems, which is an underdeveloped field in the distributed machine learning branch. The only algorithms that can solve this kind of problems and are completely integrated in Spark's machine learning library [2] are the Random Forests and the Neural Networks; not even the Support Vector Machines are able to do it because there are no Kernel implementations integrated yet. Our model aims to solve this problem based on a Linear Regression Mixture Model. The idea behind this is that, if we combine several Linear Regressions, we are able to break a non linear problem into several clusters that can be solved by linear algorithms. This way we manage to solve non linear problems and, furthermore, to do it in a more simple and interpretable way than other algorithms such as Random Forests or Neural Networks. The first part of this bachelor`s thesis will be to describe the different technologies and algorithms needed to fully understand the work developed here. Firstly, we explain all that we need to know about computer clusters and distributed computing. With a computer cluster we manage to have an excellent computing capacity without the need of a supercomputer. The problem here is that every computer in the cluster will only have a small part of the total data, thus not every algorithm is capable of being parallelized. For an algorithm to be parallelizable, we have to check if we need all the data to obtain the result or if we can apply it to partitions of it. If we can apply it to small parts of the dataset, then our algorithm will be parallelizable. That will be an important point in our work, because some algorithms will not be able to be parallelized and we will have to use alternative, parallelizable algorithms. In this part we also give a brief introduction to Apache Spark in its Python implementation, PySpark. We explain the most importants aspects of it. One important aspect of Spark is to understand the different kind of nodes in the computer clusters: Driver: Driver: manage the resources of the cluster and schedules all the tasks to be performed. Workers: performs the different transformations and actions over the partitions of the dataset. We also explain the RDDs or Resilient Distributed Databases, which are the basic parallelizable element in Apache Spark, and the two most important methods for the development of our algorithm in Apache Spark: map(): takes every data in the RDD and performs any function - passed as a parameter - we want individually. Usually, we will use a lambda function as the parameter. It returns a new RDD with the transformed data. reduce(): takes the data in the RDD and performs a binary, commutative and associative operation - passed as a parameter - to pairs of values and keeps the cumulative value in one of them, until it runs out of data and returns a single data, which will be returned to out driver. After that, we introduce the gradient descent optimization based algorithms. The simplest one is the Gradient Descent. This is an iterative method that, for each step, gives a better solution of the problem until it converges to a local minimum. In each step, it computes the gradient of the loss function. The idea here is that the gradient gives us the direction in which the slope is steepest, so, if we move in the opposite direction of the gradient, we will end up in a local minimum. Here, the magnitude that we move, or step size, will be critical, because if the step size is too large the algorithm will diverge, and if it is too small it will take too long until it converges. Once we have explained the basic gradient descent optimization algorithm, we are going to explain different evolutions that will fit better our problem and, in general, the distributed computing paradigm. Those algorithms are the Stochastic Gradient Descent and the Mini-Batch Gradient Descent. They take a small sample of the dataset - even only one in the case of the Stochastic Gradient Descent - and calculate a subgradient that will have the same expected value as the gradient. The subgradient will not be as precise as the gradient, but the computing time is much lower, which is critical when you are dealing with such big databases. Also, the Mini-Batch Gradient Descent, the one we will use here, usually converges very close to the point in which the standard Gradient Descent would have converged. In the next part, we explain the different Machine Learning algorithms that will have relevance in the development of this bachelor`s thesis. The first one are the Linear Regressions, which are based on, given a dataset, find a linear combination of a given set of functions, being the simplest case a line. To do it, we compute the minimum of the loss function, usually the Mean Squared Error or MSE. Big Data es un concepto que hace referencia a bases de datos tan grandes que las aplicaciones informáticas tradicionales de análisis de datos son incapaces de procesarlos. Por ello, surge el concepto de aprendizaje distribuido. En él, los datos se distribuyen en diferentes sistemas de almacenamiento y procesadores dentro de un clúster; de este modo, cada procesador trabaja en paralelo sobre un subconjunto de datos y, combinando las salidas parciales de cada procesador, se consigue así reducir la complejidad, coste y tiempo de computación en el tratamiento de los mismos [1]. Sin embargo, la implementación distribuida limita considerablemente el tipo de algoritmos de aprendizaje que pueden emplearse, ya que su formulación debe poder integrarse en el paradigma MapReduce. Así, por ejemplo, si nos fijamos en los algoritmos de clasificación y regresión disponibles en la librería MLlib [2], todos ellos utilizan implementaciones lineales basadas en el método de descenso por gradiente (Gradient Descent, GD). El objetivo de este proyecto es conseguir una implementación distribuida de algoritmos de regresión no lineales. Para ello, utilizaremos modelos de mezclas de regresiones lineales (Linear Regression Mixture Models, LRMM) [3]. De este modo, resolveremos un problema no-lineal como una mezcla de problemas lineales, a la vez que ganamos en interpretabilidad en los resultados y lo hacemos con un algoritmo completamente integrado en el modelo MapReduce, en el cual se basa el desarrollo de Apache Spark. En la Figura 1 se muestra un ejemplo de un problema con unos datos de entrada x y sus respectivos valores observados t. Vemos que estos no siguen una relación lineal, con lo que sería impensable ajustar dichos datos utilizando una regresión lineal para ajustar los datos de salida a los de entrada, ya que obtendríamos resultados inaceptables en cualquier aplicación. Sin embargo, si tomamos pequeñas porciones del conjunto de datos, vemos que dichos subconjuntos se pueden aproximar de forma bastante precisa con varias componentes, donde cada una de ellas es una regresión lineal diferente. Así, mezclando tres regresiones lineales, hemos resuelto de manera bastante satisfactoria un problema de regresión que no podía ser resuelto mediante una regresión lineal simple ... Ingeniería de Sistemas Audiovisuales