Today, modern developments in science and technology require ever larger dimensional mathematical models. They pose challenging mathematical tasks, that can be tackled using modern numerical methods implemented on state of the art hardware and software tools. Nevertheless, the models may be still to large for numerical simulations, and there is a need for reduction of their dimension. For instance, in case of the LTI dynamical system \begin{; ; equation}; ; \label{; ; eq:LTI}; ; \dot{; ; x}; ; (t) = A x(t) + Bu(t), \ ; \ ; y(t)=Cx(t) \end{; ; equation}; ; with the system matrix $A \in \Real^{; ; n \times n}; ; $ where $n$ can be larger than $10^5$, with the input $B \in \Real^{; ; n \times m}; ; $, and the output matrix $C \in \Real^{; ; p \times n}; ; $, where $m, p\ll n$, one of the approaches is to approximate the system's transfer function \begin{; ; align}; ; \label{; ; intro:eq:TF}; ; \nice{; ; G}; ; (s) = C (sI-A)^{; ; -1}; ; B, \ ; \ ; s\in\mathbf{; ; i}; ; \Real. \end{; ; align}; ; by a more convenient rational function $\nice{; ; G}; ; _r(s)$ of order $r$ with $r\ll n$. One such method is the Iterative Rational Krylov Algorithm (IRKA, S. Gugercin, A.C. Antoulas, and C.A. Beattie., SIMAX 2008), which finds an approximation $\nice{; ; G}; ; _r(s)$ that is locally optimal in the norm of the Hardy space $\mathcal{; ; H}; ; _2$. At each iteration, $2r$ shifted linear systems have to be solved in order to obtain the search and the test spaces. Moreover, since IRKA finds only a locally optimal approximation, and in the process of finding an appropriate reduced order model, a user may repeat the computation with different initial shifts (hoping to find better local minimum of the approximation error), and perhaps with several values of $r$. This design process may, altogether, require solutions of several thousands shifted linear systems. Even the simple task of graphing the frequency response (Bode plot), i.e. the values of the transfer function $\nice{; ; G}; ; (s)$ requires evaluations of $(A - \sigma_j I)^{; ; -1}; ; B$ for many values $\sigma_j = \mathbf{; ; i}; ; \omega_j$, $\omega_j\in\Real$. Already for moderately large $n, m, p$, this mere function evaluation may take annoyingly long time. Regardless of hardware capabilities, the problems of middle size are still in focus. In some applications the matrix $A$ is not sparse, and its dimension $n$ is not extremely large, say $n$ is in tens of thousands at most. In such situation, the total number of shifted systems to be solved, the required accuracy of the solution, and the computing platform (e.g. massively parallel hardware, available optimized libraries) may motivate development of direct methods. In our earlier work, we used \emph{; ; controller Hessenberg form}; ; for efficient solution of a large number of shifted linear systems. The reduction to this form exploits parallel environment very successfully, providing a convenient starting point for the linear solver in the second phase, by reducing number of operations and memory traffic. Shifted linear systems for different shifts are further solved simultaneously. Since modern computers offer execution of numerical algorithms in parallel, performing operations on both CPU and GPU, we adapted this approach to a hybrid CPU+GPU setting. The controller Hessenberg form is computed by splitting the workload between the CPU and the GPU, while the second phase is done entirely on the massively parallel GPU. Besides the efficient BLAS3 operations, we make further use of independent operations that can be carried out simultaneously for different shifts. We will present our IRKA implementation which is to our best knowledge the first one based of parallel solvers, making it highly parallelizable and efficient.