What’s known on the subject? and What does the study add? Looking for useful candidates or subsets of candidates from raw data of genome expression analyses, the main problem is the lack of a standard workflow. Until now there has not been a defined special biostatistical method as a standard to minimize the divergence of results. Recently, a new method has been proposed under the name GSEA (Gene set enrichment analysis) by Subramanian et al. (2005) to integrate knowledge from databases with large scale expression data. GSEA delivers not single candidates, but whole sets of functionally related groups of genes by comparing expression data with selected gene sets of gene set databases. We used the GSEA method to analyze differential expression profiles in renal cell carcinoma with the possible aim of establishing biomarkers suitable for prognostication at the time of renal surgery. GSEA identified 700 gene sets. Out of these, 120 sets with the highest leading edge subset were selected followed by hierarchical clustering G1 versus G3. Out of these 120 gene sets comparative analysis using the MSigDB data bank for pathway network and Gene Ontology (GO) terms identified 16 gene sets which were differentially strongly over- or underexpressed in G3 versus G1 tumours. OBJECTIVE To improve the workflow for standardizing the statistical interpretation provides an opportunity for the analysis of gene expression in clear cell renal cell carcinoma (ccRCC). RCC as a solid tumour entity represents a very suitable tumour model for such investigations. Although it is possible to investigate expression profiles by microarray technologies, the main problem is how to adequately interpret the accumulated mass of data derived from microarray technologies. There is a clear lack of a defined, consistent and comparable biostatistical analysis system, with no specific biostatistical standard methodology being available to compare the results of microarray analyses. We used the gene set enrichment analysis (GSEA) method to analyze microarray data from RCC tissue. The present study aimed to analyze differential expression profiles and establish biomarkers suitable for prognostication at the time of renal surgery by comparing RCC patients with long-term survival data against RCC samples of patients with poorly differentiated (grade 3) RCC, concomitant metastatic disease and short survival. PATIENTS AND METHODS In the present study, a total of 29 ccRCC fresh-frozen tissue samples were used; 14 samples from grade 1 (G1) RCC patients without metastatic disease and 15 from grade 3 (G3) RCC patients with synchronous metastatic disease. Expression profiling was performed with the Human Genome U133 Plus 2.0 Array (Affymetrix Corp., Santa Clara, CA, USA). Clinical data and long-term follow-up were obtained for all patients. The primary probe level analysis was performed using the Affymetrix MAS 5 algorithm. Further statistical processing was carried out by GSEA, using the Molecular Signatures Database, MSigDB (http://www.broad.mit.edu/gsea/msigdb/index.jsp). After selecting gene sets with the highest leading edge subsets, a cluster and a further analyses based on MSigDB data bank analysis was performed. RESULTS In total, 15 poorly G3 ccRCC, 14 well diffferentiated G1 ccRCC and 14 normal renal tissue samples were analyzed for comparative gene expression profiling. There were 12 of 15 G3 ccRCC patients who had synchronous metastatic disease at the time of surgery (pN+ and/or distant metastases: pN+ only = 4, M+ only = 11 and pN+M+= 3). The GSEA identified 700 gene sets. Out of these, 120 sets with the highest leading edge subset were selected monitored by hierarchical clustering G1 vs G3. Comparative analysis using the the MSigDB data bank for pathway network identified 16 gene sets that were differentially strongly over- or underexpressed in G3 vs G1 tumours and are involved in various aspects of tumour physiology, such as metastases and cell motility, signalling and cell proliferation, as well as gene products that are involved in the building of the extracellular matrix and as cell surface markers. CONCLUSIONS We analyzed microarray data of gene expression in ccRCC comparing poorly differentiated and well differentiated tumour tissue samples. Using GSEA, we found a number of genes set candidates relevant to biological network processes with high complexity; conspicuously, these comprised members of the interleukin- and chemokine-family, cyclin-dependent kinases, angiogenic growth factors and transcriptional factors. This suggests that, in poorly differentiated aggressive ccRCC, there may be a limited number of gene sets that are responsible for the very aggressive biological behaviour. This comparison performed at a gene set level enables the identification of such congruency between different gene sets and whole data sets with respect to a specific biological question. GSEA embedded in the statistical workflow procedure for the suitable preparation of expression data may improve the analysis and avoid missing changes at the molecular level. A systematic approach such as GSEA is clearly needed to analyze raw data from microarray analyses, although these data can only be descriptive and the mass of raw data is derived from a relatively small number of tissue samples. However, consistent alterations of gene expression found in specific tumour entities may allow a better understanding of certain aspects of specific tumour biology. Therefore, the molecular characterization of individual tumours may potentially be useful for the better individual assessment of prognosis and, finally, the identification of biomarkers and targets of specific treatments may eventually help to improve treatment.