Back to Search Start Over

Recommendations for the formatting of Variant Call Format (VCF) files to make plant genotyping data FAIR [version 2; peer review: 2 approved]

Authors :
Sebastian Beier
Anne Fiebig
Cyril Pommier
Isuru Liyanage
Matthias Lange
Paul J. Kersey
Stephan Weise
Richard Finkers
Baron Koylass
Timothee Cezard
Mélanie Courtot
Bruno Contreras-Moreira
Guy Naamati
Sarah Dyer
Uwe Scholz
Author Affiliations :
<relatesTo>1</relatesTo>Breeding Research, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, 06466, Germany<br /><relatesTo>2</relatesTo>Institute of Bio- and Geosciences, Bioinformatics (IBG-4), Forschungszentrum Jülich GmbH, Jülich, 52425, Germany<br /><relatesTo>3</relatesTo>BioinfOmics, Plant bioinformatics facility, Université Paris-Saclay, INRAE, Versailles, France<br /><relatesTo>4</relatesTo>European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK<br /><relatesTo>5</relatesTo>Royal Botanic Gardens, Kew, Richmond, UK<br /><relatesTo>6</relatesTo>Plant Breeding, Wageningen University & Research, Wageningen, The Netherlands<br /><relatesTo>7</relatesTo>Gennovation B.V., Wageningen, The Netherlands<br /><relatesTo>8</relatesTo>Ontario Institute for Cancer Research, Toronto, Canada<br /><relatesTo>9</relatesTo>Laboratorio de Biología Computacional y Estructural, Estación Experimental Aula Dei-CSIC, Zaragoza, 50059, Spain
Source :
F1000Research. 11:ELIXIR-231
Publication Year :
2022
Publisher :
London, UK: F1000 Research Limited, 2022.

Abstract

In this opinion article, we discuss the formatting of files from (plant) genotyping studies, in particular the formatting of metadata in Variant Call Format (VCF) files. The flexibility of the VCF format specification facilitates its use as a generic interchange format across domains but can lead to inconsistency between files in the presentation of metadata. To enable fully autonomous machine actionable data flow, generic elements need to be further specified. We strongly support the merits of the FAIR principles and see the need to facilitate them also through technical implementation specifications. They form a basis for the proposed VCF extensions here. We have learned from the existing application of VCF that the definition of relevant metadata using controlled standards, vocabulary and the consistent use of cross-references via resolvable identifiers (machine-readable) are particularly necessary and propose their encoding. VCF is an established standard for the exchange and publication of genotyping data. Other data formats are also used to capture variant data (for example, the HapMap and the gVCF formats), but none currently have the reach of VCF. For the sake of simplicity, we will only discuss VCF and our recommendations for its use, but these recommendations could also be applied to gVCF. However, the part of the VCF standard relating to metadata (as opposed to the actual variant calls) defines a syntactic format but no vocabulary, unique identifier or recommended content. In practice, often only sparse descriptive metadata is included. When descriptive metadata is provided, proprietary metadata fields are frequently added that have not been agreed upon within the community which may limit long-term and comprehensive interoperability. To address this, we propose recommendations for supplying and encoding metadata, focusing on use cases from plant sciences. We expect there to be overlap, but also divergence, with the needs of other domains.

Details

ISSN :
20461402
Volume :
11
Database :
F1000Research
Journal :
F1000Research
Notes :
Revised Amendments from Version 1 In version 2 of this article, we have revised the Abstract and added larger sections to both the Introduction and the Conclusion. In particular, we have addressed the reviewers' comments on the introduction of the VCF recommendation in the broader community as well as various aspects of the FAIRness of the adapted metadata. Throughout the article, we have adjusted and clarified some unclear passages and taken greater care in the correct designation of pronouns and gender-neutral language. We have also submitted a sample dataset to EVA that meets the VCF metadata specifications in this article and added guidance in the FAIR Cookbook on submitting genomic and genotypic data to EMBL-EBI., , [version 2; peer review: 2 approved]
Publication Type :
Academic Journal
Accession number :
edsfor.10.12688.f1000research.109080.2
Document Type :
opinion-article
Full Text :
https://doi.org/10.12688/f1000research.109080.2