1. Enabling simple access to a data lake both from HPC and Cloud using Kerchunk and Intake
- Author
-
Thierry Carval, Erwan Bodere, Julien Meillon, Mathiew Woillez, Jean Francois Le Roux, Justus Magin, and Tina Odaka
- Abstract
We are experimenting with hybrid access from Cloud and HPC environments using the Pangeo platform to make use of a data lake in an HPC infrastructure “DATARMOR”. DATARMOR is an HPC infrastructure hosting ODATIS services (https://www.odatis-ocean.fr) situated at “Pôle de Calcul et de Données pour la Mer” in IFREMER. Its parallel file system has a disk space dedicated for shared data, called “dataref”. Users of DATARMOR can access these data, and some of those data are cataloged by sextant service (https://sextant.ifremer.fr/Ressources/Liste-des-catalogues-thematiques/Datarmor-Donnees-de-reference ) and is open and accessible from the internet, without duplicating the data. In the cloud environment, the ability to access files in a parallel manner is essential for improving the speed of calculations. The Zarr format (https://zarr.readthedocs.io) enables parallel access to data sets, as it consists of numerous chunked “object data” files and some “metadata” files. Although it enables multiple data access, it is simple to use since all the collections of data stored in a Zarr format are accessible through one access point. For HPC centers, the numerous “object data” files create a lot of metadata on parallel file systems, slowing the data access time. Recent progress on development of Kerchunk (https://fsspec.github.io/kerchunk/), which recognize the chunks in a file (e.g. NetCDF / HDF5) as a Zarr chunk and its capability to recognize a series of files as one Zarr file, is solving these technical difficulties in our PANGEO use cases at DATARMOR. Thanks to Kerchunk and Intake (https://intake.readthedocs.io/) it is now possible to use different sets of data stored in DATARMOR in an efficient and simple manner. We are further experimenting with this workflow using the same use cases on the PANGEO-EOSC cloud. We make use of the same data stored at the data lake in DATARMOR, but based on Kerchunk and Intake catalog through ODATIS access, without duplicating the source data. In the presentation we will share our recent experiences from these experiments.
- Published
- 2023
- Full Text
- View/download PDF