Back to Search Start Over

O-15 Occupational Health: A Multi-Cohort Job Title Cleaning Project by Algorithm

Authors :
Christopher J. O. Baker
Mohammad Sadnan Al Manir
Ellen Sweeney
Deobrah Addey
Cheryl Peters
Hicks Jason
Anil Adisesh
Jennifer Vena
Grace Shen Tu
Yunsong Cui
Source :
Oral Presentations.
Publication Year :
2021
Publisher :
BMJ Publishing Group Ltd, 2021.

Abstract

Introduction Occupational data in prospective cohort studies is often underutilized due to the human and financial resources required to code open-ended text, such as job titles. Recognizing the value of occupational data in health research, as well as potential errors associated with manual coding, an Automated Coding Algorithm (ACA)-NOC algorithm was developed utilizing a Natural Language Processing approach. Objectives We tested the ACA-NOC algorithm on two regional cohorts of a pan-Canadian cohort study, which represents the largest dataset an algorithm of this kind has been applied to. This process will harmonize and greatly expand the utility of the occupational data, enrich the research platforms, and further refine the efficiency of the algorithm. Methods The ACA-NOC algorithm was tested on data from the Canadian Partnership for Tomorrow’s Health (CanPath), a longitudinal cohort examining the role of genetic, environmental, lifestyle, and behavioural factors in the development of cancer and chronic disease. Using an iterative and interactive approach, the algorithm was applied to job title data from 111,000 questionnaires from two regional cohorts, coding the data to the Canadian National Occupation Classification (NOC) system. The algorithm was further refined based on each round of analysis, increasing the quantity of accurately coded data. Results Results from this research demonstrate the ability to refine the ACA-NOC algorithm with a 10% overall improvement in exact matching from the baseline algorithm. There were also instances where the algorithm performance was superior to the manual coding. The utilization of the algorithm offers significant savings in time, human resources and cost compared to a singular manual coding approach. Conclusions The coding and harmonization of this multi-cohort data demonstrates the value of the ACA-NOC algorithm, while increasing the utility of the CanPath data and research related to occupational health. Future research may involve comparisons between CanPath and international cohorts.

Details

Database :
OpenAIRE
Journal :
Oral Presentations
Accession number :
edsair.doi...........4bc402b5c4b0f5c4f47e963e0e5519a4