Jia Li, Minghui Liu, Xiaojun Li, Xuan Liu, Jingfang Liu, Li, Jia, Liu, Minghui, Li, Xiaojun, Liu, Xuan, and Liu, Jingfang
Background: Web-based physician reviews are invaluable gold mines that merit further investigation. Although many studies have explored the text information of physician reviews, very few have focused on developing a systematic topic taxonomy embedded in physician reviews. The first step toward mining physician reviews is to determine how the natural structure or dimensions is embedded in reviews. Therefore, it is relevant to develop the topic taxonomy rigorously and systematically.Objective: This study aims to develop a hierarchical topic taxonomy to uncover the latent structure of physician reviews and illustrate its application for mining patients' interests based on the proposed taxonomy and algorithm.Methods: Data comprised 122,716 physician reviews, including reviews of 8501 doctors from a leading physician review website in China (haodf.com), collected between 2007 and 2015. Mixed methods, including a literature review, data-driven-based topic discovery, and human annotation were used to develop the physician review topic taxonomy.Results: The identified taxonomy included 3 domains or high-level categories and 9 subtopics or low-level categories. The physician-related domain included the categories of medical ethics, medical competence, communication skills, medical advice, and prescriptions. The patient-related domain included the categories of the patient profile, symptoms, diagnosis, and pathogenesis. The system-related domain included the categories of financing and operation process. The F-measure of the proposed classification algorithm reached 0.816 on average. Symptoms (Cohen d=1.58, Δu=0.216, t=229.75, and P<.001) are more often mentioned by patients with acute diseases, whereas communication skills (Cohen d=-0.29, Δu=-0.038, t=-42.01, and P<.001), financing (Cohen d=-0.68, Δu=-0.098, t=-99.26, and P<.001), and diagnosis and pathogenesis (Cohen d=-0.55, Δu=-0.078, t=-80.09, and P<.001) are more often mentioned by patients with chronic diseases. Patients with mild diseases were more interested in medical ethics (Cohen d=0.25, Δu 0.039, t=8.33, and P<.001), operation process (Cohen d=0.57, Δu 0.060, t=18.75, and P<.001), patient profile (Cohen d=1.19, Δu 0.132, t=39.33, and P<.001), and symptoms (Cohen d=1.91, Δu=0.274, t=62.82, and P<.001). Meanwhile, patients with serious diseases were more interested in medical competence (Cohen d=-0.99, Δu=-0.165, t=-32.58, and P<.001), medical advice and prescription (Cohen d=-0.65, Δu=-0.082, t=-21.45, and P<.001), financing (Cohen d=-0.26, Δu=-0.018, t=-8.45, and P<.001), and diagnosis and pathogenesis (Cohen d=-1.55, Δu=-0.229, t=-50.93, and P<.001).Conclusions: This mixed-methods approach, integrating literature reviews, data-driven topic discovery, and human annotation, is an effective and rigorous way to develop a physician review topic taxonomy. The proposed algorithm based on Labeled-Latent Dirichlet Allocation can achieve impressive classification results for mining patients' interests. Furthermore, the mining results reveal marked differences in patients' interests across different disease types, socioeconomic development levels, and hospital levels. [ABSTRACT FROM AUTHOR]