Honors & Awards
National Award for Innovative Technology Development, Government of India (December 2005)
View details for DOI 10.2196/11315
By recent estimates, the steady rise in health care costs has deprived more than 45 million Americans of health care services and has encouraged health care providers to better understand the key drivers of health care utilization from a population health management perspective. Prior studies suggest the feasibility of mining population-level patterns of health care resource utilization from observational analysis of Internet search logs; however, the utility of the endeavor to the various stakeholders in a health ecosystem remains unclear.The aim was to carry out a closed-loop evaluation of the utility of health care use predictions using the conversion rates of advertisements that were displayed to the predicted future utilizers as a surrogate. The statistical models to predict the probability of user's future visit to a medical facility were built using effective predictors of health care resource utilization, extracted from a deidentified dataset of geotagged mobile Internet search logs representing searches made by users of the Baidu search engine between March 2015 and May 2015.We inferred presence within the geofence of a medical facility from location and duration information from users' search logs and putatively assigned medical facility visit labels to qualifying search logs. We constructed a matrix of general, semantic, and location-based features from search logs of users that had 42 or more search days preceding a medical facility visit as well as from search logs of users that had no medical visits and trained statistical learners for predicting future medical visits. We then carried out a closed-loop evaluation of the utility of health care use predictions using the show conversion rates of advertisements displayed to the predicted future utilizers. In the context of behaviorally targeted advertising, wherein health care providers are interested in minimizing their cost per conversion, the association between show conversion rate and predicted utilization score, served as a surrogate measure of the model's utility.We obtained the highest area under the curve (0.796) in medical visit prediction with our random forests model and daywise features. Ablating feature categories one at a time showed that the model performance worsened the most when location features were dropped. An online evaluation in which advertisements were served to users who had a high predicted probability of a future medical visit showed a 3.96% increase in the show conversion rate.Results from our experiments done in a research setting suggest that it is possible to accurately predict future patient visits from geotagged mobile search logs. Results from the offline and online experiments on the utility of health utilization predictions suggest that such prediction can have utility for health care providers.
View details for DOI 10.2196/jmir.6240
View details for Web of Science ID 000384107200020
View details for PubMedID 27655225
View details for PubMedCentralID PMC5052461
Traditionally, patient groups with a phenotype are selected through rule-based definitions whose creation and validation are time-consuming. Machine learning approaches to electronic phenotyping are limited by the paucity of labeled training datasets. We demonstrate the feasibility of utilizing semi-automatically labeled training sets to create phenotype models via machine learning, using a comprehensive representation of the patient medical record.We use a list of keywords specific to the phenotype of interest to generate noisy labeled training data. We train L1 penalized logistic regression models for a chronic and an acute disease and evaluate the performance of the models against a gold standard.Our models for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.90, 0.89, and 0.86, 0.89, respectively. Local implementations of the previously validated rule-based definitions for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.96, 0.92 and 0.84, 0.87, respectively.We have demonstrated feasibility of learning phenotype models using imperfectly labeled data for a chronic and acute phenotype. Further research in feature engineering and in specification of the keyword list can improve the performance of the models and the scalability of the approach.Our method provides an alternative to manual labeling for creating training sets for statistical models of phenotypes. Such an approach can accelerate research with large observational healthcare datasets and may also be used to create local phenotype models.
View details for DOI 10.1093/jamia/ocw028
View details for PubMedID 27174893
View details for PubMedCentralID PMC5070523
The steady rise in healthcare costs has deprived over 45 million Americans of healthcare services (1, 2) and has encouraged healthcare providers to look for opportunities to improve their operational efficiency. Prior studies have shown that evidence of healthcare seeking intent in Internet searches correlates well with healthcare resource utilization. Given the ubiquitous nature of mobile Internet search, we hypothesized that analyzing geo-tagged mobile search logs could enable us to machine-learn predictors of future patient visits. Using a de-identified dataset of geo-tagged mobile Internet search logs, we mined text and location patterns that are predictors of healthcare resource utilization and built statistical models that predict the probability of a user's future visit to a medical facility. Our efforts will enable the development of innovative methods for modeling and optimizing the use of healthcare resources-a crucial prerequisite for securing healthcare access for everyone in the days to come.
View details for PubMedID 27570641
View details for PubMedCentralID PMC5001755
There is heterogeneity in the manifestation of diseases, therefore it is essential to understand the patterns of progression of a disease in a given population for disease management as well as for clinical research. Disease status is often summarized by repeated recordings of one or more physiological measures. As a result, historical values of these physiological measures for a population sample can be used to characterize disease progression patterns. We use a method for clustering sparse functional data for identifying sub-groups within a cohort of patients with chronic kidney disease (CKD), based on the trajectories of their Creatinine measurements. We demonstrate through a proof-of-principle study how the two sub-groups that display distinct patterns of disease progression may be compared on clinical attributes that correspond to the maximum difference in progression patterns. The key attributes that distinguish the two sub-groups appear to have support in published literature clinical practice related to CKD.
View details for PubMedID 27896974