How A Scary Genetic Diagnosis Revealed Healthcare’s Dirty Data Secrets: And How To Unlock Them
Simon Smith
2014-11-03 00:00:00

So my wife booked an appointment with a genetic counsellor. And I booked an appointment with Dr. Google. After finding no comprehensive source of information on interventions associated with a reduced risk of Parkinson’s disease (a few sites mention coffee, some nicotine), I dug into research on PubMed and compiled my own list. By the time my wife had her appointment, she was armed with recommendations and references to back them up. And it was a good thing: few doctors or counselors she spoke with knew much about them.

Fortunately, based on a lack of family history, my wife learned that her risk is considered to be substantially lower than first feared. (Although she’s still taking precautions.) But the experience left its mark. How many other prospective treatments and preventions are locked away in health data? Might there be ways to analyze them algorithmically?

That question led me to start an experimental research project called WellPilot. Our mission is to turn complex data into simple guidance for optimal wellbeing. And while we’ve only been at it a year, the results have reinforced my belief that the future of healthcare will be powered by data—but not until we can overcome at least four key barriers: data quantity, silos, latency and complexity.

Data quantity: Information grows exponentially, time does not

The fact that my wife’s healthcare professionals weren't aware of all the prospective preventative interventions for Parkinson's isn't surprising, and you can’t blame them. On average, PubMed alone indexed about 2,823 articles per day in 2013, up from about 1,651 articles per day in 20031—nearly 70% growth in 10 years. The average time that healthcare professionals have available to review such research has likely shrunk over the same period or, at best, remained stagnant. As controversial med-tech venture capitalist Vinod Khosla has pointed out, doctors simply cannot cope with the data deluge.

Let’s look at one condition alone. As of this writing, WellPilot has found research for 809 Parkinson’s disease interventions. Many of these are not drugs (for example, the supplement coenzyme Q10) and hence don't receive paid promotion to healthcare professionals by pharmaceutical companies, meaning they’re far less likely to be on a doctor’s radar—simply because doctors are time-crunched and not always able to keep up with the latest clinical research.

Machine-based data aggregation and analysis can help healthcare professionals cope. And with the adoption of electronic health record systems, growing use of wearable health devices, dropping cost of genome sequencing, and other drivers of data growth, this will only become more critical over time.

​Data silos: The dots are disconnected

Not only is there a vast and increasing amount of healthcare data, much of it exists in silos. The US National Library of Medicine has, thankfully, aggregated and organized much healthcare information. But even in the US, which does a better job than most countries, health-related information is separately collected and distributed by numerous organizations, including the National Library of Medicine, the Centers for Disease Control and Prevention and the Food and Drug Administration. And while electronic health records promise to make medicine more quantified, the truth is that the market is fragmented, with multiple vendors storing data independently.

Furthermore, the taxonomies for organizing this information are not always the same for different data sources, making them difficult to cross-reference. There are standard reference systems—for example, MeSH and ICD—but even these aren’t consistently adopted and applied.

For WellPilot, we aggregate data from sources including published research, clinical trials and reported treatment side effects. Each of these uses a separate data set available through distinct sources, without a single, official common taxonomy for querying and cross-referencing the information. To address this, we have developed and maintain an extensive list of synonyms, many of which differ by region. For example, in the US and Japan, most people refer to acetaminophen, while elsewhere it’s paracetamol.

Bringing structure to the mass of unstructured health data is an important challenge for 21st century, data-driven medicine.

Data latency: Validity is important, but so is timeliness

The most trustworthy healthcare studies are double-blind, placebo-controlled and randomized. And rightly so. These take significant effort to conduct but are thought to provide the most reliable and valid results. However, they also have drawbacks. These can include unrealistic clinical settings and too uniform patient populations. They can also take a tremendous amount of time and money to organize (for example, look at the time it will take to organize clinical trials for an ebola vaccine).

Clinical trials are critical to evaluating healthcare interventions. But with the speed at which healthcare data is available today, we are able to analyze outcomes in real-time. For example, I have seen demonstrations from free electronic health record provider Practice Fusion showing real-time analysis of data from their system, including the ability to create test and control groups of physicians to evaluate influences on prescribing behavior. Patient communities such as PatientsLikeMe and Crohnology are also now providing real-time, real-world data on interventions.

At WellPilot, analyzing timely, real-world data in addition to clinical research is a top priority. We currently allow patients to rate and discuss conditions and interventions, but I’m looking forward to the wider availability of personal health aggregators such as Apple’s HealthKit—and greater comfort from patients in sharing their data privately and securely for research purposes.

Data complexity: Consumer healthcare only works if people understand the information

Finally, even when you can address the quantity, silos and latency issues, you’re left with a lot of complex information. This information is often too complex for trained healthcare professionals to interpret, never mind patients. (Surveys show, for example, that doctors often misinterpret statistics.)

With WellPilot, we’ve been working (with varying degrees of success—but lots of optimism) to automate the analysis, visualization and explanation of how different treatments can benefit different conditions. Our current implementation uses a simple “WellRank” calculation to boil down many variables to a score that can help guide patients and healthcare professionals in evaluating interventions. It’s far from perfect, and we’re continuously improving it, but we’re hoping it’s a step towards simplifying and visualizing complex data in a way that people can understand and act on. We also allow users to provide feedback when the system gets it wrong, to help guide its improvement over time.

The evolution of healthcare almost certainly involves the production of more and more data. It’s my hope that systems like WellPilot can help people make sense of that data, so that in future anyone dealing with a scary diagnosis of their own has somewhere to turn for clear, up-to-date guidance drawing on a wide range of medical information sources.





Simon Smith is a digital health veteran with over 14 years of experience in digital media and healthcare solutions. WellPilot is an experimental research project to turn complex data into simple guidance for optimal well-being. Find Simon on LinkedIn and Twitter.




1 Based on the creation date of added content. Divided number of items created within each year by 365 to get average.