Fourth year PhD student Daisy Yi Ding develops new method for supervised learning with multiple sets of features
Ding and mentor Rob Tibshirani worked through the pandemic to reach incredible conclusions (and they built the software!)
The video that Ding and Tibshirani produced with other lab members to explaining the idea behind cooperative learning for multiview analysis.
Knee-deep in the pandemic during September of 2020, third year PhD student Daisy Ding and her advisor, Rob Tibshirani, were spending some afternoons sitting outside a coffee shop near Green Library trying to solve a problem that scientists had always faced: given multiple feature "views" on a set of patients, how to develop a statistical learning technique to use these views to make better predictions?
The problem, Tibshirani says, is a common one. Scientists gather measurements on the same group of patients from different sources, and there were no systematic approaches that allow them to use these multiple data modalities to make predictions about their patients. What Ding and Tibshirani wanted to do was use two or more different data views to search for biomarkers — which can have a weak signal — and amplify those markers to reach a greater result. The modalities of the data views can even be different — like MRI imaging or gene expression data — in which each patient has multiple aspects that have been measured. The problem was how to layer these views together and extract the needed information.
“We looked up the methods that were out there and they weren’t fully developed” he adds. “So that's what inspired us to try to come up with a new approach for this problem.”
Ding had learned of a conference on multi-omics — the practice of combining two or more “-omics” data (genomics, proteomics, epigenomics, for example) to employ in data analysis — but wasn’t able to attend in person. Thinking that she might be able to adapt some of those approaches in her own research, she found the paper associated with the workshop and brought it to Tibshirani. She had read the paper, didn’t think too much of it at first, but when describing it to her mentor, had a lightbulb moment.
“The paper was trying to solve this problem from a different perspective,” Ding recalls. “It was more from a backward kind of manner statistically — and then we thought, ‘Maybe it's more straightforward,’ and then at that moment we had this idea that we wanted to explore.”
< Fourth year PhD student and first author, Daisy Yi Ding.
Version after version, approach after approach, Ding and Tibshirani worked on a more straightforward process. The winter of 2021 passed, then spring and summer. They kept after it, confident that her idea had a true possibility of working, even after encountering countless misfires and duds.
Students read papers from scientists and might think the final product was the first thing they the scientist thought of, Tibshirani says. “But just like people like in the arts, what you see in the final product is usually the result of months or years of wrong turns and failures.”
Ding and Tibshirani worked on many versions, but eventually settled on a very simple and satisfying approach. Their chosen method generalized the two major existing approaches to the multiview problem — early and late fusion. Early fusion combines the data views into one large set of features and applies supervising learning, while late fusion fits separate models to the individual views and combines the predictions at the end.
“Our method is able to encompass these two types of approaches by giving you an entire spectrum,” Ding explains. “It wasn’t that we were trying to encompass these two; it’s more that we had this idea and we found that it’s able to do that. It was a pretty cool moment.”
With Ding’s successful approach at combining data sets to reach predictions, both early and late fusion can be used and by doing so, Ding revealed something she calls “cooperative learning.” According to the website that showcases Ding’s work, “the method can be especially powerful when the different data views share some underlying relationship in their signals that can be exploited to boost the signals,” in addition to achieving higher predictive accuracy on simulated data and real multiomics studies.
Tibrishani wants to underline what the success of the process means. “If you asked a scientist if they could predict how someone is going to respond to a cancer treatment, and you measure all of their genes, there are lots of ways to measure biological aspects of a patient,” he says. “That information may be buried in 100 genes but more likely, it's buried in 100 genes and 60 proteins. It is why scientists want to measure lots of things because when you can piece together the clues, you can build a more predictive model to predict if the patient's going to respond to treatment. That's the idea of multiview, and our approach to it is to say, well let's look for common signals and maybe this commonality will allow us to boost up the signal to get more accurate predictions. So that's what this is doing.”
But as impressive as the results were, they were also essentially useless unless there was a software platform to make it work. They asked their expert colleague Balasubramanian Narasimhan to write a robust software package so that the method can be widely used. [Ding’s friend and Statistics PhD student Shuangning Li also helped the team with some mathematical contributions].
“We spent a great deal of time on the software, and this an important part of science these days,” Tibshirani adds. “There are a lot of good ideas out there, and perhaps 90-95% of them don't get used because there's no good software. It’s software that people actually need; if you don't have that, no one’s going to use your method. They read a paper and think, ‘that's interesting,’ but they often are not going to take time to code it up themselves. We spend a lot of time on making sure it's right and making sure it's robust; we're going to support it, and that's an important part of its potential for success.”
Daisy Yi Ding is the first author of “Cooperative learning for multiview analysis” that was published by Proceedings of the National Academy of Sciences (PNAS) In late September.
Read it here: https://tibshirani.su.domains/multiview/CoopLearning.html