微信里点“发现”,扫一下
二维码便可将本文分享至朋友圈
演讲摘要:The bio-OMIC data imposes a “large p small n” challenge on the machine learning algorithms. That is to say, the number of features is much larger than the number of samples in most cases. Feature selection algorithms may choose a subset of features to train a machine learning model, in order to avoid the overfitting problem. Feature engineering is another way to condense the useful information from the original features. We proposed two novel feature engineering algorithms ReGear and mqTrans to generate new features from the original data, and the experimental data supported that these two algorithms delivered better features for the statistical analysis and machine learning modeling than the original features. Firstly, we hypothesized that the gene-level methylation values contributed better discrimination powers to the binary classification problem than the original residue-level methylation values. A subset of randomly selected samples was used to train the regression-based gene-level methylation values from all the residue-level methylation values. An extensive evaluation was conducted to show that the feature engineering algorithm ReGear generated the gene-level methylation values with much better phenotypic associations. The ReGear features also outperformed the original residue-level features using both filter and wrapper feature selection algorithms. Hierarchical clustering result based on the ReGear gene-level methylation features showed better inter-group discrimination. ReGear was further evaluated for its cross-dataset performance and compared with a few existing methylation biomarker detection algorithms. Most biomarker studies focused on detecting the phenotype-associated expression levels of the individual transcripts. This study hypothesized that the quantitative transcription regulation may carry better discrimination powers than the original expression levels. The regression-based algorithm mqTrans trained a regression model of the transcription factors (TFs) for each mRNA, and estimated the difference between the real expression level and the predicted expression level of this mRNA in the independent test samples. The regression models were trained in the healthy control samples, and the experimental data showed that the mqTrans features were much smaller in the healthy controls than in the disease samples in the independent test dataset. This supported that the regression models stably reflected the quantitative transcription regulation relationships between the mRNA and TFs. We detected 29 mRNAs with statistically significantly larger mqTrans values in the cancer samples than those in the healthy controls. So the transcription regulations of these 29 mRNAs were quantitatively altered in the cancer samples, and their regulating TFs may serve as ideal candidates for treatment targets. The mqTrans algorithm may facilitate the detection of hidden biomarkers with no differential expression but altered transcription regulation relationships. We would like to share these two feature engineering algorithms ReGear and mqTrans with the researchers, and provide a different aspect of view for the bio-OMIC data.
讲者简介:教授,博士生导师,中国科学院百人计划,吉林大学“唐敖庆”特聘教授,IEEE(美国电气和电子工程师协会)高级会员。团队主要从事健康大数据挖掘核心算法、以及融合生物组学、医学影像、心电脑电和电子病历等异构医学大数据的融合建模算法等方面的研究。