This document is an addendum to the review “Artificial intelligence in bulk and single-cell RNA-sequencing data to foster precision oncology” (International Journal of Molecular Sciences, 2021). In this tutorial we will provide a toy example of how RNA-seq data can be analyzed through Machine Learning techniques to perform sample classification and gene signature extraction. In particular, we will use published RNA-seq data of samples belonging to two cancer types (i.e. prostate and lung) and normal samples belonging to the same tissue types to train and test Random Forest and Support Vector Machine (SVM) classifiers. The aim of this tutorial is to present an analysis workflow that includes some of the ML concepts discussed in the manuscript. The presented analysis is based on the R “caret” package and it was performed on R version 4.0.4.
We will first load tumor and normal RNA-seq data of prostate and lung tissues. Tumor samples belong to The Cancer Genome Atlas (TCGA) project, while normal samples belong to the Genotype-Tissue Expression (GTEx) project. As discussed in the review, the first source of heterogeneity that must be taken into account developing a ML analysis on transcriptomic data is the one originated by technical factors. In our case, the major source of technical heterogeneity would be given by the fact that samples belong to two distinct projects that used different analysis pipelines. The use of batch-correction techniques would be needed to allow a joint analysis of samples of the two datasets. However, the data that we will use for our analysis have already been subjected to batch-correction to make the datasets comparable. These data were produced by Wang et al. (Scientific Data, 2018, https://doi.org/10.1038/sdata.2018.61) and downloaded from https://github.com/mskcc/RNAseqDB . Gene expression values from these datasets, normalized in Fragments Per Kilobase Million (FPKM), will be used as features by the ML models.
#load required libraries
library(caret)
library(randomForest)
library(gridExtra)
library(e1071)
library(kernlab)
library(Rtsne)
#load datasets
normal_A=read.table('sources/prostate-rsem-fpkm-gtex.txt.gz',header = T) #prostate tissue
normal_B=read.table('sources/lung-rsem-fpkm-gtex.txt.gz',header = T) #lung tissue
tumor_A=read.table('sources/prad-rsem-fpkm-tcga-t.txt.gz',header = T) #prostate adenocarcinoma
tumor_B=read.table('sources/luad-rsem-fpkm-tcga-t.txt.gz',header = T) #lung adenocarcinoma
Note that the first two columns of each dataset contain gene-related annotations (Hugo and ENTREZ format).
## Hugo_Symbol Entrez_Gene_Id TCGA.VN.A88P.01A.11R.A352.07
## 1 CUL5 8065 218.79
## 2 REV1 51455 420.68
## 3 ZNF428 126299 1135.20
## 4 CX3CR1 1524 5.36
## 5 FAXC 84553 100.13
## TCGA.CH.5739.01A.11R.1580.07
## 1 329.84
## 2 500.46
## 3 599.49
## 4 98.04
## 5 226.54
To reduce the number of possible features, we restrict our analysis to a known set of cancer-associated genes. We will use cancer genes annotated in NCG6 (Network of Cancer Genes 6) which is a comprehensive catalogue of experimentally-validated cancer genes by Repana et al. Genome Biology, 2019 (https://doi.org/10.1186/s13059-018-1612-0).
Tumor and Normal datasets are filtered on the set of cancer genes.
# Tumor : ---
tumor_A=subset(tumor_A,Entrez_Gene_Id%in%cancer_genes$entrez)
rep=names(table(tumor_A$Entrez_Gene_Id)[which(table(tumor_A$Entrez_Gene_Id)>1)])
tumor_A=subset(tumor_A,!Entrez_Gene_Id%in%rep | Hugo_Symbol%in%subset(cancer_genes,entrez%in%rep)$symbol)
rownames(tumor_A)=tumor_A$Entrez_Gene_Id
tumor_B=subset(tumor_B,Entrez_Gene_Id%in%cancer_genes$entrez)
rep=names(table(tumor_B$Entrez_Gene_Id)[which(table(tumor_B$Entrez_Gene_Id)>1)])
tumor_B=subset(tumor_B,!Entrez_Gene_Id%in%rep | Hugo_Symbol%in%subset(cancer_genes,entrez%in%rep)$symbol)
rownames(tumor_B)=tumor_B$Entrez_Gene_Id
# Normal : ---
normal_A=subset(normal_A,Entrez_Gene_Id%in%cancer_genes$entrez)
rep=names(table(normal_A$Entrez_Gene_Id)[which(table(normal_A$Entrez_Gene_Id)>1)])
normal_A=subset(normal_A,!Entrez_Gene_Id%in%rep | Hugo_Symbol%in%subset(cancer_genes,entrez%in%rep)$symbol)
rownames(normal_A)=normal_A$Entrez_Gene_Id
normal_B=subset(normal_B,Entrez_Gene_Id%in%cancer_genes$entrez)
rep=names(table(normal_B$Entrez_Gene_Id)[which(table(normal_B$Entrez_Gene_Id)>1)])
normal_B=subset(normal_B,!Entrez_Gene_Id%in%rep | Hugo_Symbol%in%subset(cancer_genes,entrez%in%rep)$symbol)
rownames(normal_B)=normal_B$Entrez_Gene_Id
We randomly split each dataset in a training and a test set. The first one will be used by the model in the learning process to tune its parameters and achieve an optimal classification based on known labels (i.e. Tumor-Normal or PRAD-LUAD). The second one, containing samples that the model has not seen before, will be used to assess the performance of the trained model. The training set will contain 75% of samples, while the test set will contain the remaining 25%.
set.seed(900808)
tumor_A_train = tumor_A[,c("Hugo_Symbol","Entrez_Gene_Id",
sample(x = colnames(tumor_A)[3:ncol(tumor_A)],
size = round((ncol(tumor_A)-2)*0.75),
replace = F))]
tumor_A_test = cbind(tumor_A[,1:2],tumor_A[,!colnames(tumor_A)%in%colnames(tumor_A_train)])
tumor_B_train = tumor_B[,c("Hugo_Symbol","Entrez_Gene_Id",
sample(x = colnames(tumor_B)[3:ncol(tumor_B)],
size = round((ncol(tumor_B)-2)*0.75),
replace = F))]
tumor_B_test = cbind(tumor_B[,1:2],tumor_B[,!colnames(tumor_B)%in%colnames(tumor_B_train)])
normal_A_train = normal_A[,c("Hugo_Symbol","Entrez_Gene_Id",
sample(x = colnames(normal_A)[3:ncol(normal_A)],
size = round((ncol(normal_A)-2)*0.75),
replace = F))]
normal_A_test = cbind(normal_A[,1:2],normal_A[,!colnames(normal_A)%in%colnames(normal_A_train)])
normal_B_train = normal_B[,c("Hugo_Symbol","Entrez_Gene_Id",
sample(x = colnames(normal_B)[3:ncol(normal_B)],
size = round((ncol(normal_B)-2)*0.75),
replace = F))]
normal_B_test = cbind(normal_B[,1:2],normal_B[,!colnames(normal_B)%in%colnames(normal_B_train)])
The aim of this section is to train ML models capable of discriminating tumor samples (prostate and lung adenocarcinoma) from samples of normal prostate and lung tissues. We will train a Random Forest and an SVM classifier, compare them on training samples and evaluate their performance on the test set. Finally, for each model, we will extract a putative signature of cancer genes whose expression most likely discriminates between tumor and normal samples.
We create the full training set matrix, including tumor and normal samples of both prostate and lung, that will be used in the learning process.
# Tumor : ---
tumor_A_train=subset(tumor_A_train, Entrez_Gene_Id%in%tumor_B_train$Entrez_Gene_Id)
tumor_B_train=subset(tumor_B_train, Entrez_Gene_Id%in%tumor_A_train$Entrez_Gene_Id)
tumor_train_set = as.data.frame(t(cbind(tumor_A_train[,3:ncol(tumor_A_train)],tumor_B_train[,3:ncol(tumor_B_train)])))
# Normal : ---
normal_A_train=subset(normal_A_train, Entrez_Gene_Id%in%normal_B_train$Entrez_Gene_Id)
normal_B_train=subset(normal_B_train, Entrez_Gene_Id%in%normal_A_train$Entrez_Gene_Id)
normal_train_set = as.data.frame(t(cbind(normal_A_train[,3:ncol(normal_A_train)],normal_B_train[,3:ncol(normal_B_train)])))
# Merge : ---
ncol(tumor_train_set) == ncol(normal_train_set)
sum(colnames(tumor_train_set)==colnames(normal_train_set))/ncol(normal_train_set)
train_set = rbind(tumor_train_set,normal_train_set)
colnames(train_set)=paste0('g_',colnames(train_set)[1:(ncol(train_set))])
train_set$type=factor(c(rep('Tumor',nrow(tumor_train_set)),rep('Normal',nrow(normal_train_set))),levels=c('Normal','Tumor'))
We create the full test set matrix, including tumor and normal samples of both prostate and lung, that will be used for the evaluation of the model performance on unseen data.
# Tumor : ---
tumor_A_test=subset(tumor_A_test, Entrez_Gene_Id%in%tumor_B_test$Entrez_Gene_Id)
tumor_B_test=subset(tumor_B_test, Entrez_Gene_Id%in%tumor_A_test$Entrez_Gene_Id)
tumor_test_set = as.data.frame(t(cbind(tumor_A_test[,3:ncol(tumor_A_test)],tumor_B_test[,3:ncol(tumor_B_test)])))
# Normal : ---
normal_A_test=subset(normal_A_test, Entrez_Gene_Id%in%normal_B_test$Entrez_Gene_Id)
normal_B_test=subset(normal_B_test, Entrez_Gene_Id%in%normal_A_test$Entrez_Gene_Id)
normal_test_set = as.data.frame(t(cbind(normal_A_test[,3:ncol(normal_A_test)],normal_B_test[,3:ncol(normal_B_test)])))
# Merge : ---
ncol(tumor_test_set) == ncol(normal_test_set)
sum(colnames(tumor_test_set)==colnames(normal_test_set))/ncol(normal_test_set)
test_set = rbind(tumor_test_set,normal_test_set)
colnames(test_set)=paste0('g_',colnames(test_set)[1:(ncol(test_set))])
test_set$type=factor(c(rep('Tumor',nrow(tumor_test_set)),rep('Normal',nrow(normal_test_set))),levels=c('Normal','Tumor'))
As discussed in the manuscript, feature centering and scaling is a critical preprocessing step to assure that all variables proportionally contribute to the AI model. We then use the preProcess function of the “caret” package to jointly center and scale the training and test sets.
tmp = rbind(train_set,test_set)
all_data = tmp[,1:(ncol(tmp)-1)]
all_data = predict(preProcess(all_data,method = c("center", "scale", "YeoJohnson", "nzv")),all_data)
all_data$type=tmp$type
train_set = all_data[1:nrow(train_set),]
test_set = all_data[(nrow(train_set)+1):(nrow(train_set)+nrow(test_set)),]
Before moving to the learning step, we can visualise the entire dataset composed of training and test set using a dimensionality reduction method. We here propose the use of the t-distributed stochastic neighbor embedding (tSNE), but other techniques, such as PCA and UMAP, could be used as well. Given tSNE inherent stochasticity, a seed needs to be set to ensure the reproducibility of the visualisation.
set.seed(94512)
tsne <- Rtsne(all_data[,colnames(all_data)!='type'], dims = 2, perplexity=15, verbose=TRUE, max_iter = 500)
tsne.df = as.data.frame(tsne$Y)
rownames(tsne.df) = rownames(all_data)
colnames(tsne.df) = c('tsne.1', 'tsne.2')
tsne.df$type = all_data$type
ggplot(tsne.df, aes(x=tsne.1, y=tsne.2, color=type))+
geom_point()+
theme_bw()
Besides the separation between tumor and normal samples, tSNE representation clearly shows the presence of two distinct groups of samples that correspond to the two tissue types.
We will now define and train a Random Forest for tumor-normal classification. Firstly, we specify the formula containing the relation between features (genes’ expression) and sample type (tumor or normal).
sel=colnames(train_set[1:(ncol(train_set)-1)])
formula <- as.formula(paste("type ~ ",
paste(sel, collapse = "+")))
to_forest = train_set
Secondly, we define the resampling method to be used during the training of the model. In this case we use 10-fold cross-validation (cv).
The training set is finally fed to the Random Forest model (rf) and the learning step is performed. Attention! The following module could take several minutes to train model. Feel free to skip this module if you want and directly load the pre-trained model from the subsequent module.
set.seed(900808)
Sys.time()
rf2 <- train(formula, data = to_forest, method = "rf", tuneLength=10, importance=T, trControl=train_control, metric = "ROC")
Sys.time()
The pre-trained model can be loaded from the Rdata folder.
The resulting trained model is:
## Random Forest
##
## 1012 samples
## 705 predictor
## 2 classes: 'Normal', 'Tumor'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 911, 911, 911, 911, 910, 910, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.9985152 0.9173387 0.9957143
## 3 0.9985188 0.9141129 0.9914079
## 7 0.9985886 0.9363911 0.9914079
## 14 0.9986058 0.9394153 0.9899793
## 27 0.9984030 0.9395161 0.9885300
## 52 0.9982279 0.9523185 0.9871014
## 99 0.9980934 0.9490927 0.9885300
## 191 0.9973625 0.9585685 0.9856522
## 367 0.9968737 0.9524194 0.9842236
## 704 0.9957053 0.9333669 0.9870600
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 14.
We can test the performance of the model on the training data comparing the predicted sample types with the true ones. Both the metrics used to evaluate the performance show a perfect classification capability.
## Accuracy Kappa
## 1 1
Following the same steps described for the Random Forest we now train a SVM for the same classification task.
sel=colnames(train_set[1:(ncol(train_set)-1)])
formula <- as.formula(paste("type ~ ",
paste(sel, collapse = "+")))
to_svm = train_set
# 10-fold cross-validation ("cv") is used as resampling method to train the model
train_control <- trainControl(method="cv", number=10, classProbs = TRUE, summaryFunction = twoClassSummary)
set.seed(900808)
Sys.time()
svm <- train(formula, data = to_svm, method = "svmRadial", tuneLength=10, importance=T, trControl=train_control, metric = "ROC")
Sys.time()
The pre-trained model can be loaded from the Rdata folder.
Show the trained model and evaluate its performance on the training set.
## Support Vector Machines with Radial Basis Function Kernel
##
## 1012 samples
## 705 predictor
## 2 classes: 'Normal', 'Tumor'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 911, 911, 911, 911, 910, 910, ...
## Resampling results across tuning parameters:
##
## C ROC Sens Spec
## 0.25 0.9975413 0.9712702 0.9770807
## 0.50 0.9988627 0.9745968 0.9856936
## 1.00 0.9995011 0.9809476 0.9871222
## 2.00 0.9995939 0.9841734 0.9900000
## 4.00 0.9997307 0.9937500 0.9914286
## 8.00 0.9997753 0.9937500 0.9928571
## 16.00 0.9997753 0.9937500 0.9928571
## 32.00 0.9997753 0.9937500 0.9928571
## 64.00 0.9997753 0.9937500 0.9928571
## 128.00 0.9997753 0.9937500 0.9928571
##
## Tuning parameter 'sigma' was held constant at a value of 0.001607629
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.001607629 and C = 8.
The SVM achieves a perfect classification of samples in the training set.
## Accuracy Kappa
## 1 1
Both models perform extremely well on the training set. However, to test whether one model performs significantly better than the other, a comparison based on three metrics (Sensitivity, Specificity and ROC) can be applied.
##
## Call:
## summary.resamples(object = res)
##
## Models: SVM, RF
## Number of resamples: 10
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM 0.9982143 1.0000000 1.0000000 0.9997753 1.0000000 1 0
## RF 0.9955357 0.9982719 0.9993088 0.9986058 0.9995471 1 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM 0.9687500 1.0000 1.0000000 0.9937500 1.000000 1.00000 0
## RF 0.8064516 0.9375 0.9677419 0.9394153 0.968498 0.96875 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM 0.9714286 0.9857143 1.0000000 0.9928571 1 1 0
## RF 0.9571429 0.9857143 0.9928571 0.9899793 1 1 0
##
## Call:
## diff.resamples(x = res)
##
## Models: SVM, RF
## Metrics: ROC, Sens, Spec
## Number of differences: 1
## p-value adjustment: bonferroni
##
## Call:
## summary.diff.resamples(object = difValues)
##
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## ROC
## SVM RF
## SVM 0.00117
## RF 0.01611
##
## Sens
## SVM RF
## SVM 0.05433
## RF 0.01228
##
## Spec
## SVM RF
## SVM 0.002878
## RF 0.3416
Confirming what can be observed in the plots above, this result indicates a significantly higher Sensitivity in the SVM model with respect to Random Forest. Finally, the difference of the three metrics between the two models can be visualised.
The trained models are stored in new variables.
We can now test the classification performance of the trained Random Forest on the test set, which contains samples not used for training. Again, we compare predicted sample types with the true ones. The confusionMatrix function returns a more complete summary of the comparison, explicitly indicating the number of correctly classified or misclassified samples.
## Accuracy Kappa
## 0.9821429 0.9577748
## Confusion Matrix and Statistics
##
## Reference
## Prediction Normal Tumor
## Normal 99 1
## Tumor 5 231
##
## Accuracy : 0.9821
## 95% CI : (0.9615, 0.9934)
## No Information Rate : 0.6905
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9578
##
## Mcnemar's Test P-Value : 0.2207
##
## Sensitivity : 0.9519
## Specificity : 0.9957
## Pos Pred Value : 0.9900
## Neg Pred Value : 0.9788
## Prevalence : 0.3095
## Detection Rate : 0.2946
## Detection Prevalence : 0.2976
## Balanced Accuracy : 0.9738
##
## 'Positive' Class : Normal
##
The above results indicate that only six samples of the test set were misclassified by the Random Forest. The following plots show the tSNE representation of the entire dataset, with samples belonging to the training set shown in grey and samples of the test set colored either based on the true labels (left) or on the predicted ones (right). The black cross indicates the position of the misclassified sample.
p1=ggplot()+
geom_point(data = tsne.df, mapping = aes(x=tsne.1, y=tsne.2), color='gray')+
geom_point(data=tsne.df[rownames(test_set),],mapping = aes(x=tsne.1, y=tsne.2, color=test_set$type))+
ggtitle(label = 'True labels')+
theme_bw()
tsne_test=tsne.df[rownames(test_set),]
tsne_test$prediction <- prediction
p2=ggplot()+
geom_point(data = tsne.df, mapping = aes(x=tsne.1, y=tsne.2), color='gray')+
geom_point(data = tsne_test, mapping = aes(x=tsne.1, y=tsne.2, color=prediction))+
geom_point(data= subset(tsne_test, prediction!=type), mapping = aes(x=tsne.1, y=tsne.2), shape = 4, color='black', size=2, stroke=2)+
ggtitle(label = 'Predicted labels')+
theme_bw()
grid.arrange(p1,p2, ncol=2)
As done for the Random Forest we now test the classification performance of the SVM on the test set.
# Test the accuracy of the final model on the test set
prediction = predict(svm,newdata = test_set)
postResample(prediction,test_set$type)
## Accuracy Kappa
## 1 1
## Confusion Matrix and Statistics
##
## Reference
## Prediction Normal Tumor
## Normal 104 0
## Tumor 0 232
##
## Accuracy : 1
## 95% CI : (0.9891, 1)
## No Information Rate : 0.6905
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.3095
## Detection Rate : 0.3095
## Detection Prevalence : 0.3095
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : Normal
##
This model achieves a perfect classification of samples in the test set, as shown by the above results and plots below.
p1=ggplot()+
geom_point(data = tsne.df, mapping = aes(x=tsne.1, y=tsne.2), color='gray')+
geom_point(data=tsne.df[rownames(test_set),],mapping = aes(x=tsne.1, y=tsne.2, color=test_set$type))+
ggtitle(label = 'True labels')+
theme_bw()
tsne_test=tsne.df[rownames(test_set),]
tsne_test$prediction <- prediction
p2=ggplot()+
geom_point(data = tsne.df, mapping = aes(x=tsne.1, y=tsne.2), color='gray')+
geom_point(data = tsne_test, mapping = aes(x=tsne.1, y=tsne.2, color=prediction))+
geom_point(data= subset(tsne_test, prediction!=type), mapping = aes(x=tsne.1, y=tsne.2), shape = 4, color='black', size=4)+
ggtitle(label = 'Predicted labels')+
theme_bw()
grid.arrange(p1,p2, ncol=2)
Finally, once shown the high classification capability of both models on unseed data, we extract the features (genes) that have the highest importance for classification. For each model we select the top 20 genes based on importance metrics. These gene sets constitute putative signatures of genes involved in discriminating tumor and normal samples.
roc_imp_rf <- varImp(rf2, scale = FALSE)
rownames(roc_imp_rf$importance)=sapply(strsplit(rownames(roc_imp_rf$importance),'_'),`[`,2)
rownames(roc_imp_rf$importance)=cancer_genes$symbol[match(rownames(roc_imp_rf$importance),cancer_genes$entrez)]
top_20_rf_TN = rownames(roc_imp_rf$importance[order(roc_imp_rf$importance$Normal,decreasing = T),])[1:20]
plot(roc_imp_rf, top=20)
roc_imp_svm <- varImp(svm, scale = FALSE)
rownames(roc_imp_svm$importance)=sapply(strsplit(rownames(roc_imp_svm$importance),'_'),`[`,2)
rownames(roc_imp_svm$importance)=cancer_genes$symbol[match(rownames(roc_imp_svm$importance),cancer_genes$entrez)]
top_20_svm_TN = rownames(roc_imp_svm$importance[order(roc_imp_svm$importance$Normal,decreasing = T),])[1:20]
plot(roc_imp_svm, top=20)
Ten genes are identified among the top 20 important features by both models. In a real research setting, these consensus genes would likely be prioritized for further investigation.
## [1] "BMPR1A" "SIX2" "CCND2" "FOXO1" "EED" "MYCN" "HLF" "GAS7"
## [9] "STAT5B" "MALT1"
In this section we will use Random Forest and SVM models to classify samples of two tumor types: prostate and lung adenocarcinoma (PRAD and LUAD, respectively). We will follow the same steps described in the previous section. It is worth noting that, given the strong molecular differences between prostate and lung tissue, this classification task is expected to be easy for the models.
We create the full training set matrix, including PRAD and LUAD samples, that will be used in the learning process.
We create the full test set matrix, including PRAD and LUAD samples, that will be used for the evaluation of the model performance on unseen data.
We use the preProcess function of the “caret” package to jointly center and scale the training and test sets.
tmp = rbind(train_set,test_set)
all_data = tmp[,1:(ncol(tmp)-1)]
all_data = predict(preProcess(all_data,method = c("center", "scale", "YeoJohnson", "nzv")),all_data)
all_data$type=tmp$type
train_set = all_data[1:nrow(train_set),]
test_set = all_data[(nrow(train_set)+1):(nrow(train_set)+nrow(test_set)),]
Before going into the learning step, we can visualise the entire dataset, composed of training and test set, using the dimensionality reduction method tSNE. Given tSNE inherent stochasticity, a seed needs to be set to ensure the reproducibility of the visualisation.
set.seed(900808)
tsne <- Rtsne(all_data[,colnames(all_data)!='type'], dims = 2, perplexity=15, verbose=TRUE, max_iter = 500)
tsne.df = as.data.frame(tsne$Y)
rownames(tsne.df) = rownames(all_data)
colnames(tsne.df) = c('tsne.1', 'tsne.2')
tsne.df$type = all_data$type
ggplot(tsne.df, aes(x=tsne.1, y=tsne.2, color=type))+
geom_point()+
theme_bw()
tSNE representation clearly shows the presence of two distinct sample clusters, corresponding to PRAD and LUAD samples.
We define and train a Random Forest for PRAD-LUAD classification. Firstly, we specify the formula containing the relation between features (genes’ expression) and sample type (PRAD or LUAD).
sel=colnames(train_set[1:(ncol(train_set)-1)])
formula <- as.formula(paste("type ~ ",
paste(sel, collapse = "+")))
to_forest = train_set
Secondly, we define the resampling method to be used during the training of the model. In this case we use 10-fold cross-validation (cv).
The training set is finally fed to the Random Forest model (rf) and the learning step is performed. Attention! The following module could take several minutes to train model. Feel free to skip this module if you want and directly load the pre-trained model from the subsequent module.
set.seed(900808)
Sys.time()
rf2 <- train(formula, data = to_forest, method = "rf", tuneLength=10, importance=T, trControl=train_control, metric = "ROC")
Sys.time()
The pre-trained model can be loaded from the Rdata folder.
The resulting trained model is:
## Random Forest
##
## 697 samples
## 705 predictors
## 2 classes: 'PRAD', 'LUAD'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 628, 628, 627, 627, 627, 627, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 1 1 1
## 3 1 1 1
## 7 1 1 1
## 14 1 1 1
## 27 1 1 1
## 52 1 1 1
## 99 1 1 1
## 191 1 1 1
## 367 1 1 1
## 704 1 1 1
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
We can test the performance of the model on the training data comparing the predicted sample types with the true ones. Both the metrics used to evaluate the performance indicate a perfect classification on the training set.
## Accuracy Kappa
## 1 1
Following the same steps described for the Random Forest we train a SVM for the same classification task.
sel=colnames(train_set[1:(ncol(train_set)-1)]) # If you want to speed up the analysis select a smaller subset of these genes.
formula <- as.formula(paste("type ~ ",
paste(sel, collapse = "+")))
to_svm = train_set
# 10-fold cross-validation ("cv") is used as resampling method to train the model
train_control <- trainControl(method="cv", number=10, classProbs = TRUE, summaryFunction = twoClassSummary)
set.seed(900808)
Sys.time()
svm <- train(formula, data = to_svm, method = "svmRadial", tuneLength=10, importance=T, trControl=train_control, metric = "ROC")
Sys.time()
Show the trained model and evaluate its performance on the training set.
## Support Vector Machines with Radial Basis Function Kernel
##
## 697 samples
## 705 predictors
## 2 classes: 'PRAD', 'LUAD'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 628, 628, 627, 627, 627, 627, ...
## Resampling results across tuning parameters:
##
## C ROC Sens Spec
## 0.25 1 1 1
## 0.50 1 1 1
## 1.00 1 1 1
## 2.00 1 1 1
## 4.00 1 1 1
## 8.00 1 1 1
## 16.00 1 1 1
## 32.00 1 1 1
## 64.00 1 1 1
## 128.00 1 1 1
##
## Tuning parameter 'sigma' was held constant at a value of 0.001650364
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.001650364 and C = 0.25.
# Test the accuracy of the final model on the training set
postResample(predict(svm, newdata = to_svm), to_svm$type)
## Accuracy Kappa
## 1 1
Also the SVM achieves a perfect classification of the training set.
As shown above, both models achieve a perfect classification of the training set. The comparison below confirms this point, showing no differences of the evaluation metrics between the two models.
##
## Call:
## summary.resamples(object = res)
##
## Models: SVM, RF
## Number of resamples: 10
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM 1 1 1 1 1 1 0
## RF 1 1 1 1 1 1 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM 1 1 1 1 1 1 0
## RF 1 1 1 1 1 1 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM 1 1 1 1 1 1 0
## RF 1 1 1 1 1 1 0
##
## Call:
## diff.resamples(x = res)
##
## Models: SVM, RF
## Metrics: ROC, Sens, Spec
## Number of differences: 1
## p-value adjustment: bonferroni
##
## Call:
## summary.diff.resamples(object = difValues)
##
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## ROC
## SVM RF
## SVM 0
## RF NA
##
## Sens
## SVM RF
## SVM 0
## RF NA
##
## Spec
## SVM RF
## SVM 0
## RF NA
We can now test the classification performance of the trained Random Forest on the test set, which contains samples not used for training. Again, we compare predicted sample types with the true ones. The confusionMatrix function returns a more complete summary of the comparison, explicitly indicating the number of correctly classified or misclassified samples.
# Test the accuracy of the final model on the test set
prediction = predict(rf2,newdata = test_set)
postResample(prediction,test_set$type)
## Accuracy Kappa
## 1 1
## Confusion Matrix and Statistics
##
## Reference
## Prediction PRAD LUAD
## PRAD 106 0
## LUAD 0 126
##
## Accuracy : 1
## 95% CI : (0.9842, 1)
## No Information Rate : 0.5431
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.4569
## Detection Rate : 0.4569
## Detection Prevalence : 0.4569
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : PRAD
##
The Random Forest achieves a perfect classification also of the test set, with no misclassified samples.
p1=ggplot()+
geom_point(data = tsne.df, mapping = aes(x=tsne.1, y=tsne.2), color='gray')+
geom_point(data=tsne.df[rownames(test_set),],mapping = aes(x=tsne.1, y=tsne.2, color=test_set$type))+
ggtitle(label = 'True labels')+
theme_bw()
tsne_test=tsne.df[rownames(test_set),]
tsne_test$prediction <- prediction
p2=ggplot()+
geom_point(data = tsne.df, mapping = aes(x=tsne.1, y=tsne.2), color='gray')+
geom_point(data = tsne_test, mapping = aes(x=tsne.1, y=tsne.2, color=prediction))+
geom_point(data= subset(tsne_test, prediction!=type), mapping = aes(x=tsne.1, y=tsne.2), shape = 4, color='black', size=4)+
ggtitle(label = 'Predicted labels')+
theme_bw()
grid.arrange(p1,p2, ncol=2)
As done for the Random Forest we now test the classification performance of the SVM on the test set.
# Test the accuracy of the final model on the test set
prediction = predict(svm,newdata = test_set)
postResample(prediction,test_set$type)
## Accuracy Kappa
## 1 1
## Confusion Matrix and Statistics
##
## Reference
## Prediction PRAD LUAD
## PRAD 106 0
## LUAD 0 126
##
## Accuracy : 1
## 95% CI : (0.9842, 1)
## No Information Rate : 0.5431
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.4569
## Detection Rate : 0.4569
## Detection Prevalence : 0.4569
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : PRAD
##
Also the SVM achieves a perfect classification of the test set.
p1=ggplot()+
geom_point(data = tsne.df, mapping = aes(x=tsne.1, y=tsne.2), color='gray')+
geom_point(data=tsne.df[rownames(test_set),],mapping = aes(x=tsne.1, y=tsne.2, color=test_set$type))+
ggtitle(label = 'True labels')+
theme_bw()
tsne_test=tsne.df[rownames(test_set),]
tsne_test$prediction <- prediction
p2=ggplot()+
geom_point(data = tsne.df, mapping = aes(x=tsne.1, y=tsne.2), color='gray')+
geom_point(data = tsne_test, mapping = aes(x=tsne.1, y=tsne.2, color=prediction))+
geom_point(data= subset(tsne_test, prediction!=type), mapping = aes(x=tsne.1, y=tsne.2), shape = 4, color='black', size=4)+
ggtitle(label = 'Predicted labels')+
theme_bw()
grid.arrange(p1,p2, ncol=2)
Finally, we can extract the features (genes) that have the highest importance for classification. These gene sets constitute putative signatures of genes involved in discriminating the two tumor types.
roc_imp_rf <- varImp(rf2, scale = FALSE)
rownames(roc_imp_rf$importance)=sapply(strsplit(rownames(roc_imp_rf$importance),'_'),`[`,2)
rownames(roc_imp_rf$importance)=cancer_genes$symbol[match(rownames(roc_imp_rf$importance),cancer_genes$entrez)]
top_20_rf_TT = rownames(roc_imp_rf$importance[order(roc_imp_rf$importance$PRAD,decreasing = T),])[1:20]
plot(roc_imp_rf, top=20)
Based on the ROC metric, that measures feature importance for the SVM model, 108 genes have the same importance=1. The plot below shows a random subset of 20 of these genes.
roc_imp_svm <- varImp(svm, scale = FALSE)
rownames(roc_imp_svm$importance)=sapply(strsplit(rownames(roc_imp_svm$importance),'_'),`[`,2)
rownames(roc_imp_svm$importance)=cancer_genes$symbol[match(rownames(roc_imp_svm$importance),cancer_genes$entrez)]
top_20_svm_TT = rownames(roc_imp_svm$importance[order(roc_imp_svm$importance$PRAD,decreasing = T),])[1:sum(roc_imp_svm$importance$PRAD==1)]
plot(roc_imp_svm, top=20)
We finally intersect the top 20 important genes of the Random Forest, with the 108 genes with importance=1 in the SVM. 13 genes are identified in both models, suggesting their role in discriminating PRAD and LUAD samples.
## [1] "SSX4" "H3F3A" "TCF12" "B2M" "PRDM1" "TRIM27" "KDM5A" "FAT1"
## [9] "RGPD3" "MAPK1" "IKZF1" "NUTM2A" "DGCR8"