# Day 3: Machine Learning and Data Mining in Bioinformatics ## Overview ### Lead Instructor - [Amel Ghouila](https://github.com/amelgh) | [@AmelGhouila](https://twitter.com/AmelGhouila) | [email](mailto:amel.ghouila@gmail.com) ### Co-Instructor(s) - [Fotis Psomopoulos](https://fpsom.github.io/) | [@fopsom](https://twitter.com/fopsom) | [email](mailto:fopsom@gmail.com) ### Helper(s) - Maria Tsagiopoulou | [email](mariatsayo@gmail.com) - Ola Karrar | [@ZeinOla](https://twitter.com/ZeinOla) | [email](ozkarrar2@gmail.com) ### General Topics - Introduction to Machine Learning and Data Mining - Machine Learning basic concepts - Taxonomy of ML algorithms - Supervised classification, - Unsupervised classification, - Semi-supervised classification - Clustering validation and cross validation - Applications of ML in Bioinformatics - Examples of different ML/DM techniques that can be applied to different NGS data analysis pipelines - How to chose the right ML technique? - Deep learning quick overview - Exploring some examples with R ML packages ## Schedule - _**09:00 - 10:00**_ Introduction to DM and ML, Machine Learning basic concepts - _**10:00 - 11:00**_ Taxonomy of ML and examples of algorithms - Supervised classification, - Unsupervised classification, - Semi-supervised classification - _**11:00 - 11:30**_ _**Coffee break**_ - _**11:30 - 12:30**_ Applications of ML in Bioinformatics - Examples of different ML/DM techniques that can be applied to different NGS data analysis pipelines - How to chose the right ML technique? - Deep learning overview - _**12:30 - 14:00**_ _**Lunch break**_ - _**14:00 - 15:00**_ We will start practicing using the built-in R [data set iris](https://archive.ics.uci.edu/ml/datasets/iris) - _**15:00 - 16:00**_ RNASeq analysis using clustering in R - [Example we will be using during the session](http://www.2-bitbio.com/2017/04/clustering-rnaseq-data-making-heatmaps.html) - _**16:00 - 16:15**_ _**Coffee break**_ - _**16:15 - 18:00**_ RNASeq analysis in R to be continued - [Differential expression analysis using R](https://combine-australia.github.io/RNAseq-R/06-rnaseq-day1.html) Download Mouse mammary RNA-Seq data (counts) data files from [this public repo](https://figshare.com/s/1d788fd384d33e913a2a) ## Learning Objectives - Learn Machine Learning basic concepts and jargon - Understand the Taxonomy of Machine Learning algorithms and differences between basic algorithms categories - Get familiar with the basic Machine Learning algorithms in supervised and unsupervised learning categories - Understand different parameters to take into consideration to choose the right Machine Learning technique for a given problem - Understand how to evaluate Machine Learning results in supervised and unsupervised classification - Learn about some applications of Machine Learning in Bioinformatics - Explore and apply some basic R packages to perform supervised and unsupervised classification ## Introduction to Machine Learning The **slides** of the talk are available [**here**](https://raw.githubusercontent.com/fpsom/CODATA-RDA-Advanced-Bioinformatics-2018/master/files/slides/Introduction_to_ML_AdvancedBioinfoCourse_Trieste2018_AG.pdf). ## Let's start doing some clustering on the IRIS dataset To check all the R Built-in data sets, you can use the command: ``` data() #it will display all the available default datasets ``` ``` iris #to check the dataset ``` Preprocessing the data: exclude column "Species" at position 5 and Standardize ``` F_iris <- iris[,-5] # Remove column 5 F_iris <- scale(F_iris) # Standardize: Scale is a generic function whose default method centers and/or scales the columns of a numeric matrix. ``` Let's cluster the data using function eclust() (enhanced clustering, in factoextra). You will need to install factoextra package first.
Install Bioconductor packages. More details [here](https://www.bioconductor.org/install/)
``` source("https://bioconductor.org/biocLite.R") biocLite() ``` Install factoextra ``` biocLite("factoextra") ``` Load factoextra ``` library(factoextra) ``` For more info about the eclust() function, check [this](https://www.rdocumentation.org/packages/factoextra/versions/1.0.5/topics/eclust) or type ``` help(eclust) eclust(x, FUNcluster = "kmeans", hc_metric = "euclidean", ...) ``` Now let's try to cluster the data with Kmeans and then use hierarchical clustering ``` eclust(F_iris, FUNcluster = "kmeans", hc_metric = "euclidean") #number of clusters not specified, will be detected automatically # K-means clustering giving number of clusters= 3 km.res <- eclust(df, "kmeans", k = 3, nstart = 25, graph = FALSE) # Visualize k-means clusters fviz_cluster(km.res, geom = "point", ellipse.type = "norm", palette = "jco", ggtheme = theme_minimal()) ``` ``` hc.res <- eclust(F_iris, "hclust", k = 3, hc_metric = "euclidean", hc_method = "ward.D2", graph = FALSE) # Visualize dendrograms fviz_dend(hc.res, show_labels = FALSE, palette = "jco", as.ggplot = TRUE) ``` Let's check the clusters quality, remember we talked this morning about clustering internal validation and the use of coefficients and indexes to measure clusters separability and intra clusters homogenity. We will compute the silhouette coefficient for the resulting K-means clustering ``` fviz_silhouette(km.res, palette = "jco", ggtheme = theme_classic()) ``` A value close to 1 indicates good quality clustering (element assigned to the right cluster)
A value close to -1 indicates poor quality clustering (element would better fit in another cluster)
Now try to vary the number of clusters, run Kmeans and compute the the silhouette coefficient. What did you notice?
Many other coefficients could be computed to assess the clustering quality, some are available using the cluster.stats() function of the fpc package. You might need to install the fpc package first. ``` biocLite("fpc") library(fpc) kmIris_stats <- cluster.stats(dist(F_iris), km.res$cluster) kmIris_stats$dunn # if you wanna see only the dunn index value ``` Remember, iris dataset is labelled and we removed labels before performing clustering. You can perform a cross-tabulation between k-means clusters and the reference Species labels using table. ``` table(iris$Species, km.res$cluster) ``` The nbclust() function of the fviz package give an estimate of the optimal number of clusters, let's try to estimate the number of clusters for the iris dataset based on the use of k-means clustering: ``` fviz_nbclust(F_iris, kmeans,method = "gap_stat") ``` Take few minutes to check the different functions of the [Cluster package](https://cran.r-project.org/web/packages/cluster/cluster.pdf)
. ## Let's try some supervised classification on the IRIS dataset Remeber the iris dataset is a labelled one (the classes are known) but we removed labels in the previous exercice to perform clustering.
First We will need the caret package to be installed, ``` install.packages("caret", dependencies=c("Depends", "Suggests")) #installation of the caret packages and all packages that you might need library(caret) ``` Let's load the dataset from the CSV file as follows (useful when working with non R Built-in datasets):
``` wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data #the file iris.data will be downloaded in your current working directory, make sure you either specify the path while reading the file with R or copy it to your default R working directory #backtoR getwd()#check the default working directory setwd(path) #to change the working directory irisfile <- "iris.data" # load the CSV file from the local directory irisData <- read.csv(irisfile, header=FALSE) # set the column names in the dataset colnames(irisData) <- c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species") ``` All the iris data is labelled, to allow testing the model, let's create a validation (test) sub-dataset. We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset. ``` # create a list of 80% of the rows in the original dataset we can use for training validation_index <- createDataPartition(irisData$Species, p=0.80, list=FALSE) # select 20% of the data for validation validation <- irisData[-validation_index,] # use the remaining 80% of data to training and testing the models irisData <- irisData[validation_index,] ``` You now have training data in the irisData variable and a validation set we will use later in the validation variable.
Let's have a look on the data before starting building a model. ``` # you can check the dimensions of dataset dim(irisData) # list types for each attribute sapply(irisData, class) # all the variables are numeric and the class value is a factor that has multiple class labels or levels # list the levels for the class levels(irisData$Species) head(irisData) #if you want to check the first 5 lines of the dataset tail(irisData) #if you want to check the last 5 lines of the dataset ``` ``` # summarize the class distribution percentage <- prop.table(table(irisData$Species)) * 100 cbind(freq=table(irisData$Species), percentage=percentage) # summarize attribute distributions summary(irisData) ``` ## Build models Let's evaluate 5 different algorithms: - Linear Discriminant Analysis (LDA) - Classification and Regression Trees (CART) - k-Nearest Neighbors (kNN) - Support Vector Machines (SVM) with a linear kernel - Random Forest (RF) This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable. Let's build the five models on the training set:
The train function of the caret package has many parameters that have to be defined to perform the training and build a model. The different parameters to be used vary from one algorithm to another. Check this [link](https://topepo.github.io/caret/model-training-and-tuning.html) for more details about different parameters that can be used. ``` # a) linear algorithms set.seed(7) fit.lda <- train(Species~., data=irisData, method="lda", metric=metric, trControl=control) # b) nonlinear algorithms # CART set.seed(7) fit.cart <- train(Species~., data=irisData, method="rpart", metric=metric, trControl=control) # kNN set.seed(7) fit.knn <- train(Species~., data=irisData, method="knn", metric=metric, trControl=control) #For the knn training, define a grid using expand.grid() function and change the tuneGrid parameter value #example: expand.grid(k=c(5,7,10,15,19,21) #fit.knn <- train(Species~., data=irisData, method="knn", metric=accuracy, trControl=fitcontrol,tuneGrid=grid) # c) advanced algorithms # SVM set.seed(7) fit.svm <- train(Species~., data=irisData, method="svmRadial", metric=metric, trControl=control) # Random Forest set.seed(7) fit.rf <- train(Species~., data=irisData, method="rf", metric=metric, trControl=control) ``` After building the 5 different models, it's time to assess the accuracy of the models ``` results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf)) summary(results) # compare accuracy of models by plotting results dotplot(results) ``` Note that The Kappa statistic (or value) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance). As you can see from the plots, lda was the best model ``` # summarize Best Model print(fit.lda) ``` After training the different models on the training set, the test set (validation) can be used for a second round of checks of the accuracy of the model ``` # estimate skill of LDA on the validation dataset predictions <- predict(fit.lda, validation) confusionMatrix(predictions, validation$Species) ``` Very good accuracy obtained on this validation set (that represents 20% of the actual size of the dataset). ## Some interesting readings: application of ML in Bioinformatics [A machine learning model to determine the accuracy of variant calls in capture-based next generation sequencing](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4659-0)
[Next-Generation Machine Learning for Biological Networks](https://www.ncbi.nlm.nih.gov/pubmed/29887378)
[Machine learning for metagenomics: methods and tools](https://arxiv.org/pdf/1510.06621.pdf)
## Some useful R/Bioconductor packages to handle Bioinformatics files and use ML algorithms [Cluster package](https://cran.r-project.org/web/packages/cluster/cluster.pdf)
[vcfR, a package to read, manipulate and analyze VCF files](https://knausb.github.io/vcfR_documentation/index.html)
[Practical Guide To Cluster Analysis in R](https://goo.gl/13EFCZ)
**Sources used** - [http://www.sthda.com/english/articles/29-cluster-validation-essentials/97-cluster-validation-statistics-must-know-methods/#data-preparation](http://www.sthda.com/english/articles/29-cluster-validation-essentials/97-cluster-validation-statistics-must-know-methods/#data-preparation) - [https://machinelearningmastery.com/machine-learning-in-r-step-by-step/](https://machinelearningmastery.com/machine-learning-in-r-step-by-step/)