Install Bioconductor packages. More details [here](https://www.bioconductor.org/install/)

``` source("https://bioconductor.org/biocLite.R") biocLite() ``` Install factoextra ``` biocLite("factoextra") ``` Load factoextra ``` library(factoextra) ``` For more info about the eclust() function, check [this](https://www.rdocumentation.org/packages/factoextra/versions/1.0.5/topics/eclust) or type ``` help(eclust) eclust(x, FUNcluster = "kmeans", hc_metric = "euclidean", ...) ``` Now let's try to cluster the data with Kmeans and then use hierarchical clustering ``` eclust(F_iris, FUNcluster = "kmeans", hc_metric = "euclidean") #number of clusters not specified, will be detected automatically # K-means clustering giving number of clusters= 3 km.res <- eclust(df, "kmeans", k = 3, nstart = 25, graph = FALSE) # Visualize k-means clusters fviz_cluster(km.res, geom = "point", ellipse.type = "norm", palette = "jco", ggtheme = theme_minimal()) ``` ``` hc.res <- eclust(F_iris, "hclust", k = 3, hc_metric = "euclidean", hc_method = "ward.D2", graph = FALSE) # Visualize dendrograms fviz_dend(hc.res, show_labels = FALSE, palette = "jco", as.ggplot = TRUE) ``` Let's check the clusters quality, remember we talked this morning about clustering internal validation and the use of coefficients and indexes to measure clusters separability and intra clusters homogenity. We will compute the silhouette coefficient for the resulting K-means clustering ``` fviz_silhouette(km.res, palette = "jco", ggtheme = theme_classic()) ``` A value close to 1 indicates good quality clustering (element assigned to the right cluster)

A value close to -1 indicates poor quality clustering (element would better fit in another cluster)

Now try to vary the number of clusters, run Kmeans and compute the the silhouette coefficient. What did you notice?

Many other coefficients could be computed to assess the clustering quality, some are available using the cluster.stats() function of the fpc package. You might need to install the fpc package first. ``` biocLite("fpc") library(fpc) kmIris_stats <- cluster.stats(dist(F_iris), km.res$cluster) kmIris_stats$dunn # if you wanna see only the dunn index value ``` Remember, iris dataset is labelled and we removed labels before performing clustering. You can perform a cross-tabulation between k-means clusters and the reference Species labels using table. ``` table(iris$Species, km.res$cluster) ``` The nbclust() function of the fviz package give an estimate of the optimal number of clusters, let's try to estimate the number of clusters for the iris dataset based on the use of k-means clustering: ``` fviz_nbclust(F_iris, kmeans,method = "gap_stat") ``` Take few minutes to check the different functions of the [Cluster package](https://cran.r-project.org/web/packages/cluster/cluster.pdf)

. ## Let's try some supervised classification on the IRIS dataset Remeber the iris dataset is a labelled one (the classes are known) but we removed labels in the previous exercice to perform clustering.

First We will need the caret package to be installed, ``` install.packages("caret", dependencies=c("Depends", "Suggests")) #installation of the caret packages and all packages that you might need library(caret) ``` Let's load the dataset from the CSV file as follows (useful when working with non R Built-in datasets):

``` wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data #the file iris.data will be downloaded in your current working directory, make sure you either specify the path while reading the file with R or copy it to your default R working directory #backtoR getwd()#check the default working directory setwd(path) #to change the working directory irisfile <- "iris.data" # load the CSV file from the local directory irisData <- read.csv(irisfile, header=FALSE) # set the column names in the dataset colnames(irisData) <- c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species") ``` All the iris data is labelled, to allow testing the model, let's create a validation (test) sub-dataset. We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset. ``` # create a list of 80% of the rows in the original dataset we can use for training validation_index <- createDataPartition(irisData$Species, p=0.80, list=FALSE) # select 20% of the data for validation validation <- irisData[-validation_index,] # use the remaining 80% of data to training and testing the models irisData <- irisData[validation_index,] ``` You now have training data in the irisData variable and a validation set we will use later in the validation variable.

Let's have a look on the data before starting building a model. ``` # you can check the dimensions of dataset dim(irisData) # list types for each attribute sapply(irisData, class) # all the variables are numeric and the class value is a factor that has multiple class labels or levels # list the levels for the class levels(irisData$Species) head(irisData) #if you want to check the first 5 lines of the dataset tail(irisData) #if you want to check the last 5 lines of the dataset ``` ``` # summarize the class distribution percentage <- prop.table(table(irisData$Species)) * 100 cbind(freq=table(irisData$Species), percentage=percentage) # summarize attribute distributions summary(irisData) ``` ## Build models Let's evaluate 5 different algorithms: - Linear Discriminant Analysis (LDA) - Classification and Regression Trees (CART) - k-Nearest Neighbors (kNN) - Support Vector Machines (SVM) with a linear kernel - Random Forest (RF) This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable. Let's build the five models on the training set:

The train function of the caret package has many parameters that have to be defined to perform the training and build a model. The different parameters to be used vary from one algorithm to another. Check this [link](https://topepo.github.io/caret/model-training-and-tuning.html) for more details about different parameters that can be used. ``` # a) linear algorithms set.seed(7) fit.lda <- train(Species~., data=irisData, method="lda", metric=metric, trControl=control) # b) nonlinear algorithms # CART set.seed(7) fit.cart <- train(Species~., data=irisData, method="rpart", metric=metric, trControl=control) # kNN set.seed(7) fit.knn <- train(Species~., data=irisData, method="knn", metric=metric, trControl=control) #For the knn training, define a grid using expand.grid() function and change the tuneGrid parameter value #example: expand.grid(k=c(5,7,10,15,19,21) #fit.knn <- train(Species~., data=irisData, method="knn", metric=accuracy, trControl=fitcontrol,tuneGrid=grid) # c) advanced algorithms # SVM set.seed(7) fit.svm <- train(Species~., data=irisData, method="svmRadial", metric=metric, trControl=control) # Random Forest set.seed(7) fit.rf <- train(Species~., data=irisData, method="rf", metric=metric, trControl=control) ``` After building the 5 different models, it's time to assess the accuracy of the models ``` results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf)) summary(results) # compare accuracy of models by plotting results dotplot(results) ``` Note that The Kappa statistic (or value) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance). As you can see from the plots, lda was the best model ``` # summarize Best Model print(fit.lda) ``` After training the different models on the training set, the test set (validation) can be used for a second round of checks of the accuracy of the model ``` # estimate skill of LDA on the validation dataset predictions <- predict(fit.lda, validation) confusionMatrix(predictions, validation$Species) ``` Very good accuracy obtained on this validation set (that represents 20% of the actual size of the dataset). ## Some interesting readings: application of ML in Bioinformatics [A machine learning model to determine the accuracy of variant calls in capture-based next generation sequencing](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4659-0)

[Next-Generation Machine Learning for Biological Networks](https://www.ncbi.nlm.nih.gov/pubmed/29887378)

[Machine learning for metagenomics: methods and tools](https://arxiv.org/pdf/1510.06621.pdf)

## Some useful R/Bioconductor packages to handle Bioinformatics files and use ML algorithms [Cluster package](https://cran.r-project.org/web/packages/cluster/cluster.pdf)

[vcfR, a package to read, manipulate and analyze VCF files](https://knausb.github.io/vcfR_documentation/index.html)

[Practical Guide To Cluster Analysis in R](https://goo.gl/13EFCZ)

**Sources used** - [http://www.sthda.com/english/articles/29-cluster-validation-essentials/97-cluster-validation-statistics-must-know-methods/#data-preparation](http://www.sthda.com/english/articles/29-cluster-validation-essentials/97-cluster-validation-statistics-must-know-methods/#data-preparation) - [https://machinelearningmastery.com/machine-learning-in-r-step-by-step/](https://machinelearningmastery.com/machine-learning-in-r-step-by-step/)