In short: First, use unsupervised learning techniques to determine Kmeans and hierarchal clustering. Next, use random forest classification to predict cluster number and calculate accuracy. Last, write Kmeans clustering from scratch using Python [No packages can be used for this part]. Compare results.
Long version:
First, use dataset1 (csv) with unsupervised learning techniques. Use Kmeans to determine the # of clusters. Tabulate the # of clusters from 1 – 40 and total within-cluster variance. Plot the scree plot. Using hierarchal clustering, calculate the pairwise distance. Create various dendrograms using complete and average linkage. Cut dendogram into groups of 5-7. Which is the most appropriate # of groups?
Next, do the same exercise above except this time we use random forest classification.
Last, using dataset2 (csv), write a function (without using Kmeans related packages) that calculates the within cluster variance, aggregates the data by the cluster number, and plots “total within cluster vs. number of clusterâ€. Add the cluster number to the last column of dataset, predict the cluster number using logistic regression and calculate the accuracy of your model.
Briefly compare results from each method.