K - Means Clustering

K - Means Clustering

Speaking of machine learning algorithms, we implemented K-Means clustering algorithm two years ago as a part of Data Mining course (CS-555). The purpose was to given a dataset, cluster it into two or more parts and then optimize the algorithm for number of clusters.

We used data from UCI Machine learning repository. This was very small dataset containing only 699 records. Ignoring some redundancy, each record had 10 attributes. Based on it every record was categorized into two classes. For example, every records had several attributes such as Bare Nuclei, Bland Chromatin, Normal Nucleoli, Mitoses etc. Based on these attributes every sample was categorized either as benign or malignant cancer category.

As per aim of the project, we applied K-Means clustering algorithm to create 2 partitions of data and then computing accuracy based on the membership of every record.

The details of final result are as follows :

Data analysis observed in final data :

  • Final PPV value was slightly improved when tuples with unknown values were removed instead of
    replacing unknown values with 0 or fixed constant

  • Even though above results reflect that, eliminating unknowns give better results than replacing them
    with fixed constant value, result depends upon K centroids chosen at the beginning of clustering. Depending upon result of random function, every time there is slight change in final PPV value for each

  • It was found that for lesser number of classifiers (only two of them), PPV value decreases with increase
    in number K.In addition to it, algorithm takes more number of iteration to converge to final solution
    with increase in number of clusters.

Result is shown below.

RunNumber of clusters Iterations before convergence PPV value Time(Seconds)
1 2 4 0.953722 1.09
2 10 6 0.950336 1.736
3 20 12 0.931893 2.264
4 50 14 0.883532 5.565
5 100 16 0.833298 8.764

However above numbers are specific to choice of initial centroids thus chosen. However, when experiment is repeated with multiple runs, result seemed to be following the above mentioned rule.

Time thus mentioned are average of 4 runs. However time is subjected to change each time depending upon initial choice of centroids and number of iterations to converge to solution.

  • In order to perform V- Fold cross validation, initial data was cleaned. Cleaning involved adjusting duplicates and removing tuples with unknown values as it causes skew in final distribution.After filtering, total number of records are 630

  • Divided given data in 10 equal sized data set

  • Total 10 iterations were performed on input data. Where each iteration consisted of 1 test data and remaining 9 as training data set

  • Each data set acted as test data set once and training data set remaining 9 times

  • At the end of each iteration, label of test dataset was compared against training data set. Whenever there is match we get PPV of 1 otherwise we take the PPV as 0

You may find the full source code along with project report Here. Don't forget to let me know your thoughts, critic and comment on the implementation and project report.