Part 1: Simple/Multiple Linear/Polynomial Regression:
Download Regression_Dset.csv and use Feature1 in the dataset as the independent/predictor variable x, and let Feature4 be the dependent/target variable y.
(a) Run simple linear regression to predict y from x. Report the linear model you found. Predict the value of y for new x values 0.3, 0.5, and 0.8.
(b) Use cross-validation to predict generalization error. Describe your methodology.
(c) On the same data, run polynomial regression for p = 2; 3; 4; and 5. Report polynomial models for each. With each of these models, predict the value of y for x values of 0.3, for 0.5, and for 0.8.
(d) Cross-validate to choose the best model. Describe how you did this.
2. Now use Features 1,2, and 3 of the dataset as predictors and let Feature4 be the dependent variable y.
(a) Run linear regression and predict the value of y for new (x1,x2,x3) values (0.3; 0.4; 0.1), for (0.5; 0.2; 0.4), and for (0.8; 0.2; 0.7).
(b) Regularize your model to avoid over-fitting and apply polynomial regression. Use cross-validation to choose the best hyperparameter value (e.g., alpha) to build your model.
(c) Compare the evaluation of your models at a) and b) and assess which one minimize the generalization error. Describe your finding.
Part 2: Classification
Use Ass3_Classification.ipynb program which uploads the cancer dataset and extract the predictor and target features and prepare them as x_data and y_data, respectively.
Analyze the extracted data and train various classifiers using the following algorithms: a) KNN for k=4, k=6, k=10, and k=50; b) SVM with ‘rbf’ and ‘linear’ kernel functions; c) decision tree with different depth 2,3, 4 and 10; and d) logistic regression. Evaluate your models in each Supervised Machine Learning trend using different metrics (e.g., Jaccard index, F1-score, Log loss) when they are applicable, justify why you choose these metrics, contrast between the models, conclude your finding, and finally suggest the best model among all of the models.
Part 3: Clustering
Download the digitDset_train.csv and digitDset_test.csv which represents the Optical Recognition of Handwritten Digits training and testing datasets, respectively. Both datasets consist of 64 independent features as well as one feature for class code. All independent features are integers in the range of 0 to 16, while the last feature ranges between 0 and 9 to represent the class code/label.
Apply various clustering algorithms. You are expected to build k-Means, Hierarchical, Density-based clustering models.
For the k-Means model, find a suitable value for k (3, 10, 30) . Recall that the k-Means algorithm depends on the centers of the clusters. Run the algorithm for each k with different initial centroids. For each case, choose the model with the best accuracy (e.g., the lowest mean distance to the centroids).
Evaluate your models using one metric chosen fromhere, contrast between them, and conclude which model do you recommend as the most accurate.