Make a multiclass classifier to predict wine quality with majority rules voting by
performing the following steps:
a) Using the data/winequality-white.csv and data/winequalityred.csv files, create a dataframe with concatenated data and a column indicating
which wine type the data belongs to (red or white).
b) Create a test and training set with 75% of the data in the training set. Stratify
on quality.
c) Build a pipeline for each of the following models: random forest, gradient
boosting, k-NN, logistic regression, and Naive Bayes (GaussianNB). The
pipeline should use a ColumnTransformer object to standardize the numeric
data while one-hot encoding the wine type column (something like is_red and
is_white, each with binary values), and then build the model. Note that we will
discuss Naive Bayes in Chapter 11, Machine Learning Anomaly Detection
d) Run grid search on each pipeline except Naive Bayes (just run fit() on it) with
scoring='f1_macro' on the search space of your choosing to find the best
values for the following:
i) Random forest: max_depth
ii) Gradient boosting: max_depth
iii) k-NN: n_neighbors
iv) Logistic regression: C
e) Find the level of agreement between each pair of two models using the
cohen_kappa_score() function from the metrics module in
scikit-learn. Note that you can get all the combinations of the two
easily using the combinations() function from the itertools module
in the Python standard library.
f) Build a voting classifier with the five models built using majority rules
(voting='hard') and weighting the Naive Bayes model half as much
as the others.
g) Look at the classification report for your model.
h) Create a confusion matrix using the confusion_matrix_visual() function
from the ml_utils.classification module.