id3.ipynb
{
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.4-final"
},
"orig_nbformat": 2,
"kernelspec": {
"name": "python394jvsc74a57bd081118431cc388d258ed977b65143603a98f8ad6ed776c173758a3af876bc6de9",
"display_name": "Python 3.9.4 64-bit"
}
},
"nbformat": 4,
"nbformat_minor": 2,
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from matplotlib import pyplot as plt\n",
"from sklearn import datasets\n",
"from sklearn.tree import DecisionTreeClassifier \n",
"from sklearn import tree\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Windy?\\tAir Quality Good?\\tHot?\\tPlay Tennis?\n",
"0 No\\tNo\\tNo\\tNo\n",
"1 Yes\\tNo\\tYes\\tYes\n",
"2 Yes\\tYes\\tNo\\tYes\n",
"3 Yes\\tYes\\tYes\\tNo"
],
"text/html": "
\n\n
\n\n\n | \nWindy?\\tAir Quality Good?\\tHot?\\tPlay Tennis? | \n
\n\n\n\n0 | \nNo\\tNo\\tNo\\tNo | \n
\n\n1 | \nYes\\tNo\\tYes\\tYes | \n
\n\n2 | \nYes\\tYes\\tNo\\tYes | \n
\n\n3 | \nYes\\tYes\\tYes\\tNo | \n
\n\n
\n
"
},
"metadata": {},
"execution_count": 2
}
],
"source": [
"data = pd.read_csv('id3.csv')\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
]
}
id3.csv
Windy?,Air Quality good?,Hot?,Play Tennis?
0,0,0,0
1,0,1,1
1,1,0,1
1,1,1,0
ML_4.docx
1.a.
As machine learning experts the key takeaways are:
1. Cleaning data is of utmost priority, as faulty values like NaN or random values can cause errors, also you should remove columns that are not of need. Also, the data presented should be properly formatted to the datatypes that can be processed. For example, changing the string values to float datatypes. And changing NaN values to 0. Removing columns which have NaN values, as they can cause discrepancy while training the model. Also in case of shortage of dataset, the data segmentation should be used to create more data values. And while testing the data an appropriate i.e. 0.10 to 0.25 of dataset should be used for testing.
2. Using the required machine learning algorithms. Often we use machine learning algorithms that are not required, for example in a simple clustering based problem, we can use KNN or K-Means and establish the model that fits that data. But if we use regression models like ridge and elastic, we can have unexpected results. So use the appropriate algorithm based on the requirement and the dataset
3. Overfitting and underfitting issues. Over training the data can lead to overfitting issues where the model takes the variations in account more than necessary. For example instead of having a general idea, it takes the unique ideas which are not necessary. Likewise there is underfitting issues as well, where the model fails to recognize the general trend and predicts wrongly. Using proper hyperparameters can fix this issue
4. Use of deep learning. Deep learning models are very targeted in their requirements, if there is a problem statement that can be solved using simple machine learning algorithms then deep learning should be avoided. Also if there is a necessity of deep learning, then such models should be used where the levels of layers are minimum or optimum, using models with more than necessary layers, will only lead to wastage of time in training and testing. Also, hyperparameters selection is a must. Also, there will be cases where reinforcement learning can be used, but it should be important to avoid it using classical deep learning algorithms. Using pretrained models, training and testing can be sped up.
5. Reading and implementing research papers. With the speed in which research is happening in machine learning it is important to be study newer research papers and implement them. Machine learning is a state of the art domain.
1.b.
Since there are two features and two values of the label it is a classification problem, in case of classification problems it is best to use KNN or K-Means, because of it’s clustering property.
2.a. (NB.ipynb provided)
2.b.
The real-world applications of regression is:
1. Housing problem, where based on the size and locality of a house, we will have to predict the price of the house (Y)
2. Cancer detection, based on the size of the tumor, we will have to predict the state of the tumor in the future, also the stage of the cancer(Y)
3. Based on the customer salary, and their lifestyle, predict the amount of money they will spend on a gambling den(Y)
4. Based on the past marks in maths, science and social studies. Predict the marks in the future tests(Y)
5. Find the distance a runner can travel(Y) based on the past athletic records, stamina, and physique
3.a. (see knn.ipynb)
3.b. The direction of the nearest neighbor should not be considered, because only the nearest points within the circle is taken.
3.c.
1. Logistic regression is best suited for regression based problems with multiple features. Like the housing problem
2. K-NN is used for clustering, mainly classification problem.
3. SVM can be used for both classification and regression problems, but it is mainly used for classification
4. Naïve Bayes is used for classification problems
5. Decision trees are used to solve both classification and regression problems in the form of trees that can be incrementally updated by splitting the dataset into smaller datasets, where the results are represented in the leaf nodes.
4.a. Decision tree:
NB.ipynb
{
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.4-final"
},
"orig_nbformat": 2,
"kernelspec": {
"name": "python394jvsc74a57bd081118431cc388d258ed977b65143603a98f8ad6ed776c173758a3af876bc6de9",
"display_name": "Python 3.9.4 64-bit"
}
},
"nbformat": 4,
"nbformat_minor": 2,
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.naive_bayes import GaussianNB\n",
"from sklearn import metrics"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"data = pd.read_csv('binary.csv')\n",
"y = data[\"admit\"]\n",
"X = data.drop([\"admit\"],axis =1)\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"gnb = GaussianNB()\n",
"gnb.fit(X_train, y_train)\n",
" \n",
"y_pred = gnb.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"71.875"
]
},
"metadata": {},
"execution_count": 13
}
],
"source": [
"metrics.accuracy_score(y_test, y_pred)*100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
...