PYTHON (JUPYTER NOTEBOOK) PROJECT - GENERAL DESCRIPTION · This assignment needs to be delivered in a Jupyter Python Notebook (.ipynb) · Deep understanding and knowledge of Clustering Algorithms is...

1 answer below »
- Assignment task in Word doc- data in Excel File- No Referencing needed- Unkown number of words



PYTHON (JUPYTER NOTEBOOK) PROJECT - GENERAL DESCRIPTION · This assignment needs to be delivered in a Jupyter Python Notebook (.ipynb) · Deep understanding and knowledge of Clustering Algorithms is required · Delivery Date: end of Sunday 2nd June (CET time) Aim is to cluster 51 objects (cases) according to a set of “clustering variables” through the implementation of various unsupervised clustering algorithms. Being an unsupervised learning problem, number of clusters in the data (if any) is unknown. For the found ´clusters´ by the various clustering algorithms, it´ll be of interest to compare them according to · the clustering variables and · to an additional set of variables not used in the clustering (the “profiling variables”). Some basic references are included at the end of this document to facilitate problem understanding and what is required to deliver. These are denoted as [xxx] in the following. No need to read in full such references. Usually by scrolling through relevant pages, formulas, images is enough to understand what is required. I highlight in this format, those observations that require great attention to ensure that this Python project delivers what it is required-asked. For any doubts, please contact me by email at [email protected]. It´s expect that the coder in charge of this assignment may have questions for the various tasks. Hence, I´d expect to be contacted as the project advances. DATASET In Excel File “Clustering Dataset clean”. · Number of cases (objects) = 51 (rows 3:53, 51 objects or cases in total) · Clustering Variables Set 1 (“CVS”) (cols B:U, 20 variables or features in total) · These are the used to cluster the data · Profiling Variables Set (“PVS”) (cols V:CG, 64 variables or features in total) · these variables are NOT used by the clustering algorithms. Instead, they´re used to further characterize the clusters. PART I INITIAL DATA EXPLORATION AND DATA-PREPROCESSING · Obtain basic statistical information for ALL the variables in CVS and PVS: mean, max, min, median, std dev, skew, kurtosis, Kolmogorov-Smirnov test for normality … · For CVS, PVS and CVS+PVS obtain Pearson correlation and the Spearman correlation matrices. · Normalize variables according to these two Normalization Methods (“NM”) · NM1 – Z-score: subtract mean and divide by standard deviation · NM2 – Min-Max Method: subtract minimum and divide by absolute difference between minimum and maximum In what follows, all calculations related to clustering will be done using BOTH normalization methods. PRINCIPAL COMPONENT ANALYSIS (PCA) Using both normalization methods: · Perform standard PCA on CVS · Show factor loadings, order factors by % variance explained and show variance explained (by each factor and cumulative) · Show plots of CVS data on the two most important factors PART II Using both normalization methods, implement for Dataset CVS the following 8 Clustering Methods (“CM”). Before specifying such clustering algorithms, a common definition of standard performance and valuation metrics. Note that specific nuances should be considered depending on the type of clustering algorithm (partitional, hierarchical, density-based). For example, in Partitional-type algos, clustering structure well depend on the chosen number of clusters (k), whereas in Hierarchical algos, the clustering structure will be determined by the ´height´ at which the corresponding dendrogram is cut. ERROR MEASURES · For the overall clustering structure (see slides 99-100 in [TAN04] or slide #8 & slide #20 in [RICCO]) · TSS (SSE) Total Sum of Squares · WSS Within Clusters Sum of squares · BSS Between Clusters Sum of squares TSS (SSE) = WSS + BSS SILHOUETTE WIDTH (COEFFICIENT) AND PLOT · See slide #10 in [RICCO2] or slides #19/#46 in [UMASS1]. · Preferably, I´d like Silhouette plots as that one in [DATANOVIA] DENDROGRAM PLOTS · Preferably, I´d like Silhouette plots as that one in [DATANOVIA] DISTANCE MEASURES · Although main distance metric that will be used is the Euclidean distance, set up the Python project by also accepting the Manhattan and Mahalanobis distance metrics. See pages 1-2 in [NELSON12] PARTITIONAL CLUSTERING ALGORITHMS · CM1 - Standard K-means · CM2 - K-medoids as in [MAIONE18] – also known as “Partitioning around medoids (PAM)”. See reference [TAN04] · CM3 - Bisecting K-means algorithm. See reference [TAN04] As these 3 algorithms are susceptible to initialization issues (i.e. the chosen initial clusters´ centers), run n=200 iterations for different initializations of random cluster centers. For each run, consider k = 1, 2, 3, 4, 5 number of clusters. Averaging across the n = 200 runs, · Compute TSS, WSS and BSS for k = 1, 2, 3, 4, 5 · Compute Overall Average Shilouette Width for k = 1, 2, 3, 4, 5 · Compute Shilouette Width for each cluster for k = 1, 2, 3, 4, 5 (e.g. for k=4, 4 Silhouette coefficients) · Plot Histogram distribution of TSS, WSS and BSS for k = 1, 2, 3, 4, 5 · Plot Histogram distribution of Overall Average Shilouette for k = 1, 2, 3, 4, 5 · On same plot, taking k (number of clusters) as x-axis, show in y-axis both · Average (across number of clusters) of within-cluster dissimilarities WSS (as in slide #19 in [UMASS], or slide #94 in [TAN04]), and · Average Silhouette Width (as in slide #19 in [UMASS] or slide #10 in [RICCO2]) For following sections, performance or otherwise metrics from these algorithms should be taken with respect their average (across the n=200 runs) values Hierarchical Clustering Algorithms · 4 clustering algorithms: all 4 are of the Agglomerative type · Use the following approaches to measure the distance between clusters. See reference [TAN04] · CM4 - Single Link · CM5 - Complete Link · CM6 - Average Link · CM7 - Ward´s Method For each of these ´hierarchical´ clustering methods, obtain SSE (WSE), BSE, and display corresponding Dendrograms DENSITY-BASED CLUSTERING ALGORITHMS · CM8 - DBSCAN · Use various combinations of the two parameters in this method (See reference [TAN04]) · Eps · MinPoints PART III POST-PROCESSING For all the clustering structures obtained by the 8 clustering algorithms, it´ll be calculated (see slides #99-100 in [TAN04]) · For each cluster in each clustering structure, the Within-Cluster Sum of Squares (WSS) as a measure of Cluster Cohesion · For the overall clustering structure (see slide #99 in [TAN], or slides #8/#20 in [RICCO]) · TSS Total Sum of Squares · WSS Within Clusters Sum of Squares · BSS Between Clusters Sum of squares For all of them, it´ll be shown the (2-dimensional) plot of the 51 objects with respect to the two most important principal components obtained in previous section, with different colourings/markers representing different clusters. See slides #22-23 in [RICCO]. For all of them, it´ll calculated the Silhouette coefficient for each cluster and the average Silhouette coefficient for the overall clustering structure (see slide #102 in [TAN04]) and display typical graph as in slide #22 in [UMASS]). Also show similar graph with respect to number of clusters as in slide #10 in [RICCO2]. In addition, following [RICCO], perform the following and show similar tables and graphs: UNIVARIATE CHARACTERIZATION · Characterizing the partition (see slides #8-11 in [RICCO]) · Characterizing the clusters / Quantitative variables V-test (see slides #12-14 in [RICCO]) · Characterizing the clusters / One group vs. the others – Effect size (see slides #15-17 in [RICCO]) · Characterizing the clusters / Categorical variables V-test (see slides #18 in [RICCO]) MULTIVARIATE CHARATERIZATION · Characterizing the partition / Percentage of variance explained (see slide #20 in [RICCO]). Already calculated (TSS, WSS, BSS) · Characterizing the partition / Evaluating the proximity between the clusters (see slide #21 in [RICCO]). · Characterizing the clusters / In combination with factor analysis (see slides #22-23 in [RICCO]) · Characterizing the clusters / Using supervised approach – E.g. Discriminant Analysis (see slide #24 in [RICCO]) PROFILING VARIABLES Last, compute basic statistics for the clusters obtained in each clustering structure with respect the “Profiling Variables”. PART III CLUSTERING EVALUATION As there´s no external information to validate the goodness of the various clustering structures, following [TAN04], calculate and display as applicable (some have already been calculated): · TSS, BSS, WSS · correlation between the “Proximity/Similarity” and “Incidence” matrices (see slide #87 in [TAN04]) · Similarity matrix as in slide #89 in [TAN04] · Cophenetic correlation (see slides #49-51 in [UMASS]) as in slide #51 in [UMASS] · Silhouette plot as in slide #22 in [UMASS1] (with Silhouette coefficients for each cluster and the average for the whole clustering structure) Also, as per slide #97 in [TAN04], generate n=500 sets of random data, spanning the same ranges as the features of DataSet CVS, and · Obtain average SSE and display same histogram as in slide #97 of [TAN04]. · Do the same as per slide #98 of [TAN04] but for Correlation between incidence and proximity matrices · Do the same but with average Silhouette Coefficient · Do the same but with average Cophenetic Correlation Given the (total) SSE, (average) Correlation between incidence and proximity matrices, (average) Cophenetic Correlation, and (average) Silhouette Coefficient obtained by each clustering method, obtain the likelihood of such values (obtained by each clustering algorithm) given these random runs (some sort of p-value). REFERENCES [DATANOVIA] Cluster Validation Statistics. Available at: https://www.datanovia.com/en/lessons/cluster-validation-statistics-must-know-methods/ [NELSON12] Nelson, J.D. (2012). ON K-MEANS CLUSTERING USING MAHALANOBIS DISTANCE. Available at: https://library.ndsu.edu/ir/bitstream/handle/10365/26766/On%20K-Means%20Clustering%20Using%20Mahalanobis%20Distance.pdf?sequence=1 [MAIONE18] Maione, C. Nelson, D.R., and Melgaço Barbosa, R. (2018). Research on social data by means of cluster analysis. Applied Computing and Informatics. https://doi.org/10.1016/j.aci.2018.02.003.

[RICCO] Rakotomalala, R. Interpreting Cluster Analysis Results. Available at: http://eric.univ-lyon2.fr/~ricco/cours/slides/en/classif_interpretation.pdf [RICCO2] Rakotomalala, R. Cluster analysis with Python - HAC and K-Means. Available at: https://eric.univ-lyon2.fr/~ricco/cours/didacticiels/Python/en/cah_kmeans_avec_python.pdf [RICCO3] Rakotomalala, R. K-Means Clustering. Available at: https://eric.univ-lyon2.fr/~ricco/cours/didacticiels/Python/en/cah_kmeans_avec_python.pdf [TAN04] Tan, Steinbach and Kumar (2004). Data Mining Cluster Analysis: Basic Concepts and Algorithms. Available at https://www-users.cs.umn.edu/~kumar001/dmbook/dmslides/chap8_basic_cluster_analysis.pdf [UMASS] UMASS Landscape Ecology Lab. Finding groups -- cluster analysis. Part 1. Available at https://www.umass.edu/landeco/teaching/multivariate/schedule/cluster1.pdf
Answered Same DayMay 26, 2021

Answer To: PYTHON (JUPYTER NOTEBOOK) PROJECT - GENERAL DESCRIPTION · This assignment needs to be delivered in a...

Ximi answered on May 30 2021
152 Votes
{
"cells": [
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_excel('dataset-clustering-clean-x0ybwvkd.xlsx', skiprows=1)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
Objects (Samples)CV_1CV_2CV_3CV_4CV_5CV_6CV_7CV_8CV_9...PV_55PV_56PV_57PV_58PV_59PV_60PV_61PV_62PV_63PV_64
0X_010.0311270.0271559.1285220.0241880.10415940544.7644440.736500510.620831...95.58954468.50.5022740.4007720.8350200.3552410.56418229.9712660.2264220.250370
1X_020.0137420.01604822.8044740.0266110.21476610609.8844440.706296300.512815...1.29127663.70.5562430.4373280.8015870.3260520.62479340.5261310.3754230.369033
2X_040.0208220.0347048.2297690.0324380.0897195577.1633330.746792340.480831...59.23351862.60.4859850.3972570.7784610.3572710.61431133.7217000.2541370.278691
3X_050.0360650.0354118.8611340.0185650.11328240870.7555560.735041410.605741...57.04707665.70.4426050.3512100.7957360.3016640.54617623.7320970.1888890.204846
4X_060.0258590.08086613.3241180.0376300.13121722716.1777780.742227310.314921...248.13454654.10.4634690.3713670.7568740.3557010.60286031.2080730.2744960.238922
\n",
"

5 rows × 85 columns

\n",
"
"
],
"text/plain": [
" Objects (Samples) CV_1 CV_2 CV_3 CV_4 CV_5 \\\n",
"0 X_01 0.031127 0.027155 9.128522 0.024188 0.104159 \n",
"1 X_02 0.013742 0.016048 22.804474 0.026611 0.214766 \n",
"2 X_04 0.020822 0.034704 8.229769 0.032438 0.089719 \n",
"3 X_05 0.036065 0.035411 8.861134 0.018565 0.113282 \n",
"4 X_06 0.025859 0.080866 13.324118 0.037630 0.131217 \n",
"\n",
" CV_6 CV_7 CV_8 CV_9 ... PV_55 PV_56 \\\n",
"0 40544.764444 0.736500 51 0.620831 ... 95.589544 68.5 \n",
"1 10609.884444 0.706296 30 0.512815 ... 1.291276 63.7 \n",
"2 5577.163333 0.746792 34 0.480831 ... 59.233518 62.6 \n",
"3 40870.755556 0.735041 41 0.605741 ... 57.047076 65.7 \n",
"4 22716.177778 0.742227 31 0.314921 ... 248.134546 54.1 \n",
"\n",
" PV_57 PV_58 PV_59 PV_60 PV_61 PV_62 PV_63 \\\n",
"0 0.502274 0.400772 0.835020 0.355241 0.564182 29.971266 0.226422 \n",
"1 0.556243 0.437328 0.801587 0.326052 0.624793 40.526131 0.375423 \n",
"2 0.485985 0.397257 0.778461 0.357271 0.614311 33.721700 0.254137 \n",
"3 0.442605 0.351210 0.795736 0.301664 0.546176 23.732097 0.188889 \n",
"4 0.463469 0.371367 0.756874 0.355701 0.602860 31.208073 0.274496 \n",
"\n",
" PV_64 \n",
"0 0.250370 \n",
"1 0.369033 \n",
"2 0.278691 \n",
"3 0.204846 \n",
"4 0.238922 \n",
"\n",
"[5 rows x 85 columns]"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part 1"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
CV_1CV_2CV_3CV_4CV_5CV_6CV_7CV_8CV_9CV_10...PV_55PV_56PV_57PV_58PV_59PV_60PV_61PV_62PV_63PV_64
count51.00000051.00000051.00000051.00000051.00000051.00000051.00000051.00000051.00000051.000000...51.00000051.00000051.00000051.00000051.00000051.00000051.00000051.00000051.00000051.000000
mean0.0219070.04551410.8836940.0262490.15559731684.1547930.73613435.5882350.4825820.539622...407.01361265.2470590.5193000.4175220.7938030.3261180.63314932.7027260.2376540.249861
std0.0066030.0191273.1856310.0066470.08703227353.4475510.0914877.5820220.1191760.065373...1506.6413285.5532460.0539650.0483960.0294100.0437940.0594908.6418830.0944990.107509
min0.0129010.0160487.3339940.0146420.073693552.3200000.17970821.0000000.0408750.172706...1.29127640.7000000.4025560.3159580.7358880.2618600.52843820.6988230.0000000.000000
25%0.0180810.0308178.8978890.0208150.11348610181.8200000.73551630.5000000.4096570.521856...47.41448563.6000000.4874250.3866410.7739470.2991290.58735127.9745200.2060110.224750
50%0.0203810.04619910.0587740.0253570.13808926399.9188890.75680135.0000000.4817360.541314...106.25981966.3000000.5187900.4148980.7952470.3191610.62479331.2080730.2491080.277983
75%0.0249900.05411511.5598210.0299400.17143946895.3505560.77140239.0000000.5716160.564529...225.79551968.5500000.5559640.4460110.8149720.3484760.67833235.8275600.2948850.302694
max0.0486580.09381423.8873850.0425410.685224157135.9488890.80930953.0000000.6817260.723218...10794.57821572.5000000.6601110.5253280.9046270.5452740.75826880.7599150.3754230.441670
\n",
"

8 rows × 84 columns

\n",
"
"
],
"text/plain": [
" CV_1 CV_2 CV_3 CV_4 CV_5 CV_6 \\\n",
"count 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000 \n",
"mean 0.021907 0.045514 10.883694 0.026249 0.155597 31684.154793 \n",
"std 0.006603 0.019127 3.185631 0.006647 0.087032 27353.447551 \n",
"min 0.012901 0.016048 7.333994 0.014642 0.073693 552.320000 \n",
"25% 0.018081 0.030817 8.897889 0.020815 0.113486 10181.820000 \n",
"50% 0.020381 0.046199 10.058774 0.025357 0.138089 26399.918889 \n",
"75% 0.024990 0.054115 11.559821 0.029940 0.171439 46895.350556 \n",
"max 0.048658 0.093814 23.887385 0.042541 0.685224 157135.948889 \n",
"\n",
" CV_7 CV_8 CV_9 CV_10 ... PV_55 \\\n",
"count 51.000000 51.000000 51.000000 51.000000 ... 51.000000 \n",
"mean 0.736134 35.588235 0.482582 0.539622 ... 407.013612 \n",
"std 0.091487 7.582022 0.119176 0.065373 ... 1506.641328 \n",
"min 0.179708 21.000000 0.040875 0.172706 ... 1.291276 \n",
"25% 0.735516 30.500000 0.409657 0.521856 ... 47.414485 \n",
"50% 0.756801 35.000000 0.481736 0.541314 ... 106.259819 \n",
"75% 0.771402 39.000000 0.571616 0.564529 ... 225.795519 \n",
"max 0.809309 53.000000 0.681726 0.723218 ... 10794.578215 \n",
"\n",
" PV_56 PV_57 PV_58 PV_59 PV_60 PV_61 \\\n",
"count 51.000000 51.000000 51.000000 51.000000 51.000000 51.000000 \n",
"mean 65.247059 0.519300 0.417522 0.793803 0.326118 0.633149 \n",
"std 5.553246 0.053965 0.048396 0.029410 0.043794 0.059490 \n",
"min 40.700000 0.402556 0.315958 0.735888 0.261860 0.528438 \n",
"25% 63.600000 0.487425 0.386641 0.773947 0.299129 0.587351 \n",
"50% 66.300000 0.518790 0.414898 0.795247 0.319161 0.624793 \n",
"75% 68.550000 0.555964 0.446011 0.814972 0.348476 0.678332 \n",
"max 72.500000 0.660111 0.525328 0.904627 0.545274 0.758268 \n",
"\n",
" PV_62 PV_63 PV_64 \n",
"count 51.000000 51.000000 51.000000 \n",
"mean 32.702726 0.237654 0.249861 \n",
"std 8.641883 0.094499 0.107509 \n",
"min 20.698823 0.000000 0.000000 \n",
"25% 27.974520 0.206011 0.224750 \n",
"50% 31.208073 0.249108 0.277983 \n",
"75% 35.827560 0.294885 0.302694 \n",
"max 80.759915 0.375423 0.441670 \n",
"\n",
"[8 rows x 84 columns]"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Statistical Info\n",
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"CV_1 1.697965\n",
"CV_2 0.643609\n",
"CV_3 2.593303\n",
"CV_4 0.576937\n",
"CV_5 4.736705\n",
"dtype: float64"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Skew \n",
"df.skew().head()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"CV_1 4.650896\n",
"CV_2 0.070103\n",
"CV_3 8.061187\n",
"CV_4 -0.224065\n",
"CV_5 28.034663\n",
"dtype: float64"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Kurtosis\n",
"df.kurtosis().head()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
CV_1CV_2CV_3CV_4CV_5CV_6CV_7CV_8CV_9CV_10...PV_55PV_56PV_57PV_58PV_59PV_60PV_61PV_62PV_63PV_64
CV_11.0000000.055488-0.1430160.156395-0.1275600.213934-0.5558410.5591780.056048-0.018077...0.120737-0.117209-0.025234-0.1783740.4276550.487576-0.0613080.404307-0.3666730.153937
CV_20.0554881.0000000.2308440.4813140.295223-0.239170-0.091910-0.414915-0.640325-0.103618...0.287900-0.2941650.1632930.236562-0.127381-0.0384830.0220900.111620-0.2787940.171609
CV_3-0.1430160.2308441.0000000.1170990.718548-0.328752-0.015100-0.362174-0.474109-0.445860...0.595196-0.5335180.0666330.058485-0.200520-0.1251530.0718630.115254-0.070348-0.308856
CV_40.1563950.4813140.1170991.0000000.118996-0.397467-0.282819-0.256928-0.6381580.146217...0.148864-0.2910560.3100680.3664860.1645540.344717-0.0836620.308681-0.2431900.172902
CV_5-0.1275600.2952230.7185480.1189961.000000-0.3241700.014087-0.454147-0.593993-0.663808...0.866830-0.4930870.0896480.028670-0.217856-0.2248260.2291790.108206-0.198993-0.395182
\n",
"

5 rows × 84 columns

\n",
"
"
],
"text/plain": [
" CV_1 CV_2 CV_3 CV_4 CV_5 CV_6 CV_7 \\\n",
"CV_1 1.000000 0.055488 -0.143016 0.156395 -0.127560 0.213934 -0.555841 \n",
"CV_2 0.055488 1.000000 0.230844 0.481314 0.295223 -0.239170 -0.091910 \n",
"CV_3 -0.143016 0.230844 1.000000 0.117099 0.718548 -0.328752 -0.015100 \n",
"CV_4 0.156395 0.481314 0.117099 1.000000 0.118996 -0.397467 -0.282819 \n",
"CV_5 -0.127560 0.295223 0.718548 0.118996 1.000000 -0.324170 0.014087 \n",
"\n",
" CV_8 CV_9 CV_10 ... PV_55 PV_56 PV_57 \\\n",
"CV_1 0.559178 0.056048 -0.018077 ... 0.120737 -0.117209 -0.025234 \n",
"CV_2 -0.414915 -0.640325 -0.103618 ... 0.287900 -0.294165 0.163293 \n",
"CV_3 -0.362174 -0.474109 -0.445860 ... 0.595196 -0.533518 0.066633 \n",
"CV_4 -0.256928 -0.638158 0.146217 ... 0.148864 -0.291056 0.310068 \n",
"CV_5 -0.454147 -0.593993 -0.663808 ... 0.866830 -0.493087 0.089648 \n",
"\n",
" PV_58 PV_59 PV_60 PV_61 PV_62 PV_63 PV_64 \n",
"CV_1 -0.178374 0.427655 0.487576 -0.061308 0.404307 -0.366673 0.153937 \n",
"CV_2 0.236562 -0.127381 -0.038483 0.022090 0.111620 -0.278794 0.171609 \n",
"CV_3 0.058485 -0.200520 -0.125153 0.071863 0.115254 -0.070348 -0.308856 \n",
"CV_4 0.366486 0.164554 0.344717 -0.083662 0.308681 -0.243190 0.172902 \n",
"CV_5 0.028670 -0.217856 -0.224826 0.229179 0.108206 -0.198993 -0.395182 \n",
"\n",
"[5 rows x 84 columns]"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Correlation matrices (Pearson)\n",
"df.corr(method='pearson').head()"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
CV_1CV_2CV_3CV_4CV_5CV_6CV_7CV_8CV_9CV_10...PV_55PV_56PV_57PV_58PV_59PV_60PV_61PV_62PV_63PV_64
CV_11.0000000.021448-0.2676020.074480-0.4986430.348688-0.1176470.4951400.179005-0.371312...0.078733-0.164698-0.268332-0.2871560.2025390.200005-0.1925840.001719-0.3935810.027217
CV_20.0214481.0000000.3990050.5103170.401900-0.202896-0.121176-0.437179-0.6716740.100271...0.541357-0.0942750.1892800.246748-0.138057-0.1377860.1505460.210819-0.0710290.234580
CV_3-0.2676020.3990051.0000000.2007240.677195-0.374299-0.177919-0.478599-0.470769-0.028326...0.163982-0.2199590.2032170.229010-0.173262-0.1915430.1321750.2229460.184658-0.153790
CV_40.0744800.5103170.2007241.0000000.154661-0.496290-0.300724-0.375638-0.7171950.360995...0.326063-0.2684320.3038080.4047150.1091430.306975-0.0075110.295663-0.1777320.145820
CV_5-0.4986430.4019000.6771950.1546611.000000-0.424615-0.181176-0.671788-0.3796380.158371...0.0164710.0260240.4688340.419421-0.091179-0.2930840.3477450.3854840.3661920.041074
\n",
"

5 rows × 84 columns

\n",
"
"
],
"text/plain": [
" CV_1 CV_2 CV_3 CV_4 CV_5 CV_6 CV_7 \\\n",
"CV_1 1.000000 0.021448 -0.267602 0.074480 -0.498643 0.348688 -0.117647 \n",
"CV_2 0.021448 1.000000 0.399005 0.510317 0.401900 -0.202896 -0.121176 \n",
"CV_3 -0.267602 0.399005 1.000000 0.200724 0.677195 -0.374299 -0.177919 \n",
"CV_4 0.074480 0.510317 0.200724 1.000000 0.154661 -0.496290 -0.300724 \n",
"CV_5 -0.498643 0.401900 0.677195 0.154661 1.000000 -0.424615 -0.181176 \n",
"\n",
" CV_8 CV_9 CV_10 ... PV_55 PV_56 PV_57 \\\n",
"CV_1 0.495140 0.179005 -0.371312 ... 0.078733 -0.164698 -0.268332 \n",
"CV_2 -0.437179 -0.671674 0.100271 ... 0.541357 -0.094275 0.189280 \n",
"CV_3 -0.478599 -0.470769 -0.028326 ... 0.163982 -0.219959 0.203217 \n",
"CV_4 -0.375638 -0.717195 0.360995 ... 0.326063 -0.268432 0.303808 \n",
"CV_5 -0.671788 -0.379638 0.158371 ... 0.016471 0.026024 0.468834 \n",
"\n",
" PV_58 PV_59 PV_60 PV_61 PV_62 PV_63 PV_64 \n",
"CV_1 -0.287156 0.202539 0.200005 -0.192584 0.001719 -0.393581 0.027217 \n",
"CV_2 0.246748 -0.138057 -0.137786 0.150546 0.210819 -0.071029 0.234580 \n",
"CV_3 0.229010 -0.173262 -0.191543 0.132175 0.222946 0.184658 -0.153790 \n",
"CV_4 0.404715 0.109143 0.306975 -0.007511 0.295663 -0.177732 0.145820 \n",
"CV_5 0.419421 -0.091179 -0.293084 0.347745 0.385484 0.366192 0.041074 \n",
"\n",
"[5 rows x 84 columns]"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Correlation matrices (Spearman)\n",
"df.corr(method='spearman').head()"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"# Z-Score Normalization\n",
"df_z_score = pd.DataFrame()\n",
"for col in df.columns[1:]:\n",
" col_zscore = col + '_zscore'\n",
" df_z_score[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"\n",
"\n",
...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here