Please follow the ppts regarding project data exploration on COVID-19 analysis. You can refer to other documents which are uploaded and submit a data exploration on this COVID-19 research paper. Submit everything including documentations, scripts, references, dataset.
What is Plagiarism and How Can I Avoid It ? 3. Data Engineering https://www.oreilly.com/content/data-engineering-a-quick-and-simple-definition/ Data engineers are responsible for finding trends in data sets and developing algorithms to help make raw data more useful to the enterprise. “initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data” “These characteristics can include size or amount of data, completeness of the data, correctness of the data, possible relationships amongst data elements or files/tables in the data” 1 Data Exploration Data Exploration or Exploratory Data Analysis (EDA) is used Answer questions related to data, test data assumptions, generate hypotheses for further analysis. Prepare the data for modeling Have deep understanding of your data to answer questions Build insights on your data sets Help interpreting results of modeling in the future Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems. In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA. Plagiarism Tutorial 2 Data Exploration Steps 3.1Data Process Decide the approaches and steps of deriving raw, training, validation and test datasets in order to enable the models to meet the project requirements. https://en.wikipedia.org/wiki/Data_modeling 3.2Data Collection Define the sources, parameters and quantity of raw datasets; collect necessary and sufficient raw datasets; present samples from raw datasets. Collecting data is the first step in data processing. Data is pulled from available sources, including data lakes and data warehouses. It is important that the data sources available are trustworthy and well-built so the data collected (and later used as information) is of the highest possible quality. Data Exploration Steps 3.3Data Pre-processing Pre-process collected raw data with cleaning and validation tools; present samples from pre-processed datasets. removal of noise and outliers collecting necessary information to model or account for noise handling of missing data https://serokell.io/blog/data-preprocessing Binning method is used to smoothing data or to handle noisy data. In this method, the data is first sorted and then the sorted values are distributed into a number of buckets or bins. As binning methods consult the neighborhood of values, they perform local smoothing. In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable') and one or more independent variables Plagiarism Tutorial 4 Data Exploration Steps 3.4Data Transformation Transform pre-processed datasets to desired formats with tools and scripts; present samples from transformed datasets. Normalization helps you to scale the data within a range Feature selection is the selection of variables in data that are the best predictors for the variable we want to predict. Discretization: transforms the data into sets of small intervals. Generate a hierarchy between the attributes where it was not specified https://serokell.io/blog/data-preprocessing Data discretization is defined as a process of converting continuous data attribute values into a finite set of intervals and associating with each interval some specific data value. ... If discretization leads to an unreasonably small number of data intervals, then it may result in significant information loss. Concept hierarchy generation based on the number of distinct values per attribute. Suppose a user selects a set of location-oriented attributes—street, country, province_ or_state, and city —from the AllElectronics database, but does not specify the hierarchical ordering among the attributes. Normalization helps you to scale the data within a range to avoid building incorrect ML models while training and/or executing data analysis. If the data range is very wide, it will be hard to compare the figures. With various normalization techniques, you can transform the original data linearly, perform decimal scaling or Z-score normalization. Plagiarism Tutorial 5 Data Exploration Steps 3.5Data Preparation Prepare training, validation and test datasets from transformed datasets; present samples from training, validation and test datasets. https://algotrading101.com/learn/train-test-split/ Data which we use to design our models (Training set) Data which we use to refine our models (Validation set) Data which we use to test our models (Testing set) Data Exploration Steps 3.6Data Statistics Summarize the results of progressive results for including deriving raw, pre-processed, transformed and prepared datasets; statistically present the results in visualization formats. https://www.tableau.com/learn/articles/data-visualization Knowledge Discovery Process 8 https://link.springer.com/chapter/10.1007/978-1-4842-4947-5_2 Data preprocessing 9 applying domain knowledge of the data to create new features that allow ML algorithms to work better 45 Adopted from Andrew Ferlitsch Slide Import the data Clean the data (Data Wrangling) Replace Missing Values Categorical Value Conversion Feature Scaling Steps Importing the Dataset - Python import pandas as pd dataset = pd.read_csv( ‘data.csv’ ) Cleaning the Data It is not uncommon for datasets to have some dirty data entries (i.e., samples, rows in CSV file, …) Common Problems Bad Character Encodings (Funny Characters) Misaligned Data (e.g., row has too few/many columns) Data in wrong format. Data Wrangling is an expertise/occupation all in its own. Common Practices in Data Wrangling Know the character encoding of the data file and intended character encoding of the data. Convert the data encoding format of the file if necessary. e.g., Notepad++ -> Encodings Know the data format of the source and expected data format. Convert the data format using a batch preprocessing file. e.g., 1 000 000 -> 1,000,000 Replace Missing Values Not unusual for samples (rows) to contain missing (blank) entries, or not a number (NaN). Blank/NaN entries do not work for Machine Learning Need to replace the blank/NaN entry with something meaningful. Delete the rows (generally not desirable) Replace with a Single Value Mean Average Multivariate Imputation using Chained Equations (MICS) Missing Values – Mean Value scikit-learn class for handling missing data from sklearn.preprocessing import Imputer# scikit-learn module # Create imputer object to replace NaN values with the mean value of the column imputer = Imputer( missing_values=‘NaN’, strategy=‘mean’ ) # Fit the data to the imputer object imputer = imputer.fit( dataset[ :, 2 ] ) # do the replacement and update the dataset dataset[ :, 2 ] = imputer.transform( dataset[ :, 2 ] ) needs to be the same columns in dataset original dataset replace missing values in column 2 (index starts at 0) select all rows AgeGenderIncome 25Male25000 26Female22000 30Male45000 24Female26000 Independent Variables (Features) Dependent Variables (Label) Real Values Value to Predict Categorical Values Categorical Variables Known in Python as OneHotEncoder For each categorical feature: Scan the dataset and determine all the unique instances. Create a new feature (i.e., dummy variable) in dataset, one per unique instance. Remove the categorical feature from the dataset. For each sample (row), set a 1 in the feature (dummy variable) that corresponds to that categorical value instance, and: Set a 0 in the remaining features (dummy variables) for that categorical field. Remove one dummy variable field. Dummy Variable Conversion Dummy Variable Trap Gender Male Female Male Female Need to Drop one Dummy Variable! x2x3 MaleFemale 10 01 10 01 x1 Multicollinearity occurs when one variable predicts another. i.e., x2 = ( 1 – x3) As a result, a regression analysis cannot distinguish between the contribution of x2 and x3. Categorical Variable Conversion scikit-learn class for categorical variable conversion from sklearn.preprocessing import LabelEncoder# scikit-learn module # Create an encoder object to numerically (enumeration) encode categorical variables labelEncoder = LabelEncoder() # Fit the data to the Encoder object labelEncoder.fit_transform() encode the categorical values in column 1 (index starts at 0) dataset[ :, 1 ] = labelEncoder.fit_transform( dataset[ :, 1 ] ) original datasetselect all rows needs to be the same columns in dataset # Create an encoder to convert numerical encodings to 1-encoded dummy variables onehotencoder = OneHotEncoder( categorical_features = [ 1 ] ) Categorical variables to convert are in column 1 # Replace the encoded categorical values with the 1-encoded dummy variables dataset = onehotencoder.fit_transform( dataset ) Dataset with converted categorical variables If features do not have the same numerical scale in values, will cause issues in training a mode. If the scale of one independent variable (feature) is greater than another independent variable, the model will give more importance (skew) to the independent variable with the larger range. To eliminate this problem, one converts all the independent variables to use the same scale. ( 0 to 1 ) Normalization Standardization ( -1 to 1 ) Feature Scaling Decision tree, random forest, doesn’t need feature scaling 21 Most machine learning models use Euclidean distance between two points in 2D Cartesian space. ?? − ?? ? + (?? − ??)? Given two independent variables (x1 = Age, x2 = Income) and a dependent variable (y = spending), becomes for a given sample (row) i: ? + ? ??? − ????? − ??=??? − ??? ? If x1 or x2 is a substantially greater scale than the other, the corresponding independent variable will dominate the result, and will contribute more to the model. Scaling Issue – Euclidean Distance Especially gradient descent algorithm. 22 Feature Scaling means scaling features to the same scale. Normalization scales features between 0 and 1, retaining their proportional range to each other (min-max scaling) Standardization scales features to have a mean (u) of 0 and standard deviation (a) of 1. X’ = ? − min(?) max ?− min(?) Normalization original value new value X’ = ? − ? ? Standardization original value new value mean standard deviation Normalization or Standardization 23 Feature Scaling in Python from sklearn.preprocessing import StandardScalar # scikit-learn module # Create a scaling object to scale the features. scale = StandardScalar() # Fit the data to the Scaling object and transform the data dataset [:,-1] = scale.fit_transform( dataset[:,-1] ) scikit-learn class for Feature Scaling feature scale all the variables except the last column (y or label) Correlation Heatmap Correlation states how the features are related to each other or the target variable. Correlation can be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one value of feature decreases the value of the target variable) Heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of correlated features using the seaborn library.sns.heatmap(df_new.corr()) https://heartbeat.fritz.ai/seaborn-heatmaps-13-ways-to-customize-correlation-matrix-visualizations-f1c49c816f07 What to Submit? Complete DataExp for your research report Submit your own Data Exploration on your research paper