This order requires you to work on data pre-processing and engineering for a skin cancer classification project. Dataset as well as project scope are all uploaded via a word document. Your job is to download the dataset and make sure data pre-processing and engineering is done before it goes to further steps. Please submit a notebook for all coding and a word document for documentation. In the notebook, please give clear comments to explain what you did. In the documentation, please write about 5 pages including words, outputs, visualizations etc. Thank you very much.
Skin Cancer Image Classification: Malignant or Benign Dataset: skin cancer https://www.kaggle.com/fanconic/skin-cancer-malignant-vs-benign It has two parts: train and test for a total 10,000 images. Abstract Skin cancer is one of the most common types of cancer diseases in America and its diagnosis requires skin biopsy at the clinic. In this project we develop a machine learning model which takes the images of human skin moles as input (malignant and benign skin moles), detects the moles in the input skin images and classifies them into two different skin cancer types: malignant and benign. This model helps in diagnosing human skin cancer just by skin mole images without performing skin biopsy and going to the clinic. Keywords: Skin cancer, Image classification, Machine learning, SVM, CNN, Introduction More than two people die of skin cancer in the United States every hour and 1 in 5 Americans will get skin cancer by the time they turn 70 years old. Early detection saves lives. When detected early the 5-year survival rate for melanoma, a form of skin cancer is 99 percent. Labeling data is expensive. Leveraging machines to label data and reduce costs, while improving accessibility to accurate diagnosis of skin cancer, benefits everyone, especially vulnerable populations. Deep Learning Generative Adversarial Networks (GAN) are increasingly used to classify images for medical purposes. To label data (e.g., diagnos) cost efficiently, this project intends to create a machine learning model to classify images from a human skin image dataset as either malignant or benign. Literature Review Blanco, R. F., Rosado, P., Vegas E.,& Reverter, F. (2021) Medical image editing in the latent space of Generative Adversarial Networks. Intelligence-Based Medicine, Volume 5, 2021, 100040. https://www-sciencedirect-com.libaccess.sjlibrary.org/science/article/pii/S2666521221000168 To analyze breast cancer metastases, this article mentioned how GAN can be used for medical images. The authors implemented Deep Convolutional Generative Adversarial Networks (DCGAN) and Conditional Deep Convolutional Generative Adversarial Networks (cDCGAN) in medical image editing. They explained the process of optimization as well as image selection and consideration before training the model. Murugan, S. Anu H Nair, A. Angelin Peace Preethi, K. P. Sanal Kumar. (2021) Diagnosis of skin cancer using machine learning techniques. Microprocessors and Microsystems, Volume 81, 2021, 103727. https://www-sciencedirect-com.libaccess.sjlibrary.org/science/article/pii/S0141933120308723This article demonstrates different classification techniques such as Support vector machine, Probabilistic Neural Networks and Random forest and Combined SVM+ RF classifiers for skin cancer detection. Median filter, image segmentation, mean shift image segmentation and feature extraction are used before classifying data. Monika, M. K., Vignesh, N.A, Kumari, C. U., Kumar M.N.V.S.S, Lydia, E. (2020) Skin cancer detection and classification using machine learning. Materials Today Proceedings, Volume 33, Part 7, 2020, Pages 4266-4270. https://www.sciencedirect.com/science/article/pii/S2214785320354766 For early detection of skin cancer, this paper leverages Multi-class Support Vector Machine (MSVM) machine learning to classify various forms of skin cancer using image processing tools. Dual razor, Gaussian filtering, and Median filter preprocessing image techniques are used to improve accuracy above 96%. Unsupervised, color-based k-means clustering is used to segment images from the ISIC 2019 Challenge dataset. Other Relevant Considerations Datasets of skin cancer diagnosis from the real-world are highly unbalanced, since most of the images are common skin cancer types but less images of uncommon types. For example, Melanoma is much less common than other types of skin cancer, making it difficult to detect based on limited variation. Further, for the datasets collected from Kaggle, images are mostly from light-skinned people. In order to detect dark-skinned people accurately, we have to collect datasets of people with dark skin. A model based on light-skinned people only bias’ the model. In this kaggle dataset, there is no noise element or outlier images. If the mole picture contains the dark spot, like chloasma, or it is related to another type of picture such as sun or moon pictures. It will be difficult to detect the skin mole, and classify malignant or benign. Therefore, we need to improve the accuracy of detection and classification from other various random things to help with clinical applications.