all need to be done
Assignment-2-SoundDetection Assignment 2 Sound Detection But first, let’s clarify rebalancing… What is rebalancing? ● In the real world, data is not symmetric ● When dealing with a classification task, we want our data to be balanced (i.e. same number of true and false) ● Rebalancing involves “correcting” a dataset so that it has a relatively similar number of true and false T T T F T T T T T F F F Two methods ● Undersampling - remove samples to make classes match ● Oversampling - add samples to make classes match T T T F T F T T F F T T T F T F T T T F F FF T When do I resample? In machine learning, you generally want to resample the TRAINING SET For example, in Python: X_train, y_train, X_test, y_test = train_test_split(X, y) X_res, y_res = SMOTE().fit_resample(X_train, y_train) #oversampling using imblearn *OR* X_res, y_res = RandomUnderSampler().fit_resample(X_train, y_train) #undersampling using imblearn When to oversample of undersample? Questions to ask yourself? 1. Is my data truly imbalanced? ➢ For example a 60/40 split could probably be mostly solved with a stratified KFold 2. How many samples do I have? ➢ If you lack a lot of data -> you may consider leaning towards oversampling; and vice versa 3. What is your data distribution? ➢ Depending on whether the data is skewed or not, it may change whether you choose over or under samping if size is not a factor Note: it is also possible to do a combination of under and over samping Any other questions about rebalancing? Now back to Assignment 2… GOAL: Develop a machine learning pipeline to detect activities and events using sound. Overview You are going to be collecting and analyzing common sounds to see if you can build a model that correctly identifies them. This assignment is broken down into several parts: 1. Data Collection 2. Data Pre-Processing 3. Data Analysis Pipeline 4. Report Part 1: Data Collection You will collect 20 samples per class across 5 classes. For recording the sounds, you can use these apps (feel free to try out others but make sure that you are recording uncompressed WAV files): iOS: Voice Record Android: MP3 Recorder **Ideally, you should aim for an audio quality sampling rate of 44.1 or higher Recommended Classes 1. Microwave (30 seconds per sample) 2. Blender (30 seconds per sample) 3. Siren (find YouTube recording) 4. Vacuum Cleaner (30 seconds per sample) 5. Music of choice (30 seconds per sample) You are allowed to choose other classes if you would like. Part 2: Pre-Processing ● The raw data from the .wav files needs to be processed to make it useable in your data analysis ● You should calculate the Fast Fourier Transform (FFT - as covered in lecture) ○ Converts the raw time-domain signal into frequency-domain ○ A Python notebook with some sample code will be available on Canvas ● You can also remove frequency bands that you think might not contain anything useful What are .wav files? ● WAV (.wav) files are waveform audio files ● WAV file formats use containers to contain the audio in raw and typically uncompressed “chunks” using the Resource Interchange File Format (RIFF) ● The format uses containers to store audio data, track numbers, sample rate, and bit rate If you are unfamiliar with them, here are some tutorials to help you read in the files: https://www.tutorialspoint.com/read-and-write-wav-files-using-python-wave https://stackoverflow.com/questions/2060628/reading-wav-files-in-python https://www.tutorialspoint.com/read-and-write-wav-files-using-python-wave https://stackoverflow.com/questions/2060628/reading-wav-files-in-python Part 3: Data Analysis Pipeline Similar to the last assignment, the data analysis will be comprised of: 1. Feature engineering / extraction 2. Feature normalization / visualization / etc. 3. ML models for classification Feature Engineering For feature engineering, you are required to try two approaches: 1. Binning the spectrogram data from the recordings and using each bin as a feature 2. Extracting domain-specific features a. Find specific phenomena for each class that you want to capture b. These features can be in the time or frequency domain Small Example: Binning Microwave Sounds FFT Bin-n ing Spectrogram Bins Small Example: Extracting Domain-Specific Features HUM = high frequency wave BEEP = low frequency wave Windows In either feature engineering method, you will need to choose a window of data. You are required to try: 1. Treating the whole approx. 30 seconds recording as a single “window.” 2. Dividing each recording into multiple windows a. Feel free to experiment with different window sizes and overlaps. b. As a convention, though, I would suggest using a 50% overlap between windows Analysis You are free to use any ML algorithm with any parameter and configs you like to do the classification. Requirements: ● Analyze the pipeline’s performance using 10-fold cross-validation ● Aim for above 80% performance in at least 3 cases, and above 90% in at least 1 Part 4: Report You will submit a write-up (approx 2 pages) explaining: 1. Data collection process 2. Rationale for features (no need to explain the bin sizes, explain the domain-specific features or if you end up doing any other feature engineering or anything else that you feel like sharing) 3. Graph and describe results for different conditions If you are having trouble with how to organize your report, refer to some of the shared papers as templates for the write-up. What do I need to turn in? 1. Your .ipynb files with your code 2. A .pdf version of your code (see assignment 1 slides for how to get the pdf) 3. A pdf of your 2-page report 4. Your collected data as a .zip file Note: There will also be a mandatory peer evaluation conducted after assignment 2 (on Canvas). Rubric Part 1: Data Collection - 5% Part 2: Pre-Processing - 5% Part 3: Analysis - 60% Part 4: Report - 30% **Group evals will be taken into final grades Bonus Points! You can get +20 bonus on assignment 2 by doing a live demonstration of your model. This involves: 1. Having a microphone set up to capture live WAV audio format information 2. Having the live audio connected to your model 3. Videoing/recording your demo and posting it to YouTube as an unpublished video 4. Sharing the link with your instructors as part of your assignment 2 submission https://www.gierad.com/projects/acoustruments/ https://www.microsoft.com/en-us/research/project/soundwave-using-the-doppler-effect-to-sense-gestures/publications/ https://ubicomplab.cs.washington.edu/publications/surfacelink/ https://dl.acm.org/doi/10.1145/2594368.2594386 https://www.gierad.com/projects/viband/ https://dl.acm.org/doi/10.1145/1294211.1294250 Description: Develop a machine learning pipeline to detect activities and events using sound. The assignment will involve data collection, data pre-processing/signal conditioning, feature extraction, using an existing ML implementation, and analysis of results. Data Collection (5% grade): Collect 20 samples each for 5 classes: Microwave (run for 30 seconds, and I would suggest to include door opening, closing, and beeps as part of each recording) Blender (run for 30 seconds) Fire alarm or any other kind of siren Vacuum Cleaner (run for approx. 30 seconds and perhaps move the vacuum cleaner around as it will change the sound profile a bit) Music (approx 30 seconds for each sample, the music of your choice). Try varying the song (e.g., 5 songs with 4 samples each) For any device that you might not have (e.g., please don’t trigger an actual fire alarm), find a recording on the Internet (maybe on YouTube) and record its sound on your phone. Make sure not to use the audio file directly off the Internet. Make your own recording of the audio file because you want the general variability between recordings for your 20 samples. You do not need to choose 20 different examples of a sound. For example, if you don’t have access to a blender, don’t search for 20 blender sounds on the Internet. Find one sound and record it 20 times. Identifying 20 different blenders as “blender” is a much harder problem for a course homework. For recording the sounds, you can use these apps (feel free to try out others but make sure that you are recording uncompressed WAV files): iOS: Voice Record Android: MP3 Recorder In a real-world scenario, where a system like this runs the whole time to detect different events/sounds, the system would need to filter out silent periods. Thus, in addition to the 5 event classes, also record 20 samples of silence (approx 30 seconds). These recordings will be used to develop a logic that can be used in the future by someone to filter out silent periods. Now, it is entirely up to you whether you want to treat these silent files as a separate sixth class in your ML pipeline or if you want to filter these out in the data pre-processing. Pre-processing (5% grade): You will need to process your collected data in some way before extracting features from it. For example, calculating FFT to convert the raw time-domain signal into frequency-domain, or removing some frequency bands that you think might not contain anything useful. Feature Engineering/Extraction and ML algorithm (60% total): You are free to use any ML algorithm with any parameter and configs you like. I would suggest to try different algorithms and see what works well. For features, you will try two approaches: Binning the spectrogram data from the recordings and using each bin as a feature. E.g., if you have a 1024-point FFT of a recording, then your FFT output will be a [1024 x num_of_windows] samples. You want to convert this 2D array of samples into a smaller array. You can use the sample code I provided in the shared Dropbox folder to bin the values. Feel free to experiment with different sizes of bins. Read some of the papers we discussed in the class for inspiration. (15%) Extracting domain-specific features. Find specific phenomena for each class that you want to capture. These features can be in the time or frequency domain. (15%) For calculating your features (binned or domain-specific), you need to choose a window of data. You will try two approaches here, as well: Treating the whole approx. 30 seconds recording as a single “window.” (15%) Dividing each recording into multiple windows. Feel free to experiment with different window sizes and overlaps. As a convention, though, I would suggest using a 50% overlap between windows. (15%) Analysis: Analyze the pipeline’s performance using 10-fold cross-validation. Performance: Aim for above 80% performance in at least 3 cases, and above 90% in at least 1. i.e., it is okay if the classification accuracy is below 80% for one of the cases. However, these performance thresholds are not rigid. Each of you is collecting your own dataset,