Need to complete CUDA/C++ software programming.
CUDA/C++ programming.
Reference with the CPU code provided, under the .\src,
Complete following functions
1. void medianFilter_gpu (uint8_t * inPixels, ImageDim imgDim, uint8_t * outPixels, MedianFilterArgs args);
2. int runGpuMedianFilter (std::string imgPath, std::string outPath, MedianFilterArgs args);
3. int sortPixels_gpu (uint8_t * array, dim3 arrayDim);
4. int runGpuPool (TensorShape inShape, PoolLayerArgs poolArgs);
Introduction The first assignment exploited data parallelism along a single dimension - with a direct one-to-one or one-to-many correlation between each thread and the data it processes. In the second assignment we will be operating on 2-D data and look at ways to adapt some common operations for the GPU. The official CUDA Documentation is the best resource for implementation details and API specifics. The NVIDIA Developer Blog has a number of guides which showcase the latest and best CUDA programming practices. We would strongly recommend reading through the CUDA related links within this document before you start writing code. This lab is a significant departure from the previous lab. There are a number of unique ways to parallelize each task and each choice comes with its own set of tradeoffs. There is a significant set of possible implementations and we cannot wait to see your ingeneous and unique take on the problems. To boldly go where no one has gone before! -- Captain James T. Kirk, USS Enterprise PART A: Filtering Noise from Images Median Filtering The median filter is a non-linear digital filtering technique, often used to remove noise from an image or signal [1]. For small to moderate levels of Gaussian noise, the median filter is demonstrably better than Gaussian blur at removing noise whilst preserving edges for a given, fixed window size. [2]. While a number of efficient implementations of median filtering exists [3], we will be using the relatively compute-intensive "simple" implementation, that precisely computes https://docs.nvidia.com/cuda/ https://developer.nvidia.com/blog https://en.wikipedia.org/wiki/Median_filter https://github.com/purdue-ece60827-sp23/cuda-programming-assignment-2-jkye35#1 https://github.com/purdue-ece60827-sp23/cuda-programming-assignment-2-jkye35#2 https://github.com/purdue-ece60827-sp23/cuda-programming-assignment-2-jkye35#3 the median. Furthermore, to simplify the implementation, we will be using a cental symmetric rectangular / square window such that the pixel at the centre will be replaced by the median of all the pixels in A simple description of the algorithm can be found here. The curious, interested in exploring Median Filtering in the context of recent (and past) academic work can visit this. Implementation The principle of the median filter is to replace the gray level of each pixel by the median of the gray levels in a neighborhood of the pixels, instead of using the average operation. A simplified pseudo-code implementation of the median filtering algorithm is shown below. for each pixel: load filter window - filterH x filterW centred @ pixel sort the window store median value as output A naive CPU implementation of the same can be found at medianFilter_cpu() in cpuLib.cpp. You must implement an efficient algorithm for GPU. Think carefully about which loops to parallelize and which to execute sequentially inside the GPU kernel. Make sure you handle the edges of the image where the filter window may expand beyond the dimensions of the image. • Try to solve this at the algorithmic level instead of padding the image • As an optional exercise you can try to pad the image at CPU or GPU and report the overall performance impact of this additional step. • You may implement this in the medianFilter_cpu() prior to starting on your medianFilter_gpu(). Steps 1. Complete the CPU implementation o verify visually using the viewImageBytes.py script. 2. Write runGpuMedianFilter() as entrypoint function. 3. Write the kernel medianFilter_gpu() 4. Write the kernel sort_gpu() to sort the filter window o Choose the sorting algorithm of your choice. http://fourier.eng.hmc.edu/e161/lectures/smooth_sharpen/node2.html https://www.sciencedirect.com/topics/computer-science/median-filter o Carefully weigh (or better yet, measure - optional) ▪ the effects of a sequential vs recursive implementation (if applicable) ▪ and their effects on the SIMT stack. Optimizing the Kernels • This assignment is tightly coupled with the course material - utilize your understanding of the GPU architecture and the memory heirarchy to squeeze out the best performance for your kernel. • Repeated profiling with nvprof will help you compare the design options and evaluate the impact of your choices. • Think deeply and explore the design space in the following aspects o Careful and Optimal usage of shared memory resources [8] o Leveraging CPU-to-GPU data transfer and GPU memory access and coalescing patterns [6] [7] o Threadblock and grid assignment to shaders ▪ and how that influences the working set in each cache o the data storage format and access patterns ▪ to help select which loops to parallelize Insufficient facts always invites danger, Captain. -- Cdr. Spock, USS Enterprise Setting Up a Python Virtual Environment Use the following commands to setup and use a python virtual environment. This step is essential to install python packages on scholar without modifying the built-in Python copy. # Create a new python virtual environment in local directory `./python` $ python3 -m venv ./python # Activate your python virtual environment # **NOTE** This step must be repeated every time to select your venv as default python distribution $ source ./python/bin/activate # This step selects your "local" python environment for use instead of the system python installation # Verify python path https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/ https://github.com/purdue-ece60827-sp23/cuda-programming-assignment-2-jkye35#8 https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/ https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/ https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/ https://github.com/purdue-ece60827-sp23/cuda-programming-assignment-2-jkye35#6 https://github.com/purdue-ece60827-sp23/cuda-programming-assignment-2-jkye35#7 $ which python3 /home/ ... /lab2/python/bin/python3 # Install required python packages pip3 install numpy pillow # Learn about the command line arguments $ python3 makeBytesImage.py --help info: No GUI element found - disabling GUI usage: makeBytesImage.py [-h] [-f FILE] [-l LEVEL] [-r RATIO] Prepare Image for CUDA Programming Lab 2 optional arguments: -h, --help show this help message and exit -f FILE input image file path -l LEVEL added noise intensity -r RATIO Salt vs Pepper Ratio # Run script $ python makeBytesImage.py -f ./resources/lena512color.tiff info: No GUI element found - disabling GUI TIFF 512x512 RGB (512, 512, 3) Storing bytes to - /......./resources/lena512color.tiff.bytes 786448 bytes written ... Done ! Using the .bytes file format Modern images are stored in compressed formats like JPEG which have their own storage formats and compression/decompression schemes. A custom RAW Pixel format (.bytes) has been defined for the purpose of this class to enable you to quickly convert any image into the RGB pixels and use it with the code developed as part of this assignment. The required Python scripts for interconversion and viewing the original and processed images have been provided. • makeBytesImage.py can be used to convert any given image (tested with .jpg and .tiff formats) into a .bytes file. • viewBytesImage.py can be used in a GUI-enabled environment to view the contents of a .bytes image. For details about the command line arguments that can be passed to each script use $ python3 makeBytesImage.py --help # or $ python3 viewBytesImage.py --help NOTE : Viewing the images through any of this scripts will require you to run the scripts on a GUI enabled workspace (ThinClient or your local machines). • The scripts still work on console-only workspaces but will have the GIUs disabled. • In non-GUI environments, makeBytesImage.py will take console arguments and directly store the output into a .bytes file. // .bytes metadata typedef struct ImageDim_t { uint32_t height; uint32_t width; uint32_t channels; uint32_t pixelSize; } ImageDim; The bytes format packs any image in a raw 8-bit pixel format comprised of a 16 byte metadata followed by RGB inteleaved data in little-endian format . [5] • For the purpose of this assignment, all images will have 3 channels and 8-bit (1-Byte) pixels. • The first 16 bytes of any .bytes file will have the metadata required to interpret the contents of the rest of the file (can be used to calculate dimensions and sizes) • The remaining bytes in the file will have RGB pixels interleaved. • Pixels are stored in Row-Major format. Look at the contents of loadImageBytes() in cpuLib.cpp to get an idea of storage format. MetaData Px (0,0) Px (0,1) ... Px (0,N- 1) Px (1,0) ... Px (N- 1,N-1) R G B R G B R G B R G B R G B R G B R G B 16 bytes 3 bytes 3 bytes 3 bytes 3 bytes 3 bytes What to include in your report • Your parallelization strategy and any optimizations you have included (eg: shared memory resources, memory coaleascing etc) https://chortle.ccsu.edu/AssemblyTutorial/Chapter-15/ass15_3.html https://en.wikipedia.org/wiki/Endianness • Images showing the effectives of your implementation of median filtering. • Drawbacks (if any) of your implementation. PART B: Pooling A quick overview of the pooling operation in Neural Networks can be found here. Read carefully to understand the operation of a pooling layer and its various configuration parameters. Pooling is a feature commonly imbibed into Convolutional Neural Network (CNN) architectures. The main idea behind a pooling layer is to “accumulate” features from maps generated by convolving a filter over an image. Formally, its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network. The most common form of pooling is max pooling. Max pooling is done, in part, to help over-fitting by providing an abstracted form of the representation. As well, it reduces the computational cost by reducing the number of parameters to learn and provides basic translation invariance to the internal representation. Max pooling is done by applying a max filter to (usually) non- overlapping subregions of the initial representation. [10] The next two labs will introduce a deep neural network in its entirelty and provide you the opportunity to implement a foundational work on DNNs which has been cited > 76k times! - - AlexNet (http://www.cs.toronto.edu/~kriz/imagenet_classification_with_deep_convoluti onal.pdf). For this part, however, you are only required to implement the pooling layer in isolation. An illustration of the pool operation can be found below CPU Implementation https://d2l.ai/chapter_convolutional-neural-networks/pooling.html https://github.com/purdue-ece60827-sp23/cuda-programming-assignment-2-jkye35#10 http://www.cs.toronto.edu/~kriz/imagenet_classification_with_deep_convolutional.pdf http://www.cs.toronto.edu/~kriz/imagenet_classification_with_deep_convolutional.pdf https://github.com/purdue-ece60827-sp23/cuda-programming-assignment-2-jkye35/blob/master/figures/PoolSlide.png A basic incomplete implementaton of a Max-Pool layer is provided inside cpuLib.h. It lays down the API and (after completion) can be used for checking your GPU implementation later. 1. Complete the CPU code for max pooling 2. Add provisions for padding and strides o implement padding by using conditional statements rather than by actually padding the data. GPU Implementation The naive CPU implementation is a set of nested loops - each loop presents an oppornity for parallelization. Choose wisely. Reporting your Progress Refer to the Lab1 Readme, esp. sections on Building and Executing the Project and Reporting your Progress to build, test and profile your application. All the instructions provided in Lab 1 also apply to this Lab Each time you login • Load the gcc, cuda and cmake modules To avoid typing this every time, consider adding this command to ~/.bashrc. $ module load gcc cuda cmake . • Activate the python virtual environment setup within the assignment directory • $ cd
https://github.com/tgrogers/ece695-2021/blob/master/cuda-programming-assignment/cuda-part1.md#building-and-executing-the-project https://github.com/tgrogers/ece695-2021/blob/master/cuda-programming-assignment/cuda-part1.md#building-and-executing-the-project https://github.com/tgrogers/ece695-2021/blob/master/cuda-programming-assignment/cuda-part1.md#report https://github.com/purdue-ece60827-sp23/cuda-programming-assignment-2-jkye35/blob/master/figures/poolOp.png https://github.com/purdue-ece60827-sp23/cuda-programming-assignment-2-jkye35/blob/master/figures/PoolAlgoSlide.png • $ source ./python/bin/activate > (python) user@scholar-fe05:~/...../lab2 $ Succesful loading of the venv will be indicated by (python) in front of the shell prompt as shown in the last line above Performance and Speedup Time is the only true measure of performance Please report all metrics and speedups in terms of absolute execution time ( in μs or ms as applicable). Percentage doesn not convey the actual execution time. If you happen to use Unified Virtual Memory (cudaMallocManaged) report seperate sets of data for managed and unmanaged device memory. An example of the stacked bar chart format expected for reporting the breakup of workload execution time for the report is given below. Figure is indicative of format and has no correlation with any profiling graph real or fictional. Graphs Here are some ideas for the graphs that you can use to present your efforts and results in the report. 1. Stacked bar chart (as in figure) showing execution time vs problem size 2. line or bar chart showing speedup vs naive cpu 3. chart showing speedup vs optimized cpu (if you make any changes) 4. chart with performance vs shared memory strategy (size or organization) - same or varying problem size 5. chart with performance gain acheived by any tenchiques you have applied Be terse and lucid with explanations of your observations from the trends you observe. References https://github.com/purdue-ece60827-sp23/cuda-programming-assignment-2-jkye35/blob/master/figures/stackedBar.png [1] Wikipedia -