Answer To: i have attached
Swapnil answered on Jul 13 2022
Description:
In this project, method to build a recommendation system using principal components analysis and clustering analysis will be provided. Recommendation system is used by multiple IT companies to make personalized recommendation for their customers and help them to make a decision for their products. Lodging company, such as, and Trivago are overlooking power of the recommendation system, therefore, the recommendation system for this project will be built based on data in Sydney city. Recommendation system will be powered by clustering analysis, including k-means clustering and model-based clustering will be performed on two different plots. One plot will be generated with coordinate data of the accommodation and other being generated with principal components of different datasets. While making plots based on principal components, analysis will be done to obtain deeper understanding of the dataset.
Recommendation system is one of the most popular and powerful engines that is use to make personalized recommendation for their customers and help them to make a decision for their products. While multiple companies use this machine learning based system, when it comes to booking accommodation for traveling recommendation system is hard to be found. Lodging company and Trivago does not provide any recommendation based on customers booking history. Every time customers plan on a new trip, customers need to rely on filter and compare their options. While I was traveling to Sydney City earlier this year, I have spent quite a long time to choose the perfect accommodation for me. To stop this from happening again when I visit Sydney city next time, I decided to make a recommendation system using methods that I have learned from this course.
Vision:
Explore the datasets for deeper understandings After obtaining clean and compact dataset, I will explore quantitative variables through scatterplot and bivariate boxplots. Additionally, unique plot like star plot can be used to explore the dataset. For, categorical variables I will create a word cloud to get the gist of what aspects are hosts selling to the customers. Also, by using coordinate date map of Sydney city can be expressed using data points.
Reduce dimension and describe the dataset with principal components analysis Then, after exploring datasets and getting a deeper understanding of the data, we can divide into different cases using different variables. With these different cases we can use principal components analysis to express our data into 2 dimensional plots. Scree plots can be used to choose how many principal components should be used and bipods can be used to understand better about the plots using principal components.
Build clustering models to group similar accommodations After expressing the dataset with principal components and plotting in 2 dimensions, I will build clustering models using two different methods, k-means and model based clustering. I will apply different variables for different cases and get multiple clustering models.
Make a recommendation based on booking history After building multiple clustering model, I will apply the place that I stayed earlier this year in Sydney city which I really liked to the models, and get recommendation based on the different clustering models.
Internal and External Stakeholders:
The simplest way to explore data is to plot the data that we have. However, in our dataset we have two types of variables. Categorical variables and quantitative variables. Categorical variable will be explored later after exploring quantitative variables. The quantitative variables that we have are minimum nights, number of reviews, reviews per month, calculated host listings, availability, ratings, and coordinates of the location. To check if these two categorical variables are actually important in our dataset, we can check the mean difference of the ratings between different levels of these variables. Coordinate of the location will be dealt later on by mapping them. To make the plot look better I will divide these quantitative variables into two groups. According to what variables are the most important aspects on booking the accommodation. I have interviewed several friends; what aspects do they consider the most among quantitative variables that I have when it comes to booking accommodation for their trips. The answer from all my friends were consistent, with them being price, ratings and reviews. Therefore, I divided the quantitative variables into two groups, then plotted their scatterplot with bivariate boxplots. First we can identify that there are clear minimum values and maximum value for the variables. For example, maximum value for the rating is 100, and minimum value for the price is $0. Also, since the accommodations are tended to have similar settings to compete with each other, data points are fairly condensed. Other than that we can also identify that number of reviews and reviews per month are linearly correlated, ratings variable not having significant correlation with other variables. This dataset also includes multiple outliers, that might have affected the result of the analysis. From this we can conclude that data is unique dataset, since customers can have different experience even on the same accommodation. We can also see that there are some outliers in all bivariate boxplots, however, I will keep these outliers, since this kind of data is unique in every single one of them, therefore, they might carry important information. By looking at...