Business Studies


As a group, we investigated data from Airbnb to see what conclusions we could make about the types of listings that are recorded. Of the data sets available, we chose one with 2333 different airbnb listings in asheville, north carolina (along with corresponding attributes). Upon initial glance, we realized that a good amount of data cleaning was required to evaluate this data because it was filled with unnecessary characteristics. Of the useless attributes we found: host name, host id, neighborhood group, latitude, longitude, date of last review, and calculated host listings count (reasons for removing this data and more details about the attributes will be covered in the following sections). In addition to filtering these attributes out, we cleaned the data set by removing any listings with missing values for the attributes we preserved. After cleaning the data set, we found that the most obvious dependent variable to predict would be price. As price would be our predicted value, we came to the realization that predicting price for a listing would be useful for new incoming listings that would have none of the attached attributes already recorded. We then came to the conclusion that airbnb could use this predictive data to help recruit new hosts; if they were to hypothetically list their home/room/property on airbnb, they could see what sort of price point they can reasonably expect to get.
For our group’s data analytic Project we thought it would be interesting to consider the data exploring Airbnb properties and their specific characteristics. In order to do this, our group looked at several different data sets provided on the website In the end, we decided to choose the data set that pertained to the airbnb properties in Asheville, North Carolina. This data was over hundreds of different listings and examined several different metrics pertaining to each listing in Asheville.
Originally, our group’s hope was to find out whether the analysis would reveal certain trends with regard to pricing, number of reviews, and reviews per month. After cleaning up our data, we quickly were able to find these trends in pricing, reviews, and reviews per month, so we decided to go a step further and pose a specific challenge to solve. When running predictive tests we decided to use both linear regression and K-Nearest Neighbors to see what the best forecasting method would be for predicting the price of an airbnb listing. Then our group used cross validation to see what the best method would be. Our goal in doing so was to find the best predictive model for our future airbnb users to see the most accurate predictive price points for future Asheville Airbnb listings. With the goal of finding future prices Airbnb users can find the right price to list their specific properties as well as rent other properties that will be listed in the future on the website. By incorporating an accurate model we can allow customers to predict prices for new incoming listings that would have none of the attached attributes already recorded. Additionally this will allow potential renters to see what they might be able to list their properties at on airbnb.
The group’s airbnb data pertained to the 2,333 asheville, north carolina property listings. This data set had many different variables including the ID, Name, Host ID, Host Name, Neighborhood Group, Neighborhood, latitude and longitude, room type, Price, Minimum nights, number of reviews, date of last review, reviews per month, calculated host listings count, availability during 365 days. While some of this data was relevant to our project, some of it needed to be cleaned and some variables needed to be removed. As we proceeded into Rapidminer we removed and cleaned variables such as host name, host id, neighborhood group, latitude, longitude, date of last review, and calculated host listings count. We made sure to remove hostname and host ID because there was already an ID variable that separated every property. Our group decided that we would focus on the neighborhood instead of the neighborhood group as we were trying to incorporate the most meaningful data and felt that neighborhood groups would generalize our findings too much. Additionally we didn’t feel the need to keep latitude and longitude for a similar reason. As for date of last review, we found it was not relevant to our data, as date of last review did not contribute to more accurate predictions when we kept the variable in. calculated host listings count was removed because that information is about the host and not the property. Once we had cleaned the data we were able to proceed. All we were left with was a few missing cells in a column so we relied on the replace missing values operator to erase those cells.
For this project we decided that running both descriptive and predictive analytics would be the most beneficial way to understand and evaluate the data we were examining. For our descriptive methods we chose to run a correlation matrix and K-Means Clustering. And for our predictive methods, we decided to run a linear regression as well as a K-Nearest Neighbors. Our dependent variable for this analytics project was the price of the airbnb listings.
We first ran a correlation matrix and we didn’t find a high correlation coefficient for any of the attributes that we could input into rapidminer. One of the only relevant correlations we found after running the matrix was a positive correlation of 0.598 between number of reviews and reviews per month. (we need to speak more on our correlation matrix) After that we decided to use K-means clustering to find patterns within those listings and see how those attributes would present themselves, including their prices.
We then switched our focus to predictive methods of analysis. The predictive methods we chose were linear regression and K-Nearest Neighbors. For our linear regression model there were several interesting things that we found from our results. First there were several negative coefficients including number of reviews (-0.203) and reviews per month (-6.324). We determined that both of these attributes had inverse relationships with our dependent variable, price. So any time airbnb users expected the price of future listings to go up, they could assume that the number of reviews would decrease and reviews per month would decrease. The group found that when we ran the multiple regression a few of our attributes had very high p-values leading us to determine that they were statistically insignificant and were not closely related to our dependent variable (price). These statistically insignificant variables were room type = entire home/apt (p=0.9) and room type = private room (p=0.419). On the other hand we found that reviews per month were significant (p=0.04), number of reviews and availability 365 were clearly significant with p-values rounding to zero. All three of these showed enough evidence to show they were related to the dependent variable. As for room type = hotel room, our group determined that it was marginally significant with a p-value equal to 0.065.
Once we had run those processes in Rapidminer we ran them in a cross validation matrix to see which one would yield the lowest root mean squared error and give us our most accurate predictive model. We went ahead and imported the excel file in rapidminer and labeled the price column. After that we ran a
Discussion and Conclusion:
Comments from Customer
Discipline: Conclusion part