Hilal Coban
5 min readMar 13, 2021

--

SEATTLE AIRBNB DATA ANALYSIS PROCESS WITH MACHINE LEARNING ALGORITHM: LINEAR REGRESSION MODEL

In this blog, you would find the data analysis process and ML application of the Linear Regression Model. This project is part of the Udacity Data Science Nanodegree program.

Detailed analysis with all required code is posted in github and Jupyter Notebook. All data analyses are based on the CIRSP-DM (Cross Industry Standard Process for Data Mining) methodology

Introduction

In this project, I used Seattle Airbnb Dataset. You can find the address of web site that I collected the dataset from in the reference section below. The next sections will answer three business-related questions based on data analytics with the given datasets. Dataset describes the listing activities in Seattle and there are 3818 listings in it.

Q1: How are properties distributed among neighborhoods?

In this listing, there are 86 neighborhoods. Based on the graph given below; Broadway, Belltown, Wallingford, Fremont, and Minor are the neighborhoods that cover around 29% of the total Airbnb listings in Seattle.

Q2: How many times room types are reviewed? What is the average review score rating? Is there a relationship room type price and the number of review and review score ratings? What is the average highest price for room type in different neighborhoods?

In this question, I would like to investigate if the number of reviews and review scores has an impact on room type selections. This will guide us to figure out the most preferred room types based on the number of reviews and review score ratings. Another important point to understand if there is any relationship between room type price and review rating. We will question if price and reviews are correlated.

a)- How many times room types are reviewed? What is the average review score rating?

In the given table, for example, the entire home/apt room type received the highest number of reviews. It received 52465 reviews and its rating was 94.488. Private rooms were reviewed 30870 and its score 94.756. The shared room’s number of reviews is 1514 and its score 93.508. So, we can draw the conclusion that Entire home/apt are most preferred room type and customers were satisfied with what they selected

When we look at neighborhood perspective, In the given graph, we can easily see that most reviewed neighborhoods are other neighborhoods, Downtown, Capital Hill. In Downtown, the Entire home/apt in Downtown received the highest number of reviews.

b)-Is there a relationship room type price and the number of review and review score ratings?

Additionally, we look up the relationship between price and number of reviews and review score rating variables. It looks there is an inverse correlation between price and the number of reviews while there is a direct relationship between price and reviews score rating but we can't say it is strong enough.

c)-What is the average highest price of room type in different neighborhoods?

We also observed at prices of room types in different neighborhoods. Based on the data given, the average prices for room types are found. Based on the given data, On average, the most expensive neighborhoods for the entire home/apt room type are Magnolia, Queen Anne, West Seattle, Downtown, and Cascade.

Q3: Implement linear regression model to apply ML algorithm to forecast price based on variables are selected

We performed an ML model to forecast if prices are impacted under different circumstances. In order to proceed, we selected columns as independent variables. The dependent variable is here price. The list of independent variables is below.

SELECTED INDEPENDENT VARIABLES ::: host_response_time, host_response_rate, property_type, room_type, accommodates, bathrooms, bedrooms, beds, bed_type, minimum_nights, maximum_nights, availability_365, number_of_reviews, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, instant_bookable, cancellation_policy, require_guest_profile_picture, require_guest_phone_verification, calculated_host_listings_count, reviews_per_month, neighbourhood_cleansed

Our linear regression model explains around 60% of the variation of pricing in the training set and 60% of the variation of pricing in the test set. In other words, our regression model fit 60% into the observed dataset.

We also look at the p-values and coefficients of the model. if p-values is less than 0.05 for each independent variable, there is a correlation between X vars and Y var. For example, the p-value of ‘accommodates’,’ bathrooms’, ‘bedrooms’,’reviews_per_month’ less than 0.05, we can say the change in the independent variables are associated with price. This variable is statistically significant and probably a worthwhile addition to your regression model. Otherwise, it is accepted that there is no significant relationship between X and y variables.

The sign of a regression coefficient tells you whether there is a positive or negative correlation between each independent variable the dependent variable. For example, while ‘accommodates’,’ bathrooms’,’ bedrooms’,’ beds’ have a positive correlation with price, ‘number of review’,’ review score checkin’ has a negative correlation.

You can see the rest of the detailed analysis in GitHub respiratory.

References

Seattle Airbnb Open Data: https://www.kaggle.com/airbnb/seattle

Github: https://github.com/hllcbn/Seattle_Airbnb_Project

--

--

Hilal Coban

Software Engineer in Test who can build ML Models as well as testing. So complicated to tell