For the past few decades, with the fast growing market of digital platforms, firms have tried to customize the advertising of their products based on individual customers’ preferences or interests. This practice has been utilized across various industries and companies, from the e-commerce site Amazon suggesting relevant products to the streaming platform Netflix recommending similar shows to their users’ view history and profile. The recommendation systems help increase sales as the users are able to easily see and purchase recommended products that match their needs and preferences.
In this project, we focus on devising a restaurant recommendation system (hereby referred to as “recommender”). We use data of restaurants, customer profiles, and their reviews from Yelp, a platform for crowd-sourced reviews about businesses. As an individual has unique restaurant preferences, such as cuisines, ambience, pets, diet types, and/or parking availability, we build a recommender to recommend restaurants to users based on the insights gleaned from their reviews on the previous restaurants they have been to.
The data is downloaded from Yelp official website. There are three datasets relevant to our analysis and models: business
, review
, and user
data. The business
dataset contains information about the businesses including name, location, hours, average rating stars, hours, number of reviews, and other features such as cuisine types and parking availability. The user
data include the user’s friend mapping and all the metadata associated with the user such as number of upvotes. The review
dataset records full review text data as well as the user_id
who wrote the review and the business_id
for which the review was written. There are 150,346 businesses and 6,990,280 reviews in the Yelp original datasets.
We converted the original datasets which are in JSON format to .feather
filetype to reduce storage space and optimize faster reading. The documentation of data cleaning process can be found in recommender_system_report.ipynb
Note that there are dependencies among the data cleaning Jupyter notebooks. The order of the run should be as follows:
data cleaning/business_data_inspect.ipynb
data/yelp_business_cleaned.feather
.data cleaning/review_data_inspect.ipynb
business_id
), reducing the dataset size from 6,990,280 to 5,257,329 entries.data/yelp_review_cleaned.feather
.data cleaning/user_data_inspect.ipynb
elite
, friends
, and yelping_since
.data/yelp_user_cleaned.feather
models/base_model.ipynb
models/advanced_model.ipynb
models/final_model.ipynb