Project 1: Group Restaurants Choice¶
By: Jacob Andreesen, Jeff Chen, Miao Xu, Yiyi Wang
Finding an ideal restaurants for students with a group of friends is always a struggle for newcomers, who are looking for places new and excited to go. Inspired by such a challenge, our group aim to build a personalized restaurant recommender system prototype that serve a small group of people to meet their requirements and close to their taste.
%load_ext autotime
%load_ext nb_black
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# For scraping
import time
import urllib.request, json
from flatten_dict import flatten
import requests
import copyheaders
from bs4 import BeautifulSoup
# General tools
import regex as re
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from tqdm import tqdm
import geopandas
from geopy import Nominatim
time: 1.46 s (started: 2021-03-09 15:47:27 +08:00)
Web Scraping Yelp!¶
We will be simply grabbing data from https://www.yelp.com/search/snippet
api endpoint instead of actually web scraping from the Yelp! website. We attempted to scrape it through the website but it was hard to select specific elements that we required and some of them are only revealed through button clicks, meaning we’d have to use a browser automation software like selenium to simulate clicks and grab data from the html afterwards, a little too much unecessary work.
headers_str = b"""
cache-control: max-age=0, must-revalidate, no-cache, no-store, private
cache-control: no-transform
cf-cache-status: DYNAMIC
cf-ray: 58b26184fbd76c86-SJC
cf-request-id: 02635b471c00006c86b019b200000001
content-encoding: gzip
content-security-policy: report-uri https://www.yelp.com/csp_block?id=bf59639897830a99&page=enforced_by_default_directives&policy_hash=7b6f2d6630868fdb2698dac44731677c&site=www×tamp=1588093661; object-src 'self'; base-uri 'self' https://*.yelpcdn.com https://*.adsrvr.org https://6372968.fls.doubleclick.net; font-src data: 'self' https://*.yelp.com https://*.yelpcdn.com https://fonts.gstatic.com https://connect.facebook.net https://cdnjs.cloudflare.com https://apis.google.com https://www.google-analytics.com https://use.typekit.net https://player.ooyala.com https://use.fontawesome.com https://maxcdn.bootstrapcdn.com https://fonts.googleapis.com
content-security-policy-report-only: report-uri https://www.yelp.com/csp_report_only?id=bf59639897830a99&page=csp_report_frame_directives%2Cfull_site_ssl_csp_report_directives&policy_hash=9dd00a1a6fbb402584b7ce0c1fdb4d14&site=www×tamp=1588093661; frame-ancestors 'self' https://*.yelp.com; default-src https:; img-src https: data: https://*.adsrvr.org; script-src https: data: 'unsafe-inline' 'unsafe-eval' blob:; style-src https: 'unsafe-inline' data:; connect-src https:; font-src data: 'self' https://*.yelp.com https://*.yelpcdn.com https://fonts.gstatic.com https://connect.facebook.net https://cdnjs.cloudflare.com https://apis.google.com https://www.google-analytics.com https://use.typekit.net https://player.ooyala.com https://use.fontawesome.com https://maxcdn.bootstrapcdn.com https://fonts.googleapis.com; frame-src https: yelp-webview://* yelp://* data:; child-src https: yelp-webview://* yelp://*; media-src https:; object-src 'self'; base-uri 'self' https://*.yelpcdn.com https://*.adsrvr.org https://6372968.fls.doubleclick.net; form-action https: 'self'
content-type: application/json; charset=utf-8
date: Tue, 28 Apr 2020 17:07:42 GMT
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
expires: Tue, 28 Apr 2020 17:07:41 GMT
pragma: no-cache
referrer-policy: origin-when-cross-origin
server: cloudflare
status: 200
strict-transport-security: max-age=31536000; includeSubDomains; preload
vary: User-Agent
vary: Accept-Encoding
x-b3-sampled: 0
x-content-type-options: nosniff
x-mode: ro
x-node: www_all
x-node: 10-69-179-105-uswest2bprod-9c0a6478-895a-11ea-98c5-b6d34d770
x-proxied: 10-69-159-164-uswest2bprod
x-routing-service: 10-69-187-145-uswest2bprod; site=www
x-xss-protection: 1; report=https://www.yelp.com/xss_protection_report
x-zipkin-id: 9a87fa4730749a04
"""
headers = copyheaders.headers_raw_to_dict(headers_str)
restaurant_list_url = (
lambda index: f"https://www.yelp.com/search/snippet?find_desc=&find_loc=Los%20Angeles%2C%20CA&start={index}"
)
total_number_of_restaurants = 240
yelp_raw_data = []
for i in tqdm(range((total_number_of_restaurants // 10) + 1)):
index = (
i * 10 if i * 10 < total_number_of_restaurants else total_number_of_restaurants
)
retries, max_retries = 0, 5e2
while retries < max_retries:
retries += 1
page = requests.get(restaurant_list_url(index), headers=headers)
try:
if page.ok:
yelp_raw_data += json.loads(page.content)["searchPageProps"][
"mainContentComponentsListProps"
]
except:
if retries % 10 == 0:
print(f"Number of attempts to get data for {index}: {retries}")
continue
if retries > max_retries and not page.ok:
print(f"Couldn't get data for index: {index}")
break
80%|████████ | 20/25 [00:35<00:09, 1.87s/it]
Number of attempts to get data for 200: 10
Number of attempts to get data for 200: 20
Number of attempts to get data for 200: 30
Number of attempts to get data for 200: 40
Number of attempts to get data for 200: 50
Number of attempts to get data for 200: 60
Number of attempts to get data for 200: 70
Number of attempts to get data for 200: 80
Number of attempts to get data for 200: 90
Number of attempts to get data for 200: 100
Number of attempts to get data for 200: 110
Number of attempts to get data for 200: 120
Number of attempts to get data for 200: 130
Number of attempts to get data for 200: 140
Number of attempts to get data for 200: 150
Number of attempts to get data for 200: 160
Number of attempts to get data for 200: 170
Number of attempts to get data for 200: 180
Number of attempts to get data for 200: 190
Number of attempts to get data for 200: 200
Number of attempts to get data for 200: 210
Number of attempts to get data for 200: 220
Number of attempts to get data for 200: 230
Number of attempts to get data for 200: 240
Number of attempts to get data for 200: 250
Number of attempts to get data for 200: 260
Number of attempts to get data for 200: 270
Number of attempts to get data for 200: 280
Number of attempts to get data for 200: 290
Number of attempts to get data for 200: 300
Number of attempts to get data for 200: 310
Number of attempts to get data for 200: 320
Number of attempts to get data for 200: 330
Number of attempts to get data for 200: 340
Number of attempts to get data for 200: 350
Number of attempts to get data for 200: 360
Number of attempts to get data for 200: 370
96%|█████████▌| 24/25 [04:38<00:26, 26.14s/it]
Number of attempts to get data for 240: 10
Number of attempts to get data for 240: 20
Number of attempts to get data for 240: 30
Number of attempts to get data for 240: 40
Number of attempts to get data for 240: 50
Number of attempts to get data for 240: 60
Number of attempts to get data for 240: 70
Number of attempts to get data for 240: 80
Number of attempts to get data for 240: 90
Number of attempts to get data for 240: 100
Number of attempts to get data for 240: 110
Number of attempts to get data for 240: 120
Number of attempts to get data for 240: 130
Number of attempts to get data for 240: 140
Number of attempts to get data for 240: 150
Number of attempts to get data for 240: 160
Number of attempts to get data for 240: 170
Number of attempts to get data for 240: 180
Number of attempts to get data for 240: 190
Number of attempts to get data for 240: 200
Number of attempts to get data for 240: 210
Number of attempts to get data for 240: 220
Number of attempts to get data for 240: 230
Number of attempts to get data for 240: 240
Number of attempts to get data for 240: 250
Number of attempts to get data for 240: 260
Number of attempts to get data for 240: 270
Number of attempts to get data for 240: 280
Number of attempts to get data for 240: 290
Number of attempts to get data for 240: 300
Number of attempts to get data for 240: 310
Number of attempts to get data for 240: 320
Number of attempts to get data for 240: 330
Number of attempts to get data for 240: 340
Number of attempts to get data for 240: 350
Number of attempts to get data for 240: 360
Number of attempts to get data for 240: 370
Number of attempts to get data for 240: 380
Number of attempts to get data for 240: 390
Number of attempts to get data for 240: 400
Number of attempts to get data for 240: 410
Number of attempts to get data for 240: 420
Number of attempts to get data for 240: 430
Number of attempts to get data for 240: 440
Number of attempts to get data for 240: 450
Number of attempts to get data for 240: 460
Number of attempts to get data for 240: 470
Number of attempts to get data for 240: 480
Number of attempts to get data for 240: 490
100%|██████████| 25/25 [09:52<00:00, 23.68s/it]
Number of attempts to get data for 240: 500
time: 9min 51s (started: 2021-03-09 15:47:29 +08:00)
yelp_raw = pd.DataFrame(
[flatten(content) for content in tqdm(yelp_raw_data) if "bizId" in content.keys()]
)
yelp_raw.to_csv("./data/yelp_raw.csv")
yelp_raw.head()
100%|██████████| 480/480 [00:00<00:00, 33043.36it/s]
(searchActions,) | (isYelpGuaranteed,) | (bizId,) | (tags,) | (scrollablePhotos, allPhotosHref) | (scrollablePhotos, photoHref) | (scrollablePhotos, photoList) | (scrollablePhotos, isResponsive) | (scrollablePhotos, isScrollable) | (searchResultLayoutType,) | ... | (snippet, thumbnail, src) | (snippet, thumbnail, srcset) | (snippet, readMoreText) | (snippet, text) | (searchResultBusiness, parentBusiness, businessUrl) | (searchResultBusiness, parentBusiness, name) | (searchResultBusinessHighlights, bizSiteUrl) | (searchResultBusinessHighlights, businessHighlights) | (childrenBusinessInfo, businessUrls) | (childrenBusinessInfo, businessNames) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | [] | False | rF7KNmSv5sYbwd3D5sA_vw | [{'label': {'color': 'normal', 'text': 'New on... | /biz_photos/rF7KNmSv5sYbwd3D5sA_vw | /adredir?ad_business_id=rF7KNmSv5sYbwd3D5sA_vw... | [{'src': 'https://s3-media0.fl.yelpcdn.com/bph... | True | True | scrollablePhotos | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | [] | False | 7O1ORGY36A-2aIENyaJWPg | [] | /biz_photos/7O1ORGY36A-2aIENyaJWPg | /biz/howlin-rays-los-angeles-3 | [{'src': 'https://s3-media0.fl.yelpcdn.com/bph... | True | True | scrollablePhotos | ... | https://s3-media0.fl.yelpcdn.com/photo/22VFkvu... | https://s3-media0.fl.yelpcdn.com/photo/22VFkvu... | more | I FINALLY got to try this place... but sadly b... | NaN | NaN | NaN | NaN | NaN | NaN |
2 | [] | False | KQBGm5G8IDkE8LeNY45mbA | [] | /biz_photos/KQBGm5G8IDkE8LeNY45mbA | /biz/wurstk%C3%BCche-los-angeles-2 | [{'src': 'https://s3-media0.fl.yelpcdn.com/bph... | True | True | scrollablePhotos | ... | https://s3-media0.fl.yelpcdn.com/photo/mUmrPnL... | https://s3-media0.fl.yelpcdn.com/photo/mUmrPnL... | more | A group of us stopped by for dinner before exp... | NaN | NaN | NaN | NaN | NaN | NaN |
3 | [{'content': {'text': {'text': 'Offers takeout... | False | iSZpZgVnASwEmlq0DORY2A | [] | /biz_photos/iSZpZgVnASwEmlq0DORY2A | /biz/daikokuya-little-tokyo-los-angeles | [{'src': 'https://s3-media0.fl.yelpcdn.com/bph... | True | True | scrollablePhotos | ... | https://s3-media0.fl.yelpcdn.com/photo/5P7QP1p... | https://s3-media0.fl.yelpcdn.com/photo/5P7QP1p... | more | Daikokuya just never disappoints!\nWent last n... | NaN | NaN | NaN | NaN | NaN | NaN |
4 | [{'content': {'text': {'text': 'Offers takeout... | False | MlmcOkwaNnxl3Zuk6HsPCQ | [{'label': {'color': 'normal', 'text': 'Curren... | /biz_photos/MlmcOkwaNnxl3Zuk6HsPCQ | /biz/slurpin-ramen-bar-los-angeles-los-angeles | [{'src': 'https://s3-media0.fl.yelpcdn.com/bph... | True | True | scrollablePhotos | ... | https://s3-media0.fl.yelpcdn.com/photo/-6T8kS9... | https://s3-media0.fl.yelpcdn.com/photo/-6T8kS9... | more | Covid delivery review:\nOooooooof this hit the... | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 53 columns
time: 91.1 ms (started: 2021-03-09 15:57:21 +08:00)
Data Pre-processing¶
Let’s take out some of the unecessary columns for now.
yelp_raw.columns
Index([ ('searchActions',),
('isYelpGuaranteed',),
('bizId',),
('tags',),
('scrollablePhotos', 'allPhotosHref'),
('scrollablePhotos', 'photoHref'),
('scrollablePhotos', 'photoList'),
('scrollablePhotos', 'isResponsive'),
('scrollablePhotos', 'isScrollable'),
('searchResultLayoutType',),
('searchResultBusinessHighlights',),
('verifiedLicenseLayout',),
('serviceOfferings',),
('snippet',),
('markerKey',),
('adLoggingInfo', 'slot'),
('adLoggingInfo', 'placementSlot'),
('adLoggingInfo', 'flow'),
('adLoggingInfo', 'opportunityId'),
('adLoggingInfo', 'adCampaignId'),
('adLoggingInfo', 'isShowcaseAd'),
('adLoggingInfo', 'placement'),
('offerCampaignDetails',),
('searchResultBusinessPortfolioProjects',),
('searchResultBusiness', 'parentBusiness'),
('searchResultBusiness', 'ranking'),
('searchResultBusiness', 'reviewCount'),
('searchResultBusiness', 'renderAdInfo'),
('searchResultBusiness', 'name'),
('searchResultBusiness', 'neighborhoods'),
('searchResultBusiness', 'rating'),
('searchResultBusiness', 'businessUrl'),
('searchResultBusiness', 'isAd'),
('searchResultBusiness', 'serviceArea'),
('searchResultBusiness', 'phone'),
('searchResultBusiness', 'priceRange'),
('searchResultBusiness', 'alternateNames'),
('searchResultBusiness', 'formattedAddress'),
('searchResultBusiness', 'servicePricing'),
('searchResultBusiness', 'bizSiteUrl'),
('searchResultBusiness', 'categories'),
('childrenBusinessInfo',),
('snippet', 'readMoreUrl'),
('snippet', 'thumbnail', 'src'),
('snippet', 'thumbnail', 'srcset'),
('snippet', 'readMoreText'),
('snippet', 'text'),
('searchResultBusiness', 'parentBusiness', 'businessUrl'),
('searchResultBusiness', 'parentBusiness', 'name'),
('searchResultBusinessHighlights', 'bizSiteUrl'),
('searchResultBusinessHighlights', 'businessHighlights'),
('childrenBusinessInfo', 'businessUrls'),
('childrenBusinessInfo', 'businessNames')],
dtype='object')
time: 7.82 ms (started: 2021-03-09 15:57:21 +08:00)
yelp_filtered = yelp_raw[
[col for col in yelp_raw.columns if "searchResultBusiness" in col]
].dropna(axis=1)
yelp_filtered = yelp_filtered[
yelp_filtered[("searchResultBusiness", "priceRange")] != ""
] # Remove rows with no price range
yelp_filtered.columns = [tuple(col)[-1] for col in yelp_filtered.columns]
yelp_filtered.to_csv("./data/yelp_filtered.csv")
yelp_filtered.head()
/usr/local/lib/python3.8/site-packages/numpy/core/_asarray.py:102: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
return array(a, dtype, copy=False, order=order)
reviewCount | renderAdInfo | name | neighborhoods | rating | businessUrl | isAd | phone | priceRange | alternateNames | formattedAddress | categories | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 6636 | False | Howlin’ Ray’s | [Chinatown] | 4.5 | /biz/howlin-rays-los-angeles-3 | False | (213) 935-8399 | $$ | [] | 727 N Broadway | [{'url': '/search?cflt=southern&find_loc=Los+A... |
2 | 8449 | False | Wurstküche | [Arts District] | 4.0 | /biz/wurstk%C3%BCche-los-angeles-2 | False | (213) 687-4444 | $$ | [] | 800 E 3rd St | [{'url': '/search?cflt=hotdog&find_loc=Los+Ang... |
3 | 8644 | False | Daikokuya Little Tokyo | [Little Tokyo] | 4.0 | /biz/daikokuya-little-tokyo-los-angeles | False | (213) 626-1680 | $$ | [] | 327 E 1st St | [{'url': '/search?cflt=ramen&find_loc=Los+Ange... |
4 | 4641 | False | Slurpin’ Ramen Bar - Los Angeles | [Koreatown] | 4.5 | /biz/slurpin-ramen-bar-los-angeles-los-angeles | False | (213) 388-8607 | $$ | [] | 3500 W 8th St | [{'url': '/search?cflt=ramen&find_loc=Los+Ange... |
5 | 6228 | False | Bestia | [Downtown] | 4.5 | /biz/bestia-los-angeles | False | (213) 514-5724 | $$$ | [] | 2121 E 7th Pl | [{'url': '/search?cflt=italian&find_loc=Los+An... |
time: 29.9 ms (started: 2021-03-09 15:57:21 +08:00)
Add the lat, long for each address
locator = Nominatim(user_agent="myGeocoder")
def get_latlong(address: str, locator=locator):
"""Get lat, long from string address"""
location = locator.geocode(f"{address}, Los Angeles, CA")
try:
lat, long = location.latitude, location.longitude
except:
lat, long = None, None
return lat, long
yelp_filtered = pd.concat(
[
yelp_filtered,
pd.DataFrame(
yelp_filtered["formattedAddress"]
.apply(lambda address: get_latlong(address))
.tolist(),
columns=["latitude", "longitude"],
index=yelp_filtered.index,
),
],
axis=1,
)
yelp_filtered.head()
reviewCount | renderAdInfo | name | neighborhoods | rating | businessUrl | isAd | phone | priceRange | alternateNames | formattedAddress | categories | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 6636 | False | Howlin’ Ray’s | [Chinatown] | 4.5 | /biz/howlin-rays-los-angeles-3 | False | (213) 935-8399 | $$ | [] | 727 N Broadway | [{'url': '/search?cflt=southern&find_loc=Los+A... | 34.061519 | -118.239473 |
2 | 8449 | False | Wurstküche | [Arts District] | 4.0 | /biz/wurstk%C3%BCche-los-angeles-2 | False | (213) 687-4444 | $$ | [] | 800 E 3rd St | [{'url': '/search?cflt=hotdog&find_loc=Los+Ang... | 33.896347 | -118.117083 |
3 | 8644 | False | Daikokuya Little Tokyo | [Little Tokyo] | 4.0 | /biz/daikokuya-little-tokyo-los-angeles | False | (213) 626-1680 | $$ | [] | 327 E 1st St | [{'url': '/search?cflt=ramen&find_loc=Los+Ange... | 34.049971 | -118.240083 |
4 | 4641 | False | Slurpin’ Ramen Bar - Los Angeles | [Koreatown] | 4.5 | /biz/slurpin-ramen-bar-los-angeles-los-angeles | False | (213) 388-8607 | $$ | [] | 3500 W 8th St | [{'url': '/search?cflt=ramen&find_loc=Los+Ange... | 34.057590 | -118.306725 |
5 | 6228 | False | Bestia | [Downtown] | 4.5 | /biz/bestia-los-angeles | False | (213) 514-5724 | $$$ | [] | 2121 E 7th Pl | [{'url': '/search?cflt=italian&find_loc=Los+An... | 34.033738 | -118.229309 |
time: 2min 34s (started: 2021-03-09 15:57:21 +08:00)
Convert neighbourhoods into a single value for each row
yelp_filtered["neighborhoods"].apply(
lambda neighbourhoods: len(neighbourhoods)
).value_counts()
1 234
Name: neighborhoods, dtype: int64
time: 4.93 ms (started: 2021-03-09 15:59:55 +08:00)
yelp_filtered["neighborhoods"] = yelp_filtered["neighborhoods"].apply(
lambda neighbourhoods: neighbourhoods[0]
)
time: 1.31 ms (started: 2021-03-09 15:59:55 +08:00)
yelp_filtered["neighborhoods"].value_counts()
Downtown 30
Koreatown 23
Hollywood 15
Little Tokyo 15
Wilshire Center 14
Beverly Grove 13
Fairfax 11
Mid-Wilshire 10
East Hollywood 10
Los Feliz 9
Silver Lake 6
Arts District 6
Chinatown 5
Hancock Park 5
Echo Park 4
Sawtelle 4
University Park 3
Mid-City 3
Larchmont 3
Arlington Heights 3
Harvard Heights 3
Lincoln Heights 3
Palms 3
Highland Park 2
Century City 2
Mar Vista 2
Pico-Robertson 2
Windsor Square 2
Atwater Village 2
Westlake 2
Boyle Heights 2
Exposition Park 2
Historic South Central 2
Pico-Union 2
Eagle Rock 2
Elysian Park 1
Beverlywood 1
Westchester 1
View Park/Windsor Hills 1
Studio City 1
West Adams 1
Hollywood Hills West 1
Westwood 1
Griffith Park 1
Name: neighborhoods, dtype: int64
time: 4.17 ms (started: 2021-03-09 15:59:55 +08:00)
What categories do our restaurants have?
from collections import Counter
pd.Series(
[
cat
for cats in yelp_filtered["categories"]
.apply(
lambda raw_categories: np.array(
[category["title"] for category in raw_categories]
)
)
.to_numpy()
for cat in cats
]
).value_counts()[:30]
Breakfast & Brunch 33
Korean 27
Seafood 21
Sandwiches 21
Coffee & Tea 21
American (New) 21
Mexican 20
Barbeque 17
Desserts 15
Italian 14
Noodles 14
Japanese 13
Ice Cream & Frozen Yogurt 13
Bakeries 13
Cocktail Bars 12
Sushi Bars 11
Ramen 10
Pizza 9
American (Traditional) 9
Vegan 9
Mediterranean 8
Cafes 8
Salad 8
Burgers 7
Soup 7
Chicken Shop 7
Halal 6
Chinese 6
Comfort Food 6
Thai 6
dtype: int64
time: 7.52 ms (started: 2021-03-09 15:59:55 +08:00)
yelp_filtered["categories"] = yelp_filtered["categories"].apply(
lambda raw_categories: np.array([category["title"] for category in raw_categories])
)
time: 1.82 ms (started: 2021-03-09 15:59:55 +08:00)
# Save this dataset
yelp_filtered.to_csv("./data/yelp_cleaned.csv")
time: 11.5 ms (started: 2021-03-10 08:00:35 +08:00)
Optimization¶
This can be run indepedently from Web Scraping Yelp!
as long as yelp_cleaned.csv
is loaded
# Load the cleaned yelp dataset
yelp_filtered = pd.read_csv("./data/yelp_cleaned.csv", index_col=[0])
yelp_filtered.head()
reviewCount | renderAdInfo | name | neighborhoods | rating | businessUrl | isAd | phone | priceRange | alternateNames | formattedAddress | categories | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 6636 | False | Howlin’ Ray’s | Chinatown | 4.5 | /biz/howlin-rays-los-angeles-3 | False | (213) 935-8399 | $$ | [] | 727 N Broadway | ['Southern' 'Chicken Shop' 'American (Traditio... | 34.061519 | -118.239473 |
2 | 8449 | False | Wurstküche | Arts District | 4.0 | /biz/wurstk%C3%BCche-los-angeles-2 | False | (213) 687-4444 | $$ | [] | 800 E 3rd St | ['Hot Dogs' 'German' 'Gastropubs'] | 33.896347 | -118.117083 |
3 | 8644 | False | Daikokuya Little Tokyo | Little Tokyo | 4.0 | /biz/daikokuya-little-tokyo-los-angeles | False | (213) 626-1680 | $$ | [] | 327 E 1st St | ['Ramen' 'Noodles'] | 34.049971 | -118.240083 |
4 | 4641 | False | Slurpin’ Ramen Bar - Los Angeles | Koreatown | 4.5 | /biz/slurpin-ramen-bar-los-angeles-los-angeles | False | (213) 388-8607 | $$ | [] | 3500 W 8th St | ['Ramen' 'Noodles'] | 34.057590 | -118.306725 |
5 | 6228 | False | Bestia | Downtown | 4.5 | /biz/bestia-los-angeles | False | (213) 514-5724 | $$$ | [] | 2121 E 7th Pl | ['Italian' 'Cocktail Bars' 'Pizza'] | 34.033738 | -118.229309 |
time: 19.7 ms (started: 2021-03-10 08:02:22 +08:00)
Problem Setup¶
Let \(k \in K\) index the users whose preference is to be accounted for.
Let \(m \in M\) index the month we will be requesting a restaurant recommendation for.
Let \(n \in N\) index the total number of restaurants in Los Angeles available on Yelp!.
Let \(\mathcal{X} = \left\{{x}_{1}, {x}_{2}, \cdots, {x}_{N}\right\}\) be a matrix that denotes our dataset of webscraped restaurant information from Yelp!, where
\begin{aligned} {x}_{n} &= \begin{bmatrix} \text{name} = \left{ \mathbf{\text{String}} \right} \ \text{address} = \left{ \mathbf{\text{String}}\right} \ \text{neighbourhood} = \left{ \mathbf{\text{String}} \right} \ \text{num_reviews} = \left{0.0 \leq \mathbf{\text{Integer}} \leq \infty \right} \ \text{rating} = \left{ 0.0 \leq \mathbf{\text{Float}} \leq 5.0 \right} \ \text{price_range} = \left{ \text{$}, \text{$$}, \text{$$$}, \text{$$$$} \right} \ \text{categories} = \left{ \text{Breakfast & Brunch}, \text{Korean}, \cdots, \text{Halal} \right} \ \end{bmatrix} \ \end{aligned}
Let \(\mathcal{U} = \left\{{u}_{1}, {u}_{2}, \cdots, {u}_{K}\right\}\) be a matrix, where
\begin{aligned} {u}_{k} &= \begin{bmatrix} \text{address} = \left{ \mathbf{\text{String}} \right} \ \text{neighbourhood} = \left{ \mathbf{\text{String}} \right} \ \text{min_reviews} = \left{0.0 \leq \mathbf{\text{Integer}} \leq \infty \right} \ \text{min_rating} = \left{ 0.0 \leq \mathbf{\text{Float}} \leq 5.0 \right} \ \text{price_range} = \left{ \text{$}, \text{$$}, \text{$$$}, \text{$$$$} \right} \ \text{max_distance (miles)} = \left{ 0.0 \leq \bf{\text{Float}} \leq \infty \right} \ \text{categories} = \left{ \text{Breakfast & Brunch}, \text{Korean}, \cdots, \text{Halal} \right} \ \end{bmatrix} \ \end{aligned}
Let \(w \in \mathbb{Z}^N_2 = \left\{0, 1\right\}^N\) a binary indicator vector denoting which of the \(N\) restaurants in the Los Angeles area were chosen.
Let \(d: L \times L \rightarrow \mathbb{R}^+_0, L = \left\{\mathbf{\text{Valid Address String}}\right\}\) be a function that calculates the distance in miles between two points.
Let \(A.\text{<attribute>}\) denote the column vector (if \(A\) is a matrix) / scalar (if \(A\) is a vector) of just the specific attribute.
Mixed-Integer Program Formulation¶
\begin{aligned} \underset{w}{\text{maximize }} &{w^\top}{\left(\mathcal{X}.\text{rating}\right)} \ \text{subject to } &w^\top\mathbb{1} = M \ &w_nu_k.{\text{min_reviews}} \leq x_n.{\text{num_reviews}},\forall,n \in N, k \in K \ &w_nu_k.{\text{min_rating}} \leq x_n.{\text{rating}},\forall,n \in N, k \in K \ &w_n\left(1 - \frac{\min_{n, k} d({{x_n}.{\text{address}}}, {u_k}.{\text{address}})}{\max_{n, k} d({{x_n}.{\text{address}}}, {u_k}.{\text{address}})}\right) \leq \alpha_n,\forall,n \in N, k \in K, w_n = 1 \ &u_k.{\text{price_range}} \geq x_n.{\text{price_range}},\forall,n \in N, k \in K, w_n = 1 \ &u_k.{\text{neighbourhood}} = x_n.{\text{neighbourhood}},\forall,n \in N, k \in K, w_n = 1 \ &u_k.{\text{categories}} \subseteq x_n.{\text{categories}},\forall,n \in N, k \in K, w_n = 1 \ \end{aligned}
# For distance calculation
from haversine import haversine, Unit
# For optimization
import cvxpy as cp
# For converting street in LA to lat long
locator = Nominatim(user_agent="myGeocoder")
def get_latlong(address: str, locator=locator):
"""Get lat, long from string address"""
location = locator.geocode(f"{address}, Los Angeles, CA")
try:
lat, long = location.latitude, location.longitude
except:
lat, long = None, None
return lat, long
time: 476 ms (started: 2021-03-09 15:59:55 +08:00)
User Preference Matrix \(\mathcal{U}\):
U = pd.DataFrame(
[
[
"3584 S Figueroa St",
None,
100,
4,
"$$",
None,
None,
*get_latlong("3584 S Figueroa St"),
], # Icon Plaza USC
[
"3301 S Hoover St",
None,
200,
3.6,
"$$$",
None,
None,
*get_latlong("3301 S Hoover St"),
], # USC Village
[
"10250 Santa Monica Blvd",
None,
1000,
4.5,
"$$$$",
None,
None,
*get_latlong("10250 Santa Monica Blvd"),
], # Westfield Century City
[
"189 The Grove Dr",
None,
1000,
3.9,
"$$$",
None,
["Korean"],
*get_latlong("189 The Grove Dr"),
], # The Grove
],
columns=[
"address",
"neighbourhood",
"min_reviews",
"min_rating",
"price_range",
"max_distance",
"categories",
"latitude",
"longitude",
],
)
assert U["neighbourhood"].nunique() <= 1, print(
"Number of different neighbourhood preferences must be <= 1."
)
# assert U["price_range"].nunique() <= 1, print(
# "Number of different price_range preferences must be <= 1."
# )
assert (
len(
np.unique(
[
cat
for cats in U["categories"].to_numpy()
if cats is not None
for cat in cats
]
)
)
<= 2
), print("Number of different category preferences must be <= 2.")
time: 2.64 s (started: 2021-03-09 17:18:06 +08:00)
Updated Data Matrix \(\mathcal{X}\), filtering out some data that does not match constraints:
X = yelp_filtered[
[
"name",
"formattedAddress",
"neighborhoods",
"reviewCount",
"rating",
"priceRange",
"categories",
"latitude",
"longitude",
]
]
X.columns = [
"name",
"address",
"neighbourhood",
"num_reviews",
"rating",
"price_range",
"categories",
"latitude",
"longitude",
] # rename columns
# Filter out the neighbourhoods not in the user preference
if len(U["neighbourhood"].dropna().unique()) > 0:
X = X[X["neighbourhood"] == U["neighbourhood"].dropna().unique()[0]]
# Filter out the restaurants whose price ranges exceed the lowest price range in user preferences
if len(U["price_range"].dropna().unique()) > 0:
X = X[
X["price_range"]
== sorted(
U["price_range"].dropna().unique(), key=lambda x: len(x), reverse=False
)[0]
]
# Filter out restaurants that are not in the same categories as what we requested in user preference
if len(U["categories"].dropna()) > 0:
X = X[
X["categories"].apply(
lambda categories: np.any(
[
cat in categories
for cat in [
cat
for cats in U["categories"].to_numpy()
if cats is not None
for cat in cats
]
]
)
)
]
time: 15.4 ms (started: 2021-03-10 07:57:39 +08:00)
Optimization
# Distance metric
d = lambda lat1, long1, lat2, long2: haversine(
(lat1, long1), (lat2, long2), unit=Unit.MILES
)
# Percentage difference between furthest travelling indiividual and shortest travelling individual
α = 0.75
# Number of months we getting recommendations
M = 5
# Number of restaurants in our dataset
N = X.shape[0]
# Number of users
K = U.shape[0]
# Create one vector optimization variable.
w = cp.Variable(X.shape[0], boolean=True)
# Create constraints.
constraints = [
cp.sum(w) >= M,
cp.sum(w) <= M,
*[
w_n * u_k <= x_n
for u_k in U["min_reviews"]
for w_n, x_n in zip(w, X["num_reviews"])
],
*[w_n * u_k <= x_n for u_k in U["min_rating"] for w_n, x_n in zip(w, X["rating"])],
*[
w_n
* (
1
- (
cp.minimum(
*[
d(row["latitude"], row["longitude"], lat_n, long_n)
for idx, row in U[["latitude", "longitude"]].iterrows()
]
)
/ cp.maximum(
*[
d(row["latitude"], row["longitude"], lat_n, long_n)
for idx, row in U[["latitude", "longitude"]].iterrows()
]
)
)
)
<= α
for w_n, lat_n, long_n in zip(w, X["latitude"], X["longitude"])
],
]
# Form objective.
obj = cp.Maximize(w.T @ X["rating"])
# Form and solve problem.
prob = cp.Problem(obj, constraints)
prob.solve()
print("Mixed Integer Programming Solution")
print("=" * 30)
print(f"Status: {prob.status}")
print(f"The optimal value is: {np.round(prob.value, 2)}")
print("Restaurants chosen: ")
X.iloc[np.argwhere(w.value).flatten()]
Mixed Integer Programming Solution
==============================
Status: optimal
The optimal value is: 22.5
Restaurants chosen:
name | address | neighbourhood | num_reviews | rating | price_range | categories | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|
21 | Yup Dduk LA | 3603 W 6th St | Wilshire Center | 2111 | 4.5 | $$ | [Korean, Chicken Shop] | 34.063892 | -118.300805 |
32 | Han Bat Sul Lung Tang | 4163 W 5th St | Koreatown | 2294 | 4.5 | $$ | [Korean, Comfort Food, Soup] | 34.065408 | -118.309849 |
36 | Magal BBQ | 3460 W 8th St | Koreatown | 1676 | 4.5 | $$ | [Korean, Barbeque] | 34.057598 | -118.305479 |
50 | Bulgogi Hut | 3600 Wilshire Blvd | Koreatown | 2471 | 4.5 | $$ | [Korean, Barbeque, Asian Fusion] | 34.062375 | -118.298589 |
68 | Eight Korean BBQ | 863 S Western Ave | Koreatown | 1651 | 4.5 | $$ | [Korean, Barbeque] | 34.056027 | -118.309888 |
time: 283 ms (started: 2021-03-10 07:59:21 +08:00)