Project 1: Group Restaurants Choice

By: Jacob Andreesen, Jeff Chen, Miao Xu, Yiyi Wang

Finding an ideal restaurants for students with a group of friends is always a struggle for newcomers, who are looking for places new and excited to go. Inspired by such a challenge, our group aim to build a personalized restaurant recommender system prototype that serve a small group of people to meet their requirements and close to their taste.

%load_ext autotime
%load_ext nb_black
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# For scraping
import time
import urllib.request, json
from flatten_dict import flatten
import requests
import copyheaders
from bs4 import BeautifulSoup

# General tools
import regex as re
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from tqdm import tqdm
import geopandas
from geopy import Nominatim
time: 1.46 s (started: 2021-03-09 15:47:27 +08:00)

Web Scraping Yelp!

We will be simply grabbing data from https://www.yelp.com/search/snippet api endpoint instead of actually web scraping from the Yelp! website. We attempted to scrape it through the website but it was hard to select specific elements that we required and some of them are only revealed through button clicks, meaning we’d have to use a browser automation software like selenium to simulate clicks and grab data from the html afterwards, a little too much unecessary work.

headers_str = b"""
    cache-control: max-age=0, must-revalidate, no-cache, no-store, private
    cache-control: no-transform
    cf-cache-status: DYNAMIC
    cf-ray: 58b26184fbd76c86-SJC
    cf-request-id: 02635b471c00006c86b019b200000001
    content-encoding: gzip
    content-security-policy: report-uri https://www.yelp.com/csp_block?id=bf59639897830a99&page=enforced_by_default_directives&policy_hash=7b6f2d6630868fdb2698dac44731677c&site=www&timestamp=1588093661; object-src 'self'; base-uri 'self' https://*.yelpcdn.com https://*.adsrvr.org https://6372968.fls.doubleclick.net; font-src data: 'self' https://*.yelp.com https://*.yelpcdn.com https://fonts.gstatic.com https://connect.facebook.net https://cdnjs.cloudflare.com https://apis.google.com https://www.google-analytics.com https://use.typekit.net https://player.ooyala.com https://use.fontawesome.com https://maxcdn.bootstrapcdn.com https://fonts.googleapis.com
    content-security-policy-report-only: report-uri https://www.yelp.com/csp_report_only?id=bf59639897830a99&page=csp_report_frame_directives%2Cfull_site_ssl_csp_report_directives&policy_hash=9dd00a1a6fbb402584b7ce0c1fdb4d14&site=www&timestamp=1588093661; frame-ancestors 'self' https://*.yelp.com; default-src https:; img-src https: data: https://*.adsrvr.org; script-src https: data: 'unsafe-inline' 'unsafe-eval' blob:; style-src https: 'unsafe-inline' data:; connect-src https:; font-src data: 'self' https://*.yelp.com https://*.yelpcdn.com https://fonts.gstatic.com https://connect.facebook.net https://cdnjs.cloudflare.com https://apis.google.com https://www.google-analytics.com https://use.typekit.net https://player.ooyala.com https://use.fontawesome.com https://maxcdn.bootstrapcdn.com https://fonts.googleapis.com; frame-src https: yelp-webview://* yelp://* data:; child-src https: yelp-webview://* yelp://*; media-src https:; object-src 'self'; base-uri 'self' https://*.yelpcdn.com https://*.adsrvr.org https://6372968.fls.doubleclick.net; form-action https: 'self'
    content-type: application/json; charset=utf-8
    date: Tue, 28 Apr 2020 17:07:42 GMT
    expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
    expires: Tue, 28 Apr 2020 17:07:41 GMT
    pragma: no-cache
    referrer-policy: origin-when-cross-origin
    server: cloudflare
    status: 200
    strict-transport-security: max-age=31536000; includeSubDomains; preload
    vary: User-Agent
    vary: Accept-Encoding
    x-b3-sampled: 0
    x-content-type-options: nosniff
    x-mode: ro
    x-node: www_all
    x-node: 10-69-179-105-uswest2bprod-9c0a6478-895a-11ea-98c5-b6d34d770
    x-proxied: 10-69-159-164-uswest2bprod
    x-routing-service: 10-69-187-145-uswest2bprod; site=www
    x-xss-protection: 1; report=https://www.yelp.com/xss_protection_report
    x-zipkin-id: 9a87fa4730749a04
"""
headers = copyheaders.headers_raw_to_dict(headers_str)
restaurant_list_url = (
    lambda index: f"https://www.yelp.com/search/snippet?find_desc=&find_loc=Los%20Angeles%2C%20CA&start={index}"
)
total_number_of_restaurants = 240
yelp_raw_data = []
for i in tqdm(range((total_number_of_restaurants // 10) + 1)):
    index = (
        i * 10 if i * 10 < total_number_of_restaurants else total_number_of_restaurants
    )
    retries, max_retries = 0, 5e2
    while retries < max_retries:
        retries += 1
        page = requests.get(restaurant_list_url(index), headers=headers)
        try:
            if page.ok:
                yelp_raw_data += json.loads(page.content)["searchPageProps"][
                    "mainContentComponentsListProps"
                ]
        except:
            if retries % 10 == 0:
                print(f"Number of attempts to get data for {index}: {retries}")
            continue
        if retries > max_retries and not page.ok:
            print(f"Couldn't get data for index: {index}")
        break
 80%|████████  | 20/25 [00:35<00:09,  1.87s/it]
Number of attempts to get data for 200: 10
Number of attempts to get data for 200: 20
Number of attempts to get data for 200: 30
Number of attempts to get data for 200: 40
Number of attempts to get data for 200: 50
Number of attempts to get data for 200: 60
Number of attempts to get data for 200: 70
Number of attempts to get data for 200: 80
Number of attempts to get data for 200: 90
Number of attempts to get data for 200: 100
Number of attempts to get data for 200: 110
Number of attempts to get data for 200: 120
Number of attempts to get data for 200: 130
Number of attempts to get data for 200: 140
Number of attempts to get data for 200: 150
Number of attempts to get data for 200: 160
Number of attempts to get data for 200: 170
Number of attempts to get data for 200: 180
Number of attempts to get data for 200: 190
Number of attempts to get data for 200: 200
Number of attempts to get data for 200: 210
Number of attempts to get data for 200: 220
Number of attempts to get data for 200: 230
Number of attempts to get data for 200: 240
Number of attempts to get data for 200: 250
Number of attempts to get data for 200: 260
Number of attempts to get data for 200: 270
Number of attempts to get data for 200: 280
Number of attempts to get data for 200: 290
Number of attempts to get data for 200: 300
Number of attempts to get data for 200: 310
Number of attempts to get data for 200: 320
Number of attempts to get data for 200: 330
Number of attempts to get data for 200: 340
Number of attempts to get data for 200: 350
Number of attempts to get data for 200: 360
Number of attempts to get data for 200: 370
 96%|█████████▌| 24/25 [04:38<00:26, 26.14s/it]
Number of attempts to get data for 240: 10
Number of attempts to get data for 240: 20
Number of attempts to get data for 240: 30
Number of attempts to get data for 240: 40
Number of attempts to get data for 240: 50
Number of attempts to get data for 240: 60
Number of attempts to get data for 240: 70
Number of attempts to get data for 240: 80
Number of attempts to get data for 240: 90
Number of attempts to get data for 240: 100
Number of attempts to get data for 240: 110
Number of attempts to get data for 240: 120
Number of attempts to get data for 240: 130
Number of attempts to get data for 240: 140
Number of attempts to get data for 240: 150
Number of attempts to get data for 240: 160
Number of attempts to get data for 240: 170
Number of attempts to get data for 240: 180
Number of attempts to get data for 240: 190
Number of attempts to get data for 240: 200
Number of attempts to get data for 240: 210
Number of attempts to get data for 240: 220
Number of attempts to get data for 240: 230
Number of attempts to get data for 240: 240
Number of attempts to get data for 240: 250
Number of attempts to get data for 240: 260
Number of attempts to get data for 240: 270
Number of attempts to get data for 240: 280
Number of attempts to get data for 240: 290
Number of attempts to get data for 240: 300
Number of attempts to get data for 240: 310
Number of attempts to get data for 240: 320
Number of attempts to get data for 240: 330
Number of attempts to get data for 240: 340
Number of attempts to get data for 240: 350
Number of attempts to get data for 240: 360
Number of attempts to get data for 240: 370
Number of attempts to get data for 240: 380
Number of attempts to get data for 240: 390
Number of attempts to get data for 240: 400
Number of attempts to get data for 240: 410
Number of attempts to get data for 240: 420
Number of attempts to get data for 240: 430
Number of attempts to get data for 240: 440
Number of attempts to get data for 240: 450
Number of attempts to get data for 240: 460
Number of attempts to get data for 240: 470
Number of attempts to get data for 240: 480
Number of attempts to get data for 240: 490
100%|██████████| 25/25 [09:52<00:00, 23.68s/it] 
Number of attempts to get data for 240: 500
time: 9min 51s (started: 2021-03-09 15:47:29 +08:00)

yelp_raw = pd.DataFrame(
    [flatten(content) for content in tqdm(yelp_raw_data) if "bizId" in content.keys()]
)
yelp_raw.to_csv("./data/yelp_raw.csv")
yelp_raw.head()
100%|██████████| 480/480 [00:00<00:00, 33043.36it/s]
(searchActions,) (isYelpGuaranteed,) (bizId,) (tags,) (scrollablePhotos, allPhotosHref) (scrollablePhotos, photoHref) (scrollablePhotos, photoList) (scrollablePhotos, isResponsive) (scrollablePhotos, isScrollable) (searchResultLayoutType,) ... (snippet, thumbnail, src) (snippet, thumbnail, srcset) (snippet, readMoreText) (snippet, text) (searchResultBusiness, parentBusiness, businessUrl) (searchResultBusiness, parentBusiness, name) (searchResultBusinessHighlights, bizSiteUrl) (searchResultBusinessHighlights, businessHighlights) (childrenBusinessInfo, businessUrls) (childrenBusinessInfo, businessNames)
0 [] False rF7KNmSv5sYbwd3D5sA_vw [{'label': {'color': 'normal', 'text': 'New on... /biz_photos/rF7KNmSv5sYbwd3D5sA_vw /adredir?ad_business_id=rF7KNmSv5sYbwd3D5sA_vw... [{'src': 'https://s3-media0.fl.yelpcdn.com/bph... True True scrollablePhotos ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 [] False 7O1ORGY36A-2aIENyaJWPg [] /biz_photos/7O1ORGY36A-2aIENyaJWPg /biz/howlin-rays-los-angeles-3 [{'src': 'https://s3-media0.fl.yelpcdn.com/bph... True True scrollablePhotos ... https://s3-media0.fl.yelpcdn.com/photo/22VFkvu... https://s3-media0.fl.yelpcdn.com/photo/22VFkvu... more I FINALLY got to try this place... but sadly b... NaN NaN NaN NaN NaN NaN
2 [] False KQBGm5G8IDkE8LeNY45mbA [] /biz_photos/KQBGm5G8IDkE8LeNY45mbA /biz/wurstk%C3%BCche-los-angeles-2 [{'src': 'https://s3-media0.fl.yelpcdn.com/bph... True True scrollablePhotos ... https://s3-media0.fl.yelpcdn.com/photo/mUmrPnL... https://s3-media0.fl.yelpcdn.com/photo/mUmrPnL... more A group of us stopped by for dinner before exp... NaN NaN NaN NaN NaN NaN
3 [{'content': {'text': {'text': 'Offers takeout... False iSZpZgVnASwEmlq0DORY2A [] /biz_photos/iSZpZgVnASwEmlq0DORY2A /biz/daikokuya-little-tokyo-los-angeles [{'src': 'https://s3-media0.fl.yelpcdn.com/bph... True True scrollablePhotos ... https://s3-media0.fl.yelpcdn.com/photo/5P7QP1p... https://s3-media0.fl.yelpcdn.com/photo/5P7QP1p... more Daikokuya just never disappoints!\nWent last n... NaN NaN NaN NaN NaN NaN
4 [{'content': {'text': {'text': 'Offers takeout... False MlmcOkwaNnxl3Zuk6HsPCQ [{'label': {'color': 'normal', 'text': 'Curren... /biz_photos/MlmcOkwaNnxl3Zuk6HsPCQ /biz/slurpin-ramen-bar-los-angeles-los-angeles [{'src': 'https://s3-media0.fl.yelpcdn.com/bph... True True scrollablePhotos ... https://s3-media0.fl.yelpcdn.com/photo/-6T8kS9... https://s3-media0.fl.yelpcdn.com/photo/-6T8kS9... more Covid delivery review:\nOooooooof this hit the... NaN NaN NaN NaN NaN NaN

5 rows × 53 columns

time: 91.1 ms (started: 2021-03-09 15:57:21 +08:00)

Data Pre-processing

Let’s take out some of the unecessary columns for now.

yelp_raw.columns
Index([                                       ('searchActions',),
                                           ('isYelpGuaranteed',),
                                                      ('bizId',),
                                                       ('tags',),
                           ('scrollablePhotos', 'allPhotosHref'),
                               ('scrollablePhotos', 'photoHref'),
                               ('scrollablePhotos', 'photoList'),
                            ('scrollablePhotos', 'isResponsive'),
                            ('scrollablePhotos', 'isScrollable'),
                                     ('searchResultLayoutType',),
                             ('searchResultBusinessHighlights',),
                                      ('verifiedLicenseLayout',),
                                           ('serviceOfferings',),
                                                    ('snippet',),
                                                  ('markerKey',),
                                       ('adLoggingInfo', 'slot'),
                              ('adLoggingInfo', 'placementSlot'),
                                       ('adLoggingInfo', 'flow'),
                              ('adLoggingInfo', 'opportunityId'),
                               ('adLoggingInfo', 'adCampaignId'),
                               ('adLoggingInfo', 'isShowcaseAd'),
                                  ('adLoggingInfo', 'placement'),
                                       ('offerCampaignDetails',),
                      ('searchResultBusinessPortfolioProjects',),
                      ('searchResultBusiness', 'parentBusiness'),
                             ('searchResultBusiness', 'ranking'),
                         ('searchResultBusiness', 'reviewCount'),
                        ('searchResultBusiness', 'renderAdInfo'),
                                ('searchResultBusiness', 'name'),
                       ('searchResultBusiness', 'neighborhoods'),
                              ('searchResultBusiness', 'rating'),
                         ('searchResultBusiness', 'businessUrl'),
                                ('searchResultBusiness', 'isAd'),
                         ('searchResultBusiness', 'serviceArea'),
                               ('searchResultBusiness', 'phone'),
                          ('searchResultBusiness', 'priceRange'),
                      ('searchResultBusiness', 'alternateNames'),
                    ('searchResultBusiness', 'formattedAddress'),
                      ('searchResultBusiness', 'servicePricing'),
                          ('searchResultBusiness', 'bizSiteUrl'),
                          ('searchResultBusiness', 'categories'),
                                       ('childrenBusinessInfo',),
                                      ('snippet', 'readMoreUrl'),
                                 ('snippet', 'thumbnail', 'src'),
                              ('snippet', 'thumbnail', 'srcset'),
                                     ('snippet', 'readMoreText'),
                                             ('snippet', 'text'),
       ('searchResultBusiness', 'parentBusiness', 'businessUrl'),
              ('searchResultBusiness', 'parentBusiness', 'name'),
                ('searchResultBusinessHighlights', 'bizSiteUrl'),
        ('searchResultBusinessHighlights', 'businessHighlights'),
                        ('childrenBusinessInfo', 'businessUrls'),
                       ('childrenBusinessInfo', 'businessNames')],
      dtype='object')
time: 7.82 ms (started: 2021-03-09 15:57:21 +08:00)
yelp_filtered = yelp_raw[
    [col for col in yelp_raw.columns if "searchResultBusiness" in col]
].dropna(axis=1)
yelp_filtered = yelp_filtered[
    yelp_filtered[("searchResultBusiness", "priceRange")] != ""
]  # Remove rows with no price range
yelp_filtered.columns = [tuple(col)[-1] for col in yelp_filtered.columns]
yelp_filtered.to_csv("./data/yelp_filtered.csv")
yelp_filtered.head()
/usr/local/lib/python3.8/site-packages/numpy/core/_asarray.py:102: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  return array(a, dtype, copy=False, order=order)
reviewCount renderAdInfo name neighborhoods rating businessUrl isAd phone priceRange alternateNames formattedAddress categories
1 6636 False Howlin’ Ray’s [Chinatown] 4.5 /biz/howlin-rays-los-angeles-3 False (213) 935-8399 $$ [] 727 N Broadway [{'url': '/search?cflt=southern&find_loc=Los+A...
2 8449 False Wurstküche [Arts District] 4.0 /biz/wurstk%C3%BCche-los-angeles-2 False (213) 687-4444 $$ [] 800 E 3rd St [{'url': '/search?cflt=hotdog&find_loc=Los+Ang...
3 8644 False Daikokuya Little Tokyo [Little Tokyo] 4.0 /biz/daikokuya-little-tokyo-los-angeles False (213) 626-1680 $$ [] 327 E 1st St [{'url': '/search?cflt=ramen&find_loc=Los+Ange...
4 4641 False Slurpin’ Ramen Bar - Los Angeles [Koreatown] 4.5 /biz/slurpin-ramen-bar-los-angeles-los-angeles False (213) 388-8607 $$ [] 3500 W 8th St [{'url': '/search?cflt=ramen&find_loc=Los+Ange...
5 6228 False Bestia [Downtown] 4.5 /biz/bestia-los-angeles False (213) 514-5724 $$$ [] 2121 E 7th Pl [{'url': '/search?cflt=italian&find_loc=Los+An...
time: 29.9 ms (started: 2021-03-09 15:57:21 +08:00)

Add the lat, long for each address

locator = Nominatim(user_agent="myGeocoder")


def get_latlong(address: str, locator=locator):
    """Get lat, long from string address"""
    location = locator.geocode(f"{address}, Los Angeles, CA")
    try:
        lat, long = location.latitude, location.longitude
    except:
        lat, long = None, None
    return lat, long


yelp_filtered = pd.concat(
    [
        yelp_filtered,
        pd.DataFrame(
            yelp_filtered["formattedAddress"]
            .apply(lambda address: get_latlong(address))
            .tolist(),
            columns=["latitude", "longitude"],
            index=yelp_filtered.index,
        ),
    ],
    axis=1,
)
yelp_filtered.head()
reviewCount renderAdInfo name neighborhoods rating businessUrl isAd phone priceRange alternateNames formattedAddress categories latitude longitude
1 6636 False Howlin’ Ray’s [Chinatown] 4.5 /biz/howlin-rays-los-angeles-3 False (213) 935-8399 $$ [] 727 N Broadway [{'url': '/search?cflt=southern&find_loc=Los+A... 34.061519 -118.239473
2 8449 False Wurstküche [Arts District] 4.0 /biz/wurstk%C3%BCche-los-angeles-2 False (213) 687-4444 $$ [] 800 E 3rd St [{'url': '/search?cflt=hotdog&find_loc=Los+Ang... 33.896347 -118.117083
3 8644 False Daikokuya Little Tokyo [Little Tokyo] 4.0 /biz/daikokuya-little-tokyo-los-angeles False (213) 626-1680 $$ [] 327 E 1st St [{'url': '/search?cflt=ramen&find_loc=Los+Ange... 34.049971 -118.240083
4 4641 False Slurpin’ Ramen Bar - Los Angeles [Koreatown] 4.5 /biz/slurpin-ramen-bar-los-angeles-los-angeles False (213) 388-8607 $$ [] 3500 W 8th St [{'url': '/search?cflt=ramen&find_loc=Los+Ange... 34.057590 -118.306725
5 6228 False Bestia [Downtown] 4.5 /biz/bestia-los-angeles False (213) 514-5724 $$$ [] 2121 E 7th Pl [{'url': '/search?cflt=italian&find_loc=Los+An... 34.033738 -118.229309
time: 2min 34s (started: 2021-03-09 15:57:21 +08:00)

Convert neighbourhoods into a single value for each row

yelp_filtered["neighborhoods"].apply(
    lambda neighbourhoods: len(neighbourhoods)
).value_counts()
1    234
Name: neighborhoods, dtype: int64
time: 4.93 ms (started: 2021-03-09 15:59:55 +08:00)
yelp_filtered["neighborhoods"] = yelp_filtered["neighborhoods"].apply(
    lambda neighbourhoods: neighbourhoods[0]
)
time: 1.31 ms (started: 2021-03-09 15:59:55 +08:00)
yelp_filtered["neighborhoods"].value_counts()
Downtown                   30
Koreatown                  23
Hollywood                  15
Little Tokyo               15
Wilshire Center            14
Beverly Grove              13
Fairfax                    11
Mid-Wilshire               10
East Hollywood             10
Los Feliz                   9
Silver Lake                 6
Arts District               6
Chinatown                   5
Hancock Park                5
Echo Park                   4
Sawtelle                    4
University Park             3
Mid-City                    3
Larchmont                   3
Arlington Heights           3
Harvard Heights             3
Lincoln Heights             3
Palms                       3
Highland Park               2
Century City                2
Mar Vista                   2
Pico-Robertson              2
Windsor Square              2
Atwater Village             2
Westlake                    2
Boyle Heights               2
Exposition Park             2
Historic South Central      2
Pico-Union                  2
Eagle Rock                  2
Elysian Park                1
Beverlywood                 1
Westchester                 1
View Park/Windsor Hills     1
Studio City                 1
West Adams                  1
Hollywood Hills West        1
Westwood                    1
Griffith Park               1
Name: neighborhoods, dtype: int64
time: 4.17 ms (started: 2021-03-09 15:59:55 +08:00)

What categories do our restaurants have?

from collections import Counter

pd.Series(
    [
        cat
        for cats in yelp_filtered["categories"]
        .apply(
            lambda raw_categories: np.array(
                [category["title"] for category in raw_categories]
            )
        )
        .to_numpy()
        for cat in cats
    ]
).value_counts()[:30]
Breakfast & Brunch           33
Korean                       27
Seafood                      21
Sandwiches                   21
Coffee & Tea                 21
American (New)               21
Mexican                      20
Barbeque                     17
Desserts                     15
Italian                      14
Noodles                      14
Japanese                     13
Ice Cream & Frozen Yogurt    13
Bakeries                     13
Cocktail Bars                12
Sushi Bars                   11
Ramen                        10
Pizza                         9
American (Traditional)        9
Vegan                         9
Mediterranean                 8
Cafes                         8
Salad                         8
Burgers                       7
Soup                          7
Chicken Shop                  7
Halal                         6
Chinese                       6
Comfort Food                  6
Thai                          6
dtype: int64
time: 7.52 ms (started: 2021-03-09 15:59:55 +08:00)
yelp_filtered["categories"] = yelp_filtered["categories"].apply(
    lambda raw_categories: np.array([category["title"] for category in raw_categories])
)
time: 1.82 ms (started: 2021-03-09 15:59:55 +08:00)
# Save this dataset
yelp_filtered.to_csv("./data/yelp_cleaned.csv")
time: 11.5 ms (started: 2021-03-10 08:00:35 +08:00)

Optimization

This can be run indepedently from Web Scraping Yelp! as long as yelp_cleaned.csv is loaded

# Load the cleaned yelp dataset
yelp_filtered = pd.read_csv("./data/yelp_cleaned.csv", index_col=[0])
yelp_filtered.head()
reviewCount renderAdInfo name neighborhoods rating businessUrl isAd phone priceRange alternateNames formattedAddress categories latitude longitude
1 6636 False Howlin’ Ray’s Chinatown 4.5 /biz/howlin-rays-los-angeles-3 False (213) 935-8399 $$ [] 727 N Broadway ['Southern' 'Chicken Shop' 'American (Traditio... 34.061519 -118.239473
2 8449 False Wurstküche Arts District 4.0 /biz/wurstk%C3%BCche-los-angeles-2 False (213) 687-4444 $$ [] 800 E 3rd St ['Hot Dogs' 'German' 'Gastropubs'] 33.896347 -118.117083
3 8644 False Daikokuya Little Tokyo Little Tokyo 4.0 /biz/daikokuya-little-tokyo-los-angeles False (213) 626-1680 $$ [] 327 E 1st St ['Ramen' 'Noodles'] 34.049971 -118.240083
4 4641 False Slurpin’ Ramen Bar - Los Angeles Koreatown 4.5 /biz/slurpin-ramen-bar-los-angeles-los-angeles False (213) 388-8607 $$ [] 3500 W 8th St ['Ramen' 'Noodles'] 34.057590 -118.306725
5 6228 False Bestia Downtown 4.5 /biz/bestia-los-angeles False (213) 514-5724 $$$ [] 2121 E 7th Pl ['Italian' 'Cocktail Bars' 'Pizza'] 34.033738 -118.229309
time: 19.7 ms (started: 2021-03-10 08:02:22 +08:00)

Problem Setup

Let \(k \in K\) index the users whose preference is to be accounted for.

Let \(m \in M\) index the month we will be requesting a restaurant recommendation for.

Let \(n \in N\) index the total number of restaurants in Los Angeles available on Yelp!.

Let \(\mathcal{X} = \left\{{x}_{1}, {x}_{2}, \cdots, {x}_{N}\right\}\) be a matrix that denotes our dataset of webscraped restaurant information from Yelp!, where

\begin{aligned} {x}_{n} &= \begin{bmatrix} \text{name} = \left{ \mathbf{\text{String}} \right} \ \text{address} = \left{ \mathbf{\text{String}}\right} \ \text{neighbourhood} = \left{ \mathbf{\text{String}} \right} \ \text{num_reviews} = \left{0.0 \leq \mathbf{\text{Integer}} \leq \infty \right} \ \text{rating} = \left{ 0.0 \leq \mathbf{\text{Float}} \leq 5.0 \right} \ \text{price_range} = \left{ \text{$}, \text{$$}, \text{$$$}, \text{$$$$} \right} \ \text{categories} = \left{ \text{Breakfast & Brunch}, \text{Korean}, \cdots, \text{Halal} \right} \ \end{bmatrix} \ \end{aligned}

Let \(\mathcal{U} = \left\{{u}_{1}, {u}_{2}, \cdots, {u}_{K}\right\}\) be a matrix, where

\begin{aligned} {u}_{k} &= \begin{bmatrix} \text{address} = \left{ \mathbf{\text{String}} \right} \ \text{neighbourhood} = \left{ \mathbf{\text{String}} \right} \ \text{min_reviews} = \left{0.0 \leq \mathbf{\text{Integer}} \leq \infty \right} \ \text{min_rating} = \left{ 0.0 \leq \mathbf{\text{Float}} \leq 5.0 \right} \ \text{price_range} = \left{ \text{$}, \text{$$}, \text{$$$}, \text{$$$$} \right} \ \text{max_distance (miles)} = \left{ 0.0 \leq \bf{\text{Float}} \leq \infty \right} \ \text{categories} = \left{ \text{Breakfast & Brunch}, \text{Korean}, \cdots, \text{Halal} \right} \ \end{bmatrix} \ \end{aligned}

Let \(w \in \mathbb{Z}^N_2 = \left\{0, 1\right\}^N\) a binary indicator vector denoting which of the \(N\) restaurants in the Los Angeles area were chosen.

Let \(d: L \times L \rightarrow \mathbb{R}^+_0, L = \left\{\mathbf{\text{Valid Address String}}\right\}\) be a function that calculates the distance in miles between two points.

Let \(A.\text{<attribute>}\) denote the column vector (if \(A\) is a matrix) / scalar (if \(A\) is a vector) of just the specific attribute.

Mixed-Integer Program Formulation

\begin{aligned} \underset{w}{\text{maximize }} &{w^\top}{\left(\mathcal{X}.\text{rating}\right)} \ \text{subject to } &w^\top\mathbb{1} = M \ &w_nu_k.{\text{min_reviews}} \leq x_n.{\text{num_reviews}},\forall,n \in N, k \in K \ &w_nu_k.{\text{min_rating}} \leq x_n.{\text{rating}},\forall,n \in N, k \in K \ &w_n\left(1 - \frac{\min_{n, k} d({{x_n}.{\text{address}}}, {u_k}.{\text{address}})}{\max_{n, k} d({{x_n}.{\text{address}}}, {u_k}.{\text{address}})}\right) \leq \alpha_n,\forall,n \in N, k \in K, w_n = 1 \ &u_k.{\text{price_range}} \geq x_n.{\text{price_range}},\forall,n \in N, k \in K, w_n = 1 \ &u_k.{\text{neighbourhood}} = x_n.{\text{neighbourhood}},\forall,n \in N, k \in K, w_n = 1 \ &u_k.{\text{categories}} \subseteq x_n.{\text{categories}},\forall,n \in N, k \in K, w_n = 1 \ \end{aligned}

# For distance calculation
from haversine import haversine, Unit

# For optimization
import cvxpy as cp

# For converting street in LA to lat long
locator = Nominatim(user_agent="myGeocoder")

def get_latlong(address: str, locator=locator):
    """Get lat, long from string address"""
    location = locator.geocode(f"{address}, Los Angeles, CA")
    try:
        lat, long = location.latitude, location.longitude
    except:
        lat, long = None, None
    return lat, long
time: 476 ms (started: 2021-03-09 15:59:55 +08:00)

User Preference Matrix \(\mathcal{U}\):

U = pd.DataFrame(
    [
        [
            "3584 S Figueroa St",
            None,
            100,
            4,
            "$$",
            None,
            None,
            *get_latlong("3584 S Figueroa St"),
        ],  # Icon Plaza USC
        [
            "3301 S Hoover St",
            None,
            200,
            3.6,
            "$$$",
            None,
            None,
            *get_latlong("3301 S Hoover St"),
        ],  # USC Village
        [
            "10250 Santa Monica Blvd",
            None,
            1000,
            4.5,
            "$$$$",
            None,
            None,
            *get_latlong("10250 Santa Monica Blvd"),
        ],  # Westfield Century City
        [
            "189 The Grove Dr",
            None,
            1000,
            3.9,
            "$$$",
            None,
            ["Korean"],
            *get_latlong("189 The Grove Dr"),
        ],  # The Grove
    ],
    columns=[
        "address",
        "neighbourhood",
        "min_reviews",
        "min_rating",
        "price_range",
        "max_distance",
        "categories",
        "latitude",
        "longitude",
    ],
)

assert U["neighbourhood"].nunique() <= 1, print(
    "Number of different neighbourhood preferences must be <= 1."
)
# assert U["price_range"].nunique() <= 1, print(
#     "Number of different price_range preferences must be <= 1."
# )
assert (
    len(
        np.unique(
            [
                cat
                for cats in U["categories"].to_numpy()
                if cats is not None
                for cat in cats
            ]
        )
    )
    <= 2
), print("Number of different category preferences must be <= 2.")
time: 2.64 s (started: 2021-03-09 17:18:06 +08:00)

Updated Data Matrix \(\mathcal{X}\), filtering out some data that does not match constraints:

X = yelp_filtered[
    [
        "name",
        "formattedAddress",
        "neighborhoods",
        "reviewCount",
        "rating",
        "priceRange",
        "categories",
        "latitude",
        "longitude",
    ]
]
X.columns = [
    "name",
    "address",
    "neighbourhood",
    "num_reviews",
    "rating",
    "price_range",
    "categories",
    "latitude",
    "longitude",
]  # rename columns

# Filter out the neighbourhoods not in the user preference
if len(U["neighbourhood"].dropna().unique()) > 0:
    X = X[X["neighbourhood"] == U["neighbourhood"].dropna().unique()[0]]

# Filter out the restaurants whose price ranges exceed the lowest price range in user preferences
if len(U["price_range"].dropna().unique()) > 0:
    X = X[
        X["price_range"]
        == sorted(
            U["price_range"].dropna().unique(), key=lambda x: len(x), reverse=False
        )[0]
    ]

# Filter out restaurants that are not in the same categories as what we requested in user preference
if len(U["categories"].dropna()) > 0:
    X = X[
        X["categories"].apply(
            lambda categories: np.any(
                [
                    cat in categories
                    for cat in [
                        cat
                        for cats in U["categories"].to_numpy()
                        if cats is not None
                        for cat in cats
                    ]
                ]
            )
        )
    ]
time: 15.4 ms (started: 2021-03-10 07:57:39 +08:00)

Optimization

# Distance metric
d = lambda lat1, long1, lat2, long2: haversine(
    (lat1, long1), (lat2, long2), unit=Unit.MILES
)

# Percentage difference between furthest travelling indiividual and shortest travelling individual
α = 0.75

# Number of months we getting recommendations
M = 5

# Number of restaurants in our dataset
N = X.shape[0]

# Number of users
K = U.shape[0]

# Create one vector optimization variable.
w = cp.Variable(X.shape[0], boolean=True)

# Create constraints.
constraints = [
    cp.sum(w) >= M,
    cp.sum(w) <= M,
    *[
        w_n * u_k <= x_n
        for u_k in U["min_reviews"]
        for w_n, x_n in zip(w, X["num_reviews"])
    ],
    *[w_n * u_k <= x_n for u_k in U["min_rating"] for w_n, x_n in zip(w, X["rating"])],
    *[
        w_n
        * (
            1
            - (
                cp.minimum(
                    *[
                        d(row["latitude"], row["longitude"], lat_n, long_n)
                        for idx, row in U[["latitude", "longitude"]].iterrows()
                    ]
                )
                / cp.maximum(
                    *[
                        d(row["latitude"], row["longitude"], lat_n, long_n)
                        for idx, row in U[["latitude", "longitude"]].iterrows()
                    ]
                )
            )
        )
        <= α
        for w_n, lat_n, long_n in zip(w, X["latitude"], X["longitude"])
    ],
]

# Form objective.
obj = cp.Maximize(w.T @ X["rating"])

# Form and solve problem.
prob = cp.Problem(obj, constraints)
prob.solve()

print("Mixed Integer Programming Solution")
print("=" * 30)
print(f"Status: {prob.status}")
print(f"The optimal value is: {np.round(prob.value, 2)}")
print("Restaurants chosen: ")
X.iloc[np.argwhere(w.value).flatten()]
Mixed Integer Programming Solution
==============================
Status: optimal
The optimal value is: 22.5
Restaurants chosen: 
name address neighbourhood num_reviews rating price_range categories latitude longitude
21 Yup Dduk LA 3603 W 6th St Wilshire Center 2111 4.5 $$ [Korean, Chicken Shop] 34.063892 -118.300805
32 Han Bat Sul Lung Tang 4163 W 5th St Koreatown 2294 4.5 $$ [Korean, Comfort Food, Soup] 34.065408 -118.309849
36 Magal BBQ 3460 W 8th St Koreatown 1676 4.5 $$ [Korean, Barbeque] 34.057598 -118.305479
50 Bulgogi Hut 3600 Wilshire Blvd Koreatown 2471 4.5 $$ [Korean, Barbeque, Asian Fusion] 34.062375 -118.298589
68 Eight Korean BBQ 863 S Western Ave Koreatown 1651 4.5 $$ [Korean, Barbeque] 34.056027 -118.309888
time: 283 ms (started: 2021-03-10 07:59:21 +08:00)