Project 1: Group Restaurants Choice¶

By: Jacob Andreesen, Jeff Chen, Miao Xu, Yiyi Wang

Finding an ideal restaurants for students with a group of friends is always a struggle for newcomers, who are looking for places new and excited to go. Inspired by such a challenge, our group aim to build a personalized restaurant recommender system prototype that serve a small group of people to meet their requirements and close to their taste.

%load_ext autotime
%load_ext nb_black
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# For scraping
import time
import urllib.request, json
from flatten_dict import flatten
import requests
import copyheaders
from bs4 import BeautifulSoup

# General tools
import regex as re
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from tqdm import tqdm
import geopandas
from geopy import Nominatim

time: 1.46 s (started: 2021-03-09 15:47:27 +08:00)

Web Scraping Yelp!¶

We will be simply grabbing data from https://www.yelp.com/search/snippet api endpoint instead of actually web scraping from the Yelp! website. We attempted to scrape it through the website but it was hard to select specific elements that we required and some of them are only revealed through button clicks, meaning we’d have to use a browser automation software like selenium to simulate clicks and grab data from the html afterwards, a little too much unecessary work.

headers_str = b"""
    cache-control: max-age=0, must-revalidate, no-cache, no-store, private
    cache-control: no-transform
    cf-cache-status: DYNAMIC
    cf-ray: 58b26184fbd76c86-SJC
    cf-request-id: 02635b471c00006c86b019b200000001
    content-encoding: gzip
    content-security-policy: report-uri https://www.yelp.com/csp_block?id=bf59639897830a99&page=enforced_by_default_directives&policy_hash=7b6f2d6630868fdb2698dac44731677c&site=www&timestamp=1588093661; object-src 'self'; base-uri 'self' https://*.yelpcdn.com https://*.adsrvr.org https://6372968.fls.doubleclick.net; font-src data: 'self' https://*.yelp.com https://*.yelpcdn.com https://fonts.gstatic.com https://connect.facebook.net https://cdnjs.cloudflare.com https://apis.google.com https://www.google-analytics.com https://use.typekit.net https://player.ooyala.com https://use.fontawesome.com https://maxcdn.bootstrapcdn.com https://fonts.googleapis.com
    content-security-policy-report-only: report-uri https://www.yelp.com/csp_report_only?id=bf59639897830a99&page=csp_report_frame_directives%2Cfull_site_ssl_csp_report_directives&policy_hash=9dd00a1a6fbb402584b7ce0c1fdb4d14&site=www&timestamp=1588093661; frame-ancestors 'self' https://*.yelp.com; default-src https:; img-src https: data: https://*.adsrvr.org; script-src https: data: 'unsafe-inline' 'unsafe-eval' blob:; style-src https: 'unsafe-inline' data:; connect-src https:; font-src data: 'self' https://*.yelp.com https://*.yelpcdn.com https://fonts.gstatic.com https://connect.facebook.net https://cdnjs.cloudflare.com https://apis.google.com https://www.google-analytics.com https://use.typekit.net https://player.ooyala.com https://use.fontawesome.com https://maxcdn.bootstrapcdn.com https://fonts.googleapis.com; frame-src https: yelp-webview://* yelp://* data:; child-src https: yelp-webview://* yelp://*; media-src https:; object-src 'self'; base-uri 'self' https://*.yelpcdn.com https://*.adsrvr.org https://6372968.fls.doubleclick.net; form-action https: 'self'
    content-type: application/json; charset=utf-8
    date: Tue, 28 Apr 2020 17:07:42 GMT
    expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
    expires: Tue, 28 Apr 2020 17:07:41 GMT
    pragma: no-cache
    referrer-policy: origin-when-cross-origin
    server: cloudflare
    status: 200
    strict-transport-security: max-age=31536000; includeSubDomains; preload
    vary: User-Agent
    vary: Accept-Encoding
    x-b3-sampled: 0
    x-content-type-options: nosniff
    x-mode: ro
    x-node: www_all
    x-node: 10-69-179-105-uswest2bprod-9c0a6478-895a-11ea-98c5-b6d34d770
    x-proxied: 10-69-159-164-uswest2bprod
    x-routing-service: 10-69-187-145-uswest2bprod; site=www
    x-xss-protection: 1; report=https://www.yelp.com/xss_protection_report
    x-zipkin-id: 9a87fa4730749a04
"""
headers = copyheaders.headers_raw_to_dict(headers_str)
restaurant_list_url = (
    lambda index: f"https://www.yelp.com/search/snippet?find_desc=&find_loc=Los%20Angeles%2C%20CA&start={index}"
)
total_number_of_restaurants = 240
yelp_raw_data = []
for i in tqdm(range((total_number_of_restaurants // 10) + 1)):
    index = (
        i * 10 if i * 10 < total_number_of_restaurants else total_number_of_restaurants
    )
    retries, max_retries = 0, 5e2
    while retries < max_retries:
        retries += 1
        page = requests.get(restaurant_list_url(index), headers=headers)
        try:
            if page.ok:
                yelp_raw_data += json.loads(page.content)["searchPageProps"][
                    "mainContentComponentsListProps"
                ]
        except:
            if retries % 10 == 0:
                print(f"Number of attempts to get data for {index}: {retries}")
            continue
        if retries > max_retries and not page.ok:
            print(f"Couldn't get data for index: {index}")
        break

 80%|████████  | 20/25 [00:35<00:09,  1.87s/it]

Number of attempts to get data for 200: 10
Number of attempts to get data for 200: 20
Number of attempts to get data for 200: 30
Number of attempts to get data for 200: 40
Number of attempts to get data for 200: 50
Number of attempts to get data for 200: 60
Number of attempts to get data for 200: 70
Number of attempts to get data for 200: 80
Number of attempts to get data for 200: 90
Number of attempts to get data for 200: 100
Number of attempts to get data for 200: 110
Number of attempts to get data for 200: 120
Number of attempts to get data for 200: 130
Number of attempts to get data for 200: 140
Number of attempts to get data for 200: 150
Number of attempts to get data for 200: 160
Number of attempts to get data for 200: 170
Number of attempts to get data for 200: 180
Number of attempts to get data for 200: 190
Number of attempts to get data for 200: 200
Number of attempts to get data for 200: 210
Number of attempts to get data for 200: 220
Number of attempts to get data for 200: 230
Number of attempts to get data for 200: 240
Number of attempts to get data for 200: 250
Number of attempts to get data for 200: 260
Number of attempts to get data for 200: 270
Number of attempts to get data for 200: 280
Number of attempts to get data for 200: 290
Number of attempts to get data for 200: 300
Number of attempts to get data for 200: 310
Number of attempts to get data for 200: 320
Number of attempts to get data for 200: 330
Number of attempts to get data for 200: 340
Number of attempts to get data for 200: 350
Number of attempts to get data for 200: 360
Number of attempts to get data for 200: 370

 96%|█████████▌| 24/25 [04:38<00:26, 26.14s/it]

Number of attempts to get data for 240: 10
Number of attempts to get data for 240: 20
Number of attempts to get data for 240: 30
Number of attempts to get data for 240: 40
Number of attempts to get data for 240: 50
Number of attempts to get data for 240: 60
Number of attempts to get data for 240: 70
Number of attempts to get data for 240: 80
Number of attempts to get data for 240: 90
Number of attempts to get data for 240: 100
Number of attempts to get data for 240: 110
Number of attempts to get data for 240: 120
Number of attempts to get data for 240: 130
Number of attempts to get data for 240: 140
Number of attempts to get data for 240: 150
Number of attempts to get data for 240: 160
Number of attempts to get data for 240: 170
Number of attempts to get data for 240: 180
Number of attempts to get data for 240: 190
Number of attempts to get data for 240: 200
Number of attempts to get data for 240: 210
Number of attempts to get data for 240: 220
Number of attempts to get data for 240: 230
Number of attempts to get data for 240: 240
Number of attempts to get data for 240: 250
Number of attempts to get data for 240: 260
Number of attempts to get data for 240: 270
Number of attempts to get data for 240: 280
Number of attempts to get data for 240: 290
Number of attempts to get data for 240: 300
Number of attempts to get data for 240: 310
Number of attempts to get data for 240: 320
Number of attempts to get data for 240: 330
Number of attempts to get data for 240: 340
Number of attempts to get data for 240: 350
Number of attempts to get data for 240: 360
Number of attempts to get data for 240: 370
Number of attempts to get data for 240: 380
Number of attempts to get data for 240: 390
Number of attempts to get data for 240: 400
Number of attempts to get data for 240: 410
Number of attempts to get data for 240: 420
Number of attempts to get data for 240: 430
Number of attempts to get data for 240: 440
Number of attempts to get data for 240: 450
Number of attempts to get data for 240: 460
Number of attempts to get data for 240: 470
Number of attempts to get data for 240: 480
Number of attempts to get data for 240: 490

100%|██████████| 25/25 [09:52<00:00, 23.68s/it]

Number of attempts to get data for 240: 500
time: 9min 51s (started: 2021-03-09 15:47:29 +08:00)

yelp_raw = pd.DataFrame(
    [flatten(content) for content in tqdm(yelp_raw_data) if "bizId" in content.keys()]
)
yelp_raw.to_csv("./data/yelp_raw.csv")
yelp_raw.head()

100%|██████████| 480/480 [00:00<00:00, 33043.36it/s]

	(searchActions,)	(isYelpGuaranteed,)	(bizId,)	(tags,)	(scrollablePhotos, allPhotosHref)	(scrollablePhotos, photoHref)	(scrollablePhotos, photoList)	(scrollablePhotos, isResponsive)	(scrollablePhotos, isScrollable)	(searchResultLayoutType,)	...	(snippet, thumbnail, src)	(snippet, thumbnail, srcset)	(snippet, readMoreText)	(snippet, text)	(searchResultBusiness, parentBusiness, businessUrl)	(searchResultBusiness, parentBusiness, name)	(searchResultBusinessHighlights, bizSiteUrl)	(searchResultBusinessHighlights, businessHighlights)	(childrenBusinessInfo, businessUrls)	(childrenBusinessInfo, businessNames)
0	[]	False	rF7KNmSv5sYbwd3D5sA_vw	[{'label': {'color': 'normal', 'text': 'New on...	/biz_photos/rF7KNmSv5sYbwd3D5sA_vw	/adredir?ad_business_id=rF7KNmSv5sYbwd3D5sA_vw...	[{'src': 'https://s3-media0.fl.yelpcdn.com/bph...	True	True	scrollablePhotos	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	[]	False	7O1ORGY36A-2aIENyaJWPg	[]	/biz_photos/7O1ORGY36A-2aIENyaJWPg	/biz/howlin-rays-los-angeles-3	[{'src': 'https://s3-media0.fl.yelpcdn.com/bph...	True	True	scrollablePhotos	...	https://s3-media0.fl.yelpcdn.com/photo/22VFkvu...	https://s3-media0.fl.yelpcdn.com/photo/22VFkvu...	more	I FINALLY got to try this place... but sadly b...	NaN	NaN	NaN	NaN	NaN	NaN
2	[]	False	KQBGm5G8IDkE8LeNY45mbA	[]	/biz_photos/KQBGm5G8IDkE8LeNY45mbA	/biz/wurstk%C3%BCche-los-angeles-2	[{'src': 'https://s3-media0.fl.yelpcdn.com/bph...	True	True	scrollablePhotos	...	https://s3-media0.fl.yelpcdn.com/photo/mUmrPnL...	https://s3-media0.fl.yelpcdn.com/photo/mUmrPnL...	more	A group of us stopped by for dinner before exp...	NaN	NaN	NaN	NaN	NaN	NaN
3	[{'content': {'text': {'text': 'Offers takeout...	False	iSZpZgVnASwEmlq0DORY2A	[]	/biz_photos/iSZpZgVnASwEmlq0DORY2A	/biz/daikokuya-little-tokyo-los-angeles	[{'src': 'https://s3-media0.fl.yelpcdn.com/bph...	True	True	scrollablePhotos	...	https://s3-media0.fl.yelpcdn.com/photo/5P7QP1p...	https://s3-media0.fl.yelpcdn.com/photo/5P7QP1p...	more	Daikokuya just never disappoints!\nWent last n...	NaN	NaN	NaN	NaN	NaN	NaN
4	[{'content': {'text': {'text': 'Offers takeout...	False	MlmcOkwaNnxl3Zuk6HsPCQ	[{'label': {'color': 'normal', 'text': 'Curren...	/biz_photos/MlmcOkwaNnxl3Zuk6HsPCQ	/biz/slurpin-ramen-bar-los-angeles-los-angeles	[{'src': 'https://s3-media0.fl.yelpcdn.com/bph...	True	True	scrollablePhotos	...	https://s3-media0.fl.yelpcdn.com/photo/-6T8kS9...	https://s3-media0.fl.yelpcdn.com/photo/-6T8kS9...	more	Covid delivery review:\nOooooooof this hit the...	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 53 columns

time: 91.1 ms (started: 2021-03-09 15:57:21 +08:00)

Data Pre-processing¶

Let’s take out some of the unecessary columns for now.

yelp_raw.columns

Index([                                       ('searchActions',),
                                           ('isYelpGuaranteed',),
                                                      ('bizId',),
                                                       ('tags',),
                           ('scrollablePhotos', 'allPhotosHref'),
                               ('scrollablePhotos', 'photoHref'),
                               ('scrollablePhotos', 'photoList'),
                            ('scrollablePhotos', 'isResponsive'),
                            ('scrollablePhotos', 'isScrollable'),
                                     ('searchResultLayoutType',),
                             ('searchResultBusinessHighlights',),
                                      ('verifiedLicenseLayout',),
                                           ('serviceOfferings',),
                                                    ('snippet',),
                                                  ('markerKey',),
                                       ('adLoggingInfo', 'slot'),
                              ('adLoggingInfo', 'placementSlot'),
                                       ('adLoggingInfo', 'flow'),
                              ('adLoggingInfo', 'opportunityId'),
                               ('adLoggingInfo', 'adCampaignId'),
                               ('adLoggingInfo', 'isShowcaseAd'),
                                  ('adLoggingInfo', 'placement'),
                                       ('offerCampaignDetails',),
                      ('searchResultBusinessPortfolioProjects',),
                      ('searchResultBusiness', 'parentBusiness'),
                             ('searchResultBusiness', 'ranking'),
                         ('searchResultBusiness', 'reviewCount'),
                        ('searchResultBusiness', 'renderAdInfo'),
                                ('searchResultBusiness', 'name'),
                       ('searchResultBusiness', 'neighborhoods'),
                              ('searchResultBusiness', 'rating'),
                         ('searchResultBusiness', 'businessUrl'),
                                ('searchResultBusiness', 'isAd'),
                         ('searchResultBusiness', 'serviceArea'),
                               ('searchResultBusiness', 'phone'),
                          ('searchResultBusiness', 'priceRange'),
                      ('searchResultBusiness', 'alternateNames'),
                    ('searchResultBusiness', 'formattedAddress'),
                      ('searchResultBusiness', 'servicePricing'),
                          ('searchResultBusiness', 'bizSiteUrl'),
                          ('searchResultBusiness', 'categories'),
                                       ('childrenBusinessInfo',),
                                      ('snippet', 'readMoreUrl'),
                                 ('snippet', 'thumbnail', 'src'),
                              ('snippet', 'thumbnail', 'srcset'),
                                     ('snippet', 'readMoreText'),
                                             ('snippet', 'text'),
       ('searchResultBusiness', 'parentBusiness', 'businessUrl'),
              ('searchResultBusiness', 'parentBusiness', 'name'),
                ('searchResultBusinessHighlights', 'bizSiteUrl'),
        ('searchResultBusinessHighlights', 'businessHighlights'),
                        ('childrenBusinessInfo', 'businessUrls'),
                       ('childrenBusinessInfo', 'businessNames')],
      dtype='object')

time: 7.82 ms (started: 2021-03-09 15:57:21 +08:00)

yelp_filtered = yelp_raw[
    [col for col in yelp_raw.columns if "searchResultBusiness" in col]
].dropna(axis=1)
yelp_filtered = yelp_filtered[
    yelp_filtered[("searchResultBusiness", "priceRange")] != ""
]  # Remove rows with no price range
yelp_filtered.columns = [tuple(col)[-1] for col in yelp_filtered.columns]
yelp_filtered.to_csv("./data/yelp_filtered.csv")
yelp_filtered.head()

/usr/local/lib/python3.8/site-packages/numpy/core/_asarray.py:102: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  return array(a, dtype, copy=False, order=order)

	reviewCount	renderAdInfo	name	neighborhoods	rating	businessUrl	isAd	phone	priceRange	alternateNames	formattedAddress	categories
1	6636	False	Howlin’ Ray’s	[Chinatown]	4.5	/biz/howlin-rays-los-angeles-3	False	(213) 935-8399	$$	[]	727 N Broadway	[{'url': '/search?cflt=southern&find_loc=Los+A...
2	8449	False	Wurstküche	[Arts District]	4.0	/biz/wurstk%C3%BCche-los-angeles-2	False	(213) 687-4444	$$	[]	800 E 3rd St	[{'url': '/search?cflt=hotdog&find_loc=Los+Ang...
3	8644	False	Daikokuya Little Tokyo	[Little Tokyo]	4.0	/biz/daikokuya-little-tokyo-los-angeles	False	(213) 626-1680	$$	[]	327 E 1st St	[{'url': '/search?cflt=ramen&find_loc=Los+Ange...
4	4641	False	Slurpin’ Ramen Bar - Los Angeles	[Koreatown]	4.5	/biz/slurpin-ramen-bar-los-angeles-los-angeles	False	(213) 388-8607	$$	[]	3500 W 8th St	[{'url': '/search?cflt=ramen&find_loc=Los+Ange...
5	6228	False	Bestia	[Downtown]	4.5	/biz/bestia-los-angeles	False	(213) 514-5724	$$$	[]	2121 E 7th Pl	[{'url': '/search?cflt=italian&find_loc=Los+An...

time: 29.9 ms (started: 2021-03-09 15:57:21 +08:00)

Add the lat, long for each address

locator = Nominatim(user_agent="myGeocoder")


def get_latlong(address: str, locator=locator):
    """Get lat, long from string address"""
    location = locator.geocode(f"{address}, Los Angeles, CA")
    try:
        lat, long = location.latitude, location.longitude
    except:
        lat, long = None, None
    return lat, long


yelp_filtered = pd.concat(
    [
        yelp_filtered,
        pd.DataFrame(
            yelp_filtered["formattedAddress"]
            .apply(lambda address: get_latlong(address))
            .tolist(),
            columns=["latitude", "longitude"],
            index=yelp_filtered.index,
        ),
    ],
    axis=1,
)
yelp_filtered.head()

	reviewCount	renderAdInfo	name	neighborhoods	rating	businessUrl	isAd	phone	priceRange	alternateNames	formattedAddress	categories	latitude	longitude
1	6636	False	Howlin’ Ray’s	[Chinatown]	4.5	/biz/howlin-rays-los-angeles-3	False	(213) 935-8399	$$	[]	727 N Broadway	[{'url': '/search?cflt=southern&find_loc=Los+A...	34.061519	-118.239473
2	8449	False	Wurstküche	[Arts District]	4.0	/biz/wurstk%C3%BCche-los-angeles-2	False	(213) 687-4444	$$	[]	800 E 3rd St	[{'url': '/search?cflt=hotdog&find_loc=Los+Ang...	33.896347	-118.117083
3	8644	False	Daikokuya Little Tokyo	[Little Tokyo]	4.0	/biz/daikokuya-little-tokyo-los-angeles	False	(213) 626-1680	$$	[]	327 E 1st St	[{'url': '/search?cflt=ramen&find_loc=Los+Ange...	34.049971	-118.240083
4	4641	False	Slurpin’ Ramen Bar - Los Angeles	[Koreatown]	4.5	/biz/slurpin-ramen-bar-los-angeles-los-angeles	False	(213) 388-8607	$$	[]	3500 W 8th St	[{'url': '/search?cflt=ramen&find_loc=Los+Ange...	34.057590	-118.306725
5	6228	False	Bestia	[Downtown]	4.5	/biz/bestia-los-angeles	False	(213) 514-5724	$$$	[]	2121 E 7th Pl	[{'url': '/search?cflt=italian&find_loc=Los+An...	34.033738	-118.229309

time: 2min 34s (started: 2021-03-09 15:57:21 +08:00)

Convert neighbourhoods into a single value for each row

yelp_filtered["neighborhoods"].apply(
    lambda neighbourhoods: len(neighbourhoods)
).value_counts()

1    234
Name: neighborhoods, dtype: int64

time: 4.93 ms (started: 2021-03-09 15:59:55 +08:00)

yelp_filtered["neighborhoods"] = yelp_filtered["neighborhoods"].apply(
    lambda neighbourhoods: neighbourhoods[0]
)

time: 1.31 ms (started: 2021-03-09 15:59:55 +08:00)

yelp_filtered["neighborhoods"].value_counts()

Downtown                   30
Koreatown                  23
Hollywood                  15
Little Tokyo               15
Wilshire Center            14
Beverly Grove              13
Fairfax                    11
Mid-Wilshire               10
East Hollywood             10
Los Feliz                   9
Silver Lake                 6
Arts District               6
Chinatown                   5
Hancock Park                5
Echo Park                   4
Sawtelle                    4
University Park             3
Mid-City                    3
Larchmont                   3
Arlington Heights           3
Harvard Heights             3
Lincoln Heights             3
Palms                       3
Highland Park               2
Century City                2
Mar Vista                   2
Pico-Robertson              2
Windsor Square              2
Atwater Village             2
Westlake                    2
Boyle Heights               2
Exposition Park             2
Historic South Central      2
Pico-Union                  2
Eagle Rock                  2
Elysian Park                1
Beverlywood                 1
Westchester                 1
View Park/Windsor Hills     1
Studio City                 1
West Adams                  1
Hollywood Hills West        1
Westwood                    1
Griffith Park               1
Name: neighborhoods, dtype: int64

time: 4.17 ms (started: 2021-03-09 15:59:55 +08:00)

What categories do our restaurants have?

from collections import Counter

pd.Series(
    [
        cat
        for cats in yelp_filtered["categories"]
        .apply(
            lambda raw_categories: np.array(
                [category["title"] for category in raw_categories]
            )
        )
        .to_numpy()
        for cat in cats
    ]
).value_counts()[:30]

Breakfast & Brunch           33
Korean                       27
Seafood                      21
Sandwiches                   21
Coffee & Tea                 21
American (New)               21
Mexican                      20
Barbeque                     17
Desserts                     15
Italian                      14
Noodles                      14
Japanese                     13
Ice Cream & Frozen Yogurt    13
Bakeries                     13
Cocktail Bars                12
Sushi Bars                   11
Ramen                        10
Pizza                         9
American (Traditional)        9
Vegan                         9
Mediterranean                 8
Cafes                         8
Salad                         8
Burgers                       7
Soup                          7
Chicken Shop                  7
Halal                         6
Chinese                       6
Comfort Food                  6
Thai                          6
dtype: int64

time: 7.52 ms (started: 2021-03-09 15:59:55 +08:00)

yelp_filtered["categories"] = yelp_filtered["categories"].apply(
    lambda raw_categories: np.array([category["title"] for category in raw_categories])
)

time: 1.82 ms (started: 2021-03-09 15:59:55 +08:00)

# Save this dataset
yelp_filtered.to_csv("./data/yelp_cleaned.csv")

time: 11.5 ms (started: 2021-03-10 08:00:35 +08:00)

Optimization¶

This can be run indepedently from Web Scraping Yelp! as long as yelp_cleaned.csv is loaded

# Load the cleaned yelp dataset
yelp_filtered = pd.read_csv("./data/yelp_cleaned.csv", index_col=[0])
yelp_filtered.head()

	reviewCount	renderAdInfo	name	neighborhoods	rating	businessUrl	isAd	phone	priceRange	alternateNames	formattedAddress	categories	latitude	longitude
1	6636	False	Howlin’ Ray’s	Chinatown	4.5	/biz/howlin-rays-los-angeles-3	False	(213) 935-8399	$$	[]	727 N Broadway	['Southern' 'Chicken Shop' 'American (Traditio...	34.061519	-118.239473
2	8449	False	Wurstküche	Arts District	4.0	/biz/wurstk%C3%BCche-los-angeles-2	False	(213) 687-4444	$$	[]	800 E 3rd St	['Hot Dogs' 'German' 'Gastropubs']	33.896347	-118.117083
3	8644	False	Daikokuya Little Tokyo	Little Tokyo	4.0	/biz/daikokuya-little-tokyo-los-angeles	False	(213) 626-1680	$$	[]	327 E 1st St	['Ramen' 'Noodles']	34.049971	-118.240083
4	4641	False	Slurpin’ Ramen Bar - Los Angeles	Koreatown	4.5	/biz/slurpin-ramen-bar-los-angeles-los-angeles	False	(213) 388-8607	$$	[]	3500 W 8th St	['Ramen' 'Noodles']	34.057590	-118.306725
5	6228	False	Bestia	Downtown	4.5	/biz/bestia-los-angeles	False	(213) 514-5724	$$$	[]	2121 E 7th Pl	['Italian' 'Cocktail Bars' 'Pizza']	34.033738	-118.229309

time: 19.7 ms (started: 2021-03-10 08:02:22 +08:00)

Problem Setup¶

Let $k \in K$ index the users whose preference is to be accounted for.

Let $m \in M$ index the month we will be requesting a restaurant recommendation for.

Let $n \in N$ index the total number of restaurants in Los Angeles available on Yelp!.

Let $\mathcal{X} = \left\{{x}_{1}, {x}_{2}, \cdots, {x}_{N}\right\}$ be a matrix that denotes our dataset of webscraped restaurant information from Yelp!, where

\begin{aligned} {x}_{n} &= \begin{bmatrix} \text{name} = \left{ \mathbf{\text{String}} \right} \ \text{address} = \left{ \mathbf{\text{String}}\right} \ \text{neighbourhood} = \left{ \mathbf{\text{String}} \right} \ \text{num_reviews} = \left{0.0 \leq \mathbf{\text{Integer}} \leq \infty \right} \ \text{rating} = \left{ 0.0 \leq \mathbf{\text{Float}} \leq 5.0 \right} \ \text{price_range} = \left{ \text{$}, \text{$$}, \text{$$$}, \text{$$$$} \right} \ \text{categories} = \left{ \text{Breakfast & Brunch}, \text{Korean}, \cdots, \text{Halal} \right} \ \end{bmatrix} \ \end{aligned}

Let $\mathcal{U} = \left\{{u}_{1}, {u}_{2}, \cdots, {u}_{K}\right\}$ be a matrix, where

\begin{aligned} {u}_{k} &= \begin{bmatrix} \text{address} = \left{ \mathbf{\text{String}} \right} \ \text{neighbourhood} = \left{ \mathbf{\text{String}} \right} \ \text{min_reviews} = \left{0.0 \leq \mathbf{\text{Integer}} \leq \infty \right} \ \text{min_rating} = \left{ 0.0 \leq \mathbf{\text{Float}} \leq 5.0 \right} \ \text{price_range} = \left{ \text{$}, \text{$$}, \text{$$$}, \text{$$$$} \right} \ \text{max_distance (miles)} = \left{ 0.0 \leq \bf{\text{Float}} \leq \infty \right} \ \text{categories} = \left{ \text{Breakfast & Brunch}, \text{Korean}, \cdots, \text{Halal} \right} \ \end{bmatrix} \ \end{aligned}

Let $w \in \mathbb{Z}^N_2 = \left\{0, 1\right\}^N$ a binary indicator vector denoting which of the $N$ restaurants in the Los Angeles area were chosen.

Let $d: L \times L \rightarrow \mathbb{R}^+_0, L = \left\{\mathbf{\text{Valid Address String}}\right\}$ be a function that calculates the distance in miles between two points.

Let $A.\text{<attribute>}$ denote the column vector (if $A$ is a matrix) / scalar (if $A$ is a vector) of just the specific attribute.

Mixed-Integer Program Formulation¶

\begin{aligned} \underset{w}{\text{maximize }} &{w^\top}{\left(\mathcal{X}.\text{rating}\right)} \ \text{subject to } &w^\top\mathbb{1} = M \ &w_nu_k.{\text{min_reviews}} \leq x_n.{\text{num_reviews}},\forall,n \in N, k \in K \ &w_nu_k.{\text{min_rating}} \leq x_n.{\text{rating}},\forall,n \in N, k \in K \ &w_n\left(1 - \frac{\min_{n, k} d({{x_n}.{\text{address}}}, {u_k}.{\text{address}})}{\max_{n, k} d({{x_n}.{\text{address}}}, {u_k}.{\text{address}})}\right) \leq \alpha_n,\forall,n \in N, k \in K, w_n = 1 \ &u_k.{\text{price_range}} \geq x_n.{\text{price_range}},\forall,n \in N, k \in K, w_n = 1 \ &u_k.{\text{neighbourhood}} = x_n.{\text{neighbourhood}},\forall,n \in N, k \in K, w_n = 1 \ &u_k.{\text{categories}} \subseteq x_n.{\text{categories}},\forall,n \in N, k \in K, w_n = 1 \ \end{aligned}

# For distance calculation
from haversine import haversine, Unit

# For optimization
import cvxpy as cp

# For converting street in LA to lat long
locator = Nominatim(user_agent="myGeocoder")

def get_latlong(address: str, locator=locator):
    """Get lat, long from string address"""
    location = locator.geocode(f"{address}, Los Angeles, CA")
    try:
        lat, long = location.latitude, location.longitude
    except:
        lat, long = None, None
    return lat, long

time: 476 ms (started: 2021-03-09 15:59:55 +08:00)

User Preference Matrix $\mathcal{U}$:

U = pd.DataFrame(
    [
        [
            "3584 S Figueroa St",
            None,
            100,
            4,
            "$$",
            None,
            None,
            *get_latlong("3584 S Figueroa St"),
        ],  # Icon Plaza USC
        [
            "3301 S Hoover St",
            None,
            200,
            3.6,
            "$$$",
            None,
            None,
            *get_latlong("3301 S Hoover St"),
        ],  # USC Village
        [
            "10250 Santa Monica Blvd",
            None,
            1000,
            4.5,
            "$$$$",
            None,
            None,
            *get_latlong("10250 Santa Monica Blvd"),
        ],  # Westfield Century City
        [
            "189 The Grove Dr",
            None,
            1000,
            3.9,
            "$$$",
            None,
            ["Korean"],
            *get_latlong("189 The Grove Dr"),
        ],  # The Grove
    ],
    columns=[
        "address",
        "neighbourhood",
        "min_reviews",
        "min_rating",
        "price_range",
        "max_distance",
        "categories",
        "latitude",
        "longitude",
    ],
)

assert U["neighbourhood"].nunique() <= 1, print(
    "Number of different neighbourhood preferences must be <= 1."
)
# assert U["price_range"].nunique() <= 1, print(
#     "Number of different price_range preferences must be <= 1."
# )
assert (
    len(
        np.unique(
            [
                cat
                for cats in U["categories"].to_numpy()
                if cats is not None
                for cat in cats
            ]
        )
    )
    <= 2
), print("Number of different category preferences must be <= 2.")

time: 2.64 s (started: 2021-03-09 17:18:06 +08:00)

Updated Data Matrix $\mathcal{X}$, filtering out some data that does not match constraints:

X = yelp_filtered[
    [
        "name",
        "formattedAddress",
        "neighborhoods",
        "reviewCount",
        "rating",
        "priceRange",
        "categories",
        "latitude",
        "longitude",
    ]
]
X.columns = [
    "name",
    "address",
    "neighbourhood",
    "num_reviews",
    "rating",
    "price_range",
    "categories",
    "latitude",
    "longitude",
]  # rename columns

# Filter out the neighbourhoods not in the user preference
if len(U["neighbourhood"].dropna().unique()) > 0:
    X = X[X["neighbourhood"] == U["neighbourhood"].dropna().unique()[0]]

# Filter out the restaurants whose price ranges exceed the lowest price range in user preferences
if len(U["price_range"].dropna().unique()) > 0:
    X = X[
        X["price_range"]
        == sorted(
            U["price_range"].dropna().unique(), key=lambda x: len(x), reverse=False
        )[0]
    ]

# Filter out restaurants that are not in the same categories as what we requested in user preference
if len(U["categories"].dropna()) > 0:
    X = X[
        X["categories"].apply(
            lambda categories: np.any(
                [
                    cat in categories
                    for cat in [
                        cat
                        for cats in U["categories"].to_numpy()
                        if cats is not None
                        for cat in cats
                    ]
                ]
            )
        )
    ]

time: 15.4 ms (started: 2021-03-10 07:57:39 +08:00)

Optimization

# Distance metric
d = lambda lat1, long1, lat2, long2: haversine(
    (lat1, long1), (lat2, long2), unit=Unit.MILES
)

# Percentage difference between furthest travelling indiividual and shortest travelling individual
α = 0.75

# Number of months we getting recommendations
M = 5

# Number of restaurants in our dataset
N = X.shape[0]

# Number of users
K = U.shape[0]

# Create one vector optimization variable.
w = cp.Variable(X.shape[0], boolean=True)

# Create constraints.
constraints = [
    cp.sum(w) >= M,
    cp.sum(w) <= M,
    *[
        w_n * u_k <= x_n
        for u_k in U["min_reviews"]
        for w_n, x_n in zip(w, X["num_reviews"])
    ],
    *[w_n * u_k <= x_n for u_k in U["min_rating"] for w_n, x_n in zip(w, X["rating"])],
    *[
        w_n
        * (
            1
            - (
                cp.minimum(
                    *[
                        d(row["latitude"], row["longitude"], lat_n, long_n)
                        for idx, row in U[["latitude", "longitude"]].iterrows()
                    ]
                )
                / cp.maximum(
                    *[
                        d(row["latitude"], row["longitude"], lat_n, long_n)
                        for idx, row in U[["latitude", "longitude"]].iterrows()
                    ]
                )
            )
        )
        <= α
        for w_n, lat_n, long_n in zip(w, X["latitude"], X["longitude"])
    ],
]

# Form objective.
obj = cp.Maximize(w.T @ X["rating"])

# Form and solve problem.
prob = cp.Problem(obj, constraints)
prob.solve()

print("Mixed Integer Programming Solution")
print("=" * 30)
print(f"Status: {prob.status}")
print(f"The optimal value is: {np.round(prob.value, 2)}")
print("Restaurants chosen: ")
X.iloc[np.argwhere(w.value).flatten()]

Mixed Integer Programming Solution
==============================
Status: optimal
The optimal value is: 22.5
Restaurants chosen: 

	name	address	neighbourhood	num_reviews	rating	price_range	categories	latitude	longitude
21	Yup Dduk LA	3603 W 6th St	Wilshire Center	2111	4.5	$$	[Korean, Chicken Shop]	34.063892	-118.300805
32	Han Bat Sul Lung Tang	4163 W 5th St	Koreatown	2294	4.5	$$	[Korean, Comfort Food, Soup]	34.065408	-118.309849
36	Magal BBQ	3460 W 8th St	Koreatown	1676	4.5	$$	[Korean, Barbeque]	34.057598	-118.305479
50	Bulgogi Hut	3600 Wilshire Blvd	Koreatown	2471	4.5	$$	[Korean, Barbeque, Asian Fusion]	34.062375	-118.298589
68	Eight Korean BBQ	863 S Western Ave	Koreatown	1651	4.5	$$	[Korean, Barbeque]	34.056027	-118.309888

time: 283 ms (started: 2021-03-10 07:59:21 +08:00)

ΨΦ