Data Science Capstone Project

#<div-style="text-align:-center">

The Battle of the Neighbourhoods - LA Edition

#<div-style="text-align:-center">

-
Ishaan Vasant

#<div-style="text-align:-right">

Los Angeles, often known by its initials LA, is the most populous city in California and the second most populous city in the United States. Los Angeles is the cultural, financial, and commercial center of Southern California. The city is known for its Mediterranean climate, ethnic diversity, Hollywood, the entertainment industry, and its sprawling metropolis.

Having lived in LA for over a year now, I can confirm that Los Angeles is also one of the most amazing places to eat, thanks to an incredible variety of international cuisines and some of the most talented chefs in the world. It is a multicultural city, with the biggest communities of several nationalities of people outside of their homelands. They bring their cuisines with them. LA’s thriving economy, great seasonal produce and access to ingredients makes it an ideal place for restaurants to flourish.

The objective of this project is to identify the best potential neighbourhoods where a restaurant can be set up. An [international YouGov study](https://yougov.co.uk/topics/food/articles-reports/2019/03/12/italian-cuisine-worlds-most-popular) of more than 25,000 people in 24 countries found that pizza and pasta are among the most popular foods in the world, as Italian cuisine beats all comers. According to their analysis, 88% of people surveyed in America liked Italian food. Keeping this in mind, the focus of this capstone would be Italian restaurants. Therefore, the analysis and results of this project would interest stakeholders who are interested in **opening an Italian restaurant in Los Angeles**.

Since there are lots of restaurants in LA, neighbourhoods that are **not already crowded with restaurants** would be shortlisted. The next filter would be neighbourhoods with the **least number of Italian restaurants in its vicinity**. Neighbourhoods that are **as close to city center as possible** would be preferred. Neighbourhood **rent** is another factor that would be taken into consideration.

Based on the criteria specified above, the factors that will influence the final decision are: -

  • Number of existing restaurants in the neighbourhood (any type of restaurant)
  • Number of and distance to Italian restaurants in the neighbourhood
  • Distance of neighbourhood from city center
  • Average neighbourhood rent

The following data sources will be needed to extract/generate the required information: -

In this project the first step will be to collect data on the neighbourhoods of Los Angeles from the internet. There are no relevant datasets available for this and therefore, data will need to be scraped from a webpage. The location coordinates of each neighbourhood will then be obtained with the help of GeoPy Nominatim geolocator and appended to the neighbourhood data. Using this data, a folium map of the Los Angeles neighbourhoods will be created.

The second step will be to explore each of neighbourhoods and their venues using Foursquare location data. The venues of the neighbourhoods will be analyzed in detail and patterns will be discovered. This discovery of patterns will be carried out by grouping the neighbourhoods using k-means clustering. Following this, each cluster will be examined and a decision will be made regarding which cluster fits the shareholder's requirements. The factor that will determine this is the frequency of occurrence of restaurants and other food venues within the cluster.

Once a cluster is picked, the neighbourhoods in that cluster will be investigated with regards to the number of Italian restaurants in its vicinity. The ones that fit the requirements will be further explored and shortlisted based on how small their respective distances to the center or Los Angeles are. Finally, if there are multiple neighbourhoods that fit these conditions, Los Angeles rent data can be used to influence the shareholder's decision.

The results of the analysis will highlight potential neighbourhoods where an Italian restaurant may be opened based on geographical location and proximity to competitors. This will only serve as a starting point since there are a lot of other factors that influence such a decision.

Importing Libraries

#Importing-Libraries-

The first step in the analysis is importing the required libraries.

Web Scraping Neighbourhood Data

#Web-Scraping-Neighbourhood-Data-

The list of all neighbourhoods in LA is obtained by scraping the relevant webpage. The data in the webpage is in the form of a list and not a table. Therefore, the data is obtained by searching for all list items and then using a particular characteristic that groups the required items.

Loading and Cleaning Neighbourhood

#Loading-and-Cleaning-Neighbourhood-
Loading output library...

Obtaining Neighbourhood Coordinates

#Obtaining-Neighbourhood-Coordinates-

Using GeoPy Nominatim geolocator with the user_agent "la_explorer".

Clean neighbourhood data with the respective coordinates: -

Loading output library...

Deleting neighbourhoods with missing (zero) values and obvious geocoding errors: -

Complete neighbourhood data frame: -

Loading output library...

LA Neighbourhood Map

#LA-Neighbourhood-Map-

Obtaining the coordinates of the center of LA: -

Creating a map of LA with neighbourhoods superimposed on top: -

Loading output library...

Defining Foursquare Credentials and Version

#Defining-Foursquare-Credentials-and-Version-

Exploring the first Neighbourhood

#Exploring-the-first-Neighbourhood-

Venue data: -

Loading output library...

Nearby venues of the first neighbourhood: -

Loading output library...

Exploring all Neighbourhoods

#Exploring-all-Neighbourhoods-

Function to get the nearby venues of all neighbourhoods and load the data into a data frame: -

Data frame of all venues: -

Loading output library...

It makes sense to set up a restaurant in one of the more popular neighbourhoods so that the restaurant attracts the attention of a lot more people.

Therefore, a list of all the popular neighbourhoods i.e. the neighbourhoods with 10 or more venues is obtained: -

Loading output library...

Updating the venues data frame to include only the venues which are in popular neighbourhoods: -

Loading output library...

Analyzing each Neighbourhood

#Analyzing-each-Neighbourhood-
Loading output library...

Grouping rows by neighbourhood by taking the mean of the frequency of occurrence of each category: -

Loading output library...

Printing each neighbourhood along with the top 5 most common venues: -

Creating a new data frame and displaying the top 10 venues for each neighbourhood: -

Loading output library...

Clustering Neighbourhoods

#Clustering-Neighbourhoods-

The first step is to determine the optimal value of K for the dataset using the Silhouette Coefficient Method.

A higher Silhouette Coefficient score relates to a model with better defined clusters.

A higher Silhouette Coefficient indicates that the object is well matched to its own cluster and poorly matched to neighbouring clusters.

The Silhouette Coefficient is the highest for n_clusters=4. Therefore, the neighbourhoods shall be grouped into 4 clusters (k=4) using k-means clustering.

Loading output library...

Creating a new data frame that includes the cluster as well as the top 10 venues for each neighbourhood: -

Loading output library...

Visualizing the resulting neighbourhood clusters on the map: -

Loading output library...

Examining the Clusters

#Examining-the-Clusters-

Creating a data frame for each cluster that includes the top 10 venues for each of its neighbourhoods: -

Loading output library...
Loading output library...
Loading output library...
Loading output library...

Creating a data frame grouped by clusters by taking the mean of the frequency of occurrence of each venue category: -

Loading output library...

Visualizing Top 10 Venues for each Cluster

#Visualizing-Top-10-Venues-for-each-Cluster-

Function to generate a horizontal bar plot showing the top 10 venues for each cluster, highlighting the food venues: -

Loading output library...
Loading output library...

There are 6 food venues in the top 10 venues of Cluster 0 with Mexican Restaurants making up nearly 20% of all venues. These facts indicate that Cluster 0 would not be the best one to explore further in terms of setting up a new restaurant.

Loading output library...
Loading output library...

There are 4 food venues in the top 10 venues of Cluster 1 with Korean Restaurants making up a huge majority (nearly 30%) of all venues. This is unsurprising as Cluster 1 consists of only two neighbourhoods, one being Koreatown and the other (Mid-Wilshire) also having a lot of Korean Restaurants. While there are only 4 food venues in the top 10, the complete dominance of Korean Restaurants in the area indicates the fact that Cluster 1 need not be looked into any further.

Loading output library...
Loading output library...

There are only 2 food venues in the top 10 venues of Cluster 2. To add to that, the two venues are Food Trucks and Coffee Shops as opposed to proper restaurants. There are a lot of public venues in this cluster - venues that see a lot of footfall such as parks, museums, gyms and department stores. The presence of condominium complexes in this list also suggest that the population per square unit of these neighbourhoods is high. All of these observations point in the direction of Cluster 2 being nominated as the cluster to explore further.

Having said that, the decision to explore Cluster 2 can only be confirmed after examining Cluster 3: -

Loading output library...
Loading output library...

There are 8 food venues in the top 10 venues of Cluster 3 which is huge percentage. Except for the number 1 venue (Coffee Shops), all other food venues are proper restaurants. This clearly indicates that the neighbourhoods in Cluster 3 are saturated with restaurants already and need not be considered when opening a new restaurant.

It is now safe to confirm the decision of investigating Cluster 2 further and eliminating all other clusters.

Investigating the chosen Cluster

#Investigating-the-chosen-Cluster-
Loading output library...

The neighbourhoods in Cluster 2 along with their coordinates: -

Loading output library...

Function to obtain and display the closest Italian restaurants from each neighbourhood in Cluster 2 and the corresponding distances: -

Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

From the data frames above, it can be observed that Park La Brea has 7 Italian Restaurants within 700 meters from its center. Hancock Park has fewer (3) but two of them are less than 250 meters away from its center. This indicates that Park La Brea and Hancock Park would not be suitable neighbourhoods to open an Italian Restaurant and can therefore be eliminated. This leaves the following neighbourhoods: -

Loading output library...

Computing the distance of each neighbourhood from the center of LA and adding it as a column to the existing data frame: -

Loading output library...

It is clear from the data frame above that Exposition Park (~6km) and Montecito Heights (~6km) are much closer to the center of Los Angeles than Wilshire Center (~17.5km) and Playa Vista (~19km). Since the distance from LA center is a criterion in choosing the optimal neighbourhood, Wilshire Center and Playa Vista would not be appropriate choices.

Web Scraping Rent Data

#Web-Scraping-Rent-Data-

The list of average rent of all neighbourhoods in LA can be obtained by scraping the relevant webpage. The data in the webpage is in the form of a table. Therefore, the data can be obtained much more easily.

Loading output library...

The above data frame is already in ascending order of average rent. The 2 neighbourhoods in question can be identified from the table and their average rents displayed: -

Loading output library...

The average rent in Exposition Park is nearly two times the average rent in Montecito Heights. This means that Exposition Park is a significantly more expensive neighbourhood.

In the beginning of the analysis the data frame of Los Angeles neighbourhoods was trimmed to include only the ones that had 10 or more venues. This decision was taken as it made sense to set up a restaurant in one of the more popular neighbourhoods, thereby attracting the attention of a lot more people.

When clustering the neighbourhoods, the optimal value of k (k=4) for the dataset was arrived at using the Silhouette Coefficient Method. As a consequence, all neighbourhoods were grouped into 4 clusters using k-means clustering. In order to examine the deterministic characteristics of each cluster, a data frame for each cluster was created that included their most frequently occurring venues in descending order. A horizontal bar plot was generated showing the top 10 venues for each cluster, highlighting the food venues. This helped in determining the optimal cluster for further analysis. All of the observations pointed in the direction of Cluster 2 being that cluster. It had only 2 food venues amongst the top 10 - food trucks and coffee shops - which were not full-fledged restaurants. The cluster also had apartment complexes and a lot of public venues which meant that the neighbourhoods in it see a lot of people.

The following step was to obtain and display the closest Italian restaurants from each neighbourhood in Cluster 2 and their corresponding distances. It was observed that Park La Brea has 7 Italian Restaurants within 700 meters from its center. Hancock Park had fewer (3) but two of them were less than 250 meters away from its center. This indicated that Park La Brea and Hancock Park would not be suitable neighbourhoods to open an Italian Restaurant in and were eliminated.

The next criteria was the distance of each of the remaining neighbourhoods from the center of the city. It was found that Exposition Park (~6km) and Montecito Heights (~6km) are much closer to the center of Los Angeles than Wilshire Center (~17.5km) and Playa Vista (~19km). Therefore, it was understood that Wilshire Center and Playa Vista would not be appropriate choices.

The table of average rent of all neighbourhoods in LA was obtained by scraping the relevant webpage. The two neighbourhoods that remained in contention were identified from the table and their average rents displayed. It was detected that the average rent in Exposition Park is nearly two times the average rent in Montecito Heights, implying that Exposition Park is a significantly more expensive neighbourhood. However, this does not automatically mean Montecito Heights is the better option. A factor to consider is the type of restaurant the shareholder is interested in setting up. If, for example, a high-end fine dining restaurant needs to be set up, a neighbourhood that has a low average rent would not work. The reason for this is that such a neighbourhood would generally be home to people with lower income and a high-end fine dining restaurant may not see a healthy influx of people. On the other hand, if a fast-casual/casual dining restaurant needs to be set up, a high-rent neighbourhood would not be ideal simply because the restaurant will not be able to afford the rented space. While average rent can point in the direction of the right neighbourhood, a final decision cannot be made without all the required information.