Categories
Data Analysis

IBM Data Science Professional Capstone

For the last couple of months, I have been attending various training programs; Venture Capital Investment for New Technologies, Digital Transformation, TRIZ, Design Thinking and lastly IBM Data Science Professional, which was a set of online courses in contrary to the other four. As a humble opinion from someone who devoted more than half of his life to studying, I can clearly say that the first and the foremost important thing you get out of any education is inspiration. It totally isn’t the technical skills you gain, but the inspiration that’ll give you at least a single moment of “eureka” to drive you to put what you learned into practice. In this sense, my initial inspiration led me to IBM Data Science Professional Certificate Specialization of Coursera, and eventually plus hopefully what I learned throughout the courses will let me become a data magician at some point.

WHAT IS ALL THIS FUSS ABOUT?

Well, of course some of you probably don’t even have any idea on what I’m talking about when I say Data Science. Maybe you have the misperception of it being manipulation of data over Microsoft Excel or any similar program; but in larger sums. Well, as you can guess, it isn’t that simple. Before we start to dig into the real picture though, I would clearly like to state that I’m nowhere near becoming a data scientist, or holding any title that would directly involve data analysis or manipulation. I will just briefly try to explain what Data Science is from what I learned during the courses, and give you a short example of what a Data Scientist does, while fulfilling the final project requirement of writing a report about a data science project.

DATA SCIENCE

According to Wikipedia; Data Science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured/unstructured data. According to me and a more generalized understanding; it is using data to have a better understanding of certain issues, while employing knowledge in three areas, such as; Domain/Business, Mathematics/Statistics, and Computer/IT. First, you define a problem and where you want to reach at the end of the project. For instance; hospitals can’t forecast what percentage of patients will stay in and how long. That results in undersupply or oversupply. That requires Domain/Business knowledge. After defining the problem, we define what sort of data we need, gather them, analyze them, clean out unnecessary parts and have it prepared to come up with a statistical model that can help us forecast probable future values of how long a new patient might stay in the hospital. This is where you use Mathematics/Statistics. As far as I’m concerned, in the olden days, computing power just wasn’t there, digital footprint was limited, and projects were over after coming up with a problem and a shallow statistical model. These days, Computing/IT is so powerful that it is used as the third dimension that allows you to visualize the data, better analyze it, and build up the best model(even let computer decide the best model) by testing several and making use of the one that works the greatest. In this sense there are several tools commonly used such as; Python, R, Hadoop, Sci-kit etc. The hottest one nowadays is Python as the worldwide used programming language, and almost all the courses of IBM were centered around it.

A couple of years back, I was pursuing an MBA degree in Shanghai, China. I had a business idea of opening a Turkish Fast Food restaurant chain that would rapidly grow into different parts of the country. Frankly, F&B was the only knowledge I thought as Turkish we could sell to the rest of the world. Of course both Turkey and China being the hubs of no-value added goods had great influence on that. At the time, I had no clue on Data Science as the topic wasn’t this hot; so with a Chinese classmate, we started looking into best products we can commercialize in a fast food restaurant, and eventually came up with different ideas using fairly conventional techniques. At the end of the day, we concluded that there was almost no room for a lean start to this project as rent prices in Shanghai was nuts, to say the least. Put together with my family business concerns back home, we had to scrap the project without even igniting.

Four years later, when capstone project of IBM dictated using Foursquare API to call data from Foursquare database for a data science project, this was totally the first thing that came to my mind. I wondered if I could analyze the fast food restaurant environment in Shanghai to see where I could open a restaurant. This time though, limited data was a big obstacle, and as Chinese mostly don’t use Foursquare at all, my project would be a fruit of bunch of western tourists that wouldn’t reflect the real picture at all. Hence, I made Los Angeles, California, United States my new playground.

WHERE TO INTRODUCE A NEW RESTAURANT CONCEPT IN LOS ANGELES?

Before I start with the project, let me explain what methodology I will be following for this project. IBM has its own data science methodology that involves ten stations to deploy a data science report, which you can see below. As I have become very much familiar with this one throughout the course, I’ll try to follow it.

BUSINESS UNDERSTANDING AND ANALYTIC APPROACH

I think what I would like to do is pretty much clear thanks to what I explained earlier, but to restate it one more time; I’m someone who wants to introduce a new fast food restaurant concept, and I would like to find which areas of Los Angeles are more suitable for this. By suitable, I mean the best option that will lead me to financial success. One of the reasons why I was so confused while doing in Shanghai was that there was no way of knowing how scattered fast food restaurants in which areas of the city according to what terms. You can see this in LA as well in the below scatter plot among LA neighborhoods.

DATA REQUIREMENTS, DATA COLLECTION, AND DATA UNDERSTANDING

First of all, I needed the names, and the coordinates of all neighborhoods in LA. Then, I needed the ratio of fast food restaurants over other restaurants to see where fast food culture is better established. Then in order to have a better understanding of income levels of those areas, I needed median rent prices for each neighborhood to see whether there is a correlation between fast food restaurant frequencies and rent prices, which I take as an income level indicator. I basically have two data sources; one is USC Price Housing Price Open Data, where I have median rent prices between 2011-2016 indexed by neighborhood names. This data also includes coordinates for each specific neighborhood. Second data source is Foursquare, where I can import venue categories for different neighborhoods using their API.

As you can see in the above table, rent prices have median housing prices for different locations in different neighborhood coordinates. On the other side, we make calls from Foursquare to get the top venue categories available around these coordinates. We set a limit to results and radius to cover to avoid confusion. I set them to 10k results and 10k meters radius.

DATA PREPARATION

In Data Preparation section, I basically cleaned the data from unnecessary columns like the column where it shows the name as “Rent Price” as we already knew all the data was about rent price. I separated coordinates to latitudes and longitudes to use GeoCoder to plot a map with markers where we can see the exact locations of clusters we were going to make using Machine Learning. For the data coming from Foursquare, total number of rows and columns were 24925 and 7 respectively. I converted venue category names using a technique called “One Hot Encoding” to 0s and 1s to facilitate things for Machine Learning algorithms. Filtered categories to show only venue categories that have the word “Restaurant”. I took sums of all rows then divided them by Fast Food Restaurant frequencies to find Fast Food Restaurant ratios over other restaurant categories. Then I concatenated all the data into one single DataFrame, where I could see rent prices.

MODELING

My initial hope was that I could catch a correlation between housing prices and numbers of fast food restaurants available, and come up with a simple equation. Looking at the scatter plot, it was pretty much obvious that there was no clear sign of any correlation. Then I skimmed through all the available Machine Learning models available, and it was pretty obvious that unsupervised clustering using K-means would be the most appropriate one to see the real picture, while visualizing where fast food restaurants have the highest portion of the cake on one single map. For those who wonder what K-means is; it is probably the most popular unsupervised machine learning technique where you only set how many clusters you want, then the algorithm puts data into clusters starting with random centroids for each cluster, then optimizing location of these centroids using iterative calculations.

I asked for 5 different clusters from the algorithm and it decided to cluster the data on the basis of rent price. Let’s examine the outliers of each cluster one by one:

CLUSTER 0

In $1600-$1877 median rent price interval; Acton, Ramona, San Dimas, Chatsworth, and Hidden Hills come into prominence with 50%, 22.7%, 22.7%, 10.7% and 10% Fast Food Restaurant ratios over other restaurants respectively.

CLUSTER 1

In $1171-$1347 median rent price interval; SE Antelope Valley, Palmdale, Northwest Palmdale, and Whittier come into prominence with 67%, 30%, 20% and 20% Fast Food Restaurant ratios over other restaurants respectively.

CLUSTER 2

In $1369-$1566 median rent price interval; Northwest Antelope Valley, Chatter Oak, Glendora, Norwalk, and Verne come into prominence with 33%, 25%, 25%, 13%  Fast Food Restaurant ratios over other restaurants respectively.

CLUSTER 3

In $1914-$2332 median rent price interval; Leona Valley, La Habra Heights, and Westlake step forward with 50%, 13%, 13%, Fast Food Restaurant ratios over other restaurants respectively.

CLUSTER 4

In $835-$1151 median rent price interval; Catalina Island, Lancaster, Lynwood, and East Compton step aside with 33%, 14.8%, 9.7%, and 9.4% Fast Food Restaurant ratios over other restaurants respectively.

FINAL PICTURE

EVALUATION

To have a better visual image of what’s going on, I plotted a choropleth map using Folium library of Python to see the big picture. A short glance at the map shows us that Fast Food/Other Restaurants ratio vary from 2% to 65%. When clusters analyzed, Cluster 1 has more fast food restaurant presence than other clusters as eight neighborhoods exceed 10% Fast Food restaurant ratio threshold. If you clearly look into the map above though, downtown area where population is supposed to be higher is out of any cluster. This is mainly due to Foursquare having more non-restaurant venues over those areas and even a very high number of venue call as 10k isn’t enough to filter only restaurants to have a better understanding of the fast food environment in LA. I think the best approach is to start analyzing non-mathematical facts of the locations in Cluster 1 and 4 with higher presence of Fast Food Restaurants using a top down approach, and see whether social facts have any influence on changing decisions. For instance the top location in Cluster 1 is SE Antelope Valley and that probably has something to do with it being a tourist location. Opening the first restaurant far out of the city at a tourist location can be a very tricky business decision. All in all, the map above and clusters give a good idea on where to start the search in terms of fast food culture and comparative rent costings.

Under the light of this, I would like to end my report with a funny saying of Danish Football Manager Ebbe Skovdahl:

“Statistics is like mini-skirt. It gives you good ideas but hide the most important things.”

Sources:

Jupyter Notebook: https://nbviewer.jupyter.org/github/berkaytekin…FF.ipynb

USC Price Housing Prices Dataset: https://usc.data.socrata.com/Los-Angeles…

Foursquare API: https://developer.foursquare.com

Los Angeles Neighborhood Boundaries: https://usc.data.socrata.com/dataset/Los…-Map/r8qd-yxsr

Hope you liked the post. For all your questions and whatnot, you can either e-mail me or contact me through my social media accounts.

Mail: bt@berkaytekin.com

Instagram: http://www.instagram.com/betekin