Extracting Knowledge and Making Decisions with Data Science

Author:

Evan Wimpey

Date Published:
August 14, 2023

Exciting news! The International Symposium on Forecasting was taking place in our very own Charlottesville, not far from our headquarters. This is great for Elder Research; forecasting is a large part of what we do as an analytics consultancy, so we could present some of our successful recent work and learn from other leaders in the field.

Logistics for attending a conference should be much easier with the conference site right in our backyard. Yet, with team members stationed at the conference site, others working from our downtown offices, and some commuting from their homes, coordinating a coffee meet-up posed a puzzle. There are plenty of popular coffee spots, but we aimed to find a venue that factored in everyone’s convenience. Naturally, we put our analytical acumen to work, leveraging data science to locate our optimal café.

Data Science for Café Selection: A Practical Example

Data science is a powerful problem-solving field, and for our purposes today, I’ll define it to be the application of technology to extract knowledge from data to make useful decisions.

So, what data do we have?

For simplicity, let’s assume only three people are meeting, and that their addresses are all the data we have.

 

Evan will be coming from the conference, at the Darden School of Business, 100 Darden Blvd, Charlottesville, VA 22903.

 

John is coming from the Elder Research office, 300 W Main St, Charlottesville, VA 22903.

 

Caroline will be coming from her home at 1850 Yorktown Dr, Charlottesville, VA 22901
(Note this is not Caroline’s actual address.)

Let’s look at that on a map using the Python folium package to generate the map and visualize our three points:

 

Considerations and Trade-offs in Identifying an Optimal Coffee Spot

In this case, we’ve got a triangle. Choosing a spot near the middle of our triangle would get us close to an “optimal” coffee spot. But what if another coworker decides to join from somewhere else? And what does it actually mean to have an “optimal” spot?

When using automation we need a way to quantify our decisions. There are several ways that we could think of the best possible location (regardless of coffee quality), and it’s important to think about trade-offs. Here are a few examples:

We could minimize the total distance traveled from each starting point to the destination. This may end up with some people traveling much further than others.

 

Minimize the squared-distance each person travels. This would better avoid any one person traveling much further than the others.

 

Minimize the maximum distance traveled – so that we have the shortest possible distance traveled from the person that must travel the farthest. This is called the “minimax” solution.

 

John is extra busy, so perhaps we want to weight his travel time more than Evan’s or Caroline’s. We could use a parameter to penalize extra distance for one traveler compared to others.

Applying the Minimax Method

We’re a considerate bunch here at Elder Research, so let’s use the minimax method.

As a simple search space sure to contain the optimal point, we defined a grid bounded by the extreme most north, south, east, and west points:

 

Calculating Minimax Distances

As a simple starting point, we’ll use geographic distance (rather than driving time, say).  And, we will carve up the search space rectangle into a 100×100 grid and calculate the minimax distance from those 10^4 points. This exhaustive finite search will ensure that we find the optimal minimax. If it takes too long to compute, then we could increase the grid step size or refine the search space to be the connecting triangle.

(Note: In this very local example, the distance metric doesn’t matter so much – but the earth is a sphere and at larger distances it starts to matter more. We’ve used the Haversine formula to calculate distances between points.)

After checking every point within our grid, the optimal location is shown below in red:

Identifying Amenities

Great, we’ve found the optimal location! Unfortunately, that red marker is in the parking lot of an apartment complex, and I suspect no one there is ready to invite John, Caroline, and Evan in for coffee. Fortunately, Python allows us to find amenities near a given location with geopy and OpenCage (though you’ll need an API key to use this).

Here, we specify that we want to find all of the restaurants and cafés within 500m of our ideal location. We can be flexible here as well, perhaps we want to search a wider radius because its worth it for Evan to drive a bit further to find a gem of a coffee spot. Or perhaps we could always return the closest 5 amenities to our ideal location, even if they’re miles and miles away.

Great! We’ve identified a challenge (finding a convenient location to meet for coffee), we’ve assessed data (three starting addresses), we’ve made some judgement decisions based on the users (minimax), and now we’ve identified our ideal meeting place:

 

One of the great things about this process is that we now have generalizable code that we can use again whenever we encounter this same problem – or variations to this problem. What if our colleague Lisa wanted to join, and she was coming from 2097 Inn Dr, Charlottesville, VA 22911, east of town? If we are still using all the same judgements and assumptions, then we simply add her address and apply the algorithm:

Recognizing Limitations and Downsides

Like with all data science projects, its important to note the limitations of solutions. We’ve already noted some trade-offs in how we think about “ideal” locations, but what are some other downsides?

Open-source tools

We’re relying on cafés that have been labeled in an open-source tool.

We could be missing some businesses, we could be including some closed shops, and we could be including a bar that was mistakenly labeled as a coffee shop (or at least we could hope so!)

Varying forms of measure

We’re strictly using distance as the crow flies.

This is great if crows are looking for a coffee shop, but this is often very different from driving distance, and certainly different from drive time and traffic considerations.

Large search area

We’ve got a relatively large search space, even though we know we won’t use the far corners.

That’s increasing the time it takes to run our process, which means it won’t scale as well when we’re looking for a restaurant for 30 different people.

Vague answer

We’re discretizing our grid into 100x100 – what if that is too coarse and we need a more exact answer?

Conclusion

Overall, this simple but relatable problem addresses how to think about challenges that organizations face. It’s important to understand the underlying challenge, and to be clear about the assumptions and choices that you make when designing a solution. Communication between the analyst and other stakeholders is key. Virtual communication is convenient, but if you need in-person communication then let us know and we’ll find you a convenient coffee shop!

Note: I cannot tell a lie; we didn’t use this to meet for coffee in Charlottesville, but it was trying to meet during the International Symposium on Forecasting that prompted these thoughts.