How a data scientist buys a car

LX470-road-tripLast year, I set out to buy a used car. This wasn’t just any car. It would be the vehicle for a long-planned, off-road expedition to Arizona and Utah with my brother, and for many other off-road adventures in the coming years.

I had set my sights on a 1998-2007 Toyota Land Cruiser or the Lexus LX470—both well regarded in off-road circles. However, due to a number of factors, there’s often limited market availability, which makes it difficult to put an accurate price on them. As I began my search, I found significant price variations from standard pricing guides—by as much as $10,000 in some cases.

What’s a data scientist to do? Get some data and build a model of course.

What I learned not only helped me find the right car at the right price, it also offered important lessons that can be applied to almost any data initiative.

But first the car.

I began my work on cars.com. I collected the data I needed using open source tools to extract four data points for each available vehicle: price, miles (in units of 10,000 miles), year and type (Land Cruiser or LX470).

The resulting data set consisted of 390 vehicles in the U.S., 245 of which were LX470s. The mileage ranged from 38,000 to 321,000 miles. Prices ranged from $7,000 to over $40,000.

[A side note for the data scientists reading this: I conducted this analysis in R and used the rvest package to scrape the data from cars.com.]

Visualizing the data

In business intelligence arenas, visualizing data can help users better uncover patterns, so I created graphs to show the relationship between miles, price and type, and between year, price and type.

LC_LX_year_price
Visualization showing relationship of year, price and type. Click to enlarge.
LC_LX_price_miles
Visualization showing relationship between miles, price and type. Click to enlarge.

By visualizing the data, it became quite apparent, as would be expected, that newer models are priced higher, and as mileage increases, prices decrease. The visualizations also showed that LX470s are typically priced slightly higher than Land Cruisers, and highlighted the outliers so they were easy to see.

But the question remained: What should I offer for any given vehicle?

Understanding pricing variations

My next step was to create a data model to more precisely understand the prices as they related to these data points. (The model isn’t perfect, but it does the job.)

It turns out that mileage, year and type explain about 87 percent of the variation in price, and half of the vehicles in the market will be within about $1,300 of the price that the model predicts.

All things being equal, such as vehicle condition and features, each additional 10,000 miles reduces the price by $620, and LX470s sell for $635 more than Land Cruisers.

The model also showed a roughly $3,000 price increase between 2002 and 2003 vehicles, and again between 2005 and 2006 vehicles, both of which are likely due to design improvements.

While the model predictions were close to the industry guide I consulted, there were instances where the estimates diverged. These differences showed the law of supply and demand at work.

[For the data scientists reading this: The model works like this: $22555 – (miles/10,000) * $621 + year_price_adjustment + type_price_adjustment]

Putting it all in context

This exercise helped me narrow down pricing based on the market inventory at the time—giving me greater confidence as I negotiated with dealers. It also provided two important lessons that are broadly applicable to almost any enterprise’s data initiative.

Lesson 1: Focus on the data first

Models are only as good as the data you feed them, so it’s important to think about how you organize and manage the data. In this case, I spent several hours getting the data and preparing it for analysis. Creating the model only took a few minutes.

This ratio isn’t unusual. Data scientists are often jokingly referred to as “data janitors” because we spend 80 percent of our time cleaning up data.

As the amount of data enterprises collect has grown, so too has the importance of proper data management.

In fact, marketing scientists—progressive marketing leaders who use scientific methodology to effectively predict customer needs and prescribe solutions—identify effective structuring and management of data as a key pillar of their success. And they are nearly twice as proficient as traditional marketers in “architecting” the data so that it’s “digestible, dissectible, and easily retrieved” across their organizations.

Because of this, they’re better able to test new theories and to conduct more in-depth analysis than their peers.

How does your organization manage its data? Can you easily pursue new areas of inquiry? Or do you constantly need to start from ground zero? If you face the latter, it may be time to review your data architecture.

Lesson 2: Keep it simple

There are times when squeezing every drop of performance out of a model is important enough that it’s worth having data scientists build “black box” models—models so complicated that they are difficult for anyone, other than their creators, to understand.

Trading is a great example of this. In an industry where one one-hundredth of a cent matters, the complexity of the model is irrelevant; performance is everything.

However, there can be negative consequences as the complexity increases, such as what happened during the “flash crash” in the stock market several years ago.

In analytics, enterprises need to balance performance with complexity.

In my case, I could have used a slightly better model for my car search that took into account the interaction between model year and mileage. However, the performance improvement would have been modest (only about 1.5 percent) while the model would have become more complex.

Can your data scientists explain their data models in business terms? Will the potential performance improvement justify the time and effort required to manage the additional complexity? Will increased complexity open your enterprise to greater risk?

Generally speaking, it’s often best for data scientists to use the simplest model that gets the job done.

My car search was successful. I purchased a 2006 LX470 with 96,000 miles at a local dealer, and the purchase price was within $1,000 of what the model predicted. And I just returned from the first of many off-road expeditions.

I’m sure this won’t be the last time I build a data model as part of an everyday endeavor. What about you? Have you ever wanted to leverage analytics to help guide a decision? As a data scientist, I’d love to hear your stories.

LX470-road-trip
An off-road adventure with my LX470, courtesy of analytics

5 responses to How a data scientist buys a car

  1. perell says:

    Indeed! Well done bro! Hope your car lasts for all miles needed. I guess you could extend your search by comparing how demanded/priced are other models according to milleage in order to percieve a sort endurance/reliability by model. You could have started looking for the best value instead of looking for specific models; so that you could find the perfect car! I am assuming that all cars are useful for a daily usage…blah blah

    Like

  2. Arijit Biswas says:

    Derek,
    Thoughful piece . I wanted to understand how we can build a similar model to facilitate my buying decision of a pre owned used car in India where the sentiment about used cars is different from US. What other market conditions should I include.
    Thanks
    Arijit

    Like

    • Derek Franks says:

      Hi Arijit,
      That’s a good question. I don’t know enough about the car market in India to be able to tell you exactly what you should look at, but I can give you a few things to consider.

      To begin with, I think there are likely some other factors that would have influenced pricing in my model. Location is a big one. Cars in the Northeast of the US tend to have issues with rusting over time and are often less expensive than cars in the much drier and warmer Southwestern US.

      Color is another factor. While I suspect the effect isn’t large, there are a couple of colors that were very popular 5-10 years ago that are a bit out of style today.

      Also as you would expect, the condition of the individual vehicle is going to have a large influence.

      The overarching consideration however, is how to incorporate these elements into your model. This is the problem that I ran into when building my model. There were some technical limitations from a web-scraping perspective that kept me from including color. I had some ideas regarding how to include location in the model, but it was going to be a significant amount of work – more than I was willing to invest in a “fun” project. And finally, factors like vehicle condition are really hard to quantify and I didn’t have a way to incorporate it into my model.

      So ultimately, my advice is to start simple. Look at easily quantifiable elements like Make, Model, Year, and Mileage. You may find, like I did that a simple model performs surprisingly well. I would start looking at other factors only if you build a simple model and decide that it doesn’t provide the level of performance that you need.

      Like

  3. Sebastian Wedeniwski says:

    Very efficient code doing this analysis. Many years ago I have done similar things in Java with much more effort. I expect also more complex analysis will be easier today. For example to query the feature list and cluster and priorities them by buying preferences like off-road, reliability, comfort etc.

    Like

  4. I’ve the model down on the Land Crusier side; a 1997 model 80 and now with 300K miles and going great. Finished a year long trip from (@ourLongWayHome) #Alaska to #Argentina with family and across to UK. No issues. Location regd. rust (as older cars) and maintenance records is the other data points I’d try and add.

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Pingbacks & Trackbacks