Copy link to article


A guide to reinforcement learning for dynamic pricing

Jeremy Bradley, Chief Data Scientist at Datasparq

Note to reader: This article is part of our Guide to AI-Powered Dynamic Pricing. You can download the full guide for free here. Happy reading!

Our Data Science team were lucky enough to get into Reinforcement Learning (RL) [1] several years ago, when we started looking at optimal pricing for car parking spaces at an international airport [2].

It’s an interesting problem: you have a fixed resource, the car park spaces themselves, and you have customer demand which varies over time—largely with demand for airline travel. You also give people the ability to book spaces well ahead of time. What price do you set?

Imagine you run such a car park and have the power to set prices. If you wait a couple of days, you might get a higher price from another customer for the same date. If you lower your price for that date because you haven’t sold the spaces and the date is approaching, your customers might quickly learn that waiting is the best policy and not book when they first check the car park website.

Furthermore, customers who book for a weekend might well be families or tourists looking for a quick break who are perhaps booking with a low-cost airline. If they’ve paid £67 for a return trip to Majorca, they’re likely to think that £75 for a weekend parking spot is a little on the high side. Whereas your business traveller might think that is entirely reasonable—or at the very least, it’s not an expense they’ll be picking up.

The ability to learn what price works for a given demand, for a given day of the year, with a given booking lead time is a good example of dynamic pricing.

Dynamic pricing is a growing area of data science and a long-standing problem in operations research and one that is enjoying much more attention as marketplaces themselves become more dynamic.

What makes these problems amenable to reinforcement learning is vast event data sets around customer interactions. This gives an RL tool the ability to deduce the demand level (as well as, potentially, need and urgency) for a particular cohort of customers and thus find a customer-specific price that will lead to a well-used resource.

Why dynamic pricing?

Before we get into reinforcement learning and how it can power dynamic pricing, we need to understand a few use cases for why prices might need to change in more general settings.

Increased demand/reduced supply. As in any economic model, if supply does not match demand for a product then the market may be able to sustain a higher price.

Expiring products. Anything that has a limited sale window is likely to attract varying a price, whether it’s a hotel room for tonight, a loaf of bread that’s about to reach its expiry date, or a plane ticket on a flight. None of these can be sold after the window has passed.

Seasonality. Some products are much more desired at certain times of day or times of year. Some may not be available at all at the wrong time. Christmas decorations typically get discounted heavily after Christmas as retailers don’t want to stock them for the subsequent year waiting for them to sell.

Cost of production/supply. Some products cost more to produce on demand than on the normal production run. For example, Brompton, the folding bike manufacturer based in London, typically has a 5-week lead time for a normal folding bike but that can be reduced to 5 days with tracked delivery if you are prepared to pay a premium for a model. The implication is that there is a reason to charge a varying price more closely linked to the cost of production (and possibly the customer’s willingness to pay for quick delivery—see below).

Price by customer. Possibly the most contentious reason to charge a dynamic price—certainly in mass retail—is the act of setting a different price for the same product to different customers. Sometimes this is veneered over by selling in different markets, e.g. electric cars in the US versus the EU market. But within a market, it may be desirable to identify customers who are time-limited and are not looking at competitor products and may be willing to pay more for a standard product.

Varying a price to reflect super-high demand or capture a high-need customer is not always the right thing to do. For instance, airlines charging thousands of pounds for seats on the last flight that makes it home before an enforced quarantine period starts or a taxi company surging prices in the face of a civil emergency—can lead to considerable long-lasting reputational damage.

Important to consider—is it the ethical thing to do?

Constructing an RL model... and the mistakes to avoid!

Here’s an example from another dynamic pricing project we’ve been working on recently. It’s a marketplace environment where customers are exposed to a set of agents offering a service and we have to work out what price to set. The marketplace takes a cut of the fee paid and the service providers get the rest.

The basic features

Customer demand is seasonal and varies dramatically across the day—it is also price sensitive.

The service providers are a limiting resource—if they’re working for a customer they can’t be doing other jobs for other customers.

The higher the price, the more the service providers want to offer the service and the quicker a match is made.

The lower the price, the better for the customer—of course—but also the longer they may be kept waiting for a match with a possible service provider and indeed they may get no match at all.

How do you set prices in such a setting, so that you offer a fair price to the service providers, but don’t “fleece” the customers? Well, you can use reinforcement learning. We don’t know initially know the best price—but we can set one and see what impact it has on the market—then we can learn and in the next epoch adjust accordingly.

If you read any reinforcement learning textbook, you’ll find a diagram very much like this describing the basic principles. 

The reinforcement learning loop. CC by-SA 4.0 Jeremy Bradley

In this instance, the agent is the marketplace, the action is the ability to set a price and offer it to the customer, the state is the state of the marketplace (I know that’s self-referential, but we’ll revisit that) and the reward is a measure of success from having made a successful match between customer and service provider. 

Again we’ll come back to this because it’s not straightforward and it will drive the whole optimisation of the system. There can be a choice between many possible actions reflecting the possible prices that may be offered in the marketplace. Each action may lead to a distinct change of state and a different reward associated with that change.

So what does the Agent do? Well, the marketplace learns from successful matches and unsuccessful matches and adjusts the price (or not) accordingly as each new opportunity to make a match comes around. There are maybe 20,000 transactions a day and prices are set in near real-time.

I think the three most important aspects of doing this kind of activity with reinforcement learning can be summarised:

What’s in a state?

A quick primer: Reinforcement learning comes out of a branch of mathematics called dynamic programming which solves an underlying Markov Decision Process for a value function. 

Deriving a value function gives you a policy which in turn tells you which actions you should take in a given state.

That was lots of jargon so let’s break that down—a value function computes the value of each state of your system under a given action policy; knowing which states are high value is important when discovering a better policy. A policy is a pre-defined strategy for choosing which action to take from a given state, ideally to maximise your accumulated lifetime reward. A Markov Decision Process or MDP is a state transition system which defines the states, actions, rewards and next states for any system.

What’s this got to do with states? If the underlying mathematical model (MDP) has too many states, then the dynamic programming algorithm will take forever to converge (or a very long time anyway). i.e. you’ll never get a solution and you won’t know what action to take—or price to set, in this case—to improve your overall lifetime reward.

It is very easy to inadvertently give yourself an enormous state space. We have a marketplace that spans 850 locations across the world and we want to set prices every 15 minutes, we have 20 levels of customer demand and 20 levels of service provider supply, and 20 possible price bands. That’s a little over 4.5 billion states right there!

Surprisingly this is not necessarily a disaster as you can represent your state space approximately using a deep learning neural network of some other ML model but it definitely adds to the complexity as we’ll see when we try to learn with it.

A simple 3-state MDP with actions in brown and rewards +5 and -1 marked. CC by-SA 4.0 Waldo Alvarez

What does success look like?

Success is defined by your reward. You might think that success is easy to specify. Why not choose revenue, or profit—and watch the dollars roll in? Well, yes and no. Let’s imagine we used revenue in our marketplace.

If I have some very rich customers, then I can set my prices very high and I may initially even make a lot of revenue. However, I have inadvertently made my market very brittle and subject to the whims of a few well-off customers who are now dominating the market. Every other customer is priced out and it’s unlikely that they’ll come back.

So, let’s lower the prices so that I open myself up to more of my customer base. The prices are still high. The revenue is improved because the customer base is larger, but there are service providers who are now turning down small jobs that are too low in value in the hope that a higher-value job will come along shortly. Many customers with smaller jobs are left waiting for an unacceptable period of time before they get a price match. But that is ignored by the reward metric as it only cares about revenue in this case.

Again this is a short-term win. Those customers left hanging around probably never return to your marketplace even when they have larger jobs to offer. In the end, low customer satisfaction kills your business.

What we found is that a hybrid reward metric which captures both financial success and customer satisfaction will allow you to find a sweet spot in your problem — keep the service providers happy with a decent revenue while keeping the customers happy with a good performance metric.

Can it learn?

Reinforcement learning works best when it learns, as you might expect. Successful learning comes from convergence of the algorithm and gives you a consistent result for the value of being in a particular state.

If you’re using an explicit state space model (sometimes called tabular learning, so not a neural network) you need enough training data or learning episodes to visit each state many thousands of times to get converged learning. (Converged learning refers to the model having a consistent picture of what value the most important states and actions have).

Now we know why keeping track of the state space is so crucial. If you’re using tabular learning, which is simple and has many benefits, you have to have a pre-training set that is 100x or 1000x the number of states in the state space or at least the number of states you are likely to visit. 

Alternatively, if your model is learning from scratch, your learning feedback loop needs to visit all of those states 100s or 1000s of times to get converged learning.

If you’re using an approximate state deep learning model for very large state spaces, you won’t be surprised to learn that many gigabytes of training data are needed to train that model effectively.

Certainly, if your training data only visits each state once or twice, your learning is going to be very poor and your recommended prices or actions are not going to be great!

Wrapping up

I’ve really only touched upon reinforcement learning as a mechanism for learning optimal prices.

There are lots of benefits to getting it right—an algorithm that can learn autonomously and present the right price for even new unseen scenarios; an algorithm that can even look after customer satisfaction if the correct reward structures are put in place to guide learning.

There are many ways you can get yourself in a tangle. But keeping the state space at a reasonable size that can match the available training data you have (or are likely to encounter during learning) is a great start and defining an appropriate reward metric will help you drive your learning in the right direction while not making embarrassing decisions along the way.

This introduction is still only the tip of the iceberg but it reflects some of the key decisions we had to take when taking reinforcement learning from the textbook [1] (which is quite excellent by the way!) to a real-life project.

Other significant considerations include...

Exploration versus exploitation—at some point you’ll need to take a suboptimal decision to see whether it leads to a better policy that was previously unknown. This can trigger some interesting conversations with stakeholders!

Discount factors—how much do you value reward that can’t be realised for weeks or months versus reward you can access today?

Learning rates—how quickly should your learning system adapt to new evidence or is your data quite noisy, thus meaning you don’t want it to adapt to a blip by mistake.

Algorithm—where to start; tabular or approximate representation, Expected-SARSA or Q-learning. There are many techniques and pros and cons to each of these approaches but in general, starting simple and building up the complexity as you need it is good advice.

Off-policy versus on-policy learning—is your system able to learn as it discovers new states or is it going to have to be pre-trained offline? Is your system able to implement the decisions it takes or is it only recommending the decisions to an operator who can decide to adopt or decline them? All these factors influence whether you need to use an on- or off-policy approach.

Time-varying rewards and demand—probably the hardest issue of all in reinforcement learning is time-heterogeneity. 

Your environment may change over time and the reward you see one day may differ the next day for the same action. If this means a change in the statistical properties of your system (rather than just noise or differences in samples from the same distribution), then your RL system needs to know to adapt to that. Censored memory and rapid learning rates are the order of the day here.

All of these features need addressing at one point or other in practice. Let me know if you’d like to see a Part II on some of these more technical aspects of the practical implementation of reinforcement learning.


[1] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning, An Introduction. 2nd ed. MIT Press, 2018.

[2] Andreas Papayiannis, Paul V. Johnson, Dmitry Yumashev, Peter Duck. Revenue management of airport car parks in continuous time. IMA Journal of Management Mathematics. Jan 2019.


This article was originally published on Datasparq's technical blog on Medium.

More insights

When you're ready, we're here

Contact us
Welcome to the club!
Oops! Something went wrong while submitting the form.