Bandit Pricing Fundamentals

Learning the optimal price from purchase decisions alone, balancing the cost of experimentation against the value of information.

The Problem

A seller faces $T$ customers arriving one at a time. For each customer, the seller posts a price and observes only whether the customer bought—not how much the customer was willing to pay. Setting the price too high means lost sales; setting it too low means leaving money on the table. But the seller does not know the demand curve, so every price is simultaneously a revenue opportunity and an experiment.

Rothschild (1974) first posed this as a two-armed bandit problem and showed that a myopic seller can get stuck at a suboptimal price forever, because exploiting the currently best-looking price generates no information about alternatives. The challenge is to balance exploration (trying new prices to learn demand) against exploitation (choosing the price that looks best so far).

Regret as a Metric

Definition — Cumulative Regret

If the optimal fixed price earns expected revenue $r^*$ per customer, the cumulative regret of a pricing policy after $T$ rounds is:

R(T) = T \cdot r^* - \sum_{t=1}^{T} r_t

where $r_t$ is the revenue earned in round $t$ . Regret measures the total revenue lost due to demand uncertainty.

The central question is: how fast can $R(T)$ grow as a function of $T$ , and how do structural assumptions about demand affect this rate? The answer ranges from $\Theta(\sqrt{T})$ with no assumptions to $O(\log T)$ with strong structural knowledge.

No Structure on Demand

Kleinberg and Leighton (2003) study the simplest version: customers arrive with valuations drawn i.i.d. from an unknown distribution on $[0, 1]$ , and the seller posts a price from a continuous set. The demand curve $D(p) = \Pr(v \geq p)$ is completely unknown.

Minimax Regret Without Structure (Kleinberg and Leighton, 2003)

The minimax regret for pricing with no structural assumptions on demand is $\Theta(\sqrt{T})$ . No algorithm can do better in the worst case.

The key tension: posting prices far from $p^*$ is informative about demand but costly in revenue; posting prices near $p^*$ is cheap but uninformative. After 10,000 customers, the seller loses roughly 100 customers’ worth of revenue to uncertainty—an unavoidable cost of learning.

Parametric Demand

Broder and Rusmevichientong (2012) consider a parametric demand model $d(p; z) = \Pr(V \geq p)$ , where $z \in \mathbb{R}^n$ is an unknown parameter. The seller observes binary purchase decisions and updates a maximum likelihood estimate of $z$ .

Parametric structure alone does not break the $\sqrt{T}$ barrier. In the worst case, all demand curves pass through the same point at the optimal price, making that price uninformative.

The picture changes under a well-separated condition: when the Fisher information $I(p, z)$ is bounded below for all prices, every purchase observation distinguishes demand parameters. Under well-separation, a pure greedy policy achieves $O(\log T)$ regret. After 10,000 customers, the revenue loss drops to about 9 customers’ worth.

The Value of Well-Separation

Consider two demand curves that cross at $p = 50$ : at that price, both yield identical purchase rates, so testing there distinguishes nothing. Under well-separation every price is informative, and dedicated exploration rounds become unnecessary.

Interactive Explorer

Use the controls below to visualize how regret scales under different structural assumptions and compare bandit algorithms on a segmented demand model.

Theoretical Regret Rates

Toggle curves on the log-log chart to compare how different structural assumptions affect the growth of cumulative regret.

Live Bandit Simulation

Watch epsilon-greedy, UCB1, and Thompson Sampling algorithms learn the optimal price in real time on a segmented demand model.

High-Dimensional Features

Javanmard and Nazerzadeh (2019) extend the setting to products described by $d$ -dimensional feature vectors $x_t$ . The market value is $v_t = x_t^\top \theta_0 + z_t$ , where $\theta_0$ is unknown and only $s_0$ of its $d$ coordinates are nonzero.

Their RMLP algorithm (Regularized Maximum Likelihood Pricing) uses LASSO penalization and achieves regret:

R(T) = O(s_0 \log d \cdot \log T)

Even with hundreds of features, if only a handful matter, learning is fast. The reason this beats $\sqrt{T}$ connects back to well-separation: customer features vary across periods, so every price is informative about $\theta_0$ .

The Value of Knowing Noise

Xu and Wang (2021) ask how much it helps to know the shape of demand uncertainty. In their contextual model, the customer’s valuation is $w_t = x_t^\top \theta^* + N_t$ where $N_t$ has CDF $F$ .

Known vs. Unknown Noise (Xu and Wang, 2021)

If the noise distribution $F$ is known (e.g., the seller knows demand shocks are normally distributed with known variance), the EMLP algorithm achieves regret $O(d \log T)$ . If $F$ is unknown (even if only the variance of a Gaussian is unknown), regret is at least $\Omega(\sqrt{T})$ .

The gap between $O(d \log T)$ and $\Omega(\sqrt{T})$ is super-polynomial in $T$ . When an economist specifies a logit or probit demand model, the assumed noise distribution is not merely a convenience for estimation; it purchases a qualitative improvement in the rate at which the algorithm learns.

Key Insights

1. Structure Purchases Speed

The progression from $\sqrt{T}$ (no structure) to $\log T$ (parametric with well-separation) to $d \log T$ (known noise) shows that each layer of economic structure translates directly into faster convergence to the optimal price.

2. Exploration Can Be Free

Under well-separation, greedy pricing achieves $O(\log T)$ regret without any dedicated exploration rounds. Every price is informative, so exploitation and exploration are not in conflict.

3. Model Specification Has Quantifiable Value

The difference between knowing and not knowing the noise distribution is not a matter of statistical convenience—it is the difference between $O(\log T)$ and $\Omega(\sqrt{T})$ regret. Semiparametric approaches pay a concrete cost in learning speed.

4. Sparsity Tames High Dimensions

With $d = 100$ features but only $s_0 = 5$ relevant ones, RMLP achieves $O(5 \cdot \log(100) \cdot \log T)$ regret—barely affected by the ambient dimension. LASSO-style regularization makes high-dimensional pricing tractable.

Capstone: Bandit Market

In Play mode, you act as a pricing algorithm choosing from a discrete price grid. Each round, you observe only whether the customer bought, and must balance exploring new prices against exploiting the best-performing one. Compare your cumulative revenue against the oracle and various bandit algorithms. In Design mode, configure the demand environment and watch epsilon-greedy, UCB1, and Thompson Sampling compete over thousands of rounds.

References

Auer, P., Cesa-Bianchi, N. & Fischer, P. (2002). “Finite-time analysis of the multiarmed bandit problem.” Machine Learning, 47(2), 235–256.
Broder, J. & Rusmevichientong, P. (2012). “Dynamic pricing under a general parametric choice model.” Operations Research, 60(4), 965–980.
Javanmard, A. & Nazerzadeh, H. (2019). “Dynamic pricing in high-dimensions.” Journal of Machine Learning Research, 20, 1–49.
Kleinberg, R. & Leighton, T. (2003). “The value of knowing a demand curve: Bounds on regret for online posted-price auctions.” Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science.
Rothschild, M. (1974). “A two-armed bandit theory of market pricing.” Journal of Economic Theory, 9(2), 185–202.
Xu, J. & Wang, Y. (2021). “Logarithmic regret in feature-based dynamic pricing.” Management Science, 68(2), 1257–1276.

Learning the Best Price