Background estimation via polynomial fitting – implementations of ModPoly and IModPoly methods in C++

2026-04-242026-04-25Background,Baseline,C++,chemistry,Eigen,Signal Processing,Spectrum

In the previous entry, I introduced airPLS and arPLS, which are baseline estimation methods based on the Whittaker Smoother. While these are powerful techniques, tuning their parameters can sometimes be challenging.

In this article, I will introduce more traditional and intuitive methods based on polynomial fitting: the ModPoly method (Modified Polynomial Fit) and its successor, the IModPoly method (Improved ModPoly).

Note that both ModPoly and IModPoly are already implemented in sablib, a C++ library for smoothing and baseline estimation. If you are interested, please check it out:

Released "sablib": A C++ Library for Smoothing and Baseline Estimation

Introduction On this blog, I have introduced various algorithms for smoothing noisy data and est ...

https://izadori.net/en/cpp-sablib-en/

1. ModPoly and IModPoly Methods
2. Implementation Example using Eigen
- 2.1. Verified Environments
- 2.2. Application to Noisy Data
3. Summary

ModPoly and IModPoly Methods

Challenges in Polynomial Fitting

Fitting a high-order polynomial to a background with a complex shape seems like a reasonable choice:

$$
\displaystyle
p(x) = \sum_{k=0}^d a_k x^k
$$

This can be achieved using the least squares method. However, actual data contains not just the background but also signals (peaks). If you apply the least squares method while peaks are present, the fitted background will be "pulled up" by the peak components (Figure 1).

We need to address this peak-induced distortion. The ModPoly method solves this by iteratively fitting the polynomial while excluding peak components.

ModPoly Method

The ModPoly method was proposed by Lieber and Mahadevan-Jansen in 2003 [1].

Principle

Looking at Figure 1 again, focusing on where the background is pulled up by peaks, we can see that from the peak’s perspective, it looks as if the base of the peak has been "shaved off." This phenomenon is similar to what we saw in background estimation using simple moving averages.

In the simple moving average approach, we compared the estimated background with the original data and adopted the smaller value. ModPoly follows a similar principle: it gradually removes the influence of peaks by adopting the smaller value during iterative calculations.

The ModPoly algorithm estimates the baseline through the following steps:

Initial Fit: Fit a polynomial $p(x)$ of degree $d$ to all data points using the least squares method.
Correction Step: For each data point $y_i$, compare it with the fitted polynomial value $p(x_i)$.
- If $y_i > p(x_i)$ (considered a peak component): Replace $y_i$ with $p(x_i)$.
- If $y_i \le p(x_i)$: Keep the original value.
  In other words, $y_i^{\text{new}} = \min(y_i, p(x_i))$.
Iteration: Repeat steps 1 and 2 using the updated data.
Termination: End the process when the fit converges or the specified maximum number of iterations is reached.

The ModPoly method is relatively simple to implement and computationally efficient. However, because it always adopts the smaller value, the baseline tends to align with the noise floor (bottom line).

IModPoly Method

The IModPoly method is an improvement over ModPoly, proposed by Zhao et al. in 2007 [2].

Principle

The basic concept of IModPoly is the same as ModPoly. To address ModPoly’s weakness—where the baseline is pulled down to the noise floor—it introduces a threshold check that considers the standard deviation $\sigma$ of the residuals.

Initial Fit: Fit a polynomial $p(x)$ of degree $d$ to all data points using the least squares method.
Calculate Standard Deviation: Compute the standard deviation $\sigma$ of the residuals $r_i = y_i – p(x_i)$.
Correction Step: Update each data point $y_i$ based on the following criteria:
- If $y_i > p(x_i) + \sigma$ (obvious peak component): Replace $y_i$ with $p(x_i)$.
- If $y_i \le p(x_i) + \sigma$ (noise or background): Keep the original value.
  By maintaining the original values when they are within the noise level, this prevents the baseline from becoming excessively low.
  (Note: Some implementations replace $y_i$ with $p(x_i) + \sigma$ instead)
Iteration: Repeat until the process converges or the specified iterations are reached.

References

Implementation Example using Eigen

I implemented the above algorithms using C++ and Eigen. They can be written with relatively short code. The code is also available on GitHub. Note that error checking is omitted for brevity.

ModPoly Method

#ifndef __IZADORI_EIGEN_MODPOLY_H__
#define __IZADORI_EIGEN_MODPOLY_H__

#include <Eigen/Eigen>

// Background estimation using the ModPoly method
//
// y        : Observed data
// order    : Degree of the polynomial to fit
// max_loop : Maximum number of iterations
const Eigen::VectorXd BackgroundEstimation_ModPoly(
  const Eigen::VectorXd & y, const int order, const int max_loop
);

#endif // __IZADORI_EIGEN_MODPOLY_H__

#include "modpoly.h"

const Eigen::VectorXd BackgroundEstimation_ModPoly(
  const Eigen::VectorXd & y, const int order, const int max_loop
)
{
  int n = y.size();
  Eigen::VectorXd x = Eigen::VectorXd::LinSpaced(n, 0, 1);
  Eigen::VectorXd z = y;
  Eigen::VectorXd baseline(n);

  // Build the Vandermonde matrix
  Eigen::MatrixXd vandermonde(n, order + 1);
  for (int i = 0; i < n; i++) {
      double val = 1.0;
      for (int j = 0; j <= order; j++) {
          vandermonde(i, j) = val;
          val *= x(i);
      }
  }

  for (int loop = 0; loop < max_loop; loop++) {
      // Calculate polynomial coefficients using least squares
      Eigen::VectorXd coeff = vandermonde.colPivHouseholderQr().solve(z);

      // Evaluate the polynomial (baseline)
      baseline = vandermonde * coeff;

      // Correction step: z_i = min(z_i, baseline_i)
      z = (z.array() > baseline.array()).select(baseline, z);
  }

  return baseline;
}

IModPoly Method

#ifndef __IZADORI_EIGEN_IMODPOLY_H__
#define __IZADORI_EIGEN_IMODPOLY_H__

#include <Eigen/Eigen>

// Background estimation using the IModPoly method
//
// y        : Observed data
// order    : Degree of the polynomial to fit
// max_loop : Maximum number of iterations
const Eigen::VectorXd BackgroundEstimation_IModPoly(
  const Eigen::VectorXd & y, const int order, const int max_loop
);

#endif // __IZADORI_EIGEN_IMODPOLY_H__

#include "imodpoly.h"

const Eigen::VectorXd BackgroundEstimation_IModPoly(
  const Eigen::VectorXd & y, const int order, const int max_loop
)
{
  int n = y.size();
  Eigen::VectorXd x = Eigen::VectorXd::LinSpaced(n, 0, 1);
  Eigen::VectorXd z = y;
  Eigen::VectorXd baseline(n);

  // Build the Vandermonde matrix
  Eigen::MatrixXd vandermonde(n, order + 1);
  for (int i = 0; i < n; i++) {
      double val = 1.0;
      for (int j = 0; j <= order; j++) {
          vandermonde(i, j) = val;
          val *= x(i);
      }
  }

  for (int loop = 0; loop < max_loop; loop++) {
      // Calculate polynomial coefficients using least squares
      Eigen::VectorXd coeff = vandermonde.colPivHouseholderQr().solve(z);

      // Evaluate the polynomial (baseline)
      baseline = vandermonde * coeff;

      // Calculate the standard deviation of the residuals
      Eigen::VectorXd residual = y - baseline;
      double mean = residual.mean();
      double sd = std::sqrt((residual.array() - mean).square().sum() / n);

      // Correction step: Replace only points where y_i > baseline_i + sd
      z = (z.array() > (baseline.array() + sd)).select(baseline, z);
  }

  return baseline;
}

Verified Environments

OS
1. Windows 11
  1. Microsoft C/C++ Optimizing Compiler 19.44.35207.1 for x64
  2. MinGW-W64 GCC version 13.2.0
  3. Clang version 20.1.0
2. Linux
  1. GCC version 12.2.0
Eigen version: 3.4.0

Application to Noisy Data

Using the code above, I tested how the background is estimated for data containing noise.

The original data is a pseudo-spectrum with 1000 points, created by synthesizing several peaks and a smooth background, then adding white noise. I evaluated the results fitting polynomials of degree 3, 6, and 9. The maximum number of iterations was set to 100.

ModPoly Method

IModPoly Method

Looking at these results, degree 6 and 9 yield almost identical background estimates. For complex background shapes, using a higher degree is generally better.

The main difference between the two methods is the height of the background. While the ModPoly baseline aligns with the noise floor, the IModPoly baseline aligns with the center of the noise. IModPoly feels like a more natural baseline.

Summary

In this article, I introduced the ModPoly and IModPoly methods for background estimation based on polynomial fitting. The key steps for each algorithm are as follows:

ModPoly Method
1. Fit a polynomial $p(x)$ of degree $d$ to the data sequence $y$.
2. Compare $y_i$ and $p(x_i)$, and update the data by adopting the smaller value.
3. Return to step 1 and repeat the calculation iteratively.
IModPoly Method
1. Fit a polynomial $p(x)$ of degree $d$ to the data sequence $y$.
2. Calculate the standard deviation $\sigma$ of the residuals.
3. Replace only the points where $y_i > p(x_i) + \sigma$ with $p(x_i)$.
4. Return to step 1 and repeat the calculation iteratively.

As shown in the implementation examples, using Eigen allows these methods to be implemented in C++ with relatively short and clean code.

Background,Baseline,C++,Eigen,sablib,Signal Processing

Posted by izadori