A friendly primer on Evolution Operator learning¶

Evolution operator learning is a data-driven approach to characterizing dynamical systems, which can be either stochastic, \(x_{t + 1} \sim p(\cdot | x_{t})\), or deterministic, \(x_{t+1} \sim \delta(\cdot - F(x_{t}))\). Throughout, we assume the dynamics to be Markovian, so that the evolution of \(x_t\) depends on \(x_{t}\) alone and not on the states at times \(s < t\). If this assumption is not satisfied by \(x_{t}\), a standard trick is to re-define the state as a context \(c_{t}^{H} = f(x_{t}, x_{t -1}, \ldots, x_{t - H})\) with history length \(H\), where \(f\) can be a simple concatenation (as implemented by kooplearn.preprocessing.TimeDelayEmbedding) or a learned sequence model (e.g., a recurrent neural network or transformer).

Evolution operators are defined as follows: for every function \(f\) of the state of the system, \((\mathsf{E} f)(x_{t})\) is the expected value of \(f\) one step ahead in the future, given that at time \(t\) the system was found in \(x_t\):

\[ (\mathsf{E} f)(x_t) = \int p(dy | x_{t}) f(y) = \mathbb{E}_{y \sim X_{t + 1} | X_{t}}[f(y) | x_t]. \]

Notice that \(\mathsf{E}\) is an operator because it maps any function \(f\) to another function, \(x_{t} \mapsto (\mathsf{E} f)(x_t)\), and is linear because \(\mathsf{E}(f + \alpha g) = \mathsf{E} f + \alpha \mathsf{E} g\). When the dynamics is deterministic, \(\mathsf{E}\) is known as the Koopman operator [1], while in the stochastic case it is known as the transfer operator [2].

Evolution operators fully characterize the dynamical system because knowing \(\mathsf{E}\) allows us to reconstruct the dynamical law \(p(\cdot | x_{t})\). Indeed, for any subset of the state space \(B \subseteq \mathcal{X}\), applying \(\mathsf{E}\) to the indicator function of \(B\), we have

\[ (\mathsf{E} 1_{B})(x_t) = \int_{B}p(dy|x_{t}) = \mathbb{P}\left[X_{t + 1} \in B | x_t\right]. \]

An advantage of the operator approach over dealing directly with the conditional probability \(p(\cdot | x_{t})\) is that \(\mathsf{E}\) acts linearly on the objects to which it is applied. This means that operators unlock an arsenal of tools from linear algebra and functional analysis, which would be unavailable otherwise. Arguably the most important of them is the spectral decomposition, allowing us to decompose \(\mathsf{E}\), and hence the dynamics, into a linear superposition of dynamical modes. These ideas lie at the core of the celebrated Time-lagged Independent Component Analysis (TICA) [3, 4] and Dynamical Mode Decomposition (DMD) [5, 6].

Learning \(\mathsf{E}\) and its spectral decomposition from data¶

_images/operator_learning.svg — A schematic of the operator learning framework.¶

We now review the main approaches to learn the evolution operator and its spectral decomposition from a finite dataset of observations.

A core idea of operator learning is that operators are defined by how they act on a suitable linear space of functions, similarly to how matrices are defined by their action on a basis of vectors. Of course, not every function \(f\) is interesting, and this nicely parallels with the matrix example, where the most “interesting” directions are those that recover most of the variance in the data.

Learning \(\mathsf{E}\), therefore, is usually cast as the following problem:

Note

The Learning Problem: Letting \(\varphi(x) \in \mathbb{R}^{d}\) be a — learned or fixed — encoder of the state, find the best approximation of \(\mathsf{E}\) restricted to the \(d\)-dimensional linear space of functions generated by \(\varphi\), given the data.

In practice, the data is usually a collection of transitions \(\mathcal{D} = (x_i, y_i)_{i = 1}^{N}\), where it is intended that \(x_{i} \sim \mathbb{P}[X_{t}]\) are sampled from a distribution of initial states, while \(y_{i} \sim p( \cdot | x_{i})\).

Least squares¶

In this approach, the encoder \(\varphi\) is a frozen, non-learnable dictionary of functions, and we are interested in approximating the action of \(\mathsf{E}\) on functions of the form \(f(x) = \langle w,\varphi(x)\rangle\) for every \(w \in \mathbb{R} ^{d}\). To this end, one minimizes the empirical error between the true conditional expectation \(\mathbb{E}_{y \sim X_{t + 1} | X_{t}}[\langle w,\varphi(y)\rangle | x]\), and a linear model \( \langle Ew,\varphi(x)\rangle\), where the matrix \(E \in \mathbb{R} ^{d \times d}\) identifies the restriction of the evolution operator to the linear span of the dictionary:

\[ \frac{1}{N}\sum_{i = 1}^{N} (\langle w,\varphi(y_{i})\rangle - \langle Ew,\varphi(x_i)\rangle)^2 \leq \frac{1}{N}\sum_{i = 1}^{N} \|\varphi(y_i) - E^\top \varphi(x_i)\|^{2} + \lambda \|E\|^{2}. \]

On the right-hand side, we assumed \(\|w\| \leq 1\), used the Cauchy–Schwarz inequality, and added a ridge penalty. The minimizer of the equation above can be computed in closed form [7, 8] as:

\[ \mathsf{E}_{ls} = (\text{cov}_{X} + \lambda\mathsf{I})^{-1}\text{cov}_{XY}, \quad \text{with }\text{cov}_{XY} = \frac{1}{N} \sum_{i = 1}^N\varphi(x_i)\varphi(y_i)^\top \text{and } \text{cov}_{X} = \text{cov}_{XX}. \]

In the limit of infinite data, \(N\to \infty\), and infinitely dimensional encoders, \(d \to \infty\), the least squares estimator converges [7] in the strong operator topology to the evolution operator \(\mathsf{E}\), and similar (but weaker) asymptotic convergence results are proved for its spectrum. When \(\varphi\) is fixed and finite-dimensional, the least squares estimator for evolution operator learning is implemented in kooplearn.linear_model.Ridge.

The brief Getting Started tutorial showcases the Ridge model in action. In this example, data is generated from a linear dynamical system, and a Principal Component Regression (PCR) model—also known as Dynamic Mode Decomposition—is fitted to approximate the system’s evolution.

Dynamical Modes¶

The spectral decomposition of \(\mathsf{E}\) is approximated by expressing the least-squares estimator in its eigenvectors’ basis \(\mathsf{E}_{ls} = Q\Lambda Q^{-1}\), where the columns of \(Q = [q_1, \cdots, q_d]\) are the eigenvectors of \(\mathsf{E}_{ls}\), and \(\Lambda\) is a diagonal matrix of eigenvalues. In this basis, the expected value in the future for a function \(f(x) = \langle w,\varphi(x)\rangle\) is expressed as:

\[ \mathbb{E}_{y \sim X_{t + 1} | X_{t}}[f(y) | x] \approx \langle \mathsf{E}_{ls} w,\varphi(x)\rangle = \langle Q\Lambda Q^{-1}w,\varphi(x)\rangle = \sum_{i = 1}^{d} \lambda_i \langle q_i,\varphi(x)\rangle (Q^{-1}w)_{i}. \]

The spectral decomposition expresses the transition \(x_t \to x_{t+1}\) as a sum of modes of the form \(\lambda_i \langle q_i,\varphi(x)\rangle (Q^{-1}w)_i\), each of which can be broken down into three components:

The eigenvalues \(\lambda_i\) determine the time scales of the transition. Indeed, applying the evolution operator \(s\) times to analyze the transition \(x_t \to x_{t+s}\) leaves the eigenvalue decomposition unchanged, except that each \(\lambda_i\) becomes \(\lambda_i^s\). Writing \(\lambda_i^s = \rho_i^s e^{i s \omega_i}\) in polar coordinates, reveals that the modes decay exponentially over time with rate \(\rho_i\), while oscillating at frequency \(\omega_i\).
The initial state \(x\) influences the decomposition through the factor \(\Psi_{i}(x) = \langle q_i,\varphi(x)\rangle\). This coefficient captures how strongly the state \(x\) aligns with the \(i\)-th mode. When \(q_i\) corresponds to an eigenvalue with slow decay, i.e., \(|\lambda_i| \approx 1\), the term \(\Psi_{i}(x)\) serves as a natural quantity for clustering states into coherent or metastable sets.
The coefficient \((Q^{-1}w)_i\), in turn, indicates how the function represented by the vector \(w\) relates to the \(i\)-th mode. This connection makes it possible to link the dynamical patterns to specific functions — or observables — thereby deepening our understanding of the system.

In kooplearn, extracting the dynamical modes of arbitrary observables is as simple as calling the dynamical_modes() method after fitting the evolution operator. The dynamical modes evaluated at states X are returned as a convenient DynamicalModes object, which provides direct access to eigenvalues \(\lambda_i\), eigenfunctions \(\Psi_{i}\), and their projections \((Q^{-1}w)_i\).

For practical examples showcasing the capabilities of DynamicalModes, see the Fluid Flow and Switching System examples.

Kernel methods¶

Leveraging the kernel trick, one can learn evolution operators by deriving a closed-form solution of the least square loss in terms of kernel matrices whose elements are of the form \(k(x_i, x_j) = \langle \varphi(x_i),\varphi(x_j)\rangle\), where \(k(\cdot, \cdot)\) is a suitable kernel function. Thanks to the theory of reproducing kernel Hilbert spaces, this class of methods is backed up by statistical learning guarantees, such as the ones derived in [8, 9, 10]. Similarly to the least-squares approach, one also approximates the spectral decomposition of \(\mathsf{E}\) via kernel methods [11, 12, 13, 14, 15, 16]. See the kooplearn.kernel module for state-of-the-art kernel methods on evolution operator learning.

The following examples illustrate practical applications of these kernel-based approaches:

Kernel Methods: Learning of Reduced Rank Regression (RRR), Nyström Reduced Rank Regression (Nyström-RRR), and Randomized Reduced Rank Regression (Randomized-RRR) estimators on the Lorenz-63 chaotic system using KernelRidge and NystroemKernelRidge classes.
Analysing Molecular Dynamics Simulations: Spectral analysis of molecular dynamics simulations using the NystroemKernelRidge model.
Prinz Potential: Approximation of leading eigenfunctions of the overdamped Langevin transfer operator using the KernelRidge model.
Switching System: Behavior analysis of a switching system using the KernelRidge model.

Deep learning¶

In contrast to the previous approaches, where the encoder \(\varphi\) is prescribed, a number of methods proposed to approximate \(\mathsf{E}\) from data with end-to-end schemes including \(\varphi\) as a learnable neural network. Since learning \(\mathsf{E}\) ultimately entails learning its action on the linear space spanned by \(\varphi\), it is appealing to choose an encoder capturing the most salient features of the dynamics.

To this end, one can train \(\varphi\) via an encoder-decoder scheme as proposed in [17, 18, 19, 20, 21] or with encoder-only approaches as in [22, 23, 24, 25, 26, 27].

In kooplearn, learning \(\varphi\) is easier than ever. We provide the following losses: AutoEncoderLoss [18], VampLoss [23], and SpectralContrastiveLoss [27], all implemented for use with torch models. We also offer their JAX functional counterparts (autoencoder_loss(), vamp_loss(), and spectral_contrastive_loss()).

In the following examples, we show how straightforward it is to learn \(\varphi\) using these losses:

Ordered MNIST (torch): Comparison of different learned feature maps for the Ordered MNIST stochastic dynamical system (see fetch_ordered_mnist() for details).
Ordered MNIST (jax): JAX version of the above example.
Logistic Map: Evaluation of learned representations approximating spectral properties of the noisy logistic map.

[1]

Bernard O Koopman. Hamiltonian systems and transformation in hilbert space. Proceedings of the National Academy of Sciences, 17(5):315–318, 1931.

[2]

David Applebaum. Lévy Processes and Stochastic Calculus. Cambridge University Press, April 2009. ISBN 9780511809781. URL: http://dx.doi.org/10.1017/CBO9780511809781, doi:10.1017/cbo9780511809781.

[3]

L. Molgedey and H. G. Schuster. Separation of a mixture of independent signals using time delayed correlations. Physical Review Letters, 72(23):3634–3637, June 1994. URL: http://dx.doi.org/10.1103/PhysRevLett.72.3634, doi:10.1103/physrevlett.72.3634.

[4]

Guillermo Pérez-Hernández, Fabian Paul, Toni Giorgino, Gianni De Fabritiis, and Frank Noé. Identification of slow molecular order parameters for markov model construction. The Journal of Chemical Physics, 2013.

[5]

Peter J. Schmid. Dynamic mode decomposition of numerical and experimental data. Journal of Fluid Mechanics, 656:5–28, July 2010. URL: http://dx.doi.org/10.1017/s0022112010001217, doi:10.1017/s0022112010001217.

[6]

J. Nathan Kutz, Steven L. Brunton, Bingni W. Brunton, and Joshua L. Proctor. Dynamic Mode Decomposition. Society for Industrial and Applied Mathematics, 2016.

[7] (1,2)

Milan Korda and Igor Mezić. On convergence of extended dynamic mode decomposition to the Koopman operator. Journal of Nonlinear Science, 28:687–710, 2018.

[8] (1,2)

Vladimir Kostic, Pietro Novelli, Andreas Maurer, Carlo Ciliberto, Lorenzo Rosasco, and Massimiliano Pontil. Learning dynamical systems via koopman operator regression in reproducing kernel hilbert spaces. Advances in Neural Information Processing Systems, 35:4017–4031, 2022.

[9]

Vladimir R Kostic, Karim Lounici, Pietro Novelli, and Massimiliano Pontil. Sharp spectral rates for koopman operator learning. In Thirty-seventh Conference on Neural Information Processing Systems. 2023. URL: https://openreview.net/forum?id=Lt3jqxsbVO.

[10]

Feliks Nüske, Sebastian Peitz, Friedrich Philipp, Manuel Schaller, and Karl Worthmann. Finite-data error bounds for Koopman-based prediction and control. Journal of Nonlinear Science, 33(1):14, 2023.

[11]

Matthew O. Williams, Clarence W. Rowley, and Ioannis G. Kevrekidis. A kernel-based method for data-driven koopman spectral analysis. Journal of Computational Dynamics, 2(2):247–265, 2015. URL: http://dx.doi.org/10.3934/jcd.2015005, doi:10.3934/jcd.2015005.

[12]

Yoshinobu Kawahara. Dynamic Mode Decomposition with Reproducing Kernels for Koopman Spectral Analysis. In Advances in Neural Information Processing Systems, volume 29. 2016.

[13]

Stefan Klus, Ingmar Schuster, and Krikamol Muandet. Eigendecompositions of transfer operators in reproducing kernel Hilbert spaces. Journal of Nonlinear Science, 30(1):283–315, 2019.

[14]

Suddhasattwa Das and Dimitrios Giannakis. Koopman spectra in reproducing kernel Hilbert spaces. Applied and Computational Harmonic Analysis, 49(2):573–607, 2020.

[15]

Romeo Alexander and Dimitrios Giannakis. Operator-theoretic framework for forecasting nonlinear time series with kernel analog techniques. Physica D: Nonlinear Phenomena, 409:132520, 2020.

[16]

Giacomo Meanti, Antoine Chatalic, Vladimir R Kostic, Pietro Novelli, Massimiliano Pontil, and Lorenzo Rosasco. Estimating koopman operators with sketching to provably learn large scale dynamical systems. In Thirty-seventh Conference on Neural Information Processing Systems. 2023. URL: https://openreview.net/forum?id=GItLpB1vhK.

[17]

Naoya Takeishi, Yoshinobu Kawahara, and Takehisa Yairi. Learning Koopman invariant subspaces for dynamic mode decomposition. Advances in Neural Information Processing Systems, 2017.

[18] (1,2)

Bethany Lusch, J. Nathan Kutz, and Steven L. Brunton. Deep learning for universal linear embeddings of nonlinear dynamics. Nature Communications, November 2018. URL: https://doi.org/10.1038/s41467-018-07210-0, doi:10.1038/s41467-018-07210-0.

[19]

Samuel E Otto and Clarence W Rowley. Linearly recurrent autoencoder networks for learning dynamics. SIAM Journal on Applied Dynamical Systems, 18(1):558–593, 2019.

[20]

Omri Azencot, N Benjamin Erichson, Vanessa Lin, and Michael Mahoney. Forecasting sequential data using consistent koopman autoencoders. In International Conference on Machine Learning, 475–485. PMLR, 2020.

[21]

Christoph Wehmeyer and Frank Noé. Time-lagged autoencoders: deep learning of slow collective variables for molecular kinetics. The Journal of Chemical Physics, 2018.

[22]

Qianxiao Li, Felix Dietrich, Erik M Bollt, and Ioannis G Kevrekidis. Extended dynamic mode decomposition with dictionary learning: a data-driven adaptive spectral decomposition of the koopman operator. Chaos: An Interdisciplinary Journal of Nonlinear Science, 2017.

[23] (1,2)

Andreas Mardt, Luca Pasquali, Hao Wu, and Frank Noé. VAMPnets for deep learning of molecular kinetics. Nature Communications, January 2018. URL: https://doi.org/10.1038/s41467-017-02388-1, doi:10.1038/s41467-017-02388-1.

[24]

Enoch Yeung, Soumya Kundu, and Nathan Hodas. Learning deep neural network representations for Koopman operators of nonlinear dynamical systems. In 2019 American Control Conference (ACC), 4832–4839. IEEE, 2019.

[25]

Vladimir R. Kostic, Pietro Novelli, Riccardo Grazzi, Karim Lounici, and Massimiliano Pontil. Learning invariant representations of time-homogeneous stochastic dynamical systems. In The Twelfth International Conference on Learning Representations. 2024. URL: https://openreview.net/forum?id=twSnZwiOIm.

[26]

Marco Federici, Patrick Forré, Ryota Tomioka, and Bastiaan S Veeling. Latent representation and simulation of markov processes via time-lagged information bottleneck. arXiv preprint arXiv:2309.07200, 2023.

[27] (1,2)

Giacomo Turri, Luigi Bonati, Kai Zhu, Massimiliano Pontil, and Pietro Novelli. Self-supervised evolution operator learning for high-dimensional dynamical systems. arXiv preprint arXiv:2505.18671, 2025.