Bayesian Online Change Point Detection for Baseline Shifts

In time series data analysis, detecting change points on a real-time basis (online) is of great interest in many areas, such as finance, environmental monitoring, and medicine. One promising means to achieve this is the Bayesian online change point detection (BOCPD) algorithm, which has been successfully adopted in particular cases in which the time series of interest has a fixed baseline. However, we have found that the algorithm struggles when the baseline irreversibly shifts from its initial state. This is because with the original BOCPD algorithm, the sensitivity with which a change point can be detected is degraded if the data points are fluctuating at locations relatively far from the original baseline. In this paper, we not only extend the original BOCPD algorithm to be applicable to a time series whose baseline is constantly shifting toward unknown values but also visualize why the proposed extension works. To demonstrate the efficacy of the proposed algorithm compared to the original one, we examine these algorithms on two real-world data sets and six synthetic data sets.


Introduction
Online change point detection refers to the real-time detection of an acute change in a sequential time series data set. The definition of the term "change point" may vary depending on the application and the analyst's perspective. Change points can be, for instance, mean shifts [1,8,9,12,6], slope (gradient) shifts [1,10], variance shifts [11,6,5,7], anomalous events [3,11,4], or combinations of several of these. Regardless of the definition, however, it is reasonable to assume that the generative parameters before and after a change point are different. In actuality, the changes in the generative parameters at a change point, if substantial, should stem from changes in the parameters of the relevant physical models, which are often not trivial to articulate due to their complexity. In the present work, we concentrate on improving the performance of detecting mean shifts and slope shifts, especially when the baseline of the observed data points are continuously shifting away from the original level. Since the first practical algorithm for Bayesian online change point detection (BOCPD) was introduced [1], it has received considerable interest for a wide range of real-world applications. These applications include water quality monitoring [3], human-machine interaction analysis [4], medical usage [5,6], fuel management systems in unmanned vehicles [7], and satellite fault prediction [8]. Moreover, efforts have been made to improve the performance and robustness of the BOCPD algorithm itself, such as through hyper-parameter learning [9], robust change point determination [10], and the prediction of change points [11]. Since this paper aims to broaden the applications of the BOCPD, the current study is categorized as part of the latter group. Although it has been proven that BOCPD is a promising technique for various applications, it should be noted that most previous works have dealt only with data sets whose long-term expectation is constant (e.g., a daily financial gain in %) or time series data sets with a fixed baseline (e.g., switching back and forth between normal and abnormal states). In this study, we demonstrate reasons why the original BOCPD algorithm's usage has been 2 BAYESIAN ONLINE CHANGE POINT DETECTION FOR BASELINE SHIFTS limited to such cases and attempt to modify the algorithm by introducing a mechanism of feeding back information on the existence of a change point to guide subsequent detection attempts. In this paper, the term "baseline" is defined as a certain level of data values that a time series tends to regress back to ((1) and (2) in Fig.1). For instance, if we are considering an economic metric expressed as a percentage (e.g., a monthly job loss rate), it should naturally regress to a value of 0 (the baseline) over the long term. However, while the original BOCPD algorithm implicitly assumes that the values eventually return to the original baseline, this is not always true in reality if irreversible and unidirectional mean shifts occur. In other words, this paper considers cases in which parameter shifts occur in a time series, and the newly proposed extension of the BOCPD algorithm is able to detect such shifts without losing contrast. According to a comprehensive study of the categorization of change point detection algorithms [2], the keys to useful online change point detection algorithms can be summarized as follows: 1) minimal delay, 2) minimal false positives, 3) computational efficiency, and 4) robustness. This study primarily aims to improve 1), 2) and 4) without sacrificing 3), particularly in cases in which a time series data set shows unidirectional baseline shifts from a long-term perspective.

System Model
The BOCPD algorithm [1] assumes that a group of sequential data points x 1 , x 2 , . . . , x T , where T ∈ N, in chronological order, can be partitioned into subgroups separated by estimated change points, with ρ (∈ N) denoting the subgroup or partition label. To determine the interval between two change points (i.e., the length of a partition), the concept of the run length r t (∈ N) is introduced. The run length is a number that counts for how many time steps a certain partition continues without the next change point being observed. In other words, once a change point is observed, the run length r t no longer grows and is reset to zero. To determine whether the run length r t should continue to increase, the probability distribution of a certain run length, P (r t | x 1:t ), is calculated to estimate whether a change point has occurred whenever the next data point x t is observed. The set of subgrouped data points in partition ρ is denoted by x (ρ) t , and the set of subgrouped data points from the beginning to a certain time t is denoted by x 1:t ; thus, Each data point within a partition is regarded as i.i.d. and sampled from a probability distribution P (x t | η ρ ), where the parameters ηρ and ρ are also i.i.d. In addition, the minimum r t is defined as 0 in the algorithm; therefore, the minimum number of elements that may be stored in x (ρ) t is also 0.

BOCPD Algorithm
One of the unique mechanisms of the BOCPD algorithm is the utilization of the run-length distribution P (r t | x 1:t ) to assess the probability of the existence of a change point. Given a set of t data points x 1:t , the run-length distribution is described as follows: where the joint distribution over the run length r t and all observed data points is recursively calculated [1,9].
For simplicity, a constant hazard function H is used as a prior: where λ is a parameter used to adjust the sensitivity of change point detection. This sensitivity is higher with a smaller λ (i.e., a larger H(r t )). The detailed calculation steps are described in Algorithm 1 (excluding the components marked with an asterisk (*)).

Exponential Family Likelihoods
As suggested in [1], exponential family likelihoods are mathematically convenient for obtaining a posterior predictive distribution P (x t | r t−1 , x (ρ) t ) since there exists a conjugate prior to simplify the calculation with finite parameters η ρ . Unless a time series data set of interest is known to be based on a discrete distribution, one generic assumption that is typically used for continuous distribution is to select a normal distribution of unknown mean µ and variance σ as a likelihood. To address this assumption, in previous studies [9,13], a normal inverse gamma (N IG) distribution has been adopted as a conjugate prior. In the same way, we also select an N IG distribution throughout this paper, and according to [14], the posterior predictive distribution P (x t | r t−1 , x (ρ) t ) can be written as shown below.
This posterior predictive distribution is a Student's t-distribution at time t parameterized by mean µ t , variance ν t , 2α t degrees of freedom, and β t data points. These parameters are updated for every observation of a new data point at time t + 1.
As an example, the initial parameter values may be set to µ 1 = 0 and ν 1 = α 1 = β 1 = 1 under the assumption that the data points in the time series start near zero and, after some fluctuations, will eventually regress back to the original baseline. In other words, this one-time setting of the initial conditions is based on the implicit hope that no irreversible, unidirectional value shifts will occur from a long-term perspective so that the model will continue to work well, as we see in later sections.

Change Point Detection
Regarding change point determination criteria, a variety of decision-making mechanisms have been proposed [3,5,6,10,11], and the original literature [1] did not explicitly specify any methodology for determining change points. However, in this paper, we attempt to be generic and thus simply define the existence of a change point by taking the difference δ between the indices of the highest probability among all run lengths at the current time t and the one-step-previous time t − 1. This straightforward concept of an "argmax" criterion for determining change points has also been adopted in some previous studies [3,5,6].
If δ is positive, this indicates that the run length r t is continuing to grow, whereas either a zero or negative value indicates the existence of a change point c t (∈ [0, 1]).
The flow chart in Fig. 2 describes the steps of the calculation in the BOCPD algorithm, from initializing the parameters η ρ and observing a new data point x t through calculating the run-length distribution P (r t | x 1:t ), followed by the change point decision mechanism described above.

Qualitative Change Point Detection Sensitivity Analysis
One of the critical components of the present work is the visualization of the change point detection sensitivity over time. to make the discussion of the sensitivity as intuitive as possible, we visualize the predictive probability t ) accompanying data point x t to illustrate the interactions between the incoming data points and the underlying probability distribution. This is particularly useful for understanding when BOCPD starts to lose its sensitivity for change point detection. As an example, Fig. 3 illustrates how the predictive probability

Problem Statement
If there is no guarantee that the long-term expectation of a time series is approximately constant nor that an anomalous state will regress to a fixed known baseline value, the original BOCPD algorithm struggles to maintain its change point detection sensitivity. This is because as the baseline irreversibly shifts toward arbitrary values and becomes farther from the original mean, the shape of the underlying distributional assumption for determining the predictive distribution of the current data point, P (x t |r t−1 , x (ρ) t ), becomes flatter and loses its contrast. Note that in this study, we consider only a generic case in which we can assume a normal distribution of unknown mean µ and standard deviation σ. In the original BOCPD scheme, a reasonable next attempt to cope with this situation might be to take the derivative of the original time series and cast the problem as one of "change point detection of change rates" [1]. However, although it may be useful to use the derivative to detect change points, some information may be lost in this way compared to the original time series. This is because the original time series and a differentiated time series will interact differently with the underlying distributional assumptions when forming the run-length probability distribution. To fully utilize the information from the original time series, it would be ideal to be able to evaluate change points based on the original time series as well as a derived time series, such as a differentiated or integrated series, with minimal restrictions. To elaborate the problem, two distinct data sets are used, as summarized in Table 1. The corresponding change point detection results of the BOCPD algorithm are illustrated in Fig. 5. For both cases, each datum in each predetermined partition ρ is sampled from a normal distribution with a variable mean µ and a common standard deviation σ = 1. The former case is intended to represent a time series without long-term baseline shifts (zerocentered fluctuation), while the latter represents a time series with long-term baseline shifts. Change points are established at constant intervals of t = 10 for both cases. Each subplot in Fig. 4(a) illustrates a case in which the time series tends to regress back to a constant baseline value (zero-centered fluctuation), and in this case, the change points are fully captured without delay by the original BOCPD algorithm. In contrast, the plots in Fig.  4(b) illustrate a case in which the baseline of the time series is constantly increasing. As the baseline shifts farther from its original level, the change points are more poorly captured. The primary cause of this is that the contrast of the predictive distribution of x t given the previous run length of r t = 1, namely, P (x t |r t−1 , x when the baseline is located farther from the original baseline. This can be observed in the second row (2) of Fig.  4. Here, we have used the predictive distributions of x t with previous run lengths r t−1 ∈ 1, 2, ..., 10 to accurately understand the interaction between a specific run length r t−1 and an incoming data point x t , which defines the probability of whether the run length r t will continue to grow in the current time step. The farther the data points are located from the original baseline, the blurrier the color pattern of the P (x t |r t−1 , x (ρ) t ) distribution becomes, indicating lower contrast at the decision boundaries. Furthermore, this loss of contrast is articulated in Fig. 5 by comparing the shapes of the posterior distributions with and without baseline shifts. As revealed by the graphs in Fig. 5, the shapes of the predictive probabilities P (x t |r t−1 = 10, x (ρ) t ) at the change points are flatter in the case with baseline shifts, especially at t = 81. By definition, the mechanism of change point detection relies heavily on the contrast of P (x t |r t−1 = 1, x (ρ) t ) given the most recent data point x t . In other words, the more abruptly the value of P (x t |r t−1 = 1, x (ρ) Table 1. Data sets with and without baseline shifts. Data sets 1 (without baseline shifts) and 2 (with baseline shifts) each consist of 100 data points. Each data point is generated from a normal distribution, with different means µ and the same standard deviation σ = 1 in the ten different partitions ρ. Change points are established in 9 locations, separated by a constant interval of t = 10.    for a run length of r t−1 = 10 for the two cases presented in Fig. 4 (λ = 30 for both cases).

BOCPD for Baseline Shifts (BOCPD-BLS)
Our primary interest in this work lies in cases in which the baseline irreversibly shifts to a new level, such that an approximately long-term constant baseline cannot be assumed [7,1,5,9,11,4]. To cope with data sets with baseline shifts, unlike most previous works, we regard all partitions ρ as independent of each other. In other words, all parametric information from prior to the previous change point is discarded, without changing the underlying distributional nature of the observed data points. We call the modified algorithm proposed in the present work "BOCPD for baseline shifts (BOCPD-BLS)".

Parameter Initialization After Each Change Point
Once a change point is observed at t(∈ N), the following two steps are taken to feed the result (c t = 1) back to a new detection process: 1) the parameters of the posterior predictive distribution, µ t , ν t , α t and β t are initialized, and 2) the first observation in each partition x (ρ) ini , is reset to a value equal to the next observation x t+1 . Here, x (ρ) ini is introduced to define the next partition's initial local baseline. Because we discard all the information from the previous partition except the knowledge of the existence of a change point, the run-length distribution and the joint distribution over the run length are rewritten as follows: and When making a new decision about the existence of a change point, we marginalize all past decisions c 1:(t−1) out in the calculation of P (r t | x ini is replaced by a new value. An example of a possible initial set of parameter values is µ 1 = x ′ 1 and ν 1 = α 1 = β 1 = 1, and this set of initial parameters is utilized throughout this study. Since we intend to incorporate the result of change point detection after each observation into the beginning of the next partition, δ may now be expressed as This indicates that the run-length distribution from the previous partition, P (r t | x ′(ρ−1) t ), no longer affects the decision-making process regarding the existence of a change point in the current partition. As shown in Fig. 6, BOCPD-BLS uses the change point information c t to determine whether the parameters η ρ will be reinitialized (if δ ≤ 0, then c t = 1) or continuously updated (if δ = 1, then c t = 0) before a newly incoming data point is observed. This additional process is the primary difference between BOCPD-BLS and the original BOCPD algorithm. The detailed calculation steps are described in Algorithm 1. The computational cost of BOCPD-BLS is equivalent to that of the original BOCPD algorithm, which is linear in the data count. Fig. 6. The BOCPD-BLS algorithm. The white text and dashed line represent the major differences with respect to the original BOCPD algorithm. In the BOCPD-BLS algorithm, all parameters for calculating the predictive probability t ) are reinitialized when a change point is observed such that the new partition ρ is able to start from a new baseline.

Analysis
We prepared three data sets to assess the characteristics of the BOCPD-BLS algorithm compared to the original BOCPD algorithm in the presence of irreversible baseline shifts. The first data set is the same synthetic data set with multiple upward baseline shifts utilized in previous sections. The second tracks the history of the Bitcoin/USD 9 exchange price, which is logged every minute. The third example is the number of confirmed positive COVID-19 cases in Russia from Jan. 22, 2020, through Oct. 6, 2020, on a daily basis.

Baseline Shifts: Synthetic Data
The generative parameters for this synthetic data set are listed in (data set 2 in this table). As seen from Fig. 7, BOCPD-BLS clearly captures all the change points without delay, while the original BOCPD algorithm shows some delay or misses the change point entirely for more than half of the predetermined change points. The primary reason for this difference is illustrated in Fig. 8, which highlights the difference in contrast degradation, especially when x t is far from the original baseline. By reinitializing the parameters of the predictive distribution after each change point, BOCPD-BLS is able to detect all change points without delay regardless of the x t values.

Bitcoin vs. USD Price Data
As a real-world example, publicly available Bitcoin/USD price history data [15] are used. This data set displays upward baseline shifts over the long term. As shown in Fig. 9, the value starts near 9000 and exhibits multiple change points with continuous upward baseline shifts over the long run. Both major and minor mean shifts as well as slope shifts appear to exist. The plots in Fig. 9 show that BOCPD is able to detect the major upward mean shifts, while BOCPD-BLS captures not only the major mean shifts and most of the minor mean shifts but also several slope shifts, even at elevated x t levels. The λ values used in both cases are the same as in the previous synthetic example presented in Fig. 8. For this reason, there may be some false positives or misses in both cases, but the main contrast information used in detecting minor mean and slope shifts with elevated x t > 10000 appears to be valid. Moreover, it is important to note that the reason why the BOCPD algorithm misses the minor mean shifts is not because the magnitude of such a shift itself is small but rather because the values before and after the shift are both far from the original baseline, resulting in lower contrast of the run-length distribution P (x t |r t−1 = 1, x (ρ) t ), as shown in Fig. 9(a) (2). We have checked the results with λ value as low as 3, but the BOCPD algorithm cannot capture these minor shifts at elevated x t levels. . Change point detection using Bitcoin-USD price data (July 1, 2020 -August 31, 2020, data interval of 100 minutes.) Although the BOCPD algorithm, as seen in (a), is able to capture the major mean shifts and some minor mean shifts in the first half of the data set, before the strongest elevation in x t occurs, it misses many minor mean shifts after this baseline shift. In contrast, the BOCPD-BLS algorithm, as seen in (b), detects not only the major mean shifts, but also most of the minor mean shiftss even after the strong elevation in x t . In both cases, the λ values are the same as those in Fig. 8.

Daily Confirmed COVID-19 Cases in Russia
As another real-world example, publicly available daily data on confirmed COVID-19 cases in Russia are used [16,17]. This data set also displays upward baseline shifts over the long term. As seen in Fig. 10, the change points found using the BOCPD algorithm tend to be densely concentrated in certain regions where x t is either closer to the original baseline or abruptly increasing. In contrast, the BOCPD-BLS algorithm sparsely detects change points regardless of the observed x t value throughout the entire period. Another distinctive difference between the results of the two algorithms is the existence of locally clustered change points. Because the λ settings are different between the two algorithms in Fig. 10, we have also tested the dependence of λ by varying its value, as shown in Fig. 11. As revealed in Fig. 11, although varying λ causes the number of detected change points to change in both cases, it appears to have little impact on the overall tendencies. This result may imply that the difference cannot be primarily attributed to the λ settings but rather is due to differences in the algorithms themselves.
Furthermore, since the BOCPD algorithm considers the entire run-length distribution from the beginning to the current t, some run-length probabilities carried over from previous partitions still interfere with the current decisionmaking. However, in the BOCPD-BLS algorithm, this does not occur since this algorithm discards the run-length distribution information from all previous partitions, thus simplifying the decision-making. This robustness to noise may be an additional advantage of BOCPD-BLS, as demonstrated through this example.

Validation
For validation, we utilize six types of synthetic data sets with predetermined change points. These are designed primarily to assess the impacts of BOCPD-BLS compared with BOCPD in the presence of long-term baseline shifts; thus, data sets 2 and 6 are central to our interest (Table 2). However, the other data sets were additionally prepared to evaluate the differences in basic performance in the absence of long-term baseline shifts. Each data set was repeatedly generated with different pseudorandom seeds one hundred times from a normal distribution, and examples from each data set are displayed in Figure12. To evaluate the change point detection capabilities of both algorithms, four metrics are adopted: 1) F-score (the F1 score, the harmonic mean of precision and recall), 2) Miss (the number of missed change points), 3) Delay (the amount of delay from the actual change point), and 4) Duplication (the number of duplicated change points detected in a partition, regardless of the delay in detection). For simplicity, for data sets 1 through 4, a true positive (TP) is defined only as a change point for which the algorithm can detect the change point without a time delay, whereas for data sets 5 and 6, a time delay of up to t = 5 (within the first half of each partition) is allowed since detecting slope shifts appears to be more challenging. In other words, when an algorithm indicates change points outside of the allowed delay period, these are all regarded as false positive (FPs). Furthermore, if multiple change points are detected within the allowed delay period, this is regarded as a single TP detection for the F-score calculation on data sets 5 and 6 to avoid complexity. As seen in Fig. 13, BOCPD-BLS demonstrates significantly better performance, particularly in the presence of the baseline shifts in data sets 2 and 6. Additionally, it shows equal or better performance for most combinations of metrics and data sets. It is important to note that although our primary focus is to assess whether BOCPD-BLS performs better on data sets 2 and 6 (both with baseline shifts), it also shows better performance and larger λ  Fig. 10. It is confirmed that (a) the BOCPD is only sensitive before the baseline shift occurs, while (b) the BOCPD-BLS algorithm's detection performance is balanced throughout the entire period. (slope) -0,2,-0,2,-0,2,-0,2,-0,2 0.1 Yes margins in other cases without baseline shifts. Although the BOCPD-BLS algorithm appears to have the minor drawback that it tends to show a larger standard error of the mean, there are few cases in which BOCPD dominates overall. In summary, the following comments on each metric are offered based on Fig. 13. The mean statistics for 10 ≤ λ ≤ 1000 are also summarized in Table 3 for quick reference.
• F-score: The BOCPD-BLS algorithm shows similar or higher scores for all data sets with various λ values. Additionally, its performance degradation in the high λ region is significantly smaller, especially on data set 1. • Miss: The BOCPD-BLS algorithm shows fewer misses, particularly on the data sets with either baseline shifts or slope shifts (data sets 2, 5, and 6). • Delay: The BOCPD-BLS algorithm shows less delay on the data sets with baseline mean shifts or the first discrete differences (data sets 2, 3, and 4).  Table 3. The mean values of the F-score, the number of missed change points, the delay, and the number of duplicated detections in each partition. Each value represents 800 detection attempts (eight λ values (10 ≤ λ ≤ 1000) and one hundred iterations with random seeds). The better performance in terms of each metric on each data set is highlighted in bold. The ± sign precedes the standard error of the mean (SEM) multiplied by a factor of three (3σ SEM ). The F-scores for all data sets show favorable results for the BOCPD-BLS algorithm.
14 BAYESIAN ONLINE CHANGE POINT DETECTION FOR BASELINE SHIFTS Fig. 13. Fscore, Miss, Delay and Duplication on the six data sets. Each data point consists of 100 sets of change point detection attempts on the synthesis data set with the different random seeds. An error bar in each data point indicates a standard error of the mean (SEM) multiplied by a factor of three (3σ SEM ).

Conclusions
The present work contributes to extending the application of BOCPD to time series data sets with mean and slope shifts in the presence of long-term baseline shifts toward arbitrary directions. The proposed extension of the BOCPD algorithm, called BOCPD-BLS, successfully adapts to such situations by feeding back information on a detected change point to guide the subsequent detection activities. This feedback process enables the algorithm to reinitialize the underlying baseline distribution, allowing it to maintain high detection sensitivity regardless of the offset of the x t values compared to the original baseline. Through a validation study, the proposed method has been confirmed to be particularly effective in the presence of baseline shifts, but it is also found to make the detection results less sensitive to the λ value, thus enabling better performance even in the absence of baseline shifts.

Appendix: BOCPD-BLS Algorithm
The BOCPD-BLS algorithm is a simple extension of the original BOCPD algorithm. Algorithm 1 describes a batch version for detecting all change points using all data from the beginning of a time series until the most recent datum. For online detection, all parameters need to be cached for continuous reuse as each new datum is observed. As shown in Algorithm 1, the core idea of detecting change points based on the run-length distribution is inherited