Nonconvex Energy Minimization with Unsupervised Line Process Classifier for Eﬀicient Piecewise Constant Signals Reconstruction

Inthispaper, we focus on the problem of signal smoothing and step-detection for piecewise constant signals. This problem is central to several applications such as human activity analysis, speech or image analysis and anomaly detection in genetics. We present a two-stage approach to approximate the well-known line process energy which arises from the probabilistic representation of the signal and its segmentation. In the first stage, we minimize a total variation (TV) least square problem to detect the majority of the continuous edges. In the second stage, we apply a combinatorial algorithm to filter all false jumps introduced by the TV solution. The performances of the proposed method were tested on several synthetic examples. In comparison to recent step-preserving denoising algorithms, the acceleration presents a superior speed and competitive step-detection quality.


Introduction
The problem of removing noise from piecewise constant (PWC) signals occurs naturally in several fields of applied sciences like genomic [1,2], nanotechnology [3], image analysis [4] and finance [5,6]. The main task is to recover the true signal from a noisy measurement without losing the important information of the abrupt jumps. Being a PWC implies that the denoising task could be interpreted in two different ways. First, the values of each plateau constitute a reconstruction of the true signal. Second, the position of jumps gives a direct segmentation of the true signal. Therefore, the solution is a joint restoration and segmentation, which leads to better performance compared to executing each task separately [7].
We focus on the numerical implementation of recovering one-dimensional discrete PWC signal from noisy measurements. The observed signal is defined as y = u + ε, where y = (y 0 , . . . , y n ) T ∈ R n+1 , u = (u 0 , . . . , u n ) ∈ R n+1 is a PWC signal, and ε = (ε 0 , . . . , ε n ) ∈ R n+1 is i.i.d. Gaussian noise with variance σ 2 . The goal is to restore efficiently the optimal PWC signal from the observed measurement y.
The relation between u and y alone is not sufficient to compute a solution [8]. It is mandatory to take advantage of additional prior knowledge about the initial signal in order to fix a set of acceptable The first term in the energy E represents the fidelity term, where the choice of the L 2 norm corresponds to the assumption that ε is a white Gaussian noise. The second term encodes the prior knowledge and forces the solution u to exhibit a set of given features. Finally, the parameter λ > 0 is set to control the trade-off between the fidelity term and the prior knowledge [13]. The choice of λ is a difficult problem in itself [14]. Therefore, the energy in (1) needs to be minimized for different values of λ in order to find the best restoration.
In the MRF framework, the prior is generally expressed as a sum of Interaction Penalties (IP) over pairwise cliques (higher order cliques are also useful [15]). For PWC signals, the IP specify a smoothness assumption on the homogeneous plateaus, while this constraint must be switched off for the abrupt jumps so as to avoid over smoothing. Since the seminal work of Geman and Geman [16], several non-convex IP have been considered [17,13,18] for PWC denoising. We emphasize on the truncated quadratic interaction [12] V (u p , u q ) = min{α, (u p − u q ) 2 } which is equivalent to the Line Process (LP) introduced in [16]. With this choice, the regularizer tem ϕ(u) is given as sum ov IP V (u i , u i+1 ) over binary cliques in the form of (i, i + 1): The LP model adds a boolean variable l ∈ {0, 1} n that explicitly indicates the existence (l i = 1) or absence (l i = 0) of a discontinuity. The state of variable l i is defined by a threshold h called the sensitivity as follows: The energy is then given by: where α = λh 2 2 is the penalty for introducing a discontinuity [19].
The minimization of the energy in (4) is NP-hard [20], as it is non-convex, non-smooth at the minimizer and involves mixed continuous and binary variables. Several algorithms have been proposed to compute an approximation of the solution. We mention some convex relaxation techniques [21,22], Simulated Annealing approach [23], sequential tree-reweighted message passing [24] and graph cut based algorithms [12,25,26]. We refer the reader to [27,28] for a thorough investigation of the literature.
The main contributions of this paper are as follows: • Minimize a total variation (TV) regularized least square problem with the L 1 penalty in order to find an approximation to (4 Figure 1. Flowchart of the proposed algorithm. • Incorporate the finding inû, to propose an efficient pruning scheme for Discontinuity Position Sweep (DPS) combinatorial search.
The main scheme of our method is depicted in (Fig. 1). We demonstrate the effectiveness of this scheme by a series of simulation on synthetic signals proposed in [30,31,32]. The results show an efficient increase in time and gain in robustness in the case of extremely corrupted signals.
2. Optimization algorithm 2.1. Total variation denoising for line process classification TV denoising is widely used in noisy signal processing [33]. The solution is implicitly defined as the global minimum of a convex energy function similar to (4) involving a data fidelity term and a regularizer. For PWC signals, the most used choice of the regularizer is the gradient L 1 norm and the resulting convex optimization problem is: In the MRF framework, the TV energy functional in (5) corresponds to an IP function given by V (u p , u q ) = |u p − u q |. Compared to the truncated quadratic function min{α, (u p − u q ) 2 }, this interaction potential is simpler as it is strongly convex and leads to a convex energy functional. This energy function admits a unique minimizer u * , whatever the data y.
For the choice of regularization parameter λ, it turned out that is a difficult problem by itself [3]. Several works have been proposed to study the properties and characteristics of this nonlinear TV denoiser. We refer the reader to [34,35,36] for various insights on the topic.
The energy in (5) is generally tackled by fixed point methods [37] with optimal theoretical complexity. Contrastingly, we choose the fast non-iterative TV denoising algorithm [29] with a higher theoretical 438 NONCONVEX ENERGY MINIMIZATION complexity compared to fixed point methods. Experimentally, it achieves a competitive or even faster linear complexity and can handle large bandwidth signals in a few milliseconds.
Once the solution v * is computed, we recover the primal solution u * by: The algorithm solves the dual problem (6) by defining the Karush-Kuhn-Tucker conditions [38,29]: The algorithm solves (6) by a non-iterative approach where it tries to construct the constant parts of the signal while respecting the constraints in (8). In more details, the algorithm starts by defining an index k 0 representing the start of the current plateau (for the first plateau k 0 = 0). Then for each point k = k 0 + 1, . . . , n, the algorithm tries to extend the current plateau by adding k to it. There are three possible cases, corresponding to the comparison between u * k−1 and u * k . In case if the equality u * k−1 = u * k respects the constraints in (8), it adds k and goes to the next index k + 1. In contrast, if the equality can't respect (8), it declares [k 0 , k − 1] as its own plateau, marks k 0 = k as the start of a new plateau and continue from there. The sign and the height of the jump between u * k−1 and u * k is computed from an estimated value of v * k . We refer the reader to [29] for a detailed description of the algorithm.
The solution u * to the problem in (5) is generally over fitted as it suffers from the well known stair case effect which produces multiples jumps around the true discontinuity. To illustrate this effect, we used the algorithm to restore a signal depicted in (Fig. 2). The signal has a length of 1000 and 10 plateaux. The minimal jump between these plateaux is defined as H = min k |u k+1 − u k | and it equals to 4. We considered two noisy versions with a white Gaussian noise with a known standard deviation σ. In the first example (first row), the noise has a standard deviation σ = 4, and for the second example (second row) we increased the value of σ to 8. For a PWC signal the Signal to Noise Ratio (SNR) is defined as H σ which gives a SNR of 1 in the first case and 0.5 in the second. In (Fig. 2), we present the results of with a known standard deviation σ the two restorations in the first column. We could see that the solution has multiple false jumps marked by a red plus in the second column of the figure. We also remark that the number of this false jumps greatly increased with smaller values of the SNR.
The solutionû to (5) is generally over fitted as it contains additional false jumps. This is due to the use of the convex L 1 regularizer which produces a stair-casing effect.
In order to filter all the false jumps produced by the TV solution u * , we will store its segmentation results and use our algorithm called Discontinuity Position Sweep (DPS) to refine these jumps. More precisely, we will define a binary vector p = (p 0 , . . . , p n ) ∈ {0, 1} n such as Any edge (u i , u i+1 ) such as p i = 0 will be considered as continuous and its state will never change. In contrast, discontinuous edges (u i , u i+1 ), with p i = 1, are questionable and need further processing as Jumps of the solution Figure 2. Total variation denoising with the L 1 prior. The signal contains 1000 data points, with a minimum jump H = 4. In the first example (first row), where the SNR=1, the denoiser catches all the abrupt jumps but produces additional false ones. All the produced jumps are marked by a red plus in the second column. For a higher noise (second row), the noise level corresponds to a SNR = 0.5, we can see that the number of false jumps increased considerably.
they could correspond to a false jump. Hence, when we apply our combinatorial algorithm DPS, we will never question the state of a continuous edge but only try to refine the results on the discontinuous ones.
This method is efficient as the number of jumps is too small compared to the signal size. Therefore, the number of positions considered by DPS will be greatly reduced. In more details, if we define K as the number of reported jumps by the TV solution u * , then K is quite small compared to n and the ratio the of continuous edges is closer to 1. For example, for the two restorations in (Fig. 2), r is 97% for the first row and 93% for the second.

Combinatorial line process search
Once we obtained the results of the TV denoiser segmentation p, will apply our algorithm DPS to filter the false jumps. We transform (4) into a purely combinatorial optimization problem over the LP l. But we will incorporate the results of the previous classifier as additional constraints to reduce the search domain L.

NONCONVEX ENERGY MINIMIZATION
The main idea of DPS is to eliminate the continuous variable u in the mixed problem (4). We observe that for a fixed line process l, the energy becomes This energy is quadratic and strictly convex. Furthermore, finding its unique solution comes at a low price, since it only involves solving a tridiagonal system. Given this fact, we rewrite (4) into a purely combinatorial problem over l and we propose an efficient search strategy (Fig. 3) to seek the optimum configuration.
The global unique minimum of (11) is characterized by the stationarity condition ∇E l (u * ) = 0 and is given as the solution of the following linear system: where Q is a n by n + 1 circulant matrix representing the difference operator, and L l is the n diagonal matrix holding the coefficients of l ′ = 1 − l such as Proof Let's compute the partial derivative of E l (u) with respect to u i .
Hence each component of the optimal solution u * i is characterized by the following equation: We show that this equation could be obtained in a compact matrix form as stated in (12). Hence, let's compute the coefficients of the matrix R l = Q T L l Q. We start by computing the coefficients of Q T L l : For the last equality, we used the fact that (L l ) is diagonal and the only nonzero element is Finally, we could compute the coefficients of our matrix Q T L l Q: Those coefficients are null except for the three cases (i, i − 1), (i, i) and (i, i + 1) with the following values: Hence our proof, as we could write the set of equations in (14) as matrix multiplication: We denote E * l the energy associated to u * . The mixed problem (4) is equivalent to finding the LP An exhaustive search over all configurations of L has exponential complexity. Therefore, we structure the domain L into an n-hypercube. We recall that an n-hypercube is a regular graph (each node has exactly n edges) containing 2 n nodes and 2 n−1 n edges [39]. The nodes of n-hypercube are encoded as n-bits using Gray code which ensures that two neighboring LP differ only in one bit.
The first pruning strategy eliminates all the incompatible line processes with the denoising results. That is, we only keep the configurations that are consistent with the previous segmentation p. First, we define the set of continuous edges (u i , u i+1 ) positions: According to the results of the TV segmentation, those edges are not candidates for a discontinuity. Hence any LP l that has a one in an index in C is not consistent with the TV findings and will be discarded. As a result, we denote the subset of the n-hypercube that regroups the set of line processes that are null in all the indices in C. And we reduce the search space to this subset: Additionally, DPS performs a breadth first search strategy where the depth of the search tree is defined by the levels of the hypercubes. For this purpose, we decompose the n-hypercube into disjoints sets L k , called levels. Each level L k contains the line processes that have exactly k nonzero bits. Ultimately, we rewrite the problem (23) as a sequence of subproblems ( (P k ) ) k∈{0,...,n} such as: The second pruning strategy, depicted in (Fig. 3), uses the information provided by the previous level to limit the search into the neighbors of its optimal solution. Suppose we get the optimal configuration l k, * of the problem P k , the solution l k, * already gives us the positions of k discontinuities. So, in the search for the solution l k+1, * we keep those discontinuities, and we only seek to add a new one. This means that we add the constraint that l k+1, * must be a neighbor for l k, * . We denote V(l k, * ) the neighbors of l k, * , and 442 NONCONVEX ENERGY MINIMIZATION we transform the sequence ( (P k ) ) k∈{0,...,n} defined in (24) into a recursive sequence ..,n} as follows: The resulting algorithm (Alg.1) follows a top-down paradigm which starts with no discontinuities and at each level introduces a new step if the energy decreases. Otherwise, it stops the search and reports the actual solution as optimal (Alg.1).

Complexity analysis
The complexity of the algorithm is O(emn 2 ); n is the size of the signal; m is the number of jumps and e = 1 − r where r is the ratio of continuous edges in the TV denoiser (10). Indeed, to restore such a signal, we call the TV denoiser, to attenuate the noise, that has a quadratic cost O(n 2 ) in the worst case [29]. After that, DPS checks en nodes in the first level L 0 , en − 1 in the second L 1 , and so on so forth until the level L m+1 is reached. The energy of each node involves solving a tridiagonal symmetric system (12) Algorithm 1 DPS algorithm with classified line process Input: λ, h, y Findû in (5) using the proposed TV denoiser.
Compute the set C of the classified continuous positions. l * = 0 #initial line process with zeros: l * by applying an adapted solver that costs O(n). The complexity of the two stages is: As a result, the worst-case complexity of the algorithm is O(n 2 ) (the TV denoiser fails to classify any continuous edges). In practical situations, the elimination rate is very significant (Fig. 4), since the number of jumps is small compared to the number of observations. Practically, we could achieve a linear complexity O(mn).

Numerical experiments
In this section, we test the performances of our algorithm against a set of state of the art algorithms for jump detection and PWC signals reconstruction.

Example 1
In the first example, we apply our algorithm for a synthetic example depicted in (Fig. 5). The example is taken from [30] to exemplify the stair-casing effect that emerges from using the L 1 norm in the prior term. The signal y ∈ R n contains n = 1024 observations with an additive white Gaussian noise with a standard deviation (σ = 0.5). The PWC original signal is produced by the function MakeSignal from the WaveLab framework using the 'blocks' option (Fig. 5a). For the two stages, we set λ = 3 and for DPS we choose a sensitivity h = 1.  Figure 5c displays the improvement by the DPS algorithm, as all the false jumps produced by the TV denoiser (Fig. 5b) are eliminated. This is reflected by the Root Mean Square Error (RMSE) as DPS get a lower RMSE compared to TV denoiser. Furthermore, the DPS error is less than the one reported in [30] with nonconvex penalties.

Example 2
In the second example, we consider the signal in [31] where the authors use a class of nonconvex penalties that tightly penalizes the sparsity of the signal than L 1 norm. The signal is synthetically generated to illustrate the stair-casing problem with two monotonous positive jumps with height a, y ∈ R 200 and y  (Fig. 6a). The noise is drawn from independently white Gaussian noise distribution with variance (σ = 1). DPS successfully filters all the false jumps generated by the TV denoiser (Fig. 6b) and produces the correct value of each plateau. In addition to that, we successfully repeat the experience for higher values of noise variance up to a standard deviation (σ = 12) (Fig. 6c).

Example 3
In the third simulation, we apply the DPS algorithm to detect stepping motion in the trajectory of molecular motors in biology [40]. The detection of these steps allows the analysis of physical and chemical properties of single DNA based molecular machines such as polymerases [32]. The problem consists of finding motion positions from measures of the length of a molecular machine in base pair (bp) [41]. A bp is a unit consisting of two nucleobases bound to each other (e.g., A-T or G-T) and a motion is characterized by an abrupt change in the length of the molecular motor. We will use the example studied in [32], it contains experimental data of RNA polymerase II. The initial signal, depicted in (Fig. 7), contains n = 5 10 4 noisy measurements of the length of the polymerase in bp within a time interval t ∈ [0, 2] in seconds.
We applied DPS to reconstruct and detect steps in this signal. We chose λ = 10 for the TV denoiser, λ = 20 and h = 0.55 for DPS. The reconstruction took 45ms and is depicted in (Fig. 7). In the same figure, we present the results with a state of the art algorithm called Energy Based Scheme (EBS) used in the same paper [32]. As shown in (Fig. 7), our algorithm has higher step detection quality as it only misses one hard jump in the middle while EBS missed several of them. Also in term of signal restoration, we compared the Mean Square Error (MSE) for both methods. The results are presented in Table 2 and shows that we obtained a lower MSE. In order to assess the reconstruction and segmentation quality of our algorithm, we compared the three classification metrics: precision, recall and f 1 -score with a set of state of art algorithms. To define these metrics, we denote T P the number of correct jumps detected by the algorithm, F P the number of false jumps, F N the number of missed jumps and finally T N the number of true continuous position. The three metrics are expressed as follows: Precision represents the ratio of correct reported jumps, a perfect value of 1 means that all the reported jumps are true. Recall measures the capacity of the algorithm in finding all the jumps. A perfect value of 1 in recall means that the algorithm detected all the jumps. Finally, the f 1 -score leverages precision and recall by taking the harmonic mean. The compared algorithms are: PELT [42] which is a combinatorial algorithm with a pruning strategy that gives the optimal solution with the L 2 cost function. A Binary Segmentation scheme [43,44] that offers a recursive algorithm to detect the jumps one by one. A Window sliding algorithm [45] that considers a local cost function over a window and slide it along the signal to detect the jumps one by one. Finally, we consider a Bottom-Up algorithm [46] that takes a dual approach to binary segmentation and start with a signal that has all the jumps and successively deletes the false ones. The results are depicted in (Fig. 8) for a sequence of noise standard deviation σ varying from 10 −2 to 1.8. The figure shows that our algorithm offers the highest precision while keeping a stable recall. This superiority is reflected in the combined f 1 score where our algorithm kept a higher score f 1 ≈ 0.8 for a high noise level σ = 1.8. In contrast, the f 1 score for the rest of the algorithms dropped significantly. The first figure (left) plots the precision. We could see from the figure that DPS has the highest precision. In the second figure (center), we plot the recall, our algorithm has lower recall but keeps it consistent even with strong noise. In the third figure (right), we combine the two metrics as the f 1 -score, the figure shows that DPS offers the best detection quality.

Example 4
In the fourth example, we compare the classification results of the TV denoiser and the classical implementation of DPS to illustrate the differences between them. We consider the mean shift in the ruptures package [47]. Examples of these synthetic signals are depicted in (Fig 9). Each example is of size N = 500 and has 10 jumps. The goal is to consider several noise levels and measure the classification metrics. Hence, we generated a sequence of σ i between [0.1, 2] with a step 0.1. For each case, we measure the classification metrics for 10 corrupted signals. The results of those metrics are reported in (Fig. 10).
We could see that both algorithms have the same recall. This is the essence of our combination method as a missed point by the TV denoiser will most likely be missed by DPS. For the precision, the algorithms are different as DPS offers a better precision in low and moderate noise presence, σ ∈ [0.1, 1.4]. This is not the case for higher noise presence σ > 1.4 as the TV denoiser offer a better precision.  Figure 9. Examples of the mean shift signal from [47]. The signals has a size n = 500 and contain 10 jumps. Each one is corrupted with different noise standard deviation σ. 3.5. Example 5 In this example, we compare the performances and running time of DPS, with and without the LP classifier.
In the first experience, we study the effect of the noise standard deviation σ on the f 1 -score for both implementations. We considered a regular sequence of values σ i ∈ [0.5, 5] with a step 0.2. For each values σ i , we generate 10 instances for the mean shift synthetic signal and we measure the f 1 -score by both algorithms. The results of this simulation are depicted in (Fig.11) (left). We could see that for lower noise presence, both algorithms have the same score. Nevertheless, as the noise power becomes more important, the combined version is more robust with a higher score. This is due the TV denoiser that eliminates some false jumps that will never be considered by DPS.
In the second experience, we fix the noise standard deviation σ = 1.4 for a moderate signal presence. We also the fixed the sensitivity h = 0.8 to its best value and we study the effect of the regularization coefficient λ on the f 1 -score. We thereby consider a sequence of λ i ∈ [5,90] with a step 20. Same a previous experience, for each case of λ i , we generate 10 examples of the mean shift synthetic signal and we measure the performances of each algorithm. We present the result of this experience in (Fig.11)(middle). We remark that for higher values of λ there is no difference between the two implementations. But for lower values, The combined version with the TV denoiser is more precise as the it helps to prune some false position from the LP hypercube. For the classical implementation of DPS, a lower λ would encourage smaller plateaux [19] and therefore will increase the number of false jumps.
In the last experience, we compared the f 1 -score according to the sensitivity parameter h. We therefore fixed the value of λ = 40 and σ = 1.4 and we considered a sequence h i ∈ [0.4, 6] with a step 0.2. The results of this experience are depicted in (Fig.11)(right). Firstly, we see that both metrics follow the classical shape (increasing and decreasing). The first regime (with lower sensitivity) corresponds to an over fitted solution where DPS produces solutions with false jumps (lower precision). The second regime (higher sensitivity), DPS become more conservative and only accepts jumps with higher gradient and could therefore miss a true jump (lower recall). Secondly, we remark the same benefit of using TV denoiser for lower sensitivity value as its helps eliminates those false jumps. This is reflected in the f 1 -score which is higher in the first regime while they are the same in the second one.
In the second part of this example, we compare the running time of DPS, with and without the LP classifier, as a function of the signal size, the number of jumps as well as the noise power. First, We generate 20 realizations of a PWC signal with sizes n ranging from 100 to 10 3 . Each realization has 10 randomly distributed jumps with a minimal height h = 10 and corrupted with white Gaussian noise with a  Figure 11. F 1 -score comparison between DPS and our combined version on different hyper parameters and noise power. For the first scenario (left), we fixed the best choice for the hyper parameters λ and h and measured the f 1 -score for an ascending noise standard deviation σ. We could remark that the combined version is more robust to noise as the TV denoiser helps to eliminate some false jumps. In the second simulation (middle), we tested the effect of λ on the f 1 -score while fixing the sensitivity h and the noise power σ. Also in the situation, the combined version is more robust especially for lower values of λ where DPS tends to accepts false jumps. The main reason for this behavior is that λ also represents the length of the plateaux. Hence, a lower λ increases the number of plateaux and jumps. Finally, in the last simulation (right), we study the robustness of both algorithms according to the sensitivity parameter h. Here we observe a classical behavior, where the f 1 -score increases until we get the best value h ≈ 0.8 and then starts to decreases as larger values of h would get a conservative algorithm that only accepts larger jumps.
standard deviation σ = 8. (Fig 12a) plots the running time of each algorithm as a function of the signal size in a logarithmic scale. For the classical DPS, the experimental slope is α 1 = 2.002 and, by incorporating the LP classifier, the slope goes down to α 2 = 1.5702. Second, we fix the signal size to n = 10 3 and we consider 20 examples wherein we change the number of jumps from 10 to 50. The jumps are uniformly distributed over the signal and we keep the same noise power σ = 8. The outcome of the experiment is displayed in (Fig 12b) where the left axis (red line) represents the running time of the classical DPS with a experimental slope α 1 = 0.9329 while the right axis (blue line) shows the running time using the LP classifier, the slope is reduced to α 2 = 0.6895. Finally, (Fig. (12c) illustrates the running time as a function of the signal power. The figure confirms (in left axis) that the running time of the classical DPS is independent of the SNR [26]. Conversely, this property is lost for the new implementation (axis in the right), as the TV classifier is more sensible to noise and the rate of the predicted null processes decreases with lower SNR (Fig. 4).

Conclusion
In this paper, we proposed a two-stage efficient algorithm to restore noisy PWC signals while preserving their edges. The method uses TV denoising in the first stage to reduce the noise and classify the continuous edges. In the second stage, we apply a combinatorial algorithm to refine the TV findings by filtering false jumps. Compared to existing denoising schemes, our algorithm offers a blazingly fast time while keeping a superior restoration and step-detection quality especially for high-noise data.
The findings might also be extended to higher dimension data like 2-D images. The MRF graph is this case is the classical 2-D lattice with four neighbors system [17]. The generalization, called th weak membrane, for the LP model to theses models was already studied in [19]. The model suffers from a great computational complexity as the number of lines processes is exactly the number of edges the graph and is quadratic in the image size n. In order to reduce this complexity, the TV denoising could be applied to each row and column of the image to label the continuous edges. After that, we solve the week membrane 450 NONCONVEX ENERGY MINIMIZATION .
problem, but we never touch the labeled edges by the TV denoising. We are currently investigating this solution.