A New Hybrid Optimizer for Global Optimization Based on a Comparative Study Remarks of Classical Gradient Descent Variants

In this paper, we present an empirical comparison of some Gradient Descent variants used to solve global optimization problems for large search domains. The aim is to identify which one of them is more suitable for solving an optimization problem regardless of the features of the used test function. Five variants of Gradient Descent were implemented in the R language and tested on a benchmark of ﬁve test functions. We proved the dependence between the choice of the variant and the obtained performances using the khi-2 test in a sample of 120 experiments. Those test functions vary on convexity, the number of local minima, and are classiﬁed according to some criteria. We had chosen a range of values for each algorithm parameter. Results are compared in terms of accuracy and convergence speed. Based on the obtained results, we deﬁned the priority of usage for those variants and we contributed by a new hybrid optimizer. The new optimizer is tested in a benchmark of well-known test functions and two real applications were proposed. Except for the classical gradient descent algorithm, only stochastic versions of those variants are considered in this paper.


Introduction
Optimization techniques have common applications in fields such as differential calculus, regression models for prediction, shapes optimization, topological optimization, and other applications in logistic and graph theory [1]. The optimization is mono-objective when it consists of finding the best solution that optimizes a given objective [2]. On the other hand, multi-objective optimization concerns multiple contradictory criteria for making a decision [3]. Commonly, Numerical methods can provide practical and adaptable solutions in both cases. Although finding exact analytical solutions is a challenging task because of dimensions or because of the nature of the objective function, algorithms such as gradient descent are considered to find acceptable solutions with an error margin [4]. One of the main issues with gradient descent variants is how to select the appropriate algorithm according to the problems features. When it comes to applying gradient descent variants on a real application, a practitioner will prefer to use some criteria for making a quick decision. Because not all variants have the same performance. The use of a decision technique will help in saving time, especially while performing a simulation. For this purpose, we will compare the performance of gradient descent variants based on a panel of test functions. After that, we will apply a khi-2 test to help in deploying suitable decisions that match the researchers goals or understanding of a problem.
The paper is organized as follows: In Section 2, we provide a review of the related work. In section 3, we present briefly the mono-objective optimization. Afterward, we describe the used five variants in Section 4. The used test functions as well as the obtained performance results are presented in section 5. The statistical khi-2 and ahp technique are deployed to our 632 A NEW HYBRID OPTIMIZER FOR GLOBAL OPTIMIZATION

Vanilla Gradient descent
If a multi-variables function is defined and differentiated on the neighborhood of its minimum x * then the function f decrease in the opposite direction of the gradient. The algorithm starts by choosing an initial guess solution x 0 ∈ D.
The simplest form of update is to change the parameters along the negative gradient direction . The standard gradient descent algorithm is formulated as follows : Where : • x ( j , t + 1) Represents the j-ith component of the current solution x(t) and t + 1 represents the actual iteration.
• ∇f (x (j, t)) Represents the j-ith component of the gradient vector for the function f (x) in the last iteration t . • α is the learning rate.
For a small positive, if the last estimation of the gradient is negative then: Because we need to calculate the gradients for the whole dataset to perform just one update, traditional vanilla gradient descent performs slowly close to the minimum and is hard to control for large data sets that do not fit in memory. The other problem is the way we choose the convergence step α.
Step α must be chosen carefully because it determines both the convergence speed and the accuracy of the estimated solution x * . This parameter determines how big of an update we perform. [5] [16] .
This version of the main discussed algorithm is also called Batch gradient descent. It's based on the naive / full computation of the gradient and uses the entire available dataset. Note that state-of-the-art deep learning libraries provide automatic gradient computation like the well-known Rootsolve R package.
When evaluated on the full dataset, and when the learning rate is low enough, this is guaranteed to make non-negative progress on the loss function. Batch gradient descent is guaranteed to converge to the global minimum for convex error shapes and to local minimum for non-convex surfaces.
The stochastic gradient is a descent method used to minimize an objective function formulated as a sum of Mdifferentiated functions f i .
The objective function could be formulated as a sum of M terms : The estimators that minimize a sum are called the M-estimators, they are used in the estimation of maximum-likelihood or even the empirical risk minimization for supervised learning problems. In supervised learning, each function f i is associated with an observation belonging to the data set. Evaluating the gradient for the entire data set may require a lot of computational resources [8] [9].
To resolve the problem, the stochastic gradient samples a subset of the sum pieces (we choose at uniform K functions f i from the sum where K ≤ M then we evaluate the gradient vector) [8] [9].
The algorithm starts by choosing an initial vector of parameters x 0 and a learning rate α. First, we evaluate the gradient for only a sum of K function f i that are chosen in uniform. This means that the chosen functions could differ from an iteration to the other one. After that, we execute the update equation (4) until an approximation of the minimum is obtained [8] [9] [1] [20] : Note that we can shuffle a single function f i at each iteration because of the gradient operator linearity. If we shuffle more than one function f i simultaneously, the version of the algorithm is then called the mini-batch stochastic descent gradient. The problem with the stochastic gradient method is the frequent updates and fluctuations. This complicates the convergence even if we choose a small step α.
It has been shown that even though we slowly decrease the learning rate, SGD shows the same convergence behavior as batch gradient descent. The mini-batch version could be a solution for the fluctuations around the optimum. For this, it's preferable at each iteration to evaluate the full gradient value uniformly for only some K-random dimensions and to keep the other dimensions unchanged until the next iteration [21].

Momentum
Among the other variants we find the momentum method . It appears in an article by Rumelhart, Hinton and Williams about the back propagation learning. This update can be motivated from a physical perspective of the optimization problem. The optimization process could be seen as equivalent to the process of simulating the parameter vector of a particle rolling on the landscape. [11] [22] , [23] [24] SGD has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another, which are common around local optima. In these scenarios, SGD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum . Momentum is a method that helps accelerate Since the force on particle with a mass m is related to the gradient of potential energy (i.e.F =?∇U ), the force felt by the particle is precisely the (negative) gradient of the loss function. Moreover, F = ma so the (negative) gradient is in this view proportional to the acceleration of the particle. Note that this is different from the SGD update shown above, where the gradient directly integrates the position. Instead, the physics view suggests an update in which the gradient only directly influences the velocity, which in turn has an effect on the position. In particular, the loss can be interpreted as the height of a hilly terrain (and therefore also to the potential energy since U = mgh and therefore U ∝ h). [26] [24] The SGD method with moment keeps in memory the update at each step ∆x(t), and calculates the following update as a convex combination of the current gradient and the previous change: The term momentum is used in physics when a particle is subject to a rotation movement applied by a force. The moment measures the ability of a force to rotate an object around an axis or a reference point.The name moment comes from an analogy with the moment in physics: the vector of parameters x(t), considered as a particle which travels through the space of parameters (often in large dimension), undergoes an acceleration via the gradient (which acts as a "force"). Unlike the classic SGD method, this variant tends to keep traveling in the same direction, preventing oscillations. [11][17] [27] The previous variants of the gradient descent algorithms have troubles when the convexity is irregular and variations are more severe in one dimension than others. In this case, the momentum approach tries to keep the vector we want to estimate in the direction of the gradient. By this, the referent vector can be seen as a particle subject to a force that pulls the vector to the local minimum. The force in our situation can be compared to the gradient of the objective function [11] [17][27] [24].
Unlike classical gradient descent, the momentum approach keeps the referent vector x traveling in the same direction to which prevent oscillations of the loss function. The new update ∆x is a convex linear combination of the last update and the gradient value [28][24] [29] [30]: The term β.∆x( j , t ) represents the previous update of the field x( j , t ) multiplied by a weight β.
The new ∆x( j , t + 1 ) is a moving average of the gradient and the last update [10]. Commonly, we choose a weight β such that: 0.8 ≤ β ≤ 0.999 By default, we choose a value of β = 0.9. . The term α. represents the convergence step. We use the same equation of the descent gradient using the new update as follow: x( j , t + 1 ) = x( j , t ) − α.∆x( j , t + 1) Essentially, we drive a ball down a hill by using momentum. As it moves downhill, the ball accumulates energy, getting faster and faster on the way (until it reaches its terminal velocity, if air resistance is present, i.e. β ≤ 1 ). For our parameter changes, the same thing happens: For dimensions whose gradients point in the same directions, the momentum term increases, and decreases updates for dimensions whose gradients change directions. As a consequence, we achieve faster convergence and decreased oscillation.
On the other hand , a particle that rolls down a hill is extremely unsatisfying, blindly following the slope. We just want to see a clever particle , a particle that acts as a ball which has a sense of where it's going, so that before the hill ramps up again, it knows how to slow down. [17]

Adaptive gradient descent
Adagrad version adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters x(t). For this reason, it is well-suited for dealing with sparse data. This version of the gradient descent algorithm takes the sparsity of parameters into account. Adagrad uses a different learning rate for every parameter x(t) at every time step t [12] [31] [17] [32].
An interesting application of Adagrad was done by Dean et al. he has found that Adagrad greatly improved the robustness of SGD and used it for training large-scale neural nets at Google, which C among other things C learned to recognize cats in YouTube videos. [17] In the current iteration τ + 1 , The Adagrad reduces the convergence step α for a high variation in a given point and increases the same step for a low gradient value [33] [12]. For this reason,the algorithm is suitable for sparse data.
Let G(t) be the outer product matrix of the gradient vector ∇f ( x(t) ) at the t-ith iteration. The algorithm divides the step α by the sum of diagonal elements G( j , j , t ) of the matrix G(t) until the current iteration τ ∈ N [34]. The update formula is as follows : Where : And : for all t in {1, ..., τ } Besides, G(i , j , t) is the element within the row i and column j of the outer matrix G(t) and τ is the number of iterations.

We have usually
Note that Adagrad performs also on non-convex problems [12] [35].
In Adagrad update rule, we modify the general learning rate α at each time step t for every parameter x( j, t) based on the past gradients that have been computed for x( j, t − 1). j is the j-ith dimension of the parameters vector.
One of the key advantages for Adagrad is removing the manually calibration need for the learning rate. The denominator's square gradients accumulation's could be also a drawback: As any added term is positive, during training the cumulative amount continues to increase [17][34] [33] .
At a consequence, this leads the learning rate to diminish and inevitably become infinitesimally small, at which point extra information can no longer be obtained by the algorithm. The RMSProp algorithm is aimed at fixing this weakness [17][34] [33] .

Root Mean Square Propagation
RMSProp is a similar gradient descent version compared to Adagrad, the difference is that RMSProp considers the last value and the actual value of the gradient while dividing the learning rate by S(j j, τ ) [13] [36] [31].
The RMSProp update formula is formulated as follow: Where x(j, t + 1) is the j-ith field of the N-dimensional solution x that we want to estimate and t represents the last iteration. The coefficient γ is called the memory factor, we choose a value of γ such as 0.1 ≤ γ ≤ 0.9.

Comparing performances
After presenting different variants of the descent gradient algorithm, we will try in this section to measure the performances of the algorithms already presented. Firstly, we will define the convergence rate. After that, we give the results of those algorithms applied in a benchmark of five test functions [39] [40]. The used test functions are:  The convergence rate of a sequence is the speed at which their terms converge to a certain unique value called the sequence limit [41]. In our case, we consider that an algorithm is fast if the related number of iterations is small. The accuracy is measured as the difference between the exact value of the minimum and the quantity we got using a variant of descent gradient. In practice, some sequences converge quickly to the limit value but they lack in accuracy for a given algorithm.
To conduct our experiments, the algorithms, and the used test functions are implemented under R language, we had chosen [−6, 6] × [−6, 6] as a study domain and we executed 8 × 10 5 iterations for each experiment. The initial point is usually (x, y) = (5, 5). Except for the classical gradient descent, only the stochastic versions of the previously presented algorithms are tested. First, we deduct that the more α values are small, there is more chance that the error value will diminish for the classical gradient descent and the algorithm will give accurate results.
On the other hand, the convergence speed will be slow for small values of α as shown in Table 2. The gradient descent gives good results for both F1 & F4 but fails for F2 & F3. This is because F2 & F3 are characterized by a large number of local minima regularly distributed and with and a large search space.
Secondly, we conclude that the stochastic algorithm could deteriorate the results of the classical descent gradient version whether or not we consider convex objective functions, which is the case for F1 and F2. Thus, the stochastic version should be used only for large dimensions or for an objective function that is represented as a sum with a large number of terms. The results deterioration is mainly explained by the fluctuations around the minimum because we shuffle components of the gradient vector at the uniformly. Third, the momentum version is suitable for F3 because a flat outer region and a large hole in the center characterize it, but a bad choice of β could extremely influence the accuracy of the results. This remark explains the convergence difference between F2 compared to F3 although the fact that both share the property of having a large number of local minima regularly distributed. To obtain good results, we should avoid small values of α.
Fourthly, the adaptive version can give close results of the solutions (X, Y ) for each test function even with trying different values of parameters (which is the case of F2 in Table 5). In the case we obtain close results for the objective function, we can choose a high step as shown for F3 with α = 0.9 ( this sample will not be considered for khi-2 test to avoid bias). The results of Table 5 illustrates that if we increase the value of the parameter α, we will get more chances to reach the optimum.
Last, of all, the RMSPROP improves the performance of the adaptive version. The effect of using this version is clear for F1. For F1, RMSPROP helped to ensure a stable convergence by continuing the navigation in the relevant directions and by softening the oscillations in irrelevant directions.
For F2, this version gave close results which is the same effect of the Adagrad version. The remark illustrates the ability of this version to deal with the sparsity of data. Increasing can lead to improving the convergence speed.
On the other hand, Both Adagrad and RMSPROP provide better performance for F3 compared to F2 due to the shape of F3 where the global optimum is cavernous in the central hole. Based on our results, we notice that we should avoid small steps for this category of objective functions (F2) because iterations can stick at a local minimum. Concerning a convex problem, which is the case of F4, the results of convergence are acceptable even though the momentum provided the best performance for the F4 function.  Our results of using RMSPROP are shown in Table 6 : For now, we will define our approach to select the best variant depending on several criteria, and regardless of the problem features. The khi-2 test is a statistical test where the test statistic follows a khi-2 law under the null hypothesis. This test allows verifying the adequation of a series of observations to a probability law. We assume that the steps for applying the khi-2 test are known for the reader.
The details of the khi-2 test could be found at [42]. The null hypothesis is formulated as follow: H0: There is no dependence between the choice of the method variant type and the obtained convergence speed for the considered test functions Table 7 gives the number of choices of parameters for each method variant where the number of iterations required for convergence is strictly less than 10 4 (which means: a constant ×10 3 ). We consider a variant very fast when this variant requires that magnitude of iterations. The khi-2 score for table 7 is equal to 12.95 > Chi-2(4, 0.02) = 11.66 which means that we can reject the null hypothesis. Thus, the obtained convergence speed will depend on the choice of the gradient descent variant type. Similarly, we prove that the accuracy metric will depend on the chosen variant type. Table 8 gives the associated contingency table of the accuracy based on the conducted experiments: The khi-2 score for table 8 is equal to 10.98 > Chi-2(4, 0.05) = 9.48 which means that we can reject the null hypothesis and we could deduct that the accuracy performance depends on the chosen variant. This justifies the need for classifying the priority of usage for those variants. For this purpose, we had applied the AHP technique that is a multi-criteria analytical approach for decision support. The process of using AHP is explained in detail at [43]  Note that the sum of each column is equal to 1. The matrix elements sum is equal to the number of criteria which is 4. The priorities of actions for each alternative are presented in the next table: We understand from previous tables that AHP results are coherent with the khi-2 test results. The RMSPROP has a priority of 50.73% to be chosen for a random situation based on the elaborated judgment matrice. In the next section, we will propose and test our hydride optimizer. This optimizer combines recursive random search with the stochastic RMSPRPOP. This optimizer is built based on our understanding of the obtained experimental results for the considered gradient descent variants.

Suggested hybrid optimizer
Because the accuracy of the compared versions usually depends on the chosen initial position x 0 ∈ D for large search domains and into the characteristics of the used test function. We will suggest and test our optimizer that is based on the following suggested algorithms written in pseudo-codes : The algorithm 1 starts by retrieving for each dimension the lower and the upper boundaries. First, the algorithm generates N points at uniform from the study domain defined by the two boundaries vectors then it computes the corresponding objective function values for those N points. Secondly, we subset the p points that are minima in term of the used objective function. After that, we identify for each objective function variable/column the minimum and the maximum values. Those values represent the boundaries of the new study domain, which is the output of algorithm 1.
Concerning the algorithm 2, we call the first algorithm recursively in a number of iterations that is given as input. The purpose is to reduce the study domain as possible to obtain an initial point. In each iteration of the algorithm 2, we adjust the boundaries by adding a very small quantity p/N to the Upper boundaries vector and we subtract the same quantity from the Lower boundaries vector. Those adjusted vectors will be the inputs for the upcoming call of algorithm 1 and the number of iterations will decrease by one. In each iteration, we retrieve the column where the objective function is minimum. This column represents the current obtained initial point and algorithm 2 continues until the stopping condition is met. The output of the algorithm 2 is the initial point vector that will be used by stochastic RMSPROP to solve the optimization problem. Outputs : X sol Our optimizer starts by reducing recursively the study domain in a number of iterations Nbre iterations (which is the role of algorithm 2). After reporting an initial solution by algorithm 2, the hybrid optimizer uses the stochastic version of RMSPROP to obtain an accurate solution to the problem for the considered objective function.
To improve the solution X sol we obtained by the proposed hybrid optimizer, we suggest applying a number of operations such as : replacing all this vector components by the minimum / maximum values , integer part of components , rounding of values , etc. The purpose of algorithm 4 ( presented as R code ) is to compare some possible modifications of the obtained solution . The output of algorithm 4 is a solution better than X sol : 1 2 3 Improvement<-function(x_sol, objFun ,rounds_to_try){ 4 # x1 to x10 represent the possible improvements we could make . # For each of x_sol components , we replace the component by the integer part 10 11 x5=ceiling(x_sol) # ceiling() takes a single numeric argument x and returns a numeric vector containing the smallest integers not less than the corresponding elements of x. 12 13 x6=rep(min(ceiling(x_sol)),length(x_sol)) 14 x7=rep(max(ceiling(x_sol)),length(x_sol)) 15 16 x8=floor(x_sol) # floor() takes a single numeric argument x and returns a numeric vector containing the largest integers not greater than the corresponding elements of x.

principles analysis of the suggested optimizer
To present how our algorithm is efficient, we should first mention that the used two-stages algorithm combines a slightly modified random search version with the accuracy features of the RMSprop. The input of the proposed optimizer is an n-dimensional volume. First of all , the algorithm 1 samples N points and retireve the p vectors with the minimum objective values. Secondly , the p vectors are stored in a matrix Y where columns represents the n-dimensions and rows represents the p vectors. Therefore , we retrieve for each column of the Y matrix its minimum and its maximum. The combining of those minima and maximas represents the new rectangluar search domain.
Given an interval [a, b] In one-dimensional , consider a sample of N points drawn at uniform such that : ( Without losing of generality ).
Then , for the global solution x * : As explained at [45] , consider a region of v = 5% around the minimum , the drawn sample has a chance of 5% in landing within the minimum interval. The probability that all of the N points miss the desired interval is : So the probability of the N points to miss the desired interval is : 1 − (1 − v) N . As we want a confidence level of 95% then we should have : Now consider the points S 2 = {x N −p ..., x N } which represents the obtained p minimum values of the first iteration. Suppose that the upper boundaries of search domain for the second iteration are : x For the set S 2 , we are sure with 95% that at least one element is belongign to the minimum region . By adding a fluctuation term of p/N , we avoid situations where x * isn't within S 2 boundaries [L (2) = x (2) min ; U (2) = x Repeat this procedure nbre iterations times until reaching the last iteration. After that , Report the solution with the minimum objective value denoted x init. This solution will be improved by the Rmsprop variant.
For proving the convergence accuracy of the stochastic RMSPROP, we refer the reader to [46]. We had tested our optimizer in several multi-dimensional test functions that vary on their complexity and some of the obtained results are given in table 11 .

Illustration example
To illustrate how our algorithm works, we will consider the simple 4d sphere function case ( n=4 ). From this domain [−6, 6] 4 , we will sample in each iteration N = 7 vectors ( belonging to 4-dimensional space ) and we will report the p = 2 vectors such as the objective values represents the p lowest values of the matrix Y ( The two first rows ).

The study domain is
After obtaining a sub-matrix of Y with p rows , we will retrieve the minimum and maximum of each column. The next step is to subtract the quantity p/N = 3/7 = 0.285. We obtain the New search domain ( used in the next iteration 2 ) then we repeat the same procedure N iter = 3.
In the last iteration N iter = 3 , we consider the actual N ew search domain then we report the column with the minimum objective value. This columns is the initial solution X init that will be used by the RMSPROP ( We run Rmsprop (X init, α, λ) Rmsprop iter = 1000 times ).
After running RMSPROP , we obtain X sol ( which is the algorithm 3 output ). We improve this solution by applying several rounding operations ( We obtain the X improved solution ).
The next R output illustrates that our conceived optimizer was able to obtain the exact solution of the 4-dimensional sphere function : 1 [ 1 ] " p / N" " p =2" "N=7" " n =4" " 0 . The figures 7 , 8 and 9 represents the convergence plots of the used test functions. Those plots were drawn using the R ggplot2 package. For each of the used test functions, the curves represents the evolution of each component/dimension based on the number of iterations. In order to measure the convergence speed and the accuracy of the conceived optimizer, we had chosen the following performance metrics in addition to the CPU time [47] : and : where x opt i is the i-th component of the true optimum for the concerned test function and x i is the i-th component of the computed solution by the algorithm. Note that the drawback of MPE measure is that it is undefined whenever a single actual value /denominator is zero as mentioned in [48]. The next table gives the obtained performances results : To analyze the obtained convergence speed and the accuracy results, we should first remember the reader that the main purpose is to find an approximate solution in large search domains. Our hybrid optimizer context is for global search optimization which is totally different from the local search context. If we go back to table 6, the classical RMSPROP alone wasn't able to converge for the rastrigin n-dimensional function and the obtained objective value was f (x, y) = 31.85 which is far from the optimum and this is just for a two-dimensional problem. Our suggested optimizer was able to solve the rastrigin problem in 8-dimensions with an error of M AE = 0.25610. This error is very small and acceptable compared to the search domain and taking into account the truncation error which means that the real error is strictly lower than the MAE value of the rastrigin problem. Concerning the Schaffer and Alpin functions, the objective value at the true optimum is equal to zero. The proposed hybrid optimizer was able to minimize the objective values for those functions to be equal respectively to 5.551115123e-15 and 0.02013423150 for the large study domains [−100, 100] 2 and [0, 10] 10 . The proposed hybrid optimizer consumes fewer resources of CPU time/number of iterations compared to the classical stochastic RMSPOP.
The results of table 12 shows that the suggested optimizer is able to find relevant approximate solutions for the considered artificial landscape problems.
After finding the approximate solution in global search, any other local search optimizer algorithm as well as our suggested optimizer, could be used by the practitioner if he desires to improve the obtained solution accuracy. It should be mentioned that among the used test functions , there are functions that vary greatly locally ( which are sensitive to small variations of the input space ). This will result in rapid amplification of the objective function locally for small perturbation. Using the suggested algorithm 4 , we were be able to improve the accuracy of the obtained solutions. The results are presented in table 13 : Except the case of the beale function , the algorithm 4 was able to find the exact solutions. In the next section, we will propose two real applications of the proposed hybrid optimizer. Those applications concern the mechanical and the biology fields.

Mechanical application
Consider the mechanical problem of a beam embedded on one side and subjected to a concentrated load P on the other. The beam has a length l. Its material has Young's modulus E and the section has an inertia I.This structure is modeled by a beam element of the Bernoulli type of length l, whose degrees of freedom in the xOy plane are the two rotations θ 1 and θ 2 as well as the two translations V 1 and V 2 . The purpose is to minimize the potential energy stored in the structure is given by the sum of the internal energy and the work of the applied loads, also called compliance. The objective function is formulated as described in [49] [50] with the limit conditions : θ 1 = 0 and V 1 = 0.

biology application
In a biology experiment, we study the relationship between the concentration of the substrate [S] and the reaction rate in an enzymatic reaction from data reported in the following table [51]: The purpose is to find the optimal non-linear regression parameters for the following model : The minimum least squares optimization concerns the parameters Vmax and K M . The true optimal parameters are : Vmax = 0.362 and K M = 0.556.
Using the suggested hybrid optimizer with the parameters (N = 40 , p = 5 , α = 0.01 , γ = 0.9 ) and a total of 55 iterations , we had found : In this application, we used the same study domain of the previous mechanical problem as input for the suggested hybrid optimizer. The next figures give the convergence plot for those applications :

Discussion and conclusion
In this paper, we had compared some variants of the well-known descent gradient algorithm in a benchmark of five test functions. We proved a statistical test using 120 experiments samples that the performance metrics depend on the chosen version of the algorithm regardless of the considered problem features. After that, we have arranged those variants based on their usage priority using the AHP decision technique in two-dimensional. We had found that RMSPROP has a priority of usage equal to 50.73%. Based on those results that are coherent with previous related works, we suggested a new hybrid optimizer that combines a recursive random search of initial points and the accuracy features of the stochastic RMSPROP. The features of the conceived algorithm consist of its ability to use the given study domain as input in addition to reaching approximate solutions for global optimization problems with the use of fewer resources. In a benchmark of 11 multi-dimensional test functions that vary on their complexity and dimensionality, we proved that our hybrid algorithm can effectively reduce useless iterations of the classical stochastic RMSPROP, which results in faster convergence speed. The simulation results of the new proposed hybrid algorithm show the obtention of accurate and significant results for the considered test functions in large search domains. Two real applications of mechanical and biology fields were proposed.

Acknowledgment
This work was supported by the Moroccan CNRST: National Center for Scientific and Technical Research.The authors are grateful to the referees for their valuable comments and suggestions.