Review of Reinforcement Learning for Robotic Grasping: Analysis and Recommendations

This review paper provides a comprehensive analysis of over 100 research papers focused on the challenges of robotic grasping and the effectiveness of various machine learning techniques, particularly those utilizing Deep Neural Networks (DNNs) and Reinforcement Learning (RL). The objective of this review is to simplify the research process for others by gathering different forms of Deep Reinforcement Learning (DRL) grasping tasks in one place. Through a thorough analysis of the literature, the study emphasizes the critical nature of grasping for robots and how DRL techniques, particularly the Soft-Actor-Critic (SAC) strategy, have demonstrated high efficiency in handling the task. The results of this study hold significant implications for the development of more advanced and efficient grasping systems for robots. Continued research in this area is crucial to further enhance the capabilities of robots in handling complex and challenging tasks, such as grasping.


Introduction
Artificial Intelligence (AI) is a source of both excitement and apprehension.In general, it is an overarching concept that refers to computer systems that are able to perceive their environment, reason, learn, and manage data so that they can act according to what they perceive and their goals.With that, AI has recently caused a shift in many industries around the world, from technology to healthcare [1], [2], [3].This once mysterious field has become a hot topic teasing countless industrial and academic minds.Drastic advances in hardware and data storage, coupled with AI's ability to "self-learn", have put it at the forefront of algorithms for multiple applications such as computer vision [4], [5], [6] and natural language processing [7], [8].AI uses different forms today whether it is digital assistants, chat-bots [9] or Machine Learning and at the time being, the most prominent topic in AI is machine learning (ML) [10], [11], [12], [13].
There are two primary applications for machine learning techniques, namely classification and regression.Within the realm of these applications, the exploration of machine learning's diverse capabilities is evident in recent studies.Kyrarini et al. investigated the realm of robot learning for assistive manipulation tasks through a head gesture-based interface [14].Their novel approach introduced a hands-free robot control system, leveraging optical flow for feature extraction and support vector machines for head gesture recognition.Similarly, Bahrami et al. delved into machine learning for touch localization on ultrasonic wave touchscreens [15].Employing a robotic finger to simulate touch actions, they captured data for model training.This technique finds applications in classification, clusterization, regression tasks, as well as time series analysis, anomaly detection, and adaptive (robotic) control.Shafiei et al. contributed to the field by developing machine learning classification models that utilize electroencephalogram (EEG) and eye-gaze features [16].Their objective was to predict the level of surgical expertise in robot-assisted surgery (RAS).In a different context, Kolaghassi et al. conducted a systematic review focusing on intelligent algorithms in gait analysis and prediction for lower limb robotic systems [17].Notably, 33.3% of the included papers implemented regression models for the estimation and prediction of kinematic and kinetic parameters in gait analysis.Additionally, machine learning algorithms can be categorized into four primary sub-fields: Supervised Learning [18] [19], Semi-Supervised Learning [20] [21], Unsupervised Learning [22], and Reinforcement Learning [23].
The integration of AI and ML is currently a popular and significant subject, with potential benefits when applied in the field of robotics.Many researchers have explored this combination, particularly in the area of deep learning (DL).For instance, Bai et al. have developed an innovative garbage collection robot that implements a deep neural network to recognize and pick up garbage with high precision and autonomy [24].DL was employed by Kase et al. to enable a humanoid robot to perform a Put-In-Box task that consists of several separate tasks [25].DL techniques were utilized by Gu et al. to introduce a robot designed for collecting tennis balls [26].To teach a parallel plate gripper how to recognize the grasping configurations of different household items, Caldera et al. suggested the application of a transfer learning approach that involves deep convolutional neural networks [27].Onishi et al. developed a robot for automated fruit harvesting by leveraging DL techniques [28].Kim et al. utilized a DL approach that involved transferring knowledge between different robots to teach a robot how to perform two different cleaning tasks on a table [29].In an effort to improve the ability of robots to manipulate objects, Yang et al. investigated a DL approach for grasping objects that are initially invisible, specifically,to enable a robot to grasp the target object, a sequence of pushing and grasping actions is involved in the proposed method [30].Shang et al. developed a DL technique that employs dexterous hands to grasp new objects [31] .
Reinforcement learning (RL) is a ML technique that has shown great potential in robotics, particularly in object grasping [32].RL is considered the algorithm of choice for building truly intelligent robots [33].In this comprehensive review, we delve into the current state-of-the-art RL algorithms, encompassing their methods, types, and potential applications in the domain of robotic grasping.The emphasis of this study lies in exploring the practical applications of RL in robotic grasping scenarios, steering away from intricate mathematical proofs and numerical analyses of RL approaches.Instead, we aim to provide readers with a panoramic view of the evolving landscape, summarizing the history and progression of RL from its early foundations to recent advances.Our motivation is rooted in the recognition of the critical role robotic grasping plays in various applications.To streamline the research process, we meticulously analyzed over 100 research papers, with a particular focus on the effectiveness of machine learning techniques, including Deep Neural Networks (DNNs) and Reinforcement Learning (RL).Our main contribution lies in synthesizing this extensive literature to spotlight the diverse forms of Deep Reinforcement Learning (DRL) grasping tasks and underscore the efficacy of the Soft-Actor-Critic (SAC) strategy within DRL techniques.As we progress through this review, we bridge the explored concepts with practical Stat., Optim.Inf.Comput.Vol. 12, March 2024 H. SEKKAT, O. MOUTIK, L. OURABAH, B. ELKARI, Y. CHAIBI AND T. AITTCHAKOUCHT 573 applications in robotic grasping, adding an intuitive layer to enhance comprehension in Section 3. In the same section, we provide detailed explanations and address open problems, particularly focusing on the most prominent algorithms in robotic grasping -DDPG, TD3, and SAC.Their comparative analysis in diverse state-of-the-art applications unfolds in Section 4. To conclude, Section 5 offers a comprehensive summary, shedding light on both the benefits and drawbacks of RL in the specific context of robotic grasping.

Brief history
The history of reinforcement learning is founded on two important areas of research that were independently pursued before intertwining into modern reinforcement learning: animal psychology and optimal control.The psychology of animal learning was the impetus for the idea of trial-and-error learning.The trial and error theory of learning was first introduced by the famous psychologist Edward L. Thorndike [34], this procedure was implemented in some of the early works in artificial intelligence and resulted in the renaissance of reinforcement learning in the early 1980s.The optimal control problem was composed originally to design a controller to minimize the loss function of a dynamic system over time [35].In the mid-1950s, more exactly in 1957, An innovative perspective on Hamilton-Jacobi theory was introduced by Richard Bellman, who also devised an approach to address the optimal control problem known as dynamic programming [36].Other methods are emerging and are combined with the two previously mentioned areas in the late 1980s.These methods are the temporal difference approaches and this union between the three domains gave rise to the modern field of RL. [37] give a lot more details about Reinforcement Learning history.To sum it all up, RL is a type of ML that allows an agent to learn how to reach a goal based on trial and error.This concept also named the Law of Effect aims to enable the agent to test actions and receive feedback (reinforcement).RL involves adjusting the behavior of the agent to maximize the cumulative reward it receives, and it has broad applications in solving control and optimization problems that entail sequential decision-making.Consequently, it is a subject of great interest.The bar chart in Figure 1 illustrates this concern in the percentage of published papers on this hot topic from 2014 to date.

Methods of Reinforcement Learning
To estimate value functions and action-value functions, as illustrated in Figure 2, there are three main families of algorithms used in RL: Dynamic Programming (DP), Monte Carlo (MC) and Temporal difference (TD).Dynamic programming (DP) methods aim to find the optimal policy, but they require an ideal system model and computational constraints for non-tangible tasks are unfeasible [38].Policy iteration and value iteration are commonly used DP methods, with policy iteration seeking the optimal policy through iterative policy improvement and evaluation [39].However, it is rarely used due to its large computational cost.In contrast, value iteration determines the optimal policy by identifying the optimal value functions, which is more efficient since it does not evaluate many policies [40].However, a perfect model of the system is required to extract the optimal policy using the optimal value function.Monte Carlo (MC) methods are model-free [41] and rely on sampling to estimate mean returns for various policies by taking samples of state sequences, actions, and rewards under the policy.Since the agent doesn't have a model of the system, it determines the value functions for each action through exploration and deduces the optimal policy.MC methods define action-value functions since value functions alone are insufficient without a model to switch to great value states.However, MC methods require waiting until the end of the episode before learning can begin, which is penalizing in long or continuous systems.an episode before updating the value functions, unlike MC methods [42] Temporal Difference (TD) methods combine both DP and MC by using the ideas of both methods to update the value functions incrementally.TD methods do not require waiting until the end of the episode before learning can begin, which is penalizing in long or continuous systems.Since the main goal of AI is to replicate human behavior, neither DP nor MC alone is sufficient, and TD methods are often used instead [42].

Types of Reinforcement Learning
It is tricky to present an exhaustive and detailed list of all the RL algorithms applied to robot manipulation.The focus of this discussion will be limited to the major branches of algorithms, which include model-based and model-free algorithms, as well as policy-based and value-based algorithms.Figure 3 represents a complete list of algorithms.[43].If the agent has the model, it can predict its performance when taking a particular action, resulting in improved sample efficiency compared to model-free methods.However, learning the model can introduce bias, leading to sub-optimal behavior in the real environment.In contrast, model-free techniques rely on reward signals and learn value functions solely from the agent's interactions with the environment.They are easier to implement and adjust for hyper-parameters, making them more popular than model-based methods.

Policy-Based and Value-Based
In value-based algorithms, the estimation of the action-value function is carried out in reference to the optimal value Q * (s, a).This is usually achieved through off-policy learning, as detailed in the preceding chapter.Conversely, policy-based approaches identify the optimal action to take at a given state (s) to maximize the reward.This is often done on-policy.Nguyen et al. [33] contend that policybased techniques are more dependable and consistent than Q-learning methods, which estimate Q indirectly based on an objective function.However, policy-based methods may fail due to various factors.Despite this, policybased algorithms exhibit higher sample efficiency by effectively reusing data, a crucial factor for their successful implementation on real robots.
In the realm of reinforcement learning, the dichotomy between policy-based and value-based algorithms has long been a focal point of research.While policy-based methods focus on defining the optimal behavior directly, and value-based methods estimate the value of different actions in a given state, recent advancements have extended these paradigms.A noteworthy addition to this landscape is Fuzzy Reinforcement Learning, which introduces a nuanced approach to decision-making in uncertain environments [44].Unlike traditional methods that rely on precise values and policies, fuzzy reinforcement learning leverages fuzzy logic to navigate the inherent uncertainties of real-world scenarios [45].This integration of fuzziness into the learning process not only offers a more adaptive and flexible approach but also aligns well with the challenges posed by complex and dynamic robotic grasping tasks [46].Moreover, the evolution of reinforcement learning has seen the emergence of Reverse Reinforcement Learning (Reverse RL), where the focus shifts from learning optimal behavior to inferring the underlying reward structure from observed behavior [47].This approach holds promise in scenarios where defining a reward function is challenging or impractical, contributing a unique perspective to the exploration.Additionally, Adversarial Deep Reinforcement Learning (Adversarial DRL) has garnered attention for its ability to address complex tasks, such as robotic cloth manipulation, without explicit reward function design [48].By introducing adversarial elements into the learning process, this technique allows agents to learn near-optimal behaviors through expert demonstration and self-exploration.The interplay between agent and environment takes on a dynamic and adversarial nature, further enhancing the adaptability of reinforcement learning methodologies [49].As we navigate the landscape of these advanced reinforcement learning paradigms, offering a comprehensive understanding of their applications and impact on robotic grasping tasks, our focus turns to delve specifically into Deep Reinforcement Learning (DRL).In the subsequent sections, we will explore the nuances of DRL techniques, their implementation in robotic grasping scenarios, and the state-of-the-art advancements in this exciting intersection of machine learning and robotics.

Deep Reinforcement Learning for robotic grasping
DL is an excellent tool for processing unstructured environments due to its ability to learn from vast amounts of data and identify patterns.However, while this aspect is crucial for recognition, it is not equivalent to decisionmaking.RL, on the other hand, facilitates decision-making, making it an indispensable feature.And since robotic tasks, more precisely robotic grasping tasks, require an interaction between the agent and the environment so merging between DL and RL (DRL) is very crucial to the improvement of robotic tasks.As cited by haarnoja et al. empirical evidence suggests that model-free DRL is highly effective in various domains, including video games, as well as simulated robotic manipulation and locomotion [50].Ibarz et al. discussed the successful application of DRL techniques in various tasks, including quadrupedal walking, grasping unfamiliar objects, and acquiring a diverse set of intricate manipulation skills.[51], these case studies demonstrate that DRL is a feasible approach for learning directly in the real world, using raw sensory inputs, and tackling physically challenging tasks like dexterous manipulation and walking.The aforementioned research highlights that policies learned through DRL exhibit effective generalization, as seen in the case of robotic grasping.But before talking about robotic grasping.Section 3.1 will discuss, generally, different strategies used to learn robotic grasping while section 3.2, will focus more on the three main algorithms of DRL used nowadays for robotic manipulation, specifically, for the robotic grasping.

Robotic Grasping
As mentioned before, robotic grasping is a very challenging and interesting task, that's why many reviews were conducted in that context.Here, we're going to cite the latest reviews on the application of DRL in the grasp In 2017,provided a concise overview of DRL, with a particular emphasis on the primary algorithms used in this field [52] .In 2018, Khanzhahi et al. conducted a review and classification of DRL algorithms, highlighting their advantages and limitations, as well as discussing the challenges that DRL has successfully overcome [53].Mousavi et al. conducted a review of fundamental DRL algorithms, with a focus on research methodology [54].In 2019, Chatzilygeroudis et al. described a method for robots to acquire learning using micro-data reinforcement learning [55].Bhagat et al reviewed DRL based intelligent soft robotics .[56].In 2021, Du et al. conducted a comprehensive thorough study on vision-based robotic grasping, which identified the primary tasks required for successful vision-based robotic grasping: object localization, grasp estimation, and object pose estimation [57].Connolly et al. examined the accuracy and realism of models generated by two simulation platforms for simple robotic grasping tasks [58].The goal of this review was to investigate the extent to which the resulting models could accurately represent reality.Marwan et al. conducted an extensive review of various research approaches, where within the past five years, various techniques such as sensing, learning, and gripping have been utilized.The review covered a range of topics in these areas [59].In 2022, Wang et al. classified DRL algorithms and their applications, and conducted a thorough evaluation of the current DRL methods [60].

Grasping in cluttered environments
Robotic grasping is an important aspect of robotics and automation.It involves the ability of robots to manipulate objects using different mechanisms, such as suction grasping, only grasping, synergies prehensile and non-prehensile, or multi-functional grippers.In recent years, learning robotic grasping policies using DRL has gained significant attention as a promising approach.A comprehensive overview of current research in the field, with a focus on studies related to grasping objects using different mechanisms relying on DRL methods, will be provided in this subtopic.In 2017, Mahler et al. presented a robot bin-picking system that uses grasping only, by fine-tuning a Convolutional Neural Network (CNN) for grasp quality using Dex-Net.Using this approach, the robot was able to achieve high success rates in picking and placing objects from a bin [61].In 2018, Morrison et al. introduced Generative Grasping CNN (GG-CNN), a real-time generative grasp synthesis method that uses a CNN to generate grasp candidates for an object [62].It was demonstrated that this method achieved better results in terms of both success rates and execution time compared to other stateof-the-art methods.Zeng et al. showed that model-free deep reinforcement learning is capable of learning these synergies from the ground up [63].Integrated grippers that combine different types of gripping mechanisms have been developed to enable grasping diverse objects in various operational settings.
Silver et al. introduced a pushing and pick-and-place method using Deep Deterministic Policy Gradient (DDPG) and Actor-Critic method [64].Their approach was able to learn to push objects into graspable configurations and pick them up with a gripper.In 2019,Kang et al. introduced an integrated gripper that merges a suction gripping system with a linkage-driven underactuated gripper [65].It may be necessary to perform pre-grasping manipulation such as shifting or pushing an object, and algorithms have been developed to learn these additional manipulation tasks.Berscheid et al. developed an algorithm that can learn how to shift objects to increase their grasp probability [66].Semantic grasping methods have also been developed to estimate the 6DOF pose for grasping by robotic manipulators.Zhu et al. introduced a method for robotic semantic grasping that enables estimation of the 6DOF grasping pose for a robotic manipulator, thereby allowing for a perpendicular grip to be made on the object's surface [67].Murali et al. introduced, using partial point cloud observations, a technique that generates a strategy for grasping in 6-DOF for any target object in a cluttered environment [68].Shao et al. proposed a suction grasping method using Q-Learning and ResNet with U-net (CNN) [69].Their approach was able to learn to grasp objects with suction cups, achieving high success rates in cluttered environments.In 2020, Sarantopoulos et al. presented a pushing and grasping method using Deep Q-Learning (DQN) [70].Their approach was able to learn to push objects into graspable configurations and grasp them with a gripper.Wu et al. introduced a generative attention learning framework that utilizes a single depth image and circumvents continuous motor control to achieve high-performance multi-fingered grasping in clutter [71].The approach developed by Wu et al. successfully enabled learning of multi-fingered grasping in cluttered settings, allowing for the grasping of objects with several fingers.A method based on DRL and visio-motor feedback was introduced by Joshi et al. [72] to address the issue of robotic grasping.Their approach was able to learn to grasp objects by taking into account both visual and motor information.Kim et al. developed a deep learning-based approach for grasping diverse unseen target objects in a cluttered environment [73].Pose estimation of textureless and textured objects is an important aspect of robotic grasping.A push-grasping policy was learned for grasping a particular object in clutter by Xu et al. in 2021 [74], utilizing a hierarchical RL framework based on goal-directed conditioning that exhibits efficient utilization of samples.Their approach was able to learn to push objects into graspable configurations and grasp them with a gripper.Tang et al. developed a self-supervised approach to train a robot in joint planar pushing and 6-DoF grasping policies [75].They used two distinct deep neural networks that were trained to map from 3D visual observations to actions, with the aid of a Q-learning framework.Their approach was able to learn to push objects into graspable configurations and grasp them with a 6-DoF.Dong et al. proposed an innovative method for estimating the position/orientation of objects with and without textures by leveraging the objects colors as a crucial feature for object recognition, particularly for grasping tasks [76].Teaching a robot to identify a desired object by utilizing its color as a distinctive feature and then locate it and picking it up in an unsupervised manner is another approach that has been investigated.Mohammed et al. developed a method for training a robot to locate and pick up objects based on their color [77].Finally, Sundermeyer et al. proposed an end-to-end network that generates a probability distribution of parallel-jaw grasps with 6-DoF efficiently, using only depth recordings of a scene, enabling efficient grasping of objects in cluttered environments [78].
In order to offer a thorough overview of the cutting-edge state-of-the-art accomplishments and upcoming challenges in this area, presented in Table 1 are the latest research papers on grasping in cluttered environments.

Simulation-to-real-world transfer
The field of robotic grasping has a high demand for transfer learning from simulation to reality.It is important to first conduct simulations in order to fully comprehend the training environment.One of the major challenges for robots is to learn the skills necessary to adjust to the properties of grasped objects.Numerous studies have explored this area.James et al. presented a method called Randomized-to-Canonical Adaptation Networks (RCANs) which addresses the issue of the visual reality gap without relying on real-world data [83].Wu et al. proposed an attention mechanism that improves the success rate of grasping objects in cluttered environments by mapping pixel space to Cartesian space [84].Fang et al. suggested a framework that combines planning and learning for efficient exploration in complex environments [85], while Irpan et al. examined the problem of model selection for DRL in real-world settings [86].Wu et al. presented a tactile closed-loop method called MAT, enabling the robot to seize the object even when the hand's initial location is coarse [87].Shao et al. introduced UniGrasp, a method for generating grasping motions that takes into account the geometry of the object and the attributes of the gripper [88].RL has also been used to acquire skillful in-hand manipulation policies for reorienting objects on a Shadow Dexterous Hand in the physical world, as shown by Andrychowicz et al. [89], and to enable a robot to perform robust object pushing through training, as explored by Clavera et al. [90] A new system for robotics that can automatically pick up objects in scenes that are cluttered.
• The suction cup together with the two-finger gripper for grasping is more efficient.• The active exploration strategy shows superior performance compared to methods with only a static affordance map.
• Improve the system's robustness and adaptability to a wider range of object shapes and sizes.• Optimize the system's speed and accuracy especially in real-world scenarios. [79] 2022 Push to grasp UR5 Parallel jaw gripper Self-supervised deep RL A DRL approach for teaching robots how to manipulate objects in cluttered environments.
• Effective performance in both packed and pile object scenarios • Outperforms the selected SOA in terms of task completion rate and grasp success in both scenarios.
• The limitation of the pushing strategy when dealing with objects that are hard to push due to friction.• The possibility of grasp removing non-goal objects.
[  [95].They employed a reverse real-to-sim approach, utilizing a CycleGAN to bridge the reality gap between the simulated and real environments.These studies demonstrate the importance and effectiveness of simulation and transfer learning for robotic grasping in real-world applications.Table 2 presents the latest research papers on grasping with a Sim-to-Real transfer, with the aim of offering a thorough summary of the current stateof-the-art achievements and future challenges in this field.

Robots learning from demonstration
Learning from demonstration (LfD) is a significant concept in robotics, where a robot can acquire new skills by reproducing those of an expert.This model is highly significant in terms of developing robotics to realise complex tasks such as the grasp task.To this end, many studies and reviews were conducted, amongst them Zhu et al. [101] which reviewed Recent advancements and progress in the domain of LfD.In the other hand, Hussein et al. onducted a review of imitation learning methods and outlined various design options at different stages of the learning process [102].Imitation learning approaches seek to replicate human behavior in a specific task, wherein an agent acquires the capability to carry out the task by mapping observations to actions based on demonstrations.Finn et al. investigated the potential use of inverse optimal the learning process for multi-stage tasks [113].Shahid et al. proposed a learning-based approach that utilizes simulation data to train robots for object manipulation tasks using RL [114].Sena et al. presented a learning from demonstration model that takes into account the teacher's understanding of and influence on the learner [115].
An effective LfD policy for the secure grasping of compliant food objects by robots was proposed by Misimi et al. [116].The approach used a blend of RGB-D images and tactile data to estimate the appropriate gripper pose, gripper finger configuration, and object forces.Ravichandar et al. provided a review of machine-learning methods used for robot learning from and imitation of a teacher, and discussed the mature and emerging application areas for LfD, highlighting the significant challenges that remain in both theory and practice [117].Liang et al. investigated the feasibility of using LfD for teaching construction tasks to co-robots [118].Solak et al. proposed n approach for acquiring in-hand robotic manipulation skills from human demonstrations using Dynamical Movement Primitives (DMPs).Subsequently, they replicated these tasks using a sturdy compliant controller based on the Virtual Springs Framework (VSF).The framework utilized real-time feedback from the contact forces recorded on the robot's fingertips [119].Marzari et al. proposed a multi-subtask reinforcement learning (RL) methodology to overcome the limitations of learning from demonstration [120].Meanwhile, James et al. discussed a voxel prediction approach for translation prediction in robotic manipulation and proposed a coarse-to-fine resolution increase [121].
Cai et al. presented a deep imitative reinforcement learning approach for agile autonomous racing using visual inputs, highlighting the potential of Learning from Demonstration (LfD) for enabling robots to perform complex tasks by imitating expert behavior [122].Table 3 here provides a comprehensive summary of the present-day accomplishments and future hurdles in the domain of robots learning from demonstrations to grasp, by showcasing the most recent research papers.• LfD algorithm development.
• Complicated parameter space geometry.• Nested STL exploration.A multi-task policy is trained on challenging robotic tasks using a combination of visual demonstration and language instruction through a method called DeL-TaCo (Joint Demo-Language Task Conditioning).
• DeL-TaCo framework developed.Integrating human preferences into trajectory planning for robotic manipulators.
• Efficient planning demonstration.• Low interaction effort requirement.
• Extend to shared object settings.• Explore complex reward formulations. [126] 3.1.[135], whine an approach that utilizes a reproducible sensor for precise and haptic grasping was proposed by Song et al. [136].Muthusamy et al. proposed a novel dynamic finger system that utilizes vision to detect and suppress object slippage, and presented a baseline and feature-based method to detect slippage in the presence of illumination and vibration uncertainty [137].Chen et al. suggested a framework for robotic visual grasping based on DRL, which has demonstrated effectiveness in learning complex control policies independently by training visual perception and control policy separately instead of end-to-end [138].A simulated standard for evaluating robotic grasping that prioritizes off-policy learning and the aptitude for generalizing to unfamiliar objects was introduced by Quillen et al., highlighting the significance of diversity in facilitating the adaptation of the approach to novel objects that were not encountered in the training phase, as off-policy learning facilitates the usage of grasping data across a broad spectrum of objects [139].Danielsen et al. explored diverse robotic manipulation and grasping techniques, and demonstrated through two PyBullet experiments the possibility of using DRL techniques to teach a robotic arm, which possesses seven degrees of freedom, how to grasp objects [140].In another comprehensive survey, Kleeberger et al. presented a summary of ML techniques utilized for vision-based robotic manipulation and grasping [141].Liu et al. noted that manipulators still face the challenge of only being able to grasp specific objects, unlike human beings that can use brain decision-making to pick up unfamiliar objects [142].Reinforcement learning is often used in academia to train grasping algorithms, but it encounters issues such as insufficient algorithm stability, poor sample utilization, and limited exploration.To solve these problems, Liu et al. proposed using LfD, BC, and DDPG [142].Grimm et al. presented a comprehensive system that encompasses stone segmentation, the creation of grasping hypotheses, and the implementation of pushing actions to achieve sturdy stone grasping [143].Although reinforcement learning techniques have been effective, they are yet to achieve widespread success in various robotic manipulation tasks.To address this issue, James et al. presented an Attention-driven Robotic Manipulation (ARM) algorithm, which has the potential to tackle a variety of tasks with sparse rewards, requiring only a few demonstrations [144].To overcome the challenge of actor-critic deep reinforcement learning methods struggling with the grasping of varied objects, especially in cases where learning is based on raw images and rewards that are sparse, Kim et al. utilized state representation learning (SRL) to capture crucial information for future use in RL [145].In the field of robotic grasp, Cao et al. introduced a neuromorphic vision sensor named dynamic and active-pixel vision sensor (DAVIS) [146].On the other hand, Wang et al. developed A learning system referred to as the Remote-Local Distributed (ReLoD) system, which operates in real-time.It distributes calculations of two DRL algorithms between a local and a remote computer [147].Table 4 presents a comprehensive summary of the current achievements and future challenges in the field of vision-based robotic grasping.The table includes the most recent research papers.• Improve knowledge transfer.
• Experiment extension for patterns.
[150] The balance between exploring new options and exploiting existing knowledge is a well-known occurrence in RL.The agents must experiment with various choices in order to choose better options, but as they approach closer to the ideal course of action, they must make use of what they already know.Behavior guidelines are employed as the policy to interact with the environment and as a tool for exploration during training.On the other hand, target policy refers to the policy that the agent tries to learn.This reciprocity between behavior policy and target policy is conventional for on-policy and off-policy learning.While on-policy methods need the agent to act in accordance with the learned policy, off-policy methods can learn the best policy regardless of the behavior policy which is best suited to robotics applications [151], [139].Furthermore, in the context of robotics, most actions, and state spaces are continuous.To handle continuous action spaces efficiently without losing adequate exploration, it's better to merge between value-based and policy-based approaches [152].Value-based and policy-based techniques, commonly known as model-free methods, do not utilize any environment model, thereby reducing their sample efficiency [152], [153].Figure 4 resumes the reasoning mentioned above.That's why in this section we'll do a thorough study of three of the main model-free off-policy DRL algorithms: Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), Soft Actor-Critic (SAC).

Deep Deterministic Policy Gradient
As mentioned earlier, QL was a real breakthrough in RL, it is an offpolicy TD algorithm that aims to learn the optimal action-value function Q ( s, a).Once Q ( s, a) has been learned, the policy can be derived from it.However, this algorithm is not suitable for large state-action spaces because there may be many unvisited regions, and it cannot generalize to state-action pairs that have not been visited.In other words, the algorithm's effectiveness is limited to small state-action spaces.The utilization of Deep Q-Learning is preferred due to its effectiveness when dealing with more complex state and action spaces.In such cases, Deep Learning is used as a function approximator to achieve optimal results.The process of function approximation involves creating an approximation of the Q-function based on examples of an agent's interactions with the environment.This technique enables the algorithm to generalize from states that have been visited by the agent to states that have not been visited, resulting in a substantial decrease in the quantity of states of states that need to be visited to reach an approximate solution.Besides being a DRL algorithm, Deep Q-Learning (DQL) is the act of combining Q-Learning with a deep neural network, and a deep neural network that approximates a Q function is called a deep Q-Network (DQN).It is important to note that in such a thriving field like AI, many terms are not fully established.For instance, DQL can also be referred to as DQN which can lead to confusion.Thus, before getting into the explanation of DQL, here is Table 5 that attempts to enlighten the differences between Q-learning (QL), deep Q-learning (DQL), and deep Q-network (DQN) so that no skepticism occurs.
If a neural network is used, the Q-function is represented by a function that has parameters defined by the weights w.This means that, with each iteration, in lieu of modifying the Q values, the parameter vector w, which specifies the function, is updated instead: where ▽ w Q(s, a, w) is the gradient.To select optimal actions with DQN, The Neural Network (NN) takes an input state s and its outputs are going to be the q values corresponding to different actions within the action space if the actions are discrete.If they're not, then it couldn't enumerate all the actions in this manner.In the discrete set, to use the DQN values to select actions to convert it into a policy and select the optimal actions in the environment, all that should be done is take the argmax over a of Q(s,a) by taking the maximum of all of those discrete set of values and get a* and that's also our policy π(s).The described policy is employed to choose actions once the Q-Network has been trained.However, it is worth noting that a similar process occurs even during the training phase.During training, the Bellman targets are set as the Q targets.
So overall, all the agent has to do is take the maximum of a discrete set of values.On the other hand, in the continuous set there is no meaningful way to enumerate the actions, so some modifications should occur on the Q-Network.The NN can take the state s and action a as input and output Q(s,a) but there is still going to be a problem here which is that the policy can't be simply set based on the argmax over a.So this looks a bit like an optimization problem where for each state the agent has to determine the best action given the action input and this will be too expensive.There is one potential solution to overcome this problem.Let's train a NN to produce the outputs of this optimization problem by mapping the input state to this output action a* which is the solution of this optimization problem.So the network takes in s as input and produces the best action a* as output which remembers us of what the optimal policy should be doing, so let's call this network the policy network.To train this to maximize the q function, the Q-function is parameterized in a Q-Network and the training will occur by using the standard squared bellman error loss L = (Q target − Q) 2 and it's very common to call this kind of setup an actor critic setup where the policy is called the actor obviously because it produces the actions and the Q-Network is called the critic because you can think of it as evaluating a state action tuple and saying how good it is which is exactly what q is.So this is the actor-critic algorithm and this algorithm where you can take DQN and modify it in this way to work well with continuous actions is called Deep Deterministic Policy Gradient (DDPG) and this method is quite often used in robotics.
• Some related work: Kerzel et al. put forward a novel method to tackle the challenge of collecting a vast number of training samples within a reasonable time frame, and demonstrated their method on a reach-for-grasp task that employs the Deep Deterministic Policy Gradients (DDPG) algorithm [154].The goal-auxiliary DDPG algorithm, introduced by Wang et al., facilitates the effective acquisition of policies for controlling grasping in 6 dimensions (6D) from point cloud data.This approach entails utilizing demonstrations from a specialist grasp planner and motion, moreover, it incorporates anticipation of grasping objectives as an additional task to enhance the performance of both the critic and the actor.[155].Wang et al. also proposed the experience-based policy gradient method (EBDDPG), which promotes smooth robot movements.Results demonstrated that this method improves the success rate of grasping tasks and encourages smoother manipulation [156].Controlling the gripping of a robot arm can be improved by using the enhanced DDPG reinforcement learning algorithm introduced by Qi and Li [157].In addition, Beik Mohammadi et al. presented an online continuous deep reinforcement learning approach for a reach-to-grasp task in a mixed-reality environment [158].• Open problems: When using reinforcement learning with discrete action spaces, sub-optimal policies can arise due to a problem called overestimation bias.In continuous control settings, deterministic policy gradients can also suffer from overestimation bias [159].Overestimation bias is a familiar issue within algorithms for reinforcement learning that are based on value estimation, such as DDPG and deep Qnetworks, that arise from function approximation and can lead to sub-optimal policies [160].To overcome this issue, a modified version of the DDPG algorithm, called Twin Delayed Deep Deterministic Policy Gradient (TD3), has been proposed.

Twin Delayed Deep Deterministic Policy Gradient
Twin-Delayed DDPG (TD3) is a highly intelligent deep reinforcement learning model that combines the latest methods in AI.These include continuous Double Deep Q-Learning, actor-critics, and policy gradient [161].As outlined in the previous section, TD3 comes in to improve the approximation error [162] [161] [163] [163].TD3 is a modified version of DDPG that incorporates several techniques to address the overestimation of the value function.These techniques include Target Policy Smoothing, Delayed update of Target and Policy Networks, and Clipped Double Q-learning.Let's go into details, TD3 uses 2 critics, from which the word twin comes so each critic has different values of the Q-value.The TD3 algorithm can be seen as two parts: the QL part of the training process and the policy learning part.In the part QL, first, the replay memory is initialized, then for the actors, two NN are built, one NN for the actor model and one NN of the actor target.For the critics, two NN are built for for the critic model and two NN for the critic targets.So in total, there are 2 actor NN and 4 critic NN.Here's an overview of the training process of these neural networks: After building these NN, a batch of transitions (s, s t+1 , a, r) is sampled from the memory.Then for each element of the batch, The actor target plays the next action a t+1 form the next state s t+1 then a Gaussian noise is added to this next action a t+1 and clamped within the scope of values that the environment accommodates.Afterwards, the two critic targets take (s t+1 , a t+1 ) as input and output two Q-values Q 1 (s t+1 , at + 1) and Q 2 (s t+1 , at + 1).Only the smaller of the two Q-values is kept, representing the estimated value of the following state.This minimum allow us to get the final target which is: Each couple (s,a) is inputted into both critic models and they output two Q-Values Q 1 (s, a) and Q 2 (s, a) which are compared to the minimum critic target.Then the loss between the two critic models is computed through: Stat., Optim.Inf.Comput.Vol.In order to reduce the critic loss, the parameters of the two Critic models over the iterations are updated with back propagation and the weights are updated through stochastic gradient descent.Moving to the policy learning part, the Q-values of the critic models are used to perform gradient ascent to maximize the returns.Once the actor model is updated, the agent returns better actions which maximizes the Qvalues and the agent moves nearer to the optimal return.In other words, every d iterations, the actor model is updated through gradient ascent on the output of the first critic model.Then every d iterations, the actor target's weights is updated through Polyak averaging: The equation mentioned consists of four components, the first component θ ′ i represents the actor target parameters, the second component τ denotes a small number, the third component θ i represents the actor model parameters, and the last component θ ′ i represents the actor target parameters before updating.This equation can be interpreted as a gradual transfer of weights from the actor model to the actor target, which results in bringing the actor target closer to the actor model with each iteration.Thus, the actor model learns from the actor target, which stabilizes the learning process.Similarly, after every d iterations, the weights of the critic targets are updated in a similar manner through polyak averaging.
In this TD3 algorithm, ϕ denotes the parameters of the critic target.The delayed aspect of this approach is due to the fact that the actor and critic are updated only every d iterations, which is intended to enhance performance compared to the standard DDPG technique.
• Related work: Hou et al. introduced RTD3, a modified version of the TD3 algorithm, to tackle the problem of overestimation bias in multi-degree of freedom manipulator learning through deep reinforcement learning [164].Overestimation of Q-values by the learned Q-function is a common problem with DDPG, which may cause the policy to break, as it exploits the Q-function's errors.To address this issue, M et al. combined TD3 with Hindsight Experience Replay (HER) [165].Khoi et al. utilized the TD3 algorithm along with a novel reward model to simulate the gait of a 6-DOF biped robot in a Gazebo/ROS environment [166].Yang and Xu aimed to design a robot that can aid in warehouse object grasping using various DRL algorithms, including TD3 [167].• Open problems: According to Nguyen and La [33] as well as Nian et al. [168], some of the recent successful RL algorithms, including Trust Region Policy Optimization (TRPO), Asynchronous Actor-Critic Agents (A3C), and Proximal Policy Optimization (PPO) are prone to sample inefficiency.In contrast, off-policy methods based on Q-learning, like the Deep Deterministic Policy Gradient (DDPG), and Twin Delayed DDPG (TD3) are less susceptible to this issue.They utilize replay buffers to efficiently learn from past samples.However, these off-policy methods based on Q-learning are highly sensitive to hyper-parameters and need a significant amount of tuning to achieve convergence.To address the issue of convergence fragility, Soft Actor-Critic (SAC) adopts a similar approach as the aforementioned methods and integrates techniques to combat this challenge.

Soft
Actor-Critic Soft Actor-Critic (SAC) is also a DRL algorithm defined for continuous actions.The three main components of SAC are: an actor-critical architecture with distinct networks of policies and value functions, a formulation that is not limited by the policy used to collect the data and enables the utilization of previously gathered data to enhance effectiveness.Additionally, it includes the maximization of entropy to guarantee stability and promote the exploration of alternative options.SAC uses a modified RL objective function and its main goals are to optimize both the policy's rewards and entropy.The concept of entropy refers to the level of unpredictability associated with a random variable.The reasons behind wanting the policy to have high entropy are: to encourage exploration, to induce equal probabilities for actions that have either equal or almost identical Q values, to make sure that the policy does not break down by repeatedly selecting a specific action that could potentially take advantage of any inconsistencies within the estimated Q function.With all the mentioned above, SAC algorithms can overcome the brittleness problem.And its objective function to maximize the expected return and the entropy at the same time is: In order to achieve this optimization, SAC uses 3 networks: a state-value function V parametrized by ψ, a soft Q-function Q parametrized by θ and a policy function π parametrized by ϕ.
The Value network can be trained by minimizing: This equation implies that across all states sampled from the replay buffer of the experiment D, it is necessary to reduce the squared difference between the value network prediction and the anticipated prediction of the function Q added to the political function π entropy (here, the negative log of the policy function measures it).
To train the Q-Network, the following error should be minimized: This means that for all (s,a) pairs within the experiment's replay buffer, one aims to reduce the squared difference between the Q-Function's prediction and the immediate reward plus the updated awaited value of the following state.V is the target value function here.And to train the policy network, the following error should be minimized: Essentially, this objective function is intended to cause the policy function distribution to more closely resemble the distribution of the exponentiation of the Q-function standardized by a different Z-function.
• Some related work: Chen and Lu proposed a system for object grasping that combines object detection techniques and the Soft-Actor-Critic (SAC) algorithm, using an approaching-tracking-grasping scheme [169].Feldman et al. introduced an approach to self-supervised reinforcement learning using a hybrid discrete-continuous adaptation of SAC [170]. in the other off-policy model free DRL algorithms (DDPG, TD3) [172].But it turned out that the SAC algorithm also suffers from brittleness due, this time, to the alpha temperature that regulates exploration.
To overcome this problem, the authors suggested automatic temperature tuning.Haarnoja et al. adapted this solution, however, another situation occurred which is the high variance problem [173].These limitations are still configuring as open issues to this day.

Discussion and quantitative analysis
Determining the best algorithm to realize the grasping task is still debated by many researchers.On-policy or offpolicy?Policy-based or value-based?Model-based or Model-free?The review determined, based on the current state-of-the-art, which algorithm is the best fit for continuous control applications such as the grasping task.The paper stated that all-in-one off-policy, model-free algorithms, including DDPG, TD3, and SAC, have been the most effective ones thus far.To support this statement, Table 6 was synthesized in the third section that included most of the algorithms used for learning the grasp task.All the papers reviewed in the study showed that the SAC algorithm outperforms the other off-policy algorithms.It is worth noting, however, that the papers also evaluated the performance of both off-policy algorithms and on-policy algorithms.Haarnoja et al. proposed the soft actor-critic, an off-policy actor-critic DRL method built on the framework of maximum entropy [172].In this approach, the actor strived to maximize both entropy and expected reward.Their strategy delivered stateof-the-art performance on a variety of continuous control benchmark problems by combining off-policy updates with a stable stochastic actor-critic formulation.SAC, as an algorithm based on the maximum entropy principle, showed superior performance compared to baseline methods in terms of both learning time and final performance, particularly in challenging tasks.It also demonstrated better sample efficiency and ultimate performance than state-of-the-art techniques in previous studies.In contrast to PPO, which struggled with complicated and highdimensional tasks, SAC was able to learn quickly due to its ability to handle large batch sizes.These results suggest that maximum entropy principle-based algorithms may be more effective in challenging tasks.The researchers developed a soft actor-critic algorithm based on these findings, which was shown to outperform state-of-the-art model-free deep reinforcement learning techniques such as DDPG and PPO.As a following work, Haarnoja et al. described SAC and thoroughly assessed SAC on a number of benchmark tasks as well as difficult real-world tasks including quadrupedal robot mobility and manipulating robots with a dexterous hand [173].By making these adjustments, SAC surpassed the performance of earlier on-policy and off-policy approaches in terms of sample efficiency and asymptotic performance, achieving state-of-the-art performance.Additionally, they showed that, in contrast to other algorithms that are off-policy, their method exhibits considerable stability and achieves similar performance across different arbitrary seeds..These findings imply that SAC is a strong contender for learning in practical robotics challenges.Their empirical research demonstrated that SAC, which can be used to train deep neural network policies and does not require any environment-specific hyperparameter tuning, can perform on par with or better than state-of-the-art model-free deep RL methods like the off-policy TD3 algorithm and the onpolicy PPO algorithm.Chen and Lu demonstrated that their developed system, which separates object detection from DRL control, enables autonomous grasping of a moving object with varying trajectories [169].Even though gripping a moving object in an unstructured environment is a challenging task, the actual experiment showed that the recommended intelligent system can produce encouraging outcomes with the SAC algorithm.Better outcomes than with DDPG or TD3 algorithms.Ünal [174] examined the controller strategies for a pick-and-place operation using a bi-rotor aerial manipulator.In addition, they studied how the change in the goal location of the object that the aerial manipulator must transport affects the training of the learning approaches and looked at the implications of manipulator degrees of freedom for DRL approaches.In their experiments, they analyzed the on-policy algorithms first.No matter how little their final mean reward differences were, TRPO outperformed PPO in terms of overall performance.PPO learned more quickly than the TRPO algorithm.Their results indicated that all approaches, with TRPO being the most stable, had similar mean episode lengths at the conclusion of training.PPO obtained high success rates more quickly than any other algorithm.Afterward, They analyzed the off-policy algorithms.When compared to the others, DDPG converged to a somewhat worse mean reward.They all arrived at a similar mean episode duration throughout training, with the SAC algorithm being the finest and the DDPG algorithm being the least efficient as expected, given SAC and TD3 build upon DDPG and attempt to increase its convergence and stability.Additionally, they displayed the mean success rates of the three off-policy algorithms during training, and once more, the results are the same: SAC and TD3 achieve very similar success rates, while DDPG achieves the worst.All of them achieved a respectable success rate almost simultaneously, with DDPG being a little bit slower.Subsequently, they compared the best on-policy and off-policy algorithms (SAC and TRPO).Compared to the SAC algorithm, TRPO was superior.This may be the case since the SAC algorithm's hyperparameters were not specifically tuned for the task at hand given the small difference between them.TRPO was superior in terms of time duration, but in terms of the number of time steps the results show that this is not the case.So the overall result stated that off-policy algorithms are demonstrably considerably more sample-efficient.Here, [171], PPO and SAC were studied, the fine-tuning approach, which displayed the continual adaptation of on-policy RL to changing contexts and enabled the acquired policy to adjust and execut the revised task, was offered to quicken the learning process.It was shown that the learned control strategy may be applied to a variety of object geometries and initial robot/part configurations.In fact, SAC should have acquired the task at a faster rate in terms of the number of episodes required because this is an off-policy algorithm that utilizes previously recorded transition data stored in a replay buffer.The training performance of the SAC algorithm and the PPO algorithm for the considered gripping task were compared in order to validate this notion.During the initial 2 million time steps of both algorithms, the mean reward and the number of successful episode steps were plotted, and a comparison was provided.The SAC algorithm learned to amass substantially greater average rewards than PPO and to complete tasks in just 2M time steps, confirming the premise, while PPO failed to complete tasks for the same amount of episodes.The results for both methods, however, were considerably different when the average reward and the count of successful episode steps were studied against the wall time.As seen in the results, with the SAC's off-policy updates, the rewards obtained during each update iteration are lower compared to PPO's on-policy updates.When obtaining new experience incurs significant costs and computational resources are not a concern, off-policy approaches such as SAC may be more favorable.This approach improves the performance of both the actor and the critic through the utilization of demonstrations from an expert motion and grasp planner, as well as employing grasping goal prediction as an auxiliary task Optimal performance is achieved when both the performer and the critic are assigned the goal-auxiliary task [155] Stat., Optim.Inf.Comput.Vol. 12, March 2024

Conclusion
The research paper presents an extensive examination of different Reinforcement Learning (RL) algorithms intended for robotic grasping tasks, concentrating particularly on Deep Reinforcement Learning (DRL) algorithms.
The study highlights the most effective DRL algorithms for handling complex and challenging tasks, such as grasping.Additionally, to simplify the research process for others, the paper provides a collection of different forms of DRL grasping tasks.The analysis indicates that model-free off-policy approaches, such as DDPG, TD3, and SAC, are more suitable for robotic applications, especially for continuous actions.The study concludes with a summary of the benefits and drawbacks of RL and a deep analysis of the most prominent algorithms in robotics grasping, along with open problems for further research.The insights presented in this paper emphasize the importance of continued research and development of DRL algorithms to enhance the capabilities of robots in handling complex and challenging tasks.Overall, this study contributes to advancing the field of robotic manipulation and provides a useful resource for researchers seeking to explore the potential of RL in robotics grasping.
A. Nomenclature

Figure 1 .Figure 2 .
Figure 1.Percentage of publications in RL, RL+Robotics, DRL, DRL+Robotics in the last 10 years based on WOS database

Figure 4 .
Figure 4.The reasoning behind model-free off-policy algorithms

Table 2 .
[110]rizing State-of-the-Art Achievements and Future Challenges in Grasping with Sim-to-Real Transfer Zhu et al. presented a model-free approach to DRL that utilizes a limited amount of demonstration data to support an RL agent.Their methodology was applied to robotic manipulation tasks and resulted in the training of policies that involve both visual perception and motor control, which utilize RGB camera inputs to determine joint velocities in an end-to-end manner[105].Ragaglia et al. suggested a resolution to the Robot Learning from Demonstration (RLfD) challenge in dynamic environments.To demonstrate its efficiency, a set of pick-and-place experiments were performed using an ABB YuMi robot and the system's performance was evaluated accordingly.[106].Recent studies have shown the possibility of training multi-task deep visuomotor policies for robotic manipulation through various forms of LfD and RL.The end-toend LfD architectures' capabilities have been enhanced by Abolghasemi et al. to encompass object manipulation in environments with clutter[107].A low-cost hardware interface has been proposed by Song et al. which can collect grasping demonstrations from individuals in diverse environments[108].A dataset of human-robot demonstrations suitable for training robots for various tasks was presented by Sharma et al.[109].Kim et al. provided an overview of robotic cleaning tasks utilizing different control methods[110], while Yang et al. suggested a DL model to [112]ol (IOC) for learning behaviors from demonstrations, particularly for controlling high-dimensional robotic systems with torque[103].Schoettler et al. examined challenging industrial insertion tasks that involve visual input and different types of natural rewards, including sparse rewards and goal images[104].They demonstrated that combining reinforcement learning (RL) with prior knowledge, these tasks can be effectively solved with a moderate amount of interaction in the real world.learnroboticmanipulationactionsfrom videos of human demonstrations[111].In contrast, Kilinc et al. suggested a RL based approach that does not rely on human demonstrations[112].Smith et al. conducted research on how automated robotic learning frameworks can help overcome challenges related to defining and scaffolding

Table 3 .
Summarizing State-of-the-Art Achievements and Future Challenges in Grasping with Robots Learning from Demonstrations [134]sion-based robotic grasp Various methods have been explored in several studies aimed at developing vision-based robotic grasping techniques as a means of enabling intelligent robots to perceive and interact with their surroundings.For instance, Sehgal et al. proposed a Genetic Algorithm (GA) that accelerates the learning agent[127].Similarly, Haarnoja et al. employed Soft Q-Learning (SQL), a maximum entropy reinforcement learning algorithm, to manipulate the robot's gripper and move it to a specific target position in Cartesian space[128].A scalable reinforcement learning approach for learning vision-based dynamic manipulation skills has been developed by Kalashnikov et al.[129].In a separate study, Du et al. conducted a thorough investigation on the topic of vision-guided robotic grasping[57].Wu et al. proposed a method to mitigate the issue of poor performance in a stochastic environment by using an Actor-duelling-Critic (ADC) algorithm[130].Lin et al. introduced a UAV vision-based aerial grasping system to capture target objects[131].Nonetheless, these discrete settings have not yet been investigated in practical applications involving state-action spaces that are continuous and of high dimensions.To address this issue, Bodnar et al. developed Quantile QT-Opt (Q2-Opt), a distributional variant of distributed Q-learning algorithm, for continuous domains and evaluated its performance in both simulated and real vision-based robotic grasping tasks[132].Additionally, Kobayashi et al. proposed a Reward-Punishment Actor-Critic (RP-AC) algorithm to optimize robot trajectory by acquiring suitable rewards[133], while Demura et al. used the You Only Look Once (YOLO) object detection approach to identify the optimal grasp point for stable manipulation in their Q-Learning grasping motion acquisition method[134].This technique enabled the robot to pick up the uppermost folded towel from a stack and place it on a table.Kimet al. demonstrated that deep learning-based techniques with direct visual input can achieve state-of-the-art results for robotic grasping in a cluttered environment with diverse unseen target objects [73].Julian et al. introduced a robot learning framework that allows for continuous adaptation

Table 4 .
Summarizing State-of-the-Art Achievements and Future Challenges in Vision-based robotic grasping • Prior knowledge importance.

Table 6 .
Summary of the key findings of the SOA robotic grasping that considered closely related works