N. Mhaisen | TU Delft Repository

DRONE-RL

Dynamic reinforcement learning for online navigation of UAVs in evolving environments

Journal article (2026) - Noor Khial, Mhd Saria Allahham, Naram Mhaisen, Loay Ismail, Mohamed Mabrok, Amr Mohamed

Locating mobile targets in dynamic and cluttered environments, such as disaster zones or adversarial terrains, presents significant challenges due to unknown target mobility and changing environmental conditions. Unmanned Aerial Vehicles (UAVs), equipped with advanced sensing capabilities, offer a viable solution, but require adaptive planning mechanisms to navigate through non-stationary environments effectively. In this paper, we propose a hybrid learning framework for multi-target visitation that combines offline reinforcement learning (RL) and online convex optimization (OCO) to address these challenges. Specifically, we leverage Deep Deterministic Policy Gradient (DDPG) to pre-train various UAV navigation policies across representative scenarios. During deployment, an OCO-based policy selection mechanism adaptively selects the best policy in real-time that ensures responsiveness to environmental changes without retraining. Experimental results demonstrate that our approach consistently adapts to varying levels of non-stationarity and clutter, outperforming benchmark methods in adaptability and mission success. Notably, the online learner exhibits asymptotically vanishing average regret with different levels of non-stationary behaviors. ...

Slicing for AI

An Online Learning Framework for Network Slicing Supporting AI Services

Journal article (2025) - M. Helmy, A. A. Abdellatif, N. Mhaisen, A. Mohamed, A. Erbad

The forthcoming 6G networks will embrace a new realm of AI-driven services that requires innovative network slicing strategies, namely slicing for AI, which involves the creation of customized network slices to meet Quality of Service (QoS) requirements of diverse AI services. This poses challenges due to time-varying dynamics of users’ behavior and mobile networks. Thus, this paper proposes an online learning framework to determine the allocation of computational and communication resources to AI services, to optimize their accuracy as one of their unique key performance indicators (KPIs), while abiding by resources, learning latency, and cost constraints. We define a problem of optimizing the total accuracy while balancing conflicting KPIs, prove its NP-hardness, and propose an online learning framework for solving it in dynamic environments. We present a basic online solution and two variations employing a pre-learning elimination method for reducing the decision space to expedite the learning. Furthermore, we propose a biased decision space subset selection by incorporating prior knowledge to enhance the learning speed without compromising performance and present two alternatives of handling the selected subset. Our results depict the efficiency of the proposed solutions in converging to the optimal decisions, while reducing decision space and improving time complexity. Additionally, our solution outperforms State-of-the-Art techniques in adapting to diverse environmental dynamics and excels under varying levels of resource availability. ...

An online learning framework for UAV search mission in adversarial environments

Journal article (2025) - Noor Khial, Naram Mhaisen, Mohamed Mabrok, Amr Mohamed

The rapid evolution of Unmanned Aerial Vehicles (UAVs) has revolutionized target search operations in various fields, including military applications, search and rescue missions, and post-disaster management. This paper presents the application of a multi-armed bandit algorithm for UAV search mission. The UAV's mission is to locate a mobile target formation, operating under the assumption of an unknown and potentially non-stationary probability distribution, by learning the formation's strategy over time. To achieve this, we formulate an optimization problem and leverage the Exp3 algorithm (exponential-weighted exploration and exploitation) for its solution. To enhance the learning process, we integrate environment observations as context, resulting in a variant referred to as C-Exp3. However, C-Exp3 is not designed for scenarios where the target formation strategy changes over time. Therefore, AC-Exp3 is proposed as an adaptive solution, featuring a human-centric drift detection mechanism to detect the changes in the formation strategy and adjust the learning process accordingly. Furthermore, the Exp4 algorithm is proposed as a self-adjustment meta-learner to address changes in the formation's strategy. We evaluate the performance of C-Exp3, AC-Exp3, and Exp4 through a series of experiments with a focus on non-stationary environments. Our primary objective is reaching the unknown optimal-in-hindsight policy as the time t approaches the horizon T, thereby reflecting the UAV's capacity to learn formation's strategy. AC-Exp3 demonstrates enhanced adaptability compared to C-Exp3. Meanwhile, Exp4 emerges as a robust performer, swiftly adapting to new strategies. ...

Multi-Target Path Planning with Probabilistic Detection in Cluttered Environments

Conference paper (2025) - Noor Khial, Naram Mhaisen, Loay Ismail, Mohamed Mabrok, Amr Mohamed

Autonomous Unmanned Aerial Vehicles (UAVs) offer substantial advantages for tasks such as surveillance, disaster management, and environmental monitoring, where human intervention can be risky. With advancements in their agility and autonomy, UAVs are becoming essential for critical tasks in combat, reconnaissance, wildfire monitoring, and disaster search and rescue. This paper addresses a key challenge in UAV path planning: efficiently visiting multiple unknown mobile targets in complex, obstacle-filled environments. We leverage the Deep Deterministic Policy Gradient (DDPG) framework to continuously control UAV movement to enable effective obstacle avoidance and sequential target visitation. Our approach allows the UAV to learn the unknown distribution of mobile targets and determine optimal paths while navigating around obstacles. With limited environment information, the agent receives rewards based on the confidence of detecting targets within its observation field. We validate the effectiveness of our method through comparison with an optimal benchmark that assumes perfect knowledge of target mobility and obstacle locations. Results indicate that increasing target numbers significantly impacts the agent's performance by requiring additional training time. Moreover, heavily cluttered environments reduce mission success rates for target visitation. ...

Optimistic Learning with Applications to Caching Networks

Doctoral thesis (2025) - N. Mhaisen, K.G. Langendoen, G. Iosifidis

AI/ML-based approaches are at the forefront of resource management in modern communication networks. Deep learning, in particular, enables fast and high-performing decision-making when sufficient representative training data is available to build accurate offline models. Conversely, online learning solutions operate without prior training and make decisions based on real-time observations; however, they tend to be overly conservative to ensure robustness (i.e., worst-case guarantees).

This thesis advocates optimistic learning as a decision-making framework for resource management in networked systems. An optimistic learning algorithm integrates untrusted predictions and assesses their accuracy at runtime. When predictions are accurate, these algorithms achieve performance levels comparable to offline-trained models. Crucially, they maintain the robustness of regular online learning, ensuring reliability even when predictions are inaccurate.

We focus on caching networks and propose new optimistic learning algorithms for coded caching, and whole-file caching. These algorithms provably converge to the best fixed caching allocation at an order-optimal rate, independent of prediction accuracy. However, when predictions are accurate, convergence is highly accelerated, achieving the “optimistic" premise.

We then extend our focus to scenarios where the optimization target itself changes over time. In caching, this translates to competing against dynamic caching configurations rather than a single best fixed allocation. We demonstrate that optimism is even more valuable in this setting; accurate predictions help the learner efficiently track moving targets, adapting in real-time without excessive conservatism. Furthermore, we explore the role of predictions in stateful systems, where past decisions influence future costs. In such environments, optimistic learning benefits from horizon-based predictions, leveraging forecasts over extended time windows rather than immediate next-cost predictions.

All proposed algorithms are rigorously analyzed and come with provable performance guarantees under carefully designed and explicitly stated metrics. By integrating optimistic learning into network optimization, this thesis explores the spectrum between prediction-driven and robust approaches, offering a principled framework for leveraging untrusted ML predictions in network resource allocation.
...

AI/ML-based approaches are at the forefront of resource management in modern communication networks. Deep learning, in particular, enables fast and high-performing decision-making when sufficient representative training data is available to build accurate offline models. Conversely, online learning solutions operate without prior training and make decisions based on real-time observations; however, they tend to be overly conservative to ensure robustness (i.e., worst-case guarantees).

This thesis advocates optimistic learning as a decision-making framework for resource management in networked systems. An optimistic learning algorithm integrates untrusted predictions and assesses their accuracy at runtime. When predictions are accurate, these algorithms achieve performance levels comparable to offline-trained models. Crucially, they maintain the robustness of regular online learning, ensuring reliability even when predictions are inaccurate.

We focus on caching networks and propose new optimistic learning algorithms for coded caching, and whole-file caching. These algorithms provably converge to the best fixed caching allocation at an order-optimal rate, independent of prediction accuracy. However, when predictions are accurate, convergence is highly accelerated, achieving the “optimistic" premise.

We then extend our focus to scenarios where the optimization target itself changes over time. In caching, this translates to competing against dynamic caching configurations rather than a single best fixed allocation. We demonstrate that optimism is even more valuable in this setting; accurate predictions help the learner efficiently track moving targets, adapting in real-time without excessive conservatism. Furthermore, we explore the role of predictions in stateful systems, where past decisions influence future costs. In such environments, optimistic learning benefits from horizon-based predictions, leveraging forecasts over extended time windows rather than immediate next-cost predictions.

All proposed algorithms are rigorously analyzed and come with provable performance guarantees under carefully designed and explicitly stated metrics. By integrating optimistic learning into network optimization, this thesis explores the spectrum between prediction-driven and robust approaches, offering a principled framework for leveraging untrusted ML predictions in network resource allocation.

On the Dynamic Regret of Following the Regularized Leader

Optimism with History Pruning

Journal article (2025) - Naram Mhaisen, George Iosifidis

We revisit the Follow the Regularized Leader (FTRL) framework for Online Convex Optimization (OCO) over compact sets, focusing on achieving dynamic regret guarantees. Prior work has highlighted the framework’s limitations in dynamic environments due to its tendency to produce “lazy” iterates. However, building on insights showing FTRL’s ability to produce “agile” iterates, we show that it can indeed recover known dynamic regret bounds through optimistic composition of future costs and careful linearization of past costs, which can lead to pruning some of them. This new analysis of FTRL against dynamic comparators yields a principled way to interpolate between greedy and agile updates and offers several benefits, including refined control over regret terms, optimism without cyclic dependence, and the application of minimal recursive regularization akin to AdaFTRL. More broadly, we show that it is not the “lazy” projection style of FTRL that hinders (optimistic) dynamic regret, but the decoupling of the algorithm’s state (linearized history) from its iterates, allowing the state to grow arbitrarily. Instead, pruning synchronizes these two when necessary. ...

Optimistic Online Non-stochastic Control via FTRL

Conference paper (2024) - Naram Mhaisen, George Iosifidis

This paper brings the concept of 'optimism' to the new and promising framework of online Non-stochastic Control (NSC). Namely, we study how NSC can benefit from a prediction oracle of unknown quality responsible for forecasting future costs. The posed problem is first reduced to an optimistic learning with delayed feedback problem, which is handled through the Optimistic Follow the Regularized Leader (OFTRL) algorithmic family. This reduction enables the design of OptFTRL-C, the first Disturbance Action Controller (DAC) with optimistic policy regret bounds. These new bounds are commensurate with the oracle's accuracy, ranging from O (1) for perfect predictions to the order-optimal O(ST) even when all predictions fail. By addressing the challenge of incorporating untrusted predictions into online control, this work contributes to the advancement of the NSC framework and paves the way toward effective and robust learning-based controllers. ...

An Online Learning Framework for UAV Target Search Missions in Non-Stationary Environments

Conference paper (2024) - Noor Khial, Naram Mhaisen, Mohamed Mabrok, Amr Mohamed

The rapid evolution of Unmanned Aerial Vehicles (UAVs) has revolutionized target search operations in various fields, including military applications, search and rescue missions, and post-disaster management. In this paper, we propose the use of a multi-armed bandit algorithm for a UAV's search mission in an unknown and adversarial setting. The UAV's objective is to locate a mobile target formation, assuming that their mobility resembles an adversarial behavior. To achieve this, we formulate an optimization problem and leverage the Exp3 (exponential-weighted exploration and exploitation) algorithm to solve it. The targets are assumed to be moving under the assumption of an unknown and potentially non-stationary probability distribution. To enhance the learning process, we integrate environmental observations as contextual information, resulting in a variant called C-Exp3, which optimizes the search process. Finally, we evaluate the performance of C-Exp3 in UAV search missions, focusing on adversarial environments. The primary objective for the UAV is to converge towards an optimal policy as time t approaches the horizon T, reflecting the UAV's capacity to learn the formation's strategy. ...

Adaptive Online Non-stochastic Control

Conference paper (2024) - Naram Mhaisen, George Iosifidis

We tackle the problem of Non-stochastic Control (NSC) with the aim of obtaining algorithms whose policy regret is proportional to the difficulty of the controlled environment. Namely, we tailor the Follow The Regularized Leader (FTRL) framework to dynamical systems by using regularizers that are proportional to the actual witnessed costs. The main challenge arises from using the proposed adaptive regularizers in the presence of a state, or equivalently, a memory, which couples the effect of the online decisions and requires new tools for bounding the regret. Via new analysis techniques for NSC and FTRL integration, we obtain novel disturbance action controllers (DAC) with sub-linear data adaptive policy regret bounds that shrink when the trajectory of costs has small gradients, while staying sub-linear even in the worst case. ...

Online Caching with no Regret

Optimistic Learning via Recommendations

Journal article (2023) - Naram Mhaisen, George Iosifidis, Douglas Leith

The design of effective online caching policies is an increasingly important problem for content distribution networks, online social networks and edge computing services, among other areas. This paper proposes a new algorithmic toolbox for tackling this problem through the lens of optimistic online learning. We build upon the Follow-the-Regularized-Leader (FTRL) framework, which is developed further here to include predictions for the file requests, and we design online caching algorithms for bipartite networks with pre-reserved or dynamic storage subject to time-average budget constraints. The predictions are provided by a content recommendation system that influences the users viewing activity and hence can naturally reduce the caching network's uncertainty about future requests. We also extend the framework to learn and utilize the best request predictor in cases where many are available. We prove that the proposed optimistic learning caching policies can achieve sub-zero performance loss (regret) for perfect predictions, and maintain the sub-linear regret bound O(T), which is the best achievable bound for policies that do not use predictions, even for arbitrary-bad predictions. The performance of the proposed algorithms is evaluated with detailed trace-driven numerical tests. ...

Federated Learning for Online Resource Allocation in Mobile Edge Computing

A Deep Reinforcement Learning Approach

Conference paper (2023) - Jingjing Zheng, Kai Li, Naram Mhaisen, Wei Ni, Eduardo Tovar, Mohsen Guizani

Federated learning (FL) is increasingly considered to circumvent the disclosure of private data in mobile edge computing (MEC) systems. Training with large data can enhance FL learning accuracy, which is associated with non-negligible energy use. Scheduled edge devices with small data save energy but decrease FL learning accuracy due to a reduction in energy consumption. A trade-off between the energy consumption of edge devices and the learning accuracy of FL is formulated in this proposed work. The FL-enabled twin-delayed deep deterministic policy gradient (FL-TD3) framework is proposed as a solution to the formulated problem because its state and action spaces are large in a continuous domain. This framework provides the maximum accuracy ratio of FL divided by the device’s energy consumption. A comparison of the numerical results with the state-of-the-art demonstrates that the ratio has been improved significantly. ...

Reinforcement Learning for Intelligent Healthcare Systems

A Review of Challenges, Applications, and Open Research Issues

Journal article (2023) - Alaa Awad Abdellatif, Naram Mhaisen, Amr Mohamed, Aiman Erbad, Mohsen Guizani

The rise of chronic disease patients and the pandemic pose immediate threats to healthcare expenditure and mortality rates. This calls for transforming healthcare systems away from one-on-one patient treatment into intelligent health systems, leveraging the recent advances of Internet of Things and smart sensors. Meanwhile, reinforcement learning (RL) has witnessed an intrinsic breakthrough in solving a variety of complex problems for distinct applications and services. Thus, this article presents a comprehensive survey of the recent models and techniques of RL that have been developed/used for supporting Intelligent-healthcare (I-health) systems. It can guide the readers to deeply understand the state-of-the-art regarding the use of RL in the context of I-health. Specifically, we first present an overview of the I-health systems' challenges, architecture, and how RL can benefit these systems. We then review the background and mathematical modeling of different RL, deep RL (DRL), and multiagent RL models. We highlight important guidelines on how to select the appropriate RL model for a given problem, and provide quantitative comparisons, showing the results of deploying key RL models in two scenarios that can be followed in monitoring applications. After that, we conduct an in-depth literature review on RL's applications in I-health systems, covering edge intelligence, smart core network, and dynamic treatment regimes. Finally, we highlight emerging challenges and future research directions to enhance RL's success in I-health systems, which opens the door for exploring some interesting and unsolved problems. ...

Optimistic No-regret Algorithms for Discrete Caching

Journal article (2023) - Naram Mhaisen, Abhishek Sinha, Georgios Paschos, George Iosifidis

We take a systematic look at the problem of storing whole files in a cache with limited capacity in the context of optimistic learning, where the caching policy has access to a prediction oracle. The successive file requests are assumed to be generated by an adversary, and no assumption is made on the accuracy of the oracle. We provide a universal lower bound for prediction-Assisted online caching and proceed to design a suite of policies with a range of performance-complexity trade-offs. All proposed policies offer sublinear regret bounds commensurate with the accuracy of the oracle. In this pursuit, we design, to the best of our knowledge, the first optimistic Follow-The-Perturbed leader policy, which generalizes beyond the caching problem. We also study the problem of caching files with different sizes and the bipartite network caching problem. ...

Exploring Deep Reinforcement Learning-Assisted Federated Learning for Online Resource Allocation in Privacy-Preserving EdgeIoT

Journal article (2022) - Jingjing Zheng, Kai Li, Naram Mhaisen, Wei Ni, Eduardo Tovar, Mohsen Guizani

Federated learning (FL) has been increasingly considered to preserve data training privacy from eavesdropping attacks in mobile-edge computing-based Internet of Things (EdgeIoT). On the one hand, the learning accuracy of FL can be improved by selecting the IoT devices with large data sets for training, which gives rise to a higher energy consumption. On the other hand, the energy consumption can be reduced by selecting the IoT devices with small data sets for FL, resulting in a falling learning accuracy. In this article, we formulate a new resource allocation problem for privacy-preserving EdgeIoT to balance the learning accuracy of FL and the energy consumption of the IoT device. We propose a new FL-enabled twin-delayed deep deterministic policy gradient (FL-DLT3) framework to achieve the optimal accuracy and energy balance in a continuous domain. Furthermore, long short-term memory (LSTM) is leveraged in FL-DLT3 to predict the time-varying network state while FL-DLT3 is trained to select the IoT devices and allocate the transmit power. Numerical results demonstrate that the proposed FL-DLT3 achieves fast convergence (less than 100 iterations) while the FL accuracy-to-energy consumption ratio is improved by 51.8% compared to the existing state-of-the-art benchmark. ...

Pervasive AI for IoT Applications

A Survey on Resource-Efficient Distributed Artificial Intelligence

Journal article (2022) - Emna Baccour, Naram Mhaisen, Alaa Awad Abdellatif, Aiman Erbad, Amr Mohamed, Mounir Hamdi, Mohsen Guizani

Artificial intelligence (AI) has witnessed a substantial breakthrough in a variety of Internet of Things (IoT) applications and services, spanning from recommendation systems and speech processing applications to robotics control and military surveillance. This is driven by the easier access to sensory data and the enormous scale of pervasive/ubiquitous devices that generate zettabytes of real-time data streams. Designing accurate models using such data streams, to revolutionize the decision-taking process, inaugurates pervasive computing as a worthy paradigm for a better quality-of-life (e.g., smart homes and self-driving cars.). The confluence of pervasive computing and artificial intelligence, namely Pervasive AI, expanded the role of ubiquitous IoT systems from mainly data collection to executing distributed computations with a promising alternative to centralized learning, presenting various challenges, including privacy and latency requirements. In this context, an intelligent resource scheduling should be envisaged among IoT devices (e.g., smartphones, smart vehicles) and infrastructure (e.g., edge nodes and base stations) to avoid communication and computation overheads and ensure maximum performance. In this paper, we conduct a comprehensive survey of the recent techniques and strategies developed to overcome these resource challenges in pervasive AI systems. Specifically, we first present an overview of pervasive computing, its architecture, and its intersection with artificial intelligence. We then review the background, applications and performance metrics of AI, particularly Deep Learning (DL) and reinforcement learning, running in a ubiquitous system. Next, we provide a deep literature review of communication-efficient techniques, from both algorithmic and system perspectives, of distributed training and inference across the combination of IoT devices, edge devices and cloud servers. Finally, we discuss our future vision and research challenges. ...

Artificial intelligence (AI) has witnessed a substantial breakthrough in a variety of Internet of Things (IoT) applications and services, spanning from recommendation systems and speech processing applications to robotics control and military surveillance. This is driven by the easier access to sensory data and the enormous scale of pervasive/ubiquitous devices that generate zettabytes of real-time data streams. Designing accurate models using such data streams, to revolutionize the decision-taking process, inaugurates pervasive computing as a worthy paradigm for a better quality-of-life (e.g., smart homes and self-driving cars.). The confluence of pervasive computing and artificial intelligence, namely Pervasive AI, expanded the role of ubiquitous IoT systems from mainly data collection to executing distributed computations with a promising alternative to centralized learning, presenting various challenges, including privacy and latency requirements. In this context, an intelligent resource scheduling should be envisaged among IoT devices (e.g., smartphones, smart vehicles) and infrastructure (e.g., edge nodes and base stations) to avoid communication and computation overheads and ensure maximum performance. In this paper, we conduct a comprehensive survey of the recent techniques and strategies developed to overcome these resource challenges in pervasive AI systems. Specifically, we first present an overview of pervasive computing, its architecture, and its intersection with artificial intelligence. We then review the background, applications and performance metrics of AI, particularly Deep Learning (DL) and reinforcement learning, running in a ubiquitous system. Next, we provide a deep literature review of communication-efficient techniques, from both algorithmic and system perspectives, of distributed training and inference across the combination of IoT devices, edge devices and cloud servers. Finally, we discuss our future vision and research challenges.

Online Caching with Optimistic Learning

Conference paper (2022) - Naram Mhaisen, George Iosifidis, Douglas Leith

The design of effective online caching policies is an increasingly important problem for content distribution networks, online social networks and edge computing services, among other areas. This paper proposes a new algorithmic toolbox for tackling this problem through the lens of optimistic online learning. We build upon the Follow-the-Regularized-Leader (FTRL) framework which is developed further here to include predictions for the file requests, and we design online caching algorithms for bipartite networks with fixed-size caches or elastic leased caches subject to time-average budget constraints. The predictions are provided by a content recommendation system that influences the users viewing activity, and hence can naturally reduce the caching network's uncertainty about future requests. We prove that the proposed optimistic learning caching policies can achieve sub-zero performance loss (regret) for perfect predictions, and maintain the best achievable regret bound O (√T) even for arbitrary-bad predictions. The performance of the proposed algorithms is evaluated with detailed trace-driven numerical tests. ...

Multi-Agent Reinforcement Learning for Network Selection and Resource Allocation in Heterogeneous Multi-RAT Networks

Journal article (2022) - Mhd Saria Allahham, Alaa Awad Abdellatif, Naram Mhaisen, Amr Mohamed, Aiman Erbad, Mohsen Guizani

The rapid production of mobile devices along with the wireless applications boom is continuing to evolve daily. This motivates the exploitation of wireless spectrum using multiple Radio Access Technologies (multi-RAT) and developing innovative network selection techniques to cope with such intensive demand while improving Quality of Service (QoS). Thus, we propose a distributed framework for dynamic network selection at the edge level, and resource allocation at the Radio Access Network (RAN) level, while taking into consideration diverse applications' characteristics. In particular, our framework employs a deep Multi-Agent Reinforcement Learning (DMARL) algorithm, that aims to maximize the edge nodes' quality of experience while extending the battery lifetime of the nodes and leveraging adaptive compression schemes. Indeed, our framework enables data transfer from the network's edge nodes, with multi-RAT capabilities, to the cloud in a cost and energy-efficient manner, while maintaining QoS requirements of different supported applications. Our results depict that our solution outperforms state-of-the-art techniques of network selection in terms of energy consumption, latency, and cost. ...

On Designing Smart Agents for Service Provisioning in Blockchain-Powered Systems

Journal article (2022) - Naram Mhaisen, Mhd Saria Allahham, Amr Mohamed, Aiman Erbad, Mohsen Guizani

Service provisioning systems assign users to service providers according to allocation criteria that strike an optimal trade-off between users' Quality of Experience (QoE) and the operation cost endured by providers. These systems have been leveraging Smart Contracts (SCs) to add trust and transparency to their criteria. However, deploying fixed allocation criteria in SCs does not necessarily lead to the best performance over time since the blockchain participants join and leave flexibly, and their load varies with time, making the original allocation sub-optimal. Furthermore, updating the criteria manually at every variation in the blockchain jeopardizes the autonomous and independent execution promised by SCs. Thus, we propose a set of light-weight agents for SCs that are capable of optimizing the performance. We also propose using online learning SCs, empowered by Deep Reinforcement Learning (DRL) agent, that leverage the chained data to continuously self-tune its allocation criteria. We show that the proposed learning-assisted method achieves superior performance on the combinatorial multi-stage allocation problem while still being executable in real-time. We also compare the proposed approach with standard heuristics as well as planning methods. Results show a significant performance advantage over heuristics and better adaptability to the dynamic nature of blockchain networks. ...