Joint Charging Scheduling and Computation Offloading in EV-Assisted Edge Computing: A Safe DRL Approach

Electric Vehicle-assisted Multi-access Edge Computing (EV-MEC) is a promising paradigm where EVs share their computation resources at the network edge to perform intensive computing tasks while charging. In EV-MEC, a fundamental problem is to jointly decide the charging power of EVs and computation task allocation to EVs, for meeting both the diverse charging demands of EVs and stringent performance requirements of heterogeneous tasks. To address this challenge, we propose a new joint charging scheduling and computation offloading scheme (OCEAN) for EV-MEC. Specifically, we formulate a cooperative two-timescale optimization problem to minimize the charging load and its variance subject to the performance requirements of computation tasks. We then decompose this sophisticated optimization problem into two sub-problems: charging scheduling and computation offloading. For the former, we develop a novel safe deep reinforcement learning (DRL) algorithm, and theoretically prove the feasibility of learned charging scheduling policy. For the latter, we reformulate it as an integer non-linear programming problem to derive the optimal offloading decisions. Extensive experimental results demonstrate that OCEAN can achieve similar performances as the optimal strategy and realize up to 24% improvement in charging load variance over three state-of-the-art algorithms while satisfying the charging demands of all EVs.

driving and vehicular video streaming [2], modern vehicles are expected to be equipped with performant computation and storage devices to effectively run these applications.However, these rich resources on EVs are mostly underutilized when they are parked somewhere (e.g., in a parking lot), overlooking the fact that their idle resources could be used to support various computing tasks at the network edge [3], [4].Nevertheless, due to the limited battery capacities, parked EVs may be reluctant to share their computation resources for task execution.
To address this issue, a promising way is to effectively utilize the vast idle resources of EVs to extend the computational and storage capabilities of the network edge while EVs are charging [5], which we refer to as the EV-assisted multi-access edge computing (EV-MEC).Given a certain incentive mechanism (e.g., monetary reward and free parking), the EVs while charging are generally willing to share their computation resources to assist task processing as long as their charging demands can be satisfied before leaving [6].In the EV-MEC, a group of EVs as a whole acts as a static network infrastructure that can provide effective computing and storage services.For example, a pool of EVs charged in the parking lot of shopping malls can be envisioned as an edge computing platform to serve a myriad of customers inside the mall [7].Thus, EV-MEC can alleviate the huge economic and time costs of deploying massive edge servers to meet the ever-increasing computing demands of various delay-sensitive applications.
In the EV-MEC, a fundamental problem is to jointly determine the charging power of EVs and allocation of computation tasks among EVs, which has attracted increasing attention in recent years [5], [8], [9].Specifically, it is crucial to efficiently schedule the charging power of EVs to meet the diverse EV charging demands.Meanwhile, the computation offloading decisions should be jointly optimized to minimize the energy consumption of task processing and satisfy the stringent performance requirements (e.g., service delay) of computation tasks.Therefore, an efficient joint charging scheduling and computation offloading strategy is necessary for the EV-MEC.
However, the design of such a strategy faces two critical challenges.First, the lack of knowledge on the system dynamics, including the diverse EV charging demands, dynamic task arrivals, and heterogeneous performance requirements, causes huge complexity to the joint charging scheduling and offloading 1536-1233 © 2024 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
optimization problem in the EV-MEC.Specifically, since these system dynamics are heavily influenced by various uncertain factors (e.g., traffic conditions, vehicle behaviors, and user preferences), it is extremely difficult to acquire their complete knowledge in advance, posing significant challenges in finding the optimal charging scheduling and computation offloading decisions.Second, even if this knowledge is known as a priori, the varied target battery levels, charging time, computation abilities of EVs, and processing demands of computation tasks (e.g., computation density and service delay) as well as the mutual coupling between charging scheduling and computation offloading create significant challenges in the problem-solving.
Many efforts have been devoted to the charging scheduling for EVs [10], [11], [12] and computation offloading in MEC [13], [14], [15], respectively.Meanwhile, some recent studies have paid attention to the joint charging scheduling and computation offloading in EV-MEC [5], [8], [9].However, they did not take the dynamics of both charging demands and computation task arrivals into account, which are important characteristics of the practical EV-MEC systems.To fill this gap, we develop a novel joint charging scheduling and computation offloading (OCEAN) algorithm for the EV-MEC based on safe model-free deep reinforcement learning (DRL), which can automatically learn the system dynamics and make reliable and optimal decisions accordingly.Our major contributions are summarized as follows.
r We formulate a cooperative two-timescale optimization problem that is composed of the charging scheduling of EVs at a long timescale (e.g., in minutes) and the offloading of computation tasks at a short timescale (e.g., in seconds).The optimization objective is to minimize the charging load and its variance while satisfying the constraints of the performance requirements of computation tasks and charging demands of EVs.
r We model the charging scheduling of EVs as a constrained Markov decision process (CMDP), and develop a novel Lyapunov-based safe DRL algorithm to efficiently solve this CMDP.Specifically, we apply the Lyapunov approach to transform the long-term constraints into a sequence of single-step state-wise safety constraints, and devise a new action projection approach to obtain the safe actions that satisfy the above state-wise safety constraints.
r We theoretically prove the feasibility of learned charging scheduling strategy that can satisfy the charging demands of EVs.Given the charging scheduling decisions, we formulate the offloading of computation tasks as an integer non-linear programming problem to determine the task assignment and computation resource allocation, with the aim of minimizing the energy consumption of task processing while meeting the strict delay requirements of tasks.
r Extensive simulation results and performance analysis demonstrate that OCEAN can guarantee the selected charging scheduling and computation offloading decisions to meet the charging demands of EVs.Furthermore, OCEAN can achieve similar performances as the optimal strategy (which knows ahead all system uncertainties such as EV charging demands and task arrivals), and realize up to 24% improvement in charging load variance over three state-of-the-art algorithms while guaranteeing the charging demands of all EVs.The rest of this paper is organized as follows.Section II describes the related work.The system model and problem formulation are presented in Section III.In Section IV, we propose the OCEAN algorithm.Simulation results are presented and analyzed in Section V. Finally, Section VI concludes this paper.

II. RELATED WORK
In this section, we discuss the existing studies on charging scheduling for EVs and computation offloading in MEC as well as summarize the differences between this work and previous studies.
There exist many studies focusing on the charging scheduling for EVs.Liu et al. [10] introduced software-defined networking into vehicular edge computing, based on which they proposed a scalable EV charging scheduling approach to jointly optimize the charging station selection and route planning decisions for minimizing the charging time and fares.Yan et al. [11] focused on the dynamic wireless charging problem for EVs.They developed an offline deployment approach for mobile energy disseminators and proposed a DRL method to adjust the deployment decisions in an online manner based on real-time road traffic.Li et al. [12] formulated a constrained EV charging and discharging scheduling problem to minimize the charging cost and proposed a reinforcement learning-based method to learn the scheduling strategy.Cao et al. [16] studied the charging scheduling problem to reduce the electricity bill of EV fleets under unknown future information on EV arrival time, departure time, and charging demand, and designed an actor-critic learning-based charging approach that improves computational efficiency through reducing the state dimension during training processes.
The computation offloading in MEC has also attracted widespread interest in recent years.Tütüncüoglu et al. [13] proposed a queuing network model of task graph execution and designed an online distributed rate-adaptive task offloading strategy to solve Nash equilibrium.Yan et al. [14] formulated a mixed non-linear programming problem to jointly optimize the computation offloading and resource allocation decisions in MEC, and developed two imitation learning approaches to learn the near-optimal policy and fast adapt to the changes in network environments, respectively.Li et al. [15] modelled the qualityof-service driven task offloading problem as a mixed integer non-linear programming and presented a convex optimization and Gibbs sampling-based algorithm that can converge to the global optimal solution with a high probability.Teng et al. [17] formulated a multi-server multi-task allocation and scheduling problem with the objective of maximizing MEC system profit as well as designed both centralized and distributed greedy-based algorithms to solve this problem.
Meanwhile, some efforts have been devoted to joint charging scheduling and computation offloading in EV-MEC.Wei et al. [5] developed a multi-attribute contract-based charging and computation offloading scheme to maximize the utility of  charging stations.Huang et al. [8] formulated a bi-level optimization problem to determine the charging/discharging power of EVs with the constraint of service delay, and took advantage of duality theory and linear relaxation method to solve the above problem.Zhang et al. [9] studied the joint optimization of task allocation and charging scheduling of mobile charging vehicles to minimize the total energy consumption.However, these works all focused on a static optimization problem, ignoring the dynamics of charging demands and task arrivals.This paper differs significantly from all the above studies.Both the dynamic charging demands and task arrivals are fully considered in our formulated optimization problem.We formulate a cooperative two-timescale optimization problem, rather than the single-timescale problem in all aforementioned works.In addition, we propose a new safe DRL-based algorithm that can theoretically guarantee that the charging demands of EVs can be met.

A. System Overview
We consider a charging station equipped with N charging piles denoted by N = {1, 2, . . ., N}, as shown in Fig. 1.The entire time horizon is divided into T time slots T = {0, 1, . . ., T − 1} with an equal length of τ .Due to the high time-sensitivity of computation tasks, their offloading is performed at each time slot t ∈ T , e.g., every second.However, as it is impractical to adjust the charging power of EVs at such a short timescale, the charging scheduling decisions are updated every T time slots (1 < T < T), e.g., every 15 minutes, as shown in Fig. 2.

B. Charging Scheduling of EVs
For the EV using the ith charging pile (i.e., EV i),1 let t a i ∈ T denote its arrival time, and SoC a i denote its initial state of charge (SoC).Due to the high mobility of EVs, the arrival rates of incoming EVs at different time slots of the day T are different, which means the EV arrival at charging stations has a timevarying distribution.In this paper, we consider that when an EV arrives at the charging station, it will report its departure time and target SoC to the charging station controller [18], [19], 2 which are denoted by t d i ∈ T and SoC d i , respectively.In addition, we denote the SoC of EVi at the beginning of time slot t as SoC t i , which should satisfy where SoC t i = 0 indicates that the battery of EVi is empty, while SoC t i = 1 represents a full battery.Considering the dynamic arrival time of EVs and diverse charging demands, the total energy demand of all EVs while charging at the charging station also varies over time.At each time slot t, the charging station controller needs to determine the charging power for each EVi, i.e., p t i , which is limited by where p max i is the maximum charging power of EVi.Note that as mentioned before, the charging power of each EV is updated every T time slot.
For each EVi, in order to meet its charging demand before leaving, we have the following constraint In this work, we investigate the dynamics of charging demands by taking into account the uncertain EV arrival times along with the varying both initial and target SoC levels, and make an assumption that EVs would leave as scheduled (as in many prior works, e.g., [18], [21], [22], [23]).It is worth noting that our proposed algorithm is also adaptable to scenarios where EV users change their plans and leave earlier or later.EVs can notify the charging station of a new departure time, and our learned scheduling policy can accordingly adjust the charging power, in order to meet their charging demands.However, if EV users drive away before the predefined time, they need to accept that their desired charging demands may not be satisfied, like [24], [25].

C. Offloading of Computation Tasks
In this paper, we can consider the scenario of a roadside charging station, where EVs charging at this station collaboratively share their computation resources to support the traffic management for smart cities [26], [27], [28].The tasks include traffic prediction, video analysis for congestion monitoring, traffic control, etc.The need for offloading computing tasks arises due to the rapidly growing computational demands of modern urban traffic systems.A centralized system handling all the above tasks would necessitate vast computation resources.By leveraging the computation capabilities of parked EVs at roadside charging stations, we can distribute the computational load to enhance efficiency and reduce response times, thereby optimizing overall traffic management performance.
At each time slot t, we assume there are total M t computation tasks to be computed, each of which is represented by a tuple (b t j , χ t j , d t j,max ), where b t j is the data size of task j ∈ {1, 2, . . ., M t }, χ t j is the required CPU cycles that can be obtained by using the call-graph analysis method [29], and d t j,max is the maximal tolerable delay to accomplish this task.In this paper, we consider delay-sensitive computation tasks [30], [31], [32], assuming that each task is required to be completed within one time slot, i.e., d t j,max ≤ τ .Without loss of generality, M t varies across different time slots due to the highly dynamic task arrivals.At the beginning of each time slot, the charging station allocates these computation tasks to different EVs for computing, and then the computation results are returned after the task execution is completed.Since the output of computation tasks is often much smaller than their input size, the transmission delay and energy consumption of result returning are neglected, which is a common setting in the related literature [33].Let a binary variable α t i,j denote the assignment decision of task j at time slot t.If task j is assigned to EVi, α t i,j = 1; otherwise, α t i,j = 0. Since each task can only be assigned to one EV for processing, it yields To encourage and compensate EVs for sharing their idle computation resources, we adopt a discounted charging price as the monetary reward, which is modelled as a function of the total CPU cycles required for all tasks processed by each EV during its charging period.In more detail, the discounted charging price received by EVi is defined as where κ represents the standard price, λ is a positive coefficient, and x i denotes the total CPU cycles required for all tasks processed by EVi throughout its charging period, calculated by This pricing function ensures that EVs contributing greater computation resources benefit from a lower charging price, thereby guaranteeing the fairness across all participating EVs while providing a monetary incentive for joining in resource sharing.Note that the function mentioned above is just one example to formulate the discounted charging price.Other decreasing functions with respect to x i can also be employed to define the discounted price.
Then, the transmission delay d t,u j of task j is where t i is the wireless transmission rate of EVi.Since EVs are stationary while charging, we consider a quasi-static wireless channel model, where the transmission rate t i remains unchanged during one time slot, but varies among different time slots.
Let f t i,j denote the amount of computation resource (i.e., CPU frequency) allocated to compute task j.Note that if α t i,j = 0, then f t i,j also equals to zero.Due to the limited computation capacity of EVi, we have where f max i is the maximum CPU frequency of EVi.Then, the task computing delay is Thus, the total service delay of task j is d t j = d t,u j + d t,c j .To meet the delay requirement of each task, the following constraint must be satisfied, In addition, for each EVi, its energy consumption for task computing can be calculated as, where ϑ i is the effective switched capacitance that depends on the chip architecture of each EV.Based on the above definitions, the dynamics of EV battery energy from time slot t to t + 1 can be expressed as where b max i is the battery capacity of EVi, and η i is the charging efficiency coefficient.To guarantee that each EV has enough energy to compute the assigned tasks, we have the following constraint

D. Problem Formulation
With the large integration of EVs into our power grids in the future, we here seek to minimize the long-term load variance of the charging station, so as to avoid sudden load peaks and thus improve the stability of power systems.To this end, for a charging station, we define its load variance function as the sum of squares of its energy consumption over time [34], which is By squaring the values, we emphasize the significance of large deviations from the average load, while Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
downplaying smaller fluctuations.This approach allows to effectively evaluate the impact of extreme load peaks and their contribution to the overall load variance.Based on the above definitions and models, we jointly optimize the charging power of EVs (i.e., p t i ), task assignment (i.e., α t i,j ), and computation resource allocation (i.e., f t i,j ), to minimize the load variance of the charging station while meeting the delay requirements of computation tasks and charging demands of EVs.Thus, the problem of joint charging scheduling and computation offloading (CSCO) can be formulated as CSCO : min 2), ( 3), ( 4), ( 7), ( 9), ( 12).
In ( 13), it can be observed that the task assignment decision α t i,j is the binary variable, whereas the allocated computation resource f t i,j and charging power p t i are continuous variables.Additionally, the constraints defined in (3), (9), and ( 12) are non-linear.Therefore, our formulated problem CSCO can be classified as a mixed-integer non-linear programming problem, which is typically NP-hard.This implies that it is challenging to find an optimal solution to CSCO in polynomial time.In addition, considering the long-term optimization goal in CSCO, it requires prior knowledge of the system dynamics, such as charging demands of EVs and computation task arrivals, to derive the optimal solutions.However, it is extremely difficult to acquire such knowledge accurately, resulting in significant challenges in solving CSCO.Meanwhile, the charging demand constraint (3) involves the charging scheduling and offloading decisions across multiple time slots, which makes the optimization of these decisions tightly coupled.This also brings many difficulties in solving CSCO.

IV. ALGORITHM DESIGN
In this section, we develop a novel safe DRL-based algorithm, called OCEAN, to effectively solve CSCO.We decompose CSCO into two subproblems: long timescale charging scheduling and short timescale computation offloading.For the former, we develop a Lyapunov-based safe DRL algorithm to learn the optimal charging scheduling strategy.For the latter, we reformulate it as an integer non-linear programming problem to derive the computation offloading decisions.

A. Long Timescale Charging Scheduling Subproblem
Considering that the charging demand constraint (3) involves the decisions across multiple time slots, we first model the long timescale charging scheduling subproblem as a CMDP, which is an extension of the standard MDP augmented with long-term constraints on expected cumulative cost.The CMDP can typically be described by a five-tuple (S, A, P, R, C). S denotes the set of states, A is the set of feasible actions, P represents the state transition probability, which is the probability of next state s when selecting action a at state s, R is the reward function to define the immediate reward received after performing action a at state s, and C represents the constraint cost function.Let π denote the policy that is a decision rule mapping from a state to an action.The goal in a CMDP is to find an optimal policy π that not only maximizes the expected cumulative reward but also satisfies the additional long-term constraints.In this paper, the main components of our formulated CMDP are defined as follows, r State: At each decision time of charging scheduling t = κ T (κ = 0, 1, 2, . ..), the system state s t ∈ S consists of the target SoC of each EV SoC d i , the remaining charging duration before leaving t d i − t, and current SoC SoC t i .In addition, since the charging scheduling and computation offloading are tightly coupled, information about the computation tasks during the last T time slots [(κ − 1) T , κ T − 1] is also included.However, it would lead to an extremely large state space if the input data size, required CPU cycles, and tolerance delay of all computation tasks during the last T time slots are included one by one.Thus, we use (W t , bt , χt , dt ) to represent these computation tasks, where W t is the total number of computation tasks generated during last T time slots, bt is average data input size, χt is the average required CPU cycles, and dt is the average maximal tolerable delay.Thus, the system state s t at each decision time t is r Action: The action a t ∈ A is composed of the charging powers of all EVs, which is a t = {a 1,t , a 2,t , . . ., a N,t } where a i,t = p t i .Note that the action a t should satisfy the constraint (2), and a i,t = 0 if the ith charging pile is not in use.
r State Transition: P : S × A × S → [0, 1] denotes the state transition function, which describes the distribution of next state s t+1 given the current state s t and the selected action a t .
r Reward: In the standard RL, the goal of the RL agent is to maximize the expected cumulative reward.However, in our formulated optimization problem (13), the charging station aims to minimize its load variance.Therefore, in order to model our proposed optimization problem as an RL problem and minimize the load variance, the reward function r t is defined as the negative of load variance during the period [κ T , κ T + T − 1], which is r Constraint Cost: The immediate constraint cost function Then, to satisfy the charging demand of each EV, which is (3), the long-term cumulative constraint cost should satisfy Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Compared to standard MDP, the long-term constraint (17) in the above CMDP, also called the safety constraint, makes the actions of multiple time slots highly coupled, which renders this CMDP difficult to solve directly.To make it more tractable, we first transform the long-term safety constraint (17) into a sequence of single-step state-wise constraints by taking advantage of the Lyapunov approach [35].
1) Lyapunov-Based Safety Constraint Transformation: We denote Δ(s) = {π(•|s)| a∈A π(a|s) = 1, ∀s ∈ S} as a set of Markov stationary policies.Let T π,h [V ](s) denote the generic Bellman operator w.r.t.policy π ∈ Δ and generic cost function h, which is (18) Then, we define a set of Lyapunov functions as Definition 1: Given the immediate constraint cost function c i (s) in ( 16) and letting qi = 1 − SoC d i denote the safety threshold, a set of Lyapunov functions is defined as where π B is a feasible (i.e., safe) policy of the above CMDP that satisfies (17).
For any arbitrary Lyapunov function as the set of L i −induced Markov stationary policies.Due to the contraction mapping property of the Bellman operator T π,c i , it is clear that any L i −induced policy π satis- and L i (s) ≤ qi .Thus, we can find that F L i (s) is a set of feasible policies of the above CMDP.In this context, we need to find out a Lyapunov function L i ∈ L i,π B (q i ) whose L i −induced policies include the optimal policy π * .To this end, the following Lemma 1 is first presented to demonstrate that given an optimal policy π * , with the proper cost shaping, D i,π * (s) can be transformed into a Lyapunov function induced by any feasible policy π B , which is L i ∈ L i i,π B (q i ).Lemma 1: For any feasible policy π B , there exists an auxiliary function i (s) : S → R such that the Lyapunov function given by L i (s) = E[ The proof of Lemma 1 can be found in [35].In addition, this auxiliary constraint cost i (s) is proved to be uniformly bounded by , where D i,max is the maximum immediate constraint cost, and Then, it is proved in [35] that if the feasible baseline policy contains an optimal policy.However, it is extremely difficult to obtain ˆ i (s) since it requires calculating the distance between π B and optimal policy π * that is unknown.
To address this issue, we approximate ˆ i (s) with an auxiliary constraint cost ˜ i .In order to allow F L ˜ i to include as many policies as possible so that it is more likely to contain the optimal policy π * , ˜ i is chosen as the largest auxiliary constraint cost while satisfying the Lyapunov condition , ∀s ∈ S and the safety constraint L ˜ i (s) ≤ qi .Thus, ˜ i is calculated by Then, we can construct the Lyapunov function L ˜ i (s) as By using the Lyapunov function L ˜ i (s), the original long-term cumulative safety constraint (17) can be transformed into a statewise Lyapunov safety constraint, which is That is any policy . This is because, based on (23) and the definition of Lyapunov function, we have 20) and ( 21), it yields Consequently, given a feasible policy π B , the charging scheduling subproblem can be reformulated as 2) Proposed Safe DRL Algorithm: In the following, we devise a novel safe DRL algorithm to solve SP1-1.We first define Q r (s, a) as the state-action reward function to represent the cumulative reward after executing action a at state s.In addition, let Q i c (s, a) denote the state-action constraint cost function, which is Then, according to (22), the state-action Lyapunov function is denoted as rewritten as In the context of EV charging scheduling, the charging power of EVs is considered as a continuous variable.Thus, we employ a Gaussian distribution to model the continuous charging power for each EV.This modelling approach is widely used in policy gradient and actor-critic methods with continuous action space [36], [37], [38].The Gaussian distribution offers several advantages.First, Gaussian distributions are differentiable, which is important when applying optimization techniques like gradient descent to update the policy parameters based on the policy gradient.Second, the standard deviation of Gaussian distributions can effectively characterize the uncertainty about the chosen actions by the RL agent, which allows for a principled way to balance exploration and exploitation.In this case, the charging power of each EVi is sampled from a Gaussian distribution, which is π(a i |s) ∼ N (μ i , σ 2 ), ∀i ∈ N .In this paper, σ is set to be fixed or independent of the state [39], which is used to control the action exploration.Considering that the stochastic Gaussian policy π(a i |s) is parameterized by the deterministic parameters s, μ and σ, according to identical machinery [40], SP1-2 is equivalent as where and Q i c (s, µ) = c i (s) + a∈A π(a|s; µ, σ)P (s |s, a)D i,π B (s ).For SP1-3, we can find that it requires the accurate Q r (s, a), Q i c (s, a) and µ B to derive the optimal solution.Due to the highly complex non-linear relationship making it extremely hard to directly derive an explicit mathematical expression, as well as the continuous state and action space, we here adopt deep neural networks (DNNs) to approximate their actual values.Specifically, let Q r (s, a; θ r ) denote the critic network, and Q c (s, a; θ c ) denote the constraint cost network, where θ r and θ c are the network parameters.In addition, let ω(s; θ ω ) denote the policy network, which outputs the mean of Gaussian policy for each action dimension, i.e., {μ 1 , μ 2 , . . ., μ N }.
However, it is very challenging to directly derive the optimal solutions of SP1-3 since the continuous action space makes it impossible to traverse all feasible actions to find the optimum.To tackle this issue, we apply the idea of safety layer [41] to solve SP1-3.The unconstrained actions are first calculated to maximize the reward through standard policy gradient algorithms, such as deep deterministic policy gradient (DDPG) and proximal policy optimization (PPO), without taking the Lyapunov safety constraint into account.Then, these unconstrained actions are passed through a safety layer (which is implemented by a DNN) to project them onto the feasibility set induced by the Lyapunov constraints.Specifically, let µ unc denote the unconstrained action to solve SP1-3 without the Lyapunov constraints.Then, we project this unconstrained action µ unc into a constrained (i.e., feasible) one.In order to minimize the impact of action perturbation on reward degradation, we seek to perturb the unconstrained actions as little as possible.Toward this aim, the feasible action is obtained by solving the following action projection problem at each state s ∈ S: To solve (26), Chow et al. [40] proposed to approximate the Lyapunov constraint (27) with its first-order Taylor series at action µ B .However, this approximation is inaccurate in practice as the action may not have a linear relationship with the cost function, hence it cannot guarantee the derived solutions satisfy the Lyapunov constraints, as shown in Section V-B.
To overcome this drawback, we develop a novel approach based on the state-cost action function to solve the action projection problem (26).In detail, let A(s, q) denote the state-cost action function to represent the action to be executed at state s whose cumulative constraint cost is q.Based on A(s, q), we can then derive the feasible actions given a state s and the target cumulative constraint cost q.Thus, to solve (26), we first discretize ˜ i with an equal interval ˜ i K and obtain a set of discrete values, i.e., {0, Then, a set of feasible state-action constraint costs can be constructed by Given G, we can derive a set of safe actions using the state-cost action function A(s, q), which is Based on the set of safe actions U , the action projection problem (26) can be simplified as SP1-4 is a convex problem that can be easily solved by the standard convex optimization techniques such as CVX solver [42].Let µ * denote its optimal solution.Similar to the above critic network and constraint cost network, as it is extremely hard to explicitly model the highly complex non-linear relationship of A(s, q), we also utilize a DNN to approximate its accurate value, denoted by Ã(s, q; θ A ) where θ A is the network parameter.However, there inevitably exists an approximation error between Ã(s, q; θ A ) and its exact value, thus µ * might not be able to satisfy the Lyapunov constraint in some cases.To address this issue, we next give a safe policy π .Compared to the policy obtained by solving (30), although π has smaller improvements against π B , it always guarantees the safety.To do this, we first give the following Theorem 1.
Theorem 1. (Consistent Feasibility): Given a safe baseline policy π B , suppose the successive policy π satisfies that a |π (a (s) , ∀s ∈ S, then the policy update is consistently feasible, i.e., if The proof of Theorem 1 can be found in Appendix A, available online.
According to Theorem 1, we can find that when the distance between polices π and π B is within a bound, i.e., ξ(s) = (s) , this new policy π is also feasible.Recall that we adopt Gaussian distribution to represent the probability distribution of the agent's actions.Here, we let N (μ , σ 2 ) denote the parameterized policy π (a|s), and N (μ B , σ 2 ) denote the parameterized policy π B (a|s).In this case, we can further derive the relationship between μ and μ B to guarantee the distance between policies π and π B within the bound ξ(s), which is presented in the following Theorem 2. and , letting x * and ς * be the solutions of The proof of Theorem 2 can be found in Appendix B, available online.
Based on Lemma 1 and Theorem 2, we can transform (26) into the following problem: Problem ( 31) is convex, thus its optimal solution μ * i can be derived by using the CVX solver [42].The above analysis proves that the solution μ * i can guarantee to satisfy the Lyapunov safety constraint.
Compared to other constrained policy optimization methods, one of the key benefits of our proposed Lyapunov-based safe reinforcement learning approach is its ability to provide theoretical safety guarantees during both the learning and execution phases, meaning that the long-term constraints in the formulated CMDP can be consistently satisfied in theory.This is accomplished by ensuring the system stays within a safety region defined by the Lyapunov function.However, maintaining safety throughout the learning phase can pose challenges when employing some other constrained policy optimization methods.For instance, the work presented in [43] employs a risk function to indicate whether the selected actions violate the constraints and aims to learn a policy to minimize the violation risk.Given such a strategy, during the learning phase, the RL agent will perform a variety of actions to collect information about the environment.Some of these actions could be risky and potentially violate the constraints.Furthermore, although this approach strives to minimize the probability of constraint violations, it cannot fully guarantee that the learned policy always meets the safety constraints.In contrast, our proposed Lyapunov-based safe reinforcement learning approach can offer a theoretical guarantee of safety for the learned policy, as given by Theorem 1 in Appendix A, available online, thus presenting a compelling advantage over other methods.

B. Short Timescale Computation Offloading Subproblem
Recall that the objective in ( 13) is to minimize the energy consumption by EV charging while flattening the load profile.Toward this aim, at each time slot t, based on the EV charging decisions p t i obtained by solving the above charging scheduling subproblem, we optimize the task assignment α t i,j and computation resource allocation f t i,j to minimize the sum of energy consumption by task processing of all EVs under the task delay constraints.Thus, we formulate the short timescale computation offloading subproblem as follows: ), ( 7), ( 9), ( 12), Constraint (32) guarantees to satisfy the charging demands of all EVs.SP2-1 includes a crucial constraint to ensure the completion of all computation tasks before their deadlines, as detailed in (9).As a result, we can guarantee that all tasks assigned to EVs will be computed and finished on time.We here give an example to better illustrate these constraints.Considering a scenario with three EVs sharing their computation resources, namely EV-1, EV-2, and EV-3.They are equipped with computation capacities of 1.8 GHz, 2 GHz, and 2.3 GHz, respectively [44].There are 30 computation tasks that need to be processed, and for simplicity, we categorize them into three different types, each with specific characteristics (b t , χ t , d t max ): (300 KB, 60 Megacycles, 0.5 s), (500 KB, 70 Megacycles, 0.6 s), and (700 KB, 80 Megacycles, 0.7 s), respectively [44].The number of computation tasks for each type is set to 10.The transmission rate to each EV is set at 3 Mbps [44].Based on these settings, we can calculate that the minimum computation resources required by all three types of tasks to meet their deadlines are 0.15 GHz, 0.16 GHz, and 0.17 GHz, respectively.We then consider a task allocation scheme where all tasks are assigned to EV-1.However, it is impossible for EV-1 to complete all tasks before their maximum tolerance delays since EV-1 does not have enough computing capacity.To address this issue, we introduce a crucial constraint (7) into our optimization problem, ensuring that the tasks allocated to each EV must not exceed its computation capacity.Under such a constraint, one feasible solution to our problem is to allocate all type-1 tasks to EV-1, type-2 tasks to EV-2, and type-3 tasks to EV-3.This ensures that each EV possesses Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
sufficient computation resources to complete its received tasks within the specified delay requirements.Therefore, solving our optimization problem can provide an effective task offloading strategy that efficiently considers both the computation capacities of the EVs in (7) and the task performance requirements in (9).As a result, all tasks assigned to the EVs will be computed and finished before their respective deadlines.
Next, according to the Lemma 1 in [45], we transform SP2-1 into the following problem: ), ( 12), (32), where f * ,t i,j = . SP2-2 is an integer nonlinear programming problem, which is solved by using CPLEX [46].In the context of the joint charging scheduling and computation offloading problem addressed in this paper, the dimensionality of the above computation offloading subproblem (i.e., SP2-2) primarily depends on the number of EVs involved, which in turn is determined by the number of charging piles within the charging station.In cases where the charging station is with a great many charging piles, the partitioning methods can be incorporated into our proposed algorithm.By partitioning the large charging station into smaller subregions, we can schedule EV charging and allocate tasks for each subregion independently.This partitioning strategy can effectively reduce the complexity of the problem in large-scale charging station scenarios, allowing for efficient problem-solving with CPLEX.However, due to the considerations of power grid stability and economic factors, charging stations typically have only a few dozen charging piles [47], helping ensure power grid stability and prevent excessive strain on the infrastructure.Given this setup where charging stations have a relatively modest number of charging piles, it is unlikely to encounter extremely high-dimensional spaces when solving the above task offloading subproblem.Therefore, CPLEX remains a suitable solver for efficiently addressing the optimization problem SP2-2.Fig. 3 shows the implementation architecture of the proposed task offloading algorithm in a charging station.This system can be divided into two main sections including the Charging Station Controller and the EVs.The Charging Station Controller plays an important role in the task offloading and includes the following components: r Task Profiling Module: This module is in charge of col- lecting and analyzing the incoming computation tasks to extract their task attributes such as data size, computational requirements, and deadlines.Acquiring these attributes is necessary for task offloading.
r Resource Monitoring Module: It is used to continuously monitor the CPU, memory, battery level, and wireless channel status of each EV, and provide the real-time EV resource status to the Task Scheduler, for making the task offloading decisions.r Task Scheduler: As the core of the system, the Task Scheduler employs our learned task offloading policy to intelligently allocate tasks to the proper EVs.By carefully evaluating both the task profiles and EV resource status, it aims to optimize the task offloading while guaranteeing all tasks are completed by their deadlines.r Task Execution Unit: This unit takes charge of carrying out the computation tasks received from the Task Scheduler.It ensures that tasks run securely and efficiently, constrained by the computation ability of the EV.The task offloading workflow begins with the arrival of computation tasks at the Charging Station Controller.The Task Profiling Module creates detailed profiles for each task, while the Resource Monitoring Module acquires information about the resource status of the EVs.Utilizing the above information, the Task Scheduler intelligently allocates tasks to suitable EVs by using our learned offloading policy.The Controller then offloads these tasks to the EVs through the Data Transmission Unit.On the EV side, the Data Transmission Unit receives the allocated tasks, and the Task Execution Unit securely executes the computation tasks within their deadlines.Once the tasks are completed, EVs transmit the computing results back to the Controller, completing the task offloading process.

C. Joint Charging Scheduling and Computation Offloading
Based on the above analysis, we now formally specify the procedure of OCEAN, as shown in Algorithm 1. OCEAN can Algorithm 1: Joint Charging Scheduling and Computation Offloading Algorithm (OCEAN).be divided into three major parts: trajectory generation, network training, and action projection.
In the stage of network training, as shown in Fig. 4, using the collected Z trajectories, we update the critic network Q r , constraint cost network Q c , state-cost action network Ã(s, q), and policy network ω(s, θ ω,k ).Specifically, the parameter θ r,k of the critic network is updated to minimize the following loss function: where y r = T −1 ι=t r ι is the target value.Similarly, the parameter θ c of the constraint cost network is updated by minimizing the following loss function: where is the target value.Then, the parameter θ A of the state-cost action network is updated to minimize the following loss function: In addition, for the policy network ω(s; θ ω,k ), its parameter θ ω,k is updated by following the objective gradient, so as to maximize the action-value function: After the network training, we perform action projection to map the actions generated by the policy network into feasible ones that adhere to given safety constraints.This projection procedure is implemented by the safety layer network φ(s, μ), which takes as input the system states as well as unconstrained actions generated by the policy network, and then transforms Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.[12], [49] the input actions into feasible ones induced by the Lyapunov constraint in (23), ultimately guaranteeing that the charging demand constraints of EVs are always met.Toward this aim, at any given state s in the trajectories {ξ z,k } Z z=1 , we first compute the auxiliary constraint cost ˜ i .Based on derived ˜ i , we can obtain a candidate action μ c i at state s by solving (30).In order to check the safety of this candidate action, we calculate

V. PERFORMANCE EVALUATION
In this section, we first describe the parameter settings in our experiments, and then present simulation results to demonstrate the effectiveness of OCEAN.

A. Experimental Settings
We consider a charging station consisting of N = 25 charging piles.Four different types of EVs are used in the experiments, which are Nissan Leaf, BMW i3, Tesla Model 3, and Hyundai Ioniq Elec.The electric specifications of these EVs are listed in Table I.For each EV arriving to charge, its initial SoC is sampled from the truncated normal distribution N (0.5, 0.1 2 ) within the interval [0.2, 0.8], and the target SoC is sampled from N (0.9, 0.1 2 ) with the bound [0.85, 0.95] [36].The charging efficiency coefficient is set as 0.95.The charging time of each EV follows a Weibull distribution where the scale parameter is 2.57 and the shape parameter is 1.27 [21].The standard charging price κ is 0.10015 $/kWh [21], and the coefficient λ is 5 • 10 −14 .The charging scheduling decisions are updated every 15 minutes.For the computation offloading, we consider the number of computation tasks to be randomly distributed in [70,110].The task size, required CPU cycles, and task delay follow the uniform distributions U [100, 900] KB, U [50,90] Megacycles, and U [0.5, 1] s, respectively [44].The length of each time slot is τ = 1s.The wireless transmission rate is set to 3 Mbps [44], and the CPU energy consumption coefficient ϑ i = 5 • 10 −26 .In addition, the computation capacities of four different EVs are set as 1.8 GHz (Nissan leaf), 2 GHz (BMW i3), 2.3 GHz (Tesla Model 3), and 1.8 GHz (Hyundai Ioniq Elec.) [48].The constraint cost network consists of four hidden layers where the numbers of neurons are 128, 256, 128, and 64, respectively.The safety layer network has three hidden layers, where the numbers of neurons are 256, 128, and 64, respectively.Both the critic and actor networks have two hidden layers each with 128 neurons and 64 neurons.The learning rate of the critic network is set to 0.001, and that of the actor network is 0.0001.

B. Performance of OCEAN
We first study the convergence and safety performance of OCEAN.Specifically, OCEAN is compared with classic Lyapunov-based safe reinforcement learning (LSRL) [40] to show its advantage.We also implement the optimal policy to show the upper bound of performance and the safe condition.In order to obtain the optimal policy, it is assumed that all system uncertainties such as EV charging demands and task arrivals are known in advance.In this case, CSCO in (13) can be considered as a deterministic problem, whose optimal solutions are solved by the CPLEX solver [46].To deal with the various safety constraint thresholds of different EVs, we normalize the cumulative constraint cost (17) When i ≥ 1, the safety constraint of EVs can be satisfied.Fig. 5(a) and (b) plot the accumulative reward and constraint cost achieved by OCEAN, classic LSRL, and Optimal varying with the increase of training episodes, respectively.From Fig. 5(a), we can observe that the accumulative reward of OCEAN rises with the number of training episodes, and converges after around 15 episodes.It means that a better policy can be learned by OCEAN during training to improve the system reward, and a stable policy can be learned at the end of training.It can also be seen that OCEAN can gradually approach the optimal policy, which validates its superior performance.In addition, we can observe that the classic LSRL converges after around 30 episodes, which is much slower than OCEAN.Notice that although classic LSRL can achieve a higher reward than the optimal policy, this is at the expense of safety violation, as shown in Fig. 5(b).Specifically, OCEAN can keep guaranteeing the safety of learned policy and gradually approach the optimum, while classic LSRL violates the safety condition after around 16 episodes.It is because in order to solve (26), classic LSRL approximates the Lyapunov safety constraint with its first-order Taylor series, but this approximation is inaccurate in practice, which cannot guarantee that the derived solutions meet the safety constraint.
To look into the reason behind the improvement by OCEAN over classic LSRL, we plot the loss curves of the safety layer network obtained by OCEAN and classic LSRL at different training episodes in Fig. 6.At episode 1, the loss curves of both OCEAN and classic LSRL can rapidly decrease to zero.At episode 5, the loss curve of OCEAN can still reduce to around zero, but that of classic LSRL reduces to about 0.2 and does not decrease any more.Similar results can also be observed at episodes 10 and 15.In addition, as the increase of training episodes, the loss value obtained by classic LSRL also rises.This is because classic LSRL cannot guarantee the safety of the selected actions, causing inefficiency in the training of the safety layer network.In contrast, since OCEAN is able to learn the safe policy during training, the loss value obtained by OCEAN can rapidly reduce to around zero.
Then, we study the monetary benefits received by EVs for sharing their idle computation resources.Fig. 7 plots the total CPU cycles required for all tasks processed by each EV and their charging prices.It can be observed that the charging price is inversely correlated with the total CPU cycles required for all processed tasks, indicating that the EVs with a larger computational workload have a smaller charging price.In other words, EVs that contribute more to assisting with the task processing during their charging periods are rewarded a lower charging price, which serves as an effective incentive and compensation for their resource sharing.
Moreover, we analyze the impact of various parameters of offloaded tasks in their scheduling.To facilitate the analysis, we consider four different types of computation tasks and three   types of EVs [44], [48], as detailed in Tables II and III.The number of tasks for each type is set to 30 and the number of EVs for each type is set to 7. Fig. 8 shows the scheduling of computation tasks to respective types of EVs.It can be observed that Type-1 tasks are only assigned to Type-2 and Type-3 EVs, with no allocation to Type-1 EVs.This is because Type-1 tasks have substantial data sizes and stringent latency requirements, which require a short communication latency, making them not fit for the offloading to Type-1 EVs whose communication rates are the lowest.In contrast, Type-2 tasks, having the same latency constraints but smaller data sizes compared to Type-1 tasks, exhibit diversification in their allocation.Some of them are assigned to Type-1 EVs, owing to their reduced communication requirements resulting from diminished data sizes.The majority, however, are still scheduled to Type-2 and Type-3 EVs, characterized by enhanced communication and computation capacities in line with these tasks' requirements of substantial CPU resources and strict latency constraints.Similarly, we can observe that Type-3 tasks are distributed across all EV types for processing.This is due to their moderate communication and computation requirements, making them compatible for processing across all types of EVs.Lastly, Type-4 tasks have more relaxed delay constraints compared to the other types of tasks, which allows for their effective offloading and processing on Type-1 EVs as they do not need fast responses like the other tasks.

C. Comparison Results
Next, we carry out comparison experiments to further demonstrate the effectiveness of our algorithm.In addition to the above optimal policy, we implement three DRL-based state-of-the-art algorithms.
r Optimal: As mentioned before, assuming all system un- certainties such as EV charging demands and task arrivals are known ahead, the CPLEX solver is used to derive the optimal solutions to problem CSCO.However, the dynamic nature of charging demands is influenced by various factors, e.g., user driving habits, traffic conditions, and weather, making their precise predictions difficult.It is also challenging to accurately predict uncertain task arrivals and varying task characteristics, such as data sizes, required computation resources, and delays.Thus, the optimal solution that requires complete knowledge of these uncertainties is impractical as such information cannot be accurately known in advance.
r DDPG-based joint charging scheduling and computa- tion offloading (DDPGSO): Many works in the literature have employed DDPG algorithm to address the charging scheduling for EVs [50] and computation offloading in MEC [51].We implement this technique to jointly optimize the charging scheduling and computation offloading.
r PPO-based joint charging scheduling and computation offloading (PPOSO): Some recent studies adopted PPO algorithm to solve the charging scheduling [37] and computation offloading [52] problems.We use the same method as in the above studies to learn the joint optimization strategy.
r Primal-dual RL based joint charging scheduling and com- putation offloading (PDSO): The primal-dual RL approach has also been used recently to address the safe learning problem in many previous works, e.g., [53], [54].We here leverage this approach to solve our joint charging scheduling and computation offloading problem.
To ensure the safety of learned joint charging scheduling and computation offloading policies, similar to the safe RL algorithms proposed in previous studies, e.g., [55], [56], DDPGSO and PPOSO utilize a penalty-augmented objective, which combines the expected reward objective with a penalty term associated with constraint violations for learning the safe policy.
Fig. 9(a) plots the load variances achieved by all five algorithms when the number of charging piles varies from 15 to 35.We can see that OCEAN can achieve a similar load variance as the optimal policy, and always has a better performance than PDSO, DDPGSO, and PPOSO.E.g., when N = 25, OCEAN achieves a load variance of 36165.7,compared to 41264.6 obtained by PDSO with a 12.4% reduction, 47547.2obtained by DDPGSO with a 23.9% reduction, and 46530.1 obtained by PPOSO with a 22.3% reduction.On average, OCEAN can decrease the load variance by 12.7%, 19.3%, and 22.6% over PDSO, DDPGSO, and PPOSO, respectively.It indicates that OCEAN can effectively optimize the charging scheduling and offloading decisions to minimize the load variance.
Fig. 9(b) shows the load variance achieved by all five algorithms when the number of computation tasks varies from 50 to 130.We can also see that OCEAN always outperforms PDSO, DDPGSO, and PPOSO.On average, OCEAN can achieve 9.1%, 19.2%, and 22.1% load variance reduction over PDSO, DDPGSO, and PPOSO, respectively.Fig. 9(c) presents the load variance achieved by all five algorithms when the initial SoC of EVs varies from 0.5 to 0.7.Compared to PDSO, DDPGSO, and PPOSO, OCEAN can reduce 8.3%, 22.7%, and 24.5% load variance on average, respectively.Fig. 9(d) shows the load variance achieved by all five algorithms when the required CPU cycles vary from 20 Megacycles to 80 Megacycles.On average, OCEAN can achieve 8.4%, 19.1%, and 22.4% load variance reduction over PDSO, DDPGSO, and PPOSO, respectively.To show the safety of OCEAN, Table IV lists the normalized cumulative constraint costs of all five algorithms at convergence under different parameters.We can see that the constraint costs of all five algorithms are larger than 1, which means all of them can guarantee the safety of policy.Nevertheless, the constraint cost of OCEAN is significantly smaller than those of PDSO, DDPGSO, and PPOSO, and OCEAN can achieve a similar constraint cost as the optimal policy.These comparison results demonstrate that OCEAN can effectively reduce the load variance while guaranteeing the feasibility of the learned policy.
Additionally, Table V showcases the execution times of all five algorithms across varying numbers of charging stations.We can observe that our proposed algorithm, OCEAN, exhibits longer execution times than PDSO, DDPGSO, and PPOSO.The reason is that OCEAN combines an actor network and a safety network for action generation, whereas PDSO, DDPGSO, and PPOSO all rely solely on an actor network for decision-making.Despite this, OCEAN can still maintain its execution time within a remarkably short duration, not exceeding 7 ms.Moreover, as the number of charging stations increases, all five algorithms show an increase in execution time.Notably, Optimal shows a significantly rapid increase in execution time due to its exponential growth in computational complexity.In particular, when the number of charging stations is 60, it is unable to find an optimal  solution.Although the optimal solutions can be obtained when the number of charging stations is less than 60, the execution time of Optimal is vastly larger than those of the other four algorithms.

VI. CONCLUSION
This paper studies the joint charging scheduling and computation offloading for EV-MEC.First, we formulate a two-timescale optimization problem with the objective of minimizing the load variance while satisfying both the charging demands of EVs and the strict performance requirements of computation tasks.Next, we develop a novel safe DRL-based intelligent algorithm, called OCEAN.Specifically, a new safe DRL algorithm is proposed to optimize the charging scheduling for EVs, and the optimization of computation offloading is reformulated as an integer nonlinear programming problem.Extensive experiments and performance comparison results show the superiority of OCEAN in reaching similar performances as the optimal strategy and considerably reducing the charging load variance compared to three state-ofthe-art algorithms, while ensuring that the learned policy can satisfy the charging demands of all EVs.
For future work, the uncertainties of EV departure time will be taken into account to extend the proposed approach, which could be modelled to follow a normal distribution with the mean representing the scheduled departure time.We will also explore leveraging chance-constrained planning methods in DRL to effectively manage uncertain vehicle departures.

2
a∈A |π B (a|s) − π * (a|s)|.With this upper bound ˆ i (s), we can construct a Lyapunov function as

rr
Data Transmission Unit: This unit sends tasks to the EVs according to the decisions made by the Task Scheduler and receives the computing results from EVs.It acts as the communication bridge between the charging station and the EVs.For the EVs, each of them is equipped with two main components: Data Transmission Unit: It is responsible for receiving computation tasks from the Charging Station Controller and returning the computing results.It ensures smooth communication between the EV and the Controller.

Fig. 4 .
Fig. 4. Training framework of the proposed safe DRL approach.

Fig. 6 .
Fig. 6.Loss curves of safety layer network obtained by OCEAN and classic LSRL at different training episodes.

Fig. 7 .
Fig. 7. Total CPU cycles required for all tasks processed by each EV and their charging prices.

Fig.
Fig. Load variances of all five algorithms under different parameters.

TABLE II PARAMETERS
OF COMPUTATION TASKS

TABLE IV COMPARISON
OF ALL FIVE ALGORITHMS ON THE NORMALIZED CUMULATIVE CONSTRAINT COST WITH DIFFERENT PARAMETERS TABLE V EXECUTION TIMES (IN SECONDS) OF ALL FIVE ALGORITHMS ACROSS VARYING NUMBERS OF CHARGING STATIONSN/A denotes cases unable to find optimum.