1. Introduction
Urban traffic congestion has long been a global issue, negatively impacting both the economy and the environment. The energy used during instances of traffic congestion contributes to the emission of greenhouse gases such as carbon dioxide, exacerbating the greenhouse effect. In the United States, traffic congestion in just one year has been estimated to result in economic damages totaling USD 121 million, along with the generation of 25,396 tons of carbon dioxide [
1]. Meanwhile, in China, a minimum of 24% extra travel time is necessary for commuting during peak hours in major cities like Beijing, Shijiazhuang, and Chongqing [
2]. Curbing vehicle purchases is not a feasible solution, and expanding traffic infrastructure is costly. There is a pressing need to mitigate pollution emissions from traffic congestion by leveraging technological innovations to improve environmental conditions, while enhancing the efficiency traffic signal systems appears to be a more manageable approach [
3].
Currently, many intersections operate on a fixed-time traffic signal system that sets the timing of traffic signals based on historical data rather than in response to real-time traffic needs. Conventional traffic signal control techniques typically use predefined rules derived from strategies based on expert experience. These methods lack the capability to dynamically adjust signal phases based on instantaneous traffic flow data. In the context, Adaptive Traffic Signal Control (ATSC) [
4,
5,
6] stands out. ATSC can be used to optimize traffic flow across regional road networks and effectively reduces congestion [
7] by dynamically altering signal phases in response to current traffic conditions. The concept of ATSC was introduced many years ago, and key systems such as the Split Cycle Offset Optimization Technique (SCOOT) [
8] and Sydney Coordinated Adaptive Traffic System (SCATS) [
9] are other notable examples.
With the explosive growth of Deep Learning (DL), an increasing array of DL-based ATSC methodologies are emerging. Sun et al. [
10] employed Convolutional Neural Networks (CNNs) to analyze image data from traffic cameras, deriving a method for predicting short-term traffic flow and thereby enabling real-time signal optimization. Similarly, Kong et al. [
11] utilized Recurrent Neural Networks (RNNs) to process historical traffic data, facilitating proactive signal adjustments. In addition, traffic signal control is a sequential decision-making process that can be formulated as a Markov Decision Process (MDP). Thus, it is possible to frame this issue as a problem that can be solved using reinforcement learning (RL) [
12,
13,
14]. One of the earliest contributions of RL to ATSC is based on
Q-learning [
15], and there are some other works applying RL for ATSC in intersection environments [
16,
17,
18], where the focus is on dynamically adjusting traffic signals based on real-time and historical data to enhance the overall efficiency of traffic management systems.
Recently, RL has been enhanced using deep neural networks, giving rise to a subfield known as deep reinforcement learning (DRL). DRL combines the advantages of powerful hierarchical feature extraction and nonlinear approximation abilities, with the interaction between the agent and the environment. This integration has a range of appealing attributes, generating significant academic interest for the application of DRL in ATSC [
7,
19,
20]. Li et al. [
21] employed stacked autoencoders in RL for ATSC to ascertain the
Q-function, facilitating the efficient compression and storage of agent inputs. Wei et al. [
22] contributed an advancement to the field by integrating reward and policy interpretation into a DQN-embedded ATSC system. Nishi et al. [
23] modeled the impact of neighboring nodes using a static adjacency matrix within the framework of GCN. Additionally, Wu et al. [
24] proposed using CNNs for the improved extraction of state information features relevant to network topology-related issues, specifically addressing the ATSC problem. Wang et al. [
25] used a decentralized RL approach with a region-aware strategy, incorporating an actor–critic model and graph attention networks to optimize traffic signal control. Zheng et al. [
26] introduced the FRAP model, which optimizes traffic signal control by leveraging phase competition and symmetry invariance. Furthermore, Zhang et al. [
27] applied meta-reinforcement learning using model-agnostic meta-learning (MAML) and flow clustering to generalize traffic signal control across diverse environments with support from a WGAN-generated traffic flow dataset. DRL excels over traditional RL by efficiently handling high-dimensional, complex data and automatically extracting nonlinear features while additionally surpassing DL by integrating decision-making and control capabilities, thereby enabling continuous learning and an effective balance between exploration and exploitation in dynamic environments.
However, due to the rapidly changing traffic conditions and the high dimensionality of traffic information, current DRL-based ATSC algorithms struggle to effectively capture the relationships between traffic information across different time sequences and to adequately understand traffic scenarios. The primary deficiencies are manifested as follows. Firstly, their reasoning capabilities are insufficient, resulting in non-stationarity during the training process. Secondly, most ATSC algorithms require extensive data to learn the optimal policy, necessitating greater computational resources and time requirements [
28]. In addition, traditional DRL agents are specialized in their training environment, limiting their generalization and transferability to new environments. These considerations restrict the application of DRL to simple traffic scenarios, hindering its applicability in complex real-world traffic situations.
To address these issues, a novel DRL-based ATSC approach named Sequence Decision Transformer (SDT) is proposed, in which DRL is formulated as a sequence decision model. Given that large language models (LLMs) like the GPT series and BERT have demonstrated outstanding performance in natural language processing, computer vision, and reinforcement learning [
29], the robust understanding and reasoning capabilities of LLMs are applied to the ATSC problem to accommodate the high dynamics of traffic conditions and the high dimensionality of traffic information. This enables SDT to incrementally learn optimal policy within complex traffic flows. The algorithm employs an encoder–decoder structure, stores historical trajectories in the replay buffer, and utilizes Proximal Policy Optimization (PPO) to update parameters, effectively addressing challenges in ATSC such as the demand for large amounts of data and the low generalization performance.
The main contributions of this paper are summarized as follows:
(1) The ATSC problem is first converted into a DRL formulation using a Markov Decision Process (MDP), where the essential elements, such as the observation space, action space, and reward functions, are defined with the goal of improving traffic efficiency and reducing congestion. This article introduce SDT, an extension of a standard MDP, with an encoder–decoder structure. This framework serves as means for deriving a formulation from the ATSC problem.
(2) Secondly, the SDT is employed to solve the ATSC problem formulated as an MDP. SDT utilizes a transformer to process data in parallel, which can potentially reduce training time and accommodate larger observation and action spaces. The self-attention mechanism in SDT captures dynamic changes in the environment and enhances the representation of general features, addressing issues related to non-stationarity and improving model generalization in new environments. Additionally, PPO is introduced, using a truncated probability ratio to limit the policy update step, thereby preventing large policy changes during training and increasing the stability of the training process.
(3) Thirdly, extensive experiments were carried out across various traffic scenarios. The experimental results demonstrate that the presented ATSC method based on SDT consistently excels at alleviating traffic congestion and enhancing the efficiency of the traffic system. The SDT model, compared to PPO, a DQN for ATSC, and FRAP, shows improvements of 26.8%, 150%, and 21.7% over traditional ATSC algorithms, and 18%, 30%, and 15.6% over the state-of-the-art (SOTA) under the most complex conditions.
The rest of this paper is organized as follows: The ATSC problem and the ATSC framework based on SDT are introduced and defined in
Section 2. In
Section 3, the ATSC problem is formulated as a Markov decision process.
Section 4 details the algorithm, including its structure and update policy. The experiments are presented in
Section 5, followed by the conclusions in
Section 6.
Appendix A details the explanation of notations appeared in this article.
3. Representation of ATSC Problem to Markov Decision Process
3.1. Markov Decision Process
The Markov decision process (MDP), widely recognized for formally expressing the process through which a agent traverse the environment, fundamentally falls under the category of a discrete-time decision-making architecture. An MDP is defined by a tuple , where:
O is the observation space of the agent, where observations are the pieces of information that an agent receives at each time step .
A represents the set of action space of the agent in the environment, where the agent performs in the discrete time step t.
is a function that gives rewards; this function maps observation o to a numerical reward.
represents the transition probability function, which defines the dynamics of the environment.
is a function that describes the probability of transitioning from observation to observation , given a particular action . .
Following the MDP model, the agent interacts with the environment in discrete time steps. At time step t, the agent produces an observation by interacting with the environment, and the agent performs an action . After the agent has taken the action, it receives a reward . The environment then transits to a new observation based on the probability function . When the trajectory from an epoch is gathered, the policy undergoes an update, subsequently enabling the agent to continue interacting with the environment using this newly updated policy. The goal of an MDP is to enable the agent to learn the optimal policy that maximizes the total accumulated reward over time through continuous policy updates.
3.2. Converting the ATSC Problem to MDP
In this section, the ATSC problem is transformed into an MDP problem by defining its basic elements, including the observation space, action space, and reward function set.
Table 1 shows the observation and action space in this work.
Observation: The observation space is used to describe the observation information during the intersection between the agent and the environment. A reasonable observation space is crucial for efficient training of DRL algorithm. In the ATSC system, the capacity to accurately extract and reconstruct observation information from the complex and dynamic environment of intersections is significant, as it determines the ability to output appropriate actions accurately.
The observation space is defined as the sum of the observation of lanes in the current scenario, i.e., . Considering that the traffic conditions at intersections vary according to the dynamic and static traffic information of each lane, the observation subset is set as , where represents the total number of vehicles in lane i at time t, represents the average waiting time of vehicles with a velocity of less than in lane i at time t; and are the velocity and the waiting time of vehicle m, respectively; represents the queue length of lane i at time t, which can be denoted by the number of queued vehicles; and represents the average velocity of vehicles in lane i at time t. The above observation space can accommodate complex and randomly changing traffic intersection scenarios, effectively characterizing the traffic conditions at the intersection.
Action: In this paper, the discrete set of traffic light phases is considered as the action space for the agent. Upon receiving an observation at time step t, the agent selects and executes an appropriate action p from , and indicates the quantity of phases. By executing p, the agent determines the roads that are allowed and disallowed to pass at time and maintains the phase duration.
Reward: The reward function serves as quantitative feedback received by the agent from the environment after executing an action. In this study, the reward function guides the agent in continuously exploring and learning optimal strategies while maximizing rewards. The design of these functions is crucial for the convergence speed of DRL algorithms. If the reward function is overly optimized for a specific scenario, it may lead to overfitting, reducing the trained model’s ability to adapt to new scenarios. Generally, performance metrics should be designed based on the actual objectives intended to be achieved in the environment. To comprehensively describe the environment, the reward function focuses on holistically enhancing the traffic efficiency at intersections, ensuring that the reward system accurately reflects the desired outcomes in the environment, not just how the agent should behave. Such a design helps avoid overtraining in specific environments while enhancing the model’s adaptability and generality in new scenarios. Therefore, considering all aspects, the reward function comprises the following components:
(1) The total number of vehicles (
): By focusing on the total number of vehicles, the reward function can directly impact and assess traffic flow at intersections. Reducing the total number of vehicles at intersections aids in alleviating traffic congestion and facilitating smoother traffic. This metric effectively reflects the intersection’s capacity and efficiency in managing traffic flow.
(2) The average speed of vehicles (
): Average speed is one of the key metrics used to assess traffic efficiency. By increasing the average speed of vehicles, traffic congestion can be reduced and travel times shortened, thereby enhancing the overall efficiency of the intersection. Higher average speeds generally indicate good traffic conditions, where vehicles do not need to frequently stop or slow down.
(3) The average queue lengths of all vehicles (
): This metric reflects the queuing situation of vehicles at the intersection. Shorter queue lengths mean that vehicles can pass through the intersection more quickly, reducing waiting and idling times, which helps to enhance the continuity of traffic flow and reduce congestion. Controlling queue length can effectively optimize the distribution of traffic flow and the scheduling of traffic signals at the intersection.
(4) The average waiting time of all vehicles (
): The average waiting time is an important metric for measuring the efficiency of an intersection. By reducing the average waiting time of vehicles at the intersection, the smoothness and efficiency of traffic flow can be significantly improved.
In conclusion, the reward function is formulated as follows:
Here, represent the weights for the items of the reward function, respectively. The final goal of the proposed algorithm is to optimize a weighted sum of four objectives, which serves as a normalization method. This weighted summation approach was chosen because the scales of these four factors vary significantly in the initial episodes for each environment. Simply adding them as a reward would cause the agent to focus disproportionately on the factors with larger magnitudes, potentially leading to the algorithm’s failure to converge.
Weighting these four metrics ensures that each indicator has an appropriate proportion within the reward function, thereby overcoming issues of differing scales and uneven impact among the indicators. This design helps the algorithm balance the importance of different metrics, avoiding the overoptimization of a single metric while neglecting others that are equally important. Additionally, this weighted approach enhances the stability and convergence of the algorithm, enabling it to effectively learn and adapt in various traffic environments, ultimately achieving the goals of reducing congestion and improving traffic efficiency.
5. Experiment
To evaluate the proposed ATSC algorithm based on the Sequence Decision Transformer (SDT), two sets of experiments were conducted. All experiments were performed using Simulation of Urban Mobility (SUMO) [
30], a traffic simulation tool. In the experiments, all intersections were set in a synthetic traffic environment. To ensure the practicality of the proposed model, the experiments focused on standard intersections, the most common type of traffic scenario in modern cities, where vehicles can proceed straight, turn left, or turn right. U-turns were not permitted in the designed synthetic scenarios, as most urban intersections do not allow direct U-turns due to the increased risk of traffic accidents. Therefore, the option for vehicles to make U-turns was excluded from our experiments.
Specifically, the first experiment demonstrates the training process of the proposed algorithm, showing the evolution of performance during iterations of policy updates and the final performance of the policy. The second experiment compares the performance of a fixed-time traffic signal, a baseline PPO method, a DQN algorithm specifically designed for ATSC problems [
31], and a FRAP [
26] algorithm to the SDT algorithm under various traffic scenarios. This section begins with the experimental setup and implementation details, followed by an analysis of the training and test results obtained.
5.1. Experiment Setting
The training scenario is a intersection that has 32 lanes, while each direction has 3 straight lanes, 1 right-hand lane, and 1 left-hand, lane and they are all 750 m long. The traffic flow density is approximately 600–5000 veh/h during the training process.
In SUMO, the shortest simulation time interval, namely, the time-step
, is 1 s. The traffic flow, which is randomly generated at various moments within 3600 s, will enter incoming lanes, thereby pass by the intersection. The SDT algorithm has been trained 3 million times with an episode length of 500, which means that there are 6000 episodes in this training scenario. In the synthetic scenario, the weights assigned to each item of the discount reward function are {
} = {
}. Each decision-making
T has an interval of 6 s, meaning the agent will decide whether to maintain the current phase or switch to another phase every 6 s. If
T is set too short, such as 1 s, it can cause traffic lights to switch too often, which is impractical and increases the computational load. Conversely, if
T is set too long, it will slow the agent’s learning rate. The parameters of the baseline PPO, DQN, and FRAP algorithm are restored to the original settings. Please refer to
Table 2 for the detailed parameters of the SUMO simulator, baseline PPO method, DQN, FRAP algorithm, and SDT algorithm.
5.2. Analysis of Performance during Training Process
In this experiment, a high-performing ATSC agent was successfully trained using the SDT algorithm.
Figure 3a illustrates the changes in reward values throughout the training process. The curve in the graph represents the variation of the discount rewards as the number of time steps increases. During exploration of the agent and updating of the network parameters, the rewards curve gradually increases, indicating that the policy has become adept at efficiently handling problems, and the agent is continuously adapting to varying scenarios. At around 75,000 steps, the reward curve reaches its peak and subsequently fluctuates with minimal variance for an extended period. This suggests that the agent has learned the optimal policy after approximately 75,000 steps, and the subsequent fluctuations are due to randomness in traffic scenarios and vehicle flows at different times. The rapid convergence and superior policy performance can be attributed to the integration of the SDT framework, which leverages the encoder–decoder architecture for efficient data processing. The self-attention mechanism within the transformer model enables the SDT to capture intricate dependencies and dynamic changes in traffic conditions, thereby enhancing the learning process. Additionally, the application of PPO ensures stable and continuous policy updates, preventing large policy shifts and maintaining consistent performance improvements. This robust methodological framework contributes significantly to the observed rapid convergence and high-quality policy performance during training.
To validate the convergence of the rewards and each component, the changes in three reward components were recorded with respect to the number of steps.
Figure 3b,c display the total number of vehicles and the average waiting time of vehicles, respectively. The trends in these two reward items are essentially consistent with the reward curve. The total number of vehicles decreases with the update of the policy, reaching a low point around 100,000 steps and subsequently fluctuating within a small range. The average waiting time reaches its lowest point after around 80,000 steps, as the policy updates and remains relatively stable, with only minor fluctuations thereafter.
Figure 3d shows the queue length as the number of time steps increases. During the training process, the queue length initially increases slightly and then gradually decreases. As the policy is updated, the minimum queue length occurs around 160,000 steps and maintains relative stability thereafter with only minor fluctuations. Although the reward items converge at different times, the trends and the final outcomes are the same. This indicates that the convergence of the rewards is accompanied by the convergence of the reward items, and the traffic light at the intersection becomes more efficient in managing traffic flows under the influence of the SDT algorithm, leading to improved traffic conditions at the intersection.
5.3. Performance Comparison
This section compares the performance of the policy trained using the SDT algorithm with the fixed-time method, a PPO method, a DQN tailored for the ATSC method, and a SOTA ATSC algorithm, FRAP. To demonstrate the performance of the SDT algorithm, three typical scenarios were used, which are illustrated in
Figure 4. In all scenarios, traffic flow follows a Poisson distribution; namely, traffic flow is generated randomly, and the probability of arrival is the same within equal time intervals during the simulation period. The traffic density for each lane at the intersection remains constant and does not vary under different conditions such as peak hours.
Easy scenario: In this scenario, traffic can only flow from north to south and from west to east, with each direction consisting of two lanes. The traffic density is 600 veh/h.
Medium scenario: This scenario is similar to the training scenario, with each direction consisting of four lanes. The traffic density is 4000 veh/h.
Hard scenario: This intersection has 48 lanes, while each direction has 4 straight lanes, 2 right-turn lane, and 2 left-turn lane. The traffic flow density is approximately 5000 veh/h.
In the scenarios described above, several experiments were also set up to compare these algorithms with the SDT algorithm:
FIX6/12/18/24: The figures represent the fixed duration of each green light phase. In the FIXED experiments, the green light phases cycles with a fixed duration, simulating the traffic lights commonly found at most real-life intersections to influence the traffic conditions.
Baseline PPO: This implementation is a standard, unmodified version of PPO without any specific design tailored for ATSC. The policy model is MLP, which is used to learn and optimize traffic signal control strategies from the traffic environment. However, due to the lack of specific design adjustments, this PPO algorithm represents a straightforward application in the ATSC problem.
DQN [
31]: This DQN-based DRL algorithm is designed for ATSC. It leverages a deep convolutional neural network (CNN) to extract useful features from traffic data and utilizes the Q-learning algorithm to identify the optimal traffic signal control strategy.
FRAP [
26]: FRAP models traffic signal control as a phase competition problem, giving priority to the traffic phase with the highest demand. By leveraging symmetry in traffic flow (flipping and rotation), FRAP reduces the problem’s complexity and state space, thus improving learning efficiency.
Table 3 shows final results of the comparison. To evaluate the performances of the different methods, three key metrics were used: the total number of vehicles at the intersection (vehicle number), the average speed of all vehicles (speed), and the average queue length at the intersection (queue). These metrics provide a straightforward reflection of the congestion level at the intersection, serving as core indicators for assessing the model’s performance, with their calculation formulas corresponding to Equations (2), (3), and (4), respectively. The results demonstrate that, across all scenarios, the SDT algorithm significantly outperforms the FIXED method and the baseline PPO method in all aspects.
In the easy scenario, SDT achieves improvements of 5% and 2% in vehicle number and speed compared to the FRAP method. These relatively modest improvements can be attributed to the lighter traffic conditions in this scenario, where the traditional ATSC algorithm is still sufficiently effective. However, the encoder–decoder architecture of SDT allows it to capture more nuanced dynamics of traffic flow, leading to slight performance gains.
In the medium scenario, the improvements by SDT become much more pronounced, with gains of 23%, 65%, and 13% in vehicle number, speed, and queue, respectively. This significant enhancement is primarily due to the self-attention mechanism within the SDT’s transformer model, which captures complex dependencies and dynamic interactions among traffic flows, enabling SDT to adjust signal control more precisely than the DQN and FRAP method. As traffic fluctuations and complexity increase in medium scenarios, FRAP’s limited capacity to model temporal dependencies and interactions among multiple streams becomes more apparent, leading to inferior performance.
In the hard scenario, SDT’s advantages are even more evident, with improvements of 26%, 24%, and 34% in vehicle number, speed, and queue, respectively. In such high-complexity scenarios, FRAP and the DQN struggle to cope with the high traffic volume and intricate interactions, resulting in significantly lower performances. In contrast, SDT not only leverages its encoder–decoder architecture to effectively process complex data but also benefits from the stable and continuous policy updates of the PPO algorithm, preventing large policy shifts and ensuring consistent performance improvements. SDT’s ability to dynamically adjust according to real-time traffic conditions greatly alleviates congestion and enhances throughput efficiency.
However, it is worth noting that the SDT algorithm relies heavily on computational resources and is sensitive to hyperparameter settings. While these factors contribute to its high performance, they also highlight areas for further optimization. Reducing computational demands and enhancing the algorithm’s robustness to hyperparameter variations could broaden SDT’s applicability.
Overall, as the scale of the intersection and traffic flow increases, the SDT algorithm, with its sophisticated architecture and advanced learning mechanisms, effectively addresses the limitations of traditional methods such as FRAP in terms of dynamic adaptability and handling complex traffic conditions, achieving more flexible and efficient traffic management.