Abstract
This paper develops a model of reinforcement learning ramp metering (RLRM) without complete information, which is applied to alleviate traffic congestions on ramps. RLRM consists of prediction tools depending on traffic flow simulation and optimal choice model based on reinforcement learning theories. Moreover, it is also a dynamic process with abilities of automaticity, memory and performance feedback. Numerical cases are given in this study to demonstrate RLRM such as calculating outflow rate, density, average speed, and travel time compared to no control and fixed-time control. Results indicate that the greater is the inflow, the more is the effect. In addition, the stability of RLRM is better than fixed-time control.
1. Introduction
Increasing dependence on car-based travel has led to the daily occurrence of recurrent and nonrecurrent freeway congestions not only in China but also around the world. Congestion on highways forms when the demand exceeds capacity. Recurrent congestion reduces substantially the available infrastructure capacity at rush hour, that is, at the time this capacity is most urgently needed. Moreover, congestion also causes delays, increases environmental pollution, and reduces traffic safety.
Ramp metering is essential to the efficient operation of highways, particularly when volumes are high. According to Papageorgiou and others, ramp metering is divided roughly into the reacted type and the preceded type [1]. DC (demand-capacity), OCC (occupancy), and ALNEA [2] are among the well-known local response type ramp metering [3]. In DC, the actual upstream volume is measured at regular short intervals and is then compared to the downstream capacity, which may be calculated by using downstream traffic conditions. OCC uses a predetermined relationship between occupancy rate and lane volume, developed from data previously collected at the highway adjacent to the ramp being considered. ALNEA is the ramp metering which sets up the private-use rate of an onramp based on the measured value of main line traffic. ALINEA has an example of application in some countries of Europe and is made highly validated compared to DC and OCC. Iwata, Tsubota, and Kawashima have proposed the ramp metering technique using the predicted value by a traffic simulator [4]. Reinforcement learning ramp metering based on traffic simulation model with desired speed was proposed by Wang et al. [5]. The aim of this study is to propose reinforcement learning ramp metering without complete information.
2. Methods
2.1. Traffic Flow Simulation Model
Figure 1 describes car-following behaviors. In a microsimulation model, a modeled fundamental behavior is the βcar-followingβ which adjusts the driverβs characteristics: the distance between two adjacent cars, the relative speed, and so forth.
In 1953, Pipes proposed the following basic differential equation model for car-following behavior: where , , and denote the acceleration, speed, and distance from the reference point of vehicle , respectively, and is a constant. In the model, the acceleration of the vehicle which follows a leading vehicle is proportional to the speed difference between the vehicles. It is assumed that the delay of time in which the vehicle responds to the speed difference is so small that it can be neglected. To remove this drawback, Chandler introduced a reactive delay time . Based on the rationale that the acceleration of the following car is also influenced by its speed and the distance between the vehicles, Gazis, Herman, and Rothery proposed the general type of car-following model:
Newell proposed the following model in which the acceleration is propositional to an exponential function of the distance between the vehicles, based on real data: Although the above modifications have improved the reality of car-following model, they have the following two drawbacks. When the proceeding vehicle does not exist, this implies that a car will maintain an initial speed. On the other hand, when the speed difference is 0, the acceleration is 0. This implies the unrealistic phenomenon that the following car will not apply the brake even when the distance to the preceding car approaches 0 and will not accelerate even if the distance is very long. To solve the above-mentioned problems, Treiber and Helbing introduced the intelligent driver model [6], which introduces a desired speed and a shortest distance between cars. The IDM is given as
where is distance; is the th car; is the speed; is the length of car; is the desired minimum gap; is the maximum acceleration; is the effective gap; is the comfortable deceleration ; is the parameter; is the time gap; is the desired speed.
Figure 2 presents lane change behaviors. To simulate driverβs behavior in the merging section on freeways and the merging behavior in the weave section, and so forth, the lane change model is needed [7]. We propose a new lane change model which describes driverβs behavior depending on judgment functions [8, 9]. We focus on a vehicle approaching to a confluence point and describe its behavior with several variables: the relative speed between the car and cars in current lane, the locations of both the main line cars and the on-ramp cars, driverβs judgment functions for changing his lane, and driverβs desired speed. The driverβs judgment function for the free merging is different from the judgment function for the forced merging. A free merging implies that a car on the ramp can merge into the main line without influences, and cars on the main line are not interfered. When forced merging models of psychological condition and physical condition are both satisfied, the driver conducts lane change behaviors. Otherwise, the driver continues the car-following behavior without lane change behaviors.
Physical condition presents the ability of lane change. The lane change model with driverβs judgment function is expressed as follows:
where , are judgment function; is the distance from reference point; is the speed; is the length of a vehicle; is the judgment time; is the desired speed, subject to normal distribution; are the adjustment coefficients; is the rapid acceleration with upper bound ; and is the rapid deceleration with upper bound . Parameters and are associated with vehicle βs judgment functions for lane change and decide the free merging or the forced merging. Since vehicle judges to accelerate or decelerate to merge into the main line, two events are mutually exclusive.
The function judges whether vehicle accelerates or decelerates to merge according to the given space and speed conditions between vehicles and . Similarly, the function is applied to judge in the relationship between vehicles and . If both and take 0, the distance between two vehicles and is large enough for vehicle to be accommodated to enter into the main line, then the free merging occurs (no acceleration or deceleration behavior is required for vehicle ). Conversely, in the case of the forced merging, we need to examine whether the solution of inequality (8) to (11) exists. If and are mutually exclusive, then the following two conditions and are obtained. (1)When a rapid brake event does not exist, then , and only an event could happen.(2)When a rapid acceleration event does not exist, then , and only an event is approved.
The lane changing behavior of vehicle could happen when a solution of or exists.
Psychological constraints describe driverβs motivations on lane change. If the present car has not reached the desired speed and if the predicted speed of lane change is greater than that of no change, or gain speed advantage, and describe predicted acceleration of lane change and no lane change, respectively. and are given from the IDM. Then the psychological constraints can be given by
If (12) has a solution, the driver has maneuvers of changing the current lane to the target lane. Conversely, the driver does not conduct the lane changing maneuvers.
Lane change behaviors can be characterized as a sequence of three stages: the ability of lane change (physical condition); the motivation of lane change (psychological constraints); the execution of lane change. When lane change models of psychological condition and physical condition are both satisfied, the driver conducts the above-mentioned three stages. Otherwise, the driver continues the car-following behavior without lane change behaviors.
We develop a traffic flow simulation model consisting of car-following model and lane change model [10β12]. ο»ΏThe basic concept of car-following theories is the relationship between stimuli and response. In the classic car-following theory, the stimuli are represented by the relative speed of following and leading vehicle, and the response is represented by the acceleration (or deceleration) rate of the following vehicle. The car-following model describes following behaviors that drivers follow each other in the traffic stream on only one lane. To reproduce the traffic flow in two or more lanes, lane change model which explores lane change behaviors is needed. By using the car-following model and lane change model, we express dynamic and complex traffic behaviors in two or more lanes. Moreover, traffic flow simulation models are applied to reproduce the traffic congestion represented by Helbing and Kerner [13β16].
2.2. The Reinforcement Learning Ramp Metering
Reinforcement learning is a kind of machine learning treating the problem at which the agent under a certain environment determines the action. And the action should observe and take the present state. An agent gets reward from environment by choosing actions. Reinforcement learning learns a policy from which most reward is obtained through a series of actions [17]. Reinforcement learning is a broad class of optimal control methods depending on estimating value functions from experience or simulations [18β21].
The model of reinforcement learning ramp metering (RLRM) is shown in Figure 3. is the inflow of the upstream of the main line; is the metering rate; is the outflow of the downstream of main line; is the density of the main line in merging section; is the density of onramp; is the average speed of the main line; is the average speed of onramp.
According to the volume in merging section, upstream traffic is updated by
where called state variable can be collected by the control variable detector. is set as a choosing action variable. Moreover, is the reward based on the choosing action. is the traffic density in the merging section of long. can be obtained by
According to Figure 4, the framework of RLRM is explained briefly. RLRM consists of metering rate choice model, outflow function, value function, and environmental model. The metering rate choice model is a rule to choose the optimal metering rate. Outflow function describes the data of downstream traffic which can be collected and calculated by detectors. Value function presents the total of volumes of downstream traffic. Environmental model predicts inflow and outflow in the next period of time depending on optimal metering rate and inflow.
2.3. RLRM with Complete Information
The RLRM with complete information faces a Markov decision problem (MDP). In addition, since inflow and metering rateβs set denotes , is finite. We typically use a set of matrices
to describe the transition structure. Traffic outflow at time is obtained by
for all , for all , and for all .
If maximum outflow or is given by Bellman formula, we have
or
We can obtain transit probability and next outflow with MDPβs complete information. And we assume that traffic outflow is finite. Moreover, we can also compute traffic outflow.
2.4. RLRM without Complete Information
Supposed Markov decision process with complete information is given in Section 2.3. But this argument is untenable in fact. We can give ramp metering rate by using evaluation of the experience without complete information. Since transit probability is not necessary, we can rewrite (18) as
where is real time outflow at time , and constant is transit probability function of . Equation (19) can be replaced by
If expected value of metering rate is not given, we also replace
by
We get
We suppose that the probability of on-ramp control policy can be obtained in (24). Here, it is difficult to satisfy the initial condition. The values associated with an optimal on-ramp control policy are called the optimal ramp inflow and are often written as . We get
where
In the (25), the action value function is gained by learning approximates (the optimal action value function) directly by using current policy. The state variable can be updated depending on the policy.
When the traffic reaches the jam density, it is possible to result in closure of the ramp for a long period of time, which must be taken into consideration. Maximum of waiting time and its metering rate are given. When , the control is selected. In order to remove the curse of dimensionality, the discrete equation of the continuous variable is represented. The average difference between 0 and is divided by . is given by
where is the amount of the metering rate, and cell is the function of the bottom integral function. The metering rate is for .
The algorithm of reinforcement learning on-ramp metering is shown in Figure 5.(1)Initialize , , , and . (2)Determine cycle time of a traffic signal .(3)Update .(4)Give metering rate by .(5)Determine the traffic state .(6)Generate the density by using traffic simulation and choose the metering rate.(7)If , then update and go to , and otherwise generate the optimal control .(8)If one closes the ramp, then update waiting time by , and otherwise initialize the waiting time by . If , then update metering rate by .(9)Operate the optimal control and update .
When the cycle time ββis over, determine to continue the ramp metering. If yes, then collect the data of inflow , go to , and update , that is, ; otherwise, complete the ramp metering.
3. Data Combination and Reduction
Our aim is to design a reinforcement learning control law for the ramp metering controller without complete information. We need to control the inflow from the ramp into main line, and the metering rate should be given by traffic states. Traffic flow simulation is conducted to demonstrate this control of the ramp metering. In our simulation, we set the main line length on highways to 1000βm, ramp length to 200βm, and length in merging sections of the main line and ramp to100βm. Parameters of RLRM are shown in Table 1, and the metering rate matrix is .
Table 2 shows the inflow of cases A, B, C, D, E, and F. Inflow rate of the main line increases from 1200βpcu/hour of case A to 2500βpcu/hour of case F. Moreover, inflow rate of ramp rises from 300βpcu/hour of case A to 900βpcu/hour of case F. The cycle length of the fixed-time control is 20βs which consists of 15βs green time and 5s red time.
4. Result and Discussion
The results of no control, fixed-time control, and RLRM are shown in Figures 6β9. Total inflow increases from 1500βpcu/h in case A to 3400βpcu/h in case F. Figure 6 presents average speed and its rate compared to no control. The average speed of no control, about 108βkm/h, is faster than fixedtime and RLRM in case A. The similar results are shown in case B. The average speed of no control, about 79βkm/h, is faster than fixedtime and is slower than RLRM in case C. The average speed of no control, about 51βkm/h, is slower than fixedtime and RLRM in case F. According to the average speed, rates of congestion reliefs of fixed-time control from case A to case F arrive at β7.80%, β6.65%, β3.77%, 0. 26%, 2.70%, and 8.26%, respectively. In addition, rates of congestion reliefs of RLRM from case A to case F arrive at β6.31%, β6.49%, 5.69%, 13.55%, 20.50%, and 18.18%, respectively.
Figure 7 describes density and its rate compared to no control. Densities of fixed-time control and RLRM are about 38βpcu/km, an about 60% increase, in case A. Densities of fixed-time control and RLRM are about 52βpcu/km and 45βpcu/km, about 11.46% and 22.60% decreases, in case C. Densities of fixed-time control, no control, and RLRM are about 120βpcu/km. According to densities, rates of congestion reliefs of fixed-time control from case A to case F arrive at β57.55%, β19.92%, β11.46%, β21.35%, 7.6%, and 0.39%, respectively. In addition, rates of congestion reliefs of RLRM from case A to case F arrive at β59.59%, β22.05%, 22.60%, 8.18%, 9.65%, and 3.42%, respectively.
Figure 8 shows outflow and its rate compared to no control. Outflow rate rises from 1700βpcu/h without control to 2308βpcu/h with fixed-time control and 1800βpcu/h with RLRM in case A. Moreover, 3.82% and 7.85% increases are shown depending on outflow rate in case C. In addition, 18.97% and 30.65% increases are explored depending on outflow rate in case F. Rates of congestion reliefs of fixed-time control from case A to case F arrive at 35.76%, β14.25%, 7.82%, 7.93%, 12.51%, and 18.97%, respectively. On the other hand, rates of congestion reliefs of RLRM from case A to case F arrive at 7.06%, 0.58%, 3.85%, 10.47%, 54.63%, and 30.65%, respectively.
Figure 9 represents travel time and its rate compared to no control. According to travel time, 6.25% and 9.38% increases are explored in case A. Travel time rises from 342βs without control to 370βs with fixed-time control and falls into 330βs with RLRM in case C. Travel time falls from 617βs to 469βs with fixed-time control and 343βs with RLRM in case F. Rates of congestion reliefs of fixed-time control from case A to case F arrive at β6.25%, β25.26%, β8.19%, 7.36%, 27.06%, and 23.99%, respectively. On the other hand, rates of congestion reliefs of RLRM from case A to case F arrive at β9.38%, β5.26%, 3.51%, 38.17%, 40.32%, and 44.41%, respectively.
According to Figures 6β9 when the traffic inflows are low, controls not efficient. Controls get efficient with the traffic inflows increasing. Controls are very efficient, and RLRM is optimal control when the traffic inflows are high. Moreover, based on curves of Figures 7β9 assessment indicators of fixed-time control fluctuate around indicators of no control. Fixed-time control shows instability compared to RLRM. Abilities of automaticity, memory, and performance feedback of RLRM are also shown.
5. Conclusion
The on-ramp metering ensures that traffic moves at a speed approximately equal to the optimum speed which results in maximum flow rates or travel time. This study develops an RLRM model without complete information, which consists of prediction tools depending on traffic flow simulation and optimal choice model based on reinforcement learning theories. Numerical cases are given to demonstrate RLRM compared to no control and fixed-time control. In addition, densities and outflow rates are calculated. Moreover, average speeds are computed, and travel times are assessed. According to cases A, B, C, D, E, and F, fixed-time control and RLRM are discussed depending on average speeds, densities, outflow rates, and travel times. When traffic inflow is low, controls are not efficient, and there are little differences among no control, fixed-time control, and RLRM. On the other hand, when traffic inflow is high, controls are very efficient, and RLRM is optimal control. Moreover, the greater is inflow, the more is the effect. In addition, the stability of RLRM is better than fixed-time control.
Acknowledgments
This research is founded by the National Natural Science Foundation of China (Grant no. 51008201). And this research is also sponsored by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry of China. Moreover, this research is also the key project supported by the Scientific Research Foundation, Education Department of Hebei Province of China (Grant no. GD2010235) and Society Science Development Program of Hebei Province of China (Grant no. 201004068).