Abstract

This paper develops a model of reinforcement learning ramp metering (RLRM) without complete information, which is applied to alleviate traffic congestions on ramps. RLRM consists of prediction tools depending on traffic flow simulation and optimal choice model based on reinforcement learning theories. Moreover, it is also a dynamic process with abilities of automaticity, memory and performance feedback. Numerical cases are given in this study to demonstrate RLRM such as calculating outflow rate, density, average speed, and travel time compared to no control and fixed-time control. Results indicate that the greater is the inflow, the more is the effect. In addition, the stability of RLRM is better than fixed-time control.

1. Introduction

Increasing dependence on car-based travel has led to the daily occurrence of recurrent and nonrecurrent freeway congestions not only in China but also around the world. Congestion on highways forms when the demand exceeds capacity. Recurrent congestion reduces substantially the available infrastructure capacity at rush hour, that is, at the time this capacity is most urgently needed. Moreover, congestion also causes delays, increases environmental pollution, and reduces traffic safety.

Ramp metering is essential to the efficient operation of highways, particularly when volumes are high. According to Papageorgiou and others, ramp metering is divided roughly into the reacted type and the preceded type [1]. DC (demand-capacity), OCC (occupancy), and ALNEA [2] are among the well-known local response type ramp metering [3]. In DC, the actual upstream volume is measured at regular short intervals and is then compared to the downstream capacity, which may be calculated by using downstream traffic conditions. OCC uses a predetermined relationship between occupancy rate and lane volume, developed from data previously collected at the highway adjacent to the ramp being considered. ALNEA is the ramp metering which sets up the private-use rate of an onramp based on the measured value of main line traffic. ALINEA has an example of application in some countries of Europe and is made highly validated compared to DC and OCC. Iwata, Tsubota, and Kawashima have proposed the ramp metering technique using the predicted value by a traffic simulator [4]. Reinforcement learning ramp metering based on traffic simulation model with desired speed was proposed by Wang et al. [5]. The aim of this study is to propose reinforcement learning ramp metering without complete information.

2. Methods

2.1. Traffic Flow Simulation Model

Figure 1 describes car-following behaviors. In a microsimulation model, a modeled fundamental behavior is the β€œcar-following” which adjusts the driver’s characteristics: the distance between two adjacent cars, the relative speed, and so forth.

In 1953, Pipes proposed the following basic differential equation model for car-following behavior:̈π‘₯𝑛+1ξ€Ί(𝑑)=π‘ŽΜ‡π‘₯𝑛(𝑑)βˆ’Μ‡π‘₯𝑛+1ξ€»(𝑑),(1) where ̈π‘₯, Μ‡π‘₯, and π‘₯ denote the acceleration, speed, and distance from the reference point of vehicle 𝑛, respectively, and π‘Ž is a constant. In the model, the acceleration of the vehicle which follows a leading vehicle is proportional to the speed difference between the vehicles. It is assumed that the delay of time in which the vehicle responds to the speed difference is so small that it can be neglected. To remove this drawback, Chandler introduced a reactive delay time 𝑇. Based on the rationale that the acceleration of the following car is also influenced by its speed and the distance between the vehicles, Gazis, Herman, and Rothery proposed the general type of car-following model:̈π‘₯𝑛+1(π‘Žξ€Ίπ‘‘+𝑇)=Μ‡π‘₯𝑛+1ξ€»(𝑑+𝑇)π‘šξ€ΊΜ‡π‘₯𝑛(𝑑)βˆ’Μ‡π‘₯𝑛+1ξ€»(𝑑)ξ€Ίπ‘₯𝑛(𝑑)βˆ’π‘₯𝑛+1ξ€»(𝑑)𝑙.(2)

Newell proposed the following model in which the acceleration is propositional to an exponential function of the distance between the vehicles, based on real data:̈π‘₯𝑛+1(𝑑+𝑇)=π‘Ž1ξ€ΊΜ‡π‘₯𝑛(𝑑)βˆ’Μ‡π‘₯𝑛+1ξ€»(𝑑)Γ—β„“βˆ’(π‘Ž2/[π‘₯𝑛(𝑑)βˆ’π‘₯𝑛+1(𝑑)βˆ’π‘Ž3]).(3) Although the above modifications have improved the reality of car-following model, they have the following two drawbacks. When the proceeding vehicle does not exist, this implies that a car will maintain an initial speed. On the other hand, when the speed difference is 0, the acceleration is 0. This implies the unrealistic phenomenon that the following car will not apply the brake even when the distance to the preceding car approaches 0 and will not accelerate even if the distance is very long. To solve the above-mentioned problems, Treiber and Helbing introduced the intelligent driver model [6], which introduces a desired speed and a shortest distance between cars. The IDM is given asΜ‡π‘£π‘›βŽ‘βŽ’βŽ’βŽ£ξ‚΅π‘£=π‘Ž1βˆ’π‘›π‘£0ξ‚Άπ›Ώβˆ’ξƒ©π‘ βˆ—ξ€·π‘£π‘›,Δ𝑣𝑛𝑠𝑛ξƒͺ2⎀βŽ₯βŽ₯⎦,𝑠(4)βˆ—(𝑣,Δ𝑣)=𝑠0+max𝑇𝑣+𝑣Δ𝑣2√ξƒͺπ‘ π‘Žπ‘,0,(5)𝑛π‘₯(𝑑)=π‘›βˆ’1βˆ’π‘₯π‘›ξ€»βˆ’π‘™,(6)Δ𝑣𝑛𝑣(𝑑)=𝑛(𝑑)βˆ’π‘£π‘›βˆ’1ξ€»(𝑑),(7)

where π‘₯ is distance; 𝑛 is the 𝑛th car; 𝑣 is the speed; 𝑙 is the length of car; 𝑠0 is the desired minimum gap; π‘Ž is the maximum acceleration; π‘ βˆ— is the effective gap; 𝑏 is the comfortable deceleration (π‘Žβ‰€π‘); 𝛿 is the parameter; 𝑇 is the time gap; 𝜈0 is the desired speed.

Figure 2 presents lane change behaviors. To simulate driver’s behavior in the merging section on freeways and the merging behavior in the weave section, and so forth, the lane change model is needed [7]. We propose a new lane change model which describes driver’s behavior depending on judgment functions [8, 9]. We focus on a vehicle approaching to a confluence point and describe its behavior with several variables: the relative speed between the car and cars in current lane, the locations of both the main line cars and the on-ramp cars, driver’s judgment functions for changing his lane, and driver’s desired speed. The driver’s judgment function for the free merging is different from the judgment function for the forced merging. A free merging implies that a car on the ramp can merge into the main line without influences, and cars on the main line are not interfered. When forced merging models of psychological condition and physical condition are both satisfied, the driver conducts lane change behaviors. Otherwise, the driver continues the car-following behavior without lane change behaviors.

Physical condition presents the ability of lane change. The lane change model with driver’s judgment function is expressed as follows:π‘₯β„Ž=π‘“βˆ’π‘₯π‘ξ€·π‘£βˆ’πΏ+π‘“βˆ’π‘£π‘ξ€Έπ‘‘+(βˆ’π΄+𝐡)𝑑22𝑣+𝛿0π‘“βˆ’π‘£π‘“ξ€Έπ‘£0𝑓𝑣𝑆+𝜁0π‘βˆ’π‘£π‘ξ€Έπ‘£0𝑐𝑆β‰₯𝑆,(8)π‘₯𝑔=π‘βˆ’π‘₯π‘ξ€·π‘£βˆ’πΏ+π‘βˆ’π‘£π‘ξ€Έπ‘‘+(π΄βˆ’π΅)𝑑22ξ€·π‘£βˆ’πœƒπ‘0βˆ’π‘£π‘“ξ€Έπ‘£π‘0ξ€·π‘£π‘†βˆ’πœ‰π‘0βˆ’π‘£π‘ξ€Έπ‘£π‘0𝑆β‰₯𝑆,(9)0≀𝐴≀𝑒,(10)0≀𝐡≀𝑑,(11)

where β„Ž, 𝑔 are judgment function; π‘₯ is the distance from reference point; 𝜈 is the speed; 𝐿 is the length of a vehicle; 𝑑 is the judgment time; 𝜈0 is the desired speed, subject to normal distribution; 𝛿,𝜁,πœƒ,πœ‰(𝛿,𝜁,πœƒ,πœ‰βˆˆ[0,1])are the adjustment coefficients; 𝐴 is the rapid acceleration with upper bound 𝑒; and 𝐡 is the rapid deceleration with upper bound 𝑑. Parameters 𝐴 and 𝐡 are associated with vehicle 𝑐’s judgment functions for lane change and decide the free merging or the forced merging. Since vehicle 𝑐 judges to accelerate or decelerate to merge into the main line, two events are mutually exclusive.

The function β„Ž judges whether vehicle 𝑐 accelerates or decelerates to merge according to the given space and speed conditions between vehicles 𝑓 and 𝑐. Similarly, the function 𝑔 is applied to judge in the relationship between vehicles 𝑐 and 𝑏. If both 𝐴 and 𝐡 take 0, the distance between two vehicles 𝑓 and 𝑏 is large enough for vehicle 𝑐 to be accommodated to enter into the main line, then the free merging occurs (no acceleration or deceleration behavior is required for vehicle 𝑐). Conversely, in the case of the forced merging, we need to examine whether the solution of inequality (8) to (11) exists. If 𝐴 and 𝐡 are mutually exclusive, then the following two conditions (1) and (2) are obtained. (1)When a rapid brake event 𝐡 does not exist, then 𝐡=0, and only an event 𝐴 could happen.(2)When a rapid acceleration event 𝐴 does not exist, then 𝐴=0, and only an event 𝐡 is approved.

The lane changing behavior of vehicle 𝑐 could happen when a solution of (1) or (2) exists.

Psychological constraints describe driver’s motivations on lane change. If the present car has not reached the desired speed and if the predicted speed of lane change is greater than that of no change, or gain speed advantage, π‘Ž1 and π‘Ž2 describe predicted acceleration of lane change and no lane change, respectively. π‘Ž1 and π‘Ž2 are given from the IDM. Then the psychological constraints can be given byπ‘Ž1<π‘Ž2.(12)

If (12) has a solution, the driver has maneuvers of changing the current lane to the target lane. Conversely, the driver does not conduct the lane changing maneuvers.

Lane change behaviors can be characterized as a sequence of three stages: the ability of lane change (physical condition); the motivation of lane change (psychological constraints); the execution of lane change. When lane change models of psychological condition and physical condition are both satisfied, the driver conducts the above-mentioned three stages. Otherwise, the driver continues the car-following behavior without lane change behaviors.

We develop a traffic flow simulation model consisting of car-following model and lane change model [10–12]. ο»ΏThe basic concept of car-following theories is the relationship between stimuli and response. In the classic car-following theory, the stimuli are represented by the relative speed of following and leading vehicle, and the response is represented by the acceleration (or deceleration) rate of the following vehicle. The car-following model describes following behaviors that drivers follow each other in the traffic stream on only one lane. To reproduce the traffic flow in two or more lanes, lane change model which explores lane change behaviors is needed. By using the car-following model and lane change model, we express dynamic and complex traffic behaviors in two or more lanes. Moreover, traffic flow simulation models are applied to reproduce the traffic congestion represented by Helbing and Kerner [13–16].

2.2. The Reinforcement Learning Ramp Metering

Reinforcement learning is a kind of machine learning treating the problem at which the agent under a certain environment determines the action. And the action should observe and take the present state. An agent gets reward from environment by choosing actions. Reinforcement learning learns a policy from which most reward is obtained through a series of actions [17]. Reinforcement learning is a broad class of optimal control methods depending on estimating value functions from experience or simulations [18–21].

The model of reinforcement learning ramp metering (RLRM) is shown in Figure 3. π‘žin is the inflow of the upstream of the main line; π‘Ÿ is the metering rate; π‘žout is the outflow of the downstream of main line; π‘‘π‘š is the density of the main line in merging section; π‘‘π‘Ÿ is the density of onramp; π‘£π‘š is the average speed of the main line; π‘£π‘Ÿ is the average speed of onramp.π‘ž=π‘žin+π‘Ÿβˆ’π‘žout.(13)

According to the volume π‘ž in merging section, upstream traffic π‘žin is updated byπ‘žin𝑑+1βŸ΅π‘žin𝑑+1+π‘ž,(14)

where π‘žin called state variable can be collected by the control variable detector. π‘Ÿ is set as a choosing action variable. Moreover, π‘žout is the reward based on the choosing action. 𝜌𝐿 is the traffic density in the merging section of 𝐿 long. 𝜌𝐿 can be obtained by𝜌𝐿=π‘žin𝑑+1𝐿.(15)

According to Figure 4, the framework of RLRM is explained briefly. RLRM consists of metering rate choice model, outflow function, value function, and environmental model. The metering rate choice model is a rule to choose the optimal metering rate. Outflow function describes the data of downstream traffic which can be collected and calculated by detectors. Value function presents the total of volumes of downstream traffic. Environmental model predicts inflow and outflow in the next period of time depending on optimal metering rate and inflow.

2.3. RLRM with Complete Information

The RLRM with complete information faces a Markov decision problem (MDP). In addition, since inflow and metering rate’s set denotes 𝑆, 𝐴(π‘žin)(π‘žinπ‘‘βˆˆπ‘†) is finite. We typically use a set of matricesπ‘…π‘Ÿπ‘žinπ‘žinβ€²=π‘ƒπ‘Ÿξ€½π‘žin𝑑+1=π‘žinξ…žβˆ£π‘žin𝑑=π‘žin,π‘Ÿπ‘‘ξ€Ύ=π‘Ÿ(16)

to describe the transition structure. Traffic outflow at time 𝑑 is obtained byπ‘…π‘Ÿqinqinβ€²ξ€½=πΈπ‘žout𝑑+1βˆ£π‘žin𝑑=π‘žin,π‘Ÿπ‘‘=π‘Ÿ,π‘žin𝑑+1=π‘žinξ…žξ€Ύ,(17)

for all π‘žinβˆˆπ‘†, for all π‘Ÿβˆˆπ΄(π‘žin), and for all π‘žinξ…žβˆˆπ‘†+.

If maximum outflow π‘‰βˆ— or π‘„βˆ— is given by Bellman formula, we haveπ‘‰πœ‹(π‘žin)=maxπ‘ŸπΈξ€½π‘žout𝑑+1+πœ†π‘‰βˆ—ξ€·π‘žin𝑑+1ξ€Έβˆ£π‘žin𝑑=π‘žin,π‘Ÿπ‘‘ξ€Ύ=π‘Ÿ=maxπ‘Ÿξ“π‘žinβ€²π‘ƒπ‘Ÿπ‘žinπ‘žinβ€²ξ‚ƒπ‘žoutπ‘Ÿπ‘žinπ‘žinβ€²+πœ†π‘‰πœ‹ξ€·π‘žinξ…žξ€Έξ‚„,(18)

orπ‘„βˆ—ξ‚»(π‘žin,π‘Ÿ)=πΈπ‘žout𝑑+1+πœ†maxπ‘Ÿβ€²π‘„βˆ—ξ€·π‘žin𝑑+1,π‘Ÿξ…žξ€Έβˆ£π‘žin𝑑=π‘žin,π‘Ÿπ‘‘ξ€Ύ==π‘Ÿπ‘žinβ€²π‘ƒπ‘Ÿπ‘žinπ‘žinβ€²ξ‚Έπ‘…π‘Ÿπ‘žinπ‘žinβ€²+πœ†maxπ‘Ÿβ€²π‘„βˆ—ξ€·π‘žinξ…ž,π‘Ÿξ…žξ€Έξ‚Ή.(19)

We can obtain transit probability π‘ƒπ‘Ÿπ‘žinπ‘žinβ€² and next outflow π‘‰πœ‹(π‘žin) with MDP’s complete information. And we assume that traffic outflow is finite. Moreover, we can also compute traffic outflow.

2.4. RLRM without Complete Information

Supposed Markov decision process with complete information is given in Section 2.3. But this argument is untenable in fact. We can give ramp metering rate by using evaluation of the experience without complete information. Since transit probability is not necessary, we can rewrite (18) asπ‘‰πœ‹ξ€·π‘žin𝑑=π‘‰πœ‹ξ€·π‘žin𝑑+π‘Žπ‘‘ξ€Ίπ‘žoutπ‘‘βˆ’π‘‰πœ‹ξ€·π‘žin𝑑,(20)

where π‘žout𝑑 is real time outflow at time 𝑑, and constant π‘Žπ‘‘ is transit probability function of 𝑑. Equation (19) can be replaced byπ‘„ξ€·π‘žin𝑑,π‘Ÿπ‘‘ξ€Έξ€·βŸ΅π‘„π‘žin𝑑,π‘Ÿπ‘‘ξ€Έ+π‘Žπ‘‘ξ€Ίξ€½π‘„ξ€·π‘žout+πœ†πΈπ‘žin𝑑+1,π‘Ÿπ‘‘+1βˆ£π‘ π‘‘ξ€·ξ€Έξ€Ύβˆ’π‘„π‘žin𝑑,π‘Ÿπ‘‘.ξ€Έξ€»(21)

If expected value of metering rate is not given, we also replaceξ€½π‘„ξ€·π‘žout+πœ†πΈπ‘žin𝑑+1,π‘Ÿπ‘‘+1π‘ π‘‘ξ€·ξ€Έξ€Ύβˆ’π‘„π‘žin𝑑,π‘Ÿπ‘‘ξ€Έ(22)

byξ“π‘žout+πœ†π‘Žπœ‹ξ€·π‘žin𝑑,π‘Ÿπ‘‘ξ€Έπ‘„ξ€·π‘žin𝑑+1,π‘Ÿπ‘‘ξ€Έξ€·βˆ’π‘„π‘žin𝑑,π‘Ÿπ‘‘ξ€Έ.(23)

We getπ‘„ξ€·π‘žin𝑑,π‘Ÿπ‘‘ξ€Έξ€·βŸ΅π‘„π‘žin𝑑,π‘Ÿπ‘‘ξ€Έ+π‘Žπ‘‘ξƒ¬ξ“π‘žout+πœ†π‘Žπœ‹ξ€·π‘žin𝑑,π‘Ÿπ‘‘ξ€Έπ‘„ξ€·π‘žin𝑑+1,π‘Ÿπ‘‘ξ€Έξ€·βˆ’π‘„π‘žin𝑑,π‘Ÿπ‘‘ξ€Έξƒ­.(24)

We suppose that the probability of on-ramp control policy πœ‹ can be obtained in (24). Here, it is difficult to satisfy the initial condition. The values βˆ‘π‘Žπœ‹(π‘žin𝑑,π‘Ÿπ‘‘)𝑄(π‘žin𝑑+1,π‘Ÿπ‘‘) associated with an optimal on-ramp control policy are called the optimal ramp inflow and are often written as max𝑄(π‘žin𝑑+1,π‘Ÿ). We getπ‘„ξ€·π‘žin𝑑,π‘Ÿπ‘‘ξ€Έξ€·βŸ΅π‘„π‘žin𝑑,π‘Ÿπ‘‘ξ€Έ+π‘Žπ‘‘ξ€Ίξ€·π‘žout+πœ†maxπ‘„π‘žin𝑑+1ξ€Έξ€·,π‘Ÿβˆ’π‘„π‘žin𝑑,π‘Ÿπ‘‘,ξ€Έξ€»(25)

whereβˆžξ“π‘‘=1π‘Žπ‘‘=∞,(26)βˆžξ“π‘‘=1π‘Ž2𝑑<∞.(27)

In the (25), the action value function 𝑄 is gained by learning approximates π‘„βˆ— (the optimal action value function) directly by using current policy. The state variable can be updated depending on the policy.

When the traffic reaches the jam density, it is possible to result in closure of the ramp for a long period of time, which must be taken into consideration. Maximum of waiting time (𝑇max) and its metering rate (π‘Ÿπ‘‡) are given. When βˆ‘π‘šπ‘›=1𝑇𝑆𝑛>𝑇max, the control (π‘žin𝑑,π‘Ÿπ‘‡) is selected. In order to remove the curse of dimensionality, the discrete equation of the continuous variable π‘Ÿπ‘‘ is represented. The average difference between 0 and π‘Ÿmax is divided by π‘Ÿπ‘›. π‘Ÿπ‘› is given byπ‘π‘Ÿξ‚΅π‘Ÿ=cellmaxπ‘Ÿπ‘›ξ‚Ά,(28)

where π‘π‘Ÿ is the amount of the metering rate, and cell is the function of the bottom integral function. The metering rate is max(π‘˜π‘Ÿπ‘›,π‘Ÿmax)for π‘˜βˆˆπ‘.

The algorithm of reinforcement learning on-ramp metering is shown in Figure 5.(1)Initialize 𝑄, π‘žout, π‘žin, and π‘˜. (2)Determine cycle time of a traffic signal 𝑑.(3)Update π‘žin𝑑.(4)Give metering rate by π‘Ÿπ‘‘=π‘˜Γ—π‘Ÿπ‘›.(5)Determine the traffic state (π‘žin𝑑,π‘Ÿπ‘‘).(6)Generate the density 𝜌𝐿 by using traffic simulation and choose the metering rate.(7)If π‘Ÿπ‘‘<π‘Ÿmax, then update π‘˜=π‘˜+1 and go to (4), and otherwise generate the optimal control (π‘žin𝑑,π‘Ÿβˆ—).(8)If one closes the ramp, then update waiting time 𝑇 by =𝑇+𝑑, and otherwise initialize the waiting time 𝑇 by 𝑇=0. If 𝑇>𝑇max, then update metering rate by π‘Ÿπ‘‡β†’π‘Ÿβˆ—.(9)Operate the optimal control (π‘žin𝑑,π‘Ÿβˆ—) and update 𝑄.

When the cycle time 𝑑  is over, determine to continue the ramp metering. If yes, then collect the data of inflow π‘žin𝑑+1, go to (3), and update π‘žin𝑑, that is, π‘žin𝑑+1β†’π‘žin𝑑; otherwise, complete the ramp metering.

3. Data Combination and Reduction

Our aim is to design a reinforcement learning control law for the ramp metering controller without complete information. We need to control the inflow from the ramp into main line, and the metering rate should be given by traffic states. Traffic flow simulation is conducted to demonstrate this control of the ramp metering. In our simulation, we set the main line length on highways to 1000 m, ramp length to 200 m, and length in merging sections of the main line and ramp to100 m. Parameters of RLRM are shown in Table 1, and the metering rate matrix is {0,100,200,300,……,900,1000,1100}.

Table 2 shows the inflow of cases A, B, C, D, E, and F. Inflow rate of the main line increases from 1200 pcu/hour of case A to 2500 pcu/hour of case F. Moreover, inflow rate of ramp rises from 300 pcu/hour of case A to 900 pcu/hour of case F. The cycle length of the fixed-time control is 20 s which consists of 15 s green time and 5s red time.

4. Result and Discussion

The results of no control, fixed-time control, and RLRM are shown in Figures 6–9. Total inflow increases from 1500 pcu/h in case A to 3400 pcu/h in case F. Figure 6 presents average speed and its rate compared to no control. The average speed of no control, about 108 km/h, is faster than fixedtime and RLRM in case A. The similar results are shown in case B. The average speed of no control, about 79 km/h, is faster than fixedtime and is slower than RLRM in case C. The average speed of no control, about 51 km/h, is slower than fixedtime and RLRM in case F. According to the average speed, rates of congestion reliefs of fixed-time control from case A to case F arrive at βˆ’7.80%, βˆ’6.65%, βˆ’3.77%, 0. 26%, 2.70%, and 8.26%, respectively. In addition, rates of congestion reliefs of RLRM from case A to case F arrive at βˆ’6.31%, βˆ’6.49%, 5.69%, 13.55%, 20.50%, and 18.18%, respectively.

Figure 7 describes density and its rate compared to no control. Densities of fixed-time control and RLRM are about 38 pcu/km, an about 60% increase, in case A. Densities of fixed-time control and RLRM are about 52 pcu/km and 45 pcu/km, about 11.46% and 22.60% decreases, in case C. Densities of fixed-time control, no control, and RLRM are about 120 pcu/km. According to densities, rates of congestion reliefs of fixed-time control from case A to case F arrive at βˆ’57.55%, βˆ’19.92%, βˆ’11.46%, βˆ’21.35%, 7.6%, and 0.39%, respectively. In addition, rates of congestion reliefs of RLRM from case A to case F arrive at βˆ’59.59%, βˆ’22.05%, 22.60%, 8.18%, 9.65%, and 3.42%, respectively.

Figure 8 shows outflow and its rate compared to no control. Outflow rate rises from 1700 pcu/h without control to 2308 pcu/h with fixed-time control and 1800 pcu/h with RLRM in case A. Moreover, 3.82% and 7.85% increases are shown depending on outflow rate in case C. In addition, 18.97% and 30.65% increases are explored depending on outflow rate in case F. Rates of congestion reliefs of fixed-time control from case A to case F arrive at 35.76%, βˆ’14.25%, 7.82%, 7.93%, 12.51%, and 18.97%, respectively. On the other hand, rates of congestion reliefs of RLRM from case A to case F arrive at 7.06%, 0.58%, 3.85%, 10.47%, 54.63%, and 30.65%, respectively.

Figure 9 represents travel time and its rate compared to no control. According to travel time, 6.25% and 9.38% increases are explored in case A. Travel time rises from 342 s without control to 370 s with fixed-time control and falls into 330 s with RLRM in case C. Travel time falls from 617 s to 469 s with fixed-time control and 343 s with RLRM in case F. Rates of congestion reliefs of fixed-time control from case A to case F arrive at βˆ’6.25%, βˆ’25.26%, βˆ’8.19%, 7.36%, 27.06%, and 23.99%, respectively. On the other hand, rates of congestion reliefs of RLRM from case A to case F arrive at βˆ’9.38%, βˆ’5.26%, 3.51%, 38.17%, 40.32%, and 44.41%, respectively.

According to Figures 6–9 when the traffic inflows are low, controls not efficient. Controls get efficient with the traffic inflows increasing. Controls are very efficient, and RLRM is optimal control when the traffic inflows are high. Moreover, based on curves of Figures 7–9 assessment indicators of fixed-time control fluctuate around indicators of no control. Fixed-time control shows instability compared to RLRM. Abilities of automaticity, memory, and performance feedback of RLRM are also shown.

5. Conclusion

The on-ramp metering ensures that traffic moves at a speed approximately equal to the optimum speed which results in maximum flow rates or travel time. This study develops an RLRM model without complete information, which consists of prediction tools depending on traffic flow simulation and optimal choice model based on reinforcement learning theories. Numerical cases are given to demonstrate RLRM compared to no control and fixed-time control. In addition, densities and outflow rates are calculated. Moreover, average speeds are computed, and travel times are assessed. According to cases A, B, C, D, E, and F, fixed-time control and RLRM are discussed depending on average speeds, densities, outflow rates, and travel times. When traffic inflow is low, controls are not efficient, and there are little differences among no control, fixed-time control, and RLRM. On the other hand, when traffic inflow is high, controls are very efficient, and RLRM is optimal control. Moreover, the greater is inflow, the more is the effect. In addition, the stability of RLRM is better than fixed-time control.

Acknowledgments

This research is founded by the National Natural Science Foundation of China (Grant no. 51008201). And this research is also sponsored by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry of China. Moreover, this research is also the key project supported by the Scientific Research Foundation, Education Department of Hebei Province of China (Grant no. GD2010235) and Society Science Development Program of Hebei Province of China (Grant no. 201004068).