1. Introduction
With the popularity of mobile phone usage, transportation network companies (TNCs) that offer app-based services, such as Uber, DiDi, and Lyft, claim to provide stability and convenience with peer-to-peer (p2p) processes that connect passengers and private drivers on-line and in real-time [
1]. As an emerging form of transportation based on network and mobile technology, the analysis of TNC ridership has become a hot topic in urban transportation research. Much evidence has shown that the rapid development of TNC has had a huge impact on the traditional taxi (TT), leading the taxi industry to experience significant losses in terms of market share, revenue, labor power and facility [
2]. This is particularly obvious in large modern cities such as New York City (NYC), where the annual taxi load decreased from 145 million in 2015 to 113 million in 2017, decreasing nearly 23% in three years. In contrast, the ridership by TNCs increased from 37 million to 110 million. The reduction in the market share of the taxi industry will inevitably cause a decline in the income of taxi drivers and the compression of the taxi business scale, leading to economic difficulties and even the bankruptcy of taxi companies. In May 2013, although the price of a yellow car’s license plate in NYC had been cut in half, the licenses of many taxi company vehicles were idle because of the lack of new drivers [
3].
Nevertheless, many researchers insist that it is premature to announce the inevitable demise of the taxi industry based on the current success of TNC. For example, Wang et al. reported that the success of TNCs lies in an aggressive but unsustainable price subsidy strategy [
4]. Cramer and Krueger’s study [
5] observed that most trips on TNC are concentrated in daily traffic peak periods. Regarding off-peak periods, traditional taxis still account for a large proportion of transportation and thus cannot be replaced. Furthermore, according to the statistical results from [
6], the average number of working hours per week of Uber drivers was approximately half that of many taxi drivers in the U.S.
Regardless of these debates, it is indisputable that the taxi industry is currently facing a huge challenge and competition from the TNC in many aspects. Therefore, analysis of the differentiation of these two modes, such as the characteristics of the target passengers and travel pattern, is conducive to a better understanding of the competitive relationships between them. However, as all these differentiations are not uniform within a city and are driven by diverse factors, the widely used global statistical models are limited to incorporate the significance of spatiotemporal heterogeneity and autocorrelation. The spatiotemporal analysis between taxi/TNC ridership and the built environment is still an open issue.
This paper presents the results of our research utilizing an improved GTWR model based on parallel computation to efficiently explore the spatiotemporal relationships between TT/TNC and the built environment in NYC, where about 659 million trips occurred from 2015 to 2017. The rest of this paper is arranged as follows.
Section 2 provides a brief review of the relevant research progress, and
Section 3 presents the details of the parallel-based GTWR model adopted in this study.
Section 4 introduces the related dataset and describes how the data were processed.
Section 5 mainly analyzes the model accuracy and findings.
Section 6 discusses the spatiotemporal patterns between taxi and TNC. The last section elaborates upon the conclusions of this paper, as well as future research directions.
2. Related Literature
Taxis have historically comprised a far lower share and geographical coverage of urban transportation than other transport modes, such as buses and subways; therefore, there are many lesser extensive studies on taxis than on other transport modes. In general, researchers have found taxis to be both complements and substitutes for public transit [
2]. Despite their small share in urban transportation, taxis fill a critical gap by providing mobility service and all-day operation, which are not available in other transportation modes. More importantly, with the popularization of GPS auto-collection devices, the spatiotemporal characteristics of ridership and trajectory by taxis provide a valuable reference for mining the travel patterns of citizens and for traffic optimization [
7]. Therefore, the spatiotemporal analysis of taxis has become a research hotspot in recent years.
Early research on taxis mainly focused on market demand components based on the inherent attributes of the taxi industry, such as price, tips, labor costs, and other factors [
8]. Because the measurement of cost, waiting time, and convenience is usually derived from investigations or relevant departments, those data are biased and lack objectivity. With the GPS devices carried by taxis, the spatiotemporal data of taxis can be tracked and collected in real-time. These data have the advantage of spatial-temporal characteristics than previous data and can integrate with external geographic factors, such as land use [
9] and weather [
10,
11]. For example, Liu et al. used GPS data of taxi and urban land use factors to identify ‘source-sink areas’ in Shanghai [
12]. Nevertheless, previous studies mostly adopted the ordinary least squares (OLS) method [
13,
14]. In the OLS model, the aggregated pickup (PU) and drop-off (DO) locations of taxis are used as dependent variables, and the relevant influencing factors, such as weather and land use, are selected as independent variables. Given spatial autocorrelation and heterogeneity exists in the distribution of PU and DO locations for TT and TNC, the precondition of the OLS model that the observations should be independent of each other is difficult to satisfy.
To address this issue, Fotheringham et al. proposed a local regression model called Geographically Weighted Regression (GWR) [
15], which improves the accuracy of regression results by constructing a local spatial weight matrix for estimating variation in space. Furthermore, the GWR model extends the traditional regression framework by allowing parameter estimates to vary in space and is therefore capable to capture local effects. The GWR model has been widely applied in transit ridership analysis [
16,
17]. For example, based on NYC’s taxi data, Qian et al. [
18] used the GWR model to analyze the relationship between taxi locations and urban environmental factors. The results show that the GWR model can provide better model accuracy and interpretation than the OLS model. One of the remaining problems is that the GWR model only obtains related variable coefficients in the spatial dimension. While dealing with time series datasets, those data often need to be aggregated or separated based on their timestamps, thereby ignoring the fact that the distribution of taxis or TNCs varies with different scales of time. Recently, scholars have put forward many improved strategies to account for both temporal and spatial variability, such as the GWR-TS [
19] and linear mixed effect (LME) + GWR models [
20]; still, these models are generally based on the two-stage least squares regression [
21], first fitting the temporal effect using the LME model and then evaluating the spatial heterogeneity effects with the GWR model. Those models cannot simultaneously consider temporal and spatial effects.
To simultaneously model temporal and spatial effects, Huang et al. proposed an improved GWR-based model, named Geographical and Temporal Weighted Regression (GTWR) [
22], which is thought to design simultaneous spatial and temporal weighting. Thus, the GTWR model can reflect continuous variations for each location at each time. The initial implementation of the GTWR model was carried out for house-price estimation, and the results showed that the accuracy of the GTWR model was superior to that of the OLS and GWR models. Recently, the GTWR model has been extended in many fields, such as air quality [
23] and environmental research [
24]. Moreover, some scholars have put forward improved GTWR schemes successively. For example, Wu et al. proposed an improved model, known as the Geographically and Temporally Weighted Autoregressive (GTWAR) model, to estimate spatial autocorrelations [
25], and Du et al. proposed a Geographically and circle-Temporally Weighted regression (GcTWR) model for enhancing the seasonal cycle of long-term observed data [
26].
The above research fully shows that the GTWR model has great advantages in spatiotemporal modeling. Ma et al. applied the GTWR model to public transit and achieved good modeling results [
27]. Zhang et al. also adopted the GTWR model to taxi ridership analysis and achieved a similar conclusion [
28]. Nevertheless, due to the fact that the spatiotemporal nonstationarity of taxis is more complicated than other modes of transit such as buses that have preset routes, previous studies have generally been limited to taxis or TNC separately, and few studies take into consideration the difference between taxis and TNC. Research on TNC remains relatively scarce, although its data structure is similar to that of taxis. Thus, applying the GTWR model for simultaneous analysis of both taxis and TNC is still an unsolved issue.
5. Model Estimations and Performance
5.1. Selection of Independent Variables
The multicollinearity of the independent variables will cause bias and affect the credibility of the modeling results. To eliminate the collinearity between the factors, we calculate the Pearson correlation coefficient of factors in this study. According to Qian’s suggestion [
18], if the pairwise correlation coefficients of factors are greater than 0.7, then at most, one of the variables can be included in the model.
Table 3 shows the test results between every two factors. Most of the pairwise correlation coefficients were below 0.7. However, for the weather-related group, all four factors (W1-W4) are highly correlated, so only one of them needs to be retained. Meanwhile, for the socioeconomic factors, the density of residents with at least a Bachelor’s degree (SE1) is correlated with the density of employed residents (SE2, 0.92), high income (SE3, 0.99), adults age (SE5, 0.84), and employees (SE6, 0.84), thus these factors (SE2, SE3, SE5 and SE6) need to be removed. Moreover, considering the complex situation of flow at airport, we add a dummy factor to denote whether a TAZ has an airport. In this study, three TAZs containing JFK, EWR, and LGA airport were set to 1, the others were set to 0. As a result, fifteen factors, including thirteen independent variables, the number of month (T), and a dummy variable of airport (AP) are collected and normalized from the initial set of variables.
5.2. Comparison of Model Accuracy
The OLS model is first calibrated to explore significant factors that influence the three dependent variables and the results are presented in
Table 4. It shows the estimated coefficients and t-probability for each independent variable and indicators for the goodness-of-fit of the model. Most of the factors are significant at 0.01 level, revealing that these factors are highly related to the ridership for three models. However, several factors are not statistically significant, including W1, T2, and AP for TT model, T2 for TNC model, and LU2 for PoT model. The variance inflation factor (VIF) values of most factors are within a reasonable range (<7.5), indicating that those factors are well selected so that the multicollinearity problem is avoided [
33].
According to the adjusted
R2, 80.67%, 67.29% of the variation can be explained for the TT and TNC ridership, and 73.33% for the proportion change of TNC. Based on the coefficient values, most of the factors in our study show an intuitive relation with the taxis and TNC ridership, e.g., three factors, including the number of snowy days (W1), length of roads (T1), and commuting time (SE7), are negatively correlated with variation of ridership in both modes. In addition, eight factors, including LU2, LU3, T3, T4, T5, SE4, SE8 and AP show positive effect on increase of TT and TNC ridership. Furthermore, the remaining four factors exhibit different correlations in the two models. For example, the factor of time (T) is negatively correlated with the ridership of TT (−0.721) but positively correlated with the TNC (2.545), which is consistent with the opposite temporal variation presented in
Figure 2.
However, for those factors that are homogenous over space and time, it is difficult for the OLS model to explain. For instance, the negative sign of LU1 in TT and TNC ridership models implies that low percentage of residential land use in a TAZ may increase the number of PU points. This situation is contrary to our intuitive understanding. A possible reason is that taxi trips are asymmetric [
34] and are more heavily used for trips to residential areas than trips from them. As a result, we conduct further investigations using GTWR models.
The GTWR model needs to estimate each sample independently to obtain coefficients, resulting in voluminous coefficients that vary according to time and place.
Table 5 presents the distribution of each factor for three dependent variables, respectively. The optimal parameter of
q is set to 400 and
is 350 (unit: meter/month) through a CV process via minimization in terms of the
R2. As shown in
Table 5, the adjusted
R2 is 0.9787 for TT model and 0.9403 for TNC model and 0.9329 for PoT model, which corresponds to 0.1723 (21%), 0.2679 (39%) and 0.20 (27%) improvement in the amount of variation explained compared to OLS models. Moreover, significant improvements are also achieved for two indicators of residual sum of squares (RSS) and root mean square error (RMSE). It is evident that, by addressing the spatial–temporal heterogeneities effect, the reduction in the RSS and the RMSE values prove the superiority of the GTWR model over the global OLS model in the explanatory power and the goodness of model fit based on the same dataset.
Moreover, the GTWR model also provides an in-depth understanding of how influencing factors vary locally. The coefficients of the regression model can be used to quantitatively analyze the relationship between influencing factors and the dependent variable. To be specific, if the sign of a coefficient is negative, there is a negative correlation between the factor and dependent variable, which reflects a trend of elimination; otherwise, the factor and dependent variable are positively correlated, indicating a mutually reinforcing relationship. According to the three-column summary, i.e., the lower quartile (LQ), the median (MED), and the upper quartile (UQ), we observed that the median values of the W1, T1, and SE7 are negatively correlated with both TT and TNC ridership, which implies that snowy weather, high-density roads, and lengthy commuting time probably decrease the taxi ridership. It is clear that taxi drivers are less willing to operate on snowy days or traffic congestion caused by high-density roads, resulting in a drop in ridership. Meanwhile, since the lengthy commuting TAZs are mainly located far from the central city, the correlation coefficients are consistent with the actual spatial distribution of taxi/TNC ridership decreasing with the increase of distance from the central zone.
The parameter estimation for the number of the subway station (T2) is always positive in TT and TNC models, which suggests that an increase in subway stations will generate more TT and TNC trips. The positive correlation can be explained in two aspects. First, subway stations are usually crowed thus there is a large passenger volume, which will attract and generate more TT and TNC trips. Secondly, the TT and TNC may be widely used for last-mile trips when passengers get off the subway and commute by TT/TNC to final destinations. Except for these two factors mentioned above, the other factors show moderate disparity, suggesting that these influencing factors may be positive or negative, which vary significantly over space and time.
7. Conclusions
The rapid development of TNC has been indeed a useful supplement to the traditional taxi industry in the early development stage, but the growth of the urban demand for taxis has been relatively stable. As a result, the relationship between the two modes will inevitably be mutually competitive, and this competitive relationship will demonstrate nonstationarity in time and space. In response to this problem, we select NYC as a case study to illustrate that the GTWR model can be an effective tool for analyzing spatiotemporal heterogeneity. Moreover, the effects of the influencing factors for the TT and TNC can be quantitatively evaluated in the temporal term, and spatial variations can also be analyzed by the coefficients at different spatial units (i.e., administrative division-based or grid-based).
This study compares GTWR with OLS while exploring the relationships between built environment and the PU ridership. The global coefficients of OLS models are observed to be deficient when dealing with spatial problem. The GTWR model, on the other hand, shows better performance than the OLS model, especially in the fact that the GTWR model can help to eliminate potential bias from spatiotemporal heterogeneity and provide localized regression statistics at each location. By visualizing distributions of median values of coefficients for each factor, the spatiotemporal variations of the factors could be better interpreted. Our study demonstrates that the relationships between ridership and influencing factor of built environment vary over space and time in NYC. Moreover, the effects of influencing factors on TT and TNC are significantly different on both spatial and temporal terms. For example, the model results reveal that the TNC’s surge pricing policy has a significant effect on increasing TNC trips in snowy conditions, especially in western Manhattan. While TTs have always been dominant in downtown Manhattan, the share of TNC has risen significantly in the adjacent neighborhoods due to the availability of transit alternatives, such as subways, buses, and private cars, which is probably correlated with commuting time (SE7). Meanwhile, the increases of TNCs are also observed in remote places, which are positively correlated with densities of multiple land use, educated populations, and levels of public transportation usage. Compared to the current saturation of demand in the central city, future competition between TT and TNC might be concentrated in remote areas, such as eastern Queens, which is not adequately covered by public transportation. We believe these findings of spatial variations of taxi demand could provide useful scientific guidelines for the taxi industry and TNC to optimize their existing resources thus improving efficiency. Furthermore, the basic modeling steps described in this paper, such as data aggregation, factor selection, parameter optimization, modeling analysis, and visual presentation, can also be applied to other research fields for spatiotemporal modelling. For example, considering the recent outbreak of Coronavirus Disease 2019 (COVID-19), the GTWR model might be an appropriate approach to assess the local relationships between the contagiousness of the virus and the influencing factors of urban environment.
Several challenges remain when applying GTWR models to explore detailed variations in relationships between taxis and built environment research. As the variation in transportation environments in different cities is enormous, the result of GTWR can only be adapted to specific cities. In the follow-up study, we will apply the GTWR model to other large cities for comparison and evaluation. The model will incorporate with different type of influencing factors, such as POI, real-time population flow, and Internet of Things data, which will help to improve the interpretation of how the urban spaces and times result in taxi demand.
We also notice that using a four-core CPU might be insufficient to fully evaluate the performance of the proposed parallel-based GTWR model. In fact, the optimization of computational performance for GWR-based models is always a technical bottleneck that plagues the widespread application of spatiotemporal modeling, especially in the face of a massive spatial and temporal dataset. In this study, the most important thing that we focus on is to apply the GTWR model to evaluate the relationship between the taxi ridership and the influencing factors of built environment. Due to the length limitation of this paper, we only provide a simple technical idea of the design of the parallel-based GTWR model and perform it with a small-scale case study (less than 10,000 samples). The efficiency of the current use of a four-core CPU already meets the needs of this study. According to some literature [
39], the efficiency of parallel computing is correlated with many factors, such as the structure of the algorithm, the selection of software, and the size of the data, and in some cases the computation is even less efficient than serial computation. Therefore, evaluating the efficiency of the parallel-based GTWR model could be a complex technical problem, which we believe is necessary to conduct in future.
Another issue that must be considered is that the GTWR model uses statistical local least squares to estimate the coefficient of variables; therefore, the model’s accuracy depends on the independence of the observed samples. When there is a strong autocorrelation between the data samples, and this autocorrelation is not considered well, it will cause the overfitting problem and affect the final explanatory results of the model. In this respect, the GTWAR model that can estimate the spatial autocorrelation for each variable might be a better solution. However, the GTWAR model increases the computational complexity of the algorithm; therefore, whether this model is necessary for transportation analysis needs to be further evaluated. Regarding the time dimension, seasonal change might be considered. There are more taxis to be operated in summer than in winter, although, in our study, we use weather factors to reflect seasonal changes. The GcTWR model proposed in [
26] might be an alternative way to improve the general GTWR model. The question is that the definition of the seasonal span of the GcTWR model is manually preset. In actual situations, periodicity varies. Adopting a certain adaptive method to auto-identify the seasonal span of transportation distribution is another problem that must be considered carefully.