An Adaptive Trajectory Clustering Method Based on Grid and Density in Mobile Pattern Analysis

Mao, Yingchi; Zhong, Haishi; Qi, Hai; Ping, Ping; Li, Xiaofang

doi:10.3390/s17092013

Open AccessArticle

An Adaptive Trajectory Clustering Method Based on Grid and Density in Mobile Pattern Analysis

¹

College of Computer and Information, Hohai University, Nanjing 210098, China

²

School of Computer Information & Engineering, Changzhou Institute of Technology, Changzhou 213032, China

^*

Author to whom correspondence should be addressed.

Sensors 2017, 17(9), 2013; https://0-doi-org.brum.beds.ac.uk/10.3390/s17092013

Submission received: 14 July 2017 / Revised: 29 August 2017 / Accepted: 30 August 2017 / Published: 2 September 2017

(This article belongs to the Special Issue Sensors for Transportation)

Download

Browse Figures

Versions Notes

Abstract

:

Clustering analysis is one of the most important issues in trajectory data mining. Trajectory clustering can be widely applied in the detection of hotspots, mobile pattern analysis, urban transportation control, and hurricane prediction, etc. To obtain good clustering performance, the existing trajectory clustering approaches need to input one or more parameters to calibrate the optimal values, which results in a heavy workload and computational complexity. To realize adaptive parameter calibration and reduce the workload of trajectory clustering, an adaptive trajectory clustering approach based on the grid and density (ATCGD) is proposed in this paper. The proposed ATCGD approach includes three parts: partition, mapping, and clustering. In the partition phase, ATCGD applies the average angular difference-based MDL (AD-MDL) partition method to ensure the partition accuracy on the premise that it decreases the number of the segments after the partition. During the mapping procedure, the partitioned segments are mapped into the corresponding cells, and the mapping relationship between the segments and the cells are stored. In the clustering phase, adopting the DBSCAN-based method, the segments in the cells are clustered on the basis of the calibrated values of parameters from the mapping procedure. The extensive experiments indicate that although the results of the adaptive parameter calibration are not optimal, in most cases, the difference between the adaptive calibration and the optimal is less than 5%, while the run time of clustering can reduce about 95%, compared with the TRACLUS algorithm.

Keywords:

mobile pattern analysis; spatio-temporal data; trajectory clustering; adaptive parameter calibration; grid

1. Introduction

In recent years, with the rapid development of sensor technology and smart phones, GPS devices are widely applied to track moving objects, e.g., humans, vehicles, and animals, which can produce huge amounts of trajectory data every day. The trajectory data is the spatial-temporal data series from the moving objects with different timestamps. They contain a lot of information and help us understand the behaviors of the moving objects more directly. For example, zoologists can cluster the paths of animals to study the migration of animals [1]. Meteorologists explore the movement path of hurricanes through clustering and correlation analysis to improve the capabilities in disaster early warning and prevention [2]. Based on the clustering analysis of the movement patterns of vehicles, traffic managers can plan urban roads to mitigate the traffic jams [3,4]. For example, Yue et al. proposed the single-linkage clustering method to analyze taxi trajectory data to detect the time-dependent hot spots and movement patterns for urban traffic planning [5]. Moreover, a mobility-based clustering of vehicle trajectories was presented to detect hotspots and avoid the traffic jams [6].

Clustering analysis is one of the most important methods used in trajectory data mining. Trajectory clustering approaches can be applied in hotspot path analysis, mobility pattern analysis, and urban planning. At present, the trajectory clustering approaches include two types [7]: the first cluster the trajectory data based on the similarity of the full sequences. In other words, they take the whole trajectory as a unit to cluster the trajectory data. Those approaches have good effects on the clustering for the simple trajectories, however, they have negative effects for complex trajectories due to the fact they ignore the local detail sequences. The second type cluster the trajectory data based on the similarity of the sub-sequences. This means that the whole complex trajectory sequence is divided into several segments, which can be clustered with one segment as a unit. The second approaches have the ability to recognize the local features of complex trajectories.

Nonetheless, most available trajectory clustering algorithms depend on the calibration of one or multiple parameters. Meanwhile, the parameter values have a great influence on the effect of clustering. To reduce the complexity and workload of parameter calibration in trajectory clustering, a method called Adaptive Trajectory Clustering approach based on Grid and Density (ATCGD) is proposed in this paper. ATCGD firstly divides the trajectory data into multiple discrete segments through the average angular difference-based MDL (AD-MDL) algorithm. All of the discrete segments are mapped into the corresponding cells. Then, it calculates the average distance among the different segments in each cell, and the average number of the trajectory segments in each cell. Finally, adopting a DBSCAN-based approach, ATCGD carries out the adaptive parameter calibration based on the above data to realize effective and accurate trajectory clustering. As an illustration of the capabilities of the proposed method, we evaluate the performance of ATCGD approach on clustering quality and cost using two data sets from the random trajectories and hurricane trajectories in the Atlantic Ocean. The experimental results indicate that although the results of the adaptive parameter calibration are not optimal, in most cases, the difference between the adaptive calibration and the optimal one is less than 5%, while the run time of clustering can be reduced by about 95%.

The remainder of this paper is organized as follows: Section 2 discusses the related works and analyzes their drawbacks. The discrete trajectory partition algorithm, that is the average angular difference-based MDL (AD-MDL), is discussed in Section 3. Section 4 presents the proposed ATCGD approach, and the performance evaluations are given in Section 5. Discussion and conclusions are given in Section 6.

2. Related Work

2.1. Trajectory Clustering Approaches

Trajectory data can be regarded as time sequence data. Trajectory clustering is an important part of clustering analysis. To study the trajectory clustering of mobile objects, Gaffney et al. presented the mixture regression model-based trajectory clustering algorithm [8]. Furthermore, considering the temporal feature of trajectories, the spatial distance of the mobile objects was expanded to the spatial-temporal distance of the trajectories [9]. The time-focused trajectories clustering of moving objects algorithm, TFCTMO, was proposed based on the spatial-temporal distance. To obtain the moving cluster in the spatial-temporal trajectory data, the filter-based spatial-temporal clustering algorithm was discussed [10]. The filter-based cluster algorithm first filtered the trajectory data in the different time-scale ranges, and then clustered the data in the spatial-scale range within the same timestamp. All of the above clustering algorithms are based on the similarity of the full sequences.

Lee et al. thought that the clustering approaches based on the full sequences may have negative effects for complex trajectories due to the fact they ignore the local partial similarity [11]. Moreover, they put forward a partition-and-group framework and clustering algorithm—TRACLUS—that divides the whole trajectory into several segments and clusters them through the DBSCAN method [12,13,14]. The TRACLUS algorithm can recognize the local partial similarity of trajectories, however, in order to obtain good clustering quality, TRACLUS requires a large amount of workload to calibrate two parameters (the scanning range

e p s

and the density

m i n P t s

of each group). At the same time, the values of the two parameters are sensitive to the different data sets. In order to reduce the complexity and workload of parameter calibration, some parameter adaptive clustering algorithms based on the DBSCAN were put forward. For example, a self-adaptive density-based clustering algorithm (SA-DBSCAN) was presented in [15]. In the SA-DBSCAN approach, the distance of every object-pair in the data set is calculated as the input of two parameters

e p s

and

m i n P t s

. Although SA-DBSCAN can achieve good accuracy, it results in high computational complexity O(n²). Furthermore, through integrating the Affinity Propagation (AP) clustering method with DBSCAN, an AP-based clustering algorithm (APSCAN) was presented to cluster the objects without parameters [16]. However, the APSCAN algorithm still needs to compute the distance of every object-pair and thus exhibits high complexity. To further realize adaptive parameter calibration, the GCMDDBSCAN clustering algorithm established grid cells based on the various data, and then clustered the data based on optimal values of parameters

e p s

and

m i n P t s

with the cell as a unit [17].

From the above analysis, all of the DBSCAN-based clustering algorithms can achieve the adaptive parameter clustering for the simple object data. Considering the spatial and temporal characteristics of trajectory data, which differs from that of the simple object data, the trajectory clustering algorithm should reduce the computation complexity of clustering algorithms, especially in large-scale vehicle trajectories from intelligent systems. Based on the analysis of the DBSCAN-based clustering algorithms with adaptive parameter calibration, an Adaptive Trajectory Clustering approach based on Grid and Density (ATCGD) is proposed in this paper. ATCGD firstly divides the trajectory data into discrete trajectory segments based on the MDL-based method. All of the segments are mapped into the corresponding cells. Then, it calculates the average distance among the different segments in each grid cell, and the average number of the trajectory segments in each cell. Finally, adopting the idea of the DBSCAN-based method, ATCGD carries out the adaptive parameter calibration based on the above data to realize effective and accurate trajectory clustering.

Li et al. found that the existing trajectory algorithms focused on the static data and cannot deal with the problem of the data dynamic growth [18], so an incremental clustering framework of the trajectory, TCMM, was presented. In the TCMM framework, the whole trajectory was divided into several sequences and micro-clusters were established and dynamically maintained. The K-means method was also applied to the trajectory clustering problem [19]. However, it needed to determine the value of K in advance and cannot deal with noisy data, which results in poor performance in actual applications. Furthermore, the space covered by the trajectories was divided into cells. The trajectory clustering based on cells was proposed to cluster the grids when each cell is an object [20]. The cells-based clustering algorithm can exhibit good processing performance, while it ignores the differences among the sequences and leads to the poor clustering accuracy.

2.2. Trajectory Partition Methods

The proposed ATCGD algorithm includes three parts: partition, mapping, and clustering, as shown in Figure 1. In the partition phase, ATCGD applies the average angular difference-based MDL (AD-MDL) partition method to ensure the partition accuracy on the premise that it decreases the number of the segments after the partition. During the mapping procedure, the partitioned segments are mapped into the corresponding cells, and the mapping relationship between the segment and the cell are stored. In the clustering phase, adopting the DBSCAN-based method, the segments in the cells are clustered on the basis of the computed values of parameters from the mapping procedure. The clustering results can be applied in hotspot paths analysis, mobility pattern analysis, and urban planning.

In the field of trajectory partition, most of trajectory partition approaches rely on trajectory compression algorithms. The classical one is the Douglas-Peucker (DP) algorithm [21]. It detects some unnecessary points by calculating the information loss. Through introducing the concept of “window”, that is the segment, into the information loss computation, the OPening Window algorithm (OPW) was proposed [22]. OPW uses iterations to compress the trajectories with one “window” as one unit, instead of one whole trajectory as one unit. Using an iterations method, OPW can greatly reduce the computation cost. Afterwards, taking the time dimension into consideration, the Top-Down Time Ratio (TD-TR) algorithm was presented [12], and the optimal upper bound of errors compression algorithm (SQUISH-E) was proposed [23].

These further improved the applicability of the compression algorithms in the GPS trajectory data. Lee et al. put forward the trajectory partition algorithm based on the Minimum Description Length (MDL) [11], which can effectively compress data as well as ensure the accuracy of the compressed data. From the above analysis of compression algorithms, it can be found that most of the compression algorithms try to obtain successive sequences of trajectory, which means all segments are end-to-end. However, continuity is unnecessary to the clustering of the trajectory segments. We can improve the accuracy of the compressed data when dealing with the discrete segments of trajectory. As shown in Figure 2a,b,

T S_{c - r e p 1}

and

T S_{c - r e p 2}

marked with the red line, are the continuous representative segments of the original trajectory data

T S_{o r i g i n a l}

; and

T S_{d - r e p 1}

and

T S_{d - r e p 2}

, marked with the green line, are the discrete representative segments, respectively. The dash area represents the area difference between the representative segments and the original trajectory. It is obvious that the area difference between

T S_{c - r e p}

and

T S_{o r i g i n a l}

is greater than that between

T S_{d - r e p}

and

T S_{o r i g i n a l}

.

In this paper, adopting the AD-MDL discrete trajectories partition method, the proposed ATCGD trajectory clustering approach can map all of the segments into the corresponding cells. Then, based on the idea of the DBSCAN-based method, the segments are clustered through the calibration of adaptive parameters with the mapping relationship. The experimental results illustrate that the ATCGD approach can improve the effectiveness of clustering as well as ensure the accuracy.

3. Discrete Partitioning of Trajectories

3.1. Distance Measure Between the Trajectory Segments

Definition 1 (trajectory).

With a given Euclidean space, a trajectory is composed of a series of trajectory points, expressed as

T R = {P_{1}, P_{2}, \dots, P_{n}}

, where the discrete trajectory points are sorted by timestamp, P_i refers to the trajectory point i,

P_{i} = (x_{i}, y_{i})

, and n represents the number of points in the trajectory.

Definition 2 (sub-trajectory segment).

Two adjacent discrete trajectory points

P_{i}

and

P_{i + 1}

are connected to form a trajectory segment

P_{i} P_{i + 1}

, which is a sub-trajectory segment, denoted as

T S_{i}

.

A trajectory sequence consists of a series of discrete points. Two adjacent discrete points are connected to form a sub-trajectory segment. Due to the massive amount of trajectory data generated by mobile phones and other GPS equipment, the trajectory data compression is an important task for the sub-trajectory segments clustering. To reduce the workload of clustering all of the trajectory data, it should first partition the trajectory

T R = {P_{1}, P_{2}, \dots, P_{n}}

into the multiple sub-trajectory segments

T S = {T S_{1}, T S_{2}, \dots, T S_{n - 1}}

by adopting the appropriate compression algorithm.

Lee et al. proposed a method to calculate the distance between two sub-trajectory segments with the weighted sum of the horizontal distance, vertical distance, and angular distance [10]. That distance of trajectory segments is suitable to the trajectory clustering. The horizontal distance can effectively avoid the noisy data problem when the distance between the two long trajectory segments is long. However, the angular distance may cause the problem of the short trajectory segments priority, which means that the shorter the trajectory segment is, the smaller the angular distance is. To solve the problem of the short trajectory segments priority, a new method to calculate the distance between the different segments is presented in this paper. As shown in Figure 3,

T S_{1}

is the shorter trajectory segments and

T S_{2}

is the longer one.

l_{⊥ 1}

and

l_{⊥ 2}

are the minimum and maximum vertical distance from any point in

T S_{1}

to the segment

T S_{2}

, respectively.

l_{‖ 1}

and

l_{‖ 2}

are the distance from the corresponding intersection to the endpoint, respectively.

d_{⊥}

is the vertical distance between the two segments calculated with

l_{⊥ 1}

and

l_{⊥ 2}

.

d_{‖}

is the horizontal distance between the two segments calculated with

l_{‖ 1}

and

l_{‖ 2}

,

θ

is the angle between the two segments

T S_{1}

and

T S_{2}

, as shown in Equations (1) and (2):

d_{⊥} = {\begin{cases} \begin{matrix} (\frac{l_{⊥ 1} + l_{⊥ 2}}{2}) \times (1 + \sin θ) & i f & θ < \frac{π}{2} \end{matrix} \\ \begin{matrix} (\frac{l_{⊥ 1} + l_{⊥ 2}}{2}) \times (2 + \sin (θ - π / 2)) & i f & θ \geq \frac{π}{2} \end{matrix} \end{cases}

(1)

d_{‖} = {\begin{cases} \begin{matrix} \min (l_{‖ 1}, l_{‖ 2}) \times (1 + \sin θ) & i f & θ < \frac{π}{2} \end{matrix} \\ \begin{matrix} \min (l_{‖ 1}, l_{‖ 2}) \times (2 + \sin θ) & i f & θ \geq \frac{π}{2} \end{matrix} \end{cases}

(2)

The distance between the two segments

T S_{1}

and

T S_{2}

can be computed as shown in Equation (3):

d i s t (T S_{1}, T S_{2}) = d i s t (T S_{2}, T S_{1}) = d_{⊥} + d_{‖}

(3)

3.2. Discrete Representative Trajectory Segments

Definition 3 (representative trajectory segments).

Given a set of the trajectory segments

T S = {T S_{1}, T S_{2}, \dots, T S_{n}}

,

T S

can be represented with a trajectory segment

T S_{r e p}

as the representative trajectory segment.

According to the discussion in Section 2, from Figure 2a,b, it is obvious that the area difference between

T S_{c - r e p}

and

T S_{o r i g i n a l}

is greater than that between

T S_{d - r e p}

and

T S_{o r i g i n a l}

. In order to reduce the area difference between the set of the partitioned segments and the original whole trajectory, ATCGD approach applies the discrete representative trajectory segments to replace the original whole trajectory, instead of the continuous representative trajectory segments. Figure 4 illustrates the discrete representative segments. As shown in Figure 4,

P_{i}, i = 1, \dots, 5

denotes the trajectory point in the original trajectory.

P_{m i d}

is the middle point in the original trajectory, where

x_{m i d} = \sum_{i = 1}^{5} x_{i} / 5

,

y_{m i d} = \sum_{i = 1}^{5} y_{i} / 5

.

In Figure 4,

T S_{m i d}

is the trajectory line through the middle point

P_{m i d}

. Suppose that

T S_{i} \cdot θ

represents the clockwise angle between the trajectory segment

T S_{i}

and the horizontal line, where

0 \leq T S_{i} \cdot θ < π

.

T S_{m i d} \cdot θ

is the clockwise angle between the trajectory segment

T S_{m i d}

and the horizontal line.

T S_{m i d} \cdot θ

can be calculated as follows:

T S_{m i d} \cdot θ = \frac{\sum_{i = 1}^{4} T S_{i} \cdot θ}{4}

(4)

Then, it makes two vertical lines from two endpoints of original trajectory

P_{1}

and

P_{5}

to the line

T S_{m i d}

, and intersects at the points

P_{s}

and

P_{e}

, respectively. The trajectory segment

P_{s} P_{e}

is just the representative trajectory segment of the original trajectory

{P_{1}, P_{2}, P_{3}, P_{4}, P_{5}}

, denoted as

T S_{r e p}

. The coordinate values of the intersection

P_{s}

can be calculated with Equation (5):

\begin{array}{l} x_{s} = \frac{y_{1} + \tan (π / 2 - T S_{m i d} \cdot θ) \cdot x_{1} - y_{m i d} + \tan (T S_{m i d} \cdot θ) \cdot x_{m i d}}{\tan (T S_{m i d} \cdot θ) + \tan (π / 2 - T S_{m i d} \cdot θ)} \\ y_{s} = \tan (T S_{m i d} \cdot θ) \cdot (x_{s} - x_{m i d}) + y_{m i d} \end{array}

(5)

In the same way, the coordinate values of the intersection

P_{e}

can be calculated. It is obvious that the representative trajectory segments via the above method are discrete and cannot be end-to-end.

From Figure 2, the area difference between

T S_{c - r e p}

and

T S_{o r i g i n a l}

is greater than that between

T S_{d - r e p}

and

T S_{o r i g i n a l}

. Therefore, this discreteness cannot take negative effect on the clustering results, instead it can generate the more accurate representative segments of the original trajectory.

To evaluate the accuracy of the representative trajectory segments, the cumulative distance difference between the discrete representative trajectory segment

T S_{r e p}

and the set of the original continuous segments

T S = {T S_{1}, T S_{2}, \dots, T S_{n}}

is introduced, which is represented as

φ

. Because the vertical distance is one major impact factor on the difference between the representative trajectory segment and the original ones, the vertical distance is adopted to compute the cumulative distance difference, as shown in Equation (6):

φ = \sum_{i = 1}^{n} d_{⊥} (T S_{r e p}, T S_{i})

(6)

where

n

is the number of the original segments. The smaller

φ

is, the more accurate the representative trajectory segment is. Meanwhile, in order to verify the accuracy of the discrete representative trajectory segments, 1000 trajectories from the GeoLife data sets [24] are randomly selected. Assume that

φ_{d i s c r e t e}

represents the cumulative distance difference between the discrete representative segment and the original trajectory segment.

φ_{c o n t i n u o u s}

represent the cumulative distance difference between the continuous representative segment and the original one. The experimental results are that there are

φ_{d i s c r e t e} \leq φ_{c o n t i n u o u s}

in the 982 trajectories from the 1000 trajectories, while there is only

φ_{d i s c r e t e} > φ_{c o n t i n u o u s}

in the 18 trajectories. The experimental results indicate that the discrete representative trajectory segment can substitute the original one more accurately.

3.3. Discrete Trajectory Partition Algorithm

From daily life experience, we know that the trajectory variations of people’s or vehicle’s movements are always relatively smooth. That is to say, there are very small changes in the angle between the two adjacent trajectory segments. To further quantify the variations of trajectories, the average angular difference

A v g_{a n g l e - d i f f}

is introduced. Given a trajectory data

T R = {P_{1}, P_{2}, \dots, P_{m}}

, the average angular difference

A v g_{a n g l e - d i f f}

can be calculated as shown in Equation (7):

A v g_{a n g l e - d i f f} (1, n) = \sum_{i = 2}^{n} \frac{| T S_{1} \cdot θ - T S_{i} \cdot θ |}{n - 2}, where n > 2

(7)

Lee et al. put forward the trajectory partition algorithm based on the Minimum Description Length (MDL) to compress data [8]. MDL is derived from Information Theory, which can be used to describe a given data set using fewer symbols than needed to describe the data literally. In essence, MDL can be applied to data compression. In the trajectory data compression, MDL can obtain a tradeoff between the number of sub-trajectory segments and the accuracy of the trajectories partition results, but MDL has high computational complexity to obtain the partitioned segments. In order to reduce the complexity of the trajectories partition, the average Angular Difference-based MDL (AD-MDL) is proposed to compress the trajectory data and partition the trajectories. AD-MDL consists of two phases: data filtering and trajectory partition.

In the data filtering phase, it eliminates the obvious outliers with the minimum cost based on the average angular difference

A v g_{a n g l e - d i f f}

, which can reduce the computation workload during the trajectory partitioning. At first, the original trajectory data can be partitioned into multiple continuous segments. During the procedure of data filtering, the average angular difference

A v g_{a n g l e - d i f f}

is considered as the filtering factor. The filter threshold is

θ_{t h r e s h o l d}

. For each continuous sub-trajectory segment, if its average angular difference is greater than the threshold

θ_{t h r e s h o l d}

, the starting point of that sub-trajectory segment should be added into the set of the candidate trajectory points

T R_{c}

. Otherwise, the starting point of the segment is considered as an outlier and cannot be processed in the trajectory partition phase. After the data filtering, it can get the set of the candidate trajectory points

T R_{c} = {P c_{1}, P c_{2}, \dots, P c_{n}}

. A GeoLife data set is introduced as an example to evaluate the performance on data compression. Based on the experimental results, it can be found that the AD-MDL can realize the 39% compression rate when the threshold value is

θ_{t h r e s h o l d} = π / 64

. Thus, it can greatly reduce the computation overhead in the trajectory partition phase.

In the trajectory partition phase, MDL method is still adopted to partition the compressed trajectories into discrete representative trajectories. During the data compression procedure, the overhead of MDL usually includes two parts:

L (H)

and

L (D | H)

.

H

is the hypothesis, and

D

is the described data.

L (H)

is the overhead of describing the hypothesis and

L (D | H)

is the overhead to describe the

D

under the hypothesis

H

. MDL aims to find the optimal H to describe D to minimize the sum of

L (H)

and

L (D | H)

. As to the trajectory partition,

H

is the set of discrete representative trajectory segments, and

D

is the original trajectory data.

L (H)

represents the total length of the all discrete representative segments.

L (D | H)

represents the difference between the discrete representative segments and the original trajectory. It is obvious that the greater number of the selected candidate points is, the more accuracy of the partition is. The greater

L (H)

is and the smaller

L (D | H)

is, which results in the high accuracy and high computation cost. Otherwise, it results in the low overhead and poor accuracy. When the sum of

L (H)

and

L (D | H)

is minimum, the trajectory partition can reach the tradeoff between the accuracy and computation cost.

L (H)

and

L (D | H)

can be computed as follows Equation (8):

\begin{array}{l} L (H) = \sum_{i = 1}^{m - 1} \log (l e n (T S_{c_{i} - c_{i + 1}})) \\ L (D | H) = \sum_{i = 1}^{m - 1} \sum_{j = c_{i}}^{c_{i}_{+ 1} - 1} \log (d_{⊥} (T S_{c_{i} - c_{i + 1}}, P_{j} P_{j + 1})) \end{array}

(8)

where

T S_{c_{i} - c_{i + 1}}

represents the discrete trajectory segment from the candidate point

P c_{i}

to

P c_{i + 1}

,

P_{j} P_{j + 1}

is the original trajectory segment in the

T S_{c_{i} - c_{i + 1}}

, and

l e n (T S_{c_{i} - c_{i + 1}})

means the length of the discrete trajectory segment from the point

P c_{i}

to

P c_{i + 1}

.

To obtain the optimal trajectory partition, it should compute the global optimal solution to the minimum sum of

L (H)

and

L (D | H)

, which results in the high computation overhead. To reduce the computation cost, we adopt a greedy solution to find the local optimal results to replace the global optimal results.

Suppose

P c_{i}

and

P c_{j}

are two candidate points from

T R_{c} = {P c_{1}, P c_{2}, \dots, P c_{n}}

.

M D L (c_{i}, c_{j}) = L (H) + L (D | H)

represents the minimum description length of part of trajectory segment

{P c_{i}, P c_{i + 1}, \dots, P c_{j}}

and

c_{i} < c_{j}

.

L_{D} (c_{i}, c_{j})

represents the original trajectory length of the segment

{P c_{i}, P c_{i + 1}, \dots, P c_{j}}

, that is,

L_{D} (c_{i}, c_{j}) = \sum_{x = i}^{j - 1} l e n (P c_{x} P c_{x + 1})

. From the point

P c_{i}

as the starting point, if

M D L (c_{i}, c_{j}) < L_{D} (c_{i}, c_{j})

, it reveals that all of the trajectory points in the segment

{P c_{i}, P c_{i + 1}, \dots, P c_{j}}

are not trajectory characteristic points and the corresponding trajectory segment

{P c_{i}, P c_{i + 1}, \dots, P c_{j}}

cannot be added into the set of the discrete representative segment, denoted as

D_{T S}

. Otherwise, the points in the segment

{P c_{i}, P c_{i + 1}, \dots, P c_{j}}

are trajectory characteristic points and the corresponding segment can be transformed into the discrete representative segment with the Equations (4) and (5) discussed in Section 3.2.

According to the above discussion, the average Angular Difference-based MDL (AD-MDL) algorithm can be used to compress the trajectory data and create the discrete representative segments. The pseudo-code of the AD-MDL algorithm (Algorithm 1) is as given below. The AD-MDL trajectory partition algorithm contains two phases, the first one is the data filtering and the second one is to create the discrete representative trajectory segments. In the data filtering phase, part of the trajectory points is selected as the candidate point for the trajectory partition phase, based on the average angular difference. Thus, it can reduce the number of trajectory points to create the discrete representative segments and reduce the computation time in the second phase.

Algorithm 1. AD-MDL: The Average Angular Difference-Based MDL Trajectories Partition Algorithm.

Input: Trajectory sequences $T R = {P_{1}, P_{2}, \dots, P_{n}}$ , and the threshold of the average Angular Difference $θ_{t h r e s h o l d}$
Output: the set of discrete representative trajectory segments $D_{T S}$
// data filter phase
1: index = 1; $p_{s t a r t} = p_{1}$ ; $p_{s t a r t}$ is added into the set of candidate trajectory points $T R_{c}$
2: for $j = 2$ to $n$ in the $T R$
3: if $A v g_{a n g l e - d i f f} (i n d e x, j) > θ_{t h r e s h o l d}$ then
4: $p_{j}$ is added into the set $T R_{c}$
5: $i n d e x = j$ ; $j = j + 1$ ;
6: $p_{e n d} = p_{n}$ ;
7: $p_{e n d}$ is added into the set $T R_{c}$
// trajectory partition phase
8: index =1;
9: for $j = 2$ to $m$ in the $T R_{c}$
10: if $M D L (c_{i n d e x}, c_{j}) > L_{D} (c_{i n d e x}, c_{j})$
11: $T S_{c_{i n d e x} - c_{j}}$ is a discrete representative trajectory segment, and added into the set $D_{T S}$
12: $i n d e x = j$ ; $j = j + 1$ ;
13: end for
14: return the set $D_{T S}$ .

As shown in lines 1 to 7 of the AD-MDL algorithm, if the average angular difference is not greater than the threshold

θ_{t h r e s h o l d}

, a new trajectory point is added. Otherwise, the new added trajectory point is the characteristic point and is added into the set of candidate points

T R_{c}

. In the trajectory partition phase, in order to obtain the clustering accuracy as well as the low complexity, the MDL-based method is adopted to create the discrete representative segments. As shown in the line 8 to line 14 of the AD-MDL algorithm, if there is

M D L (c_{i n d e x}, c_{j}) \leq L_{D} (c_{i n d e x}, c_{j})

, the trajectory points between the

p_{c_{i n d e x}}

and

p_{c_{j}}

are non-characteristic points, and the successive point is included. If

M D L (c_{i n d e x}, c_{j}) > L_{D} (c_{i n d e x}, c_{j})

, the trajectory points between the

p_{c_{i n d e x}}

and

p_{c_{j}}

are characteristic points and the corresponding segment

T S_{c_{i n d e x} - c_{j}}

is added into the set of discrete representative segments

D_{T S}

. The AD-MDL algorithm traverses all of the trajectory points twice, so the computation complexity is

O (n)

, where

n

is the total number of trajectory points.

4. Trajectory Clustering Based on Grid and Density

4.1. Grid Partition

We can get the discrete representative trajectory segments with the AD-MDL algorithm. After the trajectory partitioning, the partitioned segments should be mapped into the appropriate cells with the clustering method based on the grid and density, which is the task of the grid partition phase. The trajectory clustering based on the density should follow the principle of the cluster size from small to big. Suppose that the average number of the trajectory segments in each cell is represented as

N u m_{a v g}

. The value of

N u m_{a v g}

should be as small as possible, which means that the average number of the trajectory segments should be minimum in each cell. However, in order to conduct the trajectories clustering based on the density, it needs to compute the distances among the different trajectory segments for each cell, which results in the heavy overhead of computation. In the experiments of Section 5.3, it can be found that the minimum value of

N u m_{a v g}

cannot obtain the optimum of clustering. Through a lot of experiments, when

N u m_{a v g} = 2

, it can obtain the best clustering quality.

Definition 4 (belonging cell).

The cells are passed by the trajectory segment

T S_{i}

are defined as the belonging cells of

T S_{i}

, represented as

B e l o n g_C e l l . T S_{i}

. The number of belonging cells of

T S_{i}

is

| B e l o n g_C e l l . T S_{i} |

. As shown in the Figure 5, the cells with point shaded are the belonging cells of the trajectory segment

T S_{1}

, and

| B e l o n g_C e l l . T S_{1} | = 4

.

Definition 5 (adjacent cell).

The cells are adjacent to one of the belonging cells of the trajectory segment

T S_{i}

are defined as the adjacent cells of

T S_{i}

, represented as

A d j a c e n t_C e l l . T S_{i}

. The number of adjacent cells of

T S_{i}

is

| A d j a c e n t_C e l l . T S_{i} |

. As shown in the Figure 5, the cells with oblique lines shadow are the adjacent cells of the trajectory segment

T S_{1}

, and

| A d j a c e n t_C e l l . T S_{1} | = 14

.

Definition 6 (cell density).

Suppose a certain cell

c e l l_{i}

, the number of the trajectory segments passing through the

c e l l_{i}

is defined as the cell density of

c e l l_{i}

, denoted as

c e l l_{i} . s e g

.

During the procedure of the grid partition and mapping the trajectory segments into the corresponding cells, it needs to traverse every trajectory segment and recognize all of the belonging cells and adjacent cells of every segment, as well as every cell’s density. Those computation results are the inputs for the trajectory clustering.

4.2. Trajectory Clustering Algorithm

The DBSCAN-based clustering approaches should calibrate the values of two parameters

e p s

and

m i n P t s

.

e p s

and

m i n P t s

denote the radius of neighbor cells and the threshold of density of the trajectory segments, respectively. In Section 4.1, we could obtain the average distance among the different segments in each cell, and the average number of the trajectory segments in each cell. With the DBSCAN-based clustering approach, the ATCGD trajectory clustering approach carries out the adaptive parameters calibration

e p s

and

m i n P t s

, based on the above data to realize the effective and accurate trajectory clustering.

Definition 7 (neighborhood of trajectory segment).

Suppose there are two trajectory segments

T S_{x}

and

T S_{y}

in

D_{T S}

, that is

T S_{x} \in D_{T S}

and

T S_{y} \in D_{T S}

, where

D_{T S}

is the set of the discrete partitioned trajectory segments. If there has

N_{e p s} (T S_{x}) = {T S_{y} \in D_{T S} : d i s t (T S_{x}, T S_{y}) \leq e p s}

, where

e p s

is the radius of the neighbor cells,

N_{e p s} (T S_{x})

is the neighborhood of trajectory segment

T S_{x}

with

e p s

, denoted as

N_{e p s} (T S_{x})

.

From Definition 7, all of the trajectory segments, whose distance from the segment

T S_{x}

is less than

e p s

in the set

D_{T S}

, are the neighborhood of trajectory segment

T S_{x}

with

e p s

. The size of radius of the neighbor cells

e p s

can determine the size of

N_{e p s} (T S_{x})

for the trajectory segment

T S_{x}

. Next, we will discuss the procedure of adaptive parameter calibration for

e p s

.

It selects the cells with density greater than 1, that is

c e l l_{i} . s e g > 1

, where

i = 1, \dots, n

,

n

is the number of cells. Suppose the number of the selected cells with

c e l l_{i} . s e g > 1

is

M

,

c e l l_{i} . s e g

is the cell density of

c e l l_{i}

,

c e l l_{i} . T S_{x}

is the trajectory segments

T S_{x}

passing through

c e l l_{i}

. The radius of the neighbor cells

e p s

can be computed as follows:

\begin{array}{l} E X P_{e p s} (i) = \max (\sum_{p = 1}^{c e l l_{i} . s e g} \sum_{q = p + 1}^{c e l l_{i} . s e g} d i s t (c e l l_{i} . T S_{p}, c e l l_{i} . T S_{q})) \\ E X P_{a v g} = \sum_{i = 1}^{M} \frac{E X P_{e p s} (i)}{M} \\ e p s = E X P_{a v g} + \sum_{i = 1}^{M} \sqrt{\frac{(E X P_{e p s} (i) - E X P_{a v g})}{M}} \end{array}

(9)

where

E X P_{e p s} (i)

is the expected value of

e p s

for the

c e l l_{i}

, and

E X P_{a v g}

represents the average expected value of

e p s

for all of the cells. From the discussion in Section 4.1, we set

N u m_{a v g} = 2

to obtain good clustering quality. Due to the value of

N u m_{a v g}

is enough small, the distances among the different trajectory segments in each cell are very short. The maximum distance among the trajectory segments in the

c e l l_{i}

is selected as the expected value of

e p s

of the

c e l l_{i}

. The radius of the neighbor cells is the sum of the average expected value

E X P_{a v g}

and the standard deviation of all cells’ expected values. For any one cell

c e l l_{i}

, its cell density

c e l l_{i} . s e g

is constant. The computation complexity of

e p s

is

O (\log n)

, where

n

is the number of the cells.

Definition 8 (segment density).

Suppose there is one trajectory segment

T S_{x}

in

D_{T S}

, the density of

T S_{x}

is defined as the number of trajectory segments in its neighborhood, denoted as

ρ (T S_{x})

. That is

ρ (T S_{x}) = | N_{e p s} (T S_{x}) |

.

Definition 9 (core segment).

Suppose there is one trajectory segment

T S_{x}

in

D_{T S}

, and

m i n P t s

is the threshold of density of the trajectory segments. If

ρ (T S_{x}) \geq m i n P t s

, the trajectory segment

T S_{x}

is defined as the core segment of

D_{T S}

. Otherwise,

T S_{x}

is non-core segment of

D_{T S}

. The set of core segments is denoted as

D_{c o r e}

and the set of non-core segments is denoted as

D_{n o n - c o r e}

.

In the ATCGD trajectory clustering approach, the threshold value of

m i n P t s

is not fixed and may vary with the different number of the belonging cells of the trajectory segments. In the applications, if the density of the trajectory segment

T S_{x}

is not less than the mean value through the statistical results, it can be considered that the density of the segment

T S_{x}

,

ρ (T S_{x})

, can meet the requirements of trajectory clustering. For the trajectory segment

T S_{x}

, the corresponding threshold

m i n P t s

is set to

m i n P t s = N u m_{a v g} \times | B e l o n g_C e l l . T S_{i} |

. On the other hand, one trajectory segment may pass through one or more cells, and one cell can be covered by one or more trajectory segments.

N u m_{a v g}

is the average number of the trajectory segments in each cell.

N u m_{a v g}

can be further improved considering the many-to-many relationship between the

| B e l o n g_C e l l . T S_{i} |

and the

c e l l_{i} . s e g

for each segment and grid cell. The modified

N u m_{a v g}

is denoted as

N_{a v g}

and can be computed as Equation (10):

N_{a v g} = \frac{\sum_{j = 1}^{n} \frac{| B e l o n g_C e l l . T S_{j} |}{n}}{\frac{\sum_{i = 1}^{C_{n u m}} c e l l_{i} . s e g}{C_{n u m}}}

(10)

where

C_{n u m}

is the number of the cells, and

n

is the total number of the trajectory segments.

Definition 10 (directly density-reachable).

Suppose there are two trajectory segments

T S_{x}

and

T S_{y}

in

D_{T S}

, that is

T S_{x} \in D_{T S}

and

T S_{y} \in D_{T S}

. If

T S_{x} \in D_{c o r e}

and

T S_{y} \in N_{e p s} (T S_{x})

,

T S_{y}

are said to be directly density-reachable from

T S_{x}

. By Definition 10, no trajectory segments are directly density-reachable from a non-core segment.

Definition 11 (density-reachable).

Suppose there are

m

trajectory segments in

D_{T S}

, that is

T S_{1}, T S_{2}, \dots, T S_{m} \in D_{T S}

, where

m \geq 2

and

T S_{1}, T S_{2}, \dots, T S_{m - 1} \in D_{c o r e}

. If

T S_{i}

is the directly density-reachable from

T S_{i - 1}

, then

T S_{m}

is the density-reachable from

T S_{1}

.

The density-based trajectory clustering procedure includes three phases. The first phase is to map the discrete trajectory segments into the cell. Suppose there are

n

discrete representative trajectory segments obtained with the discrete trajectory partition algorithm (AD-MDL).

N u m_{a v g}

is set to 2, and the area can be divided into

n / N u m_{a v g}

cells. Then, it can calibrate two parameters

e p s

and

m i n P t s

to set the scanning radius of cells and the threshold of density of the trajectory segments and form a cluster, respectively, based on the Equations (9) and (10). The second phase is to execute the grid and density-based clustering with DBSCAN-based method. It starts with an arbitrary trajectory segment

T S_{i}

that has not been visited. The

T S_{i}

’s neighborhood is retrieved, and if its density

ρ (T S_{i})

is greater than

m i n P t s

, a cluster is started. Otherwise, the trajectory segment is labeled as noise. If the trajectory segment

T S_{i}

is found to be a dense part of a cluster, its neighborhood

N_{e p s} (T S_{i})

is also part of that cluster. All of the trajectory segments that are found within the neighborhood

N_{e p s} (T S_{i})

are added, as is their own neighborhood when they are also dense. This process continues until the density-reachable cluster is completely found. Then, a new unvisited trajectory segment

T S_{j}

is retrieved and processed, leading to the discovery of a further cluster or noise. After the trajectory clustering, the set of the candidate clusters,

S_{c l u s t e r}

, are created. However, if one candidate cluster

C_{i}

is not dense, which cannot meet the application’s requirement for the clustering quality. The last phase is to check the cardinality for each cluster. For one candidate cluster

C_{i}

, if the number of trajectory segments in the cluster

C_{i}

is not greater than

\sum_{j = 1}^{C_{n u m}} c e l l_{j} . s e g / C_{n u m}

, where

C_{n u m}

is the number of the cells, the cluster

C_{i}

should be the final cluster and be removed from the set of the candidate clusters.

Based on the procedure of the density-based trajectory clustering, it can be found that a trajectory that is neither a core segment nor directly-reachable is called as a noise segment. A cluster should satisfy two properties: all trajectory segments within the cluster are mutually density-reachable; and if a trajectory segment is density-reachable from any segment of the cluster, it is part of the cluster as well.

The density-based trajectory clustering algorithm can be expressed in pseudo-code as follows.

Algorithm 2. The Density-Based Trajectory Clustering Algorithm.

Input: The set of the discrete trajectory segments $D_{T S} = {T S_{1}, T S_{2}, \dots, T S_{n}}$
Output: the clustering results
// map the trajectory segments into the cells
1: the area is divided into $n / N u m_{a v g}$ cells and the cells are covered by one or more than trajectory segments
2: calibrate two parameters $e p s$ and $N_{a v g}$ to set the radius and the threshold of segments density based on Equations (9) and (10)
// trajectory clustering based on the density
3: all of the trajectory segments in the $D_{T S}$ as unclassified and $k = 0$
4: while ( $D_{T S} \neq \emptyset$ )
5: select any one trajectory segment $T S_{i}$ in the $D_{T S}$ , and delete it
6: if ( $T S_{i}$ is unclassified) then
7: $D_{t m p} = N_{e p s} (T S_{i})$
8: if ( $| N_{e p s} (T S_{i}) | \geq N_{a v g} \times | B e l o n g_C e l l . T S_{i} |$ ) then
9: $T S_{i} . c i d = k$
10: while ( $D_{t m p} \neq \emptyset$ ) // expand the current cluster
11: select any one trajectory segment $T S_{j}$ in the $D_{t m p}$ and delete it
12: if ( $| N_{e p s} (T S_{j}) | \geq N_{a v g} \times | B e l o n g_C e l l . T S_{j} |$ ) then
13: $D_{t m p} = D_{t m p} \cup N_{e p s} (T S_{j})$
14: if ( $T S_{j}$ is unclassified or noise) then
15: $T S_{j} . c i d = k$
16: else
17: Mark $T S_{i}$ is noise
18: $k = k + 1$
// check the trajectory cardinality
19: for each cluster $C_{i}$ in the set of clusters $S_{c l u s t e r}$
20: if ( $| C_{i} | \leq \sum_{j = 1}^{C_{n u m}} c e l l_{j} . s e g / C_{n u m}$ )
21: remove $C_{i}$ from the set $S_{c l u s t e r}$
22: Return the final set of clusters.

From Algorithm 2, the density-based trajectory cluster algorithm includes three phases. From line 1 to line 2, the area is divided into the appropriate number of cells and the segments are mapped into the corresponding cells. Meanwhile, it executes the adaptive parameter calibration for

e p s

and

m i n P t s

. The complexity of the first phase is

O (n)

. The clustering phase is from the line 3 to line 18, which adopts the DBSCAN-based method to cluster the discrete segments with the values of adaptive calibrated parameters, and get the candidate clusters. The complexity of clustering procedure is

O (n \log n)

. To further check the results of clustering, it checks the density of each cluster. If the density of cluster is not greater than the average density, the candidate cluster should be removed, as shown from line 19 to line 22. As a whole, the complexity of the trajectory clustering based on the density is

O (n \log n)

.

5. Performance Evaluation

5.1. Experimental Setup

To evaluate the clustering performance of proposed trajectory cluster approach-ATCGD, two data sets are introduced. One is a series of randomly generated trajectories (hereafter referred to as Random Trajectory, RT), as shown in Figure 6a,b. The other is hurricane trajectory data in the Atlantic Ocean provided by American Weather Information System Company, referred to as Hurricane Track (HT) as shown in Figure 6c. RT data includes two patterns: RT1 and RT2. RT1 has about 100 trajectories and 2000 trajectory segments. Those trajectories can be clearly divided into four groups from top to bottom. RT2 has about 100 trajectories and 7000 trajectory segments, and is more complicated than RT1. The trajectories in RT2 are also divided into four groups. The trajectories in the RT1 and RT2 sets are similar to the trajectory data from vehicle movement, thus, RT1 and RT2 can represent a data set from a real application. The HT data set includes the hurricane track information about latitude, longitude, and the highest wind speed from 1851. The frequency of sampling is once for every 6 h. The experiments extract 100 hurricane trajectories with 2465 trajectory segments from 1940, which includes the latitude and longitude of the hurricane track.

To further evaluate the clustering quality of the proposed ATCGD approach, one metric

Q M e a s u r e

is introduced as the standard to evaluate the clustering effect [10].

Q M e a s u r e

includes two parts: one is the sum of squared error (

S S E

) and the other is the penalty value of noise. The

Q M e a s u r e

can be calculated as follows:

Q M e a s u r e = \sum_{i = 1}^{N c l u s t e r} (\frac{1}{2 | C_{i} |} \sum_{x \in C_{i}} \sum_{y \in C_{i}} d i s t {(x, y)}^{2}) + \frac{1}{2 | D_{n} |} \sum_{p \in D_{n}} \sum_{q \in D_{n}} d i s t {(p, q)}^{2}

(11)

where

D_{n}

is the noise set,

N_{c l u s t e r}

is the number of the cluster of the trajectory segments, and

C_{i}

represents the

i^{t h}

cluster of trajectory segments.

| C_{i} |

is the number of the trajectory segments in the

i^{t h}

cluster.

| D_{n} |

is the number of the noise trajectories. The sum of squared error (

S S E

) can be calculated with

\sum_{i = 1}^{N c l u s t e r} (\frac{1}{2 | C_{i} |} \sum_{x \in C_{i}} \sum_{y \in C_{i}} d i s t {(x, y)}^{2})

, which reflects the distances between the different trajectory segments in each cluster. The smaller value of

e p s

is and the greater value of

m i n P t s

is, it can obtain smaller

S S E

. In the applications, if it can calibrate appropriate values of two parameters

e p s

and

m i n P t s

, it can exhibit good cluster quality. At the same time, the noise trajectory data are considered when calculating the value of

Q M e a s u r e

.

\frac{1}{2 | D_{n} |} \sum_{p \in D_{n}} \sum_{q \in D_{n}} d i s t {(p, q)}^{2}

is used to calculate the sum of squared distances between the any noise trajectory segments, which is as the penalty. Therefore, the value of

Q M e a s u r e

and the quality of the clustering exhibits the negative correlation. The smaller the metric value of

Q M e a s u r e

is, the higher quality of the clustering is.

5.2. Clustering Performance

Figure 7 shows the clustering results with the RT1, RT2 and HT data sets, respectively. As shown in Figure 7, the different clusters are represented with different colors. From Figure 7a, the proposed ATCGD approach can cluster those trajectory data into four groups with high accuracy, which is in accordance with the expectation. Compared to the original trajectory data, it can be found that some trajectory segments are recognized as the noise. Figure 7b illustrates the clustering results with RT2. In contrast with RT1, the trajectories of RT2 exhibit apparent non-smoothness. This reveals that RT2 has greater difficulty than RT1 in clustering, but the ATCGD approach can still cluster those trajectories into four different groups. Therefore, the ATCGD can effectively be applied to the vehicle trajectory data, which has high similarity to the RT data set. Figure 7c shows the HT clustering results. From Figure 6c, the trajectories in the HT data set are much more complicated than those in the RT data set. The ATCGD approach can classify those hurricane data into two clusters, which conforms to the expectation. It implies that the ATCGD approach can also provide effective clustering for complex trajectory data.

5.3. Comparison Analysis

To further quantify the accuracy of the ATCGD clustering approach, we compare the ATCGD approach with TRACLUS in terms of

Q M e a s u r e

. Due to the slight differences about the distance calculation of the trajectory segments between the ATCGD and TRACLUS, it adopts the proposed distance computation equation between the trajectory segments in this paper, shown in Equation (3), to calculate

Q M e a s u r e

.

In the experiments, the different number of trajectories from RT and HT data sets are selected to evaluate the clustering quality. Thus 100, 200, 300, 400 hurricane track trajectories since 1940 from the HT data set are randomly selected and denoted as HT-100, HT-200, HT-300, and HT-400, respectively. Meanwhile, we apply the parameter calibration method proposed in the TRACLUS algorithm to conduct the experiments for twenty times and get the 20 different combination results of the two parameters

e p s

and

m i n P t s

. The minimum of combination results, that is the minimum

Q M e a s u r e

, is taken as the results of the TRACLUS algorithm. The experimental results of the ATCGD and TRACLUS algorithm are listed in Table 1.

From Table 1, it can be seen that the run time of TRACLUS algorithm is much higher than that of the ATCGD method. Meanwhile, the difference in the run times becomes greater between the two algorithms as the data size increases. The reason is that the ATCGD approach adopts the belonging cells and adjacent cells to determine the candidate set, which can be used to compute the neighborhood of

e p s

. That method can greatly improve the efficiency and reduce the execution time of the trajectory clustering. The computation complexity of the ATCGD approach is

O (n \log n)

based on the analysis in the Section 4.2. On the contrary, without the index scheme, the computation complexity of the TRACLUS algorithm is up to

O (n^{2})

, where

n

is the number of trajectory points. On the other hand, as to the metric of the clustering quality

Q M e a s u r e

, the ATCGD approach does not appear to be much different from the TRACLUS algorithm. The ATCGD can obtain slightly better

Q M e a s u r e

than the TRACLUS algorithm. In most cases, the value of

Q M e a s u r e

in the ATCGD is smaller than that in the TRACLUS, except the HT-400 and RT2 data sets. The reason is that the ATCGD approach adopts the adaptive parameters calibration method to obtain the values close to the optimum, thus it can exhibit the good quality of clustering with the lower computation cost. While the TRACLUS algorithm can obtain the near-optimal combination results of two parameters

e p s

and

m i n P t s

through the large number of parameters calibrations, which results in the high accuracy and high computation complexity. If the combination results of two parameters are inappropriate, the TRACLUS algorithm will obtain the poor quality of trajectory clustering.

5.4. Parameter Sensitive Analysis

In order to further provide the quantitative analysis of the parameter values of

N u m_{a v g}

, the HT-100, HT-200, HT-300, and HT-400 data sets are used to compute the quality of clustering metric

Q M e a s u r e

with the different values of

N u m_{a v g}

. The experimental results are shown in the Figure 8. When

N u m_{a v g} = 2

, the value of

Q M e a s u r e

is minimum for all of the data sets. When

N u m_{a v g} < 2

, the value of

Q M e a s u r e

decreases with the increase of

N u m_{a v g}

. On the contrary, when

N u m_{a v g} > 2

, the value of

Q M e a s u r e

increases with the increase of

N u m_{a v g}

. Based on the experimental results, when the ATCGD approach sets

N u m_{a v g}

to 2, it can get better quality of the trajectory clustering.

To verify the correctness of the parameters calibration, two parameters

e p s

and

N_{a v g}

(

m i n P t s

can be computed based on

N_{a v g}

) are selected for the sensitivity analysis. The data sets are still HT-100, HT-200, HT-300, and HT-400. We compare the different values of

Q M e a s u r e

with different combination of

e p s

and

N_{a v g}

as well as the adaptive calibration values of those two parameters

e p s_{a}

and

N_{a v g_a}

. The value range of is

[⌊ e p s_{a} - 3 ⌋, ⌊ e p s_{a} + 3 ⌋]

and the step is 1. The value range of

N_{a v g}

is

[N_{a v g_a} - 0.6, N_{a v g_a} + 0.6]

and the step is 0.2. Figure 9 illustrates the distributions of

Q M e a s u r e

with different values of

e p s

and

N_{a v g}

in the data sets of HT-100, HT-200, HT-300, and HT-400. As shown in Figure 9, the red points are the results of adaptive parameter calibration for

e p s_{a}

and

N_{a v g_a}

; the green points are the results of different combinations with different values of

e p s

and

N_{a v g}

From Figure 9, it can be found that there is a large variation range of

Q M e a s u r e

with the different combinations of two parameters’ values, when adopting the TRACLUS algorithm. While the ATCGD approach can get the small value of

Q M e a s u r e

. The reason is that it adopts the adaptive parameters calibration method to compute the value of

Q M e a s u r e

. On the other hand, if the difference between the values of

Q M e a s u r e

by adopting the adaptive parameters calibration and the optimal combination is smaller, the ATCGD approach can obtain higher quality of trajectories clustering. Moreover, although the results of the adaptive parameter calibration are not optimal, in most cases, the difference between the values of

Q M e a s u r e

with the adaptive calibration and the optimal combination is less than 5%. It indicates that the adaptive calibrated parameters

e p s

and

N_{a v g}

can gain good clustering effects.

6. Conclusions

Clustering analysis is one of the most important issues in trajectory data mining. Trajectory clustering can be widely applied in hotspots detection, mobile pattern analysis, urban transportation control, hurricane prediction, etc. Many trajectory clustering algorithms have been proposed to obtain good clustering performance. Nonetheless, most available trajectory clustering algorithms depend on calibration of one or multiple parameters. Meanwhile, the values of these parameters have a great influence on the effect of clustering. To reduce the complexity and overhead of parameter calibration in trajectory clustering, an Adaptive Trajectory Clustering approach based on Grid and Density, ATCGD, was proposed in this paper. ATCGD firstly divides the trajectory data into multiple discrete segments through the proposed the average angular difference-based MDL (AD-MDL) algorithm. All of the discrete segments are mapped into the corresponding cells. Then, it calculates the average distance among the different segments in each cell, and the average number of the trajectory segments in each cell. Finally, adopting a DBSCAN-based approach, ATCGD carries out an adaptive parameter calibration based on the above data to realize effective and accurate trajectory clustering. With two data sets from random trajectories and hurricane trajectories on the Atlantic Ocean, we evaluate the performance of the ATCGD approach on clustering quality and cost. The experimental results indicate that although the results of the adaptive parameter calibration are not optimal, in most cases, the difference between the adaptive calibration and the optimal is less than 5%, while the run time of clustering can be reduced by about 95%.

Acknowledgments

This study was supported by the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No. 2016YFC0400910, 2017ZX07104001, 2013BAB06B04; the Fundamental Research Funds for the Central Universities under Grant No. 2015B22214, 2017B20914, 2017B16814, 2017B42214; NSF-China and Guangdong Province Joint Project: U1301252. The authors are grateful to the reviewers for their comments which greatly improved the quality of the paper.

Author Contributions

All four authors have contributed to the work presented in this paper. Yingchi Mao and Haishi Zhong, Hai Qi formed the initial idea. Yingchi Mao, Haishi Zhong, and Ping Ping developed the overall theoretical structure of the research. Yingchi Mao and Haishi Zhong conceived and designed the experiments. Haishi Zhong performed the experiments; Yingchi Mao, Haishi Zhong and Xiaofang Li analyzed the experimental results. Hai Qi and Xiaofang Li provided the experiment dataset. All authors worked collaboratively on writing main text paragraph.

Conflicts of Interest

The authors declare no conflict of interest.

References

Melnychuk, M.C.; Welch, D.W.; Walters, C.J. Spatio-temporal migration patterns of Pacific salmon smolts in rivers and coastal marine waters. PLoS ONE 2010, 5, e12916. [Google Scholar] [CrossRef] [PubMed]
Andraca-Gómez, G.; Ordano, M.; Boege, K.; Domínguez, C.A.; Piñero, D.; Pérez-Ishiwara, R.; Pérez-Camacho, J.; Cañizares, M.; Fornoni, J. A potential invasion route of Cactoblastis cactorum within the Caribbean region matches historical hurricane trajectories. Biol. Invasions 2015, 17, 1397–1406. [Google Scholar] [CrossRef]
Yuan, J.; Zheng, Y.; Zhang, C.; Xie, W.; Xie, X.; Sun, G.; Huang, Y. T-drive: Driving directions based on taxi trajectories. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 3–5 November 2010; pp. 99–108. [Google Scholar]
Serna, C.G.; Ruichek, Y. Dynamic Speed Adaptation for Pat Tracking Based on Curvature Information and Speed Limits. Sensors 2017, 17, 1383. [Google Scholar] [CrossRef] [PubMed]
Yue, Y.; Zhuang, Y.; Li, Q.; Mao, Q. Mining Time-dependent Attractive Areas and Movement Patterns from Taxi Trajectory Data. In Proceedings of the 2009 International Conference on Geoinformatics, Fairfax, VA, USA, 12–14 August 2009; pp. 1–6. [Google Scholar]
Liu, S.; Liu, Y.; Ni, L.M.; Fan, J.; Li, M. Towards mobility-based clustering. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 25–28 July 2010; pp. 919–928. [Google Scholar]
Ma, L.B.; Li, P. Spatio-temporal Trajectory Clustering Based on Automatic Subspace Clustering Algorithm. Geogr. Geo-Inf. Sci. 2014, 30, 7–11. [Google Scholar]
Gaffney, S.; Smyth, P. Trajectory clustering with mixtures of regression models. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 15–18 August 1999; pp. 63–72. [Google Scholar]
Nanni, M.; Pedreschi, D. Time-focused clustering of trajectories of moving objects. J. Intell. Inf. Syst. 2006, 27, 267–289. [Google Scholar] [CrossRef]
Kalnis, P.; Mamoulis, N.; Bakiras, S. On discovering moving clusters in spatio-temporal data. In Proceedings of the International Symposium on Spatial and Temporal Databases, Angra dos Reis, Brazil, 22–24 August 2005; pp. 364–381. [Google Scholar]
Lee, J.G.; Han, J.; Whang, K.Y. Trajectory clustering: A partition-and-group framework. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Beijing, China, 11–14 June 2007; pp. 593–604. [Google Scholar]
Ram, A.; Sharma, A.; Jalal, A.S.; Agrawal, A.; Singh, R. An Enhanced Density Based Spatial Clustering of Applications with Noise. In Proceedings of the IACC 2009, IEEE International Advance Computing Conference, Patiala, India, 6–7 March 2009; pp. 1475–1478. [Google Scholar]
Peng, C.; Guoyou, S.; Shuang, L.; Jian, Y. An incremental density based spatial clustering of application with noise algorithm based on partition index. J. Comput. Theor. Nanosci. 2016, 13, 10273–10280. [Google Scholar] [CrossRef]
Hassanin, M.F.; Hassan, M.; Shoeb, A. DDBSCAN: Different Densities-Based Spatial Clustering of Applications with Noise. In Proceedings of the International Conference on Control, Instrumentation, Communication and Computational Technologies, Kumaracoil, India, 18–19 December 2015. [Google Scholar]
Xia, L.N.; Jing, J.W. SA-DBSCAN: A self-adaptive density-based clustering algorithm. J. Grad. Sch. Chin. Acad. Sci. 2009, 26, 530–538. [Google Scholar]
Chen, X.; Liu, W.; Qiu, H.; Lai, J. APSCAN: A parameter free algorithm for clustering. Pattern Recognit. Lett. 2011, 32, 973–986. [Google Scholar] [CrossRef]
Zhang, L.; Xu, Z.; Si, F. GCMDDBSCAN: Multi-density DBSCAN Based on Grid and Contribution. In Proceedings of the International Conference on Dependable, Autonomic and Secure Computing, Chengdu, China, 21–22 December 2013; pp. 502–507. [Google Scholar]
Li, Z.; Lee, J.G.; Li, X.; Han, J. Incremental clustering for trajectories. In Proceedings of the 15th International Conference on Database Systems for Advanced Applications, Tsukuba, Japan, 1–4 April 2010; pp. 32–46. [Google Scholar]
Khairat, H.; Sitanggang, I.S.; Nuryanto, D.E. Clustering Haze Trajectory of Peatland Fires in Riau Province Using K-Means Algorithm. EES 2017, 58, 012059. [Google Scholar] [CrossRef]
Li, J.; Yang, M.; Liu, N.; Wang, Z.; Yu, L. A Trajectory Data Clustering Method Based on Dynamic Grid Density. Int. J. Grid Distrib. Comput. 2015, 8, 1–8. [Google Scholar] [CrossRef]
Ebisch, K. Short Note: A Correction to Douglas-Peucker Line Generalization Algorithm. Comput. Geosci. 2002, 28, 995–997. [Google Scholar] [CrossRef]
Meratnia, N.; Rolf, A. Spatiotemporal compression techniques for moving point objects. In Proceedings of the 9th International Conference on Extending Database Technology, Crete, Greece, 14–18 March 2004; pp. 765–782. [Google Scholar]
Muckell, J.; Olsen, P.W.; Hwang, J.H.; Lawson, C.T.; Ravi, S.S. Compression of trajectory data: A comprehensive evaluation and new approach. GeoInformatica 2014, 18, 435–460. [Google Scholar] [CrossRef]
Zheng, Y.; Xie, X.; Ma, W.Y. GeoLife: A Collaborative Social Networking Service among User, Location and Trajectory. IEEE Data Eng. Bull. 2010, 33, 32–39. [Google Scholar]

Figure 1. The illustration of the proposed ATCGD approach.

Figure 2. Illustration of the continuous representative segments and the discrete representative segments. (a) The continuous trajectory segments; and (b) The discrete trajectory segments.

Figure 3. The illustration of distance measure between the two trajectory segments.

Figure 4. The diagram of the discrete representative trajectory segment.

Figure 5. Schematic diagram of the belonged Cell and adjacent Cell.

Figure 6. The trajectory in the RT and HT datasets. (a) RT1; (b) RT2; and (c) HT.

Figure 7. The clustering results on the RT and HT dataset, (a) RT1; (b) RT2; and (c) HT.

Figure 8.

Q M e a s u r e

values under different

N u m_{a v g}

.

Figure 8.

Q M e a s u r e

values under different

N u m_{a v g}

.

Figure 9. Experimental results of parameter adaptive analysis, (a) HT-100; (b) HT-200; (c) HT-300; and (d) HT-400.

Table 1. Comparison of clustering quality between ATCGD and TRACLUS.

	TRACLUS		ATCGD
	QMeasure	Run Time (s)	QMeasure	Run Time (s)
HT-100	1,486,875	1.25	1,140,856	0.14
HT-200	5,416,222	5.84	4,327,626	0.23
HT-300	8,164,510	15.75	7,602,455	0.44
HT-400	9,741,195	26.34	10,682,513	0.61
RT1	461,437	1.07	39,426	0.09
RT2	164,351	21.75	176,269	0.57

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mao, Y.; Zhong, H.; Qi, H.; Ping, P.; Li, X. An Adaptive Trajectory Clustering Method Based on Grid and Density in Mobile Pattern Analysis. Sensors 2017, 17, 2013. https://0-doi-org.brum.beds.ac.uk/10.3390/s17092013

AMA Style

Mao Y, Zhong H, Qi H, Ping P, Li X. An Adaptive Trajectory Clustering Method Based on Grid and Density in Mobile Pattern Analysis. Sensors. 2017; 17(9):2013. https://0-doi-org.brum.beds.ac.uk/10.3390/s17092013

Chicago/Turabian Style

Mao, Yingchi, Haishi Zhong, Hai Qi, Ping Ping, and Xiaofang Li. 2017. "An Adaptive Trajectory Clustering Method Based on Grid and Density in Mobile Pattern Analysis" Sensors 17, no. 9: 2013. https://0-doi-org.brum.beds.ac.uk/10.3390/s17092013

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Adaptive Trajectory Clustering Method Based on Grid and Density in Mobile Pattern Analysis

Abstract

1. Introduction

2. Related Work

2.1. Trajectory Clustering Approaches

2.2. Trajectory Partition Methods

3. Discrete Partitioning of Trajectories

3.1. Distance Measure Between the Trajectory Segments

3.2. Discrete Representative Trajectory Segments

3.3. Discrete Trajectory Partition Algorithm

4. Trajectory Clustering Based on Grid and Density

4.1. Grid Partition

4.2. Trajectory Clustering Algorithm

5. Performance Evaluation

5.1. Experimental Setup

5.2. Clustering Performance

5.3. Comparison Analysis

5.4. Parameter Sensitive Analysis

6. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI