4.1. Preliminaries
Let us illustrate some basic concepts and notations, partially from [
1]:
Definition 1 (Power set). The power set of a set A is the set of all possible subsets of A, and is denoted with . denotes the set of all the non-empty subsets of A: .
Definition 2 (Multiset). A multiset is an extension of the concept of set that keeps track of the cardinality of each element. is the set of all multisets over some set A. Multisets are denoted with square brackets, e.g., , or with the cardinality of the elements as superscript, e.g., . We denote the empty multiset with . The operator retrieves the cardinality of an element of the multiset, e.g., , , . Over multisets we define , and . The multiset union is the multiset b such that for all x we have .
Definition 3 (Sequence and permutation). Given a set X, a finite sequence over X of length n is a function , and is written as . For any sequence s we define , , and . A permutation of the set X is a sequence that contains all elements of X without duplicates: , , and for all and for all , . We denote with all such permutations of set X. We overload the notation for sequences: given a sequence , we will write in place of .
Definition 4 (Transitive relation and correct evaluation order). Let X be a set of objects and R be a binary relation . R is transitive if and only if for all we have that . A correct evaluation order is a permutation of the elements of the set X such that for all we have that .
Definition 5 (Strict partial order). Let S be a set of objects. Let . A strict partial order is a binary relation that have the following properties:
Definition 6 (Directed graph). A directed graph is a tuple where V is the set of vertices and is the set of directed edges. The set is the graph universe. A path in a directed graph is a sequence of vertices p such that for all we have that . We denote with the set of all such possible paths over the graph G. Given two vertices , we denote with the set of all paths beginning in v and ending in : . v and are connected (and is reachable from v), denoted by , if and only if there exists a path between them in G: . Conversely, . We drop the superscript G if it is clear from the context. A directed graph G is acyclic if there exists no path satisfying .
Definition 7 (Topological sorting)
. Let be an acyclic directed graph. A topological sorting [11] is a permutation of the vertices of G such that for all we have that . We denote with all such possible topological sortings over G. Definition 8 (Transitive reduction)
. A transitive reduction [12] of a graph is a graph with where every pair of vertices connected in is not connected by any other path: for all , . is the graph with the minimal number of edges that maintain the reachability between edges of G. The transitive reduction of a directed acyclic graph always exists and is unique [12]. This paper proposes an analysis technique on
uncertain event logs. These execution logs contain information about uncertainty explicitly associated with event data. A taxonomy of different types of uncertain event logs and attribute uncertainty has been described in [
2]; we will refer to the notion of
simple uncertainty, which includes uncertainty without probabilistic information on the control-flow perspective: activities, timestamps, and indeterminate events.
Definition 9 (Universes). Let be the set of all the event identifiers. Let be the set of all case ID identifiers. Let be the set of all the activity identifiers. Let be the totally ordered set of all the timestamp identifiers. Let , where the “!” symbol denotes determinate events, and the “?” symbol denotes indeterminate events.
Definition 10 (Simple uncertain events). is a simple uncertain event, where is its event identifier, is the set of possible activity labels for e, and are the lower and upper bounds for the value of its timestamp, and o indicates if it is an indeterminate event. Let be the set of all simple uncertain events. Over the uncertain event we define the projection functions , , and .
Definition 11 (Simple uncertain traces and logs). is a simple uncertain trace if for any , and all the event identifiers are unique. denotes the universe of simple uncertain traces. is a simple uncertain log if all the event identifiers in the log are unique.
Definition 12 (Strict partial order over simple uncertain events)
. Let be two simple uncertain events. is an order defined on the universe of strongly uncertain events as: Definition 13 (Order-realizations of simple uncertain traces). Let be a simple uncertain trace. An order-realization is a permutation of the events in σ such that for all we have that , i.e., is a correct evaluation order for σ over , and the (total) order in which events are sorted in is a linear extension of the strict partial order . We denote with the set of all such order-realizations of the trace σ.
A necessary step to allow for analysis of simple uncertain traces is to obtain their behavior graph. A behavior graph is a directed acyclic graph that synthesizes the information regarding the uncertainty on timestamps contained in the trace.
Definition 14 (Behavior graph). Let be a simple uncertain trace. Let the identification function be a bijection between the events in σ and the first natural numbers. A behavior graph is the transitive reduction of a directed graph , where is defined as:
The set of topological sortings of a behavior graph corresponds to the set of all the order-realizations of the trace σ:
A technical note: this definition for the nodes of the behavior graph is slightly different from the one in [
2], to simplify the notation in algorithms. The two definitions are functionally identical.
Figure 6 and
Figure 7 show the transitive reduction operation on the running example.
The semantics of a behavior graph can efficaciously communicate time and order information concerning the time relationships among events in the corresponding uncertain trace in a compact manner. For a behavior graph and two events , , holds if and only if is immediately followed by for some possible values of the timestamps of the events in the trace. A consequence of this fact is that if a pair of events in the graph are unreachable, they might have occurred in any order.
Definition 14 is meaningful and clear from a theoretical point of view. It rigorously defines a behavior graph and the semantics of its parts. Although helpful to understand the function of behavior graphs, obtaining them from process traces following this definition—that is, using the transitive reduction—is inefficient and slow. This hinders the analysis of logs with a large number of events, and with longer traces. It is nonetheless possible to build behavior graphs from process traces in a faster and more efficient way.
4.2. Efficient Construction of Behavior Graphs
The set of steps to efficiently create a behavior graph from an uncertain trace is separated into two distinct phases, described by Algorithms 1 and 2. An uncertain event e is associated with a time interval which is determined by two values: minimum and maximum timestamp of that event and . If an event e has a certain timestamp, we have that .
We will examine here the effect of Algorithms 1 and 2 on a running example, the process trace shown in
Table 4. Notice that in this running example, no uncertainty on activity labels nor indeterminate events are present: this is because of the fact that the topology of a behavior graph only depends on the (uncertain) timestamps in the events belonging to the corresponding trace.
Algorithm 1:TimestampList()
|
|
Algorithm 2:BehaviorGraph(TimestampList())
|
|
The construction of the graph relies on a preprocessing step shown in Algorithm 1, where a support list is created (lines 4–8). Every entry in this list is a tuple of four elements. For each event e in the trace, we insert two entries in the list—one for each timestamp and appearing in a trace. The four elements in each tuple contained in the list are:
an identifier, which in the list construction is an integer representing the rank of the uncertain event by minimum timestamp (computed in line 3);
the activity labels associated with the event ;
the attribute , which will carry the information regarding indeterminate events;
the type of timestamp that generated this entry—if it is a minimum or maximum of an interval.
As we can see, the list is designed to contain all information about an uncertain event except the values of minimum and maximum timestamps, which we use to sort the list (line 9) and then discard prior to returning the list (lines 10–15).
The events of the trace in
Table 4 are represented in the list
by entries shown in
Table 5. These entries are then sorted by Algorithm 1 yielding the following list
:
One of the purposes the list
serves is gathering the structural information to create the behavior graph; in fact, visiting the list in order is equivalent of sweeping the events of the trace on the time dimension, encountering each timestamp (minimum or maximum) sorted through time. We can visualize this on the Gantt diagram representation of the trace of
Table 4, visible in
Figure 8.
Every segment representing an uncertain event in the diagram is translated by TimestampList into two entries in a sorted list, representing the two extremes of the segment. Events without an uncertain timestamp collapse into a single point in the diagram, and their corresponding two entries in the list are characterized by the same timestamp.
Now, let us examine Algorithm 2. The idea leading the algorithm is to analyze the time relationship among uncertain events in a more precise manner, as opposed to adding a large number of edges to the graph and then removing them via transitive reduction. This is attained by searching all the viable successors of each event in the sorted timestamp list . We scan the list with two nested loops, and we use the inner loop to look for successors of the entry selected by the outer loop. According to the semantics of behavior graphs, events with overlapping intervals as timestamps must not be connected by a path; thus, we draw outgoing edges from an event only when, reading the list, we arrive at a point in time in which the event has certainly occurred. This is the reason outgoing edges are not drawn when inspecting minimum timestamps (line 6) and incoming edges are not drawn when inspecting maximum timestamps (line 10).
First, we initialize the set of nodes with all the triples in the entries of , and we initialize the edges with an empty set (lines 1–2). For each maximum timestamp that we encounter in the list, we start searching for successors in the following entries (lines 3–9), so we proceed in looking for the successors of only if .
If, while searching for successors of the entry
, we encounter the entry
corresponding to a minimum timestamp (
), we connect
and
in the graph, since their timestamps do not have any possible value in common. The search for successors must continue, since it is possible that other events took place before the maximum timestamp of the event corresponding to
. This configuration occurs for events
and
in
Table 4. As can be seen in
Figure 8,
can indeed follow
, but the still undiscovered event
is another possible successor for
.
If the entry
corresponds to a maximum timestamp (line 12), so
, there are two separate situations to consider. Case 1:
was not already connected to
. Then, the timestamps of the events corresponding to
and
overlap with each other—if they did not, the two nodes would have already been connected, since we would have encountered
from
before encountering
. Thus,
must not be connected to
and the search must continue. Events
and
are an example: when the maximum timestamp of
is encountered during the search for the successor of
, the two are not connected, so the search for a viable successor of
has to continue. Case 2:
and
are already connected. This means that we had already encountered
during the search for the successors of
. Since the entire time interval representing the possible timestamp of the event associated with
is detected after the occurrence of
, there are no further events to consider as successors of
and the search stops (line 13). In the running example, this happens between
and
: when searching for the successors of
, we first connect it with
when we encounter its minimum timestamp; we then encounter its maximum timestamp, so no other successive event can be a successor for
. This concludes the walkthrough of the procedure, which shows why Algorithms 1 and 2 can be used to correctly compute the behavior graph of a trace. The behavior graph of the trace in
Table 4 obtained through this procedure is shown in
Figure 9.
Let us now prove, in more formal terms, the correctness of these algorithms. We will show that the procedures BehaviorGraph and TimestampList are able to construct a behavior graph with the semantics illustrated in Definition 14.
Theorem 1 (Correctness of the behavior graph construction). Let be an uncertain trace. Let BehaviorGraph(TimestampList(σ)) be the behavior graph of σ obtained through Algorithms 1 and 2. The graph follows the behavior graph semantics: for all pairs of events and such that , , , , , , we have that the node is connected to the node if and only if and there exists no event such that . Thus, .
Proof. Let us first define a suitable function for the behavior graph using the list created in TimestampList(σ). For all events and for such that , we define . Since is just an enumeration of the events in , it is trivially bijective.
Assume . By construction, we have that . The checks in line 6 and line 10 only allow for edges to be linked from entries of type ‘MAX’ to entries of type ‘MIN’ that only appear in a later position in the list . Thus, the configuration is a strict prerequisite for and to be connected: .
Assume
, and that the algorithm is currently searching the successors for the entry
. Eventually, the inner loop will consider as a successor the entry
, and since it is of type ‘MIN’,
and
will necessarily be connected unless the algorithm executes the
break at line 13. To execute it, the algorithm needs to find a list entry
such that there already exist an arc between
and
, and this is only possible if
has been encountered while searching for successors of
. This implies that
which by construction of
, is only possible if there exist some
such that
□
As mentioned earlier, the procedure of constructing a behavior graph has been structured in two different algorithms specifically to enable further optimization in processing uncertain process trace. This becomes evident once we consider the problem of converting in behavior graphs all the traces in an event log, as opposed as one single uncertain trace.
First, it is important to notice that different uncertain traces can have the same list . Similarly to directly-follows relationships in more classical process mining, which can ignore the amount of time in absolute terms elapsed between two consecutive events, specific values of timestamps in an uncertain trace are not necessarily meaningful with respect to the connection in the behavior graph; their order, conversely, is crucial.
This fact enables further optimization at the log level. The construction of the list in (TimestampList(σ)) is engineered in a way that allows for computing the behavior graph without direct lookup to the events in the trace. This implies that it is possible to extract a multiset of lists from the event log, and to compute the conversion to behavior graph only for the set of lists induced by this multiset. This allows the saving of computation time in converting an entire event log to behavior graphs; furthermore, it enables a more compact representation of the log in memory, since we only need to store a smaller number of graphs to represent the whole log.
The procedure to efficiently convert an event log into graphs is detailed in Algorithm 3.
These considerations allow us to extend to the uncertain scenario some concepts that are essential in classical process mining. First, we can now derive the definition of variant, highly important for preexisting process mining techniques, to uncertain event data.
Algorithm 3:ProcessUncertainLog |
|
Definition 15 (Uncertain variants). Let be a simple uncertain event log. The variants of L denoted by , are the multisets of behavior graphs for the uncertain traces in L, and are computed with ProcessUncertainLog(L.).
The computational advantage in representing a log through a multiset of behavior graphs is evident in the procedure described in Algorithm 2. We see that all data necessary to the creation of a behavior graph is contained in the list , fact that justifies the log representation method illustrated in Algorithm 3.
Lemma 1. Two uncertain traces and belong to the same variant, and share the same behavior graph, if and only if they result in the same timestamp list :TimestampListTimestampList.
Another central concept in process mining is the so-called control-flow perspective of event data. In certain process traces, where timestamps have a total order, events have a single activity label and no event is indeterminate, the control-flow information is represented by a sequence of activity labels sorted by timestamp. Although there are many analysis approaches that also account for other perspectives (e.g., the performance perspective that considers the duration of events and their distance in time, or the resource perspective that accounts for the agents that execute the activities), a vast amount of process mining techniques, including most popular algorithms for process discovery and conformance checking, rely only on the control-flow perspective of a process. Analogously, behavior graphs carry over the control-flow information of an uncertain trace: instead of describing the flow of events like their certain counterpart, the behavior graph describes all possible flows of events in the uncertain trace.