After carefully exploring some of the previous watermarking schemes [
17,
18,
22,
25], we observed that, in most of the numerical watermarking schemes, the fractional values of the numerical attributes in the relational databases are modified to carry the watermark [
17]. However, when the numerical attributes do not exist in the relational database, no watermarking can be applied. Some watermarking schemes have been proposed for categorical attributes [
15]. Moreover, the existing schemes can identify the hidden watermark successfully when only 10% of the tuples of the watermarked relational database
R’W were altered. In this section, we propose a new relational database-based watermarking scheme, called LRW-CRDB, that uses categorical attributes. First, the watermark is calculated based on the information in the relational database, while referring to two embedding keys,
K1 and
K2. Then, the generated watermark is hidden in the chosen categorical attributes, so that later they can be used to claim ownership of the relational database by authorized users. In the proposed scheme, we embed only the watermark into the categorical attributes of the relational database, instead of into numerical attributes. Following the same requirements used in previous schemes [
17,
18,
22,
25], a small change in the value of the categorical attributes is acceptable in the proposed scheme, and the slight modification will not influence the entire relational database’s usability.
Figure 1 shows a flowchart of our proposed LRW-RDB scheme.
Assume that the relational database is the data set
R and is defined as
R(
PK,
A0,…,
Aα−1). In the data set
R,
PK is defined as the primary key attribute, and the other attributes,
A0,
A1 …,
Aα−1, are candidates for watermark embedding. For simplicity, assume that all of the
α attributes are categorical attributes and that their values are size values of fashion products, i.e., T-shirts, shoes. These size values are in the range of the set
OriginalSet = {S, M, X, XL}. Let us set the index of each value in the
OriginalSet to be from 0 to 3, respectively. There is a strong assumption in our proposed LRW-RDB scheme, that is that the primary key
PK cannot be modified by malicious attackers because the primary key contains valuable information, and once it is changed, the integrity of the relational database is compromised. This assumption is reasonable, especially for a relational database. To prevent the embedded watermark from being extracted by malicious attackers, the result of a one-way hash function, which is derived from the corresponding primary key
PK and the embedding secret keys
K1 and
K2, is used to determine which tuples are selected for watermark embedding. As the embedding keys
K1 and
K2 play a crucial role in both watermark insertion and extraction phases, they should be chosen from a large key space, and they are only known by the database owner; as a result, it is difficult for malicious attackers to guess these two keys when determining which tuples are to be chosen for watermark embedding. In our scheme, the secret keys
K1 and
K2 are computed by using Equations (2) and (3), respectively, and the watermark
W is generated by using Equation (4), to withstand guessing attacks for the hidden watermark
W. Here, the one-way hash function that is used is defined as
, where
M is a message, i.e., information from the relational database, and
M has three characteristics, i.e., (1) given
M, it is easy to compute
h; (2) given
h, it is hard to obtain
M, such that
H(
M) =
h; and (3) given
M, it is also hard to find any input message
M′, so that
H(
M) =
H(
M′). To meet the above three requirements, some hash functions [
19] can be considered for the proposed scheme, i.e., MD5 [
19] and SHA [
19]. Equation (1) is used to compute the message authentication code (MAC) of a primary key and two embedding keys,
K1 and
K2. This MAC value is used to generate two reference sets and then is used to decide the chosen tuples in the watermark insertion phase.
Table 1 presents several important parameters of our scheme.
where
K1 and
K2 are two embedding keys, || denotes the concatenation function, and
ti.
PK is the primary key attribute of the tuple
ti in the relational database
R.
where
DB_name represents the name of a database;
Version represents the version of the database;
DOI represents the database owner’s identity;
DB inf. represents the database information (i.e., the number of attributes, the number of tuples of the relational database
R); and
H() is a one-way hash function. To help readers understand the proposed scheme presented in the following subsections more clearly, the related notations are given in
Table 2:
3.1. Reference Set Generation
In this section, we demonstrate how our method generates two reference sets, i.e.,
RS1 and
RS2, which are based on the secret embedding keys
K1 and
K2. These two reference sets of each selected tuple are generated to refer to when a secret bit is hidden in the chosen tuple. They are used for the watermark phases of both insertion and detection. The reference set generation algorithm (Algorithm 1) is shown in the following:
Algorithm 1: Reference set generation |
Input: MAC value of primary key ti.PK, two embedding keys K1 and K2 |
Output: two reference sets, RS1 and RS2 |
Generate two random sequences S1 and S2;//using a pseudo-random generator with the seed value *K1 and *K2, respectively. S1 and S2 have four distinct elements, and their values are from 0 to 3. |
OriginalSet1 = {S, M, L, XL}; |
OriginalSet2 = {s, m, l, xl}; |
for i = 0 to 3 do |
RS1[i] = OriginalSet1[S1[i]]; |
RS2[i] = OriginalSet2[S2[i]]; |
end for; |
end |
For each tuple
ti, the MAC value,
, of the primary key is calculated using Equation (1). Then, two random sequences for each tuple,
S1 and
S2, are generated using a pseudo-random generator with the corresponding seed values,
*
K1 and
*
K2, respectively. Each sequence contains four distinct values in the range (0, 3). For a better explanation, we can demonstrate the above algorithm by using the example given in the second paragraph in
Section 3. Following our example, all of the
α attributes are categorical attributes, and their values are the size values of fashion products, i.e., T-shirts, shoes. Their size values are in the range of the set
OriginalSet = {S, M, X, XL}. Define the sets
OriginalSet1 = {S, M, X, XL} and
OriginalSet2 = {s, m, x, xl}. Assume that, with the two seeds
*
K1 and
*
K2, two random sequences,
S1 = {1, 0, 3, 2} and
S2 = {3, 2, 0, 1}, are generated using a pseudo-random generator. For each value in the random sequence
S1, the corresponding value of the reference set
RS1 is set as
RS1[i] =
OriginalSet1[
S1[
i]]. As a result, the reference set
RS1 is generated as
RS1 = {M, S, XL, X}. Following the same procedure, the reference set
RS2 is constructed as
RS2 = {xl, x, s, m}. In this algorithm, reference sets
RS1 and
RS2 are generated in uppercase and in lowercase, respectively, to be used for embedding the secret bit “0” or “1”. To choose which reference set will be used to represent the secret bit 0 or 1 for the corresponding tuple, the choice is based on the MAC value that is computed in Equation (1).
3.2. Watermark Insertion
In this section, the details of the watermark insertion process of our proposed LRW-RDB scheme are given. Our proposed LRW-RDBhiding strategy, implemented in the watermark insertion algorithm, is based on the symmetry concept, which means the pattern of the hidden watermark should be random and the distribution of the hidden watermark remains uniform. The watermark insertion algorithm describes the process of embedding one watermark bit into the categorical attributes. The two reference sets
RS1 and
RS2 for each tuple, based on the MAC value
of the primary key and two predetermined secret embedding keys
K1 and
K2, are required for the watermark insertion. A
ReferenceSetGeneration() function, which is presented in
Section 3.1, is used to generate the two reference sets
RS1 and
RS2. Therefore, we skip the related description here. To embed a watermark into the selected attribute of the selected tuple in the relational database
R, two algorithms are designed in this section. The watermark insertion algorithm describes how to determine which tuples and which attributes will be used for carrying watermark bits. The bit-encoding algorithm describes our embedding rule for hiding the watermark bit in the selected attribute. The watermark insertion algorithm (Algorithm 2) is first given as follows:
Algorithm 2: Watermark embedding |
Input: Relational database R, parameter g |
Output: Watermarked relational database RW, embedding keys K1 and K2 |
Generate the embedding keys K1 and K2//using Equations (2) and (3) |
Generate watermark W;//using Equation (4) |
for each tuple ti ∈ R do |
if mod g is equal to 0, then//the current tuple is selected |
attribute_index j = mod α//the current attribute Aj is selected |
mark_bit_idx = mod L |
b = W[mark_bit_idx]; |
bit_encoding(, K1, K2, b, tiAj); |
end if; |
end for each tuple; |
end |
Based on the watermark embedding algorithm mentioned above, it is obvious that each tuple must conduct (
modulo
g) to derive its MAC value first, and only the tuple whose MAC value equals 0 is selected for the concealed watermark. Thus, in
g continuous tuples, only one tuple is selected. For example, if
g = 6, then in 6 tuples, there is only one selected tuple used to embed the watermark bit. It is noted, that among α attributes of the selected tuples, only one attribute will be used to conceal the watermark bits. To modify the attribute values of the selected tuple, a bit-encoding algorithm (Algorithm 3) is presented below:
Algorithm 3: Bit-encoding |
Input: , embedding keys, K1 and K2, watermark bit b, the value of attribute j of tuple i: tiAj |
Output: the update value of attribute j of tuple i: tiAj |
[RS1 RS2] = ReferenceSetGeneration(, K1, K2)//determine two reference sets, RS1 and RS2 |
Get index idx of attribute value tiAj//index of attribute value in OriginalSet set |
if %2 = 0, then |
if b = 0, then tiA’j = RS1[idx]; |
else tiA’j = RS2[idx]; |
end if; |
else |
if b = 0, then tiA’j = RS2[idx]; |
else tiA’j = RS1[idx]; |
end if; |
end if; |
Update_Attr(); //Update attribute value tiAj by value tiA’j |
In our proposed bit-encoding algorithm, the original attribute value tiAj is used to extract its index idx from the set OriginalSet. Then, the MAC value derived by Equation (1) is used to determine which reference set is assigned to conceal the watermark, bit 0 or bit 1. The watermark attribute value tiA’j can be extracted from reference sets RS1 or RS2, such as RS1[idx] or RS2[idx], respectively, and based on the watermark bit b and the index idx, for the selected attribute value j of the selected tuple i in the relational database R. After completing the process of bit-encoding, the Update_Attr() function is used to update the attribute value j of the tuple i in the relational database R for the given watermark attribute value tiA’j.
To clarify the process of the bit-encoding algorithm, an example is presented in this paragraph. Let us assume the original attribute value of the selected tuple is tiAj = “X”, and the watermark bit b = 0, two reference sets are constructed as RS1 = {M, S, XL, X} and RS2 = {xl, x, s, m}, respectively. Here, we assume that the corresponding value of %2 = 0, which means that RS1 is assigned to carry the watermark bit 0 and RS2 is assigned to carry watermark bit 1, respectively. Based on OriginalSet = {S, M, X, XL}, the index of the current attribute value “X” is extracted as idx = 2. Then, to embed the watermark bit b = 0, the reference set RS1 is utilized. Following the proposed bit-encoding rule, the corresponding value of index, idx = 2, is extracted from the reference set RS1 as “XL”. Finally, the watermark attribute value is set as “XL”.
3.3. Watermark Detection
Assume that Kait suspects that a published relational database has been illegally copied from her watermarked relational database RW. Here, we assume that the attacker does not drop the primary key attributes and does not modify the values of the primary key. This assumption is made because the primary key values contain valuable information, and once the primary key values in a database have been modified, the integrity of the database is compromised.
To extract and verify the hidden watermark, the parameters
g and
α are required in addition to two secret embedding keys,
K1 and
K2; as they use the same one-way hash function, with the same parameters
g and
α and the embedding secret keys,
K1 and
K2, and the same tuples and attributes are selected as in the watermark insertion algorithm. The proposed watermark detection procedure contains two algorithms: one is the watermark detection algorithm (Algorithm 4), and the other is the bit-decoding algorithm (Algorithm 5). The prior algorithm not only performs the inverse operations of the watermark insertion algorithm introduced in
Section 3.2, but also conducts a majority voting strategy to determine the final watermark, because the watermark
W is hidden in the relational database
R several times. The latter algorithm performs the inverse operations of the bit encoding algorithm mentioned in
Section 3.2. After the watermark detection is finished, the relational database owner can make use of the extracted watermark to prove copyright ownership. The watermark detection algorithm shows details of the watermark detection process for the watermark relational database
RW, as follows:
Algorithm 4: Watermark detection |
Input: Relational database RW, parameters g, and α |
Output: Watermarked status ∈ {true, false}, and recover relational database RGenerate embedding keys, K1 and K2//using Equations (2) and (3) |
Generate watermark W;//using Equation (4) |
for i = 0 to L − 1 do |
count[i][0] = 0; count[i][1] = 0; //reset the votes of the watermark |
end for; |
for each tuple ti ∈ R do |
if mod g equal 0, then//select this tuple |
attribute_index j = mod α//select this attribute Aj |
mark_bit_idx = mod L |
b = bit_decoding(, K1, K2, tiAj); |
count[mark_bit_idx][b] = count[mark_bit_idx][b] + 1; |
end if; |
end for each tuple; |
for i = 0 to L − 1 do |
if count[i][0] + count[i][1] = 0 then W′[i] = −1; |
end if; |
if count[i][1] > count[i][1] then W′[i] = 1; |
else W′[i] = 0; |
end if; |
end for; |
for i = 0 to L − 1 do//find the match result between the original and the detected watermarks |
if W′ [i] = W[i] then matchamount + = 1; |
end if; |
end for; |
if matchamount = L, then |
return true;//relational database R is restored successfully |
else |
return false;//relational database R can not be restored |
end if; |
The details of the bit-decoding algorithm is depicted as follows:
Algorithm 5: Bit-decoding |
Input: , embedding keys, K1 and K2, the value of the watermarked attribute j of tuple i: tiAj |
Output: watermark bit b and the update value of attribute j of tuple i: tiAj |
[RS1 RS2] = ReferenceSetGeneration(, K1, K2)//determine two reference sets, RS1 and RS2 |
if%2 = 0, then |
if tiA’j ∈ RS1, then |
b = 0; |
Get index idx of attribute value tiAj from the set RS1; |
else |
b = 1; |
Get index idx of attribute value tiAj from the set RS1; |
end if; |
else |
if tiA’j ∈ RS2 then |
b = 0; |
Get index idx of attribute value tiAj from the set RS1; |
else |
b = 1; |
Get index idx of attribute value tiAj from the set RS1; |
end if; |
recon_value = OriginalSet[idx]; |
end if; |
Update_Attr(); //Update attribute value tiAj by value recon_value |
From the above steps, we can see that the index value idx shall be determined to reconstruct the original categorical attribute value. If tiA’j ∈ RS1, then the index idx of tiA’j is extracted from the set RS1, and the watermark bit is recovered as an assigned bit of the set RS1. Otherwise, the index idx of tiA’j is extracted from the set RS2, and the watermark bit is recovered as the assigned bit of the set RS2. Finally, the original attribute value tiAj is reconstructed from the set OriginalSet, such as OriginalSet [idx].