Compressed Communication Complexity of Hamming Distance

Mitsuya, Shiori; Nakashima, Yuto; Inenaga, Shunsuke; Bannai, Hideo; Takeda, Masayuki

doi:10.3390/a14040116

Open AccessArticle

Compressed Communication Complexity of Hamming Distance

¹

Department of Informatics, Kyushu University, 744, Motooka, Nishiku, Fukuoka 819-0395, Japan

²

PRESTO, Japan Science and Technology Agency, 4-1-8, Honcho, Kawaguchi, Saitama 332-0012, Japan

³

M&D Data Science Center, Tokyo Medical and Dental University, Tokyo 101-0062, Japan

^*

Author to whom correspondence should be addressed.

Algorithms 2021, 14(4), 116; https://0-doi-org.brum.beds.ac.uk/10.3390/a14040116

Submission received: 3 March 2021 / Revised: 31 March 2021 / Accepted: 31 March 2021 / Published: 3 April 2021

(This article belongs to the Special Issue Algorithms and Data-Structures for Compressed Computation)

Download Versions Notes

Abstract

:

We consider the communication complexity of the Hamming distance of two strings. Bille et al. [SPIRE 2018] considered the communication complexity of the longest common prefix (LCP) problem in the setting where the two parties have their strings in a compressed form, i.e., represented by the Lempel-Ziv 77 factorization (LZ77) with/without self-references. We present a randomized public-coin protocol for a joint computation of the Hamming distance of two strings represented by LZ77 without self-references. Although our scheme is heavily based on Bille et al.’s LCP protocol, our complexity analysis is original which uses Crochemore’s C-factorization and Rytter’s AVL-grammar. As a byproduct, we also show that LZ77 with/without self-references are not monotonic in the sense that their sizes can increase by a factor of 4/3 when a prefix of the string is removed.

Keywords:

communication complexity; Hamming distance; Lempel-Ziv compression

1. Introduction

Communication complexity, first introduced by Yao [1], is a well-studied sub-field of complexity theory which aims at quantifying the total amount of communication (bits) between the multiple parties who separately hold partial inputs of a function f. The goal of the

k (\geq 2)

parties is to jointly compute the value of

f (X_{1}, \dots, X_{k})

, where

X_{i}

denotes the partial input that the ith party holds. Communication complexity studies lower bounds and upper bounds for the communication cost of a joint computation of a function f. Due to the rapidly increasing amount of distributed computing tasks, communication complexity has gained its importance in the recent highly digitalized society. This paper deals with the most basic and common setting where the two parties, Alice and Bob, separately hold partial inputs A and B and they perform a joint computation of

f (A, B)

for a function f following a specified protocol.

We pay our attention to communication complexity of string problems where the inputs A and B are strings over an alphabet

Σ

. Communication complexity of string problems has played a critical role in the space lower bound analysis of several streaming processing problems including Hamming/edit/swap distances [2], pattern matching with k-mismatches [3], parameterized pattern matching [4], dictionary matching [5], and quasi-periodicity [6].

Bille et al. [7] were the first to consider the communication complexity of the longest common prefix (LCP) problem in the setting where the two parties have their strings in a compressed form, i.e., represented by the Lempel-Ziv 77 factorization (LZ77) [8] with/without self-references. Bille et al. [7] proposed a randomized public-coin protocol for the LCP problem with

O (log z_{ℓ})

communication rounds and

O (log ℓ)

total bits of communication, where ℓ denotes the length of the LCP of the two strings A and B and

z_{ℓ}

denotes the size of the non-self-referencing LZ77 factorization of the LCP

A [1 . . ℓ]

. In addition, Bille et al. [7] showed a randomized public-coin protocol for the LCP problem with

(i): $O (log z_{ℓ}^{'} + log log ℓ)$ communication rounds and $O (log ℓ)$ total bits of communication, or
(ii): $O (log z_{ℓ}^{'})$ communication rounds and $O (log ℓ + log log log n)$ total bits of communication,

where

z_{ℓ}^{'}

denotes the size of the self-referencing LZ77 factorization of the LCP

A [1 . . ℓ]

and

n = | A |

.

In this paper, we consider the communication complexity of the Hamming distance of two strings of equal length, which are represented in a compressed form. We present a randomized public-coin protocol for a joint computation of the Hamming distance of two strings represented by non-self-referencing LZ77, with

O (d log z)

communication rounds and

O (d log ℓ_{max})

total bits of communication, where d is the Hamming distance between A and B, z is the size of the LZ77 factorization of string A, and

ℓ_{max}

is the largest gap between two adjacent mismatching positions between A and B (if the first/last characters of A and B are equal, then we can add terminal symbols as

# A $

and

$ B #

and subtract 2 from the computed distance). Although our scheme is heavily based on Bille et al.’s LCP protocol, our complexity analysis is original which uses Crochemore’s C-factorization [9] and Rytter’s AVL-grammar [10].

Furthermore, as a byproduct of our result for the Hamming distance problem, we also show that LZ77 with/without self-references are non-monotonic. For a compression algorithm

A

let

A (S)

denote the size of the compressed representation of string S by

A

. We say that compression algorithm

A

is monotonic if

A (S [1 . . j]) \leq A (S)

for any

1 \leq j < | S |

and

A (S [i . . | S |]) \leq A (S)

for any

1 < i \leq | S |

, and we say it is non-monotonic otherwise. It is clear that LZ77 with/without self-references satisfy the first property, however, to our knowledge the second property has not been studied for the LZ77 factorizations. We prove that LZ77 with/without self-references is non-monotonic by giving a family of strings such that removing each prefix of length from 1 to

\sqrt{n}

increases the number of factors in the LZ77 factorization by a factor of 4/3, where n denotes the string length. We also show that in the worst-case the number of factors in the non-self-referencing LZ77 factorization of any suffix of any string S of length n can be larger than that of S by at most a factor of

O (log n)

.

Monotonicity of compression algorithms and string repetitive measures has gained recent attention. Lagarde and Perifel [11] showed that Lempel-Ziv 78 compression [12] is non-monotonic by showing that removing the first character of a string can increase the size of the compression by a factor of

Ω (log n)

. The recently proposed repetitive measure called the substring complexity

δ

is known to be monotonic [13]. Kociumaka et al. [13] posed an open question whether the smallest bidirectional macro scheme size b [14] or the smallest string attractor size

γ

[15] is monotonic. It was then answered by Mantaci et al. [16] that

γ

is non-monotonic.

2. Preliminaries

2.1. Strings

Let

Σ

be an alphabet of size

σ

. An element of

Σ^{*}

is called a string. The length of a string S is denoted by

| S |

. The empty string

ε

is the string of length 0, namely

| ε | = 0

. The i-th character of a string S is denoted by

S [i]

for

1 \leq i \leq | S |

, and the substring of a string S that begins at position i and ends at position j is denoted by

S [i . . j]

for

1 \leq i \leq j \leq | S |

. For convenience, let

S [i . . j] = ε

if

j < i

. Substrings

S [1 . . j]

and

S [i . . | S |]

are respectively called a prefix and a suffix of S. For simplicity, let

S [. . j]

denote the prefix of S ending at position j and

S [i . .]

the suffix

S [i . . | S |]

of S beginning at position i. *fix A suffix

S [j . .]

with

j > 1

is called a proper suffix of S.

For string X and Y, let

lcp (X, Y)

denote the length of the longest common prefix (LCP) of strings

X, Y

, namely

lcp (X, Y) = max ({ℓ ∣ X [. . ℓ] = Y [. . ℓ], 1 \leq ℓ \leq min {| X |, | Y |}} \cup {0})

. The Hamming distance

d_{H} (X, Y)

of two strings

X, Y

of equal length is the number of positions where the underlying characters differ between X and Y, namely

d_{H} (X, Y) = | {i ∣ X [i] \neq Y [i], 1 \leq i \leq | X |} |

.

2.2. Lempel-Ziv 77 Factorizations

Of many versions of Lempel-Ziv 77 factorization [8] which divide a given string in a greedy left-to-right manner, the main tool we use is the non-self-referencing LZ77, which is formally defined as follows:

Definition 1

(Non-self-referencing LZ77 factorization). The non-self-referencing LZ77 factorizationof string S, denoted

LZN (S)

, is a factorization

S = f_{1} \dots f_{zn}

that satisfies the following: Let

u_{i}

denote the beginning position of each factor

f_{i}

in the factorization

f_{1} \dots f_{zn}

, that is,

u_{i} = | f_{1} \dots f_{i - 1} | + 1

. (1) If

i > 1

and

{max}_{1 \leq j < u_{i}} {lcp (S [u_{i} . .], S [j . . u_{i} - 1])} \geq 1

, then for any position

s_{i} \in \underset{1 \leq j < u_{i}}{\arg \max} lcp (S [u_{i} . .], S [j . . u_{i} - 1])

in S, let

p_{i} = lcp (S [u_{i} . .], S [s_{i} . . u_{i} - 1])

. (2) Otherwise, let

p_{i} = 0

. Then,

f_{i} = S [s_{i} . . u_{i} + p_{i}]

for each

1 \leq i \leq zn

.

Intuitively, each factor

f_{i}

in

LZN (S)

is either a fresh letter, or the shortest prefix of

f_{i} \dots f_{zn}

that does not have a previous occurrence in

f_{1} \dots f_{i - 1}

. This means that self-referencing is not allowed in

LZN (S)

, namely no previous occurrences

S [s_{i} . . s_{i} + p_{i}]

of each factor

f_{i}

can overlap with itself.

The size

zn (S)

of

LZN (S)

is the number

zn

of factors in

LZN (S)

.

We encode each factor

f_{i}

by a triple

(s_{i}, p_{i}, α_{i}) \in ([1 . . n] \times [1 . . n] \times Σ)

, where

s_{i}

is the left-most previous occurrence of

f_{i}

,

p_{i}

is the length of

f_{i}

, and

α_{i}

is the last character of

f_{i}

.

Example 1.

For

S = abaababaabaabaabaabaabb

,

LZS (S) = a ∣ b ∣ aa ∣ bab ∣ aabaa ∣ baabaab

∣ aabb ∣

and it can be represented as

(0, 0, a), (0, 0, b), (1, 2, a), (2, 3, b), (3, 5, a), (7, 7, b),

(3, 4, b)

. The size of

LZS (S)

is 7.

The self-referencing counterpart is defined as follows:

Definition 2

(Self-referencing LZ77 factorization). Theself-referencing LZ77 factorizationof string S, denoted

LZS (S)

, is a factorization

S = g_{1} \dots g_{zs}

that satisfies the following: Let

v_{i}

denote the beginning position of each factor

g_{i}

in the factorization

g_{1} \dots g_{zs}

, that is,

v_{i} = | g_{1} \dots g_{i - 1} | + 1

. (1) If

i > 1

and

{max}_{1 \leq j < v_{i}} {lcp (S [v_{i} . .], S [j . .])} \geq 1

, then for any position

t_{i} \in \underset{1 \leq j < v_{i}}{\arg \max} lcp (S [v_{i} . .], S [j . .])

in S, let

q_{i} = lcp (S [v_{i} . .], S [t_{i} . .])

. (2) Otherwise, let

q_{i} = 0

. Then,

g_{i} = S [v_{i} . . v_{i} + q_{i}]

for each

1 \leq i \leq zs

.

Intuitively, each factor

g_{i}

of

LZS (S)

is either a fresh letter, or the shortest prefix of

g_{i} \dots g_{zs}

that does not have a previous occurrence beginning in

g_{1} \dots g_{i - 1}

. This means that self-referencing is allowed in

LZS (S)

, namely the left-most previous occurrence with smallest

t_{i}

of each factor

g_{i}

may overlap with itself.

The size

zs (S)

of

LZS (S)

is the number

zs

of factors in

LZS (S)

.

Likewise, we encode each factor

g_{i}

by a triple

(t_{i}, q_{i}, β_{i}) \in ([1 . . n] \times [1 . . n] \times Σ)

, where

t_{i}

is the left-most previous occurrence of

g_{i}

,

q_{i}

is the length of

g_{i}

, and

β_{i}

is the last character of

g_{i}

.

Example 2.

For

S = abaababaabaabaabaabaabb

,

LZN (S) = a ∣ b ∣ aa ∣ bab ∣ aabaa ∣

baabaabaabb ∣

and it can be represented as

(0, 0, a), (0, 0, b), (1, 2, a), (2, 3, b), (3, 5, a),

(7, 11, b)

. The size of

LZS (S)

is 6.

2.3. Communication Complexity Model

Our approach is based on the standard communication complexity model of Yao [1] between two parties:

The parties are Alice and Bob;
The problem is a function $f : X \times Y \to Z$ for arbitrary sets $X, Y, Z$ ;
Alice has instance $x \in X$ and Bob has instance $y \in Y$ ;
The goal of the two parties is to output $f (x, y)$ for a pair $(x, y)$ of instances by a joint computation;
The joint computation (i.e., the communication between Alice and Bob) follows a specified protocol $P$ .

The communication complexity [1] usually refers merely to the total amount of bits that need to be transferred between Alice and Bob to compute

f (x, y)

. In this paper, we follow Bille et al.’s model [7] where the communication complexity is evaluated by a pair

〈 r, b 〉

of the number of communication rounds r and the total amount of bits b exchanged in the communication.

In a (Monte-Carlo) randomized public-coin protocol, each party (Alice/Bob) can access a shared infinitely long sequence of independent random coin tosses. The requirement is that the output must be correct for every pair of inputs with probability at least

1 - ϵ

for some

0 < ϵ < 1 / 2

, which is based on the shared random sequence of coin tosses. We remark that one can amplify the error rate to an arbitrarily small constant by paying a constant factor penalty in the communication complexity. Please note that the public-coin model differs from a randomized private-coin model, where in the latter the parties do not share a common random sequence and they can only use their own random sequence. In a deterministic protocol, every computation is performed without random sequences.

2.4. Joint Computation of Compressed String Problems

In this paper, we also consider the communication complexity of the Hamming distance problem between two compressed strings of equal length, which are compressed by LZ77 without self-references.

Problem 1

(Hamming distance with non-self-referencing LZ77).

Alice’s input:

LZN (A)

for string A of length n.

Bob’s input:

LZN (B)

for string B of length n.

Goal:Both Alice and Bob obtain the value of

d_{H} (A, B)

.

The following LCP problem for two strings compressed by non-self-referencing LZ77 has been considered by Bille et al. [7].

Problem 2

(LCP with non-self-referencing LZ77).

Alice’s input:

LZN (A)

for string A.

Bob’s input:

LZN (B)

for string B.

Goal:Both Alice and Bob obtain the value of

lcp (A, B)

.

Bille et al. proposed the following protocol for a joint computation of the LCP of two strings compressed by non-self-referencing LZ77:

Theorem 1

([7]). Suppose that the alphabet Σ and the length n of string A are known to both Alice and Bob. Then, there exists a randomized public-coin protocol which solves Problem 2 with communication complexity

〈 O (log z_{ℓ}), O (log ℓ) 〉

, where

ℓ = lcp (A, B)

and

z_{ℓ} = zn (A [1 . . ℓ])

.

The basic idea of Bille et al.’s protocol [7] is as follows: In their protocol, the sequences of factors in the non-self-referencing LZ77 factorizations

LZN (A)

and

LZN (B)

are regarded as strings of respective lengths

zn (A)

and

zn (B)

over an alphabet

[1 . . n] \times [1 . . n] \times Σ

. Then, Alice and Bob jointly compute the LCP of

LZN (A)

and

LZN (B)

, which gives them the first mismatching factors between

LZN (A)

and

LZN (B)

. This LCP of

LZN (A)

and

LZN (B)

is computed by a randomized protocol for doubling-then-binary searches with

O (log z_{ℓ})

communication rounds. Finally, Alice sends the information about her first mismatching factor to Bob, and he internally computes the LCP of A and B. The total number of bits exchanged is bounded by

O (log ℓ)

.

In Section 3, we present our protocol for Problem 1 of jointly computing the Hamming distance of two strings compressed by non-self-referencing LZ77. The scheme itself is a simple application of the LCP protocol of Theorem 1 for non-self-referencing LZ77, but our communication complexity analysis is based on non-trivial combinatorial properties of LZ77 factorization which, to our knowledge, were not previously known.

3. Compressed Communication Complexity of Hamming Distance

In this section, we show a Monte-Carlo randomized protocol for *fixed Problem 1 that asks for the Hamming distance

d_{H} (A, B)

of strings A and B that are compressed by non-self-referencing LZ77. Our protocol achieves

〈 O (d log z), O (d log ℓ_{max}) 〉

communication complexity, where

d = d_{H} (A, B)

,

z = zn (A)

, and

ℓ_{max}

is the largest value returned by the sub-protocol of the LCP problem for two strings compressed by non-self-referencing LZ77.

The basic idea is to apply the so-called Kangaroo jumping method, namely if d is the number of mismatching positions between A and B, then one can compute

d = d_{H} (A, B)

with at most

d + 1

LCP queries. More specifically, let

1 \leq i_{1} < \dots < i_{d} \leq n

be the sequence of mismatching positions between A and B. By using the protocol of Theorem 1 as a black box, and using the fact that

zn (S) \geq zn (S [1 . . j])

for any prefix

S [1 . . j]

of any string S, we immediately obtain the following:

Lemma 1.

Suppose that the alphabet Σ and the length n of strings A and B are known to both Alice and Bob. Then, there exists a randomized public-coin protocol which solves Problem 1 with communication complexity

〈 O (\sum_{k = 1}^{d} log zn (A [i_{k} + 1 . .])), O (d log ℓ_{max}) 〉

, where

ℓ_{max} = {max}_{1 < k \leq d} {i_{k} - i_{k - 1} + 1}

.

3.1. On the Sizes of Non-Self-Referencing LZ77 Factorization of Suffixes

Our next question is how large the

zn (A [i_{k} + 1 . .])

term in Lemma 1 can be in comparison to

zn (A)

. To answer this question, we consider the following general measure: For any string of length n, let

ζ (n) = max {zn (S [i . .]) / zn (S) ∣ S \in Σ^{n}, 1 < i \leq n} .

3.1.1. Lower Bound for $ζ (n)$

In this subsection, we present a family of strings S such that

zn (S [i . .]) > zn (S)

for some suffix

S [i . .]

, namely

ζ (n) > 1

. More specifically, we show the following:

Lemma 2.

ζ (n)

is asymptotically lower bounded by

4 / 3

.

Proof.

For simplicity, we consider an integer alphabet

{0, 1, \dots, σ}

of size

σ + 1

. Consider the string

S = (012 \dots σ - 1 σ) (0124) (012346) (01234568) \dots (012 \dots σ - 2 σ)

and its proper suffix

S [2 . .] = (12 \dots σ - 1 σ) (0124) (012346) (01234568) \dots (012 \dots σ - 2 σ) .

The non-self-referencing LZ77 factorization of S and

S [2 . .]

are:

\begin{matrix} LZN (S) & = & 0 ∣ 1 ∣ 2 ∣ \dots ∣ σ - 1 ∣ σ ∣ 0124 ∣ 012346 ∣ 01234568 ∣ \dots ∣ 012 \dots σ - 2 σ ∣ \\ LZN (S [2 . .]) & = & 1 ∣ 2 ∣ \dots ∣ σ - 1 ∣ σ ∣ 01 ∣ 24 ∣ 0123 ∣ 46 ∣ 012345 ∣ 68 ∣ \dots ∣ 012 \dots ∣ σ - 2 σ ∣ \end{matrix}

Observe that after the first occurrence of character

σ

, each factor of

LZN (S)

is divided into two smaller factors in

LZN (S [2 . .])

. Since

zn (S) = | LZN (S) | = (σ + 1) + (\frac{σ}{2} - 1) = \frac{3 σ}{2}

and

zn (S [2 . .]) = | LZN (S [2 . .]) | = (σ) + (σ - 2) = 2 σ - 2

,

zn (S [2 . .]) / zn (S) = \frac{2 σ - 2}{(3 σ / 2)} = \frac{4}{3} - \frac{2}{3 σ}

, which tends to

4 / 3

as

σ

goes to infinity. We finally remark that

| S | = n = Θ (σ^{2})

which in turn means that

σ = Θ (\sqrt{n})

. □

Remark 1.

One can generalize the string S of Lemma 2 by replacing 0 with

0^{h}

for arbitrarily fixed

1 < h \leq a \cdot σ

for any constant a. The upper limit

a \cdot σ

comes from the fact that the number of 0’s in the original string S is exactly

\frac{σ}{2}

. Since

| S | = n = Θ (σ^{2})

, replacing 0 by

0^{h}

with

h < a \cdot σ

keeps the string length within

O (n)

. This implies that one can obtain the asymptotic lower bound

4 / 3

for any suffix

S [h . .]

of length roughly up to

n - \sqrt{n}

.

Note also that the factorizations shown in Lemma 2 coincide with the self-referencing counterparts

LZS (S)

and

LZS (S [2 . .])

, respectively. The next corollary immediately follows from Lemma 2 and Remark 1.

Corollary 1.

The Lempel-Ziv 77 factorization with/without self-references is non-monotonic.

3.1.2. Upper Bound for $ζ (n)$

Next, we consider an upper bound for

ζ (n)

. The tools we use here are the C-factorization [9] without self-references, and a grammar compression called AVL-grammar [10].

Definition 3

(Non-self-referencing C-factorization). The non-self-referencing C-factorizationof string S, denoted

CN (S)

, is a factorization

S = c_{1} \dots c_{cn}

that satisfies the following: Let

w_{i}

denote the beginning position of each factor

c_{i}

in the factorization

c_{1} \dots c_{cn}

, that is,

w_{i} = | c_{1} \dots c_{i - 1} | + 1

. (1) If

i > 1

and

{max}_{1 \leq j < w_{i}} {lcp (S [w_{i} . .], S [j . . w_{i} - 1])} \geq 1

, then for any position

r_{i} \in \underset{1 \leq j < w_{i}}{\arg \max} lcp (S [w_{i} . .], S [j . . w_{i} - 1])

in S, let

y_{i} = lcp (S [w_{i} . .], S [r_{i} . . w_{i} - 1]) - 1

. (2) Otherwise, let

y_{i} = 0

. Then,

c_{i} = S [w_{i} . . w_{i} + y_{i}]

for each

1 \leq i \leq cn

.

The size

cn (S)

of

CN (S)

is the number

cn

of factors in

CN (S)

.

Example 3.

For

S = abaababaabaabaabaabaabb

,

CN (S) = a ∣ b ∣ a ∣ ab ∣ abaab ∣ aaba ∣

abaabaab ∣ b ∣

and its size is 8.

The difference between

LZN (S)

and

CN (S)

is that while each factor

f_{i}

in

LZN (S)

is the shortest prefix of

S [u_{i} . .]

that does not occur in

S [1 . . u_{i} - 1]

, each factor

c_{i}

in

CN (S)

is the longest prefix of

S [w_{i} . .]

that occurs in

S [1 . . w_{i} - 1]

. This immediately leads to the next lemma.

Lemma 3.

For any string S,

cn (S) \geq zn (S)

.

We also use the next lemma in our upper bound analysis for

ζ (n)

.

Lemma 4.

For any string S,

cn (S) \leq 2 zn (S)

.

Proof.

*removed “on the contrary”Suppose that there are two consecutive factors

c_{i}, c_{i + 1}

of

CN (S)

and a factor

f_{j}

of

LZN (S)

such that

c_{i}, c_{i + 1}

are completely contained in

f_{i}

and the ending position of

c_{i + 1}

is less than the ending position of

f_{j}

. Since

c_{i} c_{i + 1}

is a substring of

f_{j} [. . | f_{j} | - 1]

and

f_{j} [. . | f_{j} | - 1]

has a previous occurrence in

f_{1} \dots f_{j - 1}

, this contradicts that

c_{i}

terminated inside

f_{j} [. . | f_{j} | - 1]

.

Thus, the only possible case is that

c_{i} c_{i + 1}

occurs as a suffix of

f_{j}

. Please note that in this case

c_{i - 1}

cannot occur inside

f_{j}

by the same reasoning as above. Therefore, at most two consecutive factors of

CN (S)

can occur completely inside of each factor of

LZN (S)

. This leads to

cn (S) \leq 2 zn (S)

. □

An AVL-grammar of a string S is a kind of a straight-line program (SLP), which is a context-free grammar in the Chomsky-normal form which generates only S. The parse-tree of the AVL-grammar is an AVL-tree [17] and therefore, its height is

O (log n)

if n is the length of S. Let

avl (S)

denote the size (i.e., the number of productions) in the AVL-grammar for S. Basically, the AVL-grammar for S is constructed from the C-factorization of S, by introducing at most

O (log n)

new productions for each factor in the C-factorization. Thus, the next lemma holds.

Lemma 5

([10]). For any string S of length n,

avl (S) = O (cn (S) log n)

.

Now we show our upper bound for

ζ (n)

.

Lemma 6.

ζ (n) = O (log n)

.

Proof.

Suppose we have two AVL-grammars for strings X and Y of respective sizes

avl (X)

and

avl (Y)

. Rytter [10] showed how to build an AVL-grammar for the concatenated string

X Y

of size

avl (X) + avl (Y) + O (h)

, where h is the height of the taller parse-tree of the two AVL-grammars before the concatenation. This procedure is based on a folklore algorithm (cf [18]) that concatenates two given AVL-trees of height h with

O (h)

node rotations. In the concatenation procedure of AVL-grammars,

O (1)

new productions are produced per node rotation. Therefore,

O (h)

new productions are produced in the concatenation operation.

Suppose we have the AVL-grammar of a string S of length n. It contains

avl (S)

productions and the height of its parse-tree is

h = O (log n)

since an AVL-tree is a balanced binary tree. For any proper suffix

S^{'} = S [i . .]

of S with

1 < i \leq n

, we split the AVL-grammar into two AVL-grammars, one for the prefix

S [1 . . i]

and the other for the suffix

S [i . . n]

. We ignore the former and concentrate on the latter for our analysis. Since split operations on a given AVL-grammar can be performed in a similar manner to the afore-mentioned concatenation operations, we have that

avl (S^{'}) \leq avl (S) + a log n

for some constant

a > 0

. Now it follows from Lemma 3, Lemma 4, Lemma 5, and that the size

cn

of the C-factorization of any string is no more than the number of productions in any SLP generating the same string [10], we have

zn (S^{'}) \leq cn (S^{'}) \leq avl (S^{'}) \leq avl (S) + a log n \leq a^{'} cn (S) log n \leq 2 a^{'} zn (S) log n

where

a^{'} > 0

is a constant. This gives us

zn (S^{'}) / zn (S) = O (log n)

for any string S of length n and any of its proper suffix

S^{'}

. □

Since the size

zn (S)

of the non-self-referencing LZ77 factorization of any string S of length n is at least

log n

, the next corollary is immediate from Lemma 6:

Corollary 2.

For any string S and its proper suffix

S^{'}

,

zn (S^{'}) / zn (S) = O (zn (S))

.

3.2. Compressed Communication Complexity of Hamming Distance

Now we have the main result of this section.

Theorem 2.

Suppose that the alphabet Σ and the length n of strings A and B are known to both Alice and Bob. Then, there exists a randomized public-coin protocol which solves Problem 1 with communication complexity

〈 O (d log zn), O (d log ℓ_{max}) 〉

, where

zn = zn (A)

and

ℓ_{max} = {max}_{1 < k \leq d} {i_{k} - i_{k - 1} + 1}

.

Proof.

The protocol of Lemma 1 has

O (\sum_{k = 1}^{d} log zn (A [i_{k} + 1 . .]))

rounds. By Corollary 2, we have that

zn (A [i_{k} + 1 . .]) = O (zn {(A)}^{2})

. Therefore,

\sum_{k = 1}^{d} log zn (A [i_{k} + 1 . .]) = O (d log zn (A))

, which proves the theorem. □

4. Conclusions and Open Questions

This paper showed a randomized public-coin protocol for a joint computation of the Hamming distance of two compressed strings. Our Hamming distance protocol relies on Bille et al.’s LCP protocol for two strings that are compressed by non-self-referencing LZ77, while our communication complexity analysis is based on new combinatorial properties of non-self-referencing LZ77 factorization.

As a further research, it would be interesting to consider the communication complexity of the Hamming distance problem using self-referencing LZ77. The main question to this regard is whether

zs (S [i . .]) = O (poly (zs (S)))

holds for any suffix

S [i . .]

of any string S. In the case of non-self-referencing LZ77,

zn (S [i . .]) = O (zn {(S)}^{2})

holds due to Lemma 2.

Author Contributions

Conceptualization, Y.N, S.I., H.B. and M.T.; methodology, S.M., Y.N., and S.I.; writing—original draft preparation, S.I.; writing—review and editing, Y.N. and H.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JSPS KAKENHI Grant Numbers JP18K18002 (YN), JP20H04141 (HB), JP18H04098 (MT), and JST PRESTO Grant Number JPMJPR1922 (SI).

Conflicts of Interest

The authors declare no conflict of interest.

References

Yao, A.C. Some Complexity Questions Related to Distributive Computing (Preliminary Report). In Proceedings of the Eleventh Annual ACM-SIAM Symposium onTheory of computing, Atlanta, Georgia, 30 April –2 May 1979; pp. 209–213. [Google Scholar]
Clifford, R.; Jalsenius, M.; Porat, E.; Sach, B. Space lower bounds for online pattern matching. Theor. Comput. Sci. 2013, 483, 68–74. [Google Scholar] [CrossRef]
Radoszewski, J.; Starikovskaya, T. Streaming k-mismatch with error correcting and applications. Inf. Comput. 2020, 271, 104513. [Google Scholar] [CrossRef]
Jalsenius, M.; Porat, B.; Sach, B. Parameterized Matching in the Streaming Model. STACS 2013, 400–411. [Google Scholar] [CrossRef]
Gawrychowski, P.; Starikovskaya, T. Streaming Dictionary Matching with Mismatches. In Proceedings of the 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019), Pisa, Italy, 18–20 June 2019; pp. 21:5–21:15. [Google Scholar]
Gawrychowski, P.; Radoszewski, J.; Starikovskaya, T. Quasi-Periodicity in Streams. In Proceedings of the 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019), Pisa, Italy, 18–20 June 2019; pp. 22:1–22:14. [Google Scholar]
Bille, P.; Ettienne, M.B.; Grossi, R.; Gørtz, I.L.; Rotenberg, E. Compressed Communication Complexity of Longest Common Prefixes. In International Symposium on String Processing and Information Retrieval, Proceedings of the 25th International Symposium, SPIRE 2018, Lima, Peru, 9–11 October 2018; Springer: Cham, Switzerland, 2018; pp. 74–87. [Google Scholar]
Ziv, J.; Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 1977, 23, 337–343. [Google Scholar] [CrossRef] [Green Version]
Crochemore, M. Linear searching for a square in a word. Bull. Eur. Assoc. Theor. Comput. Sci. 1984, 24, 66–72. [Google Scholar]
Rytter, W. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 2003, 302, 211–222. [Google Scholar] [CrossRef] [Green Version]
Lagarde, G.; Perifel, S. Lempel-Ziv: A “one-bit catastrophe” but not a tragedy. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–10 January 2018; pp. 1478–1495. [Google Scholar]
Ziv, J.; Lempel, A. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 1978, 24, 530–536. [Google Scholar] [CrossRef] [Green Version]
Kociumaka, T.; Navarro, G.; Prezza, N. Towards a Definitive Measure of Repetitiveness. In Latin American Symposium on Theoretical Informatics, Proceedings of the 14th Latin American Symposium, São Paulo, Brazil, 5–8 January 2020; Springer: Cham, Switzerland, 2020; pp. 207–219. [Google Scholar]
Storer, J.A.; Szymanski, T.G. Data compression via textual substitution. J. ACM 1982, 29, 928–951. [Google Scholar] [CrossRef]
Kempa, D.; Prezza, N. At the roots of dictionary compression: string attractors. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, Los Angeles, CA, USA, 25–29 June 2018; pp. 827–840. [Google Scholar]
Mantaci, S.; Restivo, A.; Romana, G.; Rosone, G.; Sciortino, M. A combinatorial view on string attractors. Theor. Comput. Sci. 2021, 850, 236–248. [Google Scholar] [CrossRef]
Adelson-Velskii, G.; Landis, E. An algorithm for the organization of information. Sov. Math. Dokl. 1962, 3, 1259–1263. [Google Scholar]
Knuth, D.E. The Art of Computer Programming, 2nd ed.; Addison-Wesley: Boston, MA, USA, 1998; Volume III. [Google Scholar]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mitsuya, S.; Nakashima, Y.; Inenaga, S.; Bannai, H.; Takeda, M. Compressed Communication Complexity of Hamming Distance. Algorithms 2021, 14, 116. https://0-doi-org.brum.beds.ac.uk/10.3390/a14040116

AMA Style

Mitsuya S, Nakashima Y, Inenaga S, Bannai H, Takeda M. Compressed Communication Complexity of Hamming Distance. Algorithms. 2021; 14(4):116. https://0-doi-org.brum.beds.ac.uk/10.3390/a14040116

Chicago/Turabian Style

Mitsuya, Shiori, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. 2021. "Compressed Communication Complexity of Hamming Distance" Algorithms 14, no. 4: 116. https://0-doi-org.brum.beds.ac.uk/10.3390/a14040116

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Compressed Communication Complexity of Hamming Distance

Abstract

1. Introduction

2. Preliminaries

2.1. Strings

2.2. Lempel-Ziv 77 Factorizations

2.3. Communication Complexity Model

2.4. Joint Computation of Compressed String Problems

3. Compressed Communication Complexity of Hamming Distance

3.1. On the Sizes of Non-Self-Referencing LZ77 Factorization of Suffixes

3.1.1. Lower Bound for $ζ (n)$

3.1.2. Upper Bound for $ζ (n)$

3.2. Compressed Communication Complexity of Hamming Distance

4. Conclusions and Open Questions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Compressed Communication Complexity of Hamming Distance

Abstract

1. Introduction

2. Preliminaries

2.1. Strings

2.2. Lempel-Ziv 77 Factorizations

2.3. Communication Complexity Model

2.4. Joint Computation of Compressed String Problems

3. Compressed Communication Complexity of Hamming Distance

3.1. On the Sizes of Non-Self-Referencing LZ77 Factorization of Suffixes

3.1.1. Lower Bound for ζ ( n )

3.1.2. Upper Bound for ζ ( n )

3.2. Compressed Communication Complexity of Hamming Distance

4. Conclusions and Open Questions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1.1. Lower Bound for $ζ (n)$

3.1.2. Upper Bound for $ζ (n)$