Measuring the size of a regular language is NP-Hard

Measuring the size of a regular language is NP-Hard

In this post we will examine an interesting connection between the NP\mathcal{NP}-complete CNF-SAT problem and the problem of computing the amount of words generated by some efficient representation of a formal language. Here, I call a way of representing a formal language efficient, if it has a polynomially-long encoding but can possibly describe an exponentially-large formal language. Examples of such efficient representations are nondeteministic finite automations and regular expressions.

Formal definition of the problem

We can formally define the problem of determining the language size the following way:

Name: NFA Language Size

Input: A nondeterministic finite automata (NFA) M:=(Z,Σ,δ,S,E)M := (Z, \Sigma, \delta, S, E) and a number mNm \in \mathbb{N}.

Question: Is L(M)m\st L(M) \st \le m?

Proof of NP-hardness

We will construct a reduction from the NP\mathcal{NP}-complete CNF-SAT problem. Consider a formula φ\varphi with kk clauses and nn variables:

  • Create an NFA M:=(Z,Σ,δ,S,E)M := (Z, \Sigma, \delta, S, E) with the alphabet Σ:={0,1}\Sigma := \{0, 1\}.
  • For every variable 0vn0 \le v \le n and clause 1ck1 \le c \le k create a state z(v,c)Zz(v, c) \in Z. Intuitively, if the NFA is in the state z(v,c)z(v, c), it means that it has already read vv symbols of some variable assignment where clause cc is zero.
  • For each clause 1ck1 \le c \le k of the formula, construct a boolean cube ψ:V{0,1,}\psi : V \to \{0, 1, *\} with ψ1(1)\psi^{-1}(1) containing variables that are negated in the clause and ψ1(0)\psi^{-1}(0) containing positive variables (other variables are mapped to *). By doing this, we essentially construct a cube C(ψ):=(ψ(x)=1x)(ψ(x)=0xˉ)C(\psi) := (\bigwedge_{\psi(x) = 1}{x}) \wedge (\bigwedge_{\psi(x) = 0}{\bar{x}}) that is the negation of the clause. In other words, all full variable assignments ψ:V{0,1}\psi' : V \to \{0, 1\} such that ψ(x)ψ(x)=ψ(x)xV\psi(x) \neq * \Rightarrow \psi'(x) = \psi(x) \quad\forall x \in V will make the clause (and the whole formula φ\varphi) false. For all 0v<n0 \le v < n create the following transitions:
    • Set δ(z(v,c),0):={z(v+1,c)}\delta(z(v,c), 0) := \{z(v + 1,c)\} if ψ(v+1)=0\psi(v + 1) = 0.
    • Set δ(z(v,c),1):={z(v+1,c)}\delta(z(v,c), 1) := \{z(v + 1,c)\} if ψ(v+1)=1\psi(v + 1) = 1.
    • Set δ(z(v,c),0):={z(v+1,c)}\delta(z(v,c), 0) := \{z(v + 1,c)\} and δ(z(v,c),1):={z(v+1,c)}\delta(z(v,c), 1) := \{z(v + 1,c)\} if ψ(v+1)=\psi(v + 1) = *.
  • Mark all states with v=0v = 0 as initial: S:={z(0,c):1ck}S := \{z(0, c) : 1 \le c \le k\}.
  • Mark all states with v=nv = n as final: E:={z(n,c):1ck}E := \{z(n, c) : 1 \le c \le k\}.

By construction, MM will accept any variable assignment a1,,anΣna_1,\dots,a_n \in \Sigma^n where f(a1,,an)=0f(a_1, \dots, a_n) = 0. As φ\varphi has exactly 2n2^n variable assignments, it is satisfiable if and only if the set of accepted variable assignments L(M)L(M) has at most m:=2n1m := 2^n - 1 elements.

This reduction can clearly be done in polynomial time.

Intuition & Example

Consider the following function: φ:=(cˉ+d)(aˉ+c+dˉ)(aˉ+bˉ+d)\varphi := (\bar{c} + d)(\bar{a} + c + \bar{d})(\bar{a} + \bar{b} + d)

The cubes that describe variable assignments where φ\varphi is false are **10, 1*01 and 11*0 (variable order: a,b,c,da, b, c, d). These three cubes make the first, second and the third clauses false, respectively.

We can visualize all possible variable assignments with a 4-dimensional cube (as φ\varphi has 4 variables):

Boolean cube visualization

In this cube, every edge connects assignments that differ in exactly one bit (i.e. hamming distance = 1). Any lower dimensional cubes that are subgraphs of this cube are implicants if the vertices in the subgraph cover only assignments where the function is true. If a lower dimensional cube isn’t a part of some higher dimensional cube that still covers only assignments where the function is true, such a cube is a prime implicant. In this case, if we think of CNF-clauses as implicants of the negated function, then we can visualize them the following way:

CNF cube cover visualization

The idea of the reduction is that if we can count the amount of vertices covered by these cubes, then we can compare this amount to the total number of vertices and if it is less than 2n2^n where nn is the number of variables, then the function is satisfiable, otherwise not. The problem is that implicants aren’t always disjoint. So, the satisfiability problem is essentially the problem of comparing 2n2^n with the size of the union of variable assignments described by cubes.

These variable assignments that are parts of some cube(s) can be accepted with a nondeterministic finite automata (NFA) with nn states. We can create such an NFA for each clause and then union them by marking multiple states as initial. In this example, we get the following NFA:

NFA accepting variable assignments described by cubes from the CNF of the formula

The top row of the NFA accepts variable assignments generated by the 11*0 cube, the middle row corresponds to the 1*01 cube and the bottom one - to **10. φ\varphi is satisfiable if and only if there is at least one word of length 4, such that this NFA doesn’t accept it.

Converting the NFA to a BDD

The idea of this reduction can be used to compute all the satisfying assignments of φ\varphi. Consider the example above - we can apply the Rabin-Scott algorithm to convert the NFA to a DFA:

DFA accepting variable assignments described by cubes from the CNF of the formula

The computed DFA will always be acyclic (without the \varnothing-node), because the initial NFA had no cycles in it. The satisfying assignments are the ones that are not accepted by the NFA. Therefore, they are exactly the paths of length nn, leading to \varnothing. A program can easily output all satisfying assignments with a DFS or BFS-search in this tree.

If we replace \varnothing with the positive leaf and all final states with a negative leaf (the transitions after them can be removed), then the graph will become an ordered binary decision diagram (OBDD) of φ\varphi:

Ordered binary decision diagram for phi

Of course, the nodes should be renamed to variable names:

Ordered binary decision diagram with variable names

The Rabin-Scott algorithm doesn’t always output minimal deterministic automations and therefore in most cases the OBDD will also not be minimal, like in this case where the redundant checks are clearly seen. In order to get a ROBDD we will still need to apply the elimination and the isomorphism rules (remove redundant checks and reuse subtrees):

Reduced ordered binary decision diagram for the function

NP-Complete special case

We can tweak the problem defined above to make it belong to NP\mathcal{NP}:

Name: NFA Rejected m-string

Input: A nondeterministic finite automata (NFA) M:=(Z,Σ,δ,S,E)M := (Z, \Sigma, \delta, S, E) and a unary-encoded number mNm \in \mathbb{N}.

Question: Exists such a string αΣ\alpha \in \Sigma^* of length mm such that αL(M)\alpha \notin L(M)?

This problem is in NP\mathcal{NP}, because a string of length mm can be used as a certificate. Then, by using the idea of the Rabin-Scott theorem, we can compute all potential states after mm transitions and therefore test whether the given string is rejected or not in time in O(Z2m)O(\st Z \st^2 \cdot m), which is polynomial in the length of the input. The NP\mathcal{NP}-hardness can be shown with a reduction from CNF-SAT as follows:

  • Consider a formula φ\varphi with kk clauses and nn variables.
  • Construct an NFA M=(Z,Σ,δ,S,E)M = (Z, \Sigma, \delta, S, E) by following the same steps as in the proof of NP\mathcal{NP}-hardness of NFA Language Size.
  • Set m:=nm := n.

φ\varphi is satisfiable if and only if there is a rejected string of length mm in L(M)L(M) which is exactly some satisfying assignment for φ\varphi. Clearly, this is a polynomial-time reduction.

Finding any rejected string is NP-Complete

We can also define another version of the problem and proof it’s NP\mathcal{NP}-completeness:

Name: NFA Rejected String

Input: A nondeterministic finite automata (NFA) M:=(Z,Σ,δ,S,E)M := (Z, \Sigma, \delta, S, E) and a unary-encoded number mNm \in \mathbb{N}.

Question: Exists such a string αΣ\alpha \in \Sigma^* with αm\st \alpha \st \le m such that αL(M)\alpha \notin L(M)?

This problem is in NP\mathcal{NP}, because a string of length mm can be used as a certificate, like in the NFA Rejected m-string problem.

We will prove NP\mathcal{NP}-hardness by reducing NFA Rejected m-string to this problem. Consider an NFA M:=(Z,Σ,δ,S,E)M := (Z, \Sigma, \delta, S, E) and a number mNm' \in \mathbb{N}:

  • Create mm new nodes q1,,qmZq_1, \dots, q_m \in Z.
  • For all 1s<m1 \le s < m and aΣa \in \Sigma, set δ(qs,a):={qs+1}\delta(q_s, a) := \{q_{s+1}\}.
  • Mark q1q_1 as initial: S:=S{q1}S := S \cup \{q_1\}.
  • Mark q1,,qmq_1, \dots, q_m as final: E:=E{qi:1im}E := E \cup \{q_i : 1 \le i \le m\}.
  • Set m:=mm := m'.

By construction, the new NFA will accept all strings of length m1m - 1 or less. Thus, there is a rejected string of length mm if and only if there is a rejected string of length at most mm in the new NFA. Obviously, this is a polynomial-time reduction.

A few words about complexity

These problems illustrate an interesting connection between NP\mathcal{NP}-complete problems, that can be solved in polynomial time by nondeterministic turing machines and the problem of just counting the language size described by an NFA. By the Rabin-Scott theorem, it is possible to convert any NFA to an equivalent DFA, but the worst-case complexity of the algorithm is Θ(2n)\Theta(2^n), because a set with nn elements has 2n2^n subsets and all of these subsets will be reachable in the worst-case. Would the complexity of the subset-construction be polynomial, then it would mean that P=NP\mathcal{P} = \mathcal{NP} as we can just search for all paths in the DFA that lead to \varnothing after exactly nn transitions (these paths are exactly the variable assignments that satisfy φ\varphi).

The reduction discussed above can also be done in reverse. Any set of cubes can be transformed to a boolean function in conjunctive normal form. So, if P=NP\mathcal{P} = \mathcal{NP} and the CNF-SAT problem is solvable in polynomial time, then it is also possible to compare 2n2^n with the size of the union of some arbitrary boolean cubes. The cube union size problem is essentially a special case of the inclusion–exclusion principle, where the formula to compute the size of a union of nn sets is also exponentially long, because the amount of possible intersections grows exponentially.

It seems impossible to compute the size of the union of some arbitrary cubes in polynomial time, because the variable assignments that some cube describes is exponential in the length of the encoding of the cube. And the amount of ways to intersect some cubes is also exponential. Any polynomial encoding for a cube would allow us to distinguish between an exponential amount of cubes. However, the amount of ways to intersect some kk nn-variable-cubes is in Ω(3(3n(k1)))\Omega(3^{(3^{n \cdot (k - 1)})}), so it is impossible to precisely encode the union of these kk cubes with a polynomially long encoding. However, it could be possible to divide the search space so that it is still possible to compare 2n2^n with the size of the union of the cubes in polynomial time. Of course, these thoughts aren’t even close to a formal proof that PNP\mathcal{P} \neq \mathcal{NP}.


Copyright © 2019 — 2023 Alexander Mayorov. All rights reserved.