# Measuring the size of a regular language is NP-Hard

In this post we will examine an interesting connection between the $\mathcal{NP}$-complete `CNF-SAT`

problem and the problem of computing the amount of words generated by some efficient representation of a formal language. Here, I call a way of representing a formal language efficient, if it has a polynomially-long encoding but can possibly describe an exponentially-large formal language. Examples of such efficient representations are nondeteministic finite automations and regular expressions.

## Formal definition of the problem

We can formally define the problem of determining the language size the following way:

**Name**: `NFA Language Size`

**Input**: A nondeterministic finite automata (NFA) $M := (Z, \Sigma, \delta, S, E)$ and a number $m \in \mathbb{N}$.

**Question**: Is $\st L(M) \st \le m$?

## Proof of NP-hardness

We will construct a reduction from the $\mathcal{NP}$-complete `CNF-SAT`

problem. Consider a formula $\varphi$ with $k$ clauses and $n$ variables:

- Create an NFA $M := (Z, \Sigma, \delta, S, E)$ with the alphabet $\Sigma := \{0, 1\}$.
- For every variable $0 \le v \le n$ and clause $1 \le c \le k$ create a state $z(v, c) \in Z$. Intuitively, if the NFA is in the state $z(v, c)$, it means that it has already read $v$ symbols of some variable assignment where clause $c$ is zero.
- For each clause $1 \le c \le k$ of the formula, construct a boolean cube $\psi : V \to \{0, 1, *\}$ with $\psi^{-1}(1)$ containing variables that are negated in the clause and $\psi^{-1}(0)$ containing positive variables (other variables are mapped to $*$). By doing this, we essentially construct a cube $C(\psi) := (\bigwedge_{\psi(x) = 1}{x}) \wedge (\bigwedge_{\psi(x) = 0}{\bar{x}})$ that is the negation of the clause. In other words, all full variable assignments $\psi' : V \to \{0, 1\}$ such that $\psi(x) \neq * \Rightarrow \psi'(x) = \psi(x) \quad\forall x \in V$ will make the clause (and the whole formula $\varphi$) false. For all $0 \le v < n$ create the following transitions:
- Set $\delta(z(v,c), 0) := \{z(v + 1,c)\}$ if $\psi(v + 1) = 0$.
- Set $\delta(z(v,c), 1) := \{z(v + 1,c)\}$ if $\psi(v + 1) = 1$.
- Set $\delta(z(v,c), 0) := \{z(v + 1,c)\}$ and $\delta(z(v,c), 1) := \{z(v + 1,c)\}$ if $\psi(v + 1) = *$.

- Mark all states with $v = 0$ as initial: $S := \{z(0, c) : 1 \le c \le k\}$.
- Mark all states with $v = n$ as final: $E := \{z(n, c) : 1 \le c \le k\}$.

By construction, $M$ will accept any variable assignment $a_1,\dots,a_n \in \Sigma^n$ where $f(a_1, \dots, a_n) = 0$. As $\varphi$ has exactly $2^n$ variable assignments, it is satisfiable if and only if the set of accepted variable assignments $L(M)$ has at most $m := 2^n - 1$ elements.

This reduction can clearly be done in polynomial time.

## Intuition & Example

Consider the following function: $\varphi := (\bar{c} + d)(\bar{a} + c + \bar{d})(\bar{a} + \bar{b} + d)$

The cubes that describe variable assignments where $\varphi$ is false are `**10`

, `1*01`

and `11*0`

(variable order: $a, b, c, d$). These three cubes make the first, second and the third clauses false, respectively.

We can visualize all possible variable assignments with a 4-dimensional cube (as $\varphi$ has 4 variables):

In this cube, every edge connects assignments that differ in exactly one bit (i.e. hamming distance = 1). Any lower dimensional cubes that are subgraphs of this cube are implicants if the vertices in the subgraph cover only assignments where the function is true. If a lower dimensional cube isn’t a part of some higher dimensional cube that still covers only assignments where the function is true, such a cube is a prime implicant. In this case, if we think of CNF-clauses as implicants of the negated function, then we can visualize them the following way:

The idea of the reduction is that if we can count the amount of vertices covered by these cubes, then we can compare this amount to the total number of vertices and if it is less than $2^n$ where $n$ is the number of variables, then the function is satisfiable, otherwise not. The problem is that implicants aren’t always disjoint. So, the satisfiability problem is essentially the problem of comparing $2^n$ with the size of the union of variable assignments described by cubes.

These variable assignments that are parts of some cube(s) can be accepted with a nondeterministic finite automata (NFA) with $n$ states. We can create such an NFA for each clause and then union them by marking multiple states as initial. In this example, we get the following NFA:

The top row of the NFA accepts variable assignments generated by the `11*0`

cube, the middle row corresponds to the `1*01`

cube and the bottom one - to `**10`

. $\varphi$ is satisfiable if and only if there is at least one word of length 4, such that this NFA doesn’t accept it.

## Converting the NFA to a BDD

The idea of this reduction can be used to compute all the satisfying assignments of $\varphi$. Consider the example above - we can apply the Rabin-Scott algorithm to convert the NFA to a DFA:

The computed DFA will always be acyclic (without the $\varnothing$-node), because the initial NFA had no cycles in it. The satisfying assignments are the ones that are not accepted by the NFA. Therefore, they are exactly the paths of length $n$, leading to $\varnothing$. A program can easily output all satisfying assignments with a DFS or BFS-search in this tree.

If we replace $\varnothing$ with the positive leaf and all final states with a negative leaf (the transitions after them can be removed), then the graph will become an ordered binary decision diagram (OBDD) of $\varphi$:

Of course, the nodes should be renamed to variable names:

The Rabin-Scott algorithm doesn’t always output minimal deterministic automations and therefore in most cases the OBDD will also not be minimal, like in this case where the redundant checks are clearly seen. In order to get a ROBDD we will still need to apply the elimination and the isomorphism rules (remove redundant checks and reuse subtrees):

## NP-Complete special case

We can tweak the problem defined above to make it belong to $\mathcal{NP}$:

**Name**: `NFA Rejected m-string`

**Input**: A nondeterministic finite automata (NFA) $M := (Z, \Sigma, \delta, S, E)$ and a unary-encoded number $m \in \mathbb{N}$.

**Question**: Exists such a string $\alpha \in \Sigma^*$ of length $m$ such that $\alpha \notin L(M)$?

This problem is in $\mathcal{NP}$, because a string of length $m$ can be used as a certificate. Then, by using the idea of the Rabin-Scott theorem, we can compute all potential states after $m$ transitions and therefore test whether the given string is rejected or not in time in $O(\st Z \st^2 \cdot m)$, which is polynomial in the length of the input. The $\mathcal{NP}$-hardness can be shown with a reduction from `CNF-SAT`

as follows:

- Consider a formula $\varphi$ with $k$ clauses and $n$ variables.
- Construct an NFA $M = (Z, \Sigma, \delta, S, E)$ by following the same steps as in the proof of $\mathcal{NP}$-hardness of
`NFA Language Size`

. - Set $m := n$.

$\varphi$ is satisfiable if and only if there is a rejected string of length $m$ in $L(M)$ which is exactly some satisfying assignment for $\varphi$. Clearly, this is a polynomial-time reduction.

## Finding any rejected string is NP-Complete

We can also define another version of the problem and proof it’s $\mathcal{NP}$-completeness:

**Name**: `NFA Rejected String`

**Input**: A nondeterministic finite automata (NFA) $M := (Z, \Sigma, \delta, S, E)$ and a unary-encoded number $m \in \mathbb{N}$.

**Question**: Exists such a string $\alpha \in \Sigma^*$ with $\st \alpha \st \le m$ such that $\alpha \notin L(M)$?

This problem is in $\mathcal{NP}$, because a string of length $m$ can be used as a certificate, like in the `NFA Rejected m-string`

problem.

We will prove $\mathcal{NP}$-hardness by reducing `NFA Rejected m-string`

to this problem. Consider an NFA $M := (Z, \Sigma, \delta, S, E)$ and a number $m' \in \mathbb{N}$:

- Create $m$ new nodes $q_1, \dots, q_m \in Z$.
- For all $1 \le s < m$ and $a \in \Sigma$, set $\delta(q_s, a) := \{q_{s+1}\}$.
- Mark $q_1$ as initial: $S := S \cup \{q_1\}$.
- Mark $q_1, \dots, q_m$ as final: $E := E \cup \{q_i : 1 \le i \le m\}$.
- Set $m := m'$.

By construction, the new NFA will accept all strings of length $m - 1$ or less. Thus, there is a rejected string of length $m$ if and only if there is a rejected string of length at most $m$ in the new NFA. Obviously, this is a polynomial-time reduction.

## A few words about complexity

These problems illustrate an interesting connection between $\mathcal{NP}$-complete problems, that can be solved in polynomial time by nondeterministic turing machines and the problem of just counting the language size described by an NFA. By the Rabin-Scott theorem, it is possible to convert any NFA to an equivalent DFA, but the worst-case complexity of the algorithm is $\Theta(2^n)$, because a set with $n$ elements has $2^n$ subsets and all of these subsets will be reachable in the worst-case. Would the complexity of the subset-construction be polynomial, then it would mean that $\mathcal{P} = \mathcal{NP}$ as we can just search for all paths in the DFA that lead to $\varnothing$ after exactly $n$ transitions (these paths are exactly the variable assignments that satisfy $\varphi$).

The reduction discussed above can also be done in reverse. Any set of cubes can be transformed to a boolean function in conjunctive normal form. So, if $\mathcal{P} = \mathcal{NP}$ and the `CNF-SAT`

problem is solvable in polynomial time, then it is also possible to compare $2^n$ with the size of the union of some arbitrary boolean cubes. The cube union size problem is essentially a special case of the inclusion–exclusion principle, where the formula to compute the size of a union of $n$ sets is also exponentially long, because the amount of possible intersections grows exponentially.

It seems impossible to compute the size of the union of some arbitrary cubes in polynomial time, because the variable assignments that some cube describes is exponential in the length of the encoding of the cube. And the amount of ways to intersect some cubes is also exponential. Any polynomial encoding for a cube will allow us to distinguish between an exponential amount of cubes. However, the amount of ways to intersect some $k$ $n$-variable-cubes is in $\Omega(3^{(3^{n \cdot (k - 1)})})$, so it is impossible to precisely encode the union of these $k$ cubes with a polynomially long encoding. However, it could be possible to divide the search space so that it is still possible to compare $2^n$ with the size of the union of the cubes in polynomial time. Of course, these thoughts aren’t even close to a formal proof that $\mathcal{P} \neq \mathcal{NP}$.