As mentioned in Sect. 1.4, DFAs are often implemented by tables
where the current state is cross-indexed by the next symbol to find the next state. If
the alphabet is large, such a table can take up quite a lot of room. If, for example,
16-bit Unicode is used as the alphabet, there are 216 = 65536 entries in each row
of the table. Even if each entry in the table is only one byte, each row will take up
64 KB of memory, which may be a problem.
A possible solution is to split each 16-bit Unicode character c into two 8-bit characters c1 and c2. In the regular expressions, each occurrence of a character c is hence
replaced by the regular expression c1c2. This regular expression is then converted to
an NFA and then to a DFA in the usual way. The DFA may (and probably will) have
more states than the DFA using 16-bit characters, but each state in the new DFA use
only 1/256th of the space used by the original DFA.
a) How much larger is the new NFA compared to the old?
b) Estimate what the expected size (measured as number of states) of the new DFA
is compared to the old. Hint: Some states in the NFA can be reached only after
an even number of 8-bit characters are read and the rest only after an odd number
of 8-bit characters are read. What does this imply for the sets constructed during
the subset construction?
c) Roughly, how much time does the new DFA require to analyse a string compared
to the old?
d) If space is a problem for a DFA over an 8-bit alphabet, do you expect that a
similar trick (splitting each 8-bit character into two 4-bit characters) will help
reduce the space requirements? Justify your answer.