[Date Prev][Date Next][Thread Prev][Thread Next][Author Index][Date Index][Thread Index]

Comparison & Compression



I wrote:

> ....  However, I believe that if you
> allow rearrangement (not just insertion/deletion) the problem becomes
> exponential (like the travelling salesman problem).

I have convinced myself that it's not exponential.  Below is pseudo-code
for a program which does it in at most cubic time.  There may be room
for further enhancements by sorting the pointers as Ted suggested, but
cubic is an upper bound.

Of course I agree that it's better if the documents are authored with
references in the first place, instead of having to discover them later
on.  Then we don't have to do this at all.

The pseudo-code follows some definitions:

A reference replaces a substring with a pointer to its definition in
some other location.

I want to bound the complexity of a program which, given two strings
(documents), finds all common text segments and replaces them with
references.

Suppose we have a procedure which compares two strings and finds their
longest common substring (LCS).  We enhance this procedure to work with
strings having embedded references.  In this case, the procedure does
not follow the references to compare the pointed-to text; it simply
compares the pointers.

Let Lmin be the length of the shortest string which is worth replacing
by a pointer.  In some environments, this might be a single character;
in others, it might be a line, a sentence or a paragraph.

The main loop is as follows:

Given two strings, repeatedly:
  1) find the LCS
  2) if the LCS found is shorter than Lmin, then stop
  3) edit both of the original strings to:
     a) save the LCS text to a new, common, place and
     b) replace each of them with a reference to that new place

Note: to ensure that our program stops at all, it is necessary to
exclude, from Lmin or the definition of LCS, matching substrings
consisting of exactly one pointer--otherwise we would get infinite
chains of pointers to pointers...

Since each main loop shortens the master strings by at least one
character, the number of loops grows as O(n), where n is the length of
the (shorter) string.

One possible implementation of the procedure to find the LCS is as
follows:

  initialize LCS=the null string
  for i from 1 to length(first string) do
    for j from 1 to length(second string) do
      start at i-th char of first string and jth char of second string
      compare char-by-char from strings
        until end of string or a mismatch is found
      if length is longer than LCS, then replace LCS by current match
    end for
  end for

The complexity is clearly O(m*n) where m is the length of the longer
string and n is the length of the shorter string.

There may be an implementation of LCS which grows less than O(m*n) but I
didn't bother to find it.  This is an upper bound.

OK, so if the number of main loops grows as O(n) and the implemented
LCS grows as O(m*n), then the complexity of our program to replace all
common text segments with references is O(m*n*n) which is cubic, much
better than the exponential I had feared.

     - Rich