[Date Prev][Date Next][Thread Prev][Thread Next][Author Index][Date Index][Thread Index]

*To*: xanadu@xxxxxxxxxxxxx*Subject*: Comparison & Compression*From*: Rich Pasco <pasco@xxxxxxxx>*Date*: Tue, 12 Aug 1997 09:28:18 -0700*References*: <l03102802b01079480efd@[133.27.108.156]> <l03102800b0132472283c@[204.179.134.60]> <l03102800b013eaa1522d@[204.179.130.94]> <33EE8909.47E8@xxxxxxxx>*Reply-to*: xanadu@xxxxxxxxxxxxxxxxx*Sender*: xanni@xxxxxxxxxxxxxxxxx

I wrote: > .... However, I believe that if you > allow rearrangement (not just insertion/deletion) the problem becomes > exponential (like the travelling salesman problem). I have convinced myself that it's not exponential. Below is pseudo-code for a program which does it in at most cubic time. There may be room for further enhancements by sorting the pointers as Ted suggested, but cubic is an upper bound. Of course I agree that it's better if the documents are authored with references in the first place, instead of having to discover them later on. Then we don't have to do this at all. The pseudo-code follows some definitions: A reference replaces a substring with a pointer to its definition in some other location. I want to bound the complexity of a program which, given two strings (documents), finds all common text segments and replaces them with references. Suppose we have a procedure which compares two strings and finds their longest common substring (LCS). We enhance this procedure to work with strings having embedded references. In this case, the procedure does not follow the references to compare the pointed-to text; it simply compares the pointers. Let Lmin be the length of the shortest string which is worth replacing by a pointer. In some environments, this might be a single character; in others, it might be a line, a sentence or a paragraph. The main loop is as follows: Given two strings, repeatedly: 1) find the LCS 2) if the LCS found is shorter than Lmin, then stop 3) edit both of the original strings to: a) save the LCS text to a new, common, place and b) replace each of them with a reference to that new place Note: to ensure that our program stops at all, it is necessary to exclude, from Lmin or the definition of LCS, matching substrings consisting of exactly one pointer--otherwise we would get infinite chains of pointers to pointers... Since each main loop shortens the master strings by at least one character, the number of loops grows as O(n), where n is the length of the (shorter) string. One possible implementation of the procedure to find the LCS is as follows: initialize LCS=the null string for i from 1 to length(first string) do for j from 1 to length(second string) do start at i-th char of first string and jth char of second string compare char-by-char from strings until end of string or a mismatch is found if length is longer than LCS, then replace LCS by current match end for end for The complexity is clearly O(m*n) where m is the length of the longer string and n is the length of the shorter string. There may be an implementation of LCS which grows less than O(m*n) but I didn't bother to find it. This is an upper bound. OK, so if the number of main loops grows as O(n) and the implemented LCS grows as O(m*n), then the complexity of our program to replace all common text segments with references is O(m*n*n) which is cubic, much better than the exponential I had feared. - Rich

**Follow-Ups**:**Transclusion Issues***From:*Art Pollard

- Prev by Date:
**Transclusion Issues** - Next by Date:
**RE: Transclusion Issues** - Previous by thread:
**RE: Transclusion Issues** - Next by thread:
**Transclusion Issues** - Index(es):