[Date Prev][Date Next][Thread Prev][Thread Next][Author Index][Date Index][Thread Index]
Version Compare/Merge: Definition of Terms
- To: <xanatech>
- Subject: Version Compare/Merge: Definition of Terms
- From: Marc Stiegler <marcs>
- Date: Sat, 7 Apr 90 23:03:09 PDT
The algorithm is actually embarassingly simple once you
view the problem through the right set of concepts. Hence
this message is a series of definitions of terms. These
terms map more or less directly into the objects and
operations used in the algorithm. Most of the terms are
pretty self-evident; however, the definitions of "movement
runs" and "composite runs" are worth reading carefully. The
concept of the "movement run" is the closest thing to a
conceptual breakthrough in this analysis.
The list of definitions here is pretty long; the
algorithm itself is ironically rather short. You can
probably skip the definitions and read the algorithm and
get a good idea of what's going on if you're in a hurry;
the algorithm itself is in the message following this one.
------
A RUN is a sequence of objects, often sequences of
characters, often smaller runs, which are correctly
sequenced by some criterion (such as, they are in the
sequence they appear in when viewed in the final document).
Two objects are ADJACENT inside a run if they are next to
each other.
-The ORIGINAL DOCUMENT is a run of characters sequenced
as they appear in the document at the beginning of the
transformation.
-The FINAL DOCUMENT is a run of characters sequenced as
they appear at the end of the transformation.
-A PRIMITIVE RUN is a run of characters that have the
same sequence in both the original document and the final
document. The SIZE of a primitive run is the number of
characters it contains. Primitive runs come in 3 flavors:
there are PRIMITIVE MOVES, which appear in both the
original and final documents; there are INSERTIONS, which
appear only in the final document (since they don't appear
in the original document at all, they do have the same
sequence in the original document, sort of :-). And there
are DELETIONS, which appear only in the original document.
Please note that primitive runs are approximately the
kinds of objects which the backend returns to the
application in its comparison operation. Primitive runs are
therefore the starting material for the transformation.
One way in which these primitive runs differ from
Xanadu is that I don't account for vcopies, i.e., the case
where a single run appears in several places in the
document. These can be handled with a simple add-on to the
algorithm here, and are discussed in the section on
deficiencies.
- The WORKING DOCUMENT is a run of runs that includes
all the primitive runs from both the original and final
documents, i.e., both insertions and deletions are
included. The working document is transformed by successive
operations, from a state in which the primitive moves and
deletions are all sequenced according to the original
document, to a state in which the primitive moves and
insertions are all sequenced according to the final
document.
A pair of runs in the working document are PROPERLY
ADJACENT if
- they are adjacent in both the working
document and the final document, AND if
- they are in the same sequence in the working
document, as they are in the final document.
Such properly adjacent runs are ripe to be melded
together into a single run, because they will remain
together for the duration of the transformation.
To construct the working document in its initial state,
start out with all the primitive runs from the original
document in their proper sequence. Now insert the primitive
insertions from the final document by placing each
insertion properly adjacent to the larger of the 2
primitive moves which would be its neighbors in the final
document.
Please note that the initial state of the working
document is exactly the document which should appear in
Tapestry's version compare/merge display.
-A MOVEMENT RUN is a run composed of primitive runs;
the movement run has 3 interesting properties:
1) Neither of its neighbors are insertions or
deletions.
2) Neither of its neighbors are properly
adjacent to it.
3) It contains at least one primitive move run.
In other words, a movement run is a run which must
somehow be moved with respect to both its neighbors in
order to reach the arrangement found in the final document.
The entire working document can, and must, be broken into
movement runs. In a simple case, "A B C" becoming "B C A",
the primitive runs are A and B and C, while the movement
runs are (A) and (B C).
The SIZE of a movement run is the sum of the sizes of
its primitive runs.
To transform the working document into a sequence of
movement runs, run through the following operations until
none of these operations is still possible:
- Combine each insertion with the larger of the
two primitive moves which are its neighbors in the final
document, to form a proto-movement-run.
- Combine each deletion with the larger of the
two primitive moves which are its neighbors in the
original document, to form a proto-movement-run.
- Combine all pairs of proto-movement-runs
which are properly adjacent.
For example, in the original document "A del1 B del2 C"
being transformed into final document "B insert1 C A", the
working document initial state might be "A del1 B insert1
del2 C", and the movement runs might be (A) (del1 B insert1
del2 C).
Please note that once you have created movement runs, a
valid (though poor) description of transformations would be
to state that every movement run had moved to a new
location.
It's also worth noting that, following the above
algorithm for constructing movement runs, insertions and
deletions are "attracted" to the largest primitive moves.
With this clustering, movement runs will tend to be either
very large, or very small. This will be used in the
algorithm to maximize the extent to which small objects are
moved large distances, following the heuristic
specification.
- A COMPOSITE RUN is a run of movement runs which have
been rearranged such that they are properly adjacent. The
SIZE of a composite run has 2 components: first is the
number of movement runs it contains. Second is the sum of
the sizes of the contained movement runs. A composite run
is always considered to be "larger than" a single movement
run. One composite run is larger than the other if the
first part of the size is greater; if the first part of the
size is the same, it is larger if the second part of its
size is greater.
Please note that the size of a composite run reflects
the number of movements required to compose it: the number
of movements required is equal to N-1, where N is the
number of movement runs it contains. This complex
definition of size reduces the frequency with which moved
objects will be moved again, thus reducing the amount of
nesting.
Note also that composite runs do NOT contain other
composite runs. If two composite runs are melded together,
the resulting composite run contains all the movement runs
from both of the originals.
- A REVERSED SEQUENCE is a series of movement runs
which are in inverted order compared to their sequencing in
the final document. "A B" transformed to "B A" is the
simplest reversed sequence. "A B C D E" transformed to "E D
C B A" is a more complex example. Reversed sequences of up
to 3 elements are very common in editing; higher order
reverses are rare. Indeed, it is surprising just how much
common editing can be described with reverses of 2 or 3
elements, as you may discover while reading the examples
later (as an aside, all reverses of any length can be
described as a series of 2-component reverses, though again
this gives intuitively poor results).
- The ANCHOR is the largest single run in the working
document. The anchor may change every time the working
document is transformed.
If you understand all these terms, you are now equipped
to encounter the mysteries of version comparison and
merging. Prepare yourself for a breathtaking experience
:-)
--marcs