Methods for implementing transclusion of text into HTML pages

Copyright (c) 13/06/1996 Andrew Pam of Xanadu Australia
Second draft 28/06/1996 by Andrew Pam
Third draft 04/08/96 by Andrew Pam
This is a draft document and not to be regarded as final.

Introduction

An important requirement of Xanalogical systems is the ability to transclude, or virtually include, portions of one or more documents into another. This enables composite documents to be constructed where each reader obtains the pieces from the original publisher. HTML already permits images in formats including X bitmaps, GIF and JFIF (JPEG) to be transcluded with the <IMG> tag and other document types with the <EMBED> tag but unfortunately does not support the transclusion of text. This document examines methods by which support for text transclusion can be implemented with currently available technologies on the WWW.

  1. Java applet
  2. The applet would take the URL and range to transclude as parameters and would attempt to retrieve the requested text and display it within the rectangular space reserved for use by the applet.

    Note: JavaScript cannot be used because it can cause a URL to be loaded into a window, but can not access data from a URL without visibly loading the entire page.

    HTML:
    <APPLET>
    PROS:
    Allows the remainder of the page to continue loading and rendering while text transclusions are being retrieved. Doesn't require any changes to the server. Supported by many browsers with more to follow. The code can automatically and transparently be downloaded when required.
    CONS:
    Not supported by older browsers that don't have Java. Doesn't use a distinct <TEXT> HTML tag. Japanese and other fonts are probably not implemented yet unless we do it from scratch. Transcluded text must appear within the reserved rectangular applet window and probably can't be richly formatted as with HTML.

  3. Netscape plug-in
  4. Documents containing text transclusions would have to be given a particular file extension (for example *.thtml) and MIME-type (for example text/thtml). When such a document is requested the plug-in would be invoked in the background and would parse the incoming HTML stream looking for <TEXT> tags and forwarding the rest, and the result of retrieving the transclusions, to the invoking Netscape window.

    HTML:
    <TEXT>
    PROS:
    Supports full HTML markup including Japanese and other foreign fonts supported by the browser.
    CONS:
    Not supported by the many browsers that don't implement Netscape plug-ins. Requires the user to download the plug-in before they can correctly view pages containing text transclusions. Requires server administrators to configure a new MIME-type and publishers to name files containing text transclusions with a different file extension. Finally, the necessary features are not yet implemented by Netscape!

  5. Server Side Include with a CGI script
  6. Documents would invoke a CGI script which would attempt to retrieve the URL and range specified as parameters to the script.

    HTML:
    <!-- #include --> or <!-- #exec -->
    PROS:
    Supports full HTML markup.
    CONS:
    Not supported by servers that don't implement server side included CGI scripts. Doesn't use a distinct <TEXT> HTML tag. Transcluded text is retrieved by the server on behalf of the user, rather than directly by the user.

  7. Parser/filter CGI script
  8. The script would retrieve each HTML page at the URL specified as its parameter and parse it, looking for <TEXT> tags and attempting to retrive the requested URL and range, then inserting the retrieved text into the HTML page as it is output. The script should also parse hyperlinks (<A HREF> tags) and change the destination to lead back to the script itself with the original link destination as a script parameter, so that the script will continue to be invoked to parse all HTML pages retrieved even when links are followed.

    HTML:
    <TEXT>
    PROS:
    Supports full HTML markup. Should work with all known servers and browsers.
    CONS:
    Requires that documents containing text transclusions be accessed via the CGI script. Transcluded text is retrieved by the server on behalf of the user, rather than directly by the user.

  9. Browser implementation
  10. The browser would directly recognise and interpret <TEXT> tags in HTML to request the specified range and URL and insert it inline.

    HTML:
    <TEXT>
    PROS:
    Probably the most efficient solution.
    CONS:
    Doesn't support users of other browsers.

  11. Server gateway
  12. The server would parse all HTML files and request the necessary transcluded material while serving each document. Mr. Yousuke Igarashi <yousuke@crew.sfc.keio.ac.jp> suggested making this a proxy module for a web server such as Apache, which would allow users of other servers to set a transclusion supporting server as their proxy.

    HTML:
    <TEXT>
    PROS:
    Supports full HTML markup. Should work with all known browsers.
    CONS:
    Doesn't support users of other servers. Transcluded text is retrieved by the server on behalf of the user, rather than directly by the user.

Because all of these methods have their strengths and weaknesses we will probably want to implement more than one. I believe we decided to start with method 4 (CGI script) and probably methods 5 and 6 later.

I propose that the new HTML tag should be something like this:

<TEXT SRC=[URL] {(PLAIN|RANGE)={[start]},{[end]}} {WIDTH=[X] HEIGHT=[Y]}>

Where braces {} enclose optional elements, brackets [] enclose variable parameters and parentheses () contain mutually exclusive alternatives separated by vertical bars |.

Parameters:

SRC=[URL]
Mandatory. Specifies the source document from which text is to be transcluded. [URL] must be the URL of a text document of some kind, HTML or otherwise.

(PLAIN|RANGE)={[start]},{[end]}
Optional. If this parameter is omitted, the source document will be transcluded in its entirity. [start] and [end] must be byte offsets within the file. If PLAIN is used, only the text of the source document is parsed; all tags are omitted both in determining the specified offsets and in the transclusion. If RANGE is used, the source document is transcluded verbatim. It is probably an error for both [start] and [end] to be omitted. If either or both are out of range any in range portion selected should probably still be delivered.

WIDTH=[X] HEIGHT=[Y]
Optional. [X] and [Y] are in pixels and would allow the browser to reserve a rectangular space to present the transcluded text rather than having to wait for it to arrive before continuing the layout of the page, in exactly the same fashion as the <IMG> tag is typically implemented. If the transcluded text will not fit in the reserved space scroll bars could be displayed. It probably doesn't make sense to implement support for these parameters in the server-side implementations (3, 4 and 6).

The intention is to have a facility in authoring programs that permits the author to create transclusions by indicating an insertion point, viewing the document from which they wish to transclude, and marking the region to be transcluded, much in the manner of a traditional "cut and paste" operation except that what is actually pasted is the reference to the transcluded portion rhather than the literal text.

Initially, this could be a small editing program purely for adding transclusions to existing documents. It has also been suggested that people might wish to add transclusions by hand, in which case it might be desirable to have other ways of specifying the start and end of the range besides just the byte offsets, which are inconvenient to determine by hand.

Possible extensions to [start] and [end] values (for discussion):

  1. HTML target anchors <A NAME="target">, indicated by prefixing the target name with a hash mark. This should reference the start of an anchor. Byte offsets from this position could be permitted by appending a plus or minus and the number of bytes. Examples: RANGE=#start,#end RANGE=#start+5,#end-1
  2. Paragraphs (<P> in HTML), indicated by the letter P and the paragraph number counting from the beginning of the document. Sentences and words could also be supported similarly to paragraphs, but at additional computational expense. Example: RANGE=P5,P9-3
  3. Offsets from pattern matches, suggested by Paul Haeberli <paul@sgi.com> in his "Merge" script. This could be signified by enclosing the pattern with slashes, single or double quotes as delimiters which may not appear within the pattern. Examples: RANGE='<img src',"</center>"-15 RANGE=105,/day./+6

Comments welcome!