Proof of Transclusion
Introduction
Transclusion was first posed by Ted Nelson in 1980 in his book Literary Machines.1 It was key to his vision of the internet but was one of the things that slipped through. We realize it here and introduce Proof of Transclusion (PoTC)2: A cryptographic construction that verifies a bitstring occurs at a specific point in one or more source documents, without needing a copy of the original document. PoTC gives same-cost inclusion verification regardless of length, and works for intra- and inter-document sequences.
Preliminaries
It can be demonstrated that any amount of parts of a whole bitstream, segmented anywhere, are in fact parts of this whole by combining these parts in a homomorphic hash.3 Hash the whole, then hash the parts in order, and if the hash of the parts matches the hash of the whole then they are the same.
A quote is part of a document that has another part of the document precede it, and another part of the document succeed it. We will consider the preceding part null if the quote starts at the start of the document, and the succeeding part null if the quote ends at the end of the document. This structure is not just inclusion of three parts; they also have to be in sequential order, or a sequential inclusion.
Proofs
We will expand on “sequential inclusion” here to mean any set where both presence and ordinality are conditions that must be satisfied. Since we will be referring to this concept a lot, we will call this a sequinclusion. A sequinclusion can be represented and re-verified with computers by having a representation of the elements of the set as hashed by the homomorphic hashing algorithm, alongside either the hash of the cumulative set or simply the original set.
A quotation within a document can be represented with four hashes and a pair of numbers that are the quote boundaries. The things hashed are:
- the whole document
- the part of the document preceding the quote
- the quote
- the part of the document succeeding the quote
or, Cumulative, Prefix, Inclusion, Suffix (). The quote boundaries are either a start and end index, or a start index and length. The boundaries between the datastreams hashed for , , must be the same as the start and end index, or alternatively the start index plus the length. The quote boundaries are necessary to avoid scanning over the datastream in quadratic time to see if there is a set of three hashes that matches.
The sequinclusion representation is verified by these steps: Hash the hashes , , in that order and compare the result to the hash of the whole . Hash the document parts , , according to the quote boundaries. If the re-computed hashes are the same as the stored ones, and if the hash of the hashes result is the same as the cumulative hash, then it is valid.
A proof of transclusion is set of two or more of sequinclusion proofs. In practice it is stored with identical hashes de-duplicated. In a set of two or more documents, since the quote texts are expected to be identical, their hashes will be identical, so the proof can just store it once, which would look like . A transclusion proof between a pair of documents will have 7 hashes. A document can also transclude from itself, in which case it would store , with cumulative de-duped as well since the cumulative is the same for the same document. Transclusion verification is valid wherever multiple sequinclusions are valid with the same inclusion part.
Implementation
PoTC is most useful when used by identifying source documents by their content hashes in a CAN, which is closer to Ted Nelson’s original vision of the internet. It works as follows: at each document’s creation, hash the text content according to the homomorphic hash and make it identifiable by that hash. When a document transcludes from another document or itself, the document will generate a proof that is a set of hashes all of equal length and put it in metadata.* (preferable if data transmission is more expensive than computation, which for a world-wide network it is). Transclusions do not have to stop at two sequinclusions, it can be more than that. This only practically works if you store the indexes of your quotes alongside your transclusion metadata, otherwise you have to scan your documents in quadratic time in its document’s length to find the quote.
Footnotes
*: data beyond the data - stored separately from the data, not as inline attributes like how HTML/XML does it
Nelson, T. H. Literary Machines. Swarthmore, PA: Self-published, 1980. https://archive.org/details/literarymachines00nels↩︎
Bromberg, L., Shpilrain, V. & Vdovina, A. Navigating in the Cayley graph of and applications to hashing. Semigroup Forum 94, 314–324 (2017). https://doi.org/10.1007/s00233-015-9766-5↩︎