HTML to Repository Serialization

In order to store meaningful changes in git, PSO serializes HTML to a folder/file structure. This makes change and move detection behave similar to how git expects and reduces the chances of merge conflicts.

Note

Almost all of these would make great test cases.

Design Issues

How do we represent text nodes?

For representing text nodes, just having a text file for the information would adequately store what was necessary, and then filenames can be generated arbitrarily, likely using any id attributes that already exist.

So, if we had:

<p GUID=1>hello <span GUID=2>stuff</span> goodbye </p>

This would become something organized like this:

* GUID1
    * metadata.json
    * 1.txt
    * GUID2 (this is a dir)
        * 1.txt
    * 2.txt
There is the issue of:
  • What if “goodbye” was moved to

    be before the span?

But there isn’t much that can be done here; it is likely that the parser used for parsing the HTML will result in one file for each of “goodbye” and “hello” so long as the span separates them.

Without the span, the text would be treated as one file, thus, if it was moved, then it would result in a removed file and another file being modified.

To solve this, there would need to be a decision of where to stop in parsing the HTML, so that a file doesn’t get removed, but this would require quite a bit of extra information.

This isn’t insurmountable, as the old version could just be read, read the file, and keep track of where it “ends” and then just parse to that, or the first child node, and split the files accordingly.

That being said, this could result in some file directories that look very strange, but would likely work better in diffs.

How do we represent ordering of text nodes mixed with other nodes?

This could just be done with naming in lexicographic -based construction, and the table in the table -based construction. Basically, just have the names of directories and the files that hold content be what we use to determine what is what. That is, if a table says:

  • GUID1
  • GUID2
  • GUID3

Regardless if any of those are directories, just recursively build that directory into a node, and then paste it in during construction.

For name based, same idea, but with names.

How do we store node type?

This could be done in the metadata file, since there will be one one either way.

How do we store node attributes (src, href, etc)?

Same as node type, metadata file would work well for this.

Simple Paragraph Addition

Initial Content

<p>p1</p>
<p>p2</p>

After GUID-ization

The first pass adds GUIDs to any nodes that don’t have them. Since this is initial, they all get new GUIDs. Further examples will ignore GUIDs. Assume they’re there.

<p data-por-guid="GUID1">p1</p>
<p data-por-guid="GUID2">p2</p>

Lexicographical Representation

Table Representation

Edit 1: Additional 2nd Paragraph

<p>p1</p>
<p>p3</p>
<p>p2</p>

Lexicographical Representation

Conflicts: No

Table Representation

Conflicts: No

Edit 2: Additional Last Paragraph

From the initial content, we’ll add a 4th paragraph.

<p>p1</p>
<p>p2</p>
<p>p4</p>

Lexicographical Representation

Conflicts: No

Table Representation

Conflicts: No

Merged Edit 1 and Edit 2

Those two edits should merge in without conflicts.

<p>p1</p>
<p>p3</p>
<p>p2</p>
<p>p4</p>

Lexicographical Representation

Conflicts: No

Table Representation

Conflicts: Yes. In metadata.json.

Nested Nodes

Initial Content

<p>p1<span>s1</span>stillp1</p>
<p>p2</p>

Node Moves

Initial Content

<p>p1</p>
<p>p2</p>
<p>p3</p>
<p>p4</p>

Lexicographical Representation

Table Representation

Edit 1: Last to First

<p>p4</p>
<p>p1</p>
<p>p2</p>
<p>p3</p>

Lexicographical Representation

Conflicts: No

Table Representation

Conflicts: No

Edit 2: Last to First with content change

<p>p4new</p>
<p>p1</p>
<p>p2</p>
<p>p3</p>

Lexicographical Representation

Conflicts: No

Table Representation

Conflicts: No

Merged Edit 1 and Edit 2

Those two edits should merge in without conflicts.

<p>p4new</p>
<p>p1</p>
<p>p2</p>
<p>p3</p>

Lexicographical Representation

Conflicts: No. Not if content and move are separate commits.

Table Representation

Conflicts: No

Design Decision

Gitlit keeps track of the order that the nodes should be constructed via the table-method. This has the simplest base case, and stays simple even when complexity is added through commits.

For any in-depth discussion of why, see this git issue