Mutability, Recoverability, and Traceabilty

Mutability, Recoverability, and Traceability

This document uses the term mutable to mean changeable, or subject to change. This topic builds a reference model for reasoning about data, recoverability, and history tracking.

Versioned versus mutable data. The scope of the Activity model does not include providing a general versioning mechanism for all entity types. However, any data item needs to have a traceable, whether it is versioned or not. The conflicting needs of backtracking interpretation type processing and data lifecycle management leads to two very different kinds of data: mutable versus versioned data.

Mutable data is created and then it is changed in place. Most data in a database is inherently mutable, and SQL-level operators exist to modify data, namely the update command. If we were to keep track of the update history for such an object, it would look like a list of Activity object references, starting from an initial data acquisition or creation step, quality control steps, and all subsequent alterations. We might represent this by saying that the current state at some point (state i say) of the object is the result of actions A1, A2, and A3, and write it like this:

(A1, A2, A3)-->state i (writable)

This is similar to the maintenance records for an automobile. Once data is modified, a prior state of mutable data is not inherently recoverable, so if a user wishes to be able to get back to a state at a later time, that can only be done by creating and safeguarding a copy of the data. The lack of recoverability is the main drawback with mutable data.

As mentioned above, individual applications may implement an undo action, but they have a number of disadvantages that prevent a system from depending on them for the general problem of needing to revert to an earlier state. Not all applications implement them. The granularity of the update is tied to an application action, so a user level revert action may require multiple applications to undo. Databases also provide the ability to revert since the database state may be rolled back to an earlier saved state, or selective transactions may be undone. This suffers the same granularity mismatch problem that application undo had; the user would like to capture the state of their activities, and later be able to revert back to that state with simple operations, for example, not requiring DBA expertise.

Creation of a recoverable (and unchangeable) state i and subsequent application of additional actions might result in something like this:

(A1, A2, A3)-->state i (read-only)

state i-->(A12, A13, A14)-->state j (writable)

It will probably be easy to determine that the object at state j is a later version of the object at state i, but this will be a property of the data versioning model. If we then deleted state j, and want state i to again be mutable, we would be left with:

(A1, A2, A3)-->state i (writable)

Alternatively, iIf we decided state i was not needed after all, the result might be:

(A1, A2, A3, A12, A13, A14)-->state j (writable)

The participation of mutable data in a dependency graph depends on maintenance of timestamps indicating when the data was last written, and the times at which it was read. A subsequent check of whether a downstream piece of data is up to date requires that the original time-of-use (that is the times it was read) and the most-recent modification date be compared. If the data was modified since the dependency was registered, the dependent data is out of date or stale.

Versioned data on the other hand exists as a set of more or less static snapshots. When a user modifies a piece of versioned data, a new copy or version is created. Any version of a piece of data that has not been deleted (for example, through a project cleanup or pruning) is by default recoverable, in the sense that it can be examined, or serve as an input to an application. Such an application may create another version, giving rise to a tree of versions, rather than a strict version sequence. The main drawback with versioned data is that many intermediate copies persist and may require pruning to recover space, and keep the projects uncluttered.

History for versioned data generally consists of the description of the activity that created the data, and the inputs to that activity. When pruning occurs, it will probably result in a list of activities to be associated with a saved later version. Let Vi mean the I-th version of a data object V.

A1-->V1 (read-only)

V1->A2-->V2 (read-only)

V2->A3-->V3 (read-only)

If you delete V2, you probably want to augment the activity history of V3 so it reflects the application of both A2 and A3 in that order. Hence, we could say the composite activity of A2 and A3 took V1 as input and produced V3:

V1-->(A2, A3)-->V3 (read-only)

In practice, pruning of versioned data and making recoverable copies of mutable data can make it hard to determine whether a given data item is inherently mutable or versioned. Either way, data items may have a list of activities that collectively created it, and there is a (possibly degenerate) tree of recoverable versions, a way to prune it, and a way to extend it.

Mutability or versionability is treated here as a property of data. Clearly, it is related to application actions, and in particular, the provision of undo methods, which could be directly implemented by the application (for individual user actions), or which could be behavior of a DBMS (for backing out whole transactions).

Identity preserving operations. Applications take data items as input and create data items. In some cases, there is a strong identity shared by a given input and a given output. For example, a program that applies a cable-stretch correction to a well log takes a log as input and either modifies it in place, or creates a modified copy. The output can be considered an update-to, or a modified version of the input, and a user views this as applying an operation to the data without altering its identity. This contrasts with the case of a program that simply computes min, max, and mean of the numeric values in a well log. The min is not an update to the log, and it is not even a log. Preservation of identity is a property of an application and relates an input to one or more outputs. In the Activity model, this fact is registered in the Act_Tmpl_Data_Dependency. Inherits_Identity_Flag.

This is mentioned here because there are benefits of deciding whether data should be modified in place as opposed to replaced with a modified version, as a matter of policy of the overall deployed products, not a matter of policy for the applications to decide. Ideally, the application can flag that an output could potentially be a revised version of an input, but leave the actual decision to the product as a whole. A similar argument applies to recoverability. Many applications implement an undo operation that lets an operator get back to an earlier state. However, support for and implementation of data recoverability is a decision for the overall product, and cannot be subject to decisions taken by individual applications.

Input-output data. The strongest type of identity-preservation is the idea of input-output data, that is, data that is both read and updated by an application program. The Activity model supports the notion of input-output data. What this says is that the application can assume that the same instance of data is involved with both the read and write data. Although this may have its uses, the recommended practice is to treat all data as either strictly input or strictly output, and be aware that an input and an output may or may not actually be the same instance of data. Use of distinct input and output data roles allows the application framework to make decisions about whether the same data instance will be bound as in both roles, or whether the output will be a newly-created version as described above. When an application declares data to be both input and output, it forces all associated data to be mutable, and this may run contrary to the versioning strategy of the platform.