Twitter follow Button
(@NetlabLoligo)


 


 
 

Conventions:

CTX - Creativyst® Table Exchange Format

A low-overhead alternative for exchanging tabular data v1.0e








Reasons to use CTX:
  1. More functional than CSV.
  2. Less overhead than XML.
  3. Simplicity.

Overview

This article specifies CTX, a simple, shared, low-overhead exchange format. It can be used for simple tasks, such as exchanging rows of a single table without header information, up to more complex tasks, like exchanging multiple tables, along with their field names, types and comments. It will also facilitate the exchange of complex hierarchical data structures.

CTX is a more precisely defined and functional alternative to CSV, and a lower overhead alternative to many applications of XML. The CTX exchange format embodies the simplicity of CSV, while permitting, via optional secondary mechanisms, the exchange of data with complex structural hierarchy.

CTX achieves low overhead by permitting data-types to be tagged with type, display, structure, help, and other information. By tagging types instead of data, CTX vastly improves bandwidth-usage over the strategy of tagging every value-instance with two identical nametags. This also permits rich meta-information to be shared about the data being exchanged. The added metadata is optional for writers, and, if present, may be used or ignored at the option of reading applications.



Philosophy and Goals

  1. Simple - Low complexity entry level, similar to the simplicity of writing and reading a CSV file. The exchange format WILL be easy to understand and implement. It will be specifically designed for the sharing and exchange of flat, relational, or hierarchical tabular data stores.

  2. Small - The exchange format WILL take bandwidth costs and restrictions seriously and WILL strive to have relatively low overhead.

  3. Field-opaque - The core exchange format itself WILL NOT be required to be concerned with what is represented in or by the data fields it carries.

  4. Binary or Text - The transport mechanism WILL BE expected to be able to transport all 256 octet codes by default.

  5. Durable - The exchange format writer MAY optionally provide special-case (non-default) encoding extensions to respond to any transport mechanism that is expected to be limited to text-only. CTX readers MUST read and translate such extensions if present in the CTX data.

  6. Robust -
    1. Able optionally to exchange meta field information such as labels, names, types, and comments with reading applications.

    2. Able optionally to exchange meta table information such as labels, names, types, and comments with reading applications.

    3. Reading applications may choose to use or ignore as much provided meta-data as they wish

    4. Able optionally to exchange multiple tables, even multiple databases in a single transmission, file, or chunk.

    5. Able optionally to represent hierarchy within data by providing facilities to embed a complete CTX encoded file, record, table, or database into single fields.

  7. Orthogonal - The format definition WILL be expressed as a read-function state table in order to be relatively precisely defined, and have predictable attributes and characteristics.



The CTX Convention

Convention Name: Creativyst Table Exchange Format (CTX)
File Extension (if applicable): '.ctx' (case insensitive)
Recommended mime type: application/ctx (not registered as of writing)


. . .
PRIMARY (CORE):
  • One record WILL be one line in CTX.
    Records are separated with CR(0x0d) or LF(0x0a).

    If lines have been broken with secondary methods, they must be reconstructed effectively PRIOR TO reading and translation by a CTX reader.

    If line lengths must be restricted, they must be restricted using prescribed secondary methods, effectively AFTER they have been generated by a CTX writer.

  • CTX readers MUST ignore blank lines in CTX transmissions (lines having nothing but CR or LF).

  • Fields WILL BE separated with an ASCII pipe character '|' (0x7c)

  • Escape sequences WILL BE introduced with a back-slash character '\' (0x5c)
    Called the "escape sequence introducer."

  • If source field data contains the special characters CR (0x0d), LF (0x0a), '\' (0x5c), or '|' (0x7c), they MUST BE escaped with '\r', '\n', '\i', or '\p' respectively.
    These special characters represent, and are sometimes called CTX overhead.

    Optional secondary processes also include '\s', which replaces ';' but ONLY when inside an optional multi-byte sequence (\m...;).

  • Numeric types SHOULD be represented with ASCII digits and punctuation.
    This is STRONGLY RECOMMENDED. If numbers are represented with ASCII digits, they SHOULD be formatted in accordance with the standard C library fprintf() function formatters 'f','i', 'x', 'e', or 'd'.

    Writers that provide '\P' records (a secondary process) MUST format fields that are denoted in \P record as Numbers ('N') as ASCII digits using one of the fprintf() 'f','i','x', 'e', or 'd' format strings as described here.

  • Trailing blank fields MAY be truncated
    A reader expecting more fields MUST fill the extra fields as if the CTX record had contained blank fields (||) in those field positions.



. . .
SECONDARY:
(non-core, but readers must handle CTX files with these features)
  • Characters MAY be converted to hex digits with the \m[#]x...; sequence.
    Generally, this would be used to convert files for transmission over channels that are limited to 7-bit ASCII (email for example). This translation may be performed back and forth by mid-stream processes. This may also be used for some limited run-length compression. The '#' here represents any integer greater than zero.

  • Characters MAY be converted to modified base64 with the \m[#]b...; sequence.
    Generally, this would be used to convert files for transmission over channels that are limited to 7-bit ASCII (email for example). This translation may be performed back and forth by mid-stream processes. This may also be used for some limited run-length compression.

    The term "modified base64" means base64 encoding without line-length restrictions and without line-breaks added. That is, a modified base64 representation of any length will always occupy a single line.

  • CTX Record lines MAY be wrapped with the \l sequence to limit their length
    This processing may be performed back and forth by mid-stream processes.
    • If line-wrap is used, CTX WRITERS MUST 'effectively' perform line wrap operations (wrap lines) AFTER core CTX and '\m...' conversions have been completed.
    • If line-wrap is present, CTX READERS MUST 'effectively' perform line wrap operations (remove line-wrapping) BEFORE core CTX and '\m...' conversions begin.

  • Field name and type information MAY be imparted through special function records within the CTX file.
    These include records beginning with the sequences: '\L', '\N', '\R', '\H' (Labels, Names, Remarks, Hovers); '\P', '\M', '\E', '\Q', '\Y', '\K' (Primary types, mime types, encoding, SQL types, Application specific types, key types); '\X', '\D' (maX size, Display). If used, special function record types WILL apply to the records that follow them up to the next identical function record or the end of the file, database, or table.

  • Numeric fields MAY be denoted within records using primary CTX type function records ('\P').
    This is a good way for applications to avoid the leading zeros problems that often accompany key fields that just happen to be numeric (U.S. zip codes for example). However, reading applications are always free to ignore these.

  • Hierarchical data MAY be represented and denoted using field-type: 'C', within primary CTX type function records ('\P').
    An individual field within a CTX encoded record may itself contain a complete CTX record, an entire table of CTX records, or an entire database of tables containing records. It can even contain an group of databases. In short, a field within a CTX record may contain an entire independent CTX file. That CTX file within a field can contain any data you can represent in a CTX file, including its own fields containing their own embedded CTX data.

  • Multiple tables and multiple databases MAY be delimited within a single CTX transmission/file with '\T' and '\G' (group) records.
    Records starting with these sequences are special records denoting that all records following pertain to the named table ('\T'), or the named database or group of tables ('\G'). Record groups are delimited in CTX with these to record-types, the start of the file or transmission, and the end of the file or transmission. If used, special function record types WILL apply to the records that follow them up to the next identical function record or the end of the file.

  • The first occurance of a field directive record (size, name, comment, type, etc.) WILL apply to all the preceding data records in the group up to the start of the group (delimitted with \T, \G, start-of-data, or end-of-data)
    This MUST only apply to the FIRST occurance of each record type.



Escape Sequences

Escape sequences are used to replace specific bytes, or to denote specific functions within the data exchange stream.

Escape sequences are classified by their function and width:

  • Single byte replacement/function sequences:
    An introducer (escape character) followed by a single byte (character) denoting its function or the byte code it is standing in for.

  • Optional first-character record-type/function sequences:
    A single-byte escape sequence that is only legal if it appears starting at the first byte position of a record. It denotes that the record that follows is a special function and is not part of plain exchange data.

  • Optional multi-byte replacement sequences:
    A multi-byte introducer ('\m'), followed by an optional number, followed by a single byte denoting function, followed by a parameter or set of parameters and terminated by a semi-colon (';');

  • Multi-byte terminator replacement sequence:
    A special single-byte replacement sequence that is only legal when it appears inside a multi-byte replacement sequence. Specifically, it is the '\s' sequence, which is replaced with a semi-colon in raw data.

All escape sequences are introduced by a backslash ('\'), which is called an escape sequence introducer. If a file contains a backslash followed by a character that is not defined here, it is an error for this version of CTX.

Case matters. For example, a '\P' (primary type record follows) is not the same as a '\p' (pipe character escape). Also, a '\T' is a legal code at the first character position of a record, while a '\t' is always illegal.

. . . . .
Single byte replacement/function characters.

CR(0x0d), LF (0x0a), '\' (0x5c), and '|' (0x7c) within field data MUST be replaced with \r, \n, \i, and \p sequences respectively. Also, when ';' (0x3b) exists within a multi-byte replacement sequence, it MUST be replaced with \s.

Escape sequences other than those defined here may not appear in CTX encoded data.


. . . . .
Core:

CTX Raw Comments/notes
\r CR 0x0d
\n LF 0x0a
   
\i '\' Backslash character (0x5c), used to 'i'ntroduce the escape sequence
  • Writer:
    • This character must be the FIRST one translated within each field during CORE translations.
  • Reader:
    • This character must be the LAST one translated within each field during CORE translations.
   
\p '|' Vertical Bar or "pipe" (0x7c), the field separator character
   


. . . . .
Optional:

CTX Raw Comments/notes
\l (ell) (ln-cat) Optional line extended indicator (lowercase L)
placed at end of line to extend current line (record) into next non-empty line.
  1. MUST NOT come between an introducer and its first character. That is, the string '\\l' is illegal and should never appear in a ctx transmission. For that matter so is '\\'
  2. '\l' MUST NOT immediately precede a CR or LF (0x0D or 0x0F) in the source CTX formatted data.
  3. If CTX WRITERS use the \l function to break lines they MUST 'effectively' break lines using '\l' as if they have been broken AFTER all other CTX output processing of the line has occurred.
  4. If the '\l' function has been used by CTX writers, CTX Readers must reconstruct each record line using the \l function as if the line has been fully reconstructed BEFORE any other line processing occurs.
  5. Processing for '\l' may be performed by outside processes, downstream of the Writer or anywhere between the writer and the reader.
   
\m...;   Optional multi-byte sequence introducer.
   
. . .
Special single-byte replacement sequence used ONLY inside Multi-byte sequences - It is an error for this replacement sequence to occur outside of a multi-byte replacement sequence.
   
\s ; Semi-colon (0x3b) ONLY VALID INSIDE a multi-byte sequence. If this appears anywhere else in a record it is an error.

. . . . .
Optional first character record-type sequences

First character record-type sequences are only valid when they start in the first character position of a record (the first character of a line, or the first character following CR or LF).

They denote that the record on the line is not data, but a special type that performs a CTX function. These records generally apply to the data records that follow them up to the end of the file, or to the next record-type record of the same type.



Table and database groupings:

  \T Table info \TLabel|Name|Comment|Hover|Path|Endian|Enc|Res1|...
Provides information about the table whose records are being shared.
  OPTIONAL If used, applies to all following records up to the next \T or \G record, or to the end of the file or transmission (whichever comes first).
  \G Globals or Group info rec \GLabel|Name|Comment|Path|Endian|Enc|Res1|...

  OPTIONAL If used, it will apply to all following tables, up to the next \G record.

Field Names and descriptions:

  \L Field labels record \LLabel1|Label2|Label3|Label4...
Provides field labels for fields in the records that follow. Labels SHOULD be alpha-numeric with no embedded whitespace. They can be numbers only. They SHOULD be kept relatively short. For more descriptive field names see the \N record below.
  OPTIONAL If used, will apply to all following records up to the next \T, \G, or \L record.
  \N Field names record \NName 1|Name 2|Name 3...
Provides field names, or titles for fields in the records that follow. These SHOULD be kept to a reasonable length For more descriptive information about fields see the \R (remarks/comments) record below.
  OPTIONAL If used, will apply to all following records up to the next \T, \G, or \N record.
  \R Field remarks record \RComment 1|Comment 2|Comment 3...
Provides comments (remarks/reference) about the fields in the records that folllow. These MAY contain relatively long descriptive information about the field and the purpose of the field.
  OPTIONAL If used, will apply to all following records up to the next \T, \G, or \R record.
  \H Field Hovers record \HFly-over 1|Fly-over 2|...
Provides special case short comments or help hints about the fields in the records that follow. They SHOULD be short enough to look OK in a hover balloon.
  OPTIONAL If used, will apply to all following records up to the next \T, \G, or \H record.
 
 

Field types:

  \P Field Primary CTX type record \PB|N|N||B|C|...
This is a special case record. See section about Optional Primary Types below.
  OPTIONAL If used, will apply to all following records up to the next \T, \G, or \P record.
  \M Field Mime type record \Etext/html|||text/javascript|...
All CTX overhead is plain ASCII. Fields may be defined on a field-by-field bases.
  OPTIONAL If used, will apply to all following records up to the next \T, \G, or \E record.
  \E Field Encoding type record \Eutf-8|||utf-8|...
All CTX overhead is plain ASCII. Fields may be defined on a field-by-field bases.
  OPTIONAL If used, will apply to all following records up to the next \T, \G, or \E record.
  \C C-language field-types record \Cint|char *||float|...
Intrinsic C language type specifiers. All CTX overhead is plain ASCII. Fields may be defined on a field-by-field bases.
  OPTIONAL If used, will apply to all following records up to the next \T, \G, or \C record.
  \Q Field SQL-types record \QVARCHAR(20)|DECIMAL(10,2)|CHAR(4)|...
  OPTIONAL If used, will apply to all following records up to the next \T, \G, or \Q record.
  \Y Field app-type record \Ychar *|char *|double...
This is a specifically ambiguous feature. Field types related in a \Y record are application dependent. C elemental types are halfheartedly recommended but not necessary, and should not be counted on. An \S record is soon to be introduced to handle standard C scalar types.
  OPTIONAL If used, will apply to all following records up to the next \T, \G, or \L record.
  \K Field KEY-type record \KP|F|F:Persons.number|F||||....
Provides key types for fields in the records that follow. SHOULD be 'P' for primary field, 'F' for foreign key, 'I' for index or candidate, or '' for not a key.
  OPTIONAL If used, will apply to all following records up to the next \T, \G, or \K record.

Field storage size and display suggestions:

  \X Field maX-size record \X20|15|4...
Holds field sizes.
  OPTIONAL If used, will apply to all following records up to the next \T, \G, or \X record.
  \D Field display record \D15|10|4R|...
Suggested display size of each field. May also include an optional trailing 'L', 'R', or 'C' to suggest left, right or center alignment. MAY also contain a C, standard fprintf() library function styled format string (RECOMMENDED).
  OPTIONAL If used, will apply to all following records up to the next \T, \G, or \D record.

 
 

. . . . .
Optional multi-byte sequences:

Multi-byte sequences MAY be used by a CTX WRITER to translate difficult byte codes to more transportable representations. This MAY be used to transport a table or set of tables over an ASCII-only email link, for example.

Multi-byte sequences are completely optional for WRITERS, but READERS are REQUIRED to handle them and translate them.

They may be used by a CTX WRITER to accommodate channels that are known to be text-only (non-binary). A binary channel is assumed by default.

Hex Sequence (\m[#]x...;):

  \m[##]x...; Sequence of bytes represented in hexadecimal digits with an optional repetitions number.
  \mx00;-\mxff; Any byte value in hex, from x00 to xff. Upper-case alpha characters (A-F) are permitted.
  \m100x00; 100 x00 characters.
  \mx48692e; The sequence: "Hi." (no quotes)
  \m2x48692E; The sequence: "Hi.Hi." (no quotes)

Base64 sequence (\m[#]b...;)

  \m[#]bmL6g; Sequence of bytes represented as modified base-64. This is base-64 but with no line-length limitations.



Backtracking (and Redundancy) Feature

The first time a header type, size, name, or description record occurs within a group of records, that header type's directives WILL be applied to all the records that came before it, up to the group delimiter (A “\T,” a “\G,” or the start of the file or transmission).

The term “header” is a bit of a misnomer when applied to these field type records (everything but \T and \G). They apply to the records within their group, regardless of where they first appear within the group.

Aside from allowing type records to apply to all the fields in their group without regard for their sequential location within the group, this also provides for the use of redundancy in transmissions. This can be used to ensure that types and other field directives are properly applied when long transmissions are broken into multiple chunks (due to time or medium constraints, or transmission errors, for example).

Important considerations regarding backtracking feature:

  • Backtracking MUST only apply to the first occurrence of a given header type within a group. — If another header record of the same type occurred earlier in the group, the directives from the earlier occurrence apply to all the fields after it, up to the new header type. Only the first occurrence of a given header type has backtracking applied.

  • If used for redundancy, the type headers should be retransmitted at the end of each record group — This will assure the headers are all applied to any remaining preceding records in the event of a transmission interruption near the end of the group.

  • Group delimiters are \T, \G, or the beginning or end of the file or transmission — The end of the file is also considered a group delimiter. It may or may not be germane to this behavior. A transmission mode that tracks the last fully committed record to be received may also maintain the headers, or it may expect the headers to be retransmitted by the sender.



State Machines

The state tables for the READER are presented here (see below) in order to precisely define the CTX format and convention.

Those who've cut their teeth on Object Oriented Programming concepts may find state tables a bit confusing. The key to understanding state tables from an OOP perspective is to understand first, that they take the opposite extreme of the OOP (and functional-programming for that matter) "abstraction through complexity hiding" strategy.

Instead of hiding lesser-used or normally unused constructs, state tables' strength comes from equally disclosing, in a visually graspable way, all trajectories through a problem space. That includes trajectories that would be considered abnormal, unusual, or erroneous.

For many problem types, state table representations are the best choice for bridging specification to finished software code. Use of state tables permits a thorough visualization and understanding of the problem domain that is often not possible with more traditional methods. More importantly, you can code directly off of (from) these state tables. :-)

  • READERS:
    • must implement the state tables and bold functions shown below.
    • Records that have fewer fields than expected may be handled by the READER in any of two ways:
      1. Remaining fields may be set to NULL
      2. Remaining fields may be set to blank or zero
    • Records that have more fields than expected may be handled by the READER in any of two ways:
      1. Excess fields may be discarded
      2. An ERROR may be raised.
    • A post-reader process SHOULD also remove or translate any field and byte-values that will cause difficulty to the application being serviced by the READER.

  • WRITERS:
    • All CTX tables encoded by WRITER functions must be readable by the Reader represented in these state tables without ERROR, and translated back to the data encode by the writer.
    • The '\l' and '\m...;' functions are optional for WRITERS Both of these functions may be performed anywhere in the route.

Also, in these state tables, i've included the '\l' function directly in the processing stream. Processing of the '\l' function MAY be done in a separate pass prior to performing the main CTX READER translations. In fact, '\l' handling may be performed upstream and in a different process altogether from the final reader. I handled them in a single pass at the reader:

  1. to show the entire READER, state table;
  2. to produce a READER with slightly better performance and,
  3. to provide a more generic READER capable of translating streams as each byte (or 2-byte escape sequence) is received. This is more generic than a two-pass representation because it is capable of handling complete files (like the upstream strategy) as well.


Special Note: The state-tables presented below also show some primary aspects of another project I'm currently working on.

In that project, I define a programming language where you not only call functions from within other functions, you can also call state-machines from within other state-machines.

You can see this represented in the state-machines presented here. Functions are called like most function calls. State machine calls are denoted with a '$' to discern them from function calls.

Each state-machine has a special object called 'tin', which returns the value (or event) that caused the last transition.

NOTE: The first state machine displayed below takes a shortcut that puts all the function record specifier sequences in a single column. That is, instead of specifying a separate transition column for each of the escape sequences: '\T', '\G', etc., it simply labels one column with "\[TLNCHXZQYPK]". This was done primarily to save horizontal space, but it also makes the state table clearer, by showing that all these transitions are handled identically in every state.

 



graphic of state-table for CTX record reader - real table soon, contact me


state-table for CTX multi-byte sequence reader - real table soon, contact me


   
StateMachine: Inp() // provides input to $ReadRecord(); $ReadMBSeq();
 vState/InChar() -> InChar-other [lipmsrnTLNCHXZOYPK] '\'
010
First InChar
 
rval = tin

return(rval)
rval = tin

return(rval)
rval = '\'

020
020
Ongoing InChar
 
ERROR

 
rval += tin

return(rval)
ERROR

 



Function mCvrt(Type, N, tmp)
{
    1. if(Type == 'x') {
            ConvertHexToBytes(tmp)
        }
        if(Type == 'b') {
            ConvertBase64ToBytes(tmp)
        }
    2. if(Error during conversion) {
            return(Convert Error)
        }
    3. copy conversion output into OutputBuffer N times
    4. return OutputBuffer
}

Function getFuncRec(F)
{
    1. if(RecursionCheckAlreadyHere()) {
            ERROR
        }
    2. FRec = $ReadRecord();  // call state machine
    3. if(F == '\T') {
            PerformAnyTableRelatedProcessing();
        }
    4. if(F == '\G') {
            PerformAnyGroupRelatedProcessing();
        }

    5. if(IgnoreFuncRec(F)) {
            return
        }
    6. Store FRec
    7. return
}


/* . . . . .
 * These just give an example of some
 * of the kinds of processing you might
 * want to do for table/group boundaries.
 *
 * Of course, you could just ignore
 * the record and return.
 *
*/
Function PerformAnyTableRelatedProcessing()
{
    1. if(!DoTableProcessing) {
            return;
       }
    2. Store Previous Table.
    3. Allocate new table
    4. OtherProcessing()
    5. return; 
}

/* Likewise for Group (Database) records
*/
Function PerformAnyGroupRelatedProcessing()
{
    1. if(!DoGroupProcessing) {
            return;
       }
    2. Store Previous DataBase.
    3. Allocate new DataBase
    4. OtherProcessing()
    5. return; 
}



Optional Primary Types (\P)

The optional primary field-type record (\P) is used to relate basic field types, or hierarchy information that MAY be handled directly by the CTX reader.

. . . . .
Basic Types ('N' and 'B')

Aside from the CTX/hierarchy type ('C'), CTX has only two explicit, elemental, primary field types. They are:

  • Numbers ('N')
    The Number format is also called 'fixed', because its allowable representations are defined as the output formats specified by the standard C library's fprintf() function when using the format specifiers 'f', 'i', 'x', 'e', or 'd'.

  • Non-numbers ('B')
    The Non-number format is also called 'Binary' format, because it can (technically) hold any set of byte values. It MAY be interpreted as "anything but ASCII represented numbers".

. . . . .
The Hierarchy Field Type ('C')

CTX includes a special primary type ('C') that lets you embed CTX encoded records, tables, and even entire databases directly within individual fields of the current-level CTX file.

In my opinion (and it is just my opinion) there is a considerable amount of data in the world for which the type-level hierarchical partitioning supported by this mechanism may be better suited than the value-level hierarchical partitioning specified in XML.

  • 'C' The field contains a CTX file
    The relative field holds an entire embedded CTX file. This simply means there is embedded CTX encoded information in the field. The CTX data could be as simple as a single CTX record, or as complex as an entire SET of databases, each with multiple tables, its own type, help, and label identifiers, and anything else that COULD be included within a CTX file.

. . . . .
The Specifically Ambiguous Type ('')

If a \P record is defined but is blank it is specified as ambiguous, a sort of deliberate 'don't care' condition. This is true for trailing fields in records that contain more fields than their respective \P record. Fields in tables for which no \P record is transmitted may also be considered specifically ambiguous.

. . . . .
More About The \P record

  1. Each field in the \P record corresponds to the respective field in each data record.

  2. A \P record field may contain 'N' for Number', 'B', for 'Binary', or '' for Specifically Ambiguous (which essentially means 'don't care' or 'undefined'), as well as 'C' for CTX encoded data (hierarchy).

  3. If a \P record field contains 'N', the respective field MUST be represented as a Number field in the transmission or file.

  4. For situations where a numeric field must be interpreted by end applications as a string (such as U.S. zip codes, where leading zeros may have meaning) the field should be defined as 'B' (Binary, non-number)

  5. Leaving a particular \P record field blank or not providing a \P record at all MAY be considered ambiguous. In such cases a-priori agreements on how data is interpreted is outside of the scope of this convention.

. . . . .
Implementation Notes

The following box contains some notes about optional primary types that may be important to implementers of CTX writers, readers, as well as applications that employ CTX.



Writer:
    N   Numbers or 'fixed' fields
        Numerically typed fields, if used, MUST be represented in ASCII
        digits and punctuation.  If numeric application types are
        represented in ASCII they MUST be formatted according to the
        format specifications for the standard C library fprintf() function
        for format codes 'f', 'i', 'x', 'e', or 'd' ("fixed").
        No other ASCII formats are permitted.  Leading zeros, and
        leading and trailing spaces may be included only as recommended
        for the standard C printf() function format specifications for
        fixed formatted numbers.

        If there is a \P field definition of 'N', a numeric field must be
        written as a number type as described above.

        Native types SHOULD be exchanged as 'fixed' types, even if primary
        types are not used.


    B   Non-number or 'B'inary fields
        Non-number or Binary fields should be translated byte-for-byte
        into their escaped values for the CTX field.  Each byte in the
        source field must be represented in the CTX field in a way that
        will be translated exactly back into the source byte by the reader.

        The writer MAY choose to escape the bytes of the field so that it
        can be safely transported over restricted channels (such as ASCII
        only email).  The writer may use \mx..;, \mb...;, and \l functions
        to perform this.  The \l function must be applied (effectively) 
        after all other translations have been performed.



Reader:
    See the state-tables.


Application:

    1. If a \P record exists a reading application MAY choose to
       use it or ignore it.

    2. If a \P record exists and the respective \P record field
       value is 'N' a reader application MAY choose (as per 1.) to
       interpret the data-record field as a number in ASCII 
       representation, and convert it into a native numeric
       representation for the application or platform the
       reader is serving.

    3. If a \P record exists and the respective \P record
       field value is 'B' a reader MAY choose (see 1.) to interpret
       the data-record field as a string, even if it contains properly
       formatted ASCII representations of numbers.  This can be useful
       (for example) for representing numbers where leading zeros
       have meaning and should be preserved (such as U.S. zip codes).





Examples

I'll use the following database to show some basic features and answer some basic questions about the workings of CTX.

. . . . .
A Faux Database

---------           ---------
Persons             Pets
---------           ---------             
Number(P)           Number(P)
LastName            Owner(F:Persons.Number)
FirstName           Name                  
                    Species               
                    Breed                 
. . .
Persons data:
----------------------------------------
Number  LastName            FirstName
----------------------------------------
1       Smythe              Jane
2       Doe                 John
3       Mellonhead          Creg

. . .
Pets data:
---------------------------------------------------------------
Number  Owner   Name        Species         Breed
---------------------------------------------------------------
1       1       Fluffy      Dog             Poodle
2       1       Sharp       Dog             German Shepard
3       2       Silo        Cat             Mix
4       3       Doggie      Dog             Mix

. . . . .
Escence

At its simplest, CTX will transmit just the data records of a given table much like CSV. For example, to share the data from the "Persons" table we could simply send

1|Smythe|Jane
2|Doe|John
3|Mellonhead|Creg

This is a complete and correct CTX transmission, as are all the following CTX examples.

. . . . .
Field Labels and Names

In CSV the header (column names) are often sent as the first record in the file. Problems arise, however, because reading applications have no way of knowing if the first record is a header row, or the first data record.

CTX solves this problem with special record types that start with '\L' (Labels), and '\N' (Names). Labels should consist of only characters from the set: [A-Za-z0-9_].

\LNumber|LastName|FirstName
1|Smythe|Jane
2|Doe|John
3|Mellonhead|Creg

There's no reason why you can't WRITE the labels as field NAMES as well, you can even make an '\L' and '\N' record that are exactly the same in every other respect. But, you may want to use the less restricted \Name record to convey slightly more readable (but still short) names for the fields.

\LNumber|LastName|FirstName
\NPerson Number|Last Name|First Name
1|Smythe|Jane
2|Doe|John
3|Mellonhead|Creg

CTX provides other meta-record types to describe field Comments ('\R'), display sizes and alignments ('\D'), even fly-over help ('\H') among others.

. . . . .
Field Types

CTX also provides meta-record types to define field types. One example might be to use the SQL-Type record ('\Q') to impart field types for our Persons table.

\LNumber|LastName|FirstName
\NPerson Number|Last Name|First Name
\QNUMBER(7)|VARCHAR(65)|CHAR(35)
1|Smythe|Jane
2|Doe|John
3|Mellonhead|Creg

Other meta-records that impart field TYPE information include application specific type names ('\Y'), primary CTX types ('\P'), and key indicators ('\K').

. . . . .
Tables

Want to convey some information about the table you're sending? Use the '\T' (Table) record type.

\TPersons|People Table|Pet owners in our example db|Pet owners|||
\LNumber|LastName|FirstName
\NPerson Number|Last Name|First Name
\QNUMBER(7)|VARCHAR(65)|CHAR(35)
1|Smythe|Jane
2|Doe|John
3|Mellonhead|Creg

. . . . .
Multiple Tables

You can also use '\T' records alone to delimit a group of multiple tables.

\TPersons|People Table|Pet owners in our example db|Pet owners|||
\LNumber|LastName|FirstName
\NPerson Number|Last Name|First Name
\QNUMBER(7)|VARCHAR(65)|CHAR(35)
1|Smythe|Jane
2|Doe|John
3|Mellonhead|Creg

\TPets|Pets owned by people|Pet list in our example db|Pet list|||
\LNumber|Owner|Name|Species|Breed
\NPet Number|Owner Number|Pet Name|Species|Breed
\QNUMBER(7)|NUMBER(7)|CHAR(35)|CHAR(35)
1|1|Fluffy|Dog|Poodle
2|1|Sharp|Dog|German Shepard
3|2|Silo|Cat|Mix
4|3|Doggie|Dog|Mix

Here, our entire database has been sent in a single CTX file.

Note also that this continues to be a legal CTX file. The extra blank line we've put between tables for clarity here is perfectly acceptable in a CTX file because blank lines are ignored. This same behavior means CTX implementations need not be concerned with platform-dependent differences between line endings (CR, LF, CRLF, LFCR).

. . . . .
Databases

You can also add a '\G' (Group) record to impart information about a group of multiple tables (a database).

\GFauxDB|A Faux Database|An entire (if contrived) example db||||
\TPersons|People Table|Pet owners in our example db|Pet owners|||
\LNumber|LastName|FirstName
\NPerson Number|Last Name|First Name
\QNUMBER(7)|VARCHAR(65)|CHAR(35)
1|Smythe|Jane
2|Doe|John
3|Mellonhead|Creg
\TPets|Pet owned by people|Pet list in our example db|Our furry friends|||
\LNumber|Owner|Name|Species|Breed
\NPet Number|Owner Number|Pet Name|Species|Breed
\QNUMBER(7)|NUMBER(7)|CHAR(35)|CHAR(35)
1|1|Fluffy|Dog|Poodle
2|1|Sharp|Dog|German Shepard
3|2|Silo|Cat|Mix
4|3|Doggie|Dog|Mix

. . . . .
Multiple Databases

The '\G' records in CTX permit you to send multiple databases in a single CTX file in the same way '\T' records permit multiple tables. Just delimit each database with a '\G' record...

. . . . . . . . . . .







FAQ

If you have questions contact me, or ask them in the forums.



. . . . .
Q. Should I support CTX as the primary means of sharing my application's data with other outside applications?

No. If your primary concern is the ability to have users exchange data between your application and their general purpose productivity applications (like Microsofttm Office, or OpenOffice), you should first support CSV, not CTX. The reason? Because CSV is the one used by the world's biggest software provider, and so CSV is the one everybody else in the world uses.

That said, CSV has many problems and shortcomings, so once you've added CSV support for interoperability, you may want to add CTX support for vastly improved functionality between your own applications.



. . . . .
Q. What about XML?

These days of course, the defacto choice is XML. The best reason to select XML, is that nobody ever gets fired for choosing XML. But since you've already taken care of most interoperability issues by supporting CSV, you may be freer to choose your next exchange format based on actual technical merit.

If that is the case, consider this: Basic XML, other than attaching two identical name-tags to EVERY instance of EVERY value, has very little inherent functionality. CTX, on the other hand, provides a great deal of inherent, expandable functionality, while requiring very little overhead.

Also, (unlike XML) much of the functionality built into CTX is generalized, and doesn't require you to write specialized support code for each new XML application (sometimes called dialects) you wish to interact with. How many XML dialects could there be out there? There are tens of thousands.



. . . . .
Q. Why not just use SQL insert statements for data-exchange?

This was actually considered for the first clients who needed a secondary exchange format (after CSV had been added). It was rejected after the analysis, mainly because SQL was deigned to be a VERY good database maintenance and query language, which doesn't necessarily translate to a good exchange format. The issues included high overhead bandwidth usage, and slight but insurmountable syntactical differences in the formats and features of insert statements.



. . . . .
Q. CTX ignores blank records, so how do I send them?

While CTX ignores lines with only CR or LF in CTX files, it doesn't prevent you from conveying blank records. There may be times when you need to convey blank records between sending and receiving applications. One possible reason for this might be for backups where it is desirable to convey an exact image of the records in a table.

Do it this way:

|

A record sent with a single field separator will not be interpretted as a blank line to be ignored by CTX. Instead it will convey a record with all blank fields to receiving applications.

Do NOT do this:

[space-character: 0x20]

Sending a CTX line with a single space character so that the CTX readers will not interpret it as a blank line wont work. It will convey a record in which the first field is filled with a space character (which is not a blank record).



. . . . .
Q. Are there limits to which transports can carry CTX files?

Yes. Channels with encoding schemes that only use a single letter-case, such as BAUDOT, and Vail Codes (Morse Code), will not transport CTX directly because CTX uses case differences in the optional portions of its exchange format. Also, transports with encodings that don't include the CTX overhead characters CR, LF, '\', '|', and ';' will be a problem because CTX needs these for its own encoding. Certainly, some form of secondary escaping scheme could be applied "below" the CTX translator (closer to the PHY) in order to permit transport over channels with these character limitations. But such secondary escaping schemes are not currently part of CTX in any way.



. . . . .
Q. Can CTX be carried within XML documents?

Yes. It can be transported as pure binary data or it can be transported as a text block if writers (or intermediate translators) use \m...; sequences to convert the unacceptable code points. I recommend the latter. If you need a mime type use the one recommended by the convention.

  • application/ctx
    CTX files transported over transports that can carry all 256 octet code points MAY be sent without explicit conversion using secondary methods. In these cases, the suggested mime type is "application/ctx"





Miscellaneous

Following, are some random notes and thoughts that came up during the specification that may or may not be useful to those implementing this convention.

. . . . .
Structure

At its simplest, only rows of data, with no function records (records that start with a '\' and a capital letter) and no multi-byte (\m...;) encodings are included. This is approximately equal in complexity to a CSV file.

At its more complex, a CTX file or transmission may carry multiple tables along with detailed meta-data about each table and each field. All readers MUST read and gracefully deal with this complexity, even if it is simply to discard the extra information if it will not be used by the reader's application.




. . . . .
Reader

The reader must be able to read all byte values (x00-xff). If application is unable to deal with some byte values, the next level above the CTX reader must take appropriate measures to produce an error, remove them, or alter them from the output of the CTX reader before passing them on to the application. How the application deals with byte codes that are not acceptable to a given application is not part of the standard and is undefined by the standard.




. . . . .
Compression

Some primitive run-length compression of field data may be performed using special optional encoding provided by the \m[#]...; sequences. For example, a group of 1000 0x00 bytes can be represented in a CTX field with the sequence "\m1000x00;" (a 100:1 compression ratio :-) wow).

No other compression is defined to be part of the CTX recommendation. If an external compression scheme like zip is used, it is suggested that it SHOULD be applied to the entire CTX record after translation, as the CTX overhead will be relatively more redundant, and therefore more compressible, than most other data.

Files that have had compression other than that provided by CTX's '\m[#]...;' sequences applied to them MUST NOT be referred to as CTX files. Only files which can be read by the standard CTX READER (see State Machines above) should be considered CTX files.




. . . . .
Notes from the Future

Future release are likely to include:

  • Create accompanying converters for CTX->CSV and CSV->CTX.
    The CTX->CSV converter will include conversion to normal CSV and to CTC flavors. It is moot in the other direction.

  • Create accompanying converter for CTX->XML

  • The ('\K') types will include a standard that will encompass most of the key and index conventions being used.
    I'm thinking something like this:
    • P Primary key (one to many-OR-one)
    • F Foreign key
      F:TableName.PrimaryKeyField (many-to-one)
      F:TableName.KeyField1,KeyField2,KeyField3 (m->1 multi-field key)
      ?F:TableName.ForeignKeyField (m->m)
    • I Index key or key candidate
    • ...needs more research.

  • Add a section that attempts to list "planned ambiguity"
    I would like to add a section that at least lists these. Eventually, perhaps it can go into explaining "why" the ambiguity was specified.

  • Complete early work on CTXc and give it its own section.




[PRELIMINARY]
CTC / CTX-c: Comma Delimited CTX (Spreadsheet Compatible)

[Note: Preliminary - The following is very preliminary and has not yet been fully designed or finalized.]

CTC (or CTX-c, or CTXc, or CTXc) is a CSV compatible CTX format that serves as a "bridge" format for simple data sharing with desktop productivity tools. It is essentially a way to produce CTX files that can be read by popular desktop productivity tools (such as spreadsheets) for the express purpose of doing graphing and other types of analysis on numerical and other forms of data-sets.

. . . . . . .
Converting from CTX --> CTC

  1. Double (add an extra double quote to) all double-quote characters within CTX fields.
  2. Add double-quotes around ALL fields that contain the following character sequences (note that CR and LF don't exist in CTX so need not be handled directly):
    • Multi-byte escape sequences (\m...;) - Note: this is slight overkill, but makes the conversion a lot simpler.
    • Comma (0x2c)
    • Double-quotes (0x22)
  3. Convert pipe characters (|) to commas (,)
  4. Un-escape all single-byte escape sequences that are NOT function sequences (\r, \n, \p, and \i, with \i converted last).
  5. OPTIONAL: Un-escape all multi-byte escape squences. (If this is done, any characters outside of 7-bit ASCII may need to be converted to spaces in the CTX-c file)

. . . . . . .
Converting from CTC --> CTX

When producing a CTX file from a CTC file, simply reverse the above conversions.

  1. Escape all characters that have a respective single-byte CTX escape sequence (\i, \p, \r, \n, with \i converted first).
  2. Convert field-delimiter commas (,) to pipe-characters.
  3. Remove double quotes from around fields that are delimited by double-quotes.
  4. Convert double-double quotes in fields to single double-quotes.



Reading and writing CTX-c with a spreadsheet program

To produce a CTX-c file from within your desktop spreadsheet, simply save it as a CSV file using your spread sheet's save facility. If your data fields have embedded commas, double-quotes, or new-line characters you will have to first convert them to their CTX \mx...; equivalents. You can perform this manually, or write a small macro to perform the conversions for you.

If your data has a basic header-row containing column names you may (optionally) add a CTX header type signifier starting at the first character of the first field of your spread-sheet's header-row (\L for labels, or \N for names). It is also ok to add other CTX type signifiers as header-rows while the data is displaying in your spreadsheet. These might include types, such as \P, \M, \Q, and \Y, or rows that contain field descriptions, such as hovers (\H) and comments (\R).

When you read a CTX-c file into a spreadsheet, read it in as a CSV file. You may have to change the extension to ".csv" Any fields with embedded commas, double-quotes, or new-line characters will display as their CTX \mx...; replacements. If you need these fields for your analysis, you will have to either convert them manually, or write a small macro to convert them for you.





Permissions

The CTX recommendation is © Copyright, Creativyst, Inc. 2005-2011 ALL RIGHTS RESERVED.

Permission to use the functionality described in this unmodified convention as a component in the designs of commercial or personal products and software offerings is hereby granted provided that this copyright notice is included in its entirety in a form and location that is easily readable by the user. It is important to note that the above permission does NOT include or convey permission to copy this article describing CTX (see below).


. . . . .
This article is © Copyright, Creativyst, Inc. 2005-2011 ALL RIGHTS RESERVED.

Links to this article are always welcome.

However, you may not copy, modify, or distribute this work or any part of it without first obtaining express written permission from Creativyst, Inc. Production and distribution of derivative products, such as displaying this content along with directly related content in a common browser view are expressly forbidden!

Those wishing to obtain permission to distribute copies of this article or derivatives in any form should contact Creativyst.

Permissions printed over any code, DTD, or schema files are supported as Creativyst's permission statement for those constructs.



Version History

  • v1.0e [A] 2011.10.6 - Add backtracking (redundancy) features

    This has been added to define the behavior when header files show up for the first time within a group of previously transmitted regular data records. It has been designed to also (intrinsically) provide redundancy features, so that the identical headers can be repeated occasionally in a large group and they can be applied to previous records that may have come in sans headers (due to transmission interruptions, for example).

  • v1.0d [A] 2008.07.23 - Bug Fixes, CTX-c, DEEP STRUCTURAL FUNCTION-RECORD CHANGE: (C->R and C=C-types). More feature restriction and clarification, better state-table graphics.

    Also: Begining with this version, Software Stability Rating labels will be included. For now, they will be A-Alpha, meaning the convention is still in development. The convention should be considered changeable until it gets at least a [B] rating.

    • BIG CHANGE - The function-record specifier: \C will be changed. It will now be a field-TYPE specifier record containing field-type specifiers of standard C-language intrinsic types (\C).

    • BIG CHANGE - The comment function has been changed. It is now denoted by (\R) (Reference/Remark) (OR MAYBE \A for Annotate?).

    • Now that the \C type-record is in, the \Y type will be specifically restricted, and may NOT be used to denote standard C types, or "standard" SQL types.

    • CTX-c will be defined as an auxilary part of the spec. It will act as a bridge to desktop spread-sheets and such.

    • Still no clarification on the \K types. Still no text table for the state-tables. This is the first DATED and RATED release. Though it is currently rated [A], if implementations stay away from the \K types and the NON-intentional ambiguity they should be ok going forward.

  • v1.0c Bug Fixes, Feature Restriction and Clarification

    • Remove the superfluous hierarchy types ('R', 'T', and 'G')
      Three times the complexity for only a little efficiency improvement. Score one for simplicity and orthogonal design.

    • Change permitted number formats from 'dine' fprintf() formats to 'fixed' fprintf() formats in fields that have been denoted as numbers ('N') in \P records.
      I had 'n'? in there, 'n'?!?. What was I thinking? Added 'f' (float) and 'x' (hex integer).

    • Remove (for now) OpenDocument styled formatting strings from \D records
      Not a lot more flexibility, for much greater complexity. (?) If I'm wrong about this I'd be very appreciative of anyone who could explain it to me. Also, this may change as the OpenDocument standard takes over the world :-).

    • Add a section that maintains a version history and wish list.
      That'd be this list. :-)

  • v1.0b First Full Published Release

  • v1.0a First (preliminary) Publish











 
© Copyright 2005-2011, Creativyst, Inc.
ALL RIGHTS RESERVED

Written by: Dominic John Repici


v1.0e