Wednesday, October 16, 2013

Data file formats for interchanging and sharing data from user applications

When I wrote my soil heave program I based its data file format on what I had otherwise learnt about DXF files. I did this to keep the reading of the data simple, as I hadn't otherwise written any code to process a comma delimited file (.cdf), and I had both numbers and text and all kinds of problems were encountered if text and numbers are on the same line and use the standard input/output routines with languages like Pascal and C.

The format of the data file adopted therefore comprises of a line with the tag name followed by another line with the tag value. However I didn't include all the complicated nesting and sections typically found in a DXF file. However the soil heave program is based on multiple soil cores being split into multiple strata, and this introduced some inconsistency into reading and writing the data format. Whilst the Core tag is followed by a count of the reference number of the core, the end core tag has no tag value following it. Similarly the end of file tag (EOF) also has no tag value following it. But for simple data sets, this poses no real problem for reading and writing the file. For more complex data sets however, a more consistent data format is required, with a simple parser which can be called repetitively.

When I recently came to revisit the data file format for exporting data from my spreadsheets, I discovered XML files.

No need to re-invent the wheel, nor particularly any need to write a file parser as such, such file readers are built into various programming environments. The only task is having got the data, to sort it and assign to appropriate variables inside the user application: well programming wise. The bigger task is to decide on the format of the data.

I realise that MS Excel exports data to an XML file, newer may even use such as the default format, however that contains more information than I am interested in, and is about data defining a spreadsheet, not the structure of the data stored in the spreadsheet.

So at present I have wrapped the whole data set in the tag , namely its my data, formatted my way. Then I have divided the data into parameters and variables. Parameters are named ranges in an Excel spreadsheet which contain a simple value and have no formula. Variables are named ranges which contain formula. So parameters are wrapped in tags, and variables are wrapped in tags. Nested inside these tags, are tags for the name of the range, and tags for the stored value.

Whilst writing this code in MS Excel/vba I also had problems dealing with Unicode characters, as Excel cells contained values which disappeared when written to a text file using the ordinary print functions. Having got the data into a text file also had problems reading them back in. But resolved the problems using the VBScript file system objects for reading and writing files.

The other problem encountered in the export of data to an XML file was that not all the named ranges were simple scalar values, some were lists and tables. So also added additional tags, if a range only lay on a single row or column, then the range classified as a list and tagged , if comprised of rows and columns then classified as a table and tagged . A list contains elements , whilst a table contains rows and elements . Though may have been better to use lists in place of rows, except that a list container has  a name whilst a row doesn't.

An alternative format, or schema, I considered was to consider using the MS Excel range name as the XML tag, and then wrapping the value in these tags. The problem I had with that approach is that it is less generic. Whilst it may be the more valid way of defining the fields for a database in XML, it doesn't match what I already have in MS Excel and MS Access. In Excel all my variables are typically defined on one worksheet which is compatible with import into an MS Access table. In this way I can add new variables as I wish if I don't have need for a variable or more to the point if a program doesn't know what to do with the variable it can be ignored.

Whilst XML provides flexibility, and can easily write a new variable as a new XML tag, the problem is reading the XML file and then knowing what to do next with the data which comes after the tag. In effect an XML data file has to be parsed twice when using XML objects. The first level of parsing is to grab all the XML tags, any contained attributes, and then the data contained between the tags. The second level of parsing is determining what to do with the tags and data with respect to some specific application.

A book database with a fixed data structure is relatively simple: just read the XML file, and then permit searching and listing information about the list of books contained in the database. Throw variable data structures into the database, then need different processors for each data set found. For example with respect to CAD data, may have a circle and a line. A data structure for a circle contains say the diameter of the circle and may be its insertion point. The data structure for a line contains say the coordinates of the two ends of the line. Displaying a line on the screen is different than displaying a circle. So as the program works its way through all the data, it calls different commands to execute based on the data encountered at any given point in reading the data.

At the time of this experimenting, it seemed to me that if the variable name was used as a tag, then when read a variable unknown to the program, it would loose synchronisation with reading the rest of the file. {This may have had more to do with my writing a parser from scratch rather than using XML reader objects. Which is also why I didn't use parameter and variable as tag attributes. Something like isparameter=1}

If can read all the data into a XML tree object, then if traverse the tree and find a tag do not know, then can ignore the entire branch starting from that node. But if cannot read the entire data into an XML tree object, then seems to me, need to decide what to do as progress through the data: since cannot ignore an entire branch until identified the branch: found the end tag.

I don't know really, probably need to experiment with both approaches. Another approach adopted by some software is to use the tag attributes to contain data. This approach I don't like as it confuses things. Data is supposed to be between the tags, and the tags can contain attributes descriptive of the data. However storing data in the attributes can make the data file more compact and easier to read. For example, the definition of a node could be as follows:

<node id="1" x="2" y="5" z="0" />
And similarly I could have for simple variables or parameters.
 <Height isparameter="1">5</Height >
 Compared to approach proposed above and otherwise experimented with:

Parameters are input variables and can be changed, whilst those items I have named variables are dependent variables their value is dependent on the value of input parameters, and the process which relates the input to output. Change the process which relates the parameter to variable and change the output of the program. For example height and span could relate to a gable roof building or a skillion roof building or for that matter a balustrade. Thus two basic parameters can get transformed into a variety of structural forms. Further I can define a building as a series of bents. Each bent comprises one column and one rafter. To define a monoslope roof from bents, then the rafter would be split at mid point or any other convenient location along the rafter. Alternatively permit the bent to only comprise of a column. Any case the parameters height and span could occur multiple times and their meaning would vary depending on the object in which they are contained.

Say I throw another parameter in such as eavesHeight, then could have either of the following:

Method 1:
<eavesHeight isparameter="1">5</eavesHeight>
Method 2 (preferred):
Now why I prefer method 2, I'm not entirely sure. Method 2, is less compact and more cumbersome for a person to read than method 1. But ?

I guess from my viewpoint, when I set up my spreadsheets I didn't know what the input parameters were. I simply worked through the calculations and unprotected certain values as parameters. I then shifted the input of these values to a worksheet used for input values. In Quatro Pro it was necessary to have all the input values to a dialogue box in a continuous block, and since I also wanted the data to be compatible with Paradox it also fell into place as one or more tables. Without the dialogue boxes it also seemed appropriate to collect similar data in sets, with one worksheet per set of data. So ended up with multiple tabbed sheets each of which was compatible with export to a database. When moved over to MS Excel and MS Access, the concept was maintained, even though can feed data from any where into a dialogue box.

So the tables or worksheets are not just a convenient place for named cells/ranges, its not just a simple matter of getting a value for a variable. The tables are defining the parameters which define the object which the spreadsheet is used to design.

So that the list of parameters and their names are just as important as the values of the parameters. Still doesn't give me a reason for putting the names between the XML tags rather than making them the XML tags. {I guess my reasoning is the parameters are data describing an object, change the object the parameters change. Therefore I have added another level of abstraction for the data, which may not be necessary.}

Any case that's the idea at present, if there is a named range in MS Excel then its value is exported to an XML file. If the data in an XML file is a parameter then it can be imported into an MS Excel worksheet. This way large spreadsheets do not have to be stored for individual projects, the MS Excel workbook just gets used as an application, and the required data is stored in simple XML files. This data can also be shared across different Excel workbooks which fulfil different functions. So all the data for a project can be stored in a central XML file, and used by a variety of applications. Each application just reads and writes the data it needs without messing up the data file for other applications.

Of course I could just use a simple Excel spreadsheet to get data to and from complex Excel workbooks, but that supposes Excel is available. Further not every suitable application will be built using Excel, plus XML is also useful for web based data storage. Which means faster development of a web based application at future date.

Looks like I need to figure out how to display the XML brackets. &lt; and &gt; is not working, and neither does using the actual characters. Take that back, it doesn't display correctly in compose. Now for the other problem: none of the sites I found providing a solution, actually explain properly what is required because the codes they recommend are translated and therefore see the results not the recommended codes. The results and codes are:

< is displayed by &lt;
> is displayed by &gt;