Metamorphs Journal: Data Structures

Showing posts with label Data Structures. Show all posts

Saturday, December 14, 2013

Exporting Blogger Posts to individual html files

A few months back I exported my blogger blog, producing an atom xml file. I looked at this in XML notepad 2007 and it didn't seem very helpful. I also opened in MS Excel, from there at least I was able to extract a list of titles and keywords, but otherwise a bit messy. I had been experimenting writing own parser for my own simple xml files, the atom xml however seemed too complicated especially the element which contains the html formatted post.

I looked around for off-line editors, but none of the ones I could find seemed to be worthy of the title editor, as they are mostly for creation and posting of new posts, rather than editing and managing existing. What I wanted was to extract the individual posts and edit and possibly update the original posts or post again as a new post: noting that this blog is catharsis and basic brain dump of ideas.

Any case more recently I have been experimenting with the MSXML COM automation object, this I used in my simple barrier program to provide the data for the drop down list and associated load tables. So having had some success programming MSXML in both MS Excel/vba and then translating to vb.net, I thought I would take a look at ripping the blogger atom xml file apart with MSXML. {yeh! I'm now aware that I probably shouldn't use MSXML in vb.net}

Which turned out not to be all that complicated in the first instance. The xml file contains a lot of elements called <entry>, not all of these are posts. From looking at the file with XML notepad as far as I could tell those posts which have element <type> with value "html" are the posts and pages of the blog. The following code extracts the appropriate nodes.

Set nList = xmldoc.selectNodes("//entry[content/@type='html']")

Admittedly working out how to select the appropriate nodes wasn't so simple. Currently having problems understanding what I actually grab from the xml data, and what the text property is going to return, or value property returns.

Having got these nodes the next step was to find the child nodes with the required data and then write to either a plain text file or html file. For the initial test I just grabbed the data and wrote to a plain text file. Before writing the program I also copy/pasted the content node data using XML notepad and pasted to notepad and saved as simple text, renamed and viewed in google chrome. The whole xml file can also be viewed in Google chrome: writing a chrome application may be a alternative possibility for extracting the blog posts, but creating local files with a browser is slightly problematic
.
Once confirmed that could grab the posts, I added a few lines of additional code to create a more completely formatted individual html file. Only I was missing the post titles and getting those was more problematic. Well! initially I had everything for the <entry> element printed in the individual files, or otherwise just grabbed the content. So more research on XPath at w3chools.com, and more experimenting with select nodes, to arrive at the search string shown above.

With that search string have to process the child nodes of the node list to get to the <content> and <title> elements. Once I got the titles back into the posts, I figured I had a lot of files and I didn't know what they were as I merely gave them numbers. So I further modified the program to create an index file. So having got all that working on Friday, I figured the next step would be to do something with the categories or keywords assigned to each post. This posed another problem as not every <category> element contains a keyword, nor does it have a value, it only has attributes. Trying to find information on how to access the attributes wasn't easy.

Anycase there are two attributes scheme and term. It seems that "term" contains the keyword if the "scheme" attribute contains "http://www.blogger.com/atom/ns#". Therefore needed to be able to test the value of the attributes and extract the keywords.

Having figured out how to do this, I wrote the keywords to a table in Excel and counted the frequency of use in the blog posts. The code I used is as follows, it will need modifying to use elsewhere, but it my be useful as outline of basic process of getting information from the blogger atom xml. Note that some of the code has wrapped to display into the width of the blog post, and the code needs to be on a single line.

Sub xmlSimpleList2()
  Dim wrkbk As Workbook
  Dim wrkSht As Worksheet
  Dim WrkRng As Range
  Dim fpath1 As String
  Dim keyWordTable As Range
  Dim keyWord As String

  Dim xmldoc As MSXML2.DOMDocument
  Dim xmlNode As MSXML2.IXMLDOMNode

  Dim tmpNode As MSXML2.IXMLDOMNode
  
  Dim nList As MSXML2.IXMLDOMNodeList
  Dim tmpList As MSXML2.IXMLDOMNodeList
  
  Dim oNode As MSXML2.IXMLDOMNode
  Dim oAttributes As MSXML2.IXMLDOMNamedNodeMap
  
  Dim tagName As String
  Dim tagValue As String
  Dim i As Integer, j As Integer
  Dim fp As Integer, fp1 As Integer
      
  Set wrkbk = ThisWorkbook
  Set wrkSht = wrkbk.Worksheets("xmlData")
  Set WrkRng = wrkSht.Range("A1:A1")
  Set keyWordTable = ThisWorkbook.Worksheets("keyWords").Range("A1:A1") 'row zero is header
   
    Set xmldoc = New MSXML2.DOMDocument

    xmldoc.async = False
    If xmldoc.Load(myDocName) Then
      fp1 = fopen(ThisWorkbook.Path & "\BlogPosts\" & "index.htm", "wt")
      Print #fp1, "<html>"
      Print #fp1, "<body>"
      Print #fp1, "<ol>"

        'Test XPath
        i = 1
        j = 1
' Set nList = xmldoc.selectNodes("//content[@type='html']")
        Set nList = xmldoc.selectNodes("//entry[content/@type='html']")
        
' Final text entity is part of the tree, but attributes are not
        For Each oNode In nList
          tagName = oNode.nodeName
' Debug.Print i, tagName
         WrkRng.Offset(i, 0).Value = tagName
          
         fpath1 = ThisWorkbook.Path & "\BlogPosts\" & "post" & StrLPadc(CStr(i), 4, "0") & ".htm"
         fp = fopen(fpath1, "wt")
         Print #fp, "<html>"
         Print #fp, "<body>"

          If oNode.hasChildNodes Then
' Debug.Print "Number of child Nodes: " & oNode.childNodes.Length
            For Each xmlNode In oNode.childNodes
            
              Select Case UCase(xmlNode.nodeName)
                Case UCase("title")
                  Debug.Print xmlNode.Text
                  WrkRng.Offset(i, 1).Value = xmlNode.Text
                  Print #fp, "<h1>"
                  Print #fp, xmlNode.Text
                  Print #fp, "</h1>"
                  
                  Print #fp1, "<li><a href=" & """" & fpath1 & """" & ">" & xmlNode.Text & "</a></li>"
                  
                Case UCase("content")
                  Debug.Print "content"
                  Print #fp, xmlNode.Text
                  
                Case UCase("category")
                  Debug.Print "category"
                  Set oAttributes = xmlNode.Attributes
                  Set tmpNode = oAttributes.getNamedItem("scheme")
                  
                  If tmpNode.Text = bloggerNameSpace Then
                    j = j + 1
                    Set tmpNode = oAttributes.getNamedItem("term")
                    keyWord = tmpNode.Text
                    WrkRng.Offset(i, j).Value = keyWord
                    Call updateKeyWordList(keyWordTable, keyWord)
                  End If
                  
              End Select
              
            Next xmlNode
          End If
           
          Print #fp, "</body>"
          Print #fp, "</html>"
          Close #fp
          i = i + 1
          j = 1
        Next oNode
        
        Print #fp1, "</ol>"
        Print #fp1, "</body>"
        Print #fp1, "</html>"
        Close #fp1
        
    End If  'xml doc loaded
    
    Set xmldoc = Nothing
    
    Debug.Print "All Done!"
    
End Sub

Wednesday, October 16, 2013

Data file formats for interchanging and sharing data from user applications

When I wrote my soil heave program I based its data file format on what I had otherwise learnt about DXF files. I did this to keep the reading of the data simple, as I hadn't otherwise written any code to process a comma delimited file (.cdf), and I had both numbers and text and all kinds of problems were encountered if text and numbers are on the same line and use the standard input/output routines with languages like Pascal and C.

The format of the data file adopted therefore comprises of a line with the tag name followed by another line with the tag value. However I didn't include all the complicated nesting and sections typically found in a DXF file. However the soil heave program is based on multiple soil cores being split into multiple strata, and this introduced some inconsistency into reading and writing the data format. Whilst the Core tag is followed by a count of the reference number of the core, the end core tag has no tag value following it. Similarly the end of file tag (EOF) also has no tag value following it. But for simple data sets, this poses no real problem for reading and writing the file. For more complex data sets however, a more consistent data format is required, with a simple parser which can be called repetitively.

When I recently came to revisit the data file format for exporting data from my spreadsheets, I discovered XML files.

No need to re-invent the wheel, nor particularly any need to write a file parser as such, such file readers are built into various programming environments. The only task is having got the data, to sort it and assign to appropriate variables inside the user application: well programming wise. The bigger task is to decide on the format of the data.

I realise that MS Excel exports data to an XML file, newer may even use such as the default format, however that contains more information than I am interested in, and is about data defining a spreadsheet, not the structure of the data stored in the spreadsheet.

So at present I have wrapped the whole data set in the tag , namely its my data, formatted my way. Then I have divided the data into parameters and variables. Parameters are named ranges in an Excel spreadsheet which contain a simple value and have no formula. Variables are named ranges which contain formula. So parameters are wrapped in tags, and variables are wrapped in tags. Nested inside these tags, are tags for the name of the range, and tags for the stored value.

Whilst writing this code in MS Excel/vba I also had problems dealing with Unicode characters, as Excel cells contained values which disappeared when written to a text file using the ordinary print functions. Having got the data into a text file also had problems reading them back in. But resolved the problems using the VBScript file system objects for reading and writing files.

The other problem encountered in the export of data to an XML file was that not all the named ranges were simple scalar values, some were lists and tables. So also added additional tags, if a range only lay on a single row or column, then the range classified as a list and tagged , if comprised of rows and columns then classified as a table and tagged . A list contains elements , whilst a table contains rows and elements . Though may have been better to use lists in place of rows, except that a list container has a name whilst a row doesn't.

An alternative format, or schema, I considered was to consider using the MS Excel range name as the XML tag, and then wrapping the value in these tags. The problem I had with that approach is that it is less generic. Whilst it may be the more valid way of defining the fields for a database in XML, it doesn't match what I already have in MS Excel and MS Access. In Excel all my variables are typically defined on one worksheet which is compatible with import into an MS Access table. In this way I can add new variables as I wish if I don't have need for a variable or more to the point if a program doesn't know what to do with the variable it can be ignored.

Whilst XML provides flexibility, and can easily write a new variable as a new XML tag, the problem is reading the XML file and then knowing what to do next with the data which comes after the tag. In effect an XML data file has to be parsed twice when using XML objects. The first level of parsing is to grab all the XML tags, any contained attributes, and then the data contained between the tags. The second level of parsing is determining what to do with the tags and data with respect to some specific application.

A book database with a fixed data structure is relatively simple: just read the XML file, and then permit searching and listing information about the list of books contained in the database. Throw variable data structures into the database, then need different processors for each data set found. For example with respect to CAD data, may have a circle and a line. A data structure for a circle contains say the diameter of the circle and may be its insertion point. The data structure for a line contains say the coordinates of the two ends of the line. Displaying a line on the screen is different than displaying a circle. So as the program works its way through all the data, it calls different commands to execute based on the data encountered at any given point in reading the data.

At the time of this experimenting, it seemed to me that if the variable name was used as a tag, then when read a variable unknown to the program, it would loose synchronisation with reading the rest of the file. {This may have had more to do with my writing a parser from scratch rather than using XML reader objects. Which is also why I didn't use parameter and variable as tag attributes. Something like isparameter=1}

If can read all the data into a XML tree object, then if traverse the tree and find a tag do not know, then can ignore the entire branch starting from that node. But if cannot read the entire data into an XML tree object, then seems to me, need to decide what to do as progress through the data: since cannot ignore an entire branch until identified the branch: found the end tag.

I don't know really, probably need to experiment with both approaches. Another approach adopted by some software is to use the tag attributes to contain data. This approach I don't like as it confuses things. Data is supposed to be between the tags, and the tags can contain attributes descriptive of the data. However storing data in the attributes can make the data file more compact and easier to read. For example, the definition of a node could be as follows:

<node id="1" x="2" y="5" z="0" />

<node>
    <id>1</id>
    <x>2</x>
    <y>5</y>
    <z>0</z>
</node>

And similarly I could have for simple variables or parameters.

 <Height isparameter="1">5</Height >

Compared to approach proposed above and otherwise experimented with:

 <parameter>
    <name>Height</name>
    <value>5</value>
</parameter>

Parameters are input variables and can be changed, whilst those items I have named variables are dependent variables their value is dependent on the value of input parameters, and the process which relates the input to output. Change the process which relates the parameter to variable and change the output of the program. For example height and span could relate to a gable roof building or a skillion roof building or for that matter a balustrade. Thus two basic parameters can get transformed into a variety of structural forms. Further I can define a building as a series of bents. Each bent comprises one column and one rafter. To define a monoslope roof from bents, then the rafter would be split at mid point or any other convenient location along the rafter. Alternatively permit the bent to only comprise of a column. Any case the parameters height and span could occur multiple times and their meaning would vary depending on the object in which they are contained.

Say I throw another parameter in such as eavesHeight, then could have either of the following:

Method 1:

<eavesHeight isparameter="1">5</eavesHeight>

Method 2 (preferred):

<parameter>
    <name>eavesHeight</name>
    <value>5</value>
</parameter>

Now why I prefer method 2, I'm not entirely sure. Method 2, is less compact and more cumbersome for a person to read than method 1. But ?

I guess from my viewpoint, when I set up my spreadsheets I didn't know what the input parameters were. I simply worked through the calculations and unprotected certain values as parameters. I then shifted the input of these values to a worksheet used for input values. In Quatro Pro it was necessary to have all the input values to a dialogue box in a continuous block, and since I also wanted the data to be compatible with Paradox it also fell into place as one or more tables. Without the dialogue boxes it also seemed appropriate to collect similar data in sets, with one worksheet per set of data. So ended up with multiple tabbed sheets each of which was compatible with export to a database. When moved over to MS Excel and MS Access, the concept was maintained, even though can feed data from any where into a dialogue box.

So the tables or worksheets are not just a convenient place for named cells/ranges, its not just a simple matter of getting a value for a variable. The tables are defining the parameters which define the object which the spreadsheet is used to design.

So that the list of parameters and their names are just as important as the values of the parameters. Still doesn't give me a reason for putting the names between the XML tags rather than making them the XML tags. {I guess my reasoning is the parameters are data describing an object, change the object the parameters change. Therefore I have added another level of abstraction for the data, which may not be necessary.}

Any case that's the idea at present, if there is a named range in MS Excel then its value is exported to an XML file. If the data in an XML file is a parameter then it can be imported into an MS Excel worksheet. This way large spreadsheets do not have to be stored for individual projects, the MS Excel workbook just gets used as an application, and the required data is stored in simple XML files. This data can also be shared across different Excel workbooks which fulfil different functions. So all the data for a project can be stored in a central XML file, and used by a variety of applications. Each application just reads and writes the data it needs without messing up the data file for other applications.

Of course I could just use a simple Excel spreadsheet to get data to and from complex Excel workbooks, but that supposes Excel is available. Further not every suitable application will be built using Excel, plus XML is also useful for web based data storage. Which means faster development of a web based application at future date.

Looks like I need to figure out how to display the XML brackets. < and > is not working, and neither does using the actual characters. Take that back, it doesn't display correctly in compose. Now for the other problem: none of the sites I found providing a solution, actually explain properly what is required because the codes they recommend are translated and therefore see the results not the recommended codes. The results and codes are:

< is displayed by <
> is displayed by >

Pages

Saturday, December 14, 2013

Exporting Blogger Posts to individual html files

Wednesday, October 16, 2013

Data file formats for interchanging and sharing data from user applications