I looked around for off-line editors, but none of the ones I could find seemed to be worthy of the title editor, as they are mostly for creation and posting of new posts, rather than editing and managing existing. What I wanted was to extract the individual posts and edit and possibly update the original posts or post again as a new post: noting that this blog is catharsis and basic brain dump of ideas.
Any case more recently I have been experimenting with the MSXML COM automation object, this I used in my simple barrier program to provide the data for the drop down list and associated load tables. So having had some success programming MSXML in both MS Excel/vba and then translating to vb.net, I thought I would take a look at ripping the blogger atom xml file apart with MSXML. {yeh! I'm now aware that I probably shouldn't use MSXML in vb.net}
Which turned out not to be all that complicated in the first instance. The xml file contains a lot of elements called <entry>, not all of these are posts. From looking at the file with XML notepad as far as I could tell those posts which have element <type> with value "html" are the posts and pages of the blog. The following code extracts the appropriate nodes.
Set nList = xmldoc.selectNodes("//entry[content/@type='html']")
Admittedly working out how to select the appropriate nodes wasn't so simple. Currently having problems understanding what I actually grab from the xml data, and what the text property is going to return, or value property returns.
Having got these nodes the next step was to find the child nodes with the required data and then write to either a plain text file or html file. For the initial test I just grabbed the data and wrote to a plain text file. Before writing the program I also copy/pasted the content node data using XML notepad and pasted to notepad and saved as simple text, renamed and viewed in google chrome. The whole xml file can also be viewed in Google chrome: writing a chrome application may be a alternative possibility for extracting the blog posts, but creating local files with a browser is slightly problematic
.
Once confirmed that could grab the posts, I added a few lines of additional code to create a more completely formatted individual html file. Only I was missing the post titles and getting those was more problematic. Well! initially I had everything for the <entry> element printed in the individual files, or otherwise just grabbed the content. So more research on XPath at w3chools.com, and more experimenting with select nodes, to arrive at the search string shown above.
With that search string have to process the child nodes of the node list to get to the <content> and <title> elements. Once I got the titles back into the posts, I figured I had a lot of files and I didn't know what they were as I merely gave them numbers. So I further modified the program to create an index file. So having got all that working on Friday, I figured the next step would be to do something with the categories or keywords assigned to each post. This posed another problem as not every <category> element contains a keyword, nor does it have a value, it only has attributes. Trying to find information on how to access the attributes wasn't easy.
Anycase there are two attributes scheme and term. It seems that "term" contains the keyword if the "scheme" attribute contains "http://www.blogger.com/atom/ns#". Therefore needed to be able to test the value of the attributes and extract the keywords.
Having figured out how to do this, I wrote the keywords to a table in Excel and counted the frequency of use in the blog posts. The code I used is as follows, it will need modifying to use elsewhere, but it my be useful as outline of basic process of getting information from the blogger atom xml. Note that some of the code has wrapped to display into the width of the blog post, and the code needs to be on a single line.
Sub xmlSimpleList2() Dim wrkbk As Workbook Dim wrkSht As Worksheet Dim WrkRng As Range Dim fpath1 As String Dim keyWordTable As Range Dim keyWord As String Dim xmldoc As MSXML2.DOMDocument Dim xmlNode As MSXML2.IXMLDOMNode Dim tmpNode As MSXML2.IXMLDOMNode Dim nList As MSXML2.IXMLDOMNodeList Dim tmpList As MSXML2.IXMLDOMNodeList Dim oNode As MSXML2.IXMLDOMNode Dim oAttributes As MSXML2.IXMLDOMNamedNodeMap Dim tagName As String Dim tagValue As String Dim i As Integer, j As Integer Dim fp As Integer, fp1 As Integer Set wrkbk = ThisWorkbook Set wrkSht = wrkbk.Worksheets("xmlData") Set WrkRng = wrkSht.Range("A1:A1") Set keyWordTable = ThisWorkbook.Worksheets("keyWords").Range("A1:A1") 'row zero is header Set xmldoc = New MSXML2.DOMDocument xmldoc.async = False If xmldoc.Load(myDocName) Then fp1 = fopen(ThisWorkbook.Path & "\BlogPosts\" & "index.htm", "wt") Print #fp1, "<html>" Print #fp1, "<body>" Print #fp1, "<ol>" 'Test XPath i = 1 j = 1 ' Set nList = xmldoc.selectNodes("//content[@type='html']") Set nList = xmldoc.selectNodes("//entry[content/@type='html']") ' Final text entity is part of the tree, but attributes are not For Each oNode In nList tagName = oNode.nodeName ' Debug.Print i, tagName WrkRng.Offset(i, 0).Value = tagName fpath1 = ThisWorkbook.Path & "\BlogPosts\" & "post" & StrLPadc(CStr(i), 4, "0") & ".htm" fp = fopen(fpath1, "wt") Print #fp, "<html>" Print #fp, "<body>" If oNode.hasChildNodes Then ' Debug.Print "Number of child Nodes: " & oNode.childNodes.Length For Each xmlNode In oNode.childNodes Select Case UCase(xmlNode.nodeName) Case UCase("title") Debug.Print xmlNode.Text WrkRng.Offset(i, 1).Value = xmlNode.Text Print #fp, "<h1>" Print #fp, xmlNode.Text Print #fp, "</h1>" Print #fp1, "<li><a href=" & """" & fpath1 & """" & ">" & xmlNode.Text & "</a></li>" Case UCase("content") Debug.Print "content" Print #fp, xmlNode.Text Case UCase("category") Debug.Print "category" Set oAttributes = xmlNode.Attributes Set tmpNode = oAttributes.getNamedItem("scheme") If tmpNode.Text = bloggerNameSpace Then j = j + 1 Set tmpNode = oAttributes.getNamedItem("term") keyWord = tmpNode.Text WrkRng.Offset(i, j).Value = keyWord Call updateKeyWordList(keyWordTable, keyWord) End If End Select Next xmlNode End If Print #fp, "</body>" Print #fp, "</html>" Close #fp i = i + 1 j = 1 Next oNode Print #fp1, "</ol>" Print #fp1, "</body>" Print #fp1, "</html>" Close #fp1 End If 'xml doc loaded Set xmldoc = Nothing Debug.Print "All Done!" End Sub