XML was not always a silver bullet.

I programmed a simulator for another class this semester and I tried to use XML format for my input file to the simulator. The simulator takes a graph topology information first, and then needs to parse it. Compare the two formats below describing the same graph information.

:: GraphML (Standard XML format for describing graph data structure) ::

<graphml xmlns=”http://graphml.graphdrawing.org/xmlns” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=”http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd” >
<graph edgedefault=”undirected” parse.nodes=”10000″ parse.edges=”20000″>
<node id=”0″ />
<node id=”1″ />
<node id=”2″ />
…..
<node id=”9997″ />
<node id=”9998″ />
<node id=”9999″ />
<edge source=”2″ target=”1″ />
<edge source=”2″ target=”0″ />
<edge source=”3″ target=”1″ />
…..
<edge source=”0″ target=”8068″ />
<edge source=”1″ target=”9731″ />
<edge source=”1″ target=”5549″ />
</graph>
</graphml>

:: Normal text format ::

Topology: ( 10000 Nodes, 20000 Edges )
Model (1 – RTWaxman)

Nodes: ( 10000 )
0
1
2
…..
9997
9998
9999

Edges: ( 20000 )
0    2    1
1    2    0
2    3    1
…..
19997    0    8068
19998    1    9731
19999    1    5549

The second format was much better both in terms of file size and parsing speed. The XML format spent too much on putting structured metadata on the data. Once the data will be used in a limited domain, costs for structuring and standardizing data could overwhelm the benefit of doing so.

This case reminded me the warning of Svenonius, which was “putting infinite number of metadata to data is economically impossible,” although my case did not involve “infinite” numbers of metadata. Anyway, I experienced the tradeoff of IO and IR again.

Comments off

Bad IO/IR Case That I Found

I have recently developed a ruby wrapper for the API of one of the largest internet portals in Korea. Most of their RESTful API were well organized in a form of RSS or XML, but there was one very interestingly bad case.

http://dev.naver.com/openapi/sample/rank.xml

If you open the link above (please ignore Korean part of it), you will see an XML file. The file describes the real-time hot keywords searched by people, and items are ordered by its rank. When you see the tags embracing each keyword, the names of elements are “R1″, “R2″, “R3″ and so on. In the perspective of a 202er, it should be corrected to something like that below.

<result>
<items>
<item>
<rank>1</rank>
<keyword>ischool</keyword>
<change>+32</change>
</item>
…..
</items>
</result>

Or, at least, they should use attribute to describe the rank instead using element name for doing so.

Comments off