xml.html revision 4c3a2030db396850607fca7e975f31b34ac2b423
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
2                      "http://www.w3.org/TR/REC-html40/loose.dtd">
3<html>
4<head>
5  <title>The XML library for Gnome</title>
6  <meta name="GENERATOR" content="amaya V2.2">
7</head>
8
9<body bgcolor="#ffffff">
10<h1 align="center">The XML library for Gnome</h1>
11
12<h2 style="text-align: center">libxml, a.k.a. gnome-xml</h2>
13
14<p></p>
15<ul>
16  <li><a href="#Introducti">Introduction</a></li>
17  <li><a href="#Documentat">Documentation</a></li>
18  <li><a href="#News">News</a></li>
19  <li><a href="#XML">XML</a></li>
20  <li><a href="#tree">The tree output</a></li>
21  <li><a href="#interface">The SAX interface</a></li>
22  <li><a href="#library">The XML library interfaces</a>
23    <ul>
24      <li><a href="#Invoking">Invoking the parser</a></li>
25      <li><a href="#Building">Building a tree from scratch</a></li>
26      <li><a href="#Traversing">Traversing the tree</a></li>
27      <li><a href="#Modifying">Modifying the tree</a></li>
28      <li><a href="#Saving">Saving the tree</a></li>
29      <li><a href="#Compressio">Compression</a></li>
30    </ul>
31  </li>
32  <li><a href="#Entities">Entities or no entities</a></li>
33  <li><a href="#Namespaces">Namespaces</a></li>
34  <li><a href="#Validation">Validation</a></li>
35  <li><a href="#Principles">DOM principles</a></li>
36  <li><a href="#real">A real example</a></li>
37</ul>
38
39<h2><a name="Introducti">Introduction</a></h2>
40
41<p>This document describes the <a href="http://www.w3.org/XML/">XML</a>
42library provideed in the <a href="http://www.gnome.org/">Gnome</a> framework.
43XML is a standard to build tag based structured documents/data.</p>
44
45<p>The internal document repesentation is as close as possible to the <a
46href="http://www.w3.org/DOM/">DOM</a> interfaces.</p>
47
48<p>Libxml also has a <a href="http://www.megginson.com/SAX/index.html">SAX
49interface</a>, <a href="mailto:james@daa.com.au">James Henstridge</a> made <a
50href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">a nice
51documentation</a> expaining how to use it. The interface is as compatible as
52possible with <a href="http://www.jclark.com/xml/expat.html">Expat</a>
53one.</p>
54
55<p>There is also a mailing-list <a
56href="mailto:xml@rufus.w3.org">xml@rufus.w3.org</a> for libxml, with an <a
57href="http://rpmfind.net/veillard/XML/messages">on-line archive</a>. To
58subscribe to this majordomo based list, send a mail to <a
59href="mailto:majordomo@rufus.w3.org">majordomo@rufus.w3.org</a> with
60"subscribe xml" in the <strong>content</strong> of the message.</p>
61
62<p>This library is released both under the W3C Copyright and the GNU LGP,
63basically everybody should be happy, if not, drop me a mail.</p>
64
65<p>People are invited to use the <a
66href="http://cvs.gnome.org/lxr/source/gdome/">gdome Gnome module to</a> get a
67full DOM interface, thanks to <a href="mailto:raph@levien.com">Raph
68Levien</a>, check his <a
69href="http://www.levien.com/gnome/domination.html">DOMination paper</a>. He
70uses it for his implementation of <a
71href="http://www.w3.org/Graphics/SVG/">SVG</a> called <a
72href="http://www.levien.com/svg/">gill</a>.</p>
73
74<h2><a name="Documentat">Documentation</a></h2>
75
76<p>The code is commented in a <a href=""></a>way which allow <a
77href="http://rpmfind.net/veillard/XML/libxml.html">extensive documentation</a>
78to be automatically extracted.</p>
79
80<p>At some point I will change the back-end to produce XML documentation in
81addition to SGML Docbook and HTML.</p>
82
83<h3>Reporting bugs</h3>
84
85<p>Well bugs or missing features are always possible, and I will make a point
86of fixing them in a timely fashion. The best way it to <a
87href="http://bugs.gnome.org/db/pa/lgnome-xml.html">use the Gnome bug tracking
88database</a>. I look at reports there regulary and it's good to have a
89reminder when a bug is still open. Check the <a
90href="http://bugs.gnome.org/Reporting.html">instructions on reporting bugs</a>
91and be sure to specify thatthe bug is for the package gnome-xml.</p>
92
93<p>Alternately you can just send the bug to the <a
94href="mailto:xml@rufus.w3.org">xml@rufus.w3.org</a> list. </p>
95
96<h2><a name="News">News</a></h2>
97
98<p>Latest version is 1.7.3, you can find it on <a
99href="ftp://rpmfind.net/pub/veillard/">rpmfind.net</a> or on the <a
100href="ftp://ftp.gnome.org/pub/GNOME/MIRRORS.html">Gnome FTP server</a> either
101as a <a href="ftp://ftp.gnome.org/pub/GNOME/sources/libxml/">source
102archive</a> or <a href="ftp://ftp.gnome.org/pub/GNOME/contrib/rpms/">RPMs
103packages</a>.</p>
104
105<h3>CVS only</h3>
106<ul>
107  <li>Attribute normalization fix</li>
108  <li>configure patched to compile properly on HP-UX with the native
109  compiler</li>
110</ul>
111
112<h3>1.7.4: Oct 25 1999</h3>
113<ul>
114  <li>Lots of HTML improvement</li>
115  <li>Fixed some errors when saving both XML and HTML</li>
116  <li>More examples, the regression tests should now look clean</li>
117  <li>Fixed a bug with contiguous charref</li>
118</ul>
119
120<h3>1.7.3: Sep 29 1999</h3>
121<ul>
122  <li>portability problems fixed</li>
123  <li>snprintf was used unconditionnally, leading to link problems on system
124    were it's not available, fixed</li>
125</ul>
126
127<h3>1.7.1: Sep 24 1999</h3>
128<ul>
129  <li>The basic type for strings manipulated by libxml has been renamed in
130    1.7.1 from <strong>CHAR</strong> to <strong>xmlChar</strong>. The reason
131    is that CHAR was conflicting with a predefined type on Windows. However on
132    non WIN32 environment, compatibility is provided by the way of  a
133    <strong>#define </strong>.</li>
134  <li>Changed another error : the use of a structure field called errno, and
135    leading to troubles on platforms where it's a macro</li>
136</ul>
137
138<h3>1.7.0: sep 23 1999</h3>
139<ul>
140  <li>Added the ability to fetch remote DTD or parsed entities, see the <a
141    href="gnome-xml-nanohttp.html">nanohttp</a> module.</li>
142  <li>Added an errno to report errors by another mean than a simple printf
143    like callback</li>
144  <li>Finished ID/IDREF support and checking when validation</li>
145  <li>Serious memory leaks fixed (there is now a <a
146    href="gnome-xml-xmlmemory.html">memory wrapper</a> module)</li>
147  <li>Improvement of <a href="http://www.w3.org/TR/xpath">XPath</a>
148    implementation</li>
149  <li>Added an HTML parser front-end</li>
150</ul>
151
152<h2><a name="XML">XML</a></h2>
153
154<p><a href="http://www.w3.org/TR/REC-xml">XML is a standard</a> for markup
155based structured documents, here is <a name="example">an example</a>:</p>
156<pre>&lt;?xml version="1.0"?>
157&lt;EXAMPLE prop1="gnome is great" prop2="&amp;amp; linux too">
158  &lt;head>
159   &lt;title>Welcome to Gnome&lt;/title>
160  &lt;/head>
161  &lt;chapter>
162   &lt;title>The Linux adventure&lt;/title>
163   &lt;p>bla bla bla ...&lt;/p>
164   &lt;image href="linus.gif"/>
165   &lt;p>...&lt;/p>
166  &lt;/chapter>
167&lt;/EXAMPLE></pre>
168
169<p>The first line specify that it's an XML document and gives useful
170informations about it's encoding. Then the document is a text format whose
171structure is specified by tags between brackets. <strong>Each tag opened have
172to be closed</strong> XML is pedantic about this, not that for example the
173image tag has no content (just an attribute) and is closed by ending up the
174tag with <code>/></code>.</p>
175
176<p>XML can be applied sucessfully to a wide range or usage from long term
177structured document maintenance where it follows the steps of SGML to simple
178data encoding mechanism like configuration file format (glade), spreadsheets
179(gnumeric), or even shorter lived document like in WebDAV where it is used to
180encode remote call between a client and a server.</p>
181
182<h2><a name="tree">The tree output</a></h2>
183
184<p>The parser returns a tree built during the document analysis. The value
185returned is an <strong>xmlDocPtr</strong> (i.e. a pointer to an
186<strong>xmlDoc</strong> structure). This structure contains informations like
187the file  name, the document type, and a <strong>root</strong> pointer which
188is the root of the document (or more exactly the first child under the root
189which is the document). The tree is made of <strong>xmlNode</strong>s, chained
190in double linked lists of siblings and with childs&lt;->parent relationship.
191An xmlNode can also carry properties (a chain of xmlAttr structures). An
192attribute may have a value which is a list of TEXT or ENTITY_REF nodes.</p>
193
194<p>Here is an example (erroneous w.r.t. the XML spec since there should be
195only one ELEMENT under the root):</p>
196
197<p><img src="structure.gif" alt=" structure.gif "></p>
198
199<p>In the source package there is a small program (not installed by default)
200called <strong>tester</strong> which parses XML files given as argument and
201prints them back as parsed, this is useful to detect errors both in XML code
202and in the XML parser itself. It has an option <strong>--debug</strong> which
203prints the actual in-memory structure of the document, here is the result with
204the <a href="#example">example</a> given before:</p>
205<pre>DOCUMENT
206version=1.0
207standalone=true
208  ELEMENT EXAMPLE
209    ATTRIBUTE prop1
210      TEXT
211      content=gnome is great
212    ATTRIBUTE prop2
213      ENTITY_REF
214      TEXT
215      content= too
216    ELEMENT head
217      ELEMENT title
218        TEXT
219        content=Welcome to Gnome
220    ELEMENT chapter
221      ELEMENT title
222        TEXT
223        content=The Linux adventure
224      ELEMENT p
225        TEXT
226        content=bla bla bla ...
227      ELEMENT image
228        ATTRIBUTE href
229          TEXT
230          content=linus.gif
231      ELEMENT p
232        TEXT
233        content=...</pre>
234
235<p>This should be useful to learn the internal representation model.</p>
236
237<h2><a name="interface">The SAX interface</a></h2>
238
239<p>Sometimes the DOM tree output is just to large to fit reasonably into
240memory. In that case and if you don't expect to save back the XML document
241loaded using libxml, it's better to use the SAX interface of libxml. SAX is a
242<strong>callback based interface</strong> to the parser. Before parsing, the
243application layer register a customized set of callbacks which will be called
244by the library as it progresses through the XML input.</p>
245
246<p>To get a more detailed step-by-step guidance on using the SAX interface of
247libxml, <a href="mailto:james@daa.com.au">James Henstridge</a> made <a
248href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">a nice
249documentation.</a></p>
250
251<p>You can debug the SAX behaviour by using the <strong>testSAX</strong>
252program located in the gnome-xml module (it's usually not shipped in the
253binary packages of libxml, but you can also find it in the tar source
254distribution). Here is the sequence of callback that would be generated when
255parsing the example given before as reported by testSAX:</p>
256<pre>SAX.setDocumentLocator()
257SAX.startDocument()
258SAX.getEntity(amp)
259SAX.startElement(EXAMPLE, prop1='gnome is great', prop2='&amp;amp; linux too')
260SAX.characters(   , 3)
261SAX.startElement(head)
262SAX.characters(    , 4)
263SAX.startElement(title)
264SAX.characters(Welcome to Gnome, 16)
265SAX.endElement(title)
266SAX.characters(   , 3)
267SAX.endElement(head)
268SAX.characters(   , 3)
269SAX.startElement(chapter)
270SAX.characters(    , 4)
271SAX.startElement(title)
272SAX.characters(The Linux adventure, 19)
273SAX.endElement(title)
274SAX.characters(    , 4)
275SAX.startElement(p)
276SAX.characters(bla bla bla ..., 15)
277SAX.endElement(p)
278SAX.characters(    , 4)
279SAX.startElement(image, href='linus.gif')
280SAX.endElement(image)
281SAX.characters(    , 4)
282SAX.startElement(p)
283SAX.characters(..., 3)
284SAX.endElement(p)
285SAX.characters(   , 3)
286SAX.endElement(chapter)
287SAX.characters( , 1)
288SAX.endElement(EXAMPLE)
289SAX.endDocument()</pre>
290
291<p>Most of the other functionnalities of libxml are based on the DOM tree
292building facility, so nearly everything up to the end of this document
293presuppose the use of the standard DOM tree build. Note that the DOM tree
294itself is built by a set of registered default callbacks, without internal
295specific interface.</p>
296
297<h2><a name="library">The XML library interfaces</a></h2>
298
299<p>This section is directly intended to help programmers getting bootstrapped
300using the XML library from the C language. It doesn't intent to be extensive,
301I hope the automatically generated docs will provide the completeness
302required, but as a separated set of documents. The interfaces of the XML
303library are by principle low level, there is nearly zero abstration. Those
304interested in a higher level API should <a href="#DOM">look at DOM</a>.</p>
305
306<h3><a name="Invoking">Invoking the parser</a></h3>
307
308<p>Usually, the first thing to do is to read an XML input, the parser accepts
309to parse both memory mapped documents or direct files. The functions are
310defined in "parser.h":</p>
311<dl>
312  <dt><code>xmlDocPtr xmlParseMemory(char *buffer, int size);</code></dt>
313    <dd><p>parse a zero terminated string containing the document</p>
314    </dd>
315</dl>
316<dl>
317  <dt><code>xmlDocPtr xmlParseFile(const char *filename);</code></dt>
318    <dd><p>parse an XML document contained in a file (possibly compressed)</p>
319    </dd>
320</dl>
321
322<p>This returns a pointer to the document structure (or NULL in case of
323failure).</p>
324
325<p>A couple of comments can be made, first this mean that the parser is
326memory-hungry, first to load the document in memory, second to build the tree.
327Reading a document without building the tree will be possible in the future by
328pluggin the code to the SAX interface (see SAX.c).</p>
329
330<h3><a name="Building">Building a tree from scratch</a></h3>
331
332<p>The other way to get an XML tree in memory is by building it. Basically
333there is a set of functions dedicated to building new elements, those are also
334described in "tree.h", here is for example the piece of code producing the
335example used before:</p>
336<pre>    xmlDocPtr doc;
337    xmlNodePtr tree, subtree;
338
339    doc = xmlNewDoc("1.0");
340    doc->root = xmlNewDocNode(doc, NULL, "EXAMPLE", NULL);
341    xmlSetProp(doc->root, "prop1", "gnome is great");
342    xmlSetProp(doc->root, "prop2", "&amp;linux; too");
343    tree = xmlNewChild(doc->root, NULL, "head", NULL);
344    subtree = xmlNewChild(tree, NULL, "title", "Welcome to Gnome");
345    tree = xmlNewChild(doc->root, NULL, "chapter", NULL);
346    subtree = xmlNewChild(tree, NULL, "title", "The Linux adventure");
347    subtree = xmlNewChild(tree, NULL, "p", "bla bla bla ...");
348    subtree = xmlNewChild(tree, NULL, "image", NULL);
349    xmlSetProp(subtree, "href", "linus.gif");</pre>
350
351<p>Not really rocket science ...</p>
352
353<h3><a name="Traversing">Traversing the tree</a></h3>
354
355<p>Basically by including "tree.h" your code has access to the internal
356structure of all the element of the tree. The names should be somewhat simple
357like <strong>parent</strong>, <strong>childs</strong>, <strong>next</strong>,
358<strong>prev</strong>, <strong>properties</strong>, etc... For example still
359with the previous example:</p>
360<pre><code>doc->root->childs->childs</code></pre>
361
362<p>points to the title element,</p>
363<pre>doc->root->childs->next->child->child</pre>
364
365<p>points to the text node containing the chapter titlle "The Linux adventure"
366and</p>
367<pre>doc->root->properties->next->val</pre>
368
369<p>points to the entity reference containing the value of "&amp;linux" at the
370beginning of the second attribute of the root element "EXAMPLE".</p>
371
372<h3><a name="Modifying">Modifying the tree</a></h3>
373
374<p>functions are provided to read and write the document content:</p>
375<dl>
376  <dt><code>xmlAttrPtr xmlSetProp(xmlNodePtr node, const xmlChar *name, const
377  xmlChar *value);</code></dt>
378    <dd><p>This set (or change) an attribute carried by an ELEMENT node the
379      value can be NULL</p>
380    </dd>
381</dl>
382<dl>
383  <dt><code>const xmlChar *xmlGetProp(xmlNodePtr node, const xmlChar
384  *name);</code></dt>
385    <dd><p>This function returns a pointer to the property content, note that
386      no extra copy is made</p>
387    </dd>
388</dl>
389
390<p>Two functions must be used to read an write the text associated to
391elements:</p>
392<dl>
393  <dt><code>xmlNodePtr xmlStringGetNodeList(xmlDocPtr doc, const xmlChar
394  *value);</code></dt>
395    <dd><p>This function takes an "external" string and convert it to one text
396      node or possibly to a list of entity and text nodes. All non-predefined
397      entity references like &amp;Gnome; will be stored internally as an
398      entity node, hence the result of the function may not be a single
399      node.</p>
400    </dd>
401</dl>
402<dl>
403  <dt><code>xmlChar *xmlNodeListGetString(xmlDocPtr doc, xmlNodePtr list, int
404  inLine);</code></dt>
405    <dd><p>this is the dual function, which generate a new string containing
406      the content of the text and entity nodes. Note the extra argument
407      inLine, if set to 1 instead of returning the &amp;Gnome; XML encoding in
408      the string it will substitute it with it's value say "GNU Network Object
409      Model Environment". Set it if you want to use the string for non XML
410      usage like User Interface.</p>
411    </dd>
412</dl>
413
414<h3><a name="Saving">Saving a tree</a></h3>
415
416<p>Basically 3 options are possible:</p>
417<dl>
418  <dt><code>void xmlDocDumpMemory(xmlDocPtr cur, xmlChar**mem, int
419  *size);</code></dt>
420    <dd><p>returns a buffer where the document has been saved</p>
421    </dd>
422</dl>
423<dl>
424  <dt><code>extern void xmlDocDump(FILE *f, xmlDocPtr doc);</code></dt>
425    <dd><p>dumps a buffer to an open file descriptor</p>
426    </dd>
427</dl>
428<dl>
429  <dt><code>int xmlSaveFile(const char *filename, xmlDocPtr cur);</code></dt>
430    <dd><p>save the document ot a file. In that case the compression interface
431      is triggered if turned on</p>
432    </dd>
433</dl>
434
435<h3><a name="Compressio">Compression</a></h3>
436
437<p>The library handle transparently compression when doing file based
438accesses, the level of compression on saves can be tuned either globally or
439individually for one file:</p>
440<dl>
441  <dt><code>int  xmlGetDocCompressMode (xmlDocPtr doc);</code></dt>
442    <dd><p>Get the document compression ratio (0-9)</p>
443    </dd>
444</dl>
445<dl>
446  <dt><code>void xmlSetDocCompressMode (xmlDocPtr doc, int mode);</code></dt>
447    <dd><p>Set the document compression ratio</p>
448    </dd>
449</dl>
450<dl>
451  <dt><code>int  xmlGetCompressMode(void);</code></dt>
452    <dd><p>Get the default compression ratio</p>
453    </dd>
454</dl>
455<dl>
456  <dt><code>void xmlSetCompressMode(int mode);</code></dt>
457    <dd><p>set the default compression ratio</p>
458    </dd>
459</dl>
460
461<h2><a name="Entities">Entities or no entities</a></h2>
462
463<p>Entities principle is similar to simple C macros. They define an
464abbreviation for a given string that you can reuse many time through the
465content of your document. They are especially useful when frequent occurrences
466of a given string may occur within a document or to confine the change needed
467to a document to a restricted area in the internal subset of the document (at
468the beginning). Example:</p>
469<pre>1 &lt;?xml version="1.0"?>
4702 &lt;!DOCTYPE EXAMPLE SYSTEM "example.dtd" [
4713 &lt;!ENTITY xml "Extensible Markup Language">
4724 ]>
4735 &lt;EXAMPLE>
4746    &amp;xml;
4757 &lt;/EXAMPLE></pre>
476
477<p>Line 3 declares the xml entity. Line 6 uses the xml entity, by prefixing
478it's name with '&amp;' and following it by ';' without any spaces added. There
479are 5 predefined entities in libxml allowing to escape charaters with
480predefined meaning in some parts of the xml document content:
481<strong>&amp;lt;</strong> for the letter '&lt;', <strong>&amp;gt;</strong> for
482the letter '>',  <strong>&amp;apos;</strong> for the letter ''',
483<strong>&amp;quot;</strong> for the letter '"', and <strong>&amp;amp;</strong>
484for the letter '&amp;'.</p>
485
486<p>One of the problems related to entities is that you may want the parser to
487substitute entities content to see the replacement text in your application,
488or you may prefer keeping entities references as such in the content to be
489able to save the document back without loosing this usually precious
490information (if the user went through the pain of explicitley defining
491entities, he may have a a rather negative attitude if you blindly susbtitute
492them as saving time). The function <a
493href="gnome-xml-parser.html#XMLSUBSTITUTEENTITIESDEFAULT">xmlSubstituteEntitiesDefault()</a>
494allows to check and change the behaviour, which is to not substitute entities
495by default.</p>
496
497<p>Here is the DOM tree built by libxml for the previous document in the
498default case:</p>
499<pre>/gnome/src/gnome-xml -> /tester --debug test/ent1
500DOCUMENT
501version=1.0
502   ELEMENT EXAMPLE
503     TEXT
504     content=
505     ENTITY_REF
506       INTERNAL_GENERAL_ENTITY xml
507       content=Extensible Markup Language
508     TEXT
509     content=</pre>
510
511<p>And here is the result when substituting entities:</p>
512<pre>/gnome/src/gnome-xml -> /tester --debug --noent test/ent1
513DOCUMENT
514version=1.0
515   ELEMENT EXAMPLE
516     TEXT
517     content=     Extensible Markup Language</pre>
518
519<p>So entities or no entities ? Basically it depends on your use case, I
520suggest to keep the non-substituting default behaviour and avoid using
521entities in your XML document or data if you are not willing to handle the
522entity references elements in the DOM tree.</p>
523
524<p>Note that at save time libxml enforce the conversion of the predefined
525entities where necessary to prevent well-formedness problems, and will also
526transparently replace those with chars (i.e. will not generate entity
527reference elements in the DOM tree nor call the reference() SAX callback when
528finding them in the input).</p>
529
530<h2><a name="Namespaces">Namespaces</a></h2>
531
532<p>The libxml library implement namespace @@ support by recognizing namespace
533contructs in the input, and does namespace lookup automatically when building
534the DOM tree. A namespace declaration is associated with an in-memory
535structure and all elements or attributes within that namespace point to it.
536Hence testing the namespace is a simple and fast equality operation at the
537user level.</p>
538
539<p>I suggest it that people using libxml use a namespace, and declare it on
540the root element of their document as the default namespace. Then they dont
541need to happend the prefix in the content but we will have a basis for future
542semantic refinement and  merging of data from different sources. This doesn't
543augment significantly the size of the XML output, but significantly increase
544it's value in the long-term.</p>
545
546<p>Concerning the namespace value, this has to be an URL, but this doesn't
547have to point to any existing resource on the Web. I suggest using an URL
548within a domain you control, which makes sense and if possible holding some
549kind of versionning informations. For example
550<code>"http://www.gnome.org/gnumeric/1.0"</code> is a good namespace scheme.
551Then when you load a file, make sure that a namespace carrying the
552version-independant prefix is installed on the root element of your document,
553and if the version information don't match something you know, warn the user
554and be liberal in what you accept as the input. Also do *not* try to base
555namespace checking on the prefix value &lt;foo:text> may be exactly the same
556as &lt;bar:text>  in another document, what really matter is the URI
557associated with the element or the attribute, not the prefix string which is
558just a shortcut for the full URI.</p>
559
560<p>@@Interfaces@@</p>
561
562<p>@@Examples@@</p>
563
564<p>Usually people object using namespace in the case of validation, I object
565this and will make sure that using namespaces won't break validity checking,
566so even is you plan or are using validation I strongly suggest to add
567namespaces to your document. A default namespace scheme
568<code>xmlns="http://...."</code> should not break validity even on less
569flexible parsers. Now using namespace to mix and differenciate content coming
570from mutliple Dtd will certainly break current validation schemes, I will try
571to provide ways to do this, but this may not be portable or standardized.</p>
572
573<h2><a name="Validation">Validation, or are you afraid of DTDs ?</a></h2>
574
575<p>Well what is validation and what is a DTD ?</p>
576
577<p>Validation is the process of checking a document against a set of
578construction rules, a <strong>DTD</strong> (Document Type Definition) is such
579a set of rules.</p>
580
581<p>The validation process and building DTDs are the two most difficult parts
582of  XML life cycle. Briefly a DTD defines all the possibles element to be
583found within your document, what is the formal shape of your document tree (by
584defining the allowed content of an element, either text, a regular expression
585for the allowed list of children, or mixed content i.e. both text and childs).
586The DTD also defines the allowed attributes for all elements and the types of
587the attributes. For more detailed informations, I suggest to read the related
588parts of the XML specification, the examples found under
589gnome-xml/test/valid/dtd and the large amount of books available on XML. The
590dia example in gnome-xml/test/valid should be both simple and complete enough
591to allow you to build your own.</p>
592
593<p>A word of warning, building a good DTD which will fit your needs of your
594application in the long-term is far from trivial, however the extra level of
595quality it can insure is well worth the price for some sets of applications or
596if you already have already a DTD defined for your application field.</p>
597
598<p>The validation is not completely finished but in a (very IMHO) usable
599state. Until a real validation interface is defined the way to do it is to
600define and set the <strong>xmlDoValidityCheckingDefaultValue</strong> external
601variable to 1, this will of course be changed at some point:</p>
602
603<p>extern int xmlDoValidityCheckingDefaultValue;</p>
604
605<p>...</p>
606
607<p>xmlDoValidityCheckingDefaultValue = 1;</p>
608
609<p></p>
610
611<p>To handle external entities, use the function
612<strong>xmlSetExternalEntityLoader</strong>(xmlExternalEntityLoader f); to
613link in you HTTP/FTP/Entities database library to the standard libxml
614core.</p>
615
616<p>@@interfaces@@</p>
617
618<h2><a name="DOM"></a><a name="Principles">DOM Principles</a></h2>
619
620<p><a href="http://www.w3.org/DOM/">DOM</a> stands for the <em>Document Object
621Model</em> this is an API for accessing XML or HTML structured documents.
622Native support for DOM in Gnome is on the way (module gnome-dom), and it will
623be based on gnome-xml. This will be a far cleaner interface to manipulate XML
624files within Gnome since it won't expose the internal structure. DOM defines a
625set of IDL (or Java) interfaces allowing to traverse and manipulate a
626document. The DOM library will allow accessing and modifying "live" documents
627presents on other programs like this:</p>
628
629<p><img src="DOM.gif" alt=" DOM.gif "></p>
630
631<p>This should help greatly doing things like modifying a gnumeric spreadsheet
632embedded in a GWP document for example.</p>
633
634<p>The current DOM implementation on top of libxml is the <a
635href="http://cvs.gnome.org/lxr/source/gdome/">gdome Gnome module</a>, this is
636a full DOM interface, thanks to <a href="mailto:raph@levien.com">Raph
637Levien</a>.</p>
638
639<p>The gnome-dom module in the Gnome CVS base is obsolete</p>
640
641<h2><a name="Example"></a><a name="real">A real example</a></h2>
642
643<p>Here is a real size example, where the actual content of the application
644data is not kept in the DOM tree but uses internal structures. It is based on
645a proposal to keep a database of jobs related to Gnome, with an XML based
646storage structure. Here is an <a href="gjobs.xml">XML encoded jobs
647base</a>:</p>
648<pre>&lt;?xml version="1.0"?>
649&lt;gjob:Helping xmlns:gjob="http://www.gnome.org/some-location">
650  &lt;gjob:Jobs>
651
652    &lt;gjob:Job>
653      &lt;gjob:Project ID="3"/>
654      &lt;gjob:Application>GBackup&lt;/gjob:Application>
655      &lt;gjob:Category>Development&lt;/gjob:Category>
656
657      &lt;gjob:Update>
658        &lt;gjob:Status>Open&lt;/gjob:Status>
659        &lt;gjob:Modified>Mon, 07 Jun 1999 20:27:45 -0400 MET DST&lt;/gjob:Modified>
660        &lt;gjob:Salary>USD 0.00&lt;/gjob:Salary>
661      &lt;/gjob:Update>
662
663      &lt;gjob:Developers>
664        &lt;gjob:Developer>
665        &lt;/gjob:Developer>
666      &lt;/gjob:Developers>
667
668      &lt;gjob:Contact>
669        &lt;gjob:Person>Nathan Clemons&lt;/gjob:Person>
670        &lt;gjob:Email>nathan@windsofstorm.net&lt;/gjob:Email>
671        &lt;gjob:Company>
672        &lt;/gjob:Company>
673        &lt;gjob:Organisation>
674        &lt;/gjob:Organisation>
675        &lt;gjob:Webpage>
676        &lt;/gjob:Webpage>
677        &lt;gjob:Snailmail>
678        &lt;/gjob:Snailmail>
679        &lt;gjob:Phone>
680        &lt;/gjob:Phone>
681      &lt;/gjob:Contact>
682
683      &lt;gjob:Requirements>
684      The program should be released as free software, under the GPL.
685      &lt;/gjob:Requirements>
686
687      &lt;gjob:Skills>
688      &lt;/gjob:Skills>
689
690      &lt;gjob:Details>
691      A GNOME based system that will allow a superuser to configure 
692      compressed and uncompressed files and/or file systems to be backed 
693      up with a supported media in the system.  This should be able to 
694      perform via find commands generating a list of files that are passed 
695      to tar, dd, cpio, cp, gzip, etc., to be directed to the tape machine 
696      or via operations performed on the filesystem itself. Email 
697      notification and GUI status display very important.
698      &lt;/gjob:Details>
699
700    &lt;/gjob:Job>
701
702  &lt;/gjob:Jobs>
703&lt;/gjob:Helping></pre>
704
705<p>While loading the XML file into an internal DOM tree is a matter of calling
706only a couple of functions, browsing the tree to gather the informations and
707generate the internals structures is harder, and more error prone.</p>
708
709<p>The suggested principle is to be tolerant with respect to the input
710structure. For example the ordering of the attributes is not significant, Cthe
711XML specification is clear about it. It's also usually a good idea to not be
712dependant of the orders of the childs of a given node, unless it really makes
713things harder. Here is some code to parse the informations for a person:</p>
714<pre>/*
715 * A person record
716 */
717typedef struct person {
718    char *name;
719    char *email;
720    char *company;
721    char *organisation;
722    char *smail;
723    char *webPage;
724    char *phone;
725} person, *personPtr;
726
727/*
728 * And the code needed to parse it
729 */
730personPtr parsePerson(xmlDocPtr doc, xmlNsPtr ns, xmlNodePtr cur) {
731    personPtr ret = NULL;
732
733DEBUG("parsePerson\n");
734    /*
735     * allocate the struct
736     */
737    ret = (personPtr) malloc(sizeof(person));
738    if (ret == NULL) {
739        fprintf(stderr,"out of memory\n");
740        return(NULL);
741    }
742    memset(ret, 0, sizeof(person));
743
744    /* We don't care what the top level element name is */
745    cur = cur->childs;
746    while (cur != NULL) {
747        if ((!strcmp(cur->name, "Person")) &amp;&amp; (cur->ns == ns))
748            ret->name = xmlNodeListGetString(doc, cur->childs, 1);
749        if ((!strcmp(cur->name, "Email")) &amp;&amp; (cur->ns == ns))
750            ret->email = xmlNodeListGetString(doc, cur->childs, 1);
751        cur = cur->next;
752    }
753
754    return(ret);
755}</pre>
756
757<p>Here is a couple of things to notice:</p>
758<ul>
759  <li>Usually a recursive parsing style is the more convenient one, XML data
760    being by nature subject to repetitive constructs and usualy exibit highly
761    stuctured patterns.</li>
762  <li>The two arguments of type <em>xmlDocPtr</em> and <em>xmlNsPtr</em>, i.e.
763    the pointer to the global XML document and the namespace reserved to the
764    application. Document wide information are needed for example to decode
765    entities and it's a good coding practice to define a namespace for your
766    application set of data and test that the element and attributes you're
767    analyzing actually pertains to your application space. This is done by a
768    simple equality test (cur->ns == ns).</li>
769  <li>To retrieve text and attributes value, it is suggested to use the
770    function <em>xmlNodeListGetString</em> to gather all the text and entity
771    reference nodes generated by the DOM output and produce an single text
772    string.</li>
773</ul>
774
775<p>Here is another piece of code used to parse another level of the
776structure:</p>
777<pre>/*
778 * a Description for a Job
779 */
780typedef struct job {
781    char *projectID;
782    char *application;
783    char *category;
784    personPtr contact;
785    int nbDevelopers;
786    personPtr developers[100]; /* using dynamic alloc is left as an exercise */
787} job, *jobPtr;
788
789/*
790 * And the code needed to parse it
791 */
792jobPtr parseJob(xmlDocPtr doc, xmlNsPtr ns, xmlNodePtr cur) {
793    jobPtr ret = NULL;
794
795DEBUG("parseJob\n");
796    /*
797     * allocate the struct
798     */
799    ret = (jobPtr) malloc(sizeof(job));
800    if (ret == NULL) {
801        fprintf(stderr,"out of memory\n");
802        return(NULL);
803    }
804    memset(ret, 0, sizeof(job));
805
806    /* We don't care what the top level element name is */
807    cur = cur->childs;
808    while (cur != NULL) {
809        
810        if ((!strcmp(cur->name, "Project")) &amp;&amp; (cur->ns == ns)) {
811            ret->projectID = xmlGetProp(cur, "ID");
812            if (ret->projectID == NULL) {
813                fprintf(stderr, "Project has no ID\n");
814            }
815        }
816        if ((!strcmp(cur->name, "Application")) &amp;&amp; (cur->ns == ns))
817            ret->application = xmlNodeListGetString(doc, cur->childs, 1);
818        if ((!strcmp(cur->name, "Category")) &amp;&amp; (cur->ns == ns))
819            ret->category = xmlNodeListGetString(doc, cur->childs, 1);
820        if ((!strcmp(cur->name, "Contact")) &amp;&amp; (cur->ns == ns))
821            ret->contact = parsePerson(doc, ns, cur);
822        cur = cur->next;
823    }
824
825    return(ret);
826}</pre>
827
828<p>One can notice that once used to it, writing this kind of code is quite
829simple, but boring. Ultimately, it could be possble to write stubbers taking
830either C data structure definitions, a set of XML examples or an XML DTD and
831produce the code needed to import and export the content between C data and
832XML storage. This is left as an exercise to the reader :-)</p>
833
834<p>Feel free to use <a href="gjobread.c">the code for the full C parsing
835example</a> as a template, it is also available with Makefile in the Gnome CVS
836base under gnome-xml/example</p>
837
838<p></p>
839
840<p><a href="mailto:Daniel.Veillard@w3.org">Daniel Veillard</a></p>
841
842<p>$Id: xml.html,v 1.11 1999/10/25 13:15:50 veillard Exp $</p>
843</body>
844</html>
845