xml.html revision 29a11cc696655f9ac841a5ca28b272e4150aafa1
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
2                      "http://www.w3.org/TR/REC-html40/loose.dtd">
3<html>
4<head>
5  <title>The XML C library for Gnome</title>
6  <meta name="GENERATOR" content="amaya V3.2.1">
7  <meta http-equiv="Content-Type" content="text/html">
8</head>
9
10<body bgcolor="#ffffff">
11<p><a href="http://www.gnome.org/"><img src="smallfootonly.gif"
12alt="Gnome Logo"></a><a href="http://www.w3.org/Status"><img src="w3c.png"
13alt="W3C Logo"></a></p>
14
15<h1 align="center">The XML C library for Gnome</h1>
16
17<h2 style="text-align: center">libxml, a.k.a. gnome-xml</h2>
18
19<p></p>
20<ul>
21  <li><a href="#Introducti">Introduction</a></li>
22  <li><a href="#Documentat">Documentation</a></li>
23  <li><a href="#Reporting">Reporting bugs and getting help</a></li>
24  <li><a href="#help">how to help</a></li>
25  <li><a href="#Downloads">Downloads</a></li>
26  <li><a href="#News">News</a></li>
27  <li><a href="#XML">XML</a></li>
28  <li><a href="#tree">The tree output</a></li>
29  <li><a href="#interface">The SAX interface</a></li>
30  <li><a href="#library">The XML library interfaces</a>
31    <ul>
32      <li><a href="#Invoking">Invoking the parser: the pull way</a></li>
33      <li><a href="#Invoking">Invoking the parser: the push way</a></li>
34      <li><a href="#Invoking2">Invoking the parser: the SAX interface</a></li>
35      <li><a href="#Building">Building a tree from scratch</a></li>
36      <li><a href="#Traversing">Traversing the tree</a></li>
37      <li><a href="#Modifying">Modifying the tree</a></li>
38      <li><a href="#Saving">Saving the tree</a></li>
39      <li><a href="#Compressio">Compression</a></li>
40    </ul>
41  </li>
42  <li><a href="#Entities">Entities or no entities</a></li>
43  <li><a href="#Namespaces">Namespaces</a></li>
44  <li><a href="#Validation">Validation</a></li>
45  <li><a href="#Principles">DOM principles</a></li>
46  <li><a href="#real">A real example</a></li>
47  <li><a href="#Contributi">Contributions</a></li>
48</ul>
49
50<p>Separate documents:</p>
51<ul>
52  <li><a href="upgrade.html">upgrade instructions for migrating to
53  libxml2</a></li>
54  <li><a href="encoding.html">libxml Internationalization support</a></li>
55  <li><a href="xmlio.html">libxml Input/Output interfaces</a></li>
56  <li><a href="xmlmem.html">libxml Memory interfaces</a></li>
57</ul>
58
59<h2><a name="Introducti">Introduction</a></h2>
60
61<p>This document describes libxml, the <a
62href="http://www.w3.org/XML/">XML</a> C library developped for the <a
63href="http://www.gnome.org/">Gnome</a> project. <a
64href="http://www.w3.org/XML/">XML is a standard</a> for building tag-based
65structured documents/data.</p>
66
67<p>Here are some key points about libxml:</p>
68<ul>
69  <li>Libxml exports Push and Pull type parser interfaces for both XML and
70    HTML.</li>
71  <li>Libxml can do Dtd validation at parse time, using a parsed document
72    instance, or with an arbitrary Dtd.</li>
73  <li>Libxml now includes a nearly complete <a
74    href="http://www.w3.org/TR/xpath">XPath</a> and <a
75    href="http://www.w3.org/TR/xptr">XPointer</a> implementations.</li>
76  <li>It is written in plain C, making as few assumptions as possible, and
77    sticking closely to ANSI C/POSIX for easy embedding. Works on
78    Linux/Unix/Windows, ported to a number of other platforms.</li>
79  <li>Basic support for HTTP and FTP client allowing to fetch remote
80  resources</li>
81  <li>The design of modular, most of the extensions can be compiled out.</li>
82  <li>The internal document repesentation is as close as possible to the <a
83    href="http://www.w3.org/DOM/">DOM</a> interfaces.</li>
84  <li>Libxml also has a <a href="http://www.megginson.com/SAX/index.html">SAX
85    like interface</a>; the interface is designed to be compatible with <a
86    href="http://www.jclark.com/xml/expat.html">Expat</a>.</li>
87  <li>This library is released both under the <a
88    href="http://www.w3.org/Consortium/Legal/copyright-software-19980720.html">W3C
89    IPR</a> and the <a href="http://www.gnu.org/copyleft/lesser.html">GNU
90    LGPL</a>. Use either at your convenience, basically this should make
91    everybody happy, if not, drop me a mail.</li>
92</ul>
93
94<h2><a name="Documentat">Documentation</a></h2>
95
96<p>There are some on-line resources about using libxml:</p>
97<ol>
98  <li>Check the <a href="FAQ.html">FAQ</a></li>
99  <li>Check the <a href="http://xmlsoft.org/html/libxml-lib.html">extensive
100    documentation</a> automatically extracted from code comments (using <a
101    href="http://cvs.gnome.org/bonsai/rview.cgi?cvsroot=/cvs/gnome&dir=gtk-doc">gtk
102    doc</a>).</li>
103  <li>Look at the documentation about <a href="encoding.html">libxml
104    internationalization support</a></li>
105  <li>This page provides a global overview and <a href="#real">some
106    examples</a> on how to use libxml.</li>
107  <li><a href="mailto:james@daa.com.au">James Henstridge</a> wrote <a
108    href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">some nice
109    documentation</a> explaining how to use the libxml SAX interface.</li>
110  <li>George Lebl wrote <a
111    href="http://www-4.ibm.com/software/developer/library/gnome3/">an article
112    for IBM developerWorks</a> about using libxml.</li>
113  <li>It is also a good idea to check to <a href="mailto:raph@levien.com">Raph
114    Levien</a> <a href="http://levien.com/gnome/">web site</a> since he is
115    building the <a href="http://levien.com/gnome/gdome.html">DOM interface
116    gdome</a> on top of libxml result tree and an implementation of <a
117    href="http://www.w3.org/Graphics/SVG/">SVG</a> called <a
118    href="http://www.levien.com/svg/">gill</a>. Check his <a
119    href="http://www.levien.com/gnome/domination.html">DOMination
120  paper</a>.</li>
121  <li>Check <a href="http://cvs.gnome.org/lxr/source/gnome-xml/TODO">the TODO
122    file</a></li>
123  <li>Read the <a href="upgrade.html">1.x to 2.x upgrade path</a>. If you are
124    starting a new project using libxml you should really use the 2.x
125  version.</li>
126  <li>And don't forget to look at the <a href="/messages/">mailing-list
127    archive</a>.</li>
128</ol>
129
130<h2><a name="Reporting">Reporting bugs and getting help</a></h2>
131
132<p>Well, bugs or missing features are always possible, and I will make a point
133of fixing them in a timely fashion. The best way to report a bug is to use the
134<a href="http://bugs.gnome.org/db/pa/lgnome-xml.html">Gnome bug tracking
135database</a> (make sure to use the "gnome-xml" module name, not libxml or
136libxml2). I look at reports there regularly and it's good to have a reminder
137when a bug is still open. Check the <a
138href="http://bugs.gnome.org/Reporting.html">instructions on reporting bugs</a>
139and be sure to specify that the bug is for the package gnome-xml.</p>
140
141<p>There is also a mailing-list <a
142href="mailto:xml@rpmfind.net">xml@rpmfind.net</a> for libxml, with an <a
143href="http://xmlsoft.org/messages">on-line archive</a>. To subscribe to this
144majordomo based list, send a mail message to <a
145href="mailto:majordomo@rpmfind.net">majordomo@rpmfind.net</a> with "subscribe
146xml" in the <strong>content</strong> of the message.</p>
147
148<p>Alternatively, you can just send the bug to the <a
149href="mailto:xml@rpmfind.net">xml@rpmfind.net</a> list, if it's really libxml
150related I will approve it..</p>
151
152<p>Of course, bugs reports with a suggested patch for fixing them will
153probably be processed faster.</p>
154
155<p>If you're looking for help, a quick look at <a
156href="http://xmlsoft.org/messages/#407">the list archive</a> may actually
157provide the answer, I usually send source samples when answering libxml usage
158questions. The <a href="http://xmlsoft.org/html/book1.html">auto-generated
159documentantion</a> is not as polished as I would like (i need to learn more
160about Docbook), but it's a good starting point.</p>
161
162<h2><a name="help">How to help</a></h2>
163
164<p>You can help the project in various ways, the best thing to do first is to
165subscribe to the mailing-list as explained before, check the <a
166href="http://xmlsoft.org/messages/">archives </a>and the <a
167href="http://bugs.gnome.org/db/pa/lgnome-xml.html">Gnome bug
168database:</a>:</p>
169<ol>
170  <li>provide patches when you find problems</li>
171  <li>provide the diffs when you port libxml to a new platform. They may not
172    be integrated in all cases but help pinpointing portability problems
173  and</li>
174  <li>provice documentation fixes (either as patches to the code comments or
175    as HTML diffs).</li>
176  <li>provide new documentations pieces (translations, examples, etc ...)</li>
177  <li>Check the TODO file and try to close one of the items</li>
178  <li>take one of the points raised in the archive or the bug database and
179    provide a fix. <a href="mailto:Daniel.Veillard@w3.org">Get in touch with
180    me </a>before to avoid synchronization problems and check that the
181    suggested fix will fit in nicely :-)</li>
182</ol>
183
184<h2><a name="Downloads">Downloads</a></h2>
185
186<p>The latest versions of libxml can be found on <a
187href="ftp://rpmfind.net/pub/libxml/">rpmfind.net</a> or on the <a
188href="ftp://ftp.gnome.org/pub/GNOME/MIRRORS.html">Gnome FTP server</a> either
189as a <a href="ftp://ftp.gnome.org/pub/GNOME/stable/sources/libxml/">source
190archive</a> or <a
191href="ftp://ftp.gnome.org/pub/GNOME/contrib/redhat/SRPMS/">RPM packages</a>.
192(NOTE that you need both the <a
193href="http://rpmfind.net/linux/RPM/libxml2.html">libxml(2)</a> and <a
194href="http://rpmfind.net/linux/RPM/libxml2-devel.html">libxml(2)-devel</a>
195packages installed to compile applications using libxml.)</p>
196
197<p><a name="Snapshot">Snapshot:</a></p>
198<ul>
199  <li>Code from the W3C cvs base libxml <a
200    href="ftp://rpmfind.net/pub/libxml/cvs-snapshot.tar.gz">cvs-snapshot.tar.gz</a></li>
201  <li>Docs, content of the web site, the list archive included <a
202    href="ftp://rpmfind.net/pub/libxml/libxml-docs.tar.gz">libxml-docs.tar.gz</a></li>
203</ul>
204
205<p><a name="Contribs">Contribs:</a></p>
206
207<p>I do accept external contributions, especially if compiling on another
208platform, get in touch with me to upload the package. I will keep them in the
209<a href="ftp://rpmfind.net/pub/libxml/contribs/">contrib directory</a></p>
210
211<p>Libxml is also available from 2 CVS bases:</p>
212<ul>
213  <li><p>The <a href="http://dev.w3.org/cvsweb/XML/">W3C CVS base</a>,
214    available read-only using the CVS pserver authentification (I tend to use
215    this base for my own development, so it's updated more regularly, but the
216    content may not be as stable):</p>
217    <pre>CVSROOT=:pserver:anonymous@dev.w3.org:/sources/public
218        password: anonymous
219        module: XML</pre>
220  </li>
221  <li><p>The <a
222    href="http://cvs.gnome.org/bonsai/rview.cgi?cvsroot=/cvs/gnome&dir=gnome-xml">Gnome
223    CVS base</a>. Check the <a
224    href="http://developer.gnome.org/tools/cvs.html">Gnome CVS Tools</a> page;
225    the CVS module is <b>gnome-xml</b>.</p>
226  </li>
227</ul>
228
229<h2><a name="News">News</a></h2>
230
231<h3>CVS only : check the <a
232href="http://cvs.gnome.org/lxr/source/gnome-xml/ChangeLog">Changelog</a> file
233for a really accurate description</h3>
234
235<p>Item floating around but not actively worked on, get in touch with me if
236you want to test those</p>
237<ul>
238  <li>working on HTML and XML links recognition layers</li>
239  <li>parsing/import of Docbook SGML docs</li>
240</ul>
241
242<h3>2.2.6: Oct 25 2000:</h3>
243<ul>
244  <li>Added an hash table module, migrated a number of internal structure to
245    those</li>
246  <li>Fixed a posteriori validation problems</li>
247  <li>HTTP module cleanups</li>
248  <li>HTML parser improvements (tag errors, script/style handling, attribute
249    normalization)</li>
250  <li>coalescing of adjacent text nodes</li>
251  <li>couple of XPath bug fixes, exported the internal API</li>
252</ul>
253
254<h3>2.2.5: Oct 15 2000:</h3>
255<ul>
256  <li>XPointer implementation and testsuite</li>
257  <li>Lot of XPath fixes, added variable and functions registration, more
258    tests</li>
259  <li>Portability fixes, lots of enhancements toward an easy Windows build and
260    release</li>
261  <li>Late validation fixes</li>
262  <li>Integrated a lot of contributed patches</li>
263  <li>added memory management docs</li>
264  <li>a performance problem when using large buffer seems fixed</li>
265</ul>
266
267<h3>2.2.4: Oct 1 2000:</h3>
268<ul>
269  <li>main XPath problem fixed</li>
270  <li>Integrated portability patches for Windows</li>
271  <li>Serious bug fixes on the URI and HTML code</li>
272</ul>
273
274<h3>2.2.3: Sep 17 2000</h3>
275<ul>
276  <li>bug fixes</li>
277  <li>cleanup of entity handling code</li>
278  <li>overall review of all loops in the parsers, all sprintf usage has been
279    checked too</li>
280  <li>Far better handling of larges Dtd. Validating against Docbook XML Dtd
281    works smoothly now.</li>
282</ul>
283
284<h3>1.8.10: Sep 6 2000</h3>
285<ul>
286  <li>bug fix release for some Gnome projects</li>
287</ul>
288
289<h3>2.2.2: August 12 2000</h3>
290<ul>
291  <li>mostly bug fixes</li>
292  <li>started adding routines to access xml parser context options</li>
293</ul>
294
295<h3>2.2.1: July 21 2000</h3>
296<ul>
297  <li>a purely bug fixes release</li>
298  <li>fixed an encoding support problem when parsing from a memory block</li>
299  <li>fixed a DOCTYPE parsing problem</li>
300  <li>removed a bug in the function allowing to override the memory allocation
301    routines</li>
302</ul>
303
304<h3>2.2.0: July 14 2000</h3>
305<ul>
306  <li>applied a lot of portability fixes</li>
307  <li>better encoding support/cleanup and saving (content is now always
308    encoded in UTF-8)</li>
309  <li>the HTML parser now correctly handles encodings</li>
310  <li>added xmlHasProp()</li>
311  <li>fixed a serious problem with &amp;#38;</li>
312  <li>propagated the fix to FTP client</li>
313  <li>cleanup, bugfixes, etc ...</li>
314  <li>Added a page about <a href="encoding.html">libxml Internationalization
315    support</a></li>
316</ul>
317
318<h3>1.8.9:  July 9 2000</h3>
319<ul>
320  <li>fixed the spec the RPMs should be better</li>
321  <li>fixed a serious bug in the FTP implementation, released 1.8.9 to solve
322    rpmfind users problem</li>
323</ul>
324
325<h3>2.1.1: July 1 2000</h3>
326<ul>
327  <li>fixes a couple of bugs in the 2.1.0 packaging</li>
328  <li>improvements on the HTML parser</li>
329</ul>
330
331<h3>2.1.0 and 1.8.8: June 29 2000</h3>
332<ul>
333  <li>1.8.8 is mostly a comodity package for upgrading to libxml2 accoding to
334    <a href="upgrade.html">new instructions</a>. It fixes a nasty problem
335    about &amp;#38; charref parsing</li>
336  <li>2.1.0 also ease the upgrade from libxml v1 to the recent version. it
337    also contains numerous fixes and enhancements:
338    <ul>
339      <li>added xmlStopParser() to stop parsing</li>
340      <li>improved a lot parsing speed when there is large CDATA blocs</li>
341      <li>includes XPath patches provided by Picdar Technology</li>
342      <li>tried to fix as much as possible DtD validation and namespace
343        related problems</li>
344      <li>output to a given encoding has been added/tested</li>
345      <li>lot of various fixes</li>
346    </ul>
347  </li>
348</ul>
349
350<h3>2.0.0: Apr 12 2000</h3>
351<ul>
352  <li>First public release of libxml2. If you are using libxml, it's a good
353    idea to check the 1.x to 2.x upgrade instructions. NOTE: while initally
354    scheduled for Apr 3 the relase occured only on Apr 12 due to massive
355    workload.</li>
356  <li>The include are now located under $prefix/include/libxml (instead of
357    $prefix/include/gnome-xml), they also are referenced by
358    <pre>#include &lt;libxml/xxx.h&gt;</pre>
359    <p>instead of</p>
360    <pre>#include "xxx.h"</pre>
361  </li>
362  <li>a new URI module for parsing URIs and following strictly RFC 2396</li>
363  <li>the memory allocation routines used by libxml can now be overloaded
364    dynamically by using xmlMemSetup()</li>
365  <li>The previously CVS only tool tester has been renamed
366    <strong>xmllint</strong> and is now installed as part of the libxml2
367    package</li>
368  <li>The I/O interface has been revamped. There is now ways to plug in
369    specific I/O modules, either at the URI scheme detection level using
370    xmlRegisterInputCallbacks()  or by passing I/O functions when creating a
371    parser context using xmlCreateIOParserCtxt()</li>
372  <li>there is a C preprocessor macro LIBXML_VERSION providing the version
373    number of the libxml module in use</li>
374  <li>a number of optional features of libxml can now be excluded at configure
375    time (FTP/HTTP/HTML/XPath/Debug)</li>
376</ul>
377
378<h3>2.0.0beta: Mar 14 2000</h3>
379<ul>
380  <li>This is a first Beta release of libxml version 2</li>
381  <li>It's available only from<a href="ftp://rpmfind.net/pub/libxml/">
382    rpmfind.net FTP</a>, it's packaged as libxml2-2.0.0beta and available as
383    tar and RPMs</li>
384  <li>This version is now the head in the Gnome CVS base, the old one is
385    available under the tag LIB_XML_1_X</li>
386  <li>This includes a very large set of changes. Froma  programmatic point of
387    view applications should not have to be modified too much, check the <a
388    href="upgrade.html">upgrade page</a></li>
389  <li>Some interfaces may changes (especially a bit about encoding).</li>
390  <li>the updates includes:
391    <ul>
392      <li>fix I18N support. ISO-Latin-x/UTF-8/UTF-16 (nearly) seems correctly
393        handled now</li>
394      <li>Better handling of entities, especially well formedness checking and
395        proper PEref extensions in external subsets</li>
396      <li>DTD conditional sections</li>
397      <li>Validation now correcly handle entities content</li>
398      <li><a href="http://rpmfind.net/tools/gdome/messages/0039.html">change
399        structures to accomodate DOM</a></li>
400    </ul>
401  </li>
402  <li>Serious progress were made toward compliance, <a
403    href="conf/result.html">here are the result of the test</a> against the
404    OASIS testsuite (except the japanese tests since I don't support that
405    encoding yet). This URL is rebuilt every couple of hours using the CVS
406    head version.</li>
407</ul>
408
409<h3>1.8.7: Mar 6 2000</h3>
410<ul>
411  <li>This is a bug fix release:</li>
412  <li>It is possible to disable the ignorable blanks heuristic used by
413    libxml-1.x, a new function  xmlKeepBlanksDefault(0) will allow this. Note
414    that for adherence to XML spec, this behaviour will be disabled by default
415    in 2.x . The same function will allow to keep compatibility for old
416  code.</li>
417  <li>Blanks in &lt;a&gt;  &lt;/a&gt; constructs are not ignored anymore,
418    avoiding heuristic is really the Right Way :-\</li>
419  <li>The unchecked use of snprintf which was breaking libxml-1.8.6
420    compilation on some platforms has been fixed</li>
421  <li>nanoftp.c nanohttp.c: Fixed '#' and '?' stripping when processing
422  URIs</li>
423</ul>
424
425<h3>1.8.6: Jan 31 2000</h3>
426<ul>
427  <li>added a nanoFTP transport module, debugged until the new version of <a
428    href="http://rpmfind.net/linux/rpm2html/rpmfind.html">rpmfind</a> can use
429    it without troubles</li>
430</ul>
431
432<h3>1.8.5: Jan 21 2000</h3>
433<ul>
434  <li>adding APIs to parse a well balanced chunk of XML (production <a
435    href="http://www.w3.org/TR/REC-xml#NT-content">[43] content</a> of the XML
436    spec)</li>
437  <li>fixed a hideous bug in xmlGetProp pointed by Rune.Djurhuus@fast.no</li>
438  <li>Jody Goldberg &lt;jgoldberg@home.com&gt; provided another patch trying
439    to solve the zlib checks problems</li>
440  <li>The current state in gnome CVS base is expected to ship as 1.8.5 with
441    gnumeric soon</li>
442</ul>
443
444<h3>1.8.4: Jan 13 2000</h3>
445<ul>
446  <li>bug fixes, reintroduced xmlNewGlobalNs(), fixed xmlNewNs()</li>
447  <li>all exit() call should have been removed from libxml</li>
448  <li>fixed a problem with INCLUDE_WINSOCK on WIN32 platform</li>
449  <li>added newDocFragment()</li>
450</ul>
451
452<h3>1.8.3: Jan 5 2000</h3>
453<ul>
454  <li>a Push interface for the XML and HTML parsers</li>
455  <li>a shell-like interface to the document tree (try tester --shell :-)</li>
456  <li>lots of bug fixes and improvement added over XMas hollidays</li>
457  <li>fixed the DTD parsing code to work with the xhtml DTD</li>
458  <li>added xmlRemoveProp(), xmlRemoveID() and xmlRemoveRef()</li>
459  <li>Fixed bugs in xmlNewNs()</li>
460  <li>External entity loading code has been revamped, now it uses
461    xmlLoadExternalEntity(), some fix on entities processing were added</li>
462  <li>cleaned up WIN32 includes of socket stuff</li>
463</ul>
464
465<h3>1.8.2: Dec 21 1999</h3>
466<ul>
467  <li>I got another problem with includes and C++, I hope this issue is fixed
468    for good this time</li>
469  <li>Added a few tree modification functions: xmlReplaceNode,
470    xmlAddPrevSibling, xmlAddNextSibling, xmlNodeSetName and
471    xmlDocSetRootElement</li>
472  <li>Tried to improve the HTML output with help from <a
473    href="mailto:clahey@umich.edu">Chris Lahey</a></li>
474</ul>
475
476<h3>1.8.1: Dec 18 1999</h3>
477<ul>
478  <li>various patches to avoid troubles when using libxml with C++ compilers
479    the "namespace" keyword and C escaping in include files</li>
480  <li>a problem in one of the core macros IS_CHAR was corrected</li>
481  <li>fixed a bug introduced in 1.8.0 breaking default namespace processing,
482    and more specifically the Dia application</li>
483  <li>fixed a posteriori validation (validation after parsing, or by using a
484    Dtd not specified in the original document)</li>
485  <li>fixed a bug in</li>
486</ul>
487
488<h3>1.8.0: Dec 12 1999</h3>
489<ul>
490  <li>cleanup, especially memory wise</li>
491  <li>the parser should be more reliable, especially the HTML one, it should
492    not crash, whatever the input !</li>
493  <li>Integrated various patches, especially a speedup improvement for large
494    dataset from <a href="mailto:cnygard@bellatlantic.net">Carl Nygard</a>,
495    configure with --with-buffers to enable them.</li>
496  <li>attribute normalization, oops should have been added long ago !</li>
497  <li>attributes defaulted from Dtds should be available, xmlSetProp() now
498    does entities escapting by default.</li>
499</ul>
500
501<h3>1.7.4: Oct 25 1999</h3>
502<ul>
503  <li>Lots of HTML improvement</li>
504  <li>Fixed some errors when saving both XML and HTML</li>
505  <li>More examples, the regression tests should now look clean</li>
506  <li>Fixed a bug with contiguous charref</li>
507</ul>
508
509<h3>1.7.3: Sep 29 1999</h3>
510<ul>
511  <li>portability problems fixed</li>
512  <li>snprintf was used unconditionnally, leading to link problems on system
513    were it's not available, fixed</li>
514</ul>
515
516<h3>1.7.1: Sep 24 1999</h3>
517<ul>
518  <li>The basic type for strings manipulated by libxml has been renamed in
519    1.7.1 from <strong>CHAR</strong> to <strong>xmlChar</strong>. The reason
520    is that CHAR was conflicting with a predefined type on Windows. However on
521    non WIN32 environment, compatibility is provided by the way of  a
522    <strong>#define </strong>.</li>
523  <li>Changed another error : the use of a structure field called errno, and
524    leading to troubles on platforms where it's a macro</li>
525</ul>
526
527<h3>1.7.0: sep 23 1999</h3>
528<ul>
529  <li>Added the ability to fetch remote DTD or parsed entities, see the <a
530    href="html/gnome-xml-nanohttp.html">nanohttp</a> module.</li>
531  <li>Added an errno to report errors by another mean than a simple printf
532    like callback</li>
533  <li>Finished ID/IDREF support and checking when validation</li>
534  <li>Serious memory leaks fixed (there is now a <a
535    href="html/gnome-xml-xmlmemory.html">memory wrapper</a> module)</li>
536  <li>Improvement of <a href="http://www.w3.org/TR/xpath">XPath</a>
537    implementation</li>
538  <li>Added an HTML parser front-end</li>
539</ul>
540
541<h2><a name="XML">XML</a></h2>
542
543<p><a href="http://www.w3.org/TR/REC-xml">XML is a standard</a> for
544markup-based structured documents. Here is <a name="example">an example XML
545document</a>:</p>
546<pre>&lt;?xml version="1.0"?&gt;
547&lt;EXAMPLE prop1="gnome is great" prop2="&amp;amp; linux too"&gt;
548  &lt;head&gt;
549   &lt;title&gt;Welcome to Gnome&lt;/title&gt;
550  &lt;/head&gt;
551  &lt;chapter&gt;
552   &lt;title&gt;The Linux adventure&lt;/title&gt;
553   &lt;p&gt;bla bla bla ...&lt;/p&gt;
554   &lt;image href="linus.gif"/&gt;
555   &lt;p&gt;...&lt;/p&gt;
556  &lt;/chapter&gt;
557&lt;/EXAMPLE&gt;</pre>
558
559<p>The first line specifies that it's an XML document and gives useful
560information about its encoding. Then the document is a text format whose
561structure is specified by tags between brackets. <strong>Each tag opened has
562to be closed</strong>. XML is pedantic about this. However, if a tag is empty
563(no content), a single tag can serve as both the opening and closing tag if it
564ends with <code>/&gt;</code> rather than with <code>&gt;</code>. Note that,
565for example, the image tag has no content (just an attribute) and is closed by
566ending the tag with <code>/&gt;</code>.</p>
567
568<p>XML can be applied sucessfully to a wide range of uses, from long term
569structured document maintenance (where it follows the steps of SGML) to simple
570data encoding mechanisms like configuration file formatting (glade),
571spreadsheets (gnumeric), or even shorter lived documents such as WebDAV where
572it is used to encode remote calls between a client and a server.</p>
573
574<h2>An overview of libxml architecture</h2>
575
576<p>Libxml is made of multiple components, some of them optionals, and most of
577the block interfaces are public. The main components are:</p>
578<ul>
579  <li>an Input/Output layer</li>
580  <li>FTP and HTTP client layers (optionnal)</li>
581  <li>an Internationalization layer managing the encodings support</li>
582  <li>an URI module</li>
583  <li>the XML parser and its basic SAX interface</li>
584  <li>an HTML parser using the same SAX interface (optionnal)</li>
585  <li>a SAX tree module to build an in-memory DOM representation</li>
586  <li>a tree module to manipulate the DOM representation</li>
587  <li>a validation module using the DOM representation (optionnal)</li>
588  <li>an XPath module for global lookup in a DOM representation
589  (optionnal)</li>
590  <li>a debug module (optionnal)</li>
591</ul>
592
593<p>Graphically this gives the following:</p>
594
595<p><img src="libxml.gif" alt="a graphical view of the various"></p>
596
597<p></p>
598
599<h2><a name="tree">The tree output</a></h2>
600
601<p>The parser returns a tree built during the document analysis. The value
602returned is an <strong>xmlDocPtr</strong> (i.e., a pointer to an
603<strong>xmlDoc</strong> structure). This structure contains information such
604as the file name, the document type, and a <strong>children</strong> pointer
605which is the root of the document (or more exactly the first child under the
606root which is the document). The tree is made of <strong>xmlNode</strong>s,
607chained in double-linked lists of siblings and with children&lt;-&gt;parent
608relationship. An xmlNode can also carry properties (a chain of xmlAttr
609structures). An attribute may have a value which is a list of TEXT or
610ENTITY_REF nodes.</p>
611
612<p>Here is an example (erroneous with respect to the XML spec since there
613should be only one ELEMENT under the root):</p>
614
615<p><img src="structure.gif" alt=" structure.gif "></p>
616
617<p>In the source package there is a small program (not installed by default)
618called <strong>xmllint</strong> which parses XML files given as argument and
619prints them back as parsed. This is useful for detecting errors both in XML
620code and in the XML parser itself. It has an option <strong>--debug</strong>
621which prints the actual in-memory structure of the document, here is the
622result with the <a href="#example">example</a> given before:</p>
623<pre>DOCUMENT
624version=1.0
625standalone=true
626  ELEMENT EXAMPLE
627    ATTRIBUTE prop1
628      TEXT
629      content=gnome is great
630    ATTRIBUTE prop2
631      ENTITY_REF
632      TEXT
633      content= linux too 
634    ELEMENT head
635      ELEMENT title
636        TEXT
637        content=Welcome to Gnome
638    ELEMENT chapter
639      ELEMENT title
640        TEXT
641        content=The Linux adventure
642      ELEMENT p
643        TEXT
644        content=bla bla bla ...
645      ELEMENT image
646        ATTRIBUTE href
647          TEXT
648          content=linus.gif
649      ELEMENT p
650        TEXT
651        content=...</pre>
652
653<p>This should be useful for learning the internal representation model.</p>
654
655<h2><a name="interface">The SAX interface</a></h2>
656
657<p>Sometimes the DOM tree output is just too large to fit reasonably into
658memory. In that case (and if you don't expect to save back the XML document
659loaded using libxml), it's better to use the SAX interface of libxml. SAX is a
660<strong>callback-based interface</strong> to the parser. Before parsing, the
661application layer registers a customized set of callbacks which are called by
662the library as it progresses through the XML input.</p>
663
664<p>To get more detailed step-by-step guidance on using the SAX interface of
665libxml, see the <a
666href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">nice
667documentation</a>.written by <a href="mailto:james@daa.com.au">James
668Henstridge</a>.</p>
669
670<p>You can debug the SAX behaviour by using the <strong>testSAX</strong>
671program located in the gnome-xml module (it's usually not shipped in the
672binary packages of libxml, but you can find it in the tar source
673distribution). Here is the sequence of callbacks that would be reported by
674testSAX when parsing the example XML document shown earlier:</p>
675<pre>SAX.setDocumentLocator()
676SAX.startDocument()
677SAX.getEntity(amp)
678SAX.startElement(EXAMPLE, prop1='gnome is great', prop2='&amp;amp; linux too')
679SAX.characters(   , 3)
680SAX.startElement(head)
681SAX.characters(    , 4)
682SAX.startElement(title)
683SAX.characters(Welcome to Gnome, 16)
684SAX.endElement(title)
685SAX.characters(   , 3)
686SAX.endElement(head)
687SAX.characters(   , 3)
688SAX.startElement(chapter)
689SAX.characters(    , 4)
690SAX.startElement(title)
691SAX.characters(The Linux adventure, 19)
692SAX.endElement(title)
693SAX.characters(    , 4)
694SAX.startElement(p)
695SAX.characters(bla bla bla ..., 15)
696SAX.endElement(p)
697SAX.characters(    , 4)
698SAX.startElement(image, href='linus.gif')
699SAX.endElement(image)
700SAX.characters(    , 4)
701SAX.startElement(p)
702SAX.characters(..., 3)
703SAX.endElement(p)
704SAX.characters(   , 3)
705SAX.endElement(chapter)
706SAX.characters( , 1)
707SAX.endElement(EXAMPLE)
708SAX.endDocument()</pre>
709
710<p>Most of the other functionalities of libxml are based on the DOM
711tree-building facility, so nearly everything up to the end of this document
712presupposes the use of the standard DOM tree build. Note that the DOM tree
713itself is built by a set of registered default callbacks, without internal
714specific interface.</p>
715
716<h2><a name="library">The XML library interfaces</a></h2>
717
718<p>This section is directly intended to help programmers getting bootstrapped
719using the XML library from the C language. It is not intended to be extensive.
720I hope the automatically generated documents will provide the completeness
721required, but as a separate set of documents. The interfaces of the XML
722library are by principle low level, there is nearly zero abstraction. Those
723interested in a higher level API should <a href="#DOM">look at DOM</a>.</p>
724
725<p>The <a href="html/gnome-xml-parser.html">parser interfaces for XML</a> are
726separated from the <a href="html/gnome-xml-htmlparser.html">HTML parser
727interfaces</a>.  Let's have a look at how the XML parser can be called:</p>
728
729<h3><a name="Invoking">Invoking the parser : the pull method</a></h3>
730
731<p>Usually, the first thing to do is to read an XML input. The parser accepts
732documents either from in-memory strings or from files.  The functions are
733defined in "parser.h":</p>
734<dl>
735  <dt><code>xmlDocPtr xmlParseMemory(char *buffer, int size);</code></dt>
736    <dd><p>Parse a null-terminated string containing the document.</p>
737    </dd>
738</dl>
739<dl>
740  <dt><code>xmlDocPtr xmlParseFile(const char *filename);</code></dt>
741    <dd><p>Parse an XML document contained in a (possibly compressed)
742      file.</p>
743    </dd>
744</dl>
745
746<p>The parser returns a pointer to the document structure (or NULL in case of
747failure).</p>
748
749<h3 id="Invoking1">Invoking the parser: the push method</h3>
750
751<p>In order for the application to keep the control when the document is been
752fetched (which is common for GUI based programs) libxml provides a push
753interface, too, as of version 1.8.3. Here are the interface functions:</p>
754<pre>xmlParserCtxtPtr xmlCreatePushParserCtxt(xmlSAXHandlerPtr sax,
755                                         void *user_data,
756                                         const char *chunk,
757                                         int size,
758                                         const char *filename);
759int              xmlParseChunk          (xmlParserCtxtPtr ctxt,
760                                         const char *chunk,
761                                         int size,
762                                         int terminate);</pre>
763
764<p>and here is a simple example showing how to use the interface:</p>
765<pre>            FILE *f;
766
767            f = fopen(filename, "r");
768            if (f != NULL) {
769                int res, size = 1024;
770                char chars[1024];
771                xmlParserCtxtPtr ctxt;
772
773                res = fread(chars, 1, 4, f);
774                if (res &gt; 0) {
775                    ctxt = xmlCreatePushParserCtxt(NULL, NULL,
776                                chars, res, filename);
777                    while ((res = fread(chars, 1, size, f)) &gt; 0) {
778                        xmlParseChunk(ctxt, chars, res, 0);
779                    }
780                    xmlParseChunk(ctxt, chars, 0, 1);
781                    doc = ctxt-&gt;myDoc;
782                    xmlFreeParserCtxt(ctxt);
783                }
784            }</pre>
785
786<p>Also note that the HTML parser embedded into libxml also has a push
787interface; the functions are just prefixed by "html" rather than "xml"</p>
788
789<h3 id="Invoking2">Invoking the parser: the SAX interface</h3>
790
791<p>A couple of comments can be made, first this mean that the parser is
792memory-hungry, first to load the document in memory, second to build the tree.
793Reading a document without building the tree is possible using the SAX
794interfaces (see SAX.h and <a
795href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">James
796Henstridge's documentation</a>). Note also that the push interface can be
797limited to SAX. Just use the two first arguments of
798<code>xmlCreatePushParserCtxt()</code>.</p>
799
800<h3><a name="Building">Building a tree from scratch</a></h3>
801
802<p>The other way to get an XML tree in memory is by building it. Basically
803there is a set of functions dedicated to building new elements. (These are
804also described in &lt;libxml/tree.h&gt;.) For example, here is a piece of code
805that produces the XML document used in the previous examples:</p>
806<pre>    #include &lt;libxml/tree.h&gt;
807    xmlDocPtr doc;
808    xmlNodePtr tree, subtree;
809
810    doc = xmlNewDoc("1.0");
811    doc-&gt;children = xmlNewDocNode(doc, NULL, "EXAMPLE", NULL);
812    xmlSetProp(doc-&gt;children, "prop1", "gnome is great");
813    xmlSetProp(doc-&gt;children, "prop2", "&amp; linux too");
814    tree = xmlNewChild(doc-&gt;children, NULL, "head", NULL);
815    subtree = xmlNewChild(tree, NULL, "title", "Welcome to Gnome");
816    tree = xmlNewChild(doc-&gt;children, NULL, "chapter", NULL);
817    subtree = xmlNewChild(tree, NULL, "title", "The Linux adventure");
818    subtree = xmlNewChild(tree, NULL, "p", "bla bla bla ...");
819    subtree = xmlNewChild(tree, NULL, "image", NULL);
820    xmlSetProp(subtree, "href", "linus.gif");</pre>
821
822<p>Not really rocket science ...</p>
823
824<h3><a name="Traversing">Traversing the tree</a></h3>
825
826<p>Basically by <a href="html/gnome-xml-tree.html">including "tree.h"</a> your
827code has access to the internal structure of all the elements of the tree. The
828names should be somewhat simple like <strong>parent</strong>,
829<strong>children</strong>, <strong>next</strong>, <strong>prev</strong>,
830<strong>properties</strong>, etc... For example, still with the previous
831example:</p>
832<pre><code>doc-&gt;children-&gt;children-&gt;children</code></pre>
833
834<p>points to the title element,</p>
835<pre>doc-&gt;children-&gt;children-&gt;next-&gt;child-&gt;child</pre>
836
837<p>points to the text node containing the chapter title "The Linux
838adventure".</p>
839
840<p><strong>NOTE</strong>: XML allows <em>PI</em>s and <em>comments</em> to be
841present before the document root, so <code>doc-&gt;children</code> may point
842to an element which is not the document Root Element, a function
843<code>xmlDocGetRootElement()</code> was added for this purpose.</p>
844
845<h3><a name="Modifying">Modifying the tree</a></h3>
846
847<p>Functions are provided for reading and writing the document content. Here
848is an excerpt from the <a href="html/gnome-xml-tree.html">tree API</a>:</p>
849<dl>
850  <dt><code>xmlAttrPtr xmlSetProp(xmlNodePtr node, const xmlChar *name, const
851  xmlChar *value);</code></dt>
852    <dd><p>This sets (or changes) an attribute carried by an ELEMENT node. The
853      value can be NULL.</p>
854    </dd>
855</dl>
856<dl>
857  <dt><code>const xmlChar *xmlGetProp(xmlNodePtr node, const xmlChar
858  *name);</code></dt>
859    <dd><p>This function returns a pointer to new copy of the property
860      content. Note that the user must deallocate the result.</p>
861    </dd>
862</dl>
863
864<p>Two functions are provided for reading and writing the text associated with
865elements:</p>
866<dl>
867  <dt><code>xmlNodePtr xmlStringGetNodeList(xmlDocPtr doc, const xmlChar
868  *value);</code></dt>
869    <dd><p>This function takes an "external" string and convert it to one text
870      node or possibly to a list of entity and text nodes. All non-predefined
871      entity references like &amp;Gnome; will be stored internally as entity
872      nodes, hence the result of the function may not be a single node.</p>
873    </dd>
874</dl>
875<dl>
876  <dt><code>xmlChar *xmlNodeListGetString(xmlDocPtr doc, xmlNodePtr list, int
877  inLine);</code></dt>
878    <dd><p>This function is the inverse of
879      <code>xmlStringGetNodeList()</code>. It generates a new string
880      containing the content of the text and entity nodes. Note the extra
881      argument inLine. If this argument is set to 1, the function will expand
882      entity references.  For example, instead of returning the &amp;Gnome;
883      XML encoding in the string, it will substitute it with its value (say,
884      "GNU Network Object Model Environment"). Set this argument if you want
885      to use the string for non-XML usage like User Interface.</p>
886    </dd>
887</dl>
888
889<h3><a name="Saving">Saving a tree</a></h3>
890
891<p>Basically 3 options are possible:</p>
892<dl>
893  <dt><code>void xmlDocDumpMemory(xmlDocPtr cur, xmlChar**mem, int
894  *size);</code></dt>
895    <dd><p>Returns a buffer into which the document has been saved.</p>
896    </dd>
897</dl>
898<dl>
899  <dt><code>extern void xmlDocDump(FILE *f, xmlDocPtr doc);</code></dt>
900    <dd><p>Dumps a document to an open file descriptor.</p>
901    </dd>
902</dl>
903<dl>
904  <dt><code>int xmlSaveFile(const char *filename, xmlDocPtr cur);</code></dt>
905    <dd><p>Saves the document to a file. In this case, the compression
906      interface is triggered if it has been turned on.</p>
907    </dd>
908</dl>
909
910<h3><a name="Compressio">Compression</a></h3>
911
912<p>The library transparently handles compression when doing file-based
913accesses. The level of compression on saves can be turned on either globally
914or individually for one file:</p>
915<dl>
916  <dt><code>int  xmlGetDocCompressMode (xmlDocPtr doc);</code></dt>
917    <dd><p>Gets the document compression ratio (0-9).</p>
918    </dd>
919</dl>
920<dl>
921  <dt><code>void xmlSetDocCompressMode (xmlDocPtr doc, int mode);</code></dt>
922    <dd><p>Sets the document compression ratio.</p>
923    </dd>
924</dl>
925<dl>
926  <dt><code>int  xmlGetCompressMode(void);</code></dt>
927    <dd><p>Gets the default compression ratio.</p>
928    </dd>
929</dl>
930<dl>
931  <dt><code>void xmlSetCompressMode(int mode);</code></dt>
932    <dd><p>Sets the default compression ratio.</p>
933    </dd>
934</dl>
935
936<h2><a name="Entities">Entities or no entities</a></h2>
937
938<p>Entities in principle are similar to simple C macros. An entity defines an
939abbreviation for a given string that you can reuse many times throughout the
940content of your document. Entities are especially useful when a given string
941may occur frequently within a document, or to confine the change needed to a
942document to a restricted area in the internal subset of the document (at the
943beginning). Example:</p>
944<pre>1 &lt;?xml version="1.0"?&gt;
9452 &lt;!DOCTYPE EXAMPLE SYSTEM "example.dtd" [
9463 &lt;!ENTITY xml "Extensible Markup Language"&gt;
9474 ]&gt;
9485 &lt;EXAMPLE&gt;
9496    &amp;xml;
9507 &lt;/EXAMPLE&gt;</pre>
951
952<p>Line 3 declares the xml entity. Line 6 uses the xml entity, by prefixing
953it's name with '&amp;' and following it by ';' without any spaces added. There
954are 5 predefined entities in libxml allowing you to escape charaters with
955predefined meaning in some parts of the xml document content:
956<strong>&amp;lt;</strong> for the character '&lt;', <strong>&amp;gt;</strong>
957for the character '&gt;',  <strong>&amp;apos;</strong> for the character ''',
958<strong>&amp;quot;</strong> for the character '"', and
959<strong>&amp;amp;</strong> for the character '&amp;'.</p>
960
961<p>One of the problems related to entities is that you may want the parser to
962substitute an entity's content so that you can see the replacement text in
963your application. Or you may prefer to keep entity references as such in the
964content to be able to save the document back without losing this usually
965precious information (if the user went through the pain of explicitly defining
966entities, he may have a a rather negative attitude if you blindly susbtitute
967them as saving time). The <a
968href="html/gnome-xml-parser.html#XMLSUBSTITUTEENTITIESDEFAULT">xmlSubstituteEntitiesDefault()</a>
969function allows you to check and change the behaviour, which is to not
970substitute entities by default.</p>
971
972<p>Here is the DOM tree built by libxml for the previous document in the
973default case:</p>
974<pre>/gnome/src/gnome-xml -&gt; /xmllint --debug test/ent1
975DOCUMENT
976version=1.0
977   ELEMENT EXAMPLE
978     TEXT
979     content=
980     ENTITY_REF
981       INTERNAL_GENERAL_ENTITY xml
982       content=Extensible Markup Language
983     TEXT
984     content=</pre>
985
986<p>And here is the result when substituting entities:</p>
987<pre>/gnome/src/gnome-xml -&gt; /tester --debug --noent test/ent1
988DOCUMENT
989version=1.0
990   ELEMENT EXAMPLE
991     TEXT
992     content=     Extensible Markup Language</pre>
993
994<p>So, entities or no entities? Basically, it depends on your use case. I
995suggest that you keep the non-substituting default behaviour and avoid using
996entities in your XML document or data if you are not willing to handle the
997entity references elements in the DOM tree.</p>
998
999<p>Note that at save time libxml enforce the conversion of the predefined
1000entities where necessary to prevent well-formedness problems, and will also
1001transparently replace those with chars (i.e., it will not generate entity
1002reference elements in the DOM tree or call the reference() SAX callback when
1003finding them in the input).</p>
1004
1005<p><span style="background-color: #FF0000">WARNING</span>: handling entities
1006on top of libxml SAX interface is difficult !!! If you plan to use
1007non-predefined entities in your documents, then the learning cuvre to handle
1008then using the SAX API may be long. If you plan to use complex document, I
1009strongly suggest you consider using the DOM interface instead and let libxml
1010deal with the complexity rather than trying to do it yourself.</p>
1011
1012<h2><a name="Namespaces">Namespaces</a></h2>
1013
1014<p>The libxml library implements <a
1015href="http://www.w3.org/TR/REC-xml-names/">XML namespaces</a> support by
1016recognizing namespace contructs in the input, and does namespace lookup
1017automatically when building the DOM tree. A namespace declaration is
1018associated with an in-memory structure and all elements or attributes within
1019that namespace point to it. Hence testing the namespace is a simple and fast
1020equality operation at the user level.</p>
1021
1022<p>I suggest that people using libxml use a namespace, and declare it in the
1023root element of their document as the default namespace. Then they don't need
1024to use the prefix in the content but we will have a basis for future semantic
1025refinement and  merging of data from different sources. This doesn't augment
1026significantly the size of the XML output, but significantly increase its value
1027in the long-term. Example:</p>
1028<pre>&lt;mydoc xmlns="http://mydoc.example.org/schemas/"&gt;
1029   &lt;elem1&gt;...&lt;/elem1&gt;
1030   &lt;elem2&gt;...&lt;/elem2&gt;
1031&lt;/mydoc&gt;</pre>
1032
1033<p>Concerning the namespace value, this has to be an URL, but the URL doesn't
1034have to point to any existing resource on the Web. It will bind all the
1035element and atributes with that URL. I suggest to use an URL within a domain
1036you control, and that the URL should contain some kind of version information
1037if possible. For example, <code>"http://www.gnome.org/gnumeric/1.0/"</code> is
1038a good namespace scheme.</p>
1039
1040<p>Then when you load a file, make sure that a namespace carrying the
1041version-independent prefix is installed on the root element of your document,
1042and if the version information don't match something you know, warn the user
1043and be liberal in what you accept as the input. Also do *not* try to base
1044namespace checking on the prefix value. &lt;foo:text&gt; may be exactly the
1045same as &lt;bar:text&gt; in another document. What really matter is the URI
1046associated with the element or the attribute, not the prefix string (which is
1047just a shortcut for the full URI). In libxml element and attributes have a
1048<code>ns</code> field pointing to an xmlNs structure detailing the namespace
1049prefix and it's URI.</p>
1050
1051<p>@@Interfaces@@</p>
1052
1053<p>@@Examples@@</p>
1054
1055<p>Usually people object using namespace in the case of validation, I object
1056this and will make sure that using namespaces won't break validity checking,
1057so even is you plan to use or currently are using validation I strongly
1058suggest adding namespaces to your document. A default namespace scheme
1059<code>xmlns="http://...."</code> should not break validity even on less
1060flexible parsers. Now using namespace to mix and differentiate content coming
1061from multiple DTDs will certainly break current validation schemes. I will try
1062to provide ways to do this, but this may not be portable or standardized.</p>
1063
1064<h2><a name="Validation">Validation, or are you afraid of DTDs ?</a></h2>
1065
1066<p>Well what is validation and what is a DTD ?</p>
1067
1068<p>Validation is the process of checking a document against a set of
1069construction rules, a <strong>DTD</strong> (Document Type Definition) is such
1070a set of rules.</p>
1071
1072<p>The validation process and building DTDs are the two most difficult parts
1073of  XML life cycle. Briefly a DTD defines all the possibles element to be
1074found within your document, what is the formal shape of your document tree (by
1075defining the allowed content of an element, either text, a regular expression
1076for the allowed list of children, or mixed content i.e. both text and
1077children). The DTD also defines the allowed attributes for all elements and
1078the types of the attributes. For more detailed informations, I suggest to read
1079the related parts of the XML specification, the examples found under
1080gnome-xml/test/valid/dtd and the large amount of books available on XML. The
1081dia example in gnome-xml/test/valid should be both simple and complete enough
1082to allow you to build your own.</p>
1083
1084<p>A word of warning, building a good DTD which will fit your needs of your
1085application in the long-term is far from trivial, however the extra level of
1086quality it can insure is well worth the price for some sets of applications or
1087if you already have already a DTD defined for your application field.</p>
1088
1089<p>The validation is not completely finished but in a (very IMHO) usable
1090state. Until a real validation interface is defined the way to do it is to
1091define and set the <strong>xmlDoValidityCheckingDefaultValue</strong> external
1092variable to 1, this will of course be changed at some point:</p>
1093
1094<p>extern int xmlDoValidityCheckingDefaultValue;</p>
1095
1096<p>...</p>
1097
1098<p>xmlDoValidityCheckingDefaultValue = 1;</p>
1099
1100<p></p>
1101
1102<p>To handle external entities, use the function
1103<strong>xmlSetExternalEntityLoader</strong>(xmlExternalEntityLoader f); to
1104link in you HTTP/FTP/Entities database library to the standard libxml
1105core.</p>
1106
1107<p>@@interfaces@@</p>
1108
1109<h2><a name="DOM"></a><a name="Principles">DOM Principles</a></h2>
1110
1111<p><a href="http://www.w3.org/DOM/">DOM</a> stands for the <em>Document Object
1112Model</em> this is an API for accessing XML or HTML structured documents.
1113Native support for DOM in Gnome is on the way (module gnome-dom), and it will
1114be based on gnome-xml. This will be a far cleaner interface to manipulate XML
1115files within Gnome since it won't expose the internal structure. DOM defines a
1116set of IDL (or Java) interfaces allowing to traverse and manipulate a
1117document. The DOM library will allow accessing and modifying "live" documents
1118presents on other programs like this:</p>
1119
1120<p><img src="DOM.gif" alt=" DOM.gif "></p>
1121
1122<p>This should help greatly doing things like modifying a gnumeric spreadsheet
1123embedded in a GWP document for example.</p>
1124
1125<p>The current DOM implementation on top of libxml is the <a
1126href="http://cvs.gnome.org/lxr/source/gdome/">gdome Gnome module</a>, this is
1127a full DOM interface, thanks to <a href="mailto:raph@levien.com">Raph
1128Levien</a>.</p>
1129
1130<p>The gnome-dom module in the Gnome CVS base is obsolete</p>
1131
1132<h2><a name="Example"></a><a name="real">A real example</a></h2>
1133
1134<p>Here is a real size example, where the actual content of the application
1135data is not kept in the DOM tree but uses internal structures. It is based on
1136a proposal to keep a database of jobs related to Gnome, with an XML based
1137storage structure. Here is an <a href="gjobs.xml">XML encoded jobs
1138base</a>:</p>
1139<pre>&lt;?xml version="1.0"?&gt;
1140&lt;gjob:Helping xmlns:gjob="http://www.gnome.org/some-location"&gt;
1141  &lt;gjob:Jobs&gt;
1142
1143    &lt;gjob:Job&gt;
1144      &lt;gjob:Project ID="3"/&gt;
1145      &lt;gjob:Application&gt;GBackup&lt;/gjob:Application&gt;
1146      &lt;gjob:Category&gt;Development&lt;/gjob:Category&gt;
1147
1148      &lt;gjob:Update&gt;
1149        &lt;gjob:Status&gt;Open&lt;/gjob:Status&gt;
1150        &lt;gjob:Modified&gt;Mon, 07 Jun 1999 20:27:45 -0400 MET DST&lt;/gjob:Modified&gt;
1151        &lt;gjob:Salary&gt;USD 0.00&lt;/gjob:Salary&gt;
1152      &lt;/gjob:Update&gt;
1153
1154      &lt;gjob:Developers&gt;
1155        &lt;gjob:Developer&gt;
1156        &lt;/gjob:Developer&gt;
1157      &lt;/gjob:Developers&gt;
1158
1159      &lt;gjob:Contact&gt;
1160        &lt;gjob:Person&gt;Nathan Clemons&lt;/gjob:Person&gt;
1161        &lt;gjob:Email&gt;nathan@windsofstorm.net&lt;/gjob:Email&gt;
1162        &lt;gjob:Company&gt;
1163        &lt;/gjob:Company&gt;
1164        &lt;gjob:Organisation&gt;
1165        &lt;/gjob:Organisation&gt;
1166        &lt;gjob:Webpage&gt;
1167        &lt;/gjob:Webpage&gt;
1168        &lt;gjob:Snailmail&gt;
1169        &lt;/gjob:Snailmail&gt;
1170        &lt;gjob:Phone&gt;
1171        &lt;/gjob:Phone&gt;
1172      &lt;/gjob:Contact&gt;
1173
1174      &lt;gjob:Requirements&gt;
1175      The program should be released as free software, under the GPL.
1176      &lt;/gjob:Requirements&gt;
1177
1178      &lt;gjob:Skills&gt;
1179      &lt;/gjob:Skills&gt;
1180
1181      &lt;gjob:Details&gt;
1182      A GNOME based system that will allow a superuser to configure 
1183      compressed and uncompressed files and/or file systems to be backed 
1184      up with a supported media in the system.  This should be able to 
1185      perform via find commands generating a list of files that are passed 
1186      to tar, dd, cpio, cp, gzip, etc., to be directed to the tape machine 
1187      or via operations performed on the filesystem itself. Email 
1188      notification and GUI status display very important.
1189      &lt;/gjob:Details&gt;
1190
1191    &lt;/gjob:Job&gt;
1192
1193  &lt;/gjob:Jobs&gt;
1194&lt;/gjob:Helping&gt;</pre>
1195
1196<p>While loading the XML file into an internal DOM tree is a matter of calling
1197only a couple of functions, browsing the tree to gather the informations and
1198generate the internals structures is harder, and more error prone.</p>
1199
1200<p>The suggested principle is to be tolerant with respect to the input
1201structure. For example, the ordering of the attributes is not significant,
1202Cthe XML specification is clear about it. It's also usually a good idea to not
1203be dependent of the orders of the children of a given node, unless it really
1204makes things harder. Here is some code to parse the informations for a
1205person:</p>
1206<pre>/*
1207 * A person record
1208 */
1209typedef struct person {
1210    char *name;
1211    char *email;
1212    char *company;
1213    char *organisation;
1214    char *smail;
1215    char *webPage;
1216    char *phone;
1217} person, *personPtr;
1218
1219/*
1220 * And the code needed to parse it
1221 */
1222personPtr parsePerson(xmlDocPtr doc, xmlNsPtr ns, xmlNodePtr cur) {
1223    personPtr ret = NULL;
1224
1225DEBUG("parsePerson\n");
1226    /*
1227     * allocate the struct
1228     */
1229    ret = (personPtr) malloc(sizeof(person));
1230    if (ret == NULL) {
1231        fprintf(stderr,"out of memory\n");
1232        return(NULL);
1233    }
1234    memset(ret, 0, sizeof(person));
1235
1236    /* We don't care what the top level element name is */
1237    cur = cur-&gt;xmlChildrenNode;
1238    while (cur != NULL) {
1239        if ((!strcmp(cur-&gt;name, "Person")) &amp;&amp; (cur-&gt;ns == ns))
1240            ret-&gt;name = xmlNodeListGetString(doc, cur-&gt;xmlChildrenNode, 1);
1241        if ((!strcmp(cur-&gt;name, "Email")) &amp;&amp; (cur-&gt;ns == ns))
1242            ret-&gt;email = xmlNodeListGetString(doc, cur-&gt;xmlChildrenNode, 1);
1243        cur = cur-&gt;next;
1244    }
1245
1246    return(ret);
1247}</pre>
1248
1249<p>Here is a couple of things to notice:</p>
1250<ul>
1251  <li>Usually a recursive parsing style is the more convenient one, XML data
1252    being by nature subject to repetitive constructs and usualy exibit highly
1253    stuctured patterns.</li>
1254  <li>The two arguments of type <em>xmlDocPtr</em> and <em>xmlNsPtr</em>, i.e.
1255    the pointer to the global XML document and the namespace reserved to the
1256    application. Document wide information are needed for example to decode
1257    entities and it's a good coding practice to define a namespace for your
1258    application set of data and test that the element and attributes you're
1259    analyzing actually pertains to your application space. This is done by a
1260    simple equality test (cur-&gt;ns == ns).</li>
1261  <li>To retrieve text and attributes value, it is suggested to use the
1262    function <em>xmlNodeListGetString</em> to gather all the text and entity
1263    reference nodes generated by the DOM output and produce an single text
1264    string.</li>
1265</ul>
1266
1267<p>Here is another piece of code used to parse another level of the
1268structure:</p>
1269<pre>#include &lt;libxml/tree.h&gt;
1270/*
1271 * a Description for a Job
1272 */
1273typedef struct job {
1274    char *projectID;
1275    char *application;
1276    char *category;
1277    personPtr contact;
1278    int nbDevelopers;
1279    personPtr developers[100]; /* using dynamic alloc is left as an exercise */
1280} job, *jobPtr;
1281
1282/*
1283 * And the code needed to parse it
1284 */
1285jobPtr parseJob(xmlDocPtr doc, xmlNsPtr ns, xmlNodePtr cur) {
1286    jobPtr ret = NULL;
1287
1288DEBUG("parseJob\n");
1289    /*
1290     * allocate the struct
1291     */
1292    ret = (jobPtr) malloc(sizeof(job));
1293    if (ret == NULL) {
1294        fprintf(stderr,"out of memory\n");
1295        return(NULL);
1296    }
1297    memset(ret, 0, sizeof(job));
1298
1299    /* We don't care what the top level element name is */
1300    cur = cur-&gt;xmlChildrenNode;
1301    while (cur != NULL) {
1302        
1303        if ((!strcmp(cur-&gt;name, "Project")) &amp;&amp; (cur-&gt;ns == ns)) {
1304            ret-&gt;projectID = xmlGetProp(cur, "ID");
1305            if (ret-&gt;projectID == NULL) {
1306                fprintf(stderr, "Project has no ID\n");
1307            }
1308        }
1309        if ((!strcmp(cur-&gt;name, "Application")) &amp;&amp; (cur-&gt;ns == ns))
1310            ret-&gt;application = xmlNodeListGetString(doc, cur-&gt;xmlChildrenNode, 1);
1311        if ((!strcmp(cur-&gt;name, "Category")) &amp;&amp; (cur-&gt;ns == ns))
1312            ret-&gt;category = xmlNodeListGetString(doc, cur-&gt;xmlChildrenNode, 1);
1313        if ((!strcmp(cur-&gt;name, "Contact")) &amp;&amp; (cur-&gt;ns == ns))
1314            ret-&gt;contact = parsePerson(doc, ns, cur);
1315        cur = cur-&gt;next;
1316    }
1317
1318    return(ret);
1319}</pre>
1320
1321<p>One can notice that once used to it, writing this kind of code is quite
1322simple, but boring. Ultimately, it could be possble to write stubbers taking
1323either C data structure definitions, a set of XML examples or an XML DTD and
1324produce the code needed to import and export the content between C data and
1325XML storage. This is left as an exercise to the reader :-)</p>
1326
1327<p>Feel free to use <a href="example/gjobread.c">the code for the full C
1328parsing example</a> as a template, it is also available with Makefile in the
1329Gnome CVS base under gnome-xml/example</p>
1330
1331<h2><a name="Contributi">Contributions</a></h2>
1332<ul>
1333  <li><a href="mailto:ari@btigate.com">Ari Johnson</a> provides a  C++ wrapper
1334    for libxml:
1335    <p>Website: <a
1336    href="http://lusis.org/~ari/xml++/">http://lusis.org/~ari/xml++/</a></p>
1337    <p>Download: <a
1338    href="http://lusis.org/~ari/xml++/libxml++.tar.gz">http://lusis.org/~ari/xml++/libxml++.tar.gz</a></p>
1339  </li>
1340  <li><a href="mailto:doolin@cs.utk.edu">David Doolin</a> provides a
1341    precompiled Windows version
1342    <p><a
1343    href="http://www.ce.berkeley.edu/~doolin/code/libxmlwin32/">http://www.ce.berkeley.edu/~doolin/code/libxmlwin32/</a></p>
1344  </li>
1345  <li><a href="mailto:fnatter@gmx.net">Felix Natter</a> provided <a
1346    href="libxml-doc.el">an emacs module</a> to lookup libxml functions
1347    documentation</li>
1348  <li><a href="mailto:sherwin@nlm.nih.gov">Ziying Sherwin</a> provided <a
1349    href="http://xmlsoft.org/messages/0488.html">man pages</a> (not yet
1350    integrated in the distribution)</li>
1351</ul>
1352
1353<p></p>
1354
1355<p><a href="mailto:Daniel.Veillard@w3.org">Daniel Veillard</a></p>
1356
1357<p>$Id: xml.html,v 1.56 2000/10/21 09:25:52 veillard Exp $</p>
1358</body>
1359</html>
1360