xml.html revision 6e93c4aa47c11b86642c504301a3ff24ee191172
1<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
2    "http://www.w3.org/TR/html4/loose.dtd">
3<html>
4<head>
5  <title>The XML C library for Gnome</title>
6  <meta name="GENERATOR" content="amaya V4.1">
7  <meta http-equiv="Content-Type" content="text/html">
8</head>
9
10<body bgcolor="#ffffff">
11<p><a href="http://www.gnome.org/"><img src="smallfootonly.gif"
12alt="Gnome Logo"></a><a href="http://www.w3.org/Status"><img src="w3c.png"
13alt="W3C Logo"></a></p>
14
15<h1 align="center">The XML C library for Gnome</h1>
16
17<h2 style="text-align: center">libxml, a.k.a. gnome-xml</h2>
18
19<p></p>
20<ul>
21  <li><a href="#Introducti">Introduction</a></li>
22  <li><a href="#Documentat">Documentation</a></li>
23  <li><a href="#Reporting">Reporting bugs and getting help</a></li>
24  <li><a href="#help">how to help</a></li>
25  <li><a href="#Downloads">Downloads</a></li>
26  <li><a href="#News">News</a></li>
27  <li><a href="#XML">XML</a></li>
28  <li><a href="#XSLT">XSLT</a></li>
29  <li><a href="#tree">The tree output</a></li>
30  <li><a href="#interface">The SAX interface</a></li>
31  <li><a href="#library">The XML library interfaces</a>
32    <ul>
33      <li><a href="#Invoking">Invoking the parser: the pull way</a></li>
34      <li><a href="#Invoking">Invoking the parser: the push way</a></li>
35      <li><a href="#Invoking2">Invoking the parser: the SAX interface</a></li>
36      <li><a href="#Building">Building a tree from scratch</a></li>
37      <li><a href="#Traversing">Traversing the tree</a></li>
38      <li><a href="#Modifying">Modifying the tree</a></li>
39      <li><a href="#Saving">Saving the tree</a></li>
40      <li><a href="#Compressio">Compression</a></li>
41    </ul>
42  </li>
43  <li><a href="#Entities">Entities or no entities</a></li>
44  <li><a href="#Namespaces">Namespaces</a></li>
45  <li><a href="#Validation">Validation</a></li>
46  <li><a href="#Principles">DOM principles</a></li>
47  <li><a href="#real">A real example</a></li>
48  <li><a href="#Contributi">Contributions</a></li>
49</ul>
50
51<p>Separate documents:</p>
52<ul>
53  <li><a href="upgrade.html">upgrade instructions for migrating to
54  libxml2</a></li>
55  <li><a href="encoding.html">libxml Internationalization support</a></li>
56  <li><a href="xmlio.html">libxml Input/Output interfaces</a></li>
57  <li><a href="xmlmem.html">libxml Memory interfaces</a></li>
58  <li><a href="xmldtd.html">a short introduction about DTDs and
59  libxml</a></li>
60  <li><a href="http://xmlsoft.org/XSLT/">the libxslt page</a></li>
61</ul>
62
63<h2><a name="Introducti">Introduction</a></h2>
64
65<p>This document describes libxml, the <a
66href="http://www.w3.org/XML/">XML</a> C library developped for the <a
67href="http://www.gnome.org/">Gnome</a> project. <a
68href="http://www.w3.org/XML/">XML is a standard</a> for building tag-based
69structured documents/data.</p>
70
71<p>Here are some key points about libxml:</p>
72<ul>
73  <li>Libxml exports Push and Pull type parser interfaces for both XML and
74    HTML.</li>
75  <li>Libxml can do DTD validation at parse time, using a parsed document
76    instance, or with an arbitrary DTD.</li>
77  <li>Libxml now includes nearly complete <a
78    href="http://www.w3.org/TR/xpath">XPath</a> and <a
79    href="http://www.w3.org/TR/xptr">XPointer</a> implementations.</li>
80  <li>It is written in plain C, making as few assumptions as possible, and
81    sticking closely to ANSI C/POSIX for easy embedding. Works on
82    Linux/Unix/Windows, ported to a number of other platforms.</li>
83  <li>Basic support for HTTP and FTP client allowing aplications to fetch
84    remote resources</li>
85  <li>The design is modular, most of the extensions can be compiled out.</li>
86  <li>The internal document repesentation is as close as possible to the <a
87    href="http://www.w3.org/DOM/">DOM</a> interfaces.</li>
88  <li>Libxml also has a <a href="http://www.megginson.com/SAX/index.html">SAX
89    like interface</a>; the interface is designed to be compatible with <a
90    href="http://www.jclark.com/xml/expat.html">Expat</a>.</li>
91  <li>This library is released both under the <a
92    href="http://www.w3.org/Consortium/Legal/copyright-software-19980720.html">W3C
93    IPR</a> and the <a href="http://www.gnu.org/copyleft/lesser.html">GNU
94    LGPL</a>. Use either at your convenience, basically this should make
95    everybody happy, if not, drop me a mail.</li>
96</ul>
97
98<p>Warning: unless you are forced to because your application links with a
99Gnome library requiring it,  <strong><span
100style="background-color: #FF0000">Do Not Use libxml1</span></strong>, use
101libxml2</p>
102
103<h2><a name="Documentat">Documentation</a></h2>
104
105<p>There are some on-line resources about using libxml:</p>
106<ol>
107  <li>Check the <a href="FAQ.html">FAQ</a></li>
108  <li>Check the <a href="http://xmlsoft.org/html/libxml-lib.html">extensive
109    documentation</a> automatically extracted from code comments (using <a
110    href="http://cvs.gnome.org/bonsai/rview.cgi?cvsroot=/cvs/gnome&dir=gtk-doc">gtk
111    doc</a>).</li>
112  <li>Look at the documentation about <a href="encoding.html">libxml
113    internationalization support</a></li>
114  <li>This page provides a global overview and <a href="#real">some
115    examples</a> on how to use libxml.</li>
116  <li><a href="mailto:james@daa.com.au">James Henstridge</a> wrote <a
117    href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">some nice
118    documentation</a> explaining how to use the libxml SAX interface.</li>
119  <li>George Lebl wrote <a
120    href="http://www-4.ibm.com/software/developer/library/gnome3/">an article
121    for IBM developerWorks</a> about using libxml.</li>
122  <li>It is also a good idea to check to <a href="mailto:raph@levien.com">Raph
123    Levien</a>'s <a href="http://levien.com/gnome/">web site</a> since he is
124    building the <a href="http://levien.com/gnome/gdome.html">DOM interface
125    gdome</a> on top of libxml result tree and an implementation of <a
126    href="http://www.w3.org/Graphics/SVG/">SVG</a> called <a
127    href="http://www.levien.com/svg/">gill</a>. Check his <a
128    href="http://www.levien.com/gnome/domination.html">DOMination
129  paper</a>.</li>
130  <li>Check <a href="http://cvs.gnome.org/lxr/source/gnome-xml/TODO">the TODO
131    file</a></li>
132  <li>Read the <a href="upgrade.html">1.x to 2.x upgrade path</a>. If you are
133    starting a new project using libxml you should really use the 2.x
134  version.</li>
135  <li>And don't forget to look at the <a href="/messages/">mailing-list
136    archive</a>.</li>
137</ol>
138
139<h2><a name="Reporting">Reporting bugs and getting help</a></h2>
140
141<p>Well, bugs or missing features are always possible, and I will make a point
142of fixing them in a timely fashion. The best way to report a bug is to use the
143<a href="http://bugzilla.gnome.org/buglist.cgi?product=libxml">Gnome bug
144tracking database</a> (make sure to use the "libxml" module name). I look at
145reports there regularly and it's good to have a reminder when a bug is still
146open. Check the <a
147href="http://bugzilla.gnome.org/bugwritinghelp.html">instructions on reporting
148bugs</a> and be sure to specify that the bug is for the package libxml.</p>
149
150<p>There is also a mailing-list <a
151href="mailto:xml@gnome.org">xml@gnome.org</a> for libxml, with an  <a
152href="http://mail.gnome.org/archives/xml/">on-line archive</a> (<a
153href="http://xmlsoft.org/messages">old</a>). To subscribe to this list, please
154visit the <a href="http://mail.gnome.org/mailman/listinfo/xml">associated
155Web</a> page and follow the instructions.</p>
156
157<p>Alternatively, you can just send the bug to the <a
158href="mailto:xml@gnome.org">xml@gnome.org</a> list; if it's really libxml
159related I will approve it.. Please do not send me mail directly especially for
160portability problem, it makes things really harder to track and in some cases
161I'm not the best person to answer a given question, ask the list instead.</p>
162
163<p>Of course, bugs reported with a suggested patch for fixing them will
164probably be processed faster.</p>
165
166<p>If you're looking for help, a quick look at <a
167href="http://xmlsoft.org/messages/#407">the list archive</a> may actually
168provide the answer, I usually send source samples when answering libxml usage
169questions. The <a href="http://xmlsoft.org/html/book1.html">auto-generated
170documentantion</a> is not as polished as I would like (i need to learn more
171about Docbook), but it's a good starting point.</p>
172
173<h2><a name="help">How to help</a></h2>
174
175<p>You can help the project in various ways, the best thing to do first is to
176subscribe to the mailing-list as explained before, check the <a
177href="http://xmlsoft.org/messages/">archives </a>and the <a
178href="http://bugs.gnome.org/db/pa/lgnome-xml.html">Gnome bug
179database:</a>:</p>
180<ol>
181  <li>provide patches when you find problems</li>
182  <li>provide the diffs when you port libxml to a new platform. They may not
183    be integrated in all cases but help pinpointing portability problems
184  and</li>
185  <li>provide documentation fixes (either as patches to the code comments or
186    as HTML diffs).</li>
187  <li>provide new documentations pieces (translations, examples, etc ...)</li>
188  <li>Check the TODO file and try to close one of the items</li>
189  <li>take one of the points raised in the archive or the bug database and
190    provide a fix. <a href="mailto:Daniel.Veillard@imag.fr">Get in touch with
191    me </a>before to avoid synchronization problems and check that the
192    suggested fix will fit in nicely :-)</li>
193</ol>
194
195<h2><a name="Downloads">Downloads</a></h2>
196
197<p>The latest versions of libxml can be found on <a
198href="ftp://xmlsoft.org/">xmlsoft.org</a> or on the <a
199href="ftp://ftp.gnome.org/pub/GNOME/MIRRORS.html">Gnome FTP server</a> either
200as a <a href="ftp://ftp.gnome.org/pub/GNOME/stable/sources/libxml/">source
201archive</a> or <a
202href="ftp://ftp.gnome.org/pub/GNOME/stable/redhat/i386/libxml/">RPM
203packages</a>. (NOTE that you need both the <a
204href="http://rpmfind.net/linux/RPM/libxml2.html">libxml(2)</a> and <a
205href="http://rpmfind.net/linux/RPM/libxml2-devel.html">libxml(2)-devel</a>
206packages installed to compile applications using libxml.)</p>
207
208<p><a name="Snapshot">Snapshot:</a></p>
209<ul>
210  <li>Code from the W3C cvs base libxml <a
211    href="ftp://xmlsoft.org/cvs-snapshot.tar.gz">cvs-snapshot.tar.gz</a></li>
212  <li>Docs, content of the web site, the list archive included <a
213    href="ftp://xmlsoft.org/libxml-docs.tar.gz">libxml-docs.tar.gz</a></li>
214</ul>
215
216<p><a name="Contribs">Contribs:</a></p>
217
218<p>I do accept external contributions, especially if compiling on another
219platform, get in touch with me to upload the package. I will keep them in the
220<a href="ftp://xmlsoft.org/contribs/">contrib directory</a></p>
221
222<p>Libxml is also available from CVS:</p>
223<ul>
224  <li><p>The <a
225    href="http://cvs.gnome.org/bonsai/rview.cgi?cvsroot=/cvs/gnome&dir=gnome-xml">Gnome
226    CVS base</a>. Check the <a
227    href="http://developer.gnome.org/tools/cvs.html">Gnome CVS Tools</a> page;
228    the CVS module is <b>gnome-xml</b>.</p>
229  </li>
230  <li>The <strong>libxslt</strong> module is also present there</li>
231</ul>
232
233<h2><a name="News">News</a></h2>
234
235<h3>CVS only : check the <a
236href="http://cvs.gnome.org/lxr/source/gnome-xml/ChangeLog">Changelog</a> file
237for a really accurate description</h3>
238
239<p>Items floating around but not actively worked on, get in touch with me if
240you want to test those</p>
241<ul>
242  <li>Implementing <a href="http://xmlsoft.org/XSLT">XSLT</a>, this is done as
243    a separate C library on top of libxml called libxslt, not released yet but
244    available from CVS</li>
245  <li>Finishing up <a href="http://www.w3.org/TR/xptr">XPointer</a> and <a
246    href="http://www.w3.org/TR/xinclude">XInclude</a></li>
247  <li>(seeems working but delayed from release) parsing/import of Docbook SGML
248    docs</li>
249</ul>
250
251<h3>2.3.10: June 1 2001</h3>
252<ul>
253  <li>fixed the SGML catalog support</li>
254  <li>a number of reported bugs got fixed, in XPath, iconv detection, XInclude
255    processing</li>
256  <li>XPath string function should now handle unicode correctly</li>
257</ul>
258
259<h3>2.3.9: May 19 2001</h3>
260
261<p>Lots of bugfixes, and added a basic SGML catalog support:</p>
262<ul>
263  <li>HTML push bugfix #54891 and another patch from Jonas Borgstr�m</li>
264  <li>some serious speed optimisation again</li>
265  <li>some documentation cleanups</li>
266  <li>trying to get better linking on solaris (-R)</li>
267  <li>XPath API cleanup from Thomas Broyer</li>
268  <li>Validation bug fixed #54631, added a patch from Gary Pennington, fixed
269    xmlValidGetValidElements()</li>
270  <li>Added an INSTALL file</li>
271  <li>Attribute removal added to API: #54433</li>
272  <li>added a basic support for SGML catalogs</li>
273  <li>fixed xmlKeepBlanksDefault(0) API</li>
274  <li>bugfix in xmlNodeGetLang()</li>
275  <li>fixed a small configure portability problem</li>
276  <li>fixed an inversion of SYSTEM and PUBLIC identifier in HTML document</li>
277</ul>
278
279<h3>1.8.13: May 14 2001</h3>
280<ul>
281  <li>bugfixes release of the old libxml1 branch used by Gnome</li>
282</ul>
283
284<h3>2.3.8: May 3 2001</h3>
285<ul>
286  <li>Integrated an SGML DocBook parser for the Gnome project</li>
287  <li>Fixed a few things in the HTML parser</li>
288  <li>Fixed some XPath bugs raised by XSLT use, tried to fix the floating
289    point portability issue</li>
290  <li>Speed improvement (8M/s for SAX, 3M/s for DOM, 1.5M/s for DOM+validation
291    using the XML REC as input and a 700MHz celeron).</li>
292  <li>incorporated more Windows cleanup</li>
293  <li>added xmlSaveFormatFile()</li>
294  <li>fixed problems in copying nodes with entities references (gdome)</li>
295  <li>removed some troubles surrounding the new validation module</li>
296</ul>
297
298<h3>2.3.7: April 22 2001</h3>
299<ul>
300  <li>lots of small bug fixes, corrected XPointer</li>
301  <li>Non determinist content model validation support</li>
302  <li>added xmlDocCopyNode for gdome2</li>
303  <li>revamped the way the HTML parser handles end of tags</li>
304  <li>XPath: corrctions of namespacessupport and number formatting</li>
305  <li>Windows: Igor Zlatkovic patches for MSC compilation</li>
306  <li>HTML ouput fixes from P C Chow and William M. Brack</li>
307  <li>Improved validation speed sensible for DocBook</li>
308  <li>fixed a big bug with ID declared in external parsed entities</li>
309  <li>portability fixes, update of Trio from Bjorn Reese</li>
310</ul>
311
312<h3>2.3.6: April 8 2001</h3>
313<ul>
314  <li>Code cleanup using extreme gcc compiler warning options, found and
315    cleared half a dozen potential problem</li>
316  <li>the Eazel team found an XML parser bug</li>
317  <li>cleaned up the user of some of the string formatting function. used the
318    trio library code to provide the one needed when the platform is missing
319    them</li>
320  <li>xpath: removed a memory leak and fixed the predicate evaluation problem,
321    extended the testsuite and cleaned up the result. XPointer seems broken
322    ...</li>
323</ul>
324
325<h3>2.3.5: Mar 23 2001</h3>
326<ul>
327  <li>Biggest change is separate parsing and evaluation of XPath expressions,
328    there is some new APIs for this too</li>
329  <li>included a number of bug fixes(XML push parser, 51876, notations,
330  52299)</li>
331  <li>Fixed some portability issues</li>
332</ul>
333
334<h3>2.3.4: Mar 10 2001</h3>
335<ul>
336  <li>Fixed bugs #51860 and #51861</li>
337  <li>Added a global variable xmlDefaultBufferSize to allow default buffer
338    size to be application tunable.</li>
339  <li>Some cleanup in the validation code, still a bug left and this part
340    should probably be rewritten to support ambiguous content model :-\</li>
341  <li>Fix a couple of serious bugs introduced or raised by changes in 2.3.3
342    parser</li>
343  <li>Fixed another bug in xmlNodeGetContent()</li>
344  <li>Bjorn fixed XPath node collection and Number formatting</li>
345  <li>Fixed a loop reported in the HTML parsing</li>
346  <li>blank space are reported even if the Dtd content model proves that they
347    are formatting spaces, this is for XmL conformance</li>
348</ul>
349
350<h3>2.3.3: Mar 1 2001</h3>
351<ul>
352  <li>small change in XPath for XSLT</li>
353  <li>documentation cleanups</li>
354  <li>fix in validation by Gary Pennington</li>
355  <li>serious parsing performances improvements</li>
356</ul>
357
358<h3>2.3.2: Feb 24 2001</h3>
359<ul>
360  <li>chasing XPath bugs, found a bunch, completed some TODO</li>
361  <li>fixed a Dtd parsing bug</li>
362  <li>fixed a bug in xmlNodeGetContent</li>
363  <li>ID/IDREF support partly rewritten by Gary Pennington</li>
364</ul>
365
366<h3>2.3.1: Feb 15 2001</h3>
367<ul>
368  <li>some XPath and HTML bug fixes for XSLT</li>
369  <li>small extension of the hash table interfaces for DOM gdome2
370    implementation</li>
371  <li>A few bug fixes</li>
372</ul>
373
374<h3>2.3.0: Feb 8 2001 (2.2.12 was on 25 Jan but I didn't kept track)</h3>
375<ul>
376  <li>Lots of XPath bug fixes</li>
377  <li>Add a mode with Dtd lookup but without validation error reporting for
378    XSLT</li>
379  <li>Add support for text node without escaping (XSLT)</li>
380  <li>bug fixes for xmlCheckFilename</li>
381  <li>validation code bug fixes from Gary Pennington</li>
382  <li>Patch from Paul D. Smith correcting URI path normalization</li>
383  <li>Patch to allow simultaneous install of libxml-devel and
384  libxml2-devel</li>
385  <li>the example Makefile is now fixed</li>
386  <li>added HTML to the RPM packages</li>
387  <li>tree copying bugfixes</li>
388  <li>updates to Windows makefiles</li>
389  <li>optimisation patch from Bjorn Reese</li>
390</ul>
391
392<h3>2.2.11: Jan 4 2001</h3>
393<ul>
394  <li>bunch of bug fixes (memory I/O, xpath, ftp/http, ...)</li>
395  <li>added htmlHandleOmittedElem()</li>
396  <li>Applied Bjorn Reese's IPV6 first patch</li>
397  <li>Applied Paul D. Smith patches for validation of XInclude results</li>
398  <li>added XPointer xmlns() new scheme support</li>
399</ul>
400
401<h3>2.2.10: Nov 25 2000</h3>
402<ul>
403  <li>Fix the Windows problems of 2.2.8</li>
404  <li>integrate OpenVMS patches</li>
405  <li>better handling of some nasty HTML input</li>
406  <li>Improved the XPointer implementation</li>
407  <li>integrate a number of provided patches</li>
408</ul>
409
410<h3>2.2.9: Nov 25 2000</h3>
411<ul>
412  <li>erroneous release :-(</li>
413</ul>
414
415<h3>2.2.8: Nov 13 2000</h3>
416<ul>
417  <li>First version of <a href="http://www.w3.org/TR/xinclude">XInclude</a>
418    support</li>
419  <li>Patch in conditional section handling</li>
420  <li>updated MS compiler project</li>
421  <li>fixed some XPath problems</li>
422  <li>added an URI escaping function</li>
423  <li>some other bug fixes</li>
424</ul>
425
426<h3>2.2.7: Oct 31 2000</h3>
427<ul>
428  <li>added message redirection</li>
429  <li>XPath improvements (thanks TOM !)</li>
430  <li>xmlIOParseDTD() added</li>
431  <li>various small fixes in the HTML, URI, HTTP and XPointer support</li>
432  <li>some cleanup of the Makefile, autoconf and the distribution content</li>
433</ul>
434
435<h3>2.2.6: Oct 25 2000:</h3>
436<ul>
437  <li>Added an hash table module, migrated a number of internal structure to
438    those</li>
439  <li>Fixed a posteriori validation problems</li>
440  <li>HTTP module cleanups</li>
441  <li>HTML parser improvements (tag errors, script/style handling, attribute
442    normalization)</li>
443  <li>coalescing of adjacent text nodes</li>
444  <li>couple of XPath bug fixes, exported the internal API</li>
445</ul>
446
447<h3>2.2.5: Oct 15 2000:</h3>
448<ul>
449  <li>XPointer implementation and testsuite</li>
450  <li>Lot of XPath fixes, added variable and functions registration, more
451    tests</li>
452  <li>Portability fixes, lots of enhancements toward an easy Windows build and
453    release</li>
454  <li>Late validation fixes</li>
455  <li>Integrated a lot of contributed patches</li>
456  <li>added memory management docs</li>
457  <li>a performance problem when using large buffer seems fixed</li>
458</ul>
459
460<h3>2.2.4: Oct 1 2000:</h3>
461<ul>
462  <li>main XPath problem fixed</li>
463  <li>Integrated portability patches for Windows</li>
464  <li>Serious bug fixes on the URI and HTML code</li>
465</ul>
466
467<h3>2.2.3: Sep 17 2000</h3>
468<ul>
469  <li>bug fixes</li>
470  <li>cleanup of entity handling code</li>
471  <li>overall review of all loops in the parsers, all sprintf usage has been
472    checked too</li>
473  <li>Far better handling of larges Dtd. Validating against Docbook XML Dtd
474    works smoothly now.</li>
475</ul>
476
477<h3>1.8.10: Sep 6 2000</h3>
478<ul>
479  <li>bug fix release for some Gnome projects</li>
480</ul>
481
482<h3>2.2.2: August 12 2000</h3>
483<ul>
484  <li>mostly bug fixes</li>
485  <li>started adding routines to access xml parser context options</li>
486</ul>
487
488<h3>2.2.1: July 21 2000</h3>
489<ul>
490  <li>a purely bug fixes release</li>
491  <li>fixed an encoding support problem when parsing from a memory block</li>
492  <li>fixed a DOCTYPE parsing problem</li>
493  <li>removed a bug in the function allowing to override the memory allocation
494    routines</li>
495</ul>
496
497<h3>2.2.0: July 14 2000</h3>
498<ul>
499  <li>applied a lot of portability fixes</li>
500  <li>better encoding support/cleanup and saving (content is now always
501    encoded in UTF-8)</li>
502  <li>the HTML parser now correctly handles encodings</li>
503  <li>added xmlHasProp()</li>
504  <li>fixed a serious problem with &amp;#38;</li>
505  <li>propagated the fix to FTP client</li>
506  <li>cleanup, bugfixes, etc ...</li>
507  <li>Added a page about <a href="encoding.html">libxml Internationalization
508    support</a></li>
509</ul>
510
511<h3>1.8.9:  July 9 2000</h3>
512<ul>
513  <li>fixed the spec the RPMs should be better</li>
514  <li>fixed a serious bug in the FTP implementation, released 1.8.9 to solve
515    rpmfind users problem</li>
516</ul>
517
518<h3>2.1.1: July 1 2000</h3>
519<ul>
520  <li>fixes a couple of bugs in the 2.1.0 packaging</li>
521  <li>improvements on the HTML parser</li>
522</ul>
523
524<h3>2.1.0 and 1.8.8: June 29 2000</h3>
525<ul>
526  <li>1.8.8 is mostly a comodity package for upgrading to libxml2 accoding to
527    <a href="upgrade.html">new instructions</a>. It fixes a nasty problem
528    about &amp;#38; charref parsing</li>
529  <li>2.1.0 also ease the upgrade from libxml v1 to the recent version. it
530    also contains numerous fixes and enhancements:
531    <ul>
532      <li>added xmlStopParser() to stop parsing</li>
533      <li>improved a lot parsing speed when there is large CDATA blocs</li>
534      <li>includes XPath patches provided by Picdar Technology</li>
535      <li>tried to fix as much as possible DtD validation and namespace
536        related problems</li>
537      <li>output to a given encoding has been added/tested</li>
538      <li>lot of various fixes</li>
539    </ul>
540  </li>
541</ul>
542
543<h3>2.0.0: Apr 12 2000</h3>
544<ul>
545  <li>First public release of libxml2. If you are using libxml, it's a good
546    idea to check the 1.x to 2.x upgrade instructions. NOTE: while initally
547    scheduled for Apr 3 the relase occured only on Apr 12 due to massive
548    workload.</li>
549  <li>The include are now located under $prefix/include/libxml (instead of
550    $prefix/include/gnome-xml), they also are referenced by
551    <pre>#include &lt;libxml/xxx.h&gt;</pre>
552    <p>instead of</p>
553    <pre>#include "xxx.h"</pre>
554  </li>
555  <li>a new URI module for parsing URIs and following strictly RFC 2396</li>
556  <li>the memory allocation routines used by libxml can now be overloaded
557    dynamically by using xmlMemSetup()</li>
558  <li>The previously CVS only tool tester has been renamed
559    <strong>xmllint</strong> and is now installed as part of the libxml2
560    package</li>
561  <li>The I/O interface has been revamped. There is now ways to plug in
562    specific I/O modules, either at the URI scheme detection level using
563    xmlRegisterInputCallbacks()  or by passing I/O functions when creating a
564    parser context using xmlCreateIOParserCtxt()</li>
565  <li>there is a C preprocessor macro LIBXML_VERSION providing the version
566    number of the libxml module in use</li>
567  <li>a number of optional features of libxml can now be excluded at configure
568    time (FTP/HTTP/HTML/XPath/Debug)</li>
569</ul>
570
571<h3>2.0.0beta: Mar 14 2000</h3>
572<ul>
573  <li>This is a first Beta release of libxml version 2</li>
574  <li>It's available only from<a href="ftp://xmlsoft.org/">xmlsoft.org
575    FTP</a>, it's packaged as libxml2-2.0.0beta and available as tar and
576  RPMs</li>
577  <li>This version is now the head in the Gnome CVS base, the old one is
578    available under the tag LIB_XML_1_X</li>
579  <li>This includes a very large set of changes. Froma  programmatic point of
580    view applications should not have to be modified too much, check the <a
581    href="upgrade.html">upgrade page</a></li>
582  <li>Some interfaces may changes (especially a bit about encoding).</li>
583  <li>the updates includes:
584    <ul>
585      <li>fix I18N support. ISO-Latin-x/UTF-8/UTF-16 (nearly) seems correctly
586        handled now</li>
587      <li>Better handling of entities, especially well formedness checking and
588        proper PEref extensions in external subsets</li>
589      <li>DTD conditional sections</li>
590      <li>Validation now correcly handle entities content</li>
591      <li><a href="http://rpmfind.net/tools/gdome/messages/0039.html">change
592        structures to accomodate DOM</a></li>
593    </ul>
594  </li>
595  <li>Serious progress were made toward compliance, <a
596    href="conf/result.html">here are the result of the test</a> against the
597    OASIS testsuite (except the japanese tests since I don't support that
598    encoding yet). This URL is rebuilt every couple of hours using the CVS
599    head version.</li>
600</ul>
601
602<h3>1.8.7: Mar 6 2000</h3>
603<ul>
604  <li>This is a bug fix release:</li>
605  <li>It is possible to disable the ignorable blanks heuristic used by
606    libxml-1.x, a new function  xmlKeepBlanksDefault(0) will allow this. Note
607    that for adherence to XML spec, this behaviour will be disabled by default
608    in 2.x . The same function will allow to keep compatibility for old
609  code.</li>
610  <li>Blanks in &lt;a&gt;  &lt;/a&gt; constructs are not ignored anymore,
611    avoiding heuristic is really the Right Way :-\</li>
612  <li>The unchecked use of snprintf which was breaking libxml-1.8.6
613    compilation on some platforms has been fixed</li>
614  <li>nanoftp.c nanohttp.c: Fixed '#' and '?' stripping when processing
615  URIs</li>
616</ul>
617
618<h3>1.8.6: Jan 31 2000</h3>
619<ul>
620  <li>added a nanoFTP transport module, debugged until the new version of <a
621    href="http://rpmfind.net/linux/rpm2html/rpmfind.html">rpmfind</a> can use
622    it without troubles</li>
623</ul>
624
625<h3>1.8.5: Jan 21 2000</h3>
626<ul>
627  <li>adding APIs to parse a well balanced chunk of XML (production <a
628    href="http://www.w3.org/TR/REC-xml#NT-content">[43] content</a> of the XML
629    spec)</li>
630  <li>fixed a hideous bug in xmlGetProp pointed by Rune.Djurhuus@fast.no</li>
631  <li>Jody Goldberg &lt;jgoldberg@home.com&gt; provided another patch trying
632    to solve the zlib checks problems</li>
633  <li>The current state in gnome CVS base is expected to ship as 1.8.5 with
634    gnumeric soon</li>
635</ul>
636
637<h3>1.8.4: Jan 13 2000</h3>
638<ul>
639  <li>bug fixes, reintroduced xmlNewGlobalNs(), fixed xmlNewNs()</li>
640  <li>all exit() call should have been removed from libxml</li>
641  <li>fixed a problem with INCLUDE_WINSOCK on WIN32 platform</li>
642  <li>added newDocFragment()</li>
643</ul>
644
645<h3>1.8.3: Jan 5 2000</h3>
646<ul>
647  <li>a Push interface for the XML and HTML parsers</li>
648  <li>a shell-like interface to the document tree (try tester --shell :-)</li>
649  <li>lots of bug fixes and improvement added over XMas hollidays</li>
650  <li>fixed the DTD parsing code to work with the xhtml DTD</li>
651  <li>added xmlRemoveProp(), xmlRemoveID() and xmlRemoveRef()</li>
652  <li>Fixed bugs in xmlNewNs()</li>
653  <li>External entity loading code has been revamped, now it uses
654    xmlLoadExternalEntity(), some fix on entities processing were added</li>
655  <li>cleaned up WIN32 includes of socket stuff</li>
656</ul>
657
658<h3>1.8.2: Dec 21 1999</h3>
659<ul>
660  <li>I got another problem with includes and C++, I hope this issue is fixed
661    for good this time</li>
662  <li>Added a few tree modification functions: xmlReplaceNode,
663    xmlAddPrevSibling, xmlAddNextSibling, xmlNodeSetName and
664    xmlDocSetRootElement</li>
665  <li>Tried to improve the HTML output with help from <a
666    href="mailto:clahey@umich.edu">Chris Lahey</a></li>
667</ul>
668
669<h3>1.8.1: Dec 18 1999</h3>
670<ul>
671  <li>various patches to avoid troubles when using libxml with C++ compilers
672    the "namespace" keyword and C escaping in include files</li>
673  <li>a problem in one of the core macros IS_CHAR was corrected</li>
674  <li>fixed a bug introduced in 1.8.0 breaking default namespace processing,
675    and more specifically the Dia application</li>
676  <li>fixed a posteriori validation (validation after parsing, or by using a
677    Dtd not specified in the original document)</li>
678  <li>fixed a bug in</li>
679</ul>
680
681<h3>1.8.0: Dec 12 1999</h3>
682<ul>
683  <li>cleanup, especially memory wise</li>
684  <li>the parser should be more reliable, especially the HTML one, it should
685    not crash, whatever the input !</li>
686  <li>Integrated various patches, especially a speedup improvement for large
687    dataset from <a href="mailto:cnygard@bellatlantic.net">Carl Nygard</a>,
688    configure with --with-buffers to enable them.</li>
689  <li>attribute normalization, oops should have been added long ago !</li>
690  <li>attributes defaulted from Dtds should be available, xmlSetProp() now
691    does entities escapting by default.</li>
692</ul>
693
694<h3>1.7.4: Oct 25 1999</h3>
695<ul>
696  <li>Lots of HTML improvement</li>
697  <li>Fixed some errors when saving both XML and HTML</li>
698  <li>More examples, the regression tests should now look clean</li>
699  <li>Fixed a bug with contiguous charref</li>
700</ul>
701
702<h3>1.7.3: Sep 29 1999</h3>
703<ul>
704  <li>portability problems fixed</li>
705  <li>snprintf was used unconditionnally, leading to link problems on system
706    were it's not available, fixed</li>
707</ul>
708
709<h3>1.7.1: Sep 24 1999</h3>
710<ul>
711  <li>The basic type for strings manipulated by libxml has been renamed in
712    1.7.1 from <strong>CHAR</strong> to <strong>xmlChar</strong>. The reason
713    is that CHAR was conflicting with a predefined type on Windows. However on
714    non WIN32 environment, compatibility is provided by the way of  a
715    <strong>#define </strong>.</li>
716  <li>Changed another error : the use of a structure field called errno, and
717    leading to troubles on platforms where it's a macro</li>
718</ul>
719
720<h3>1.7.0: sep 23 1999</h3>
721<ul>
722  <li>Added the ability to fetch remote DTD or parsed entities, see the <a
723    href="html/libxml-nanohttp.html">nanohttp</a> module.</li>
724  <li>Added an errno to report errors by another mean than a simple printf
725    like callback</li>
726  <li>Finished ID/IDREF support and checking when validation</li>
727  <li>Serious memory leaks fixed (there is now a <a
728    href="html/libxml-xmlmemory.html">memory wrapper</a> module)</li>
729  <li>Improvement of <a href="http://www.w3.org/TR/xpath">XPath</a>
730    implementation</li>
731  <li>Added an HTML parser front-end</li>
732</ul>
733
734<h2><a name="XML">XML</a></h2>
735
736<p><a href="http://www.w3.org/TR/REC-xml">XML is a standard</a> for
737markup-based structured documents. Here is <a name="example">an example XML
738document</a>:</p>
739<pre>&lt;?xml version="1.0"?&gt;
740&lt;EXAMPLE prop1="gnome is great" prop2="&amp;amp; linux too"&gt;
741  &lt;head&gt;
742   &lt;title&gt;Welcome to Gnome&lt;/title&gt;
743  &lt;/head&gt;
744  &lt;chapter&gt;
745   &lt;title&gt;The Linux adventure&lt;/title&gt;
746   &lt;p&gt;bla bla bla ...&lt;/p&gt;
747   &lt;image href="linus.gif"/&gt;
748   &lt;p&gt;...&lt;/p&gt;
749  &lt;/chapter&gt;
750&lt;/EXAMPLE&gt;</pre>
751
752<p>The first line specifies that it's an XML document and gives useful
753information about its encoding. Then the document is a text format whose
754structure is specified by tags between brackets. <strong>Each tag opened has
755to be closed</strong>. XML is pedantic about this. However, if a tag is empty
756(no content), a single tag can serve as both the opening and closing tag if it
757ends with <code>/&gt;</code> rather than with <code>&gt;</code>. Note that,
758for example, the image tag has no content (just an attribute) and is closed by
759ending the tag with <code>/&gt;</code>.</p>
760
761<p>XML can be applied sucessfully to a wide range of uses, from long term
762structured document maintenance (where it follows the steps of SGML) to simple
763data encoding mechanisms like configuration file formatting (glade),
764spreadsheets (gnumeric), or even shorter lived documents such as WebDAV where
765it is used to encode remote calls between a client and a server.</p>
766
767<h2><a name="XSLT">XSLT</a></h2>
768
769<p>Check <a href="http://xmlsoft.org/XSLT">the separate libxslt page</a></p>
770
771<p><a href="http://www.w3.org/TR/xslt">XSL Transformations</a>,  is a language
772for transforming XML documents into other XML documents (or HTML/textual
773output).</p>
774
775<p>A separate library called libxslt is being built on top of libxml2. This
776module "libxslt" can be found in the Gnome CVS base too.</p>
777
778<p>You can check the <a
779href="http://cvs.gnome.org/lxr/source/libxslt/FEATURES">features</a> supported
780and the progresses on the <a
781href="http://cvs.gnome.org/lxr/source/libxslt/ChangeLog">Changelog</a></p>
782
783<h2>An overview of libxml architecture</h2>
784
785<p>Libxml is made of multiple components; some of them are optional, and most
786of the block interfaces are public. The main components are:</p>
787<ul>
788  <li>an Input/Output layer</li>
789  <li>FTP and HTTP client layers (optional)</li>
790  <li>an Internationalization layer managing the encodings support</li>
791  <li>a URI module</li>
792  <li>the XML parser and its basic SAX interface</li>
793  <li>an HTML parser using the same SAX interface (optional)</li>
794  <li>a SAX tree module to build an in-memory DOM representation</li>
795  <li>a tree module to manipulate the DOM representation</li>
796  <li>a validation module using the DOM representation (optional)</li>
797  <li>an XPath module for global lookup in a DOM representation
798  (optional)</li>
799  <li>a debug module (optional)</li>
800</ul>
801
802<p>Graphically this gives the following:</p>
803
804<p><img src="libxml.gif" alt="a graphical view of the various"></p>
805
806<p></p>
807
808<h2><a name="tree">The tree output</a></h2>
809
810<p>The parser returns a tree built during the document analysis. The value
811returned is an <strong>xmlDocPtr</strong> (i.e., a pointer to an
812<strong>xmlDoc</strong> structure). This structure contains information such
813as the file name, the document type, and a <strong>children</strong> pointer
814which is the root of the document (or more exactly the first child under the
815root which is the document). The tree is made of <strong>xmlNode</strong>s,
816chained in double-linked lists of siblings and with a children&lt;-&gt;parent
817relationship. An xmlNode can also carry properties (a chain of xmlAttr
818structures). An attribute may have a value which is a list of TEXT or
819ENTITY_REF nodes.</p>
820
821<p>Here is an example (erroneous with respect to the XML spec since there
822should be only one ELEMENT under the root):</p>
823
824<p><img src="structure.gif" alt=" structure.gif "></p>
825
826<p>In the source package there is a small program (not installed by default)
827called <strong>xmllint</strong> which parses XML files given as argument and
828prints them back as parsed. This is useful for detecting errors both in XML
829code and in the XML parser itself. It has an option <strong>--debug</strong>
830which prints the actual in-memory structure of the document; here is the
831result with the <a href="#example">example</a> given before:</p>
832<pre>DOCUMENT
833version=1.0
834standalone=true
835  ELEMENT EXAMPLE
836    ATTRIBUTE prop1
837      TEXT
838      content=gnome is great
839    ATTRIBUTE prop2
840      ENTITY_REF
841      TEXT
842      content= linux too 
843    ELEMENT head
844      ELEMENT title
845        TEXT
846        content=Welcome to Gnome
847    ELEMENT chapter
848      ELEMENT title
849        TEXT
850        content=The Linux adventure
851      ELEMENT p
852        TEXT
853        content=bla bla bla ...
854      ELEMENT image
855        ATTRIBUTE href
856          TEXT
857          content=linus.gif
858      ELEMENT p
859        TEXT
860        content=...</pre>
861
862<p>This should be useful for learning the internal representation model.</p>
863
864<h2><a name="interface">The SAX interface</a></h2>
865
866<p>Sometimes the DOM tree output is just too large to fit reasonably into
867memory. In that case (and if you don't expect to save back the XML document
868loaded using libxml), it's better to use the SAX interface of libxml. SAX is a
869<strong>callback-based interface</strong> to the parser. Before parsing, the
870application layer registers a customized set of callbacks which are called by
871the library as it progresses through the XML input.</p>
872
873<p>To get more detailed step-by-step guidance on using the SAX interface of
874libxml, see the <a
875href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">nice
876documentation</a>.written by <a href="mailto:james@daa.com.au">James
877Henstridge</a>.</p>
878
879<p>You can debug the SAX behaviour by using the <strong>testSAX</strong>
880program located in the gnome-xml module (it's usually not shipped in the
881binary packages of libxml, but you can find it in the tar source
882distribution). Here is the sequence of callbacks that would be reported by
883testSAX when parsing the example XML document shown earlier:</p>
884<pre>SAX.setDocumentLocator()
885SAX.startDocument()
886SAX.getEntity(amp)
887SAX.startElement(EXAMPLE, prop1='gnome is great', prop2='&amp;amp; linux too')
888SAX.characters(   , 3)
889SAX.startElement(head)
890SAX.characters(    , 4)
891SAX.startElement(title)
892SAX.characters(Welcome to Gnome, 16)
893SAX.endElement(title)
894SAX.characters(   , 3)
895SAX.endElement(head)
896SAX.characters(   , 3)
897SAX.startElement(chapter)
898SAX.characters(    , 4)
899SAX.startElement(title)
900SAX.characters(The Linux adventure, 19)
901SAX.endElement(title)
902SAX.characters(    , 4)
903SAX.startElement(p)
904SAX.characters(bla bla bla ..., 15)
905SAX.endElement(p)
906SAX.characters(    , 4)
907SAX.startElement(image, href='linus.gif')
908SAX.endElement(image)
909SAX.characters(    , 4)
910SAX.startElement(p)
911SAX.characters(..., 3)
912SAX.endElement(p)
913SAX.characters(   , 3)
914SAX.endElement(chapter)
915SAX.characters( , 1)
916SAX.endElement(EXAMPLE)
917SAX.endDocument()</pre>
918
919<p>Most of the other interfaces of libxml are based on the DOM tree-building
920facility, so nearly everything up to the end of this document presupposes the
921use of the standard DOM tree build. Note that the DOM tree itself is built by
922a set of registered default callbacks, without internal specific
923interface.</p>
924
925<h2><a name="library">The XML library interfaces</a></h2>
926
927<p>This section is directly intended to help programmers getting bootstrapped
928using the XML library from the C language. It is not intended to be extensive.
929I hope the automatically generated documents will provide the completeness
930required, but as a separate set of documents. The interfaces of the XML
931library are by principle low level, there is nearly zero abstraction. Those
932interested in a higher level API should <a href="#DOM">look at DOM</a>.</p>
933
934<p>The <a href="html/libxml-parser.html">parser interfaces for XML</a> are
935separated from the <a href="html/libxml-htmlparser.html">HTML parser
936interfaces</a>.  Let's have a look at how the XML parser can be called:</p>
937
938<h3><a name="Invoking">Invoking the parser : the pull method</a></h3>
939
940<p>Usually, the first thing to do is to read an XML input. The parser accepts
941documents either from in-memory strings or from files.  The functions are
942defined in "parser.h":</p>
943<dl>
944  <dt><code>xmlDocPtr xmlParseMemory(char *buffer, int size);</code></dt>
945    <dd><p>Parse a null-terminated string containing the document.</p>
946    </dd>
947</dl>
948<dl>
949  <dt><code>xmlDocPtr xmlParseFile(const char *filename);</code></dt>
950    <dd><p>Parse an XML document contained in a (possibly compressed)
951      file.</p>
952    </dd>
953</dl>
954
955<p>The parser returns a pointer to the document structure (or NULL in case of
956failure).</p>
957
958<h3 id="Invoking1">Invoking the parser: the push method</h3>
959
960<p>In order for the application to keep the control when the document is being
961fetched (which is common for GUI based programs) libxml provides a push
962interface, too, as of version 1.8.3. Here are the interface functions:</p>
963<pre>xmlParserCtxtPtr xmlCreatePushParserCtxt(xmlSAXHandlerPtr sax,
964                                         void *user_data,
965                                         const char *chunk,
966                                         int size,
967                                         const char *filename);
968int              xmlParseChunk          (xmlParserCtxtPtr ctxt,
969                                         const char *chunk,
970                                         int size,
971                                         int terminate);</pre>
972
973<p>and here is a simple example showing how to use the interface:</p>
974<pre>            FILE *f;
975
976            f = fopen(filename, "r");
977            if (f != NULL) {
978                int res, size = 1024;
979                char chars[1024];
980                xmlParserCtxtPtr ctxt;
981
982                res = fread(chars, 1, 4, f);
983                if (res &gt; 0) {
984                    ctxt = xmlCreatePushParserCtxt(NULL, NULL,
985                                chars, res, filename);
986                    while ((res = fread(chars, 1, size, f)) &gt; 0) {
987                        xmlParseChunk(ctxt, chars, res, 0);
988                    }
989                    xmlParseChunk(ctxt, chars, 0, 1);
990                    doc = ctxt-&gt;myDoc;
991                    xmlFreeParserCtxt(ctxt);
992                }
993            }</pre>
994
995<p>The HTML parser embedded into libxml also has a push interface; the
996functions are just prefixed by "html" rather than "xml".</p>
997
998<h3 id="Invoking2">Invoking the parser: the SAX interface</h3>
999
1000<p>The tree-building interface makes the parser memory-hungry, first loading
1001the document in memory and then building the tree itself. Reading a document
1002without building the tree is possible using the SAX interfaces (see SAX.h and
1003<a href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">James
1004Henstridge's documentation</a>). Note also that the push interface can be
1005limited to SAX: just use the two first arguments of
1006<code>xmlCreatePushParserCtxt()</code>.</p>
1007
1008<h3><a name="Building">Building a tree from scratch</a></h3>
1009
1010<p>The other way to get an XML tree in memory is by building it. Basically
1011there is a set of functions dedicated to building new elements. (These are
1012also described in &lt;libxml/tree.h&gt;.) For example, here is a piece of code
1013that produces the XML document used in the previous examples:</p>
1014<pre>    #include &lt;libxml/tree.h&gt;
1015    xmlDocPtr doc;
1016    xmlNodePtr tree, subtree;
1017
1018    doc = xmlNewDoc("1.0");
1019    doc-&gt;children = xmlNewDocNode(doc, NULL, "EXAMPLE", NULL);
1020    xmlSetProp(doc-&gt;children, "prop1", "gnome is great");
1021    xmlSetProp(doc-&gt;children, "prop2", "&amp; linux too");
1022    tree = xmlNewChild(doc-&gt;children, NULL, "head", NULL);
1023    subtree = xmlNewChild(tree, NULL, "title", "Welcome to Gnome");
1024    tree = xmlNewChild(doc-&gt;children, NULL, "chapter", NULL);
1025    subtree = xmlNewChild(tree, NULL, "title", "The Linux adventure");
1026    subtree = xmlNewChild(tree, NULL, "p", "bla bla bla ...");
1027    subtree = xmlNewChild(tree, NULL, "image", NULL);
1028    xmlSetProp(subtree, "href", "linus.gif");</pre>
1029
1030<p>Not really rocket science ...</p>
1031
1032<h3><a name="Traversing">Traversing the tree</a></h3>
1033
1034<p>Basically by <a href="html/libxml-tree.html">including "tree.h"</a> your
1035code has access to the internal structure of all the elements of the tree. The
1036names should be somewhat simple like <strong>parent</strong>,
1037<strong>children</strong>, <strong>next</strong>, <strong>prev</strong>,
1038<strong>properties</strong>, etc... For example, still with the previous
1039example:</p>
1040<pre><code>doc-&gt;children-&gt;children-&gt;children</code></pre>
1041
1042<p>points to the title element,</p>
1043<pre>doc-&gt;children-&gt;children-&gt;next-&gt;children-&gt;children</pre>
1044
1045<p>points to the text node containing the chapter title "The Linux
1046adventure".</p>
1047
1048<p><strong>NOTE</strong>: XML allows <em>PI</em>s and <em>comments</em> to be
1049present before the document root, so <code>doc-&gt;children</code> may point
1050to an element which is not the document Root Element; a function
1051<code>xmlDocGetRootElement()</code> was added for this purpose.</p>
1052
1053<h3><a name="Modifying">Modifying the tree</a></h3>
1054
1055<p>Functions are provided for reading and writing the document content. Here
1056is an excerpt from the <a href="html/libxml-tree.html">tree API</a>:</p>
1057<dl>
1058  <dt><code>xmlAttrPtr xmlSetProp(xmlNodePtr node, const xmlChar *name, const
1059  xmlChar *value);</code></dt>
1060    <dd><p>This sets (or changes) an attribute carried by an ELEMENT node. The
1061      value can be NULL.</p>
1062    </dd>
1063</dl>
1064<dl>
1065  <dt><code>const xmlChar *xmlGetProp(xmlNodePtr node, const xmlChar
1066  *name);</code></dt>
1067    <dd><p>This function returns a pointer to new copy of the property
1068      content. Note that the user must deallocate the result.</p>
1069    </dd>
1070</dl>
1071
1072<p>Two functions are provided for reading and writing the text associated with
1073elements:</p>
1074<dl>
1075  <dt><code>xmlNodePtr xmlStringGetNodeList(xmlDocPtr doc, const xmlChar
1076  *value);</code></dt>
1077    <dd><p>This function takes an "external" string and converts it to one
1078      text node or possibly to a list of entity and text nodes. All
1079      non-predefined entity references like &amp;Gnome; will be stored
1080      internally as entity nodes, hence the result of the function may not be
1081      a single node.</p>
1082    </dd>
1083</dl>
1084<dl>
1085  <dt><code>xmlChar *xmlNodeListGetString(xmlDocPtr doc, xmlNodePtr list, int
1086  inLine);</code></dt>
1087    <dd><p>This function is the inverse of
1088      <code>xmlStringGetNodeList()</code>. It generates a new string
1089      containing the content of the text and entity nodes. Note the extra
1090      argument inLine. If this argument is set to 1, the function will expand
1091      entity references.  For example, instead of returning the &amp;Gnome;
1092      XML encoding in the string, it will substitute it with its value (say,
1093      "GNU Network Object Model Environment").</p>
1094    </dd>
1095</dl>
1096
1097<h3><a name="Saving">Saving a tree</a></h3>
1098
1099<p>Basically 3 options are possible:</p>
1100<dl>
1101  <dt><code>void xmlDocDumpMemory(xmlDocPtr cur, xmlChar**mem, int
1102  *size);</code></dt>
1103    <dd><p>Returns a buffer into which the document has been saved.</p>
1104    </dd>
1105</dl>
1106<dl>
1107  <dt><code>extern void xmlDocDump(FILE *f, xmlDocPtr doc);</code></dt>
1108    <dd><p>Dumps a document to an open file descriptor.</p>
1109    </dd>
1110</dl>
1111<dl>
1112  <dt><code>int xmlSaveFile(const char *filename, xmlDocPtr cur);</code></dt>
1113    <dd><p>Saves the document to a file. In this case, the compression
1114      interface is triggered if it has been turned on.</p>
1115    </dd>
1116</dl>
1117
1118<h3><a name="Compressio">Compression</a></h3>
1119
1120<p>The library transparently handles compression when doing file-based
1121accesses. The level of compression on saves can be turned on either globally
1122or individually for one file:</p>
1123<dl>
1124  <dt><code>int  xmlGetDocCompressMode (xmlDocPtr doc);</code></dt>
1125    <dd><p>Gets the document compression ratio (0-9).</p>
1126    </dd>
1127</dl>
1128<dl>
1129  <dt><code>void xmlSetDocCompressMode (xmlDocPtr doc, int mode);</code></dt>
1130    <dd><p>Sets the document compression ratio.</p>
1131    </dd>
1132</dl>
1133<dl>
1134  <dt><code>int  xmlGetCompressMode(void);</code></dt>
1135    <dd><p>Gets the default compression ratio.</p>
1136    </dd>
1137</dl>
1138<dl>
1139  <dt><code>void xmlSetCompressMode(int mode);</code></dt>
1140    <dd><p>Sets the default compression ratio.</p>
1141    </dd>
1142</dl>
1143
1144<h2><a name="Entities">Entities or no entities</a></h2>
1145
1146<p>Entities in principle are similar to simple C macros. An entity defines an
1147abbreviation for a given string that you can reuse many times throughout the
1148content of your document. Entities are especially useful when a given string
1149may occur frequently within a document, or to confine the change needed to a
1150document to a restricted area in the internal subset of the document (at the
1151beginning). Example:</p>
1152<pre>1 &lt;?xml version="1.0"?&gt;
11532 &lt;!DOCTYPE EXAMPLE SYSTEM "example.dtd" [
11543 &lt;!ENTITY xml "Extensible Markup Language"&gt;
11554 ]&gt;
11565 &lt;EXAMPLE&gt;
11576    &amp;xml;
11587 &lt;/EXAMPLE&gt;</pre>
1159
1160<p>Line 3 declares the xml entity. Line 6 uses the xml entity, by prefixing
1161its name with '&amp;' and following it by ';' without any spaces added. There
1162are 5 predefined entities in libxml allowing you to escape charaters with
1163predefined meaning in some parts of the xml document content:
1164<strong>&amp;lt;</strong> for the character '&lt;', <strong>&amp;gt;</strong>
1165for the character '&gt;',  <strong>&amp;apos;</strong> for the character ''',
1166<strong>&amp;quot;</strong> for the character '"', and
1167<strong>&amp;amp;</strong> for the character '&amp;'.</p>
1168
1169<p>One of the problems related to entities is that you may want the parser to
1170substitute an entity's content so that you can see the replacement text in
1171your application. Or you may prefer to keep entity references as such in the
1172content to be able to save the document back without losing this usually
1173precious information (if the user went through the pain of explicitly defining
1174entities, he may have a a rather negative attitude if you blindly susbtitute
1175them as saving time). The <a
1176href="html/libxml-parser.html#XMLSUBSTITUTEENTITIESDEFAULT">xmlSubstituteEntitiesDefault()</a>
1177function allows you to check and change the behaviour, which is to not
1178substitute entities by default.</p>
1179
1180<p>Here is the DOM tree built by libxml for the previous document in the
1181default case:</p>
1182<pre>/gnome/src/gnome-xml -&gt; /xmllint --debug test/ent1
1183DOCUMENT
1184version=1.0
1185   ELEMENT EXAMPLE
1186     TEXT
1187     content=
1188     ENTITY_REF
1189       INTERNAL_GENERAL_ENTITY xml
1190       content=Extensible Markup Language
1191     TEXT
1192     content=</pre>
1193
1194<p>And here is the result when substituting entities:</p>
1195<pre>/gnome/src/gnome-xml -&gt; /tester --debug --noent test/ent1
1196DOCUMENT
1197version=1.0
1198   ELEMENT EXAMPLE
1199     TEXT
1200     content=     Extensible Markup Language</pre>
1201
1202<p>So, entities or no entities? Basically, it depends on your use case. I
1203suggest that you keep the non-substituting default behaviour and avoid using
1204entities in your XML document or data if you are not willing to handle the
1205entity references elements in the DOM tree.</p>
1206
1207<p>Note that at save time libxml enforces the conversion of the predefined
1208entities where necessary to prevent well-formedness problems, and will also
1209transparently replace those with chars (i.e. it will not generate entity
1210reference elements in the DOM tree or call the reference() SAX callback when
1211finding them in the input).</p>
1212
1213<p><span style="background-color: #FF0000">WARNING</span>: handling entities
1214on top of the libxml SAX interface is difficult!!! If you plan to use
1215non-predefined entities in your documents, then the learning cuvre to handle
1216then using the SAX API may be long. If you plan to use complex documents, I
1217strongly suggest you consider using the DOM interface instead and let libxml
1218deal with the complexity rather than trying to do it yourself.</p>
1219
1220<h2><a name="Namespaces">Namespaces</a></h2>
1221
1222<p>The libxml library implements <a
1223href="http://www.w3.org/TR/REC-xml-names/">XML namespaces</a> support by
1224recognizing namespace contructs in the input, and does namespace lookup
1225automatically when building the DOM tree. A namespace declaration is
1226associated with an in-memory structure and all elements or attributes within
1227that namespace point to it. Hence testing the namespace is a simple and fast
1228equality operation at the user level.</p>
1229
1230<p>I suggest that people using libxml use a namespace, and declare it in the
1231root element of their document as the default namespace. Then they don't need
1232to use the prefix in the content but we will have a basis for future semantic
1233refinement and  merging of data from different sources. This doesn't increase
1234the size of the XML output significantly, but significantly increases its
1235value in the long-term. Example:</p>
1236<pre>&lt;mydoc xmlns="http://mydoc.example.org/schemas/"&gt;
1237   &lt;elem1&gt;...&lt;/elem1&gt;
1238   &lt;elem2&gt;...&lt;/elem2&gt;
1239&lt;/mydoc&gt;</pre>
1240
1241<p>The namespace value has to be an absolute URL, but the URL doesn't have to
1242point to any existing resource on the Web. It will bind all the element and
1243atributes with that URL. I suggest to use an URL within a domain you control,
1244and that the URL should contain some kind of version information if possible.
1245For example, <code>"http://www.gnome.org/gnumeric/1.0/"</code> is a good
1246namespace scheme.</p>
1247
1248<p>Then when you load a file, make sure that a namespace carrying the
1249version-independent prefix is installed on the root element of your document,
1250and if the version information don't match something you know, warn the user
1251and be liberal in what you accept as the input. Also do *not* try to base
1252namespace checking on the prefix value. &lt;foo:text&gt; may be exactly the
1253same as &lt;bar:text&gt; in another document. What really matters is the URI
1254associated with the element or the attribute, not the prefix string (which is
1255just a shortcut for the full URI). In libxml, element and attributes have an
1256<code>ns</code> field pointing to an xmlNs structure detailing the namespace
1257prefix and its URI.</p>
1258
1259<p>@@Interfaces@@</p>
1260
1261<p>@@Examples@@</p>
1262
1263<p>Usually people object to using namespaces together with validity checking.
1264I will try to make sure that using namespaces won't break validity checking,
1265so even if you plan to use or currently are using validation I strongly
1266suggest adding namespaces to your document. A default namespace scheme
1267<code>xmlns="http://...."</code> should not break validity even on less
1268flexible parsers. Using namespaces to mix and differentiate content coming
1269from multiple DTDs will certainly break current validation schemes. I will try
1270to provide ways to do this, but this may not be portable or standardized.</p>
1271
1272<h2><a name="Validation">Validation, or are you afraid of DTDs ?</a></h2>
1273
1274<p>Well what is validation and what is a DTD ?</p>
1275
1276<p>Validation is the process of checking a document against a set of
1277construction rules; a <strong>DTD</strong> (Document Type Definition) is such
1278a set of rules.</p>
1279
1280<p>The validation process and building DTDs are the two most difficult parts
1281of the XML life cycle. Briefly a DTD defines all the possibles element to be
1282found within your document, what is the formal shape of your document tree (by
1283defining the allowed content of an element, either text, a regular expression
1284for the allowed list of children, or mixed content i.e. both text and
1285children). The DTD also defines the allowed attributes for all elements and
1286the types of the attributes. For more detailed information, I suggest that you
1287read the related parts of the XML specification, the examples found under
1288gnome-xml/test/valid/dtd and any of the large number of books available on
1289XML. The dia example in gnome-xml/test/valid should be both simple and
1290complete enough to allow you to build your own.</p>
1291
1292<p>A word of warning, building a good DTD which will fit the needs of your
1293application in the long-term is far from trivial; however, the extra level of
1294quality it can ensure is well worth the price for some sets of applications or
1295if you already have already a DTD defined for your application field.</p>
1296
1297<p>The validation is not completely finished but in a (very IMHO) usable
1298state. Until a real validation interface is defined the way to do it is to
1299define and set the <strong>xmlDoValidityCheckingDefaultValue</strong> external
1300variable to 1, this will of course be changed at some point:</p>
1301
1302<p>extern int xmlDoValidityCheckingDefaultValue;</p>
1303
1304<p>...</p>
1305
1306<p>xmlDoValidityCheckingDefaultValue = 1;</p>
1307
1308<p></p>
1309
1310<p>To handle external entities, use the function
1311<strong>xmlSetExternalEntityLoader</strong>(xmlExternalEntityLoader f); to
1312link in you HTTP/FTP/Entities database library to the standard libxml
1313core.</p>
1314
1315<p>@@interfaces@@</p>
1316
1317<h2><a name="DOM"></a><a name="Principles">DOM Principles</a></h2>
1318
1319<p><a href="http://www.w3.org/DOM/">DOM</a> stands for the <em>Document Object
1320Model</em>; this is an API for accessing XML or HTML structured documents.
1321Native support for DOM in Gnome is on the way (module gnome-dom), and will be
1322based on gnome-xml. This will be a far cleaner interface to manipulate XML
1323files within Gnome since it won't expose the internal structure.</p>
1324
1325<p>The current DOM implementation on top of libxml is the <a
1326href="http://cvs.gnome.org/lxr/source/gdome2/">gdome2 Gnome module</a>, this
1327is a full DOM interface, thanks to Paolo Casarini, check the <a
1328href="http://www.cs.unibo.it/~casarini/gdome2/">Gdome2 homepage</a> for more
1329informations.</p>
1330
1331<p>The gnome-dom and gdome modules in the Gnome CVS base are obsolete</p>
1332
1333<h2><a name="Example"></a><a name="real">A real example</a></h2>
1334
1335<p>Here is a real size example, where the actual content of the application
1336data is not kept in the DOM tree but uses internal structures. It is based on
1337a proposal to keep a database of jobs related to Gnome, with an XML based
1338storage structure. Here is an <a href="gjobs.xml">XML encoded jobs
1339base</a>:</p>
1340<pre>&lt;?xml version="1.0"?&gt;
1341&lt;gjob:Helping xmlns:gjob="http://www.gnome.org/some-location"&gt;
1342  &lt;gjob:Jobs&gt;
1343
1344    &lt;gjob:Job&gt;
1345      &lt;gjob:Project ID="3"/&gt;
1346      &lt;gjob:Application&gt;GBackup&lt;/gjob:Application&gt;
1347      &lt;gjob:Category&gt;Development&lt;/gjob:Category&gt;
1348
1349      &lt;gjob:Update&gt;
1350        &lt;gjob:Status&gt;Open&lt;/gjob:Status&gt;
1351        &lt;gjob:Modified&gt;Mon, 07 Jun 1999 20:27:45 -0400 MET DST&lt;/gjob:Modified&gt;
1352        &lt;gjob:Salary&gt;USD 0.00&lt;/gjob:Salary&gt;
1353      &lt;/gjob:Update&gt;
1354
1355      &lt;gjob:Developers&gt;
1356        &lt;gjob:Developer&gt;
1357        &lt;/gjob:Developer&gt;
1358      &lt;/gjob:Developers&gt;
1359
1360      &lt;gjob:Contact&gt;
1361        &lt;gjob:Person&gt;Nathan Clemons&lt;/gjob:Person&gt;
1362        &lt;gjob:Email&gt;nathan@windsofstorm.net&lt;/gjob:Email&gt;
1363        &lt;gjob:Company&gt;
1364        &lt;/gjob:Company&gt;
1365        &lt;gjob:Organisation&gt;
1366        &lt;/gjob:Organisation&gt;
1367        &lt;gjob:Webpage&gt;
1368        &lt;/gjob:Webpage&gt;
1369        &lt;gjob:Snailmail&gt;
1370        &lt;/gjob:Snailmail&gt;
1371        &lt;gjob:Phone&gt;
1372        &lt;/gjob:Phone&gt;
1373      &lt;/gjob:Contact&gt;
1374
1375      &lt;gjob:Requirements&gt;
1376      The program should be released as free software, under the GPL.
1377      &lt;/gjob:Requirements&gt;
1378
1379      &lt;gjob:Skills&gt;
1380      &lt;/gjob:Skills&gt;
1381
1382      &lt;gjob:Details&gt;
1383      A GNOME based system that will allow a superuser to configure 
1384      compressed and uncompressed files and/or file systems to be backed 
1385      up with a supported media in the system.  This should be able to 
1386      perform via find commands generating a list of files that are passed 
1387      to tar, dd, cpio, cp, gzip, etc., to be directed to the tape machine 
1388      or via operations performed on the filesystem itself. Email 
1389      notification and GUI status display very important.
1390      &lt;/gjob:Details&gt;
1391
1392    &lt;/gjob:Job&gt;
1393
1394  &lt;/gjob:Jobs&gt;
1395&lt;/gjob:Helping&gt;</pre>
1396
1397<p>While loading the XML file into an internal DOM tree is a matter of calling
1398only a couple of functions, browsing the tree to gather the ata and generate
1399the internal structures is harder, and more error prone.</p>
1400
1401<p>The suggested principle is to be tolerant with respect to the input
1402structure. For example, the ordering of the attributes is not significant, the
1403XML specification is clear about it. It's also usually a good idea not to
1404depend on the order of the children of a given node, unless it really makes
1405things harder. Here is some code to parse the information for a person:</p>
1406<pre>/*
1407 * A person record
1408 */
1409typedef struct person {
1410    char *name;
1411    char *email;
1412    char *company;
1413    char *organisation;
1414    char *smail;
1415    char *webPage;
1416    char *phone;
1417} person, *personPtr;
1418
1419/*
1420 * And the code needed to parse it
1421 */
1422personPtr parsePerson(xmlDocPtr doc, xmlNsPtr ns, xmlNodePtr cur) {
1423    personPtr ret = NULL;
1424
1425DEBUG("parsePerson\n");
1426    /*
1427     * allocate the struct
1428     */
1429    ret = (personPtr) malloc(sizeof(person));
1430    if (ret == NULL) {
1431        fprintf(stderr,"out of memory\n");
1432        return(NULL);
1433    }
1434    memset(ret, 0, sizeof(person));
1435
1436    /* We don't care what the top level element name is */
1437    cur = cur-&gt;xmlChildrenNode;
1438    while (cur != NULL) {
1439        if ((!strcmp(cur-&gt;name, "Person")) &amp;&amp; (cur-&gt;ns == ns))
1440            ret-&gt;name = xmlNodeListGetString(doc, cur-&gt;xmlChildrenNode, 1);
1441        if ((!strcmp(cur-&gt;name, "Email")) &amp;&amp; (cur-&gt;ns == ns))
1442            ret-&gt;email = xmlNodeListGetString(doc, cur-&gt;xmlChildrenNode, 1);
1443        cur = cur-&gt;next;
1444    }
1445
1446    return(ret);
1447}</pre>
1448
1449<p>Here are a couple of things to notice:</p>
1450<ul>
1451  <li>Usually a recursive parsing style is the more convenient one: XML data
1452    is by nature subject to repetitive constructs and usually exibits highly
1453    stuctured patterns.</li>
1454  <li>The two arguments of type <em>xmlDocPtr</em> and <em>xmlNsPtr</em>, i.e.
1455    the pointer to the global XML document and the namespace reserved to the
1456    application. Document wide information are needed for example to decode
1457    entities and it's a good coding practice to define a namespace for your
1458    application set of data and test that the element and attributes you're
1459    analyzing actually pertains to your application space. This is done by a
1460    simple equality test (cur-&gt;ns == ns).</li>
1461  <li>To retrieve text and attributes value, you can use the function
1462    <em>xmlNodeListGetString</em> to gather all the text and entity reference
1463    nodes generated by the DOM output and produce an single text string.</li>
1464</ul>
1465
1466<p>Here is another piece of code used to parse another level of the
1467structure:</p>
1468<pre>#include &lt;libxml/tree.h&gt;
1469/*
1470 * a Description for a Job
1471 */
1472typedef struct job {
1473    char *projectID;
1474    char *application;
1475    char *category;
1476    personPtr contact;
1477    int nbDevelopers;
1478    personPtr developers[100]; /* using dynamic alloc is left as an exercise */
1479} job, *jobPtr;
1480
1481/*
1482 * And the code needed to parse it
1483 */
1484jobPtr parseJob(xmlDocPtr doc, xmlNsPtr ns, xmlNodePtr cur) {
1485    jobPtr ret = NULL;
1486
1487DEBUG("parseJob\n");
1488    /*
1489     * allocate the struct
1490     */
1491    ret = (jobPtr) malloc(sizeof(job));
1492    if (ret == NULL) {
1493        fprintf(stderr,"out of memory\n");
1494        return(NULL);
1495    }
1496    memset(ret, 0, sizeof(job));
1497
1498    /* We don't care what the top level element name is */
1499    cur = cur-&gt;xmlChildrenNode;
1500    while (cur != NULL) {
1501        
1502        if ((!strcmp(cur-&gt;name, "Project")) &amp;&amp; (cur-&gt;ns == ns)) {
1503            ret-&gt;projectID = xmlGetProp(cur, "ID");
1504            if (ret-&gt;projectID == NULL) {
1505                fprintf(stderr, "Project has no ID\n");
1506            }
1507        }
1508        if ((!strcmp(cur-&gt;name, "Application")) &amp;&amp; (cur-&gt;ns == ns))
1509            ret-&gt;application = xmlNodeListGetString(doc, cur-&gt;xmlChildrenNode, 1);
1510        if ((!strcmp(cur-&gt;name, "Category")) &amp;&amp; (cur-&gt;ns == ns))
1511            ret-&gt;category = xmlNodeListGetString(doc, cur-&gt;xmlChildrenNode, 1);
1512        if ((!strcmp(cur-&gt;name, "Contact")) &amp;&amp; (cur-&gt;ns == ns))
1513            ret-&gt;contact = parsePerson(doc, ns, cur);
1514        cur = cur-&gt;next;
1515    }
1516
1517    return(ret);
1518}</pre>
1519
1520<p>Once you are used to it, writing this kind of code is quite simple, but
1521boring. Ultimately, it could be possble to write stubbers taking either C data
1522structure definitions, a set of XML examples or an XML DTD and produce the
1523code needed to import and export the content between C data and XML storage.
1524This is left as an exercise to the reader :-)</p>
1525
1526<p>Feel free to use <a href="example/gjobread.c">the code for the full C
1527parsing example</a> as a template, it is also available with Makefile in the
1528Gnome CVS base under gnome-xml/example</p>
1529
1530<h2><a name="Contributi">Contributions</a></h2>
1531<ul>
1532  <li><a href="mailto:ari@lusis.org">Ari Johnson</a> provides a  C++ wrapper
1533    for libxml:
1534    <p>Website: <a
1535    href="http://lusis.org/~ari/xml++/">http://lusis.org/~ari/xml++/</a></p>
1536    <p>Download: <a
1537    href="http://lusis.org/~ari/xml++/libxml++.tar.gz">http://lusis.org/~ari/xml++/libxml++.tar.gz</a></p>
1538  </li>
1539  <li><a href="mailto:doolin@cs.utk.edu">David Doolin</a> provides a
1540    precompiled Windows version
1541    <p><a
1542    href="http://www.ce.berkeley.edu/~doolin/code/libxmlwin32/">http://www.ce.berkeley.edu/~doolin/code/libxmlwin32/</a></p>
1543  </li>
1544  <li><a
1545    href="http://mail.gnome.org/archives/xml/2001-March/msg00014.html">Matt
1546    Sergeant</a> developped <a
1547    href="http://axkit.org/download/">XML::LibXSLT</a>, a perl wrapper for
1548    libxml2/libxslt as part of the <a href="http://axkit.com/">AxKit XML
1549    application server</a></li>
1550  <li><a href="mailto:fnatter@gmx.net">Felix Natter</a> provided <a
1551    href="libxml-doc.el">an emacs module</a> to lookup libxml functions
1552    documentation</li>
1553  <li><a href="mailto:sherwin@nlm.nih.gov">Ziying Sherwin</a> provided <a
1554    href="http://xmlsoft.org/messages/0488.html">man pages</a> (not yet
1555    integrated in the distribution)</li>
1556</ul>
1557
1558<p></p>
1559
1560<p><a href="mailto:Daniel.Veillard@imag.fr">Daniel Veillard</a></p>
1561
1562<p>$Id: xml.html,v 1.85 2001/06/01 10:11:57 veillard Exp $</p>
1563</body>
1564</html>
1565