xml.html revision d8da01cf3777b9fcb03731ea2e3117e0c2ee6dc6
1<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
2    "http://www.w3.org/TR/html4/loose.dtd">
3<html>
4<head>
5  <title>The XML C library for Gnome</title>
6  <meta name="GENERATOR" content="amaya 5.1">
7  <meta http-equiv="Content-Type" content="text/html">
8</head>
9
10<body bgcolor="#ffffff">
11<h1 align="center">The XML C library for Gnome</h1>
12
13<h1>Note: this is the flat content of the <a href="index.html">web
14site</a></h1>
15
16<h1 style="text-align: center">libxml, a.k.a. gnome-xml</h1>
17
18<p></p>
19
20<p>Libxml is the XML C library developed for the Gnome project.  XML itself
21is a metalanguage to design markup languages, i.e. text language where
22semantic and structure are added to the content using extra "markup"
23information enclosed between angle brackets. HTML is the most well-known
24markup language. Though the library is written in C <a href="python.html">a
25variety of language bindings</a> make it available in other environments.</p>
26
27<p>Libxml2 is known to be very portable, the library should build and work
28without serious troubles on a variety of systems (Linux, Unix, Windows,
29CygWin, MacOS, MacOS X, RISC Os, OS/2, VMS, QNX, MVS, ...)</p>
30
31<p>Libxml2 implements a number of existing standards related to markup
32languages:</p>
33<ul>
34  <li>the XML standard: <a
35    href="http://www.w3.org/TR/REC-xml">http://www.w3.org/TR/REC-xml</a></li>
36  <li>Namespaces in XML: <a
37    href="http://www.w3.org/TR/REC-xml-names/">http://www.w3.org/TR/REC-xml-names/</a></li>
38  <li>XML Base: <a
39    href="http://www.w3.org/TR/xmlbase/">http://www.w3.org/TR/xmlbase/</a></li>
40  <li><a href="http://www.cis.ohio-state.edu/rfc/rfc2396.txt">RFC 2396</a> :
41    Uniform Resource Identifiers <a
42    href="http://www.ietf.org/rfc/rfc2396.txt">http://www.ietf.org/rfc/rfc2396.txt</a></li>
43  <li>XML Path Language (XPath) 1.0: <a
44    href="http://www.w3.org/TR/xpath">http://www.w3.org/TR/xpath</a></li>
45  <li>HTML4 parser: <a
46    href="http://www.w3.org/TR/html401/">http://www.w3.org/TR/html401/</a></li>
47  <li>most of XML Pointer Language (XPointer) Version 1.0: <a
48    href="http://www.w3.org/TR/xptr">http://www.w3.org/TR/xptr</a></li>
49  <li>XML Inclusions (XInclude) Version 1.0: <a
50    href="http://www.w3.org/TR/xinclude/">http://www.w3.org/TR/xinclude/</a></li>
51  <li>[ISO-8859-1], <a
52    href="http://www.cis.ohio-state.edu/rfc/rfc2044.txt">rfc2044</a> [UTF-8]
53    and <a href="http://www.cis.ohio-state.edu/rfc/rfc2781.txt">rfc2781</a>
54    [UTF-16] core encodings</li>
55  <li>part of SGML Open Technical Resolution TR9401:1997</li>
56  <li>XML Catalogs Working Draft 06 August 2001: <a
57    href="http://www.oasis-open.org/committees/entity/spec-2001-08-06.html">http://www.oasis-open.org/committees/entity/spec-2001-08-06.html</a></li>
58  <li>Canonical XML Version 1.0: <a
59    href="http://www.w3.org/TR/xml-c14n">http://www.w3.org/TR/xml-c14n</a>
60    and the Exclusive XML Canonicalization CR draft <a
61    href="http://www.w3.org/TR/xml-exc-c14n">http://www.w3.org/TR/xml-exc-c14n</a></li>
62  <li>Relax NG Committee Specification 3 December 2001 <a
63    href="http://www.oasis-open.org/committees/relax-ng/spec-20011203.html">http://www.oasis-open.org/committees/relax-ng/spec-20011203.html</a></li>
64</ul>
65
66<p>In most cases libxml tries to implement the specifications in a relatively
67strictly compliant way. As of release 2.4.16, libxml2 passes all 1800+ tests
68from the <a
69href="http://www.oasis-open.org/committees/xml-conformance/">OASIS XML Tests
70Suite</a>.</p>
71
72<p>To some extent libxml2 provides support for the following additional
73specifications but doesn't claim to implement them completely:</p>
74<ul>
75  <li>Document Object Model (DOM) <a
76    href="http://www.w3.org/TR/DOM-Level-2-Core/">http://www.w3.org/TR/DOM-Level-2-Core/</a>
77    it doesn't implement the API itself, gdome2 does this on top of
78  libxml2</li>
79  <li><a href="http://www.cis.ohio-state.edu/rfc/rfc959.txt">RFC 959</a> :
80    libxml implements a basic FTP client code</li>
81  <li><a href="http://www.cis.ohio-state.edu/rfc/rfc1945.txt">RFC 1945</a> :
82    HTTP/1.0, again a basic HTTP client code</li>
83  <li>SAX: a minimal SAX implementation compatible with early expat
84  versions</li>
85  <li>DocBook SGML v4: libxml2 includes a hackish parser to transition to
86  XML</li>
87</ul>
88
89<p>A partial implementation of XML Schemas is being worked on but it would be
90far too early to make any conformance statement about it at the moment.</p>
91
92<p>Separate documents:</p>
93<ul>
94  <li><a href="http://xmlsoft.org/XSLT/">the libxslt page</a> providing an
95    implementation of XSLT 1.0 and common extensions like EXSLT for
96  libxml2</li>
97  <li><a href="http://www.cs.unibo.it/~casarini/gdome2/">the gdome2 page</a>
98    : a standard DOM2 implementation for libxml2</li>
99  <li><a href="http://www.aleksey.com/xmlsec/">the XMLSec page</a>: an
100    implementation of <a href="http://www.w3.org/TR/xmldsig-core/">W3C XML
101    Digital Signature</a> for libxml2</li>
102  <li>also check the related links section below for more related and active
103    projects.</li>
104</ul>
105
106<p>Results of the <a
107href="http://xmlbench.sourceforge.net/results/benchmark/index.html">xmlbench
108benchmark</a> on sourceforge 19 March 2003 (smaller is better):</p>
109
110<p align="center"><img src="benchmark.gif"
111alt="benchmark results for Expat Xerces libxml2 Oracle and Sun toolkits"></p>
112
113<p>Logo designed by <a href="mailto:liyanage@access.ch">Marc Liyanage</a>.</p>
114
115<h2><a name="Introducti">Introduction</a></h2>
116
117<p>This document describes libxml, the <a
118href="http://www.w3.org/XML/">XML</a> C library developed for the <a
119href="http://www.gnome.org/">Gnome</a> project. <a
120href="http://www.w3.org/XML/">XML is a standard</a> for building tag-based
121structured documents/data.</p>
122
123<p>Here are some key points about libxml:</p>
124<ul>
125  <li>Libxml exports Push (progressive) and Pull (blocking) type parser
126    interfaces for both XML and HTML.</li>
127  <li>Libxml can do DTD validation at parse time, using a parsed document
128    instance, or with an arbitrary DTD.</li>
129  <li>Libxml includes complete <a
130    href="http://www.w3.org/TR/xpath">XPath</a>, <a
131    href="http://www.w3.org/TR/xptr">XPointer</a> and <a
132    href="http://www.w3.org/TR/xinclude">XInclude</a> implementations.</li>
133  <li>It is written in plain C, making as few assumptions as possible, and
134    sticking closely to ANSI C/POSIX for easy embedding. Works on
135    Linux/Unix/Windows, ported to a number of other platforms.</li>
136  <li>Basic support for HTTP and FTP client allowing applications to fetch
137    remote resources.</li>
138  <li>The design is modular, most of the extensions can be compiled out.</li>
139  <li>The internal document representation is as close as possible to the <a
140    href="http://www.w3.org/DOM/">DOM</a> interfaces.</li>
141  <li>Libxml also has a <a href="http://www.megginson.com/SAX/index.html">SAX
142    like interface</a>; the interface is designed to be compatible with <a
143    href="http://www.jclark.com/xml/expat.html">Expat</a>.</li>
144  <li>This library is released under the <a
145    href="http://www.opensource.org/licenses/mit-license.html">MIT
146    License</a>. See the Copyright file in the distribution for the precise
147    wording.</li>
148</ul>
149
150<p>Warning: unless you are forced to because your application links with a
151Gnome-1.X library requiring it,  <strong><span
152style="background-color: #FF0000">Do Not Use libxml1</span></strong>, use
153libxml2</p>
154
155<h2><a name="FAQ">FAQ</a></h2>
156
157<p>Table of Contents:</p>
158<ul>
159  <li><a href="FAQ.html#License">License(s)</a></li>
160  <li><a href="FAQ.html#Installati">Installation</a></li>
161  <li><a href="FAQ.html#Compilatio">Compilation</a></li>
162  <li><a href="FAQ.html#Developer">Developer corner</a></li>
163</ul>
164
165<h3><a name="License">License</a>(s)</h3>
166<ol>
167  <li><em>Licensing Terms for libxml</em>
168    <p>libxml is released under the <a
169    href="http://www.opensource.org/licenses/mit-license.html">MIT
170    License</a>; see the file Copyright in the distribution for the precise
171    wording</p>
172  </li>
173  <li><em>Can I embed libxml in a proprietary application ?</em>
174    <p>Yes. The MIT License allows you to keep proprietary the changes you
175    made to libxml, but it would be graceful to send-back bug fixes and
176    improvements as patches for possible incorporation in the main
177    development tree.</p>
178  </li>
179</ol>
180
181<h3><a name="Installati">Installation</a></h3>
182<ol>
183  <li>Unless you are forced to because your application links with a Gnome
184    library requiring it,  <strong><span style="background-color: #FF0000">Do
185    Not Use libxml1</span></strong>, use libxml2</li>
186  <li><em>Where can I get libxml</em> ?
187    <p>The original distribution comes from <a
188    href="ftp://rpmfind.net/pub/libxml/">rpmfind.net</a> or <a
189    href="ftp://ftp.gnome.org/pub/GNOME/sources/libxml2/2.5/">gnome.org</a></p>
190    <p>Most Linux and BSD distributions include libxml, this is probably the
191    safer way for end-users to use libxml.</p>
192    <p>David Doolin provides precompiled Windows versions at <a
193    href="http://www.ce.berkeley.edu/~doolin/code/libxmlwin32/         ">http://www.ce.berkeley.edu/~doolin/code/libxmlwin32/</a></p>
194  </li>
195  <li><em>I see libxml and libxml2 releases, which one should I install ?</em>
196    <ul>
197      <li>If you are not constrained by backward compatibility issues with
198        existing applications, install libxml2 only</li>
199      <li>If you are not doing development, you can safely install both.
200        Usually the packages <a
201        href="http://rpmfind.net/linux/RPM/libxml.html">libxml</a> and <a
202        href="http://rpmfind.net/linux/RPM/libxml2.html">libxml2</a> are
203        compatible (this is not the case for development packages).</li>
204      <li>If you are a developer and your system provides separate packaging
205        for shared libraries and the development components, it is possible
206        to install libxml and libxml2, and also <a
207        href="http://rpmfind.net/linux/RPM/libxml-devel.html">libxml-devel</a>
208        and <a
209        href="http://rpmfind.net/linux/RPM/libxml2-devel.html">libxml2-devel</a>
210        too for libxml2 &gt;= 2.3.0</li>
211      <li>If you are developing a new application, please develop against
212        libxml2(-devel)</li>
213    </ul>
214  </li>
215  <li><em>I can't install the libxml package, it conflicts with libxml0</em>
216    <p>You probably have an old libxml0 package used to provide the shared
217    library for libxml.so.0, you can probably safely remove it. The libxml
218    packages provided on <a
219    href="ftp://rpmfind.net/pub/libxml/">rpmfind.net</a> provide
220    libxml.so.0</p>
221  </li>
222  <li><em>I can't install the libxml(2) RPM package due to failed
223    dependencies</em>
224    <p>The most generic solution is to re-fetch the latest src.rpm , and
225    rebuild it locally with</p>
226    <p><code>rpm --rebuild libxml(2)-xxx.src.rpm</code>.</p>
227    <p>If everything goes well it will generate two binary rpm packages (one
228    providing the shared libs and xmllint, and the other one, the -devel
229    package, providing includes, static libraries and scripts needed to build
230    applications with libxml(2)) that you can install locally.</p>
231  </li>
232</ol>
233
234<h3><a name="Compilatio">Compilation</a></h3>
235<ol>
236  <li><em>What is the process to compile libxml ?</em>
237    <p>As most UNIX libraries libxml follows the "standard":</p>
238    <p><code>gunzip -c xxx.tar.gz | tar xvf -</code></p>
239    <p><code>cd libxml-xxxx</code></p>
240    <p><code>/configure --help</code></p>
241    <p>to see the options, then the compilation/installation proper</p>
242    <p><code>/configure [possible options]</code></p>
243    <p><code>make</code></p>
244    <p><code>make install</code></p>
245    <p>At that point you may have to rerun ldconfig or a similar utility to
246    update your list of installed shared libs.</p>
247  </li>
248  <li><em>What other libraries are needed to compile/install libxml ?</em>
249    <p>Libxml does not require any other library, the normal C ANSI API
250    should be sufficient (please report any violation to this rule you may
251    find).</p>
252    <p>However if found at configuration time libxml will detect and use the
253    following libs:</p>
254    <ul>
255      <li><a href="http://www.info-zip.org/pub/infozip/zlib/">libz</a> : a
256        highly portable and available widely compression library.</li>
257      <li>iconv: a powerful character encoding conversion library. It is
258        included by default in recent glibc libraries, so it doesn't need to
259        be installed specifically on Linux. It now seems a <a
260        href="http://www.opennc.org/onlinepubs/7908799/xsh/iconv.html">part
261        of the official UNIX</a> specification. Here is one <a
262        href="http://www.gnu.org/software/libiconv/">implementation of the
263        library</a> which source can be found <a
264        href="ftp://ftp.ilog.fr/pub/Users/haible/gnu/">here</a>.</li>
265    </ul>
266  </li>
267  <li><em>Make check fails on some platforms</em>
268    <p>Sometimes the regression tests' results don't completely match the
269    value produced by the parser, and the makefile uses diff to print the
270    delta. On some platforms the diff return breaks the compilation process;
271    if the diff is small this is probably not a serious problem.</p>
272    <p>Sometimes (especially on Solaris) make checks fail due to limitations
273    in make. Try using GNU-make instead.</p>
274  </li>
275  <li><em>I use the CVS version and there is no configure script</em>
276    <p>The configure script (and other Makefiles) are generated. Use the
277    autogen.sh script to regenerate the configure script and Makefiles,
278    like:</p>
279    <p><code>/autogen.sh --prefix=/usr --disable-shared</code></p>
280  </li>
281  <li><em>I have troubles when running make tests with gcc-3.0</em>
282    <p>It seems the initial release of gcc-3.0 has a problem with the
283    optimizer which miscompiles the URI module. Please use another
284    compiler.</p>
285  </li>
286</ol>
287
288<h3><a name="Developer">Developer</a> corner</h3>
289<ol>
290  <li><em>xmlDocDump() generates output on one line.</em>
291    <p>Libxml will not <strong>invent</strong> spaces in the content of a
292    document since <strong>all spaces in the content of a document are
293    significant</strong>. If you build a tree from the API and want
294    indentation:</p>
295    <ol>
296      <li>the correct way is to generate those yourself too.</li>
297      <li>the dangerous way is to ask libxml to add those blanks to your
298        content <strong>modifying the content of your document in the
299        process</strong>. The result may not be what you expect. There is
300        <strong>NO</strong> way to guarantee that such a modification won't
301        affect other parts of the content of your document. See <a
302        href="http://xmlsoft.org/html/libxml-parser.html#XMLKEEPBLANKSDEFAULT">xmlKeepBlanksDefault
303        ()</a> and <a
304        href="http://xmlsoft.org/html/libxml-tree.html#XMLSAVEFORMATFILE">xmlSaveFormatFile
305        ()</a></li>
306    </ol>
307  </li>
308  <li>Extra nodes in the document:
309    <p><em>For a XML file as below:</em></p>
310    <pre>&lt;?xml version="1.0"?&gt;
311&lt;PLAN xmlns="http://www.argus.ca/autotest/1.0/"&gt;
312&lt;NODE CommFlag="0"/&gt;
313&lt;NODE CommFlag="1"/&gt;
314&lt;/PLAN&gt;</pre>
315    <p><em>after parsing it with the function
316    pxmlDoc=xmlParseFile(...);</em></p>
317    <p><em>I want to the get the content of the first node (node with the
318    CommFlag="0")</em></p>
319    <p><em>so I did it as following;</em></p>
320    <pre>xmlNodePtr pnode;
321pnode=pxmlDoc-&gt;children-&gt;children;</pre>
322    <p><em>but it does not work. If I change it to</em></p>
323    <pre>pnode=pxmlDoc-&gt;children-&gt;children-&gt;next;</pre>
324    <p><em>then it works.  Can someone explain it to me.</em></p>
325    <p></p>
326    <p>In XML all characters in the content of the document are significant
327    <strong>including blanks and formatting line breaks</strong>.</p>
328    <p>The extra nodes you are wondering about are just that, text nodes with
329    the formatting spaces which are part of the document but that people tend
330    to forget. There is a function <a
331    href="http://xmlsoft.org/html/libxml-parser.html">xmlKeepBlanksDefault
332    ()</a>  to remove those at parse time, but that's an heuristic, and its
333    use should be limited to cases where you are certain there is no
334    mixed-content in the document.</p>
335  </li>
336  <li><em>I get compilation errors of existing code like when accessing
337    <strong>root</strong> or <strong>child fields</strong> of nodes.</em>
338    <p>You are compiling code developed for libxml version 1 and using a
339    libxml2 development environment. Either switch back to libxml v1 devel or
340    even better fix the code to compile with libxml2 (or both) by <a
341    href="upgrade.html">following the instructions</a>.</p>
342  </li>
343  <li><em>I get compilation errors about non existing
344    <strong>xmlRootNode</strong> or <strong>xmlChildrenNode</strong>
345    fields.</em>
346    <p>The source code you are using has been <a
347    href="upgrade.html">upgraded</a> to be able to compile with both libxml
348    and libxml2, but you need to install a more recent version:
349    libxml(-devel) &gt;= 1.8.8 or libxml2(-devel) &gt;= 2.1.0</p>
350  </li>
351  <li><em>XPath implementation looks seriously broken</em>
352    <p>XPath implementation prior to 2.3.0 was really incomplete. Upgrade to
353    a recent version, there are no known bugs in the current version.</p>
354  </li>
355  <li><em>The example provided in the web page does not compile.</em>
356    <p>It's hard to maintain the documentation in sync with the code
357    &lt;grin/&gt; ...</p>
358    <p>Check the previous points 1/ and 2/ raised before, and please send
359    patches.</p>
360  </li>
361  <li><em>Where can I get more examples and information than privoded on the
362    web page?</em>
363    <p>Ideally a libxml book would be nice. I have no such plan ... But you
364    can:</p>
365    <ul>
366      <li>check more deeply the <a href="html/libxml-lib.html">existing
367        generated doc</a></li>
368      <li>look for examples of use for libxml function using the Gnome code.
369        For example the following will query the full Gnome CVS base for the
370        use of the <strong>xmlAddChild()</strong> function:
371        <p><a
372        href="http://cvs.gnome.org/lxr/search?string=xmlAddChild">http://cvs.gnome.org/lxr/search?string=xmlAddChild</a></p>
373        <p>This may be slow, a large hardware donation to the gnome project
374        could cure this :-)</p>
375      </li>
376      <li><a
377        href="http://cvs.gnome.org/bonsai/rview.cgi?cvsroot=/cvs/gnome&dir=gnome-xml">Browse
378        the libxml source</a> , I try to write code as clean and documented
379        as possible, so looking at it may be helpful. In particular the code
380        of xmllint.c and of the various testXXX.c test programs should
381        provide good examples of how to do things with the library.</li>
382    </ul>
383  </li>
384  <li>What about C++ ?
385    <p>libxml is written in pure C in order to allow easy reuse on a number
386    of platforms, including embedded systems. I don't intend to convert to
387    C++.</p>
388    <p>There are however a few C++ wrappers which may fulfill your needs:</p>
389    <ul>
390      <li>by Ari Johnson &lt;ari@btigate.com&gt;:
391        <p>Website: <a
392        href="http://lusis.org/~ari/xml++/">http://lusis.org/~ari/xml++/</a></p>
393        <p>Download: <a
394        href="http://lusis.org/~ari/xml++/libxml++.tar.gz">http://lusis.org/~ari/xml++/libxml++.tar.gz</a></p>
395      </li>
396      <li>by Peter Jones &lt;pjones@pmade.org&gt;
397        <p>Website: <a
398        href="http://pmade.org/pjones/software/xmlwrapp/">http://pmade.org/pjones/software/xmlwrapp/</a></p>
399      </li>
400    </ul>
401  </li>
402  <li>How to validate a document a posteriori ?
403    <p>It is possible to validate documents which had not been validated at
404    initial parsing time or documents which have been built from scratch
405    using the API. Use the <a
406    href="http://xmlsoft.org/html/libxml-valid.html#XMLVALIDATEDTD">xmlValidateDtd()</a>
407    function. It is also possible to simply add a DTD to an existing
408    document:</p>
409    <pre>xmlDocPtr doc; /* your existing document */
410xmlDtdPtr dtd = xmlParseDTD(NULL, filename_of_dtd); /* parse the DTD */
411
412        dtd-&gt;name = xmlStrDup((xmlChar*)"root_name"); /* use the given root */
413
414        doc-&gt;intSubset = dtd;
415        if (doc-&gt;children == NULL) xmlAddChild((xmlNodePtr)doc, (xmlNodePtr)dtd);
416        else xmlAddPrevSibling(doc-&gt;children, (xmlNodePtr)dtd);
417          </pre>
418  </li>
419  <li>So what is this funky "xmlChar" used all the time?
420    <p>It is a null terminated sequence of utf-8 characters. And only utf-8!
421    You need to convert strings encoded in different ways to utf-8 before
422    passing them to the API.  This can be accomplished with the iconv library
423    for instance.</p>
424  </li>
425  <li>etc ...</li>
426</ol>
427
428<p></p>
429
430<h2><a name="Documentat">Documentation</a></h2>
431
432<p>There are several on-line resources related to using libxml:</p>
433<ol>
434  <li>Use the <a href="search.php">search engine</a> to lookup
435  informations.</li>
436  <li>Check the <a href="FAQ.html">FAQ.</a></li>
437  <li>Check the <a href="http://xmlsoft.org/html/libxml-lib.html">extensive
438    documentation</a> automatically extracted from code comments (using <a
439    href="http://cvs.gnome.org/bonsai/rview.cgi?cvsroot=/cvs/gnome&dir=gtk-doc">gtk
440    doc</a>).</li>
441  <li>Look at the documentation about <a href="encoding.html">libxml
442    internationalization support</a>.</li>
443  <li>This page provides a global overview and <a href="example.html">some
444    examples</a> on how to use libxml.</li>
445  <li>John Fleck's libxml tutorial: <a href="tutorial/index.html">html</a> or
446    <a href="tutorial/xmltutorial.pdf">pdf</a>.</li>
447  <li><a href="mailto:james@daa.com.au">James Henstridge</a> wrote <a
448    href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">some nice
449    documentation</a> explaining how to use the libxml SAX interface.</li>
450  <li>George Lebl wrote <a
451    href="http://www-4.ibm.com/software/developer/library/gnome3/">an article
452    for IBM developerWorks</a> about using libxml.</li>
453  <li>Check <a href="http://cvs.gnome.org/lxr/source/gnome-xml/TODO">the TODO
454    file</a>.</li>
455  <li>Read the <a href="upgrade.html">1.x to 2.x upgrade path</a>
456    description. If you are starting a new project using libxml you should
457    really use the 2.x version.</li>
458  <li>And don't forget to look at the <a
459    href="http://mail.gnome.org/archives/xml/">mailing-list archive</a>.</li>
460</ol>
461
462<h2><a name="Reporting">Reporting bugs and getting help</a></h2>
463
464<p>Well, bugs or missing features are always possible, and I will make a
465point of fixing them in a timely fashion. The best way to report a bug is to
466use the <a href="http://bugzilla.gnome.org/buglist.cgi?product=libxml">Gnome
467bug tracking database</a> (make sure to use the "libxml2" module name). I
468look at reports there regularly and it's good to have a reminder when a bug
469is still open. Be sure to specify that the bug is for the package libxml2.</p>
470
471<p>There is also a mailing-list <a
472href="mailto:xml@gnome.org">xml@gnome.org</a> for libxml, with an  <a
473href="http://mail.gnome.org/archives/xml/">on-line archive</a> (<a
474href="http://xmlsoft.org/messages">old</a>). To subscribe to this list,
475please visit the <a
476href="http://mail.gnome.org/mailman/listinfo/xml">associated Web</a> page and
477follow the instructions. <strong>Do not send code, I won't debug it</strong>
478(but patches are really appreciated!).</p>
479
480<p>Check the following <strong><span style="color: #FF0000">before
481posting</span></strong>:</p>
482<ul>
483  <li>Read the <a href="FAQ.html">FAQ</a> and <a href="search.php">use the
484    search engine</a> to get informations related to your problem.</li>
485  <li>Make sure you are <a href="ftp://xmlsoft.org/">using a recent
486    version</a>, and that the problem still shows up in a recent version.</li>
487  <li>Check the <a href="http://mail.gnome.org/archives/xml/">list
488    archives</a> to see if the problem was reported already. In this case
489    there is probably a fix available, similarly check the <a
490    href="http://bugzilla.gnome.org/buglist.cgi?product=libxml">registered
491    open bugs</a>.</li>
492  <li>Make sure you can reproduce the bug with xmllint or one of the test
493    programs found in source in the distribution.</li>
494  <li>Please send the command showing the error as well as the input (as an
495    attachment)</li>
496</ul>
497
498<p>Then send the bug with associated informations to reproduce it to the <a
499href="mailto:xml@gnome.org">xml@gnome.org</a> list; if it's really libxml
500related I will approve it. Please do not send mail to me directly, it makes
501things really hard to track and in some cases I am not the best person to
502answer a given question, ask on the list.</p>
503
504<p>To <span style="color: #E50000">be really clear about support</span>:</p>
505<ul>
506  <li>Support or help <span style="color: #E50000">request MUST be sent to
507    the list or on bugzilla</span> in case of problems, so that the Question
508    and Answers can be shared publicly. Failing to do so carries the implicit
509    message "I want free support but I don't want to share the benefits with
510    others" and is not welcome. I will automatically Carbon-Copy the
511    xml@gnome.org mailing list for any technical reply made about libxml2 or
512    libxslt.</li>
513  <li>There is <span style="color: #E50000">no garantee for support</span>,
514    if your question remains unanswered after a week, repost it, making sure
515    you gave all the detail needed and the informations requested.</li>
516  <li>Failing to provide informations as requested or double checking first
517    for prior feedback also carries the implicit message "the time of the
518    library maintainers is less valuable than my time" and might not be
519    welcome.</li>
520</ul>
521
522<p>Of course, bugs reported with a suggested patch for fixing them will
523probably be processed faster than those without.</p>
524
525<p>If you're looking for help, a quick look at <a
526href="http://mail.gnome.org/archives/xml/">the list archive</a> may actually
527provide the answer. I usually send source samples when answering libxml usage
528questions. The <a href="http://xmlsoft.org/html/book1.html">auto-generated
529documentation</a> is not as polished as I would like (i need to learn more
530about DocBook), but it's a good starting point.</p>
531
532<h2><a name="help">How to help</a></h2>
533
534<p>You can help the project in various ways, the best thing to do first is to
535subscribe to the mailing-list as explained before, check the <a
536href="http://mail.gnome.org/archives/xml/">archives </a>and the <a
537href="http://bugzilla.gnome.org/buglist.cgi?product=libxml">Gnome bug
538database</a>:</p>
539<ol>
540  <li>Provide patches when you find problems.</li>
541  <li>Provide the diffs when you port libxml to a new platform. They may not
542    be integrated in all cases but help pinpointing portability problems
543  and</li>
544  <li>Provide documentation fixes (either as patches to the code comments or
545    as HTML diffs).</li>
546  <li>Provide new documentations pieces (translations, examples, etc
547  ...).</li>
548  <li>Check the TODO file and try to close one of the items.</li>
549  <li>Take one of the points raised in the archive or the bug database and
550    provide a fix. <a href="mailto:daniel@veillard.com">Get in touch with me
551    </a>before to avoid synchronization problems and check that the suggested
552    fix will fit in nicely :-)</li>
553</ol>
554
555<h2><a name="Downloads">Downloads</a></h2>
556
557<p>The latest versions of libxml can be found on <a
558href="ftp://xmlsoft.org/">xmlsoft.org</a> (<a
559href="ftp://speakeasy.rpmfind.net/pub/libxml/">Seattle</a>, <a
560href="ftp://fr.rpmfind.net/pub/libxml/">France</a>) or on the <a
561href="ftp://ftp.gnome.org/pub/GNOME/MIRRORS.html">Gnome FTP server</a> either
562as a <a href="ftp://ftp.gnome.org/pub/GNOME/sources/libxml2/2.5/">source
563archive</a><!-- commenting this out because they seem to have disappeared or <a
564href="ftp://ftp.gnome.org/pub/GNOME/stable/redhat/i386/libxml/">RPM
565packages</a> -->
566 , Antonin Sprinzl also provide <a href="ftp://gd.tuwien.ac.at/pub/libxml/">a
567mirror in Austria</a>. (NOTE that you need both the <a
568href="http://rpmfind.net/linux/RPM/libxml2.html">libxml(2)</a> and <a
569href="http://rpmfind.net/linux/RPM/libxml2-devel.html">libxml(2)-devel</a>
570packages installed to compile applications using libxml.) <a
571href="mailto:igor@zlatkovic.com">Igor  Zlatkovic</a> is now the maintainer of
572the Windows port, <a
573href="http://www.zlatkovic.com/projects/libxml/index.html">he provides
574binaries</a>. <a href="mailto:Gary.Pennington@sun.com">Gary Pennington</a>
575provides <a href="http://garypennington.net/libxml2/">Solaris binaries</a>.
576<a href="mailto:Steve.Ball@zveno.com">Steve Ball</a> provides <a
577href="http://www.zveno.com/open_source/libxml2xslt.html">Mac Os X
578binaries</a>.</p>
579
580<p><a name="Snapshot">Snapshot:</a></p>
581<ul>
582  <li>Code from the W3C cvs base libxml <a
583    href="ftp://xmlsoft.org/cvs-snapshot.tar.gz">cvs-snapshot.tar.gz</a>.</li>
584  <li>Docs, content of the web site, the list archive included <a
585    href="ftp://xmlsoft.org/libxml-docs.tar.gz">libxml-docs.tar.gz</a>.</li>
586</ul>
587
588<p><a name="Contribs">Contributions:</a></p>
589
590<p>I do accept external contributions, especially if compiling on another
591platform,  get in touch with me to upload the package, wrappers for various
592languages have been provided, and can be found in the <a
593href="contribs.html">contrib section</a></p>
594
595<p>Libxml is also available from CVS:</p>
596<ul>
597  <li><p>The <a
598    href="http://cvs.gnome.org/bonsai/rview.cgi?cvsroot=/cvs/gnome&dir=gnome-xml">Gnome
599    CVS base</a>. Check the <a
600    href="http://developer.gnome.org/tools/cvs.html">Gnome CVS Tools</a>
601    page; the CVS module is <b>gnome-xml</b>.</p>
602  </li>
603  <li>The <strong>libxslt</strong> module is also present there</li>
604</ul>
605
606<h2><a name="News">News</a></h2>
607
608<h3>CVS only : check the <a
609href="http://cvs.gnome.org/lxr/source/gnome-xml/ChangeLog">Changelog</a> file
610for a really accurate description</h3>
611
612<p>Items not finished and worked on, get in touch with the list if you want
613to test those</p>
614<ul>
615  <li>More testing on RelaxNG</li>
616  <li>Finishing up <a href="http://www.w3.org/TR/xmlschema-1/">XML
617  Schemas</a></li>
618</ul>
619
620<p>2.5.5: Mar 24 2003</p>
621<ul>
622  <li>Lot of fixes on the Relax NG implementation. More testing including
623    DocBook and TEI examples.</li>
624  <li>Increased the support for W3C XML Schemas datatype</li>
625  <li>Several bug fixes in the URI handling layer</li>
626  <li>Bug fixes: HTML parser, xmlReader, DTD validation, XPath, encoding
627    conversion, line counting in the parser.</li>
628  <li>Added support for $XMLLINT_INDENT environment variable, FTP delete</li>
629  <li>Fixed the RPM spec file name</li>
630</ul>
631
632<h3>2.5.4: Feb 20 2003</h3>
633<ul>
634  <li>Conformance testing and lot of fixes on Relax NG and XInclude
635    implementation</li>
636  <li>Implementation of XPointer element() scheme</li>
637  <li>Bug fixes: XML parser, XInclude entities merge, validity checking on
638    namespaces,
639    <p>2 serialization bugs, node info generation problems, a DTD regexp
640    generation problem.</p>
641  </li>
642  <li>Portability: windows updates and path canonicalization (Igor)</li>
643  <li>A few typo fixes (Kjartan Maraas)</li>
644  <li>Python bindings generator fixes (Stephane Bidoul)</li>
645</ul>
646
647<h3>2.5.3: Feb 10 2003</h3>
648<ul>
649  <li>RelaxNG and XML Schemas datatypes improvements, and added a first
650    version of RelaxNG Python bindings</li>
651  <li>Fixes: XLink (Sean Chittenden), XInclude (Sean Chittenden), API fix for
652    serializing namespace nodes, encoding conversion bug, XHTML1
653  serialization</li>
654  <li>Portability fixes: Windows (Igor), AMD 64bits RPM spec file</li>
655</ul>
656
657<h3>2.5.2: Feb 5 2003</h3>
658<ul>
659  <li>First implementation of RelaxNG, added --relaxng flag to xmllint</li>
660  <li>Schemas support now compiled in by default.</li>
661  <li>Bug fixes: DTD validation, namespace checking, XInclude and entities,
662    delegateURI in XML Catalogs, HTML parser, XML reader (St�phane Bidoul),
663    XPath parser and evaluation,  UTF8ToUTF8 serialization, XML reader memory
664    consumption, HTML parser, HTML serialization in the presence of
665  namespaces</li>
666  <li>added an HTML API to check elements and attributes.</li>
667  <li>Documentation improvement, PDF for the tutorial (John Fleck), doc
668    patches (Stefan Kost)</li>
669  <li>Portability fixes: NetBSD (Julio Merino), Windows (Igor Zlatkovic)</li>
670  <li>Added python bindings for XPointer, contextual error reporting
671    (St�phane Bidoul)</li>
672  <li>URI/file escaping problems (Stefano Zacchiroli)</li>
673</ul>
674
675<h3>2.5.1: Jan 8 2003</h3>
676<ul>
677  <li>Fixes a memory leak and configuration/compilation problems in 2.5.0</li>
678  <li>documentation updates (John)</li>
679  <li>a couple of XmlTextReader fixes</li>
680</ul>
681
682<h3>2.5.0: Jan 6 2003</h3>
683<ul>
684  <li>New <a href="xmlreader.html">XmltextReader interface</a> based on C#
685    API (with help of St�phane Bidoul)</li>
686  <li>Windows: more exports, including the new API (Igor)</li>
687  <li>XInclude fallback fix</li>
688  <li>Python: bindings for the new API, packaging (St�phane Bidoul),
689    drv_libxml2.py Python xml.sax driver (St�phane Bidoul), fixes, speedup
690    and iterators for Python-2.2 (Hannu Krosing)</li>
691  <li>Tutorial fixes (john Fleck and Niraj Tolia) xmllint man update
692  (John)</li>
693  <li>Fix an XML parser bug raised by Vyacheslav Pindyura</li>
694  <li>Fix for VMS serialization (Nigel Hall) and config (Craig A. Berry)</li>
695  <li>Entities handling fixes</li>
696  <li>new API to optionally track node creation and deletion (Lukas
697  Schroeder)</li>
698  <li>Added documentation for the XmltextReader interface and some <a
699    href="guidelines.html">XML guidelines</a></li>
700</ul>
701
702<h3>2.4.30: Dec 12 2002</h3>
703<ul>
704  <li>2.4.29 broke the python bindings, rereleasing</li>
705  <li>Improvement/fixes of the XML API generator, and couple of minor code
706    fixes.</li>
707</ul>
708
709<h3>2.4.29: Dec 11 2002</h3>
710<ul>
711  <li>Windows fixes (Igor): Windows CE port, pthread linking, python bindings
712    (St�phane Bidoul), Mingw (Magnus Henoch), and export list updates</li>
713  <li>Fix for prev in python bindings (ERDI Gergo)</li>
714  <li>Fix for entities handling (Marcus Clarke)</li>
715  <li>Refactored the XML and HTML dumps to a single code path, fixed XHTML1
716    dump</li>
717  <li>Fix for URI parsing when handling URNs with fragment identifiers</li>
718  <li>Fix for HTTP URL escaping problem</li>
719  <li>added an TextXmlReader (C#) like API (work in progress)</li>
720  <li>Rewrote the API in XML generation script, includes a C parser and saves
721    more informations needed for C# bindings</li>
722</ul>
723
724<h3>2.4.28: Nov 22 2002</h3>
725<ul>
726  <li>a couple of python binding fixes</li>
727  <li>2 bug fixes in the XML push parser</li>
728  <li>potential memory leak removed (Martin Stoilov)</li>
729  <li>fix to the configure script for Unix (Dimitri Papadopoulos)</li>
730  <li>added encoding support for XInclude parse="text"</li>
731  <li>autodetection of XHTML1 and specific serialization rules added</li>
732  <li>nasty threading bug fixed (William Brack)</li>
733</ul>
734
735<h3>2.4.27: Nov 17 2002</h3>
736<ul>
737  <li>fixes for the Python bindings</li>
738  <li>a number of bug fixes: SGML catalogs, xmlParseBalancedChunkMemory(),
739    HTML parser,  Schemas (Charles Bozeman), document fragment support
740    (Christian Glahn), xmlReconciliateNs (Brian Stafford), XPointer,
741    xmlFreeNode(), xmlSAXParseMemory (Peter Jones), xmlGetNodePath (Petr
742    Pajas), entities processing</li>
743  <li>added grep to xmllint --shell</li>
744  <li>VMS update patch from Craig A. Berry</li>
745  <li>cleanup of the Windows build with support for more compilers (Igor),
746    better thread support on Windows</li>
747  <li>cleanup of Unix Makefiles and spec file</li>
748  <li>Improvements to the documentation (John Fleck)</li>
749</ul>
750
751<h3>2.4.26: Oct 18 2002</h3>
752<ul>
753  <li>Patches for Windows CE port, improvements on Windows paths handling</li>
754  <li>Fixes to the validation  code (DTD and Schemas), xmlNodeGetPath() ,
755    HTML serialization, Namespace compliance,  and a number of small
756  problems</li>
757</ul>
758
759<h3>2.4.25: Sep 26 2002</h3>
760<ul>
761  <li>A number of bug fixes: XPath, validation, Python bindings, DOM and
762    tree, xmlI/O,  Html</li>
763  <li>Serious rewrite of XInclude</li>
764  <li>Made XML Schemas regexp part of the default build and APIs, small fix
765    and improvement of the regexp core</li>
766  <li>Changed the validation code to reuse XML Schemas regexp APIs</li>
767  <li>Better handling of Windows file paths, improvement of Makefiles (Igor,
768    Daniel Gehriger, Mark Vakoc)</li>
769  <li>Improved the python I/O bindings, the tests, added resolver and regexp
770    APIs</li>
771  <li>New logos from Marc Liyanage</li>
772  <li>Tutorial improvements: John Fleck, Christopher Harris</li>
773  <li>Makefile: Fixes for AMD x86_64 (Mandrake), DESTDIR (Christophe
774  Merlet)</li>
775  <li>removal of all stderr/perror use for error reporting</li>
776  <li>Better error reporting: XPath and DTD validation</li>
777  <li>update of the trio portability layer (Bjorn Reese)</li>
778</ul>
779
780<p><strong>2.4.24: Aug 22 2002</strong></p>
781<ul>
782  <li>XPath fixes (William), xf:escape-uri() (Wesley Terpstra)</li>
783  <li>Python binding fixes: makefiles (William), generator, rpm build, x86-64
784    (fcrozat)</li>
785  <li>HTML &lt;style&gt; and boolean attributes serializer fixes</li>
786  <li>C14N improvements by Aleksey</li>
787  <li>doc cleanups: Rick Jones</li>
788  <li>Windows compiler makefile updates: Igor and Elizabeth Barham</li>
789  <li>XInclude: implementation of fallback and xml:base fixup added</li>
790</ul>
791
792<h3>2.4.23: July 6 2002</h3>
793<ul>
794  <li>performances patches: Peter Jacobi</li>
795  <li>c14n fixes, testsuite and performances: Aleksey Sanin</li>
796  <li>added xmlDocFormatDump: Chema Celorio</li>
797  <li>new tutorial: John Fleck</li>
798  <li>new hash functions and performances: Sander Vesik, portability fix from
799    Peter Jacobi</li>
800  <li>a number of bug fixes: XPath (William Brack, Richard Jinks), XML and
801    HTML parsers, ID lookup function</li>
802  <li>removal of all remaining sprintf: Aleksey Sanin</li>
803</ul>
804
805<h3>2.4.22: May 27 2002</h3>
806<ul>
807  <li>a number of bug fixes: configure scripts, base handling, parser, memory
808    usage, HTML parser, XPath, documentation (Christian Cornelssen),
809    indentation, URI parsing</li>
810  <li>Optimizations for XMLSec, fixing and making public some of the network
811    protocol handlers (Aleksey)</li>
812  <li>performance patch from Gary Pennington</li>
813  <li>Charles Bozeman provided date and time support for XML Schemas
814  datatypes</li>
815</ul>
816
817<h3>2.4.21: Apr 29 2002</h3>
818
819<p>This release is both a bug fix release and also contains the early XML
820Schemas <a href="http://www.w3.org/TR/xmlschema-1/">structures</a> and <a
821href="http://www.w3.org/TR/xmlschema-2/">datatypes</a> code, beware, all
822interfaces are likely to change, there is huge holes, it is clearly a work in
823progress and don't even think of putting this code in a production system,
824it's actually not compiled in by default. The real fixes are:</p>
825<ul>
826  <li>a couple of bugs or limitations introduced in 2.4.20</li>
827  <li>patches for Borland C++ and MSC by Igor</li>
828  <li>some fixes on XPath strings and conformance patches by Richard
829  Jinks</li>
830  <li>patch from Aleksey for the ExcC14N specification</li>
831  <li>OSF/1 bug fix by Bjorn</li>
832</ul>
833
834<h3>2.4.20: Apr 15 2002</h3>
835<ul>
836  <li>bug fixes: file descriptor leak, XPath, HTML output, DTD validation</li>
837  <li>XPath conformance testing by Richard Jinks</li>
838  <li>Portability fixes: Solaris, MPE/iX, Windows, OSF/1, python bindings,
839    libxml.m4</li>
840</ul>
841
842<h3>2.4.19: Mar 25 2002</h3>
843<ul>
844  <li>bug fixes: half a dozen XPath bugs, Validation, ISO-Latin to UTF8
845    encoder</li>
846  <li>portability fixes in the HTTP code</li>
847  <li>memory allocation checks using valgrind, and profiling tests</li>
848  <li>revamp of the Windows build and Makefiles</li>
849</ul>
850
851<h3>2.4.18: Mar 18 2002</h3>
852<ul>
853  <li>bug fixes: tree, SAX, canonicalization, validation, portability,
854  XPath</li>
855  <li>removed the --with-buffer option it was becoming unmaintainable</li>
856  <li>serious cleanup of the Python makefiles</li>
857  <li>speedup patch to XPath very effective for DocBook stylesheets</li>
858  <li>Fixes for Windows build, cleanup of the documentation</li>
859</ul>
860
861<h3>2.4.17: Mar 8 2002</h3>
862<ul>
863  <li>a lot of bug fixes, including "namespace nodes have no parents in
864  XPath"</li>
865  <li>fixed/improved the Python wrappers, added more examples and more
866    regression tests, XPath extension functions can now return node-sets</li>
867  <li>added the XML Canonicalization support from Aleksey Sanin</li>
868</ul>
869
870<h3>2.4.16: Feb 20 2002</h3>
871<ul>
872  <li>a lot of bug fixes, most of them were triggered by the XML Testsuite
873    from OASIS and W3C. Compliance has been significantly improved.</li>
874  <li>a couple of portability fixes too.</li>
875</ul>
876
877<h3>2.4.15: Feb 11 2002</h3>
878<ul>
879  <li>Fixed the Makefiles, especially the python module ones</li>
880  <li>A few bug fixes and cleanup</li>
881  <li>Includes cleanup</li>
882</ul>
883
884<h3>2.4.14: Feb 8 2002</h3>
885<ul>
886  <li>Change of License to the <a
887    href="http://www.opensource.org/licenses/mit-license.html">MIT
888    License</a> basically for integration in XFree86 codebase, and removing
889    confusion around the previous dual-licensing</li>
890  <li>added Python bindings, beta software but should already be quite
891    complete</li>
892  <li>a large number of fixes and cleanups, especially for all tree
893    manipulations</li>
894  <li>cleanup of the headers, generation of a reference API definition in
895  XML</li>
896</ul>
897
898<h3>2.4.13: Jan 14 2002</h3>
899<ul>
900  <li>update of the documentation: John Fleck and Charlie Bozeman</li>
901  <li>cleanup of timing code from Justin Fletcher</li>
902  <li>fixes for Windows and initial thread support on Win32: Igor and Serguei
903    Narojnyi</li>
904  <li>Cygwin patch from Robert Collins</li>
905  <li>added xmlSetEntityReferenceFunc() for Keith Isdale work on xsldbg</li>
906</ul>
907
908<h3>2.4.12: Dec 7 2001</h3>
909<ul>
910  <li>a few bug fixes: thread (Gary Pennington), xmllint (Geert Kloosterman),
911    XML parser (Robin Berjon), XPointer (Danny Jamshy), I/O cleanups
912  (robert)</li>
913  <li>Eric Lavigne contributed project files for MacOS</li>
914  <li>some makefiles cleanups</li>
915</ul>
916
917<h3>2.4.11: Nov 26 2001</h3>
918<ul>
919  <li>fixed a couple of errors in the includes, fixed a few bugs, some code
920    cleanups</li>
921  <li>xmllint man pages improvement by Heiko Rupp</li>
922  <li>updated VMS build instructions from John A Fotheringham</li>
923  <li>Windows Makefiles updates from Igor</li>
924</ul>
925
926<h3>2.4.10: Nov 10 2001</h3>
927<ul>
928  <li>URI escaping fix (Joel Young)</li>
929  <li>added xmlGetNodePath() (for paths or XPointers generation)</li>
930  <li>Fixes namespace handling problems when using DTD and validation</li>
931  <li>improvements on xmllint: Morus Walter patches for --format and
932    --encode, Stefan Kost and Heiko Rupp improvements on the --shell</li>
933  <li>fixes for xmlcatalog linking pointed by Weiqi Gao</li>
934  <li>fixes to the HTML parser</li>
935</ul>
936
937<h3>2.4.9: Nov 6 2001</h3>
938<ul>
939  <li>fixes more catalog bugs</li>
940  <li>avoid a compilation problem, improve xmlGetLineNo()</li>
941</ul>
942
943<h3>2.4.8: Nov 4 2001</h3>
944<ul>
945  <li>fixed SGML catalogs broken in previous release, updated xmlcatalog
946  tool</li>
947  <li>fixed a compile errors and some includes troubles.</li>
948</ul>
949
950<h3>2.4.7: Oct 30 2001</h3>
951<ul>
952  <li>exported some debugging interfaces</li>
953  <li>serious rewrite of the catalog code</li>
954  <li>integrated Gary Pennington thread safety patch, added configure option
955    and regression tests</li>
956  <li>removed an HTML parser bug</li>
957  <li>fixed a couple of potentially serious validation bugs</li>
958  <li>integrated the SGML DocBook support in xmllint</li>
959  <li>changed the nanoftp anonymous login passwd</li>
960  <li>some I/O cleanup and a couple of interfaces for Perl wrapper</li>
961  <li>general bug fixes</li>
962  <li>updated xmllint man page by John Fleck</li>
963  <li>some VMS and Windows updates</li>
964</ul>
965
966<h3>2.4.6: Oct 10 2001</h3>
967<ul>
968  <li>added an updated man pages by John Fleck</li>
969  <li>portability and configure fixes</li>
970  <li>an infinite loop on the HTML parser was removed (William)</li>
971  <li>Windows makefile patches from Igor</li>
972  <li>fixed half a dozen bugs reported for libxml or libxslt</li>
973  <li>updated xmlcatalog to be able to modify SGML super catalogs</li>
974</ul>
975
976<h3>2.4.5: Sep 14 2001</h3>
977<ul>
978  <li>Remove a few annoying bugs in 2.4.4</li>
979  <li>forces the HTML serializer to output decimal charrefs since some
980    version of Netscape can't handle hexadecimal ones</li>
981</ul>
982
983<h3>1.8.16: Sep 14 2001</h3>
984<ul>
985  <li>maintenance release of the old libxml1 branch, couple of bug and
986    portability fixes</li>
987</ul>
988
989<h3>2.4.4: Sep 12 2001</h3>
990<ul>
991  <li>added --convert to xmlcatalog, bug fixes and cleanups of XML
992  Catalog</li>
993  <li>a few bug fixes and some portability changes</li>
994  <li>some documentation cleanups</li>
995</ul>
996
997<h3>2.4.3:  Aug 23 2001</h3>
998<ul>
999  <li>XML Catalog support see the doc</li>
1000  <li>New NaN/Infinity floating point code</li>
1001  <li>A few bug fixes</li>
1002</ul>
1003
1004<h3>2.4.2:  Aug 15 2001</h3>
1005<ul>
1006  <li>adds xmlLineNumbersDefault() to control line number generation</li>
1007  <li>lot of bug fixes</li>
1008  <li>the Microsoft MSC projects files should now be up to date</li>
1009  <li>inheritance of namespaces from DTD defaulted attributes</li>
1010  <li>fixes a serious potential security bug</li>
1011  <li>added a --format option to xmllint</li>
1012</ul>
1013
1014<h3>2.4.1:  July 24 2001</h3>
1015<ul>
1016  <li>possibility to keep line numbers in the tree</li>
1017  <li>some computation NaN fixes</li>
1018  <li>extension of the XPath API</li>
1019  <li>cleanup for alpha and ia64 targets</li>
1020  <li>patch to allow saving through HTTP PUT or POST</li>
1021</ul>
1022
1023<h3>2.4.0: July 10 2001</h3>
1024<ul>
1025  <li>Fixed a few bugs in XPath, validation, and tree handling.</li>
1026  <li>Fixed XML Base implementation, added a couple of examples to the
1027    regression tests</li>
1028  <li>A bit of cleanup</li>
1029</ul>
1030
1031<h3>2.3.14: July 5 2001</h3>
1032<ul>
1033  <li>fixed some entities problems and reduce memory requirement when
1034    substituting them</li>
1035  <li>lots of improvements in the XPath queries interpreter can be
1036    substantially faster</li>
1037  <li>Makefiles and configure cleanups</li>
1038  <li>Fixes to XPath variable eval, and compare on empty node set</li>
1039  <li>HTML tag closing bug fixed</li>
1040  <li>Fixed an URI reference computation problem when validating</li>
1041</ul>
1042
1043<h3>2.3.13: June 28 2001</h3>
1044<ul>
1045  <li>2.3.12 configure.in was broken as well as the push mode XML parser</li>
1046  <li>a few more fixes for compilation on Windows MSC by Yon Derek</li>
1047</ul>
1048
1049<h3>1.8.14: June 28 2001</h3>
1050<ul>
1051  <li>Zbigniew Chyla gave a patch to use the old XML parser in push mode</li>
1052  <li>Small Makefile fix</li>
1053</ul>
1054
1055<h3>2.3.12: June 26 2001</h3>
1056<ul>
1057  <li>lots of cleanup</li>
1058  <li>a couple of validation fix</li>
1059  <li>fixed line number counting</li>
1060  <li>fixed serious problems in the XInclude processing</li>
1061  <li>added support for UTF8 BOM at beginning of entities</li>
1062  <li>fixed a strange gcc optimizer bugs in xpath handling of float, gcc-3.0
1063    miscompile uri.c (William), Thomas Leitner provided a fix for the
1064    optimizer on Tru64</li>
1065  <li>incorporated Yon Derek and Igor Zlatkovic  fixes and improvements for
1066    compilation on Windows MSC</li>
1067  <li>update of libxml-doc.el (Felix Natter)</li>
1068  <li>fixed 2 bugs in URI normalization code</li>
1069</ul>
1070
1071<h3>2.3.11: June 17 2001</h3>
1072<ul>
1073  <li>updates to trio, Makefiles and configure should fix some portability
1074    problems (alpha)</li>
1075  <li>fixed some HTML serialization problems (pre, script, and block/inline
1076    handling), added encoding aware APIs, cleanup of this code</li>
1077  <li>added xmlHasNsProp()</li>
1078  <li>implemented a specific PI for encoding support in the DocBook SGML
1079    parser</li>
1080  <li>some XPath fixes (-Infinity, / as a function parameter and namespaces
1081    node selection)</li>
1082  <li>fixed a performance problem and an error in the validation code</li>
1083  <li>fixed XInclude routine to implement the recursive behaviour</li>
1084  <li>fixed xmlFreeNode problem when libxml is included statically twice</li>
1085  <li>added --version to xmllint for bug reports</li>
1086</ul>
1087
1088<h3>2.3.10: June 1 2001</h3>
1089<ul>
1090  <li>fixed the SGML catalog support</li>
1091  <li>a number of reported bugs got fixed, in XPath, iconv detection,
1092    XInclude processing</li>
1093  <li>XPath string function should now handle unicode correctly</li>
1094</ul>
1095
1096<h3>2.3.9: May 19 2001</h3>
1097
1098<p>Lots of bugfixes, and added a basic SGML catalog support:</p>
1099<ul>
1100  <li>HTML push bugfix #54891 and another patch from Jonas Borgstr�m</li>
1101  <li>some serious speed optimization again</li>
1102  <li>some documentation cleanups</li>
1103  <li>trying to get better linking on Solaris (-R)</li>
1104  <li>XPath API cleanup from Thomas Broyer</li>
1105  <li>Validation bug fixed #54631, added a patch from Gary Pennington, fixed
1106    xmlValidGetValidElements()</li>
1107  <li>Added an INSTALL file</li>
1108  <li>Attribute removal added to API: #54433</li>
1109  <li>added a basic support for SGML catalogs</li>
1110  <li>fixed xmlKeepBlanksDefault(0) API</li>
1111  <li>bugfix in xmlNodeGetLang()</li>
1112  <li>fixed a small configure portability problem</li>
1113  <li>fixed an inversion of SYSTEM and PUBLIC identifier in HTML document</li>
1114</ul>
1115
1116<h3>1.8.13: May 14 2001</h3>
1117<ul>
1118  <li>bugfixes release of the old libxml1 branch used by Gnome</li>
1119</ul>
1120
1121<h3>2.3.8: May 3 2001</h3>
1122<ul>
1123  <li>Integrated an SGML DocBook parser for the Gnome project</li>
1124  <li>Fixed a few things in the HTML parser</li>
1125  <li>Fixed some XPath bugs raised by XSLT use, tried to fix the floating
1126    point portability issue</li>
1127  <li>Speed improvement (8M/s for SAX, 3M/s for DOM, 1.5M/s for
1128    DOM+validation using the XML REC as input and a 700MHz celeron).</li>
1129  <li>incorporated more Windows cleanup</li>
1130  <li>added xmlSaveFormatFile()</li>
1131  <li>fixed problems in copying nodes with entities references (gdome)</li>
1132  <li>removed some troubles surrounding the new validation module</li>
1133</ul>
1134
1135<h3>2.3.7: April 22 2001</h3>
1136<ul>
1137  <li>lots of small bug fixes, corrected XPointer</li>
1138  <li>Non deterministic content model validation support</li>
1139  <li>added xmlDocCopyNode for gdome2</li>
1140  <li>revamped the way the HTML parser handles end of tags</li>
1141  <li>XPath: corrections of namespaces support and number formatting</li>
1142  <li>Windows: Igor Zlatkovic patches for MSC compilation</li>
1143  <li>HTML output fixes from P C Chow and William M. Brack</li>
1144  <li>Improved validation speed sensible for DocBook</li>
1145  <li>fixed a big bug with ID declared in external parsed entities</li>
1146  <li>portability fixes, update of Trio from Bjorn Reese</li>
1147</ul>
1148
1149<h3>2.3.6: April 8 2001</h3>
1150<ul>
1151  <li>Code cleanup using extreme gcc compiler warning options, found and
1152    cleared half a dozen potential problem</li>
1153  <li>the Eazel team found an XML parser bug</li>
1154  <li>cleaned up the user of some of the string formatting function. used the
1155    trio library code to provide the one needed when the platform is missing
1156    them</li>
1157  <li>xpath: removed a memory leak and fixed the predicate evaluation
1158    problem, extended the testsuite and cleaned up the result. XPointer seems
1159    broken ...</li>
1160</ul>
1161
1162<h3>2.3.5: Mar 23 2001</h3>
1163<ul>
1164  <li>Biggest change is separate parsing and evaluation of XPath expressions,
1165    there is some new APIs for this too</li>
1166  <li>included a number of bug fixes(XML push parser, 51876, notations,
1167  52299)</li>
1168  <li>Fixed some portability issues</li>
1169</ul>
1170
1171<h3>2.3.4: Mar 10 2001</h3>
1172<ul>
1173  <li>Fixed bugs #51860 and #51861</li>
1174  <li>Added a global variable xmlDefaultBufferSize to allow default buffer
1175    size to be application tunable.</li>
1176  <li>Some cleanup in the validation code, still a bug left and this part
1177    should probably be rewritten to support ambiguous content model :-\</li>
1178  <li>Fix a couple of serious bugs introduced or raised by changes in 2.3.3
1179    parser</li>
1180  <li>Fixed another bug in xmlNodeGetContent()</li>
1181  <li>Bjorn fixed XPath node collection and Number formatting</li>
1182  <li>Fixed a loop reported in the HTML parsing</li>
1183  <li>blank space are reported even if the Dtd content model proves that they
1184    are formatting spaces, this is for XML conformance</li>
1185</ul>
1186
1187<h3>2.3.3: Mar 1 2001</h3>
1188<ul>
1189  <li>small change in XPath for XSLT</li>
1190  <li>documentation cleanups</li>
1191  <li>fix in validation by Gary Pennington</li>
1192  <li>serious parsing performances improvements</li>
1193</ul>
1194
1195<h3>2.3.2: Feb 24 2001</h3>
1196<ul>
1197  <li>chasing XPath bugs, found a bunch, completed some TODO</li>
1198  <li>fixed a Dtd parsing bug</li>
1199  <li>fixed a bug in xmlNodeGetContent</li>
1200  <li>ID/IDREF support partly rewritten by Gary Pennington</li>
1201</ul>
1202
1203<h3>2.3.1: Feb 15 2001</h3>
1204<ul>
1205  <li>some XPath and HTML bug fixes for XSLT</li>
1206  <li>small extension of the hash table interfaces for DOM gdome2
1207    implementation</li>
1208  <li>A few bug fixes</li>
1209</ul>
1210
1211<h3>2.3.0: Feb 8 2001 (2.2.12 was on 25 Jan but I didn't kept track)</h3>
1212<ul>
1213  <li>Lots of XPath bug fixes</li>
1214  <li>Add a mode with Dtd lookup but without validation error reporting for
1215    XSLT</li>
1216  <li>Add support for text node without escaping (XSLT)</li>
1217  <li>bug fixes for xmlCheckFilename</li>
1218  <li>validation code bug fixes from Gary Pennington</li>
1219  <li>Patch from Paul D. Smith correcting URI path normalization</li>
1220  <li>Patch to allow simultaneous install of libxml-devel and
1221  libxml2-devel</li>
1222  <li>the example Makefile is now fixed</li>
1223  <li>added HTML to the RPM packages</li>
1224  <li>tree copying bugfixes</li>
1225  <li>updates to Windows makefiles</li>
1226  <li>optimization patch from Bjorn Reese</li>
1227</ul>
1228
1229<h3>2.2.11: Jan 4 2001</h3>
1230<ul>
1231  <li>bunch of bug fixes (memory I/O, xpath, ftp/http, ...)</li>
1232  <li>added htmlHandleOmittedElem()</li>
1233  <li>Applied Bjorn Reese's IPV6 first patch</li>
1234  <li>Applied Paul D. Smith patches for validation of XInclude results</li>
1235  <li>added XPointer xmlns() new scheme support</li>
1236</ul>
1237
1238<h3>2.2.10: Nov 25 2000</h3>
1239<ul>
1240  <li>Fix the Windows problems of 2.2.8</li>
1241  <li>integrate OpenVMS patches</li>
1242  <li>better handling of some nasty HTML input</li>
1243  <li>Improved the XPointer implementation</li>
1244  <li>integrate a number of provided patches</li>
1245</ul>
1246
1247<h3>2.2.9: Nov 25 2000</h3>
1248<ul>
1249  <li>erroneous release :-(</li>
1250</ul>
1251
1252<h3>2.2.8: Nov 13 2000</h3>
1253<ul>
1254  <li>First version of <a href="http://www.w3.org/TR/xinclude">XInclude</a>
1255    support</li>
1256  <li>Patch in conditional section handling</li>
1257  <li>updated MS compiler project</li>
1258  <li>fixed some XPath problems</li>
1259  <li>added an URI escaping function</li>
1260  <li>some other bug fixes</li>
1261</ul>
1262
1263<h3>2.2.7: Oct 31 2000</h3>
1264<ul>
1265  <li>added message redirection</li>
1266  <li>XPath improvements (thanks TOM !)</li>
1267  <li>xmlIOParseDTD() added</li>
1268  <li>various small fixes in the HTML, URI, HTTP and XPointer support</li>
1269  <li>some cleanup of the Makefile, autoconf and the distribution content</li>
1270</ul>
1271
1272<h3>2.2.6: Oct 25 2000:</h3>
1273<ul>
1274  <li>Added an hash table module, migrated a number of internal structure to
1275    those</li>
1276  <li>Fixed a posteriori validation problems</li>
1277  <li>HTTP module cleanups</li>
1278  <li>HTML parser improvements (tag errors, script/style handling, attribute
1279    normalization)</li>
1280  <li>coalescing of adjacent text nodes</li>
1281  <li>couple of XPath bug fixes, exported the internal API</li>
1282</ul>
1283
1284<h3>2.2.5: Oct 15 2000:</h3>
1285<ul>
1286  <li>XPointer implementation and testsuite</li>
1287  <li>Lot of XPath fixes, added variable and functions registration, more
1288    tests</li>
1289  <li>Portability fixes, lots of enhancements toward an easy Windows build
1290    and release</li>
1291  <li>Late validation fixes</li>
1292  <li>Integrated a lot of contributed patches</li>
1293  <li>added memory management docs</li>
1294  <li>a performance problem when using large buffer seems fixed</li>
1295</ul>
1296
1297<h3>2.2.4: Oct 1 2000:</h3>
1298<ul>
1299  <li>main XPath problem fixed</li>
1300  <li>Integrated portability patches for Windows</li>
1301  <li>Serious bug fixes on the URI and HTML code</li>
1302</ul>
1303
1304<h3>2.2.3: Sep 17 2000</h3>
1305<ul>
1306  <li>bug fixes</li>
1307  <li>cleanup of entity handling code</li>
1308  <li>overall review of all loops in the parsers, all sprintf usage has been
1309    checked too</li>
1310  <li>Far better handling of larges Dtd. Validating against DocBook XML Dtd
1311    works smoothly now.</li>
1312</ul>
1313
1314<h3>1.8.10: Sep 6 2000</h3>
1315<ul>
1316  <li>bug fix release for some Gnome projects</li>
1317</ul>
1318
1319<h3>2.2.2: August 12 2000</h3>
1320<ul>
1321  <li>mostly bug fixes</li>
1322  <li>started adding routines to access xml parser context options</li>
1323</ul>
1324
1325<h3>2.2.1: July 21 2000</h3>
1326<ul>
1327  <li>a purely bug fixes release</li>
1328  <li>fixed an encoding support problem when parsing from a memory block</li>
1329  <li>fixed a DOCTYPE parsing problem</li>
1330  <li>removed a bug in the function allowing to override the memory
1331    allocation routines</li>
1332</ul>
1333
1334<h3>2.2.0: July 14 2000</h3>
1335<ul>
1336  <li>applied a lot of portability fixes</li>
1337  <li>better encoding support/cleanup and saving (content is now always
1338    encoded in UTF-8)</li>
1339  <li>the HTML parser now correctly handles encodings</li>
1340  <li>added xmlHasProp()</li>
1341  <li>fixed a serious problem with &amp;#38;</li>
1342  <li>propagated the fix to FTP client</li>
1343  <li>cleanup, bugfixes, etc ...</li>
1344  <li>Added a page about <a href="encoding.html">libxml Internationalization
1345    support</a></li>
1346</ul>
1347
1348<h3>1.8.9:  July 9 2000</h3>
1349<ul>
1350  <li>fixed the spec the RPMs should be better</li>
1351  <li>fixed a serious bug in the FTP implementation, released 1.8.9 to solve
1352    rpmfind users problem</li>
1353</ul>
1354
1355<h3>2.1.1: July 1 2000</h3>
1356<ul>
1357  <li>fixes a couple of bugs in the 2.1.0 packaging</li>
1358  <li>improvements on the HTML parser</li>
1359</ul>
1360
1361<h3>2.1.0 and 1.8.8: June 29 2000</h3>
1362<ul>
1363  <li>1.8.8 is mostly a commodity package for upgrading to libxml2 according
1364    to <a href="upgrade.html">new instructions</a>. It fixes a nasty problem
1365    about &amp;#38; charref parsing</li>
1366  <li>2.1.0 also ease the upgrade from libxml v1 to the recent version. it
1367    also contains numerous fixes and enhancements:
1368    <ul>
1369      <li>added xmlStopParser() to stop parsing</li>
1370      <li>improved a lot parsing speed when there is large CDATA blocs</li>
1371      <li>includes XPath patches provided by Picdar Technology</li>
1372      <li>tried to fix as much as possible DTD validation and namespace
1373        related problems</li>
1374      <li>output to a given encoding has been added/tested</li>
1375      <li>lot of various fixes</li>
1376    </ul>
1377  </li>
1378</ul>
1379
1380<h3>2.0.0: Apr 12 2000</h3>
1381<ul>
1382  <li>First public release of libxml2. If you are using libxml, it's a good
1383    idea to check the 1.x to 2.x upgrade instructions. NOTE: while initially
1384    scheduled for Apr 3 the release occurred only on Apr 12 due to massive
1385    workload.</li>
1386  <li>The include are now located under $prefix/include/libxml (instead of
1387    $prefix/include/gnome-xml), they also are referenced by
1388    <pre>#include &lt;libxml/xxx.h&gt;</pre>
1389    <p>instead of</p>
1390    <pre>#include "xxx.h"</pre>
1391  </li>
1392  <li>a new URI module for parsing URIs and following strictly RFC 2396</li>
1393  <li>the memory allocation routines used by libxml can now be overloaded
1394    dynamically by using xmlMemSetup()</li>
1395  <li>The previously CVS only tool tester has been renamed
1396    <strong>xmllint</strong> and is now installed as part of the libxml2
1397    package</li>
1398  <li>The I/O interface has been revamped. There is now ways to plug in
1399    specific I/O modules, either at the URI scheme detection level using
1400    xmlRegisterInputCallbacks()  or by passing I/O functions when creating a
1401    parser context using xmlCreateIOParserCtxt()</li>
1402  <li>there is a C preprocessor macro LIBXML_VERSION providing the version
1403    number of the libxml module in use</li>
1404  <li>a number of optional features of libxml can now be excluded at
1405    configure time (FTP/HTTP/HTML/XPath/Debug)</li>
1406</ul>
1407
1408<h3>2.0.0beta: Mar 14 2000</h3>
1409<ul>
1410  <li>This is a first Beta release of libxml version 2</li>
1411  <li>It's available only from<a href="ftp://xmlsoft.org/">xmlsoft.org
1412    FTP</a>, it's packaged as libxml2-2.0.0beta and available as tar and
1413  RPMs</li>
1414  <li>This version is now the head in the Gnome CVS base, the old one is
1415    available under the tag LIB_XML_1_X</li>
1416  <li>This includes a very large set of changes. From a  programmatic point
1417    of view applications should not have to be modified too much, check the
1418    <a href="upgrade.html">upgrade page</a></li>
1419  <li>Some interfaces may changes (especially a bit about encoding).</li>
1420  <li>the updates includes:
1421    <ul>
1422      <li>fix I18N support. ISO-Latin-x/UTF-8/UTF-16 (nearly) seems correctly
1423        handled now</li>
1424      <li>Better handling of entities, especially well-formedness checking
1425        and proper PEref extensions in external subsets</li>
1426      <li>DTD conditional sections</li>
1427      <li>Validation now correctly handle entities content</li>
1428      <li><a href="http://rpmfind.net/tools/gdome/messages/0039.html">change
1429        structures to accommodate DOM</a></li>
1430    </ul>
1431  </li>
1432  <li>Serious progress were made toward compliance, <a
1433    href="conf/result.html">here are the result of the test</a> against the
1434    OASIS testsuite (except the Japanese tests since I don't support that
1435    encoding yet). This URL is rebuilt every couple of hours using the CVS
1436    head version.</li>
1437</ul>
1438
1439<h3>1.8.7: Mar 6 2000</h3>
1440<ul>
1441  <li>This is a bug fix release:</li>
1442  <li>It is possible to disable the ignorable blanks heuristic used by
1443    libxml-1.x, a new function  xmlKeepBlanksDefault(0) will allow this. Note
1444    that for adherence to XML spec, this behaviour will be disabled by
1445    default in 2.x . The same function will allow to keep compatibility for
1446    old code.</li>
1447  <li>Blanks in &lt;a&gt;  &lt;/a&gt; constructs are not ignored anymore,
1448    avoiding heuristic is really the Right Way :-\</li>
1449  <li>The unchecked use of snprintf which was breaking libxml-1.8.6
1450    compilation on some platforms has been fixed</li>
1451  <li>nanoftp.c nanohttp.c: Fixed '#' and '?' stripping when processing
1452  URIs</li>
1453</ul>
1454
1455<h3>1.8.6: Jan 31 2000</h3>
1456<ul>
1457  <li>added a nanoFTP transport module, debugged until the new version of <a
1458    href="http://rpmfind.net/linux/rpm2html/rpmfind.html">rpmfind</a> can use
1459    it without troubles</li>
1460</ul>
1461
1462<h3>1.8.5: Jan 21 2000</h3>
1463<ul>
1464  <li>adding APIs to parse a well balanced chunk of XML (production <a
1465    href="http://www.w3.org/TR/REC-xml#NT-content">[43] content</a> of the
1466    XML spec)</li>
1467  <li>fixed a hideous bug in xmlGetProp pointed by Rune.Djurhuus@fast.no</li>
1468  <li>Jody Goldberg &lt;jgoldberg@home.com&gt; provided another patch trying
1469    to solve the zlib checks problems</li>
1470  <li>The current state in gnome CVS base is expected to ship as 1.8.5 with
1471    gnumeric soon</li>
1472</ul>
1473
1474<h3>1.8.4: Jan 13 2000</h3>
1475<ul>
1476  <li>bug fixes, reintroduced xmlNewGlobalNs(), fixed xmlNewNs()</li>
1477  <li>all exit() call should have been removed from libxml</li>
1478  <li>fixed a problem with INCLUDE_WINSOCK on WIN32 platform</li>
1479  <li>added newDocFragment()</li>
1480</ul>
1481
1482<h3>1.8.3: Jan 5 2000</h3>
1483<ul>
1484  <li>a Push interface for the XML and HTML parsers</li>
1485  <li>a shell-like interface to the document tree (try tester --shell :-)</li>
1486  <li>lots of bug fixes and improvement added over XMas holidays</li>
1487  <li>fixed the DTD parsing code to work with the xhtml DTD</li>
1488  <li>added xmlRemoveProp(), xmlRemoveID() and xmlRemoveRef()</li>
1489  <li>Fixed bugs in xmlNewNs()</li>
1490  <li>External entity loading code has been revamped, now it uses
1491    xmlLoadExternalEntity(), some fix on entities processing were added</li>
1492  <li>cleaned up WIN32 includes of socket stuff</li>
1493</ul>
1494
1495<h3>1.8.2: Dec 21 1999</h3>
1496<ul>
1497  <li>I got another problem with includes and C++, I hope this issue is fixed
1498    for good this time</li>
1499  <li>Added a few tree modification functions: xmlReplaceNode,
1500    xmlAddPrevSibling, xmlAddNextSibling, xmlNodeSetName and
1501    xmlDocSetRootElement</li>
1502  <li>Tried to improve the HTML output with help from <a
1503    href="mailto:clahey@umich.edu">Chris Lahey</a></li>
1504</ul>
1505
1506<h3>1.8.1: Dec 18 1999</h3>
1507<ul>
1508  <li>various patches to avoid troubles when using libxml with C++ compilers
1509    the "namespace" keyword and C escaping in include files</li>
1510  <li>a problem in one of the core macros IS_CHAR was corrected</li>
1511  <li>fixed a bug introduced in 1.8.0 breaking default namespace processing,
1512    and more specifically the Dia application</li>
1513  <li>fixed a posteriori validation (validation after parsing, or by using a
1514    Dtd not specified in the original document)</li>
1515  <li>fixed a bug in</li>
1516</ul>
1517
1518<h3>1.8.0: Dec 12 1999</h3>
1519<ul>
1520  <li>cleanup, especially memory wise</li>
1521  <li>the parser should be more reliable, especially the HTML one, it should
1522    not crash, whatever the input !</li>
1523  <li>Integrated various patches, especially a speedup improvement for large
1524    dataset from <a href="mailto:cnygard@bellatlantic.net">Carl Nygard</a>,
1525    configure with --with-buffers to enable them.</li>
1526  <li>attribute normalization, oops should have been added long ago !</li>
1527  <li>attributes defaulted from DTDs should be available, xmlSetProp() now
1528    does entities escaping by default.</li>
1529</ul>
1530
1531<h3>1.7.4: Oct 25 1999</h3>
1532<ul>
1533  <li>Lots of HTML improvement</li>
1534  <li>Fixed some errors when saving both XML and HTML</li>
1535  <li>More examples, the regression tests should now look clean</li>
1536  <li>Fixed a bug with contiguous charref</li>
1537</ul>
1538
1539<h3>1.7.3: Sep 29 1999</h3>
1540<ul>
1541  <li>portability problems fixed</li>
1542  <li>snprintf was used unconditionally, leading to link problems on system
1543    were it's not available, fixed</li>
1544</ul>
1545
1546<h3>1.7.1: Sep 24 1999</h3>
1547<ul>
1548  <li>The basic type for strings manipulated by libxml has been renamed in
1549    1.7.1 from <strong>CHAR</strong> to <strong>xmlChar</strong>. The reason
1550    is that CHAR was conflicting with a predefined type on Windows. However
1551    on non WIN32 environment, compatibility is provided by the way of  a
1552    <strong>#define </strong>.</li>
1553  <li>Changed another error : the use of a structure field called errno, and
1554    leading to troubles on platforms where it's a macro</li>
1555</ul>
1556
1557<h3>1.7.0: Sep 23 1999</h3>
1558<ul>
1559  <li>Added the ability to fetch remote DTD or parsed entities, see the <a
1560    href="html/libxml-nanohttp.html">nanohttp</a> module.</li>
1561  <li>Added an errno to report errors by another mean than a simple printf
1562    like callback</li>
1563  <li>Finished ID/IDREF support and checking when validation</li>
1564  <li>Serious memory leaks fixed (there is now a <a
1565    href="html/libxml-xmlmemory.html">memory wrapper</a> module)</li>
1566  <li>Improvement of <a href="http://www.w3.org/TR/xpath">XPath</a>
1567    implementation</li>
1568  <li>Added an HTML parser front-end</li>
1569</ul>
1570
1571<h2><a name="XML">XML</a></h2>
1572
1573<p><a href="http://www.w3.org/TR/REC-xml">XML is a standard</a> for
1574markup-based structured documents. Here is <a name="example">an example XML
1575document</a>:</p>
1576<pre>&lt;?xml version="1.0"?&gt;
1577&lt;EXAMPLE prop1="gnome is great" prop2="&amp;amp; linux too"&gt;
1578  &lt;head&gt;
1579   &lt;title&gt;Welcome to Gnome&lt;/title&gt;
1580  &lt;/head&gt;
1581  &lt;chapter&gt;
1582   &lt;title&gt;The Linux adventure&lt;/title&gt;
1583   &lt;p&gt;bla bla bla ...&lt;/p&gt;
1584   &lt;image href="linus.gif"/&gt;
1585   &lt;p&gt;...&lt;/p&gt;
1586  &lt;/chapter&gt;
1587&lt;/EXAMPLE&gt;</pre>
1588
1589<p>The first line specifies that it is an XML document and gives useful
1590information about its encoding.  Then the rest of the document is a text
1591format whose structure is specified by tags between brackets. <strong>Each
1592tag opened has to be closed</strong>. XML is pedantic about this. However, if
1593a tag is empty (no content), a single tag can serve as both the opening and
1594closing tag if it ends with <code>/&gt;</code> rather than with
1595<code>&gt;</code>. Note that, for example, the image tag has no content (just
1596an attribute) and is closed by ending the tag with <code>/&gt;</code>.</p>
1597
1598<p>XML can be applied successfully to a wide range of tasks, ranging from
1599long term structured document maintenance (where it follows the steps of
1600SGML) to simple data encoding mechanisms like configuration file formatting
1601(glade), spreadsheets (gnumeric), or even shorter lived documents such as
1602WebDAV where it is used to encode remote calls between a client and a
1603server.</p>
1604
1605<h2><a name="XSLT">XSLT</a></h2>
1606
1607<p>Check <a href="http://xmlsoft.org/XSLT">the separate libxslt page</a></p>
1608
1609<p><a href="http://www.w3.org/TR/xslt">XSL Transformations</a>,  is a
1610language for transforming XML documents into other XML documents (or
1611HTML/textual output).</p>
1612
1613<p>A separate library called libxslt is being developed on top of libxml2.
1614This module "libxslt" too can be found in the Gnome CVS base.</p>
1615
1616<p>You can check the <a
1617href="http://cvs.gnome.org/lxr/source/libxslt/FEATURES">features</a>
1618supported and the progresses on the <a
1619href="http://cvs.gnome.org/lxr/source/libxslt/ChangeLog"
1620name="Changelog">Changelog</a>.</p>
1621
1622<h2><a name="Python">Python and bindings</a></h2>
1623
1624<p>There are a number of language bindings and wrappers available for
1625libxml2, the list below is not exhaustive. Please contact the <a
1626href="http://mail.gnome.org/mailman/listinfo/xml-bindings">xml-bindings@gnome.org</a>
1627(<a href="http://mail.gnome.org/archives/xml-bindings/">archives</a>) in
1628order to get updates to this list or to discuss the specific topic of libxml2
1629or libxslt wrappers or bindings:</p>
1630<ul>
1631  <li><a href="http://libxmlplusplus.sourceforge.net/">Libxml++</a> seems the
1632    most up-to-date C++ bindings for libxml2, check the <a
1633    href="http://libxmlplusplus.sourceforge.net/reference/html/hierarchy.html">documentation</a>
1634    and the <a
1635    href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/libxmlplusplus/libxml%2b%2b/examples/">examples</a>.</li>
1636  <li>There is another <a href="http://libgdome-cpp.berlios.de/">C++ wrapper
1637    based on the gdome2 bindings</a> maintained by Tobias Peters.</li>
1638  <li>and a third C++ wrapper by Peter Jones &lt;pjones@pmade.org&gt;
1639    <p>Website: <a
1640    href="http://pmade.org/pjones/software/xmlwrapp/">http://pmade.org/pjones/software/xmlwrapp/</a></p>
1641  </li>
1642  <li><a
1643    href="http://mail.gnome.org/archives/xml/2001-March/msg00014.html">Matt
1644    Sergeant</a> developed <a
1645    href="http://axkit.org/download/">XML::LibXSLT</a>, a Perl wrapper for
1646    libxml2/libxslt as part of the <a href="http://axkit.com/">AxKit XML
1647    application server</a>.</li>
1648  <li><a href="mailto:dkuhlman@cutter.rexx.com">Dave Kuhlman</a> provides an
1649    earlier version of the libxml/libxslt <a
1650    href="http://www.rexx.com/~dkuhlman">wrappers for Python</a>.</li>
1651  <li>Gopal.V and Peter Minten develop <a
1652    href="http://savannah.gnu.org/projects/libxmlsharp">libxml#</a>, a set of
1653    C# libxml2 bindings.</li>
1654  <li>Petr Kozelka provides <a
1655    href="http://sourceforge.net/projects/libxml2-pas">Pascal units to glue
1656    libxml2</a> with Kylix, Delphi and other Pascal compilers.</li>
1657  <li>Uwe Fechner also provides <a
1658    href="http://sourceforge.net/projects/idom2-pas/">idom2</a>, a DOM2
1659    implementation for Kylix2/D5/D6 from Borland.</li>
1660  <li>Wai-Sun "Squidster" Chia provides <a
1661    href="http://www.rubycolor.org/arc/redist/">bindings for Ruby</a>  and
1662    libxml2 bindings are also available in Ruby through the <a
1663    href="http://libgdome-ruby.berlios.de/">libgdome-ruby</a> module
1664    maintained by Tobias Peters.</li>
1665  <li>Steve Ball and contributors maintains <a
1666    href="http://tclxml.sourceforge.net/">libxml2 and libxslt bindings for
1667    Tcl</a>.</li>
1668  <li>There is support for libxml2 in the DOM module of PHP.</li>
1669  <li><a href="http://savannah.gnu.org/projects/classpathx/">LibxmlJ</a> is
1670    an effort to create a 100% JAXP-compatible Java wrapper for libxml2 and
1671    libxslt as part of GNU ClasspathX project.</li>
1672</ul>
1673
1674<p>The distribution includes a set of Python bindings, which are guaranteed
1675to be maintained as part of the library in the future, though the Python
1676interface have not yet reached the completeness of the C API.</p>
1677
1678<p><a href="mailto:stephane.bidoul@softwareag.com">St�phane Bidoul</a>
1679maintains <a href="http://users.skynet.be/sbi/libxml-python/">a Windows port
1680of the Python bindings</a>.</p>
1681
1682<p>Note to people interested in building bindings, the API is formalized as
1683<a href="libxml2-api.xml">an XML API description file</a> which allows to
1684automate a large part of the Python bindings, this includes function
1685descriptions, enums, structures, typedefs, etc... The Python script used to
1686build the bindings is python/generator.py in the source distribution.</p>
1687
1688<p>To install the Python bindings there are 2 options:</p>
1689<ul>
1690  <li>If you use an RPM based distribution, simply install the <a
1691    href="http://rpmfind.net/linux/rpm2html/search.php?query=libxml2-python">libxml2-python
1692    RPM</a> (and if needed the <a
1693    href="http://rpmfind.net/linux/rpm2html/search.php?query=libxslt-python">libxslt-python
1694    RPM</a>).</li>
1695  <li>Otherwise use the <a href="ftp://xmlsoft.org/python/">libxml2-python
1696    module distribution</a> corresponding to your installed version of
1697    libxml2 and libxslt. Note that to install it you will need both libxml2
1698    and libxslt installed and run "python setup.py build install" in the
1699    module tree.</li>
1700</ul>
1701
1702<p>The distribution includes a set of examples and regression tests for the
1703python bindings in the <code>python/tests</code> directory. Here are some
1704excerpts from those tests:</p>
1705
1706<h3>tst.py:</h3>
1707
1708<p>This is a basic test of the file interface and DOM navigation:</p>
1709<pre>import libxml2
1710
1711doc = libxml2.parseFile("tst.xml")
1712if doc.name != "tst.xml":
1713    print "doc.name failed"
1714    sys.exit(1)
1715root = doc.children
1716if root.name != "doc":
1717    print "root.name failed"
1718    sys.exit(1)
1719child = root.children
1720if child.name != "foo":
1721    print "child.name failed"
1722    sys.exit(1)
1723doc.freeDoc()</pre>
1724
1725<p>The Python module is called libxml2; parseFile is the equivalent of
1726xmlParseFile (most of the bindings are automatically generated, and the xml
1727prefix is removed and the casing convention are kept). All node seen at the
1728binding level share the same subset of accessors:</p>
1729<ul>
1730  <li><code>name</code> : returns the node name</li>
1731  <li><code>type</code> : returns a string indicating the node type</li>
1732  <li><code>content</code> : returns the content of the node, it is based on
1733    xmlNodeGetContent() and hence is recursive.</li>
1734  <li><code>parent</code> , <code>children</code>, <code>last</code>,
1735    <code>next</code>, <code>prev</code>, <code>doc</code>,
1736    <code>properties</code>: pointing to the associated element in the tree,
1737    those may return None in case no such link exists.</li>
1738</ul>
1739
1740<p>Also note the need to explicitly deallocate documents with freeDoc() .
1741Reference counting for libxml2 trees would need quite a lot of work to
1742function properly, and rather than risk memory leaks if not implemented
1743correctly it sounds safer to have an explicit function to free a tree. The
1744wrapper python objects like doc, root or child are them automatically garbage
1745collected.</p>
1746
1747<h3>validate.py:</h3>
1748
1749<p>This test check the validation interfaces and redirection of error
1750messages:</p>
1751<pre>import libxml2
1752
1753#deactivate error messages from the validation
1754def noerr(ctx, str):
1755    pass
1756
1757libxml2.registerErrorHandler(noerr, None)
1758
1759ctxt = libxml2.createFileParserCtxt("invalid.xml")
1760ctxt.validate(1)
1761ctxt.parseDocument()
1762doc = ctxt.doc()
1763valid = ctxt.isValid()
1764doc.freeDoc()
1765if valid != 0:
1766    print "validity check failed"</pre>
1767
1768<p>The first thing to notice is the call to registerErrorHandler(), it
1769defines a new error handler global to the library. It is used to avoid seeing
1770the error messages when trying to validate the invalid document.</p>
1771
1772<p>The main interest of that test is the creation of a parser context with
1773createFileParserCtxt() and how the behaviour can be changed before calling
1774parseDocument() . Similarly the informations resulting from the parsing phase
1775are also available using context methods.</p>
1776
1777<p>Contexts like nodes are defined as class and the libxml2 wrappers maps the
1778C function interfaces in terms of objects method as much as possible. The
1779best to get a complete view of what methods are supported is to look at the
1780libxml2.py module containing all the wrappers.</p>
1781
1782<h3>push.py:</h3>
1783
1784<p>This test show how to activate the push parser interface:</p>
1785<pre>import libxml2
1786
1787ctxt = libxml2.createPushParser(None, "&lt;foo", 4, "test.xml")
1788ctxt.parseChunk("/&gt;", 2, 1)
1789doc = ctxt.doc()
1790
1791doc.freeDoc()</pre>
1792
1793<p>The context is created with a special call based on the
1794xmlCreatePushParser() from the C library. The first argument is an optional
1795SAX callback object, then the initial set of data, the length and the name of
1796the resource in case URI-References need to be computed by the parser.</p>
1797
1798<p>Then the data are pushed using the parseChunk() method, the last call
1799setting the third argument terminate to 1.</p>
1800
1801<h3>pushSAX.py:</h3>
1802
1803<p>this test show the use of the event based parsing interfaces. In this case
1804the parser does not build a document, but provides callback information as
1805the parser makes progresses analyzing the data being provided:</p>
1806<pre>import libxml2
1807log = ""
1808
1809class callback:
1810    def startDocument(self):
1811        global log
1812        log = log + "startDocument:"
1813
1814    def endDocument(self):
1815        global log
1816        log = log + "endDocument:"
1817
1818    def startElement(self, tag, attrs):
1819        global log
1820        log = log + "startElement %s %s:" % (tag, attrs)
1821
1822    def endElement(self, tag):
1823        global log
1824        log = log + "endElement %s:" % (tag)
1825
1826    def characters(self, data):
1827        global log
1828        log = log + "characters: %s:" % (data)
1829
1830    def warning(self, msg):
1831        global log
1832        log = log + "warning: %s:" % (msg)
1833
1834    def error(self, msg):
1835        global log
1836        log = log + "error: %s:" % (msg)
1837
1838    def fatalError(self, msg):
1839        global log
1840        log = log + "fatalError: %s:" % (msg)
1841
1842handler = callback()
1843
1844ctxt = libxml2.createPushParser(handler, "&lt;foo", 4, "test.xml")
1845chunk = " url='tst'&gt;b"
1846ctxt.parseChunk(chunk, len(chunk), 0)
1847chunk = "ar&lt;/foo&gt;"
1848ctxt.parseChunk(chunk, len(chunk), 1)
1849
1850reference = "startDocument:startElement foo {'url': 'tst'}:" + \ 
1851            "characters: bar:endElement foo:endDocument:"
1852if log != reference:
1853    print "Error got: %s" % log
1854    print "Expected: %s" % reference</pre>
1855
1856<p>The key object in that test is the handler, it provides a number of entry
1857points which can be called by the parser as it makes progresses to indicate
1858the information set obtained. The full set of callback is larger than what
1859the callback class in that specific example implements (see the SAX
1860definition for a complete list). The wrapper will only call those supplied by
1861the object when activated. The startElement receives the names of the element
1862and a dictionary containing the attributes carried by this element.</p>
1863
1864<p>Also note that the reference string generated from the callback shows a
1865single character call even though the string "bar" is passed to the parser
1866from 2 different call to parseChunk()</p>
1867
1868<h3>xpath.py:</h3>
1869
1870<p>This is a basic test of XPath wrappers support</p>
1871<pre>import libxml2
1872
1873doc = libxml2.parseFile("tst.xml")
1874ctxt = doc.xpathNewContext()
1875res = ctxt.xpathEval("//*")
1876if len(res) != 2:
1877    print "xpath query: wrong node set size"
1878    sys.exit(1)
1879if res[0].name != "doc" or res[1].name != "foo":
1880    print "xpath query: wrong node set value"
1881    sys.exit(1)
1882doc.freeDoc()
1883ctxt.xpathFreeContext()</pre>
1884
1885<p>This test parses a file, then create an XPath context to evaluate XPath
1886expression on it. The xpathEval() method execute an XPath query and returns
1887the result mapped in a Python way. String and numbers are natively converted,
1888and node sets are returned as a tuple of libxml2 Python nodes wrappers. Like
1889the document, the XPath context need to be freed explicitly, also not that
1890the result of the XPath query may point back to the document tree and hence
1891the document must be freed after the result of the query is used.</p>
1892
1893<h3>xpathext.py:</h3>
1894
1895<p>This test shows how to extend the XPath engine with functions written in
1896python:</p>
1897<pre>import libxml2
1898
1899def foo(ctx, x):
1900    return x + 1
1901
1902doc = libxml2.parseFile("tst.xml")
1903ctxt = doc.xpathNewContext()
1904libxml2.registerXPathFunction(ctxt._o, "foo", None, foo)
1905res = ctxt.xpathEval("foo(1)")
1906if res != 2:
1907    print "xpath extension failure"
1908doc.freeDoc()
1909ctxt.xpathFreeContext()</pre>
1910
1911<p>Note how the extension function is registered with the context (but that
1912part is not yet finalized, this may change slightly in the future).</p>
1913
1914<h3>tstxpath.py:</h3>
1915
1916<p>This test is similar to the previous one but shows how the extension
1917function can access the XPath evaluation context:</p>
1918<pre>def foo(ctx, x):
1919    global called
1920
1921    #
1922    # test that access to the XPath evaluation contexts
1923    #
1924    pctxt = libxml2.xpathParserContext(_obj=ctx)
1925    ctxt = pctxt.context()
1926    called = ctxt.function()
1927    return x + 1</pre>
1928
1929<p>All the interfaces around the XPath parser(or rather evaluation) context
1930are not finalized, but it should be sufficient to do contextual work at the
1931evaluation point.</p>
1932
1933<h3>Memory debugging:</h3>
1934
1935<p>last but not least, all tests starts with the following prologue:</p>
1936<pre>#memory debug specific
1937libxml2.debugMemory(1)</pre>
1938
1939<p>and ends with the following epilogue:</p>
1940<pre>#memory debug specific
1941libxml2.cleanupParser()
1942if libxml2.debugMemory(1) == 0:
1943    print "OK"
1944else:
1945    print "Memory leak %d bytes" % (libxml2.debugMemory(1))
1946    libxml2.dumpMemory()</pre>
1947
1948<p>Those activate the memory debugging interface of libxml2 where all
1949allocated block in the library are tracked. The prologue then cleans up the
1950library state and checks that all allocated memory has been freed. If not it
1951calls dumpMemory() which saves that list in a <code>.memdump</code> file.</p>
1952
1953<h2><a name="architecture">libxml architecture</a></h2>
1954
1955<p>Libxml is made of multiple components; some of them are optional, and most
1956of the block interfaces are public. The main components are:</p>
1957<ul>
1958  <li>an Input/Output layer</li>
1959  <li>FTP and HTTP client layers (optional)</li>
1960  <li>an Internationalization layer managing the encodings support</li>
1961  <li>a URI module</li>
1962  <li>the XML parser and its basic SAX interface</li>
1963  <li>an HTML parser using the same SAX interface (optional)</li>
1964  <li>a SAX tree module to build an in-memory DOM representation</li>
1965  <li>a tree module to manipulate the DOM representation</li>
1966  <li>a validation module using the DOM representation (optional)</li>
1967  <li>an XPath module for global lookup in a DOM representation
1968  (optional)</li>
1969  <li>a debug module (optional)</li>
1970</ul>
1971
1972<p>Graphically this gives the following:</p>
1973
1974<p><img src="libxml.gif" alt="a graphical view of the various"></p>
1975
1976<p></p>
1977
1978<h2><a name="tree">The tree output</a></h2>
1979
1980<p>The parser returns a tree built during the document analysis. The value
1981returned is an <strong>xmlDocPtr</strong> (i.e., a pointer to an
1982<strong>xmlDoc</strong> structure). This structure contains information such
1983as the file name, the document type, and a <strong>children</strong> pointer
1984which is the root of the document (or more exactly the first child under the
1985root which is the document). The tree is made of <strong>xmlNode</strong>s,
1986chained in double-linked lists of siblings and with a children&lt;-&gt;parent
1987relationship. An xmlNode can also carry properties (a chain of xmlAttr
1988structures). An attribute may have a value which is a list of TEXT or
1989ENTITY_REF nodes.</p>
1990
1991<p>Here is an example (erroneous with respect to the XML spec since there
1992should be only one ELEMENT under the root):</p>
1993
1994<p><img src="structure.gif" alt=" structure.gif "></p>
1995
1996<p>In the source package there is a small program (not installed by default)
1997called <strong>xmllint</strong> which parses XML files given as argument and
1998prints them back as parsed. This is useful for detecting errors both in XML
1999code and in the XML parser itself. It has an option <strong>--debug</strong>
2000which prints the actual in-memory structure of the document; here is the
2001result with the <a href="#example">example</a> given before:</p>
2002<pre>DOCUMENT
2003version=1.0
2004standalone=true
2005  ELEMENT EXAMPLE
2006    ATTRIBUTE prop1
2007      TEXT
2008      content=gnome is great
2009    ATTRIBUTE prop2
2010      ENTITY_REF
2011      TEXT
2012      content= linux too 
2013    ELEMENT head
2014      ELEMENT title
2015        TEXT
2016        content=Welcome to Gnome
2017    ELEMENT chapter
2018      ELEMENT title
2019        TEXT
2020        content=The Linux adventure
2021      ELEMENT p
2022        TEXT
2023        content=bla bla bla ...
2024      ELEMENT image
2025        ATTRIBUTE href
2026          TEXT
2027          content=linus.gif
2028      ELEMENT p
2029        TEXT
2030        content=...</pre>
2031
2032<p>This should be useful for learning the internal representation model.</p>
2033
2034<h2><a name="interface">The SAX interface</a></h2>
2035
2036<p>Sometimes the DOM tree output is just too large to fit reasonably into
2037memory. In that case (and if you don't expect to save back the XML document
2038loaded using libxml), it's better to use the SAX interface of libxml. SAX is
2039a <strong>callback-based interface</strong> to the parser. Before parsing,
2040the application layer registers a customized set of callbacks which are
2041called by the library as it progresses through the XML input.</p>
2042
2043<p>To get more detailed step-by-step guidance on using the SAX interface of
2044libxml, see the <a
2045href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">nice
2046documentation</a>.written by <a href="mailto:james@daa.com.au">James
2047Henstridge</a>.</p>
2048
2049<p>You can debug the SAX behaviour by using the <strong>testSAX</strong>
2050program located in the gnome-xml module (it's usually not shipped in the
2051binary packages of libxml, but you can find it in the tar source
2052distribution). Here is the sequence of callbacks that would be reported by
2053testSAX when parsing the example XML document shown earlier:</p>
2054<pre>SAX.setDocumentLocator()
2055SAX.startDocument()
2056SAX.getEntity(amp)
2057SAX.startElement(EXAMPLE, prop1='gnome is great', prop2='&amp;amp; linux too')
2058SAX.characters(   , 3)
2059SAX.startElement(head)
2060SAX.characters(    , 4)
2061SAX.startElement(title)
2062SAX.characters(Welcome to Gnome, 16)
2063SAX.endElement(title)
2064SAX.characters(   , 3)
2065SAX.endElement(head)
2066SAX.characters(   , 3)
2067SAX.startElement(chapter)
2068SAX.characters(    , 4)
2069SAX.startElement(title)
2070SAX.characters(The Linux adventure, 19)
2071SAX.endElement(title)
2072SAX.characters(    , 4)
2073SAX.startElement(p)
2074SAX.characters(bla bla bla ..., 15)
2075SAX.endElement(p)
2076SAX.characters(    , 4)
2077SAX.startElement(image, href='linus.gif')
2078SAX.endElement(image)
2079SAX.characters(    , 4)
2080SAX.startElement(p)
2081SAX.characters(..., 3)
2082SAX.endElement(p)
2083SAX.characters(   , 3)
2084SAX.endElement(chapter)
2085SAX.characters( , 1)
2086SAX.endElement(EXAMPLE)
2087SAX.endDocument()</pre>
2088
2089<p>Most of the other interfaces of libxml are based on the DOM tree-building
2090facility, so nearly everything up to the end of this document presupposes the
2091use of the standard DOM tree build. Note that the DOM tree itself is built by
2092a set of registered default callbacks, without internal specific
2093interface.</p>
2094
2095<h2><a name="Validation">Validation &amp; DTDs</a></h2>
2096
2097<p>Table of Content:</p>
2098<ol>
2099  <li><a href="#General5">General overview</a></li>
2100  <li><a href="#definition">The definition</a></li>
2101  <li><a href="#Simple">Simple rules</a>
2102    <ol>
2103      <li><a href="#reference">How to reference a DTD from a document</a></li>
2104      <li><a href="#Declaring">Declaring elements</a></li>
2105      <li><a href="#Declaring1">Declaring attributes</a></li>
2106    </ol>
2107  </li>
2108  <li><a href="#Some">Some examples</a></li>
2109  <li><a href="#validate">How to validate</a></li>
2110  <li><a href="#Other">Other resources</a></li>
2111</ol>
2112
2113<h3><a name="General5">General overview</a></h3>
2114
2115<p>Well what is validation and what is a DTD ?</p>
2116
2117<p>DTD is the acronym for Document Type Definition. This is a description of
2118the content for a family of XML files. This is part of the XML 1.0
2119specification, and allows one to describe and verify that a given document
2120instance conforms to the set of rules detailing its structure and content.</p>
2121
2122<p>Validation is the process of checking a document against a DTD (more
2123generally against a set of construction rules).</p>
2124
2125<p>The validation process and building DTDs are the two most difficult parts
2126of the XML life cycle. Briefly a DTD defines all the possible elements to be
2127found within your document, what is the formal shape of your document tree
2128(by defining the allowed content of an element; either text, a regular
2129expression for the allowed list of children, or mixed content i.e. both text
2130and children). The DTD also defines the valid attributes for all elements and
2131the types of those attributes.</p>
2132
2133<h3><a name="definition1">The definition</a></h3>
2134
2135<p>The <a href="http://www.w3.org/TR/REC-xml">W3C XML Recommendation</a> (<a
2136href="http://www.xml.com/axml/axml.html">Tim Bray's annotated version of
2137Rev1</a>):</p>
2138<ul>
2139  <li><a href="http://www.w3.org/TR/REC-xml#elemdecls">Declaring
2140  elements</a></li>
2141  <li><a href="http://www.w3.org/TR/REC-xml#attdecls">Declaring
2142  attributes</a></li>
2143</ul>
2144
2145<p>(unfortunately) all this is inherited from the SGML world, the syntax is
2146ancient...</p>
2147
2148<h3><a name="Simple1">Simple rules</a></h3>
2149
2150<p>Writing DTDs can be done in many ways. The rules to build them if you need
2151something permanent or something which can evolve over time can be radically
2152different. Really complex DTDs like DocBook ones are flexible but quite
2153harder to design. I will just focus on DTDs for a formats with a fixed simple
2154structure. It is just a set of basic rules, and definitely not exhaustive nor
2155usable for complex DTD design.</p>
2156
2157<h4><a name="reference1">How to reference a DTD from a document</a>:</h4>
2158
2159<p>Assuming the top element of the document is <code>spec</code> and the dtd
2160is placed in the file <code>mydtd</code> in the subdirectory
2161<code>dtds</code> of the directory from where the document were loaded:</p>
2162
2163<p><code>&lt;!DOCTYPE spec SYSTEM "dtds/mydtd"&gt;</code></p>
2164
2165<p>Notes:</p>
2166<ul>
2167  <li>The system string is actually an URI-Reference (as defined in <a
2168    href="http://www.ietf.org/rfc/rfc2396.txt">RFC 2396</a>) so you can use a
2169    full URL string indicating the location of your DTD on the Web. This is a
2170    really good thing to do if you want others to validate your document.</li>
2171  <li>It is also possible to associate a <code>PUBLIC</code> identifier (a
2172    magic string) so that the DTD is looked up in catalogs on the client side
2173    without having to locate it on the web.</li>
2174  <li>A DTD contains a set of element and attribute declarations, but they
2175    don't define what the root of the document should be. This is explicitly
2176    told to the parser/validator as the first element of the
2177    <code>DOCTYPE</code> declaration.</li>
2178</ul>
2179
2180<h4><a name="Declaring2">Declaring elements</a>:</h4>
2181
2182<p>The following declares an element <code>spec</code>:</p>
2183
2184<p><code>&lt;!ELEMENT spec (front, body, back?)&gt;</code></p>
2185
2186<p>It also expresses that the spec element contains one <code>front</code>,
2187one <code>body</code> and one optional <code>back</code> children elements in
2188this order. The declaration of one element of the structure and its content
2189are done in a single declaration. Similarly the following declares
2190<code>div1</code> elements:</p>
2191
2192<p><code>&lt;!ELEMENT div1 (head, (p | list | note)*, div2?)&gt;</code></p>
2193
2194<p>which means div1 contains one <code>head</code> then a series of optional
2195<code>p</code>, <code>list</code>s and <code>note</code>s and then an
2196optional <code>div2</code>. And last but not least an element can contain
2197text:</p>
2198
2199<p><code>&lt;!ELEMENT b (#PCDATA)&gt;</code></p>
2200
2201<p><code>b</code> contains text or being of mixed content (text and elements
2202in no particular order):</p>
2203
2204<p><code>&lt;!ELEMENT p (#PCDATA|a|ul|b|i|em)*&gt;</code></p>
2205
2206<p><code>p </code>can contain text or <code>a</code>, <code>ul</code>,
2207<code>b</code>, <code>i </code>or <code>em</code> elements in no particular
2208order.</p>
2209
2210<h4><a name="Declaring1">Declaring attributes</a>:</h4>
2211
2212<p>Again the attributes declaration includes their content definition:</p>
2213
2214<p><code>&lt;!ATTLIST termdef name CDATA #IMPLIED&gt;</code></p>
2215
2216<p>means that the element <code>termdef</code> can have a <code>name</code>
2217attribute containing text (<code>CDATA</code>) and which is optional
2218(<code>#IMPLIED</code>). The attribute value can also be defined within a
2219set:</p>
2220
2221<p><code>&lt;!ATTLIST list type (bullets|ordered|glossary)
2222"ordered"&gt;</code></p>
2223
2224<p>means <code>list</code> element have a <code>type</code> attribute with 3
2225allowed values "bullets", "ordered" or "glossary" and which default to
2226"ordered" if the attribute is not explicitly specified.</p>
2227
2228<p>The content type of an attribute can be text (<code>CDATA</code>),
2229anchor/reference/references
2230(<code>ID</code>/<code>IDREF</code>/<code>IDREFS</code>), entity(ies)
2231(<code>ENTITY</code>/<code>ENTITIES</code>) or name(s)
2232(<code>NMTOKEN</code>/<code>NMTOKENS</code>). The following defines that a
2233<code>chapter</code> element can have an optional <code>id</code> attribute
2234of type <code>ID</code>, usable for reference from attribute of type
2235IDREF:</p>
2236
2237<p><code>&lt;!ATTLIST chapter id ID #IMPLIED&gt;</code></p>
2238
2239<p>The last value of an attribute definition can be <code>#REQUIRED
2240</code>meaning that the attribute has to be given, <code>#IMPLIED</code>
2241meaning that it is optional, or the default value (possibly prefixed by
2242<code>#FIXED</code> if it is the only allowed).</p>
2243
2244<p>Notes:</p>
2245<ul>
2246  <li>Usually the attributes pertaining to a given element are declared in a
2247    single expression, but it is just a convention adopted by a lot of DTD
2248    writers:
2249    <pre>&lt;!ATTLIST termdef
2250          id      ID      #REQUIRED
2251          name    CDATA   #IMPLIED&gt;</pre>
2252    <p>The previous construct defines both <code>id</code> and
2253    <code>name</code> attributes for the element <code>termdef</code>.</p>
2254  </li>
2255</ul>
2256
2257<h3><a name="Some1">Some examples</a></h3>
2258
2259<p>The directory <code>test/valid/dtds/</code> in the libxml distribution
2260contains some complex DTD examples. The example in the file
2261<code>test/valid/dia.xml</code> shows an XML file where the simple DTD is
2262directly included within the document.</p>
2263
2264<h3><a name="validate1">How to validate</a></h3>
2265
2266<p>The simplest way is to use the xmllint program included with libxml. The
2267<code>--valid</code> option turns-on validation of the files given as input.
2268For example the following validates a copy of the first revision of the XML
22691.0 specification:</p>
2270
2271<p><code>xmllint --valid --noout test/valid/REC-xml-19980210.xml</code></p>
2272
2273<p>the -- noout is used to disable output of the resulting tree.</p>
2274
2275<p>The <code>--dtdvalid dtd</code> allows validation of the document(s)
2276against a given DTD.</p>
2277
2278<p>Libxml exports an API to handle DTDs and validation, check the <a
2279href="http://xmlsoft.org/html/libxml-valid.html">associated
2280description</a>.</p>
2281
2282<h3><a name="Other1">Other resources</a></h3>
2283
2284<p>DTDs are as old as SGML. So there may be a number of examples on-line, I
2285will just list one for now, others pointers welcome:</p>
2286<ul>
2287  <li><a href="http://www.xml101.com:8081/dtd/">XML-101 DTD</a></li>
2288</ul>
2289
2290<p>I suggest looking at the examples found under test/valid/dtd and any of
2291the large number of books available on XML. The dia example in test/valid
2292should be both simple and complete enough to allow you to build your own.</p>
2293
2294<p></p>
2295
2296<h2><a name="Memory">Memory Management</a></h2>
2297
2298<p>Table of Content:</p>
2299<ol>
2300  <li><a href="#General3">General overview</a></li>
2301  <li><a href="#setting">Setting libxml set of memory routines</a></li>
2302  <li><a href="#cleanup">Cleaning up after parsing</a></li>
2303  <li><a href="#Debugging">Debugging routines</a></li>
2304  <li><a href="#General4">General memory requirements</a></li>
2305</ol>
2306
2307<h3><a name="General3">General overview</a></h3>
2308
2309<p>The module <code><a
2310href="http://xmlsoft.org/html/libxml-xmlmemory.html">xmlmemory.h</a></code>
2311provides the interfaces to the libxml memory system:</p>
2312<ul>
2313  <li>libxml does not use the libc memory allocator directly but xmlFree(),
2314    xmlMalloc() and xmlRealloc()</li>
2315  <li>those routines can be reallocated to a specific set of routine, by
2316    default the libc ones i.e. free(), malloc() and realloc()</li>
2317  <li>the xmlmemory.c module includes a set of debugging routine</li>
2318</ul>
2319
2320<h3><a name="setting">Setting libxml set of memory routines</a></h3>
2321
2322<p>It is sometimes useful to not use the default memory allocator, either for
2323debugging, analysis or to implement a specific behaviour on memory management
2324(like on embedded systems). Two function calls are available to do so:</p>
2325<ul>
2326  <li><a href="http://xmlsoft.org/html/libxml-xmlmemory.html">xmlMemGet
2327    ()</a> which return the current set of functions in use by the parser</li>
2328  <li><a
2329    href="http://xmlsoft.org/html/libxml-xmlmemory.html">xmlMemSetup()</a>
2330    which allow to set up a new set of memory allocation functions</li>
2331</ul>
2332
2333<p>Of course a call to xmlMemSetup() should probably be done before calling
2334any other libxml routines (unless you are sure your allocations routines are
2335compatibles).</p>
2336
2337<h3><a name="cleanup">Cleaning up after parsing</a></h3>
2338
2339<p>Libxml is not stateless, there is a few set of memory structures needing
2340allocation before the parser is fully functional (some encoding structures
2341for example). This also mean that once parsing is finished there is a tiny
2342amount of memory (a few hundred bytes) which can be recollected if you don't
2343reuse the parser immediately:</p>
2344<ul>
2345  <li><a href="http://xmlsoft.org/html/libxml-parser.html">xmlCleanupParser
2346    ()</a> is a centralized routine to free the parsing states. Note that it
2347    won't deallocate any produced tree if any (use the xmlFreeDoc() and
2348    related routines for this).</li>
2349  <li><a href="http://xmlsoft.org/html/libxml-parser.html">xmlInitParser
2350    ()</a> is the dual routine allowing to preallocate the parsing state
2351    which can be useful for example to avoid initialization reentrancy
2352    problems when using libxml in multithreaded applications</li>
2353</ul>
2354
2355<p>Generally xmlCleanupParser() is safe, if needed the state will be rebuild
2356at the next invocation of parser routines, but be careful of the consequences
2357in multithreaded applications.</p>
2358
2359<h3><a name="Debugging">Debugging routines</a></h3>
2360
2361<p>When configured using --with-mem-debug flag (off by default), libxml uses
2362a set of memory allocation debugging routines keeping track of all allocated
2363blocks and the location in the code where the routine was called. A couple of
2364other debugging routines allow to dump the memory allocated infos to a file
2365or call a specific routine when a given block number is allocated:</p>
2366<ul>
2367  <li><a
2368    href="http://xmlsoft.org/html/libxml-xmlmemory.html">xmlMallocLoc()</a>
2369    <a
2370    href="http://xmlsoft.org/html/libxml-xmlmemory.html">xmlReallocLoc()</a>
2371    and <a
2372    href="http://xmlsoft.org/html/libxml-xmlmemory.html">xmlMemStrdupLoc()</a>
2373    are the memory debugging replacement allocation routines</li>
2374  <li><a href="http://xmlsoft.org/html/libxml-xmlmemory.html">xmlMemoryDump
2375    ()</a> dumps all the informations about the allocated memory block lefts
2376    in the <code>.memdump</code> file</li>
2377</ul>
2378
2379<p>When developing libxml memory debug is enabled, the tests programs call
2380xmlMemoryDump () and the "make test" regression tests will check for any
2381memory leak during the full regression test sequence, this helps a lot
2382ensuring that libxml  does not leak memory and bullet proof memory
2383allocations use (some libc implementations are known to be far too permissive
2384resulting in major portability problems!).</p>
2385
2386<p>If the .memdump reports a leak, it displays the allocation function and
2387also tries to give some informations about the content and structure of the
2388allocated blocks left. This is sufficient in most cases to find the culprit,
2389but not always. Assuming the allocation problem is reproducible, it is
2390possible to find more easily:</p>
2391<ol>
2392  <li>write down the block number xxxx not allocated</li>
2393  <li>export the environment variable XML_MEM_BREAKPOINT=xxxx , the easiest
2394    when using GDB is to simply give the command
2395    <p><code>set environment XML_MEM_BREAKPOINT xxxx</code></p>
2396    <p>before running the program.</p>
2397  </li>
2398  <li>run the program under a debugger and set a breakpoint on
2399    xmlMallocBreakpoint() a specific function called when this precise block
2400    is allocated</li>
2401  <li>when the breakpoint is reached you can then do a fine analysis of the
2402    allocation an step  to see the condition resulting in the missing
2403    deallocation.</li>
2404</ol>
2405
2406<p>I used to use a commercial tool to debug libxml memory problems but after
2407noticing that it was not detecting memory leaks that simple mechanism was
2408used and proved extremely efficient until now. Lately I have also used <a
2409href="http://developer.kde.org/~sewardj/">valgrind</a> with quite some
2410success, it is tied to the i386 architecture since it works by emulating the
2411processor and instruction set, it is slow but  extremely efficient, i.e. it
2412spot memory usage errors in a very precise way.</p>
2413
2414<h3><a name="General4">General memory requirements</a></h3>
2415
2416<p>How much libxml memory require ? It's hard to tell in average it depends
2417of a number of things:</p>
2418<ul>
2419  <li>the parser itself should work  in a fixed amount of memory, except for
2420    information maintained about the stacks of names and  entities locations.
2421    The I/O and encoding handlers will probably account for a few KBytes.
2422    This is true for both the XML and HTML parser (though the HTML parser
2423    need more state).</li>
2424  <li>If you are generating the DOM tree then memory requirements will grow
2425    nearly linear with the size of the data. In general for a balanced
2426    textual document the internal memory requirement is about 4 times the
2427    size of the UTF8 serialization of this document (example the XML-1.0
2428    recommendation is a bit more of 150KBytes and takes 650KBytes of main
2429    memory when parsed). Validation will add a amount of memory required for
2430    maintaining the external Dtd state which should be linear with the
2431    complexity of the content model defined by the Dtd</li>
2432  <li>If you don't care about the advanced features of libxml like
2433    validation, DOM, XPath or XPointer, but really need to work fixed memory
2434    requirements, then the SAX interface should be used.</li>
2435</ul>
2436
2437<p></p>
2438
2439<h2><a name="Encodings">Encodings support</a></h2>
2440
2441<p>Table of Content:</p>
2442<ol>
2443  <li><a href="encoding.html#What">What does internationalization support
2444    mean ?</a></li>
2445  <li><a href="encoding.html#internal">The internal encoding, how and
2446  why</a></li>
2447  <li><a href="encoding.html#implemente">How is it implemented ?</a></li>
2448  <li><a href="encoding.html#Default">Default supported encodings</a></li>
2449  <li><a href="encoding.html#extend">How to extend the existing
2450  support</a></li>
2451</ol>
2452
2453<h3><a name="What">What does internationalization support mean ?</a></h3>
2454
2455<p>XML was designed from the start to allow the support of any character set
2456by using Unicode. Any conformant XML parser has to support the UTF-8 and
2457UTF-16 default encodings which can both express the full unicode ranges. UTF8
2458is a variable length encoding whose greatest points are to reuse the same
2459encoding for ASCII and to save space for Western encodings, but it is a bit
2460more complex to handle in practice. UTF-16 use 2 bytes per characters (and
2461sometimes combines two pairs), it makes implementation easier, but looks a
2462bit overkill for Western languages encoding. Moreover the XML specification
2463allows document to be encoded in other encodings at the condition that they
2464are clearly labeled as such. For example the following is a wellformed XML
2465document encoded in ISO-8859 1 and using accentuated letter that we French
2466likes for both markup and content:</p>
2467<pre>&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;
2468&lt;tr�s&gt;l�&lt;/tr�s&gt;</pre>
2469
2470<p>Having internationalization support in libxml means the following:</p>
2471<ul>
2472  <li>the document is properly parsed</li>
2473  <li>informations about it's encoding are saved</li>
2474  <li>it can be modified</li>
2475  <li>it can be saved in its original encoding</li>
2476  <li>it can also be saved in another encoding supported by libxml (for
2477    example straight UTF8 or even an ASCII form)</li>
2478</ul>
2479
2480<p>Another very important point is that the whole libxml API, with the
2481exception of a few routines to read with a specific encoding or save to a
2482specific encoding, is completely agnostic about the original encoding of the
2483document.</p>
2484
2485<p>It should be noted too that the HTML parser embedded in libxml now obey
2486the same rules too, the following document will be (as of 2.2.2) handled  in
2487an internationalized fashion by libxml too:</p>
2488<pre>&lt;!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
2489                      "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
2490&lt;html lang="fr"&gt;
2491&lt;head&gt;
2492  &lt;META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1"&gt;
2493&lt;/head&gt;
2494&lt;body&gt;
2495&lt;p&gt;W3C cr�e des standards pour le Web.&lt;/body&gt;
2496&lt;/html&gt;</pre>
2497
2498<h3><a name="internal">The internal encoding, how and why</a></h3>
2499
2500<p>One of the core decision was to force all documents to be converted to a
2501default internal encoding, and that encoding to be UTF-8, here are the
2502rationale for those choices:</p>
2503<ul>
2504  <li>keeping the native encoding in the internal form would force the libxml
2505    users (or the code associated) to be fully aware of the encoding of the
2506    original document, for examples when adding a text node to a document,
2507    the content would have to be provided in the document encoding, i.e. the
2508    client code would have to check it before hand, make sure it's conformant
2509    to the encoding, etc ... Very hard in practice, though in some specific
2510    cases this may make sense.</li>
2511  <li>the second decision was which encoding. From the XML spec only UTF8 and
2512    UTF16 really makes sense as being the two only encodings for which there
2513    is mandatory support. UCS-4 (32 bits fixed size encoding) could be
2514    considered an intelligent choice too since it's a direct Unicode mapping
2515    support. I selected UTF-8 on the basis of efficiency and compatibility
2516    with surrounding software:
2517    <ul>
2518      <li>UTF-8 while a bit more complex to convert from/to (i.e. slightly
2519        more costly to import and export CPU wise) is also far more compact
2520        than UTF-16 (and UCS-4) for a majority of the documents I see it used
2521        for right now (RPM RDF catalogs, advogato data, various configuration
2522        file formats, etc.) and the key point for today's computer
2523        architecture is efficient uses of caches. If one nearly double the
2524        memory requirement to store the same amount of data, this will trash
2525        caches (main memory/external caches/internal caches) and my take is
2526        that this harms the system far more than the CPU requirements needed
2527        for the conversion to UTF-8</li>
2528      <li>Most of libxml version 1 users were using it with straight ASCII
2529        most of the time, doing the conversion with an internal encoding
2530        requiring all their code to be rewritten was a serious show-stopper
2531        for using UTF-16 or UCS-4.</li>
2532      <li>UTF-8 is being used as the de-facto internal encoding standard for
2533        related code like the <a href="http://www.pango.org/">pango</a>
2534        upcoming Gnome text widget, and a lot of Unix code (yep another place
2535        where Unix programmer base takes a different approach from Microsoft
2536        - they are using UTF-16)</li>
2537    </ul>
2538  </li>
2539</ul>
2540
2541<p>What does this mean in practice for the libxml user:</p>
2542<ul>
2543  <li>xmlChar, the libxml data type is a byte, those bytes must be assembled
2544    as UTF-8 valid strings. The proper way to terminate an xmlChar * string
2545    is simply to append 0 byte, as usual.</li>
2546  <li>One just need to make sure that when using chars outside the ASCII set,
2547    the values has been properly converted to UTF-8</li>
2548</ul>
2549
2550<h3><a name="implemente">How is it implemented ?</a></h3>
2551
2552<p>Let's describe how all this works within libxml, basically the I18N
2553(internationalization) support get triggered only during I/O operation, i.e.
2554when reading a document or saving one. Let's look first at the reading
2555sequence:</p>
2556<ol>
2557  <li>when a document is processed, we usually don't know the encoding, a
2558    simple heuristic allows to detect UTF-18 and UCS-4 from whose where the
2559    ASCII range (0-0x7F) maps with ASCII</li>
2560  <li>the xml declaration if available is parsed, including the encoding
2561    declaration. At that point, if the autodetected encoding is different
2562    from the one declared a call to xmlSwitchEncoding() is issued.</li>
2563  <li>If there is no encoding declaration, then the input has to be in either
2564    UTF-8 or UTF-16, if it is not then at some point when processing the
2565    input, the converter/checker of UTF-8 form will raise an encoding error.
2566    You may end-up with a garbled document, or no document at all ! Example:
2567    <pre>~/XML -&gt; /xmllint err.xml 
2568err.xml:1: error: Input is not proper UTF-8, indicate encoding !
2569&lt;tr�s&gt;l�&lt;/tr�s&gt;
2570   ^
2571err.xml:1: error: Bytes: 0xE8 0x73 0x3E 0x6C
2572&lt;tr�s&gt;l�&lt;/tr�s&gt;
2573   ^</pre>
2574  </li>
2575  <li>xmlSwitchEncoding() does an encoding name lookup, canonicalize it, and
2576    then search the default registered encoding converters for that encoding.
2577    If it's not within the default set and iconv() support has been compiled
2578    it, it will ask iconv for such an encoder. If this fails then the parser
2579    will report an error and stops processing:
2580    <pre>~/XML -&gt; /xmllint err2.xml 
2581err2.xml:1: error: Unsupported encoding UnsupportedEnc
2582&lt;?xml version="1.0" encoding="UnsupportedEnc"?&gt;
2583                                             ^</pre>
2584  </li>
2585  <li>From that point the encoder processes progressively the input (it is
2586    plugged as a front-end to the I/O module) for that entity. It captures
2587    and convert on-the-fly the document to be parsed to UTF-8. The parser
2588    itself just does UTF-8 checking of this input and process it
2589    transparently. The only difference is that the encoding information has
2590    been added to the parsing context (more precisely to the input
2591    corresponding to this entity).</li>
2592  <li>The result (when using DOM) is an internal form completely in UTF-8
2593    with just an encoding information on the document node.</li>
2594</ol>
2595
2596<p>Ok then what happens when saving the document (assuming you
2597collected/built an xmlDoc DOM like structure) ? It depends on the function
2598called, xmlSaveFile() will just try to save in the original encoding, while
2599xmlSaveFileTo() and xmlSaveFileEnc() can optionally save to a given
2600encoding:</p>
2601<ol>
2602  <li>if no encoding is given, libxml will look for an encoding value
2603    associated to the document and if it exists will try to save to that
2604    encoding,
2605    <p>otherwise everything is written in the internal form, i.e. UTF-8</p>
2606  </li>
2607  <li>so if an encoding was specified, either at the API level or on the
2608    document, libxml will again canonicalize the encoding name, lookup for a
2609    converter in the registered set or through iconv. If not found the
2610    function will return an error code</li>
2611  <li>the converter is placed before the I/O buffer layer, as another kind of
2612    buffer, then libxml will simply push the UTF-8 serialization to through
2613    that buffer, which will then progressively be converted and pushed onto
2614    the I/O layer.</li>
2615  <li>It is possible that the converter code fails on some input, for example
2616    trying to push an UTF-8 encoded Chinese character through the UTF-8 to
2617    ISO-8859-1 converter won't work. Since the encoders are progressive they
2618    will just report the error and the number of bytes converted, at that
2619    point libxml will decode the offending character, remove it from the
2620    buffer and replace it with the associated charRef encoding &amp;#123; and
2621    resume the conversion. This guarantees that any document will be saved
2622    without losses (except for markup names where this is not legal, this is
2623    a problem in the current version, in practice avoid using non-ascii
2624    characters for tags or attributes names  @@). A special "ascii" encoding
2625    name is used to save documents to a pure ascii form can be used when
2626    portability is really crucial</li>
2627</ol>
2628
2629<p>Here is a few examples based on the same test document:</p>
2630<pre>~/XML -&gt; /xmllint isolat1 
2631&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;
2632&lt;tr�s&gt;l�&lt;/tr�s&gt;
2633~/XML -&gt; /xmllint --encode UTF-8 isolat1 
2634&lt;?xml version="1.0" encoding="UTF-8"?&gt;
2635&lt;très&gt;l� �&lt;/très&gt;
2636~/XML -&gt; </pre>
2637
2638<p>The same processing is applied (and reuse most of the code) for HTML I18N
2639processing. Looking up and modifying the content encoding is a bit more
2640difficult since it is located in a &lt;meta&gt; tag under the &lt;head&gt;,
2641so a couple of functions htmlGetMetaEncoding() and htmlSetMetaEncoding() have
2642been provided. The parser also attempts to switch encoding on the fly when
2643detecting such a tag on input. Except for that the processing is the same
2644(and again reuses the same code).</p>
2645
2646<h3><a name="Default">Default supported encodings</a></h3>
2647
2648<p>libxml has a set of default converters for the following encodings
2649(located in encoding.c):</p>
2650<ol>
2651  <li>UTF-8 is supported by default (null handlers)</li>
2652  <li>UTF-16, both little and big endian</li>
2653  <li>ISO-Latin-1 (ISO-8859-1) covering most western languages</li>
2654  <li>ASCII, useful mostly for saving</li>
2655  <li>HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML
2656    predefined entities like &amp;copy; for the Copyright sign.</li>
2657</ol>
2658
2659<p>More over when compiled on an Unix platform with iconv support the full
2660set of encodings supported by iconv can be instantly be used by libxml. On a
2661linux machine with glibc-2.1 the list of supported encodings and aliases fill
26623 full pages, and include UCS-4, the full set of ISO-Latin encodings, and the
2663various Japanese ones.</p>
2664
2665<h4>Encoding aliases</h4>
2666
2667<p>From 2.2.3, libxml has support to register encoding names aliases. The
2668goal is to be able to parse document whose encoding is supported but where
2669the name differs (for example from the default set of names accepted by
2670iconv). The following functions allow to register and handle new aliases for
2671existing encodings. Once registered libxml will automatically lookup the
2672aliases when handling a document:</p>
2673<ul>
2674  <li>int xmlAddEncodingAlias(const char *name, const char *alias);</li>
2675  <li>int xmlDelEncodingAlias(const char *alias);</li>
2676  <li>const char * xmlGetEncodingAlias(const char *alias);</li>
2677  <li>void xmlCleanupEncodingAliases(void);</li>
2678</ul>
2679
2680<h3><a name="extend">How to extend the existing support</a></h3>
2681
2682<p>Well adding support for new encoding, or overriding one of the encoders
2683(assuming it is buggy) should not be hard, just write an input and output
2684conversion routines to/from UTF-8, and register them using
2685xmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx),  and they will be
2686called automatically if the parser(s) encounter such an encoding name
2687(register it uppercase, this will help). The description of the encoders,
2688their arguments and expected return values are described in the encoding.h
2689header.</p>
2690
2691<p>A quick note on the topic of subverting the parser to use a different
2692internal encoding than UTF-8, in some case people will absolutely want to
2693keep the internal encoding different, I think it's still possible (but the
2694encoding must be compliant with ASCII on the same subrange) though I didn't
2695tried it. The key is to override the default conversion routines (by
2696registering null encoders/decoders for your charsets), and bypass the UTF-8
2697checking of the parser by setting the parser context charset
2698(ctxt-&gt;charset) to something different than XML_CHAR_ENCODING_UTF8, but
2699there is no guarantee that this will work. You may also have some troubles
2700saving back.</p>
2701
2702<p>Basically proper I18N support is important, this requires at least
2703libxml-2.0.0, but a lot of features and corrections are really available only
2704starting 2.2.</p>
2705
2706<h2><a name="IO">I/O Interfaces</a></h2>
2707
2708<p>Table of Content:</p>
2709<ol>
2710  <li><a href="#General1">General overview</a></li>
2711  <li><a href="#basic">The basic buffer type</a></li>
2712  <li><a href="#Input">Input I/O handlers</a></li>
2713  <li><a href="#Output">Output I/O handlers</a></li>
2714  <li><a href="#entities">The entities loader</a></li>
2715  <li><a href="#Example2">Example of customized I/O</a></li>
2716</ol>
2717
2718<h3><a name="General1">General overview</a></h3>
2719
2720<p>The module <code><a
2721href="http://xmlsoft.org/html/libxml-xmlio.html">xmlIO.h</a></code> provides
2722the interfaces to the libxml I/O system. This consists of 4 main parts:</p>
2723<ul>
2724  <li>Entities loader, this is a routine which tries to fetch the entities
2725    (files) based on their PUBLIC and SYSTEM identifiers. The default loader
2726    don't look at the public identifier since libxml do not maintain a
2727    catalog. You can redefine you own entity loader by using
2728    <code>xmlGetExternalEntityLoader()</code> and
2729    <code>xmlSetExternalEntityLoader()</code>. <a href="#entities">Check the
2730    example</a>.</li>
2731  <li>Input I/O buffers which are a commodity structure used by the parser(s)
2732    input layer to handle fetching the informations to feed the parser. This
2733    provides buffering and is also a placeholder where the encoding
2734    converters to UTF8 are piggy-backed.</li>
2735  <li>Output I/O buffers are similar to the Input ones and fulfill similar
2736    task but when generating a serialization from a tree.</li>
2737  <li>A mechanism to register sets of I/O callbacks and associate them with
2738    specific naming schemes like the protocol part of the URIs.
2739    <p>This affect the default I/O operations and allows to use specific I/O
2740    handlers for certain names.</p>
2741  </li>
2742</ul>
2743
2744<p>The general mechanism used when loading http://rpmfind.net/xml.html for
2745example in the HTML parser is the following:</p>
2746<ol>
2747  <li>The default entity loader calls <code>xmlNewInputFromFile()</code> with
2748    the parsing context and the URI string.</li>
2749  <li>the URI string is checked against the existing registered handlers
2750    using their match() callback function, if the HTTP module was compiled
2751    in, it is registered and its match() function will succeeds</li>
2752  <li>the open() function of the handler is called and if successful will
2753    return an I/O Input buffer</li>
2754  <li>the parser will the start reading from this buffer and progressively
2755    fetch information from the resource, calling the read() function of the
2756    handler until the resource is exhausted</li>
2757  <li>if an encoding change is detected it will be installed on the input
2758    buffer, providing buffering and efficient use of the conversion
2759  routines</li>
2760  <li>once the parser has finished, the close() function of the handler is
2761    called once and the Input buffer and associated resources are
2762  deallocated.</li>
2763</ol>
2764
2765<p>The user defined callbacks are checked first to allow overriding of the
2766default libxml I/O routines.</p>
2767
2768<h3><a name="basic">The basic buffer type</a></h3>
2769
2770<p>All the buffer manipulation handling is done using the
2771<code>xmlBuffer</code> type define in <code><a
2772href="http://xmlsoft.org/html/libxml-tree.html">tree.h</a> </code>which is a
2773resizable memory buffer. The buffer allocation strategy can be selected to be
2774either best-fit or use an exponential doubling one (CPU vs. memory use
2775trade-off). The values are <code>XML_BUFFER_ALLOC_EXACT</code> and
2776<code>XML_BUFFER_ALLOC_DOUBLEIT</code>, and can be set individually or on a
2777system wide basis using <code>xmlBufferSetAllocationScheme()</code>. A number
2778of functions allows to manipulate buffers with names starting with the
2779<code>xmlBuffer...</code> prefix.</p>
2780
2781<h3><a name="Input">Input I/O handlers</a></h3>
2782
2783<p>An Input I/O handler is a simple structure
2784<code>xmlParserInputBuffer</code> containing a context associated to the
2785resource (file descriptor, or pointer to a protocol handler), the read() and
2786close() callbacks to use and an xmlBuffer. And extra xmlBuffer and a charset
2787encoding handler are also present to support charset conversion when
2788needed.</p>
2789
2790<h3><a name="Output">Output I/O handlers</a></h3>
2791
2792<p>An Output handler <code>xmlOutputBuffer</code> is completely similar to an
2793Input one except the callbacks are write() and close().</p>
2794
2795<h3><a name="entities">The entities loader</a></h3>
2796
2797<p>The entity loader resolves requests for new entities and create inputs for
2798the parser. Creating an input from a filename or an URI string is done
2799through the xmlNewInputFromFile() routine.  The default entity loader do not
2800handle the PUBLIC identifier associated with an entity (if any). So it just
2801calls xmlNewInputFromFile() with the SYSTEM identifier (which is mandatory in
2802XML).</p>
2803
2804<p>If you want to hook up a catalog mechanism then you simply need to
2805override the default entity loader, here is an example:</p>
2806<pre>#include &lt;libxml/xmlIO.h&gt;
2807
2808xmlExternalEntityLoader defaultLoader = NULL;
2809
2810xmlParserInputPtr
2811xmlMyExternalEntityLoader(const char *URL, const char *ID,
2812                               xmlParserCtxtPtr ctxt) {
2813    xmlParserInputPtr ret;
2814    const char *fileID = NULL;
2815    /* lookup for the fileID depending on ID */
2816
2817    ret = xmlNewInputFromFile(ctxt, fileID);
2818    if (ret != NULL)
2819        return(ret);
2820    if (defaultLoader != NULL)
2821        ret = defaultLoader(URL, ID, ctxt);
2822    return(ret);
2823}
2824
2825int main(..) {
2826    ...
2827
2828    /*
2829     * Install our own entity loader
2830     */
2831    defaultLoader = xmlGetExternalEntityLoader();
2832    xmlSetExternalEntityLoader(xmlMyExternalEntityLoader);
2833
2834    ...
2835}</pre>
2836
2837<h3><a name="Example2">Example of customized I/O</a></h3>
2838
2839<p>This example come from <a href="http://xmlsoft.org/messages/0708.html">a
2840real use case</a>,  xmlDocDump() closes the FILE * passed by the application
2841and this was a problem. The <a
2842href="http://xmlsoft.org/messages/0711.html">solution</a> was to redefine a
2843new output handler with the closing call deactivated:</p>
2844<ol>
2845  <li>First define a new I/O output allocator where the output don't close
2846    the file:
2847    <pre>xmlOutputBufferPtr
2848xmlOutputBufferCreateOwn(FILE *file, xmlCharEncodingHandlerPtr encoder) {
2849����xmlOutputBufferPtr ret;
2850����
2851����if (xmlOutputCallbackInitialized == 0)
2852��������xmlRegisterDefaultOutputCallbacks();
2853
2854����if (file == NULL) return(NULL);
2855����ret = xmlAllocOutputBuffer(encoder);
2856����if (ret != NULL) {
2857��������ret-&gt;context = file;
2858��������ret-&gt;writecallback = xmlFileWrite;
2859��������ret-&gt;closecallback = NULL;  /* No close callback */
2860����}
2861����return(ret); <br>
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900} </pre>
2901  </li>
2902  <li>And then use it to save the document:
2903    <pre>FILE *f;
2904xmlOutputBufferPtr output;
2905xmlDocPtr doc;
2906int res;
2907
2908f = ...
2909doc = ....
2910
2911output = xmlOutputBufferCreateOwn(f, NULL);
2912res = xmlSaveFileTo(output, doc, NULL);
2913    </pre>
2914  </li>
2915</ol>
2916
2917<h2><a name="Catalog">Catalog support</a></h2>
2918
2919<p>Table of Content:</p>
2920<ol>
2921  <li><a href="General2">General overview</a></li>
2922  <li><a href="#definition">The definition</a></li>
2923  <li><a href="#Simple">Using catalogs</a></li>
2924  <li><a href="#Some">Some examples</a></li>
2925  <li><a href="#reference">How to tune  catalog usage</a></li>
2926  <li><a href="#validate">How to debug catalog processing</a></li>
2927  <li><a href="#Declaring">How to create and maintain catalogs</a></li>
2928  <li><a href="#implemento">The implementor corner quick review of the
2929  API</a></li>
2930  <li><a href="#Other">Other resources</a></li>
2931</ol>
2932
2933<h3><a name="General2">General overview</a></h3>
2934
2935<p>What is a catalog? Basically it's a lookup mechanism used when an entity
2936(a file or a remote resource) references another entity. The catalog lookup
2937is inserted between the moment the reference is recognized by the software
2938(XML parser, stylesheet processing, or even images referenced for inclusion
2939in a rendering) and the time where loading that resource is actually
2940started.</p>
2941
2942<p>It is basically used for 3 things:</p>
2943<ul>
2944  <li>mapping from "logical" names, the public identifiers and a more
2945    concrete name usable for download (and URI). For example it can associate
2946    the logical name
2947    <p>"-//OASIS//DTD DocBook XML V4.1.2//EN"</p>
2948    <p>of the DocBook 4.1.2 XML DTD with the actual URL where it can be
2949    downloaded</p>
2950    <p>http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd</p>
2951  </li>
2952  <li>remapping from a given URL to another one, like an HTTP indirection
2953    saying that
2954    <p>"http://www.oasis-open.org/committes/tr.xsl"</p>
2955    <p>should really be looked at</p>
2956    <p>"http://www.oasis-open.org/committes/entity/stylesheets/base/tr.xsl"</p>
2957  </li>
2958  <li>providing a local cache mechanism allowing to load the entities
2959    associated to public identifiers or remote resources, this is a really
2960    important feature for any significant deployment of XML or SGML since it
2961    allows to avoid the aleas and delays associated to fetching remote
2962    resources.</li>
2963</ul>
2964
2965<h3><a name="definition">The definitions</a></h3>
2966
2967<p>Libxml, as of 2.4.3 implements 2 kind of catalogs:</p>
2968<ul>
2969  <li>the older SGML catalogs, the official spec is  SGML Open Technical
2970    Resolution TR9401:1997, but is better understood by reading <a
2971    href="http://www.jclark.com/sp/catalog.htm">the SP Catalog page</a> from
2972    James Clark. This is relatively old and not the preferred mode of
2973    operation of libxml.</li>
2974  <li><a href="http://www.oasis-open.org/committees/entity/spec.html">XML
2975    Catalogs</a> is far more flexible, more recent, uses an XML syntax and
2976    should scale quite better. This is the default option of libxml.</li>
2977</ul>
2978
2979<p></p>
2980
2981<h3><a name="Simple">Using catalog</a></h3>
2982
2983<p>In a normal environment libxml will by default check the presence of a
2984catalog in /etc/xml/catalog, and assuming it has been correctly populated,
2985the processing is completely transparent to the document user. To take a
2986concrete example, suppose you are authoring a DocBook document, this one
2987starts with the following DOCTYPE definition:</p>
2988<pre>&lt;?xml version='1.0'?&gt;
2989&lt;!DOCTYPE book PUBLIC "-//Norman Walsh//DTD DocBk XML V3.1.4//EN"
2990          "http://nwalsh.com/docbook/xml/3.1.4/db3xml.dtd"&gt;</pre>
2991
2992<p>When validating the document with libxml, the catalog will be
2993automatically consulted to lookup the public identifier "-//Norman Walsh//DTD
2994DocBk XML V3.1.4//EN" and the system identifier
2995"http://nwalsh.com/docbook/xml/3.1.4/db3xml.dtd", and if these entities have
2996been installed on your system and the catalogs actually point to them, libxml
2997will fetch them from the local disk.</p>
2998
2999<p style="font-size: 10pt"><strong>Note</strong>: Really don't use this
3000DOCTYPE example it's a really old version, but is fine as an example.</p>
3001
3002<p>Libxml will check the catalog each time that it is requested to load an
3003entity, this includes DTD, external parsed entities, stylesheets, etc ... If
3004your system is correctly configured all the authoring phase and processing
3005should use only local files, even if your document stays portable because it
3006uses the canonical public and system ID, referencing the remote document.</p>
3007
3008<h3><a name="Some">Some examples:</a></h3>
3009
3010<p>Here is a couple of fragments from XML Catalogs used in libxml early
3011regression tests in <code>test/catalogs</code> :</p>
3012<pre>&lt;?xml version="1.0"?&gt;
3013&lt;!DOCTYPE catalog PUBLIC 
3014   "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN"
3015   "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd"&gt;
3016&lt;catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"&gt;
3017  &lt;public publicId="-//OASIS//DTD DocBook XML V4.1.2//EN"
3018   uri="http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"/&gt;
3019...</pre>
3020
3021<p>This is the beginning of a catalog for DocBook 4.1.2, XML Catalogs are
3022written in XML,  there is a specific namespace for catalog elements
3023"urn:oasis:names:tc:entity:xmlns:xml:catalog". The first entry in this
3024catalog is a <code>public</code> mapping it allows to associate a Public
3025Identifier with an URI.</p>
3026<pre>...
3027    &lt;rewriteSystem systemIdStartString="http://www.oasis-open.org/docbook/"
3028                   rewritePrefix="file:///usr/share/xml/docbook/"/&gt;
3029...</pre>
3030
3031<p>A <code>rewriteSystem</code> is a very powerful instruction, it says that
3032any URI starting with a given prefix should be looked at another  URI
3033constructed by replacing the prefix with an new one. In effect this acts like
3034a cache system for a full area of the Web. In practice it is extremely useful
3035with a file prefix if you have installed a copy of those resources on your
3036local system.</p>
3037<pre>...
3038&lt;delegatePublic publicIdStartString="-//OASIS//DTD XML Catalog //"
3039                catalog="file:///usr/share/xml/docbook.xml"/&gt;
3040&lt;delegatePublic publicIdStartString="-//OASIS//ENTITIES DocBook XML"
3041                catalog="file:///usr/share/xml/docbook.xml"/&gt;
3042&lt;delegatePublic publicIdStartString="-//OASIS//DTD DocBook XML"
3043                catalog="file:///usr/share/xml/docbook.xml"/&gt;
3044&lt;delegateSystem systemIdStartString="http://www.oasis-open.org/docbook/"
3045                catalog="file:///usr/share/xml/docbook.xml"/&gt;
3046&lt;delegateURI uriStartString="http://www.oasis-open.org/docbook/"
3047                catalog="file:///usr/share/xml/docbook.xml"/&gt;
3048...</pre>
3049
3050<p>Delegation is the core features which allows to build a tree of catalogs,
3051easier to maintain than a single catalog, based on Public Identifier, System
3052Identifier or URI prefixes it instructs the catalog software to look up
3053entries in another resource. This feature allow to build hierarchies of
3054catalogs, the set of entries presented should be sufficient to redirect the
3055resolution of all DocBook references to the specific catalog in
3056<code>/usr/share/xml/docbook.xml</code> this one in turn could delegate all
3057references for DocBook 4.2.1 to a specific catalog installed at the same time
3058as the DocBook resources on the local machine.</p>
3059
3060<h3><a name="reference">How to tune catalog usage:</a></h3>
3061
3062<p>The user can change the default catalog behaviour by redirecting queries
3063to its own set of catalogs, this can be done by setting the
3064<code>XML_CATALOG_FILES</code> environment variable to a list of catalogs, an
3065empty one should deactivate loading the default <code>/etc/xml/catalog</code>
3066default catalog</p>
3067
3068<h3><a name="validate">How to debug catalog processing:</a></h3>
3069
3070<p>Setting up the <code>XML_DEBUG_CATALOG</code> environment variable will
3071make libxml output debugging informations for each catalog operations, for
3072example:</p>
3073<pre>orchis:~/XML -&gt; xmllint --memory --noout test/ent2
3074warning: failed to load external entity "title.xml"
3075orchis:~/XML -&gt; export XML_DEBUG_CATALOG=
3076orchis:~/XML -&gt; xmllint --memory --noout test/ent2
3077Failed to parse catalog /etc/xml/catalog
3078Failed to parse catalog /etc/xml/catalog
3079warning: failed to load external entity "title.xml"
3080Catalogs cleanup
3081orchis:~/XML -&gt; </pre>
3082
3083<p>The test/ent2 references an entity, running the parser from memory makes
3084the base URI unavailable and the the "title.xml" entity cannot be loaded.
3085Setting up the debug environment variable allows to detect that an attempt is
3086made to load the <code>/etc/xml/catalog</code> but since it's not present the
3087resolution fails.</p>
3088
3089<p>But the most advanced way to debug XML catalog processing is to use the
3090<strong>xmlcatalog</strong> command shipped with libxml2, it allows to load
3091catalogs and make resolution queries to see what is going on. This is also
3092used for the regression tests:</p>
3093<pre>orchis:~/XML -&gt; /xmlcatalog test/catalogs/docbook.xml \
3094                   "-//OASIS//DTD DocBook XML V4.1.2//EN"
3095http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd
3096orchis:~/XML -&gt; </pre>
3097
3098<p>For debugging what is going on, adding one -v flags increase the verbosity
3099level to indicate the processing done (adding a second flag also indicate
3100what elements are recognized at parsing):</p>
3101<pre>orchis:~/XML -&gt; /xmlcatalog -v test/catalogs/docbook.xml \
3102                   "-//OASIS//DTD DocBook XML V4.1.2//EN"
3103Parsing catalog test/catalogs/docbook.xml's content
3104Found public match -//OASIS//DTD DocBook XML V4.1.2//EN
3105http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd
3106Catalogs cleanup
3107orchis:~/XML -&gt; </pre>
3108
3109<p>A shell interface is also available to debug and process multiple queries
3110(and for regression tests):</p>
3111<pre>orchis:~/XML -&gt; /xmlcatalog -shell test/catalogs/docbook.xml \
3112                   "-//OASIS//DTD DocBook XML V4.1.2//EN"
3113&gt; help   
3114Commands available:
3115public PublicID: make a PUBLIC identifier lookup
3116system SystemID: make a SYSTEM identifier lookup
3117resolve PublicID SystemID: do a full resolver lookup
3118add 'type' 'orig' 'replace' : add an entry
3119del 'values' : remove values
3120dump: print the current catalog state
3121debug: increase the verbosity level
3122quiet: decrease the verbosity level
3123exit:  quit the shell
3124&gt; public "-//OASIS//DTD DocBook XML V4.1.2//EN"
3125http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd
3126&gt; quit
3127orchis:~/XML -&gt; </pre>
3128
3129<p>This should be sufficient for most debugging purpose, this was actually
3130used heavily to debug the XML Catalog implementation itself.</p>
3131
3132<h3><a name="Declaring">How to create and maintain</a> catalogs:</h3>
3133
3134<p>Basically XML Catalogs are XML files, you can either use XML tools to
3135manage them or use  <strong>xmlcatalog</strong> for this. The basic step is
3136to create a catalog the -create option provide this facility:</p>
3137<pre>orchis:~/XML -&gt; /xmlcatalog --create tst.xml
3138&lt;?xml version="1.0"?&gt;
3139&lt;!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN"
3140         "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd"&gt;
3141&lt;catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"/&gt;
3142orchis:~/XML -&gt; </pre>
3143
3144<p>By default xmlcatalog does not overwrite the original catalog and save the
3145result on the standard output, this can be overridden using the -noout
3146option. The <code>-add</code> command allows to add entries in the
3147catalog:</p>
3148<pre>orchis:~/XML -&gt; /xmlcatalog --noout --create --add "public" \
3149  "-//OASIS//DTD DocBook XML V4.1.2//EN" \
3150  http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd tst.xml
3151orchis:~/XML -&gt; cat tst.xml
3152&lt;?xml version="1.0"?&gt;
3153&lt;!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN" \
3154  "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd"&gt;
3155&lt;catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"&gt;
3156&lt;public publicId="-//OASIS//DTD DocBook XML V4.1.2//EN"
3157        uri="http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"/&gt;
3158&lt;/catalog&gt;
3159orchis:~/XML -&gt; </pre>
3160
3161<p>The <code>-add</code> option will always take 3 parameters even if some of
3162the XML Catalog constructs (like nextCatalog) will have only a single
3163argument, just pass a third empty string, it will be ignored.</p>
3164
3165<p>Similarly the <code>-del</code> option remove matching entries from the
3166catalog:</p>
3167<pre>orchis:~/XML -&gt; /xmlcatalog --del \
3168  "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" tst.xml
3169&lt;?xml version="1.0"?&gt;
3170&lt;!DOCTYPE catalog PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN"
3171    "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd"&gt;
3172&lt;catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"/&gt;
3173orchis:~/XML -&gt; </pre>
3174
3175<p>The catalog is now empty. Note that the matching of <code>-del</code> is
3176exact and would have worked in a similar fashion with the Public ID
3177string.</p>
3178
3179<p>This is rudimentary but should be sufficient to manage a not too complex
3180catalog tree of resources.</p>
3181
3182<h3><a name="implemento">The implementor corner quick review of the
3183API:</a></h3>
3184
3185<p>First, and like for every other module of libxml, there is an
3186automatically generated <a href="html/libxml-catalog.html">API page for
3187catalog support</a>.</p>
3188
3189<p>The header for the catalog interfaces should be included as:</p>
3190<pre>#include &lt;libxml/catalog.h&gt;</pre>
3191
3192<p>The API is voluntarily kept very simple. First it is not obvious that
3193applications really need access to it since it is the default behaviour of
3194libxml (Note: it is possible to completely override libxml default catalog by
3195using <a href="html/libxml-parser.html">xmlSetExternalEntityLoader</a> to
3196plug an application specific resolver).</p>
3197
3198<p>Basically libxml support 2 catalog lists:</p>
3199<ul>
3200  <li>the default one, global shared by all the application</li>
3201  <li>a per-document catalog, this one is built if the document uses the
3202    <code>oasis-xml-catalog</code> PIs to specify its own catalog list, it is
3203    associated to the parser context and destroyed when the parsing context
3204    is destroyed.</li>
3205</ul>
3206
3207<p>the document one will be used first if it exists.</p>
3208
3209<h4>Initialization routines:</h4>
3210
3211<p>xmlInitializeCatalog(), xmlLoadCatalog() and xmlLoadCatalogs() should be
3212used at startup to initialize the catalog, if the catalog should be
3213initialized with specific values xmlLoadCatalog()  or xmlLoadCatalogs()
3214should be called before xmlInitializeCatalog() which would otherwise do a
3215default initialization first.</p>
3216
3217<p>The xmlCatalogAddLocal() call is used by the parser to grow the document
3218own catalog list if needed.</p>
3219
3220<h4>Preferences setup:</h4>
3221
3222<p>The XML Catalog spec requires the possibility to select default
3223preferences between  public and system delegation,
3224xmlCatalogSetDefaultPrefer() allows this, xmlCatalogSetDefaults() and
3225xmlCatalogGetDefaults() allow to control  if XML Catalogs resolution should
3226be forbidden, allowed for global catalog, for document catalog or both, the
3227default is to allow both.</p>
3228
3229<p>And of course xmlCatalogSetDebug() allows to generate debug messages
3230(through the xmlGenericError() mechanism).</p>
3231
3232<h4>Querying routines:</h4>
3233
3234<p>xmlCatalogResolve(), xmlCatalogResolveSystem(), xmlCatalogResolvePublic()
3235and xmlCatalogResolveURI() are relatively explicit if you read the XML
3236Catalog specification they correspond to section 7 algorithms, they should
3237also work if you have loaded an SGML catalog with a simplified semantic.</p>
3238
3239<p>xmlCatalogLocalResolve() and xmlCatalogLocalResolveURI() are the same but
3240operate on the document catalog list</p>
3241
3242<h4>Cleanup and Miscellaneous:</h4>
3243
3244<p>xmlCatalogCleanup() free-up the global catalog, xmlCatalogFreeLocal() is
3245the per-document equivalent.</p>
3246
3247<p>xmlCatalogAdd() and xmlCatalogRemove() are used to dynamically modify the
3248first catalog in the global list, and xmlCatalogDump() allows to dump a
3249catalog state, those routines are primarily designed for xmlcatalog, I'm not
3250sure that exposing more complex interfaces (like navigation ones) would be
3251really useful.</p>
3252
3253<p>The xmlParseCatalogFile() is a function used to load XML Catalog files,
3254it's similar as xmlParseFile() except it bypass all catalog lookups, it's
3255provided because this functionality may be useful for client tools.</p>
3256
3257<h4>threaded environments:</h4>
3258
3259<p>Since the catalog tree is built progressively, some care has been taken to
3260try to avoid troubles in multithreaded environments. The code is now thread
3261safe assuming that the libxml library has been compiled with threads
3262support.</p>
3263
3264<p></p>
3265
3266<h3><a name="Other">Other resources</a></h3>
3267
3268<p>The XML Catalog specification is relatively recent so there isn't much
3269literature to point at:</p>
3270<ul>
3271  <li>You can find a good rant from Norm Walsh about <a
3272    href="http://www.arbortext.com/Think_Tank/XML_Resources/Issue_Three/issue_three.html">the
3273    need for catalogs</a>, it provides a lot of context informations even if
3274    I don't agree with everything presented. Norm also wrote a more recent
3275    article <a
3276    href="http://wwws.sun.com/software/xml/developers/resolver/article/">XML
3277    entities and URI resolvers</a> describing them.</li>
3278  <li>An <a href="http://home.ccil.org/~cowan/XML/XCatalog.html">old XML
3279    catalog proposal</a> from John Cowan</li>
3280  <li>The <a href="http://www.rddl.org/">Resource Directory Description
3281    Language</a> (RDDL) another catalog system but more oriented toward
3282    providing metadata for XML namespaces.</li>
3283  <li>the page from the OASIS Technical <a
3284    href="http://www.oasis-open.org/committees/entity/">Committee on Entity
3285    Resolution</a> who maintains XML Catalog, you will find pointers to the
3286    specification update, some background and pointers to others tools
3287    providing XML Catalog support</li>
3288  <li>Here is a <a href="buildDocBookCatalog">shell script</a> to generate
3289    XML Catalogs for DocBook 4.1.2 . If it can write to the /etc/xml/
3290    directory, it will set-up /etc/xml/catalog and /etc/xml/docbook based on
3291    the resources found on the system. Otherwise it will just create
3292    ~/xmlcatalog and ~/dbkxmlcatalog and doing:
3293    <p><code>export XMLCATALOG=$HOME/xmlcatalog</code></p>
3294    <p>should allow to process DocBook documentations without requiring
3295    network accesses for the DTD or stylesheets</p>
3296  </li>
3297  <li>I have uploaded <a href="ftp://xmlsoft.org/test/dbk412catalog.tar.gz">a
3298    small tarball</a> containing XML Catalogs for DocBook 4.1.2 which seems
3299    to work fine for me too</li>
3300  <li>The <a href="http://www.xmlsoft.org/xmlcatalog_man.html">xmlcatalog
3301    manual page</a></li>
3302</ul>
3303
3304<p>If you have suggestions for corrections or additions, simply contact
3305me:</p>
3306
3307<h2><a name="library">The parser interfaces</a></h2>
3308
3309<p>This section is directly intended to help programmers getting bootstrapped
3310using the XML library from the C language. It is not intended to be
3311extensive. I hope the automatically generated documents will provide the
3312completeness required, but as a separate set of documents. The interfaces of
3313the XML library are by principle low level, there is nearly zero abstraction.
3314Those interested in a higher level API should <a href="#DOM">look at
3315DOM</a>.</p>
3316
3317<p>The <a href="html/libxml-parser.html">parser interfaces for XML</a> are
3318separated from the <a href="html/libxml-htmlparser.html">HTML parser
3319interfaces</a>.  Let's have a look at how the XML parser can be called:</p>
3320
3321<h3><a name="Invoking">Invoking the parser : the pull method</a></h3>
3322
3323<p>Usually, the first thing to do is to read an XML input. The parser accepts
3324documents either from in-memory strings or from files.  The functions are
3325defined in "parser.h":</p>
3326<dl>
3327  <dt><code>xmlDocPtr xmlParseMemory(char *buffer, int size);</code></dt>
3328    <dd><p>Parse a null-terminated string containing the document.</p>
3329    </dd>
3330</dl>
3331<dl>
3332  <dt><code>xmlDocPtr xmlParseFile(const char *filename);</code></dt>
3333    <dd><p>Parse an XML document contained in a (possibly compressed)
3334      file.</p>
3335    </dd>
3336</dl>
3337
3338<p>The parser returns a pointer to the document structure (or NULL in case of
3339failure).</p>
3340
3341<h3 id="Invoking1">Invoking the parser: the push method</h3>
3342
3343<p>In order for the application to keep the control when the document is
3344being fetched (which is common for GUI based programs) libxml provides a push
3345interface, too, as of version 1.8.3. Here are the interface functions:</p>
3346<pre>xmlParserCtxtPtr xmlCreatePushParserCtxt(xmlSAXHandlerPtr sax,
3347                                         void *user_data,
3348                                         const char *chunk,
3349                                         int size,
3350                                         const char *filename);
3351int              xmlParseChunk          (xmlParserCtxtPtr ctxt,
3352                                         const char *chunk,
3353                                         int size,
3354                                         int terminate);</pre>
3355
3356<p>and here is a simple example showing how to use the interface:</p>
3357<pre>            FILE *f;
3358
3359            f = fopen(filename, "r");
3360            if (f != NULL) {
3361                int res, size = 1024;
3362                char chars[1024];
3363                xmlParserCtxtPtr ctxt;
3364
3365                res = fread(chars, 1, 4, f);
3366                if (res &gt; 0) {
3367                    ctxt = xmlCreatePushParserCtxt(NULL, NULL,
3368                                chars, res, filename);
3369                    while ((res = fread(chars, 1, size, f)) &gt; 0) {
3370                        xmlParseChunk(ctxt, chars, res, 0);
3371                    }
3372                    xmlParseChunk(ctxt, chars, 0, 1);
3373                    doc = ctxt-&gt;myDoc;
3374                    xmlFreeParserCtxt(ctxt);
3375                }
3376            }</pre>
3377
3378<p>The HTML parser embedded into libxml also has a push interface; the
3379functions are just prefixed by "html" rather than "xml".</p>
3380
3381<h3 id="Invoking2">Invoking the parser: the SAX interface</h3>
3382
3383<p>The tree-building interface makes the parser memory-hungry, first loading
3384the document in memory and then building the tree itself. Reading a document
3385without building the tree is possible using the SAX interfaces (see SAX.h and
3386<a href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">James
3387Henstridge's documentation</a>). Note also that the push interface can be
3388limited to SAX: just use the two first arguments of
3389<code>xmlCreatePushParserCtxt()</code>.</p>
3390
3391<h3><a name="Building">Building a tree from scratch</a></h3>
3392
3393<p>The other way to get an XML tree in memory is by building it. Basically
3394there is a set of functions dedicated to building new elements. (These are
3395also described in &lt;libxml/tree.h&gt;.) For example, here is a piece of
3396code that produces the XML document used in the previous examples:</p>
3397<pre>    #include &lt;libxml/tree.h&gt;
3398    xmlDocPtr doc;
3399    xmlNodePtr tree, subtree;
3400
3401    doc = xmlNewDoc("1.0");
3402    doc-&gt;children = xmlNewDocNode(doc, NULL, "EXAMPLE", NULL);
3403    xmlSetProp(doc-&gt;children, "prop1", "gnome is great");
3404    xmlSetProp(doc-&gt;children, "prop2", "&amp; linux too");
3405    tree = xmlNewChild(doc-&gt;children, NULL, "head", NULL);
3406    subtree = xmlNewChild(tree, NULL, "title", "Welcome to Gnome");
3407    tree = xmlNewChild(doc-&gt;children, NULL, "chapter", NULL);
3408    subtree = xmlNewChild(tree, NULL, "title", "The Linux adventure");
3409    subtree = xmlNewChild(tree, NULL, "p", "bla bla bla ...");
3410    subtree = xmlNewChild(tree, NULL, "image", NULL);
3411    xmlSetProp(subtree, "href", "linus.gif");</pre>
3412
3413<p>Not really rocket science ...</p>
3414
3415<h3><a name="Traversing">Traversing the tree</a></h3>
3416
3417<p>Basically by <a href="html/libxml-tree.html">including "tree.h"</a> your
3418code has access to the internal structure of all the elements of the tree.
3419The names should be somewhat simple like <strong>parent</strong>,
3420<strong>children</strong>, <strong>next</strong>, <strong>prev</strong>,
3421<strong>properties</strong>, etc... For example, still with the previous
3422example:</p>
3423<pre><code>doc-&gt;children-&gt;children-&gt;children</code></pre>
3424
3425<p>points to the title element,</p>
3426<pre>doc-&gt;children-&gt;children-&gt;next-&gt;children-&gt;children</pre>
3427
3428<p>points to the text node containing the chapter title "The Linux
3429adventure".</p>
3430
3431<p><strong>NOTE</strong>: XML allows <em>PI</em>s and <em>comments</em> to be
3432present before the document root, so <code>doc-&gt;children</code> may point
3433to an element which is not the document Root Element; a function
3434<code>xmlDocGetRootElement()</code> was added for this purpose.</p>
3435
3436<h3><a name="Modifying">Modifying the tree</a></h3>
3437
3438<p>Functions are provided for reading and writing the document content. Here
3439is an excerpt from the <a href="html/libxml-tree.html">tree API</a>:</p>
3440<dl>
3441  <dt><code>xmlAttrPtr xmlSetProp(xmlNodePtr node, const xmlChar *name, const
3442  xmlChar *value);</code></dt>
3443    <dd><p>This sets (or changes) an attribute carried by an ELEMENT node.
3444      The value can be NULL.</p>
3445    </dd>
3446</dl>
3447<dl>
3448  <dt><code>const xmlChar *xmlGetProp(xmlNodePtr node, const xmlChar
3449  *name);</code></dt>
3450    <dd><p>This function returns a pointer to new copy of the property
3451      content. Note that the user must deallocate the result.</p>
3452    </dd>
3453</dl>
3454
3455<p>Two functions are provided for reading and writing the text associated
3456with elements:</p>
3457<dl>
3458  <dt><code>xmlNodePtr xmlStringGetNodeList(xmlDocPtr doc, const xmlChar
3459  *value);</code></dt>
3460    <dd><p>This function takes an "external" string and converts it to one
3461      text node or possibly to a list of entity and text nodes. All
3462      non-predefined entity references like &amp;Gnome; will be stored
3463      internally as entity nodes, hence the result of the function may not be
3464      a single node.</p>
3465    </dd>
3466</dl>
3467<dl>
3468  <dt><code>xmlChar *xmlNodeListGetString(xmlDocPtr doc, xmlNodePtr list, int
3469  inLine);</code></dt>
3470    <dd><p>This function is the inverse of
3471      <code>xmlStringGetNodeList()</code>. It generates a new string
3472      containing the content of the text and entity nodes. Note the extra
3473      argument inLine. If this argument is set to 1, the function will expand
3474      entity references.  For example, instead of returning the &amp;Gnome;
3475      XML encoding in the string, it will substitute it with its value (say,
3476      "GNU Network Object Model Environment").</p>
3477    </dd>
3478</dl>
3479
3480<h3><a name="Saving">Saving a tree</a></h3>
3481
3482<p>Basically 3 options are possible:</p>
3483<dl>
3484  <dt><code>void xmlDocDumpMemory(xmlDocPtr cur, xmlChar**mem, int
3485  *size);</code></dt>
3486    <dd><p>Returns a buffer into which the document has been saved.</p>
3487    </dd>
3488</dl>
3489<dl>
3490  <dt><code>extern void xmlDocDump(FILE *f, xmlDocPtr doc);</code></dt>
3491    <dd><p>Dumps a document to an open file descriptor.</p>
3492    </dd>
3493</dl>
3494<dl>
3495  <dt><code>int xmlSaveFile(const char *filename, xmlDocPtr cur);</code></dt>
3496    <dd><p>Saves the document to a file. In this case, the compression
3497      interface is triggered if it has been turned on.</p>
3498    </dd>
3499</dl>
3500
3501<h3><a name="Compressio">Compression</a></h3>
3502
3503<p>The library transparently handles compression when doing file-based
3504accesses. The level of compression on saves can be turned on either globally
3505or individually for one file:</p>
3506<dl>
3507  <dt><code>int  xmlGetDocCompressMode (xmlDocPtr doc);</code></dt>
3508    <dd><p>Gets the document compression ratio (0-9).</p>
3509    </dd>
3510</dl>
3511<dl>
3512  <dt><code>void xmlSetDocCompressMode (xmlDocPtr doc, int mode);</code></dt>
3513    <dd><p>Sets the document compression ratio.</p>
3514    </dd>
3515</dl>
3516<dl>
3517  <dt><code>int  xmlGetCompressMode(void);</code></dt>
3518    <dd><p>Gets the default compression ratio.</p>
3519    </dd>
3520</dl>
3521<dl>
3522  <dt><code>void xmlSetCompressMode(int mode);</code></dt>
3523    <dd><p>Sets the default compression ratio.</p>
3524    </dd>
3525</dl>
3526
3527<h2><a name="Entities">Entities or no entities</a></h2>
3528
3529<p>Entities in principle are similar to simple C macros. An entity defines an
3530abbreviation for a given string that you can reuse many times throughout the
3531content of your document. Entities are especially useful when a given string
3532may occur frequently within a document, or to confine the change needed to a
3533document to a restricted area in the internal subset of the document (at the
3534beginning). Example:</p>
3535<pre>1 &lt;?xml version="1.0"?&gt;
35362 &lt;!DOCTYPE EXAMPLE SYSTEM "example.dtd" [
35373 &lt;!ENTITY xml "Extensible Markup Language"&gt;
35384 ]&gt;
35395 &lt;EXAMPLE&gt;
35406    &amp;xml;
35417 &lt;/EXAMPLE&gt;</pre>
3542
3543<p>Line 3 declares the xml entity. Line 6 uses the xml entity, by prefixing
3544its name with '&amp;' and following it by ';' without any spaces added. There
3545are 5 predefined entities in libxml allowing you to escape characters with
3546predefined meaning in some parts of the xml document content:
3547<strong>&amp;lt;</strong> for the character '&lt;', <strong>&amp;gt;</strong>
3548for the character '&gt;',  <strong>&amp;apos;</strong> for the character ''',
3549<strong>&amp;quot;</strong> for the character '"', and
3550<strong>&amp;amp;</strong> for the character '&amp;'.</p>
3551
3552<p>One of the problems related to entities is that you may want the parser to
3553substitute an entity's content so that you can see the replacement text in
3554your application. Or you may prefer to keep entity references as such in the
3555content to be able to save the document back without losing this usually
3556precious information (if the user went through the pain of explicitly
3557defining entities, he may have a a rather negative attitude if you blindly
3558substitute them as saving time). The <a
3559href="html/libxml-parser.html#XMLSUBSTITUTEENTITIESDEFAULT">xmlSubstituteEntitiesDefault()</a>
3560function allows you to check and change the behaviour, which is to not
3561substitute entities by default.</p>
3562
3563<p>Here is the DOM tree built by libxml for the previous document in the
3564default case:</p>
3565<pre>/gnome/src/gnome-xml -&gt; /xmllint --debug test/ent1
3566DOCUMENT
3567version=1.0
3568   ELEMENT EXAMPLE
3569     TEXT
3570     content=
3571     ENTITY_REF
3572       INTERNAL_GENERAL_ENTITY xml
3573       content=Extensible Markup Language
3574     TEXT
3575     content=</pre>
3576
3577<p>And here is the result when substituting entities:</p>
3578<pre>/gnome/src/gnome-xml -&gt; /tester --debug --noent test/ent1
3579DOCUMENT
3580version=1.0
3581   ELEMENT EXAMPLE
3582     TEXT
3583     content=     Extensible Markup Language</pre>
3584
3585<p>So, entities or no entities? Basically, it depends on your use case. I
3586suggest that you keep the non-substituting default behaviour and avoid using
3587entities in your XML document or data if you are not willing to handle the
3588entity references elements in the DOM tree.</p>
3589
3590<p>Note that at save time libxml enforces the conversion of the predefined
3591entities where necessary to prevent well-formedness problems, and will also
3592transparently replace those with chars (i.e. it will not generate entity
3593reference elements in the DOM tree or call the reference() SAX callback when
3594finding them in the input).</p>
3595
3596<p><span style="background-color: #FF0000">WARNING</span>: handling entities
3597on top of the libxml SAX interface is difficult!!! If you plan to use
3598non-predefined entities in your documents, then the learning curve to handle
3599then using the SAX API may be long. If you plan to use complex documents, I
3600strongly suggest you consider using the DOM interface instead and let libxml
3601deal with the complexity rather than trying to do it yourself.</p>
3602
3603<h2><a name="Namespaces">Namespaces</a></h2>
3604
3605<p>The libxml library implements <a
3606href="http://www.w3.org/TR/REC-xml-names/">XML namespaces</a> support by
3607recognizing namespace constructs in the input, and does namespace lookup
3608automatically when building the DOM tree. A namespace declaration is
3609associated with an in-memory structure and all elements or attributes within
3610that namespace point to it. Hence testing the namespace is a simple and fast
3611equality operation at the user level.</p>
3612
3613<p>I suggest that people using libxml use a namespace, and declare it in the
3614root element of their document as the default namespace. Then they don't need
3615to use the prefix in the content but we will have a basis for future semantic
3616refinement and  merging of data from different sources. This doesn't increase
3617the size of the XML output significantly, but significantly increases its
3618value in the long-term. Example:</p>
3619<pre>&lt;mydoc xmlns="http://mydoc.example.org/schemas/"&gt;
3620   &lt;elem1&gt;...&lt;/elem1&gt;
3621   &lt;elem2&gt;...&lt;/elem2&gt;
3622&lt;/mydoc&gt;</pre>
3623
3624<p>The namespace value has to be an absolute URL, but the URL doesn't have to
3625point to any existing resource on the Web. It will bind all the element and
3626attributes with that URL. I suggest to use an URL within a domain you
3627control, and that the URL should contain some kind of version information if
3628possible. For example, <code>"http://www.gnome.org/gnumeric/1.0/"</code> is a
3629good namespace scheme.</p>
3630
3631<p>Then when you load a file, make sure that a namespace carrying the
3632version-independent prefix is installed on the root element of your document,
3633and if the version information don't match something you know, warn the user
3634and be liberal in what you accept as the input. Also do *not* try to base
3635namespace checking on the prefix value. &lt;foo:text&gt; may be exactly the
3636same as &lt;bar:text&gt; in another document. What really matters is the URI
3637associated with the element or the attribute, not the prefix string (which is
3638just a shortcut for the full URI). In libxml, element and attributes have an
3639<code>ns</code> field pointing to an xmlNs structure detailing the namespace
3640prefix and its URI.</p>
3641
3642<p>@@Interfaces@@</p>
3643
3644<p>@@Examples@@</p>
3645
3646<p>Usually people object to using namespaces together with validity checking.
3647I will try to make sure that using namespaces won't break validity checking,
3648so even if you plan to use or currently are using validation I strongly
3649suggest adding namespaces to your document. A default namespace scheme
3650<code>xmlns="http://...."</code> should not break validity even on less
3651flexible parsers. Using namespaces to mix and differentiate content coming
3652from multiple DTDs will certainly break current validation schemes. I will
3653try to provide ways to do this, but this may not be portable or
3654standardized.</p>
3655
3656<h2><a name="Upgrading">Upgrading 1.x code</a></h2>
3657
3658<p>Incompatible changes:</p>
3659
3660<p>Version 2 of libxml is the first version introducing serious backward
3661incompatible changes. The main goals were:</p>
3662<ul>
3663  <li>a general cleanup. A number of mistakes inherited from the very early
3664    versions couldn't be changed due to compatibility constraints. Example
3665    the "childs" element in the nodes.</li>
3666  <li>Uniformization of the various nodes, at least for their header and link
3667    parts (doc, parent, children, prev, next), the goal is a simpler
3668    programming model and simplifying the task of the DOM implementors.</li>
3669  <li>better conformances to the XML specification, for example version 1.x
3670    had an heuristic to try to detect ignorable white spaces. As a result the
3671    SAX event generated were ignorableWhitespace() while the spec requires
3672    character() in that case. This also mean that a number of DOM node
3673    containing blank text may populate the DOM tree which were not present
3674    before.</li>
3675</ul>
3676
3677<h3>How to fix libxml-1.x code:</h3>
3678
3679<p>So client code of libxml designed to run with version 1.x may have to be
3680changed to compile against version 2.x of libxml. Here is a list of changes
3681that I have collected, they may not be sufficient, so in case you find other
3682change which are required, <a href="mailto:Daniel.�eillardw3.org">drop me a
3683mail</a>:</p>
3684<ol>
3685  <li>The package name have changed from libxml to libxml2, the library name
3686    is now -lxml2 . There is a new xml2-config script which should be used to
3687    select the right parameters libxml2</li>
3688  <li>Node <strong>childs</strong> field has been renamed
3689    <strong>children</strong> so s/childs/children/g should be  applied
3690    (probability of having "childs" anywhere else is close to 0+</li>
3691  <li>The document don't have anymore a <strong>root</strong> element it has
3692    been replaced by <strong>children</strong> and usually you will get a
3693    list of element here. For example a Dtd element for the internal subset
3694    and it's declaration may be found in that list, as well as processing
3695    instructions or comments found before or after the document root element.
3696    Use <strong>xmlDocGetRootElement(doc)</strong> to get the root element of
3697    a document. Alternatively if you are sure to not reference DTDs nor have
3698    PIs or comments before or after the root element
3699    s/-&gt;root/-&gt;children/g will probably do it.</li>
3700  <li>The white space issue, this one is more complex, unless special case of
3701    validating parsing, the line breaks and spaces usually used for indenting
3702    and formatting the document content becomes significant. So they are
3703    reported by SAX and if your using the DOM tree, corresponding nodes are
3704    generated. Too approach can be taken:
3705    <ol>
3706      <li>lazy one, use the compatibility call
3707        <strong>xmlKeepBlanksDefault(0)</strong> but be aware that you are
3708        relying on a special (and possibly broken) set of heuristics of
3709        libxml to detect ignorable blanks. Don't complain if it breaks or
3710        make your application not 100% clean w.r.t. to it's input.</li>
3711      <li>the Right Way: change you code to accept possibly insignificant
3712        blanks characters, or have your tree populated with weird blank text
3713        nodes. You can spot them using the commodity function
3714        <strong>xmlIsBlankNode(node)</strong> returning 1 for such blank
3715        nodes.</li>
3716    </ol>
3717    <p>Note also that with the new default the output functions don't add any
3718    extra indentation when saving a tree in order to be able to round trip
3719    (read and save) without inflating the document with extra formatting
3720    chars.</p>
3721  </li>
3722  <li>The include path has changed to $prefix/libxml/ and the includes
3723    themselves uses this new prefix in includes instructions... If you are
3724    using (as expected) the
3725    <pre>xml2-config --cflags</pre>
3726    <p>output to generate you compile commands this will probably work out of
3727    the box</p>
3728  </li>
3729  <li>xmlDetectCharEncoding takes an extra argument indicating the length in
3730    byte of the head of the document available for character detection.</li>
3731</ol>
3732
3733<h3>Ensuring both libxml-1.x and libxml-2.x compatibility</h3>
3734
3735<p>Two new version of libxml (1.8.11) and libxml2 (2.3.4) have been released
3736to allow smooth upgrade of existing libxml v1code while retaining
3737compatibility. They offers the following:</p>
3738<ol>
3739  <li>similar include naming, one should use
3740    <strong>#include&lt;libxml/...&gt;</strong> in both cases.</li>
3741  <li>similar identifiers defined via macros for the child and root fields:
3742    respectively <strong>xmlChildrenNode</strong> and
3743    <strong>xmlRootNode</strong></li>
3744  <li>a new macro <strong>LIBXML_TEST_VERSION</strong> which should be
3745    inserted once in the client code</li>
3746</ol>
3747
3748<p>So the roadmap to upgrade your existing libxml applications is the
3749following:</p>
3750<ol>
3751  <li>install the  libxml-1.8.8 (and libxml-devel-1.8.8) packages</li>
3752  <li>find all occurrences where the xmlDoc <strong>root</strong> field is
3753    used and change it to <strong>xmlRootNode</strong></li>
3754  <li>similarly find all occurrences where the xmlNode
3755    <strong>childs</strong> field is used and change it to
3756    <strong>xmlChildrenNode</strong></li>
3757  <li>add a <strong>LIBXML_TEST_VERSION</strong> macro somewhere in your
3758    <strong>main()</strong> or in the library init entry point</li>
3759  <li>Recompile, check compatibility, it should still work</li>
3760  <li>Change your configure script to look first for xml2-config and fall
3761    back using xml-config . Use the --cflags and --libs output of the command
3762    as the Include and Linking parameters needed to use libxml.</li>
3763  <li>install libxml2-2.3.x and  libxml2-devel-2.3.x (libxml-1.8.y and
3764    libxml-devel-1.8.y can be kept simultaneously)</li>
3765  <li>remove your config.cache, relaunch your configuration mechanism, and
3766    recompile, if steps 2 and 3 were done right it should compile as-is</li>
3767  <li>Test that your application is still running correctly, if not this may
3768    be due to extra empty nodes due to formating spaces being kept in libxml2
3769    contrary to libxml1, in that case insert xmlKeepBlanksDefault(1) in your
3770    code before calling the parser (next to
3771    <strong>LIBXML_TEST_VERSION</strong> is a fine place).</li>
3772</ol>
3773
3774<p>Following those steps should work. It worked for some of my own code.</p>
3775
3776<p>Let me put some emphasis on the fact that there is far more changes from
3777libxml 1.x to 2.x than the ones you may have to patch for. The overall code
3778has been considerably cleaned up and the conformance to the XML specification
3779has been drastically improved too. Don't take those changes as an excuse to
3780not upgrade, it may cost a lot on the long term ...</p>
3781
3782<h2><a name="Thread">Thread safety</a></h2>
3783
3784<p>Starting with 2.4.7, libxml makes provisions to ensure that concurrent
3785threads can safely work in parallel parsing different documents. There is
3786however a couple of things to do to ensure it:</p>
3787<ul>
3788  <li>configure the library accordingly using the --with-threads options</li>
3789  <li>call xmlInitParser() in the "main" thread before using any of the
3790    libxml API (except possibly selecting a different memory allocator)</li>
3791</ul>
3792
3793<p>Note that the thread safety cannot be ensured for multiple threads sharing
3794the same document, the locking must be done at the application level, libxml
3795exports a basic mutex and reentrant mutexes API in &lt;libxml/threads.h&gt;.
3796The parts of the library checked for thread safety are:</p>
3797<ul>
3798  <li>concurrent loading</li>
3799  <li>file access resolution</li>
3800  <li>catalog access</li>
3801  <li>catalog building</li>
3802  <li>entities lookup/accesses</li>
3803  <li>validation</li>
3804  <li>global variables per-thread override</li>
3805  <li>memory handling</li>
3806</ul>
3807
3808<p>XPath is supposed to be thread safe now, but this wasn't tested
3809seriously.</p>
3810
3811<h2><a name="DOM"></a><a name="Principles">DOM Principles</a></h2>
3812
3813<p><a href="http://www.w3.org/DOM/">DOM</a> stands for the <em>Document
3814Object Model</em>; this is an API for accessing XML or HTML structured
3815documents. Native support for DOM in Gnome is on the way (module gnome-dom),
3816and will be based on gnome-xml. This will be a far cleaner interface to
3817manipulate XML files within Gnome since it won't expose the internal
3818structure.</p>
3819
3820<p>The current DOM implementation on top of libxml is the <a
3821href="http://cvs.gnome.org/lxr/source/gdome2/">gdome2 Gnome module</a>, this
3822is a full DOM interface, thanks to Paolo Casarini, check the <a
3823href="http://www.cs.unibo.it/~casarini/gdome2/">Gdome2 homepage</a> for more
3824informations.</p>
3825
3826<h2><a name="Example"></a><a name="real">A real example</a></h2>
3827
3828<p>Here is a real size example, where the actual content of the application
3829data is not kept in the DOM tree but uses internal structures. It is based on
3830a proposal to keep a database of jobs related to Gnome, with an XML based
3831storage structure. Here is an <a href="gjobs.xml">XML encoded jobs
3832base</a>:</p>
3833<pre>&lt;?xml version="1.0"?&gt;
3834&lt;gjob:Helping xmlns:gjob="http://www.gnome.org/some-location"&gt;
3835  &lt;gjob:Jobs&gt;
3836
3837    &lt;gjob:Job&gt;
3838      &lt;gjob:Project ID="3"/&gt;
3839      &lt;gjob:Application&gt;GBackup&lt;/gjob:Application&gt;
3840      &lt;gjob:Category&gt;Development&lt;/gjob:Category&gt;
3841
3842      &lt;gjob:Update&gt;
3843        &lt;gjob:Status&gt;Open&lt;/gjob:Status&gt;
3844        &lt;gjob:Modified&gt;Mon, 07 Jun 1999 20:27:45 -0400 MET DST&lt;/gjob:Modified&gt;
3845        &lt;gjob:Salary&gt;USD 0.00&lt;/gjob:Salary&gt;
3846      &lt;/gjob:Update&gt;
3847
3848      &lt;gjob:Developers&gt;
3849        &lt;gjob:Developer&gt;
3850        &lt;/gjob:Developer&gt;
3851      &lt;/gjob:Developers&gt;
3852
3853      &lt;gjob:Contact&gt;
3854        &lt;gjob:Person&gt;Nathan Clemons&lt;/gjob:Person&gt;
3855        &lt;gjob:Email&gt;nathan@windsofstorm.net&lt;/gjob:Email&gt;
3856        &lt;gjob:Company&gt;
3857        &lt;/gjob:Company&gt;
3858        &lt;gjob:Organisation&gt;
3859        &lt;/gjob:Organisation&gt;
3860        &lt;gjob:Webpage&gt;
3861        &lt;/gjob:Webpage&gt;
3862        &lt;gjob:Snailmail&gt;
3863        &lt;/gjob:Snailmail&gt;
3864        &lt;gjob:Phone&gt;
3865        &lt;/gjob:Phone&gt;
3866      &lt;/gjob:Contact&gt;
3867
3868      &lt;gjob:Requirements&gt;
3869      The program should be released as free software, under the GPL.
3870      &lt;/gjob:Requirements&gt;
3871
3872      &lt;gjob:Skills&gt;
3873      &lt;/gjob:Skills&gt;
3874
3875      &lt;gjob:Details&gt;
3876      A GNOME based system that will allow a superuser to configure 
3877      compressed and uncompressed files and/or file systems to be backed 
3878      up with a supported media in the system.  This should be able to 
3879      perform via find commands generating a list of files that are passed 
3880      to tar, dd, cpio, cp, gzip, etc., to be directed to the tape machine 
3881      or via operations performed on the filesystem itself. Email 
3882      notification and GUI status display very important.
3883      &lt;/gjob:Details&gt;
3884
3885    &lt;/gjob:Job&gt;
3886
3887  &lt;/gjob:Jobs&gt;
3888&lt;/gjob:Helping&gt;</pre>
3889
3890<p>While loading the XML file into an internal DOM tree is a matter of
3891calling only a couple of functions, browsing the tree to gather the data and
3892generate the internal structures is harder, and more error prone.</p>
3893
3894<p>The suggested principle is to be tolerant with respect to the input
3895structure. For example, the ordering of the attributes is not significant,
3896the XML specification is clear about it. It's also usually a good idea not to
3897depend on the order of the children of a given node, unless it really makes
3898things harder. Here is some code to parse the information for a person:</p>
3899<pre>/*
3900 * A person record
3901 */
3902typedef struct person {
3903    char *name;
3904    char *email;
3905    char *company;
3906    char *organisation;
3907    char *smail;
3908    char *webPage;
3909    char *phone;
3910} person, *personPtr;
3911
3912/*
3913 * And the code needed to parse it
3914 */
3915personPtr parsePerson(xmlDocPtr doc, xmlNsPtr ns, xmlNodePtr cur) {
3916    personPtr ret = NULL;
3917
3918DEBUG("parsePerson\n");
3919    /*
3920     * allocate the struct
3921     */
3922    ret = (personPtr) malloc(sizeof(person));
3923    if (ret == NULL) {
3924        fprintf(stderr,"out of memory\n");
3925        return(NULL);
3926    }
3927    memset(ret, 0, sizeof(person));
3928
3929    /* We don't care what the top level element name is */
3930    cur = cur-&gt;xmlChildrenNode;
3931    while (cur != NULL) {
3932        if ((!strcmp(cur-&gt;name, "Person")) &amp;&amp; (cur-&gt;ns == ns))
3933            ret-&gt;name = xmlNodeListGetString(doc, cur-&gt;xmlChildrenNode, 1);
3934        if ((!strcmp(cur-&gt;name, "Email")) &amp;&amp; (cur-&gt;ns == ns))
3935            ret-&gt;email = xmlNodeListGetString(doc, cur-&gt;xmlChildrenNode, 1);
3936        cur = cur-&gt;next;
3937    }
3938
3939    return(ret);
3940}</pre>
3941
3942<p>Here are a couple of things to notice:</p>
3943<ul>
3944  <li>Usually a recursive parsing style is the more convenient one: XML data
3945    is by nature subject to repetitive constructs and usually exhibits highly
3946    structured patterns.</li>
3947  <li>The two arguments of type <em>xmlDocPtr</em> and <em>xmlNsPtr</em>,
3948    i.e. the pointer to the global XML document and the namespace reserved to
3949    the application. Document wide information are needed for example to
3950    decode entities and it's a good coding practice to define a namespace for
3951    your application set of data and test that the element and attributes
3952    you're analyzing actually pertains to your application space. This is
3953    done by a simple equality test (cur-&gt;ns == ns).</li>
3954  <li>To retrieve text and attributes value, you can use the function
3955    <em>xmlNodeListGetString</em> to gather all the text and entity reference
3956    nodes generated by the DOM output and produce an single text string.</li>
3957</ul>
3958
3959<p>Here is another piece of code used to parse another level of the
3960structure:</p>
3961<pre>#include &lt;libxml/tree.h&gt;
3962/*
3963 * a Description for a Job
3964 */
3965typedef struct job {
3966    char *projectID;
3967    char *application;
3968    char *category;
3969    personPtr contact;
3970    int nbDevelopers;
3971    personPtr developers[100]; /* using dynamic alloc is left as an exercise */
3972} job, *jobPtr;
3973
3974/*
3975 * And the code needed to parse it
3976 */
3977jobPtr parseJob(xmlDocPtr doc, xmlNsPtr ns, xmlNodePtr cur) {
3978    jobPtr ret = NULL;
3979
3980DEBUG("parseJob\n");
3981    /*
3982     * allocate the struct
3983     */
3984    ret = (jobPtr) malloc(sizeof(job));
3985    if (ret == NULL) {
3986        fprintf(stderr,"out of memory\n");
3987        return(NULL);
3988    }
3989    memset(ret, 0, sizeof(job));
3990
3991    /* We don't care what the top level element name is */
3992    cur = cur-&gt;xmlChildrenNode;
3993    while (cur != NULL) {
3994        
3995        if ((!strcmp(cur-&gt;name, "Project")) &amp;&amp; (cur-&gt;ns == ns)) {
3996            ret-&gt;projectID = xmlGetProp(cur, "ID");
3997            if (ret-&gt;projectID == NULL) {
3998                fprintf(stderr, "Project has no ID\n");
3999            }
4000        }
4001        if ((!strcmp(cur-&gt;name, "Application")) &amp;&amp; (cur-&gt;ns == ns))
4002            ret-&gt;application = xmlNodeListGetString(doc, cur-&gt;xmlChildrenNode, 1);
4003        if ((!strcmp(cur-&gt;name, "Category")) &amp;&amp; (cur-&gt;ns == ns))
4004            ret-&gt;category = xmlNodeListGetString(doc, cur-&gt;xmlChildrenNode, 1);
4005        if ((!strcmp(cur-&gt;name, "Contact")) &amp;&amp; (cur-&gt;ns == ns))
4006            ret-&gt;contact = parsePerson(doc, ns, cur);
4007        cur = cur-&gt;next;
4008    }
4009
4010    return(ret);
4011}</pre>
4012
4013<p>Once you are used to it, writing this kind of code is quite simple, but
4014boring. Ultimately, it could be possible to write stubbers taking either C
4015data structure definitions, a set of XML examples or an XML DTD and produce
4016the code needed to import and export the content between C data and XML
4017storage. This is left as an exercise to the reader :-)</p>
4018
4019<p>Feel free to use <a href="example/gjobread.c">the code for the full C
4020parsing example</a> as a template, it is also available with Makefile in the
4021Gnome CVS base under gnome-xml/example</p>
4022
4023<h2><a name="Contributi">Contributions</a></h2>
4024<ul>
4025  <li>Bjorn Reese, William Brack and Thomas Broyer have provided a number of
4026    patches, Gary Pennington worked on the validation API, threading support
4027    and Solaris port.</li>
4028  <li>John Fleck helps maintaining the documentation and man pages.</li>
4029  <li><a href="mailto:igor@zlatkovic.com">Igor  Zlatkovic</a> is now the
4030    maintainer of the Windows port, <a
4031    href="http://www.zlatkovic.com/projects/libxml/index.html">he provides
4032    binaries</a></li>
4033  <li><a href="mailto:Gary.Pennington@sun.com">Gary Pennington</a> provides
4034    <a href="http://garypennington.net/libxml2/">Solaris binaries</a></li>
4035  <li><a
4036    href="http://mail.gnome.org/archives/xml/2001-March/msg00014.html">Matt
4037    Sergeant</a> developed <a
4038    href="http://axkit.org/download/">XML::LibXSLT</a>, a Perl wrapper for
4039    libxml2/libxslt as part of the <a href="http://axkit.com/">AxKit XML
4040    application server</a></li>
4041  <li><a href="mailto:fnatter@gmx.net">Felix Natter</a> and <a
4042    href="mailto:geertk@ai.rug.nl">Geert Kloosterman</a> provide <a
4043    href="libxml-doc.el">an emacs module</a> to lookup libxml(2) functions
4044    documentation</li>
4045  <li><a href="mailto:sherwin@nlm.nih.gov">Ziying Sherwin</a> provided <a
4046    href="http://xmlsoft.org/messages/0488.html">man pages</a></li>
4047  <li>there is a module for <a
4048    href="http://acs-misc.sourceforge.net/nsxml.html">libxml/libxslt support
4049    in OpenNSD/AOLServer</a></li>
4050  <li><a href="mailto:dkuhlman@cutter.rexx.com">Dave Kuhlman</a> provided the
4051    first version of libxml/libxslt <a
4052    href="http://www.rexx.com/~dkuhlman">wrappers for Python</a></li>
4053  <li>Petr Kozelka provides <a
4054    href="http://sourceforge.net/projects/libxml2-pas">Pascal units to glue
4055    libxml2</a> with Kylix and Delphi and other Pascal compilers</li>
4056  <li><a href="mailto:aleksey@aleksey.com">Aleksey Sanin</a> implemented the
4057    <a href="http://www.w3.org/Signature/">XML Canonicalization and XML
4058    Digital Signature</a> <a
4059    href="http://www.aleksey.com/xmlsec/">implementations for libxml2</a></li>
4060  <li><a href="mailto:Steve.Ball@zveno.com">Steve Ball</a>, <a
4061    href="http://www.zveno.com/">Zveno</a> and contributors maintain <a
4062    href="http://tclxml.sourceforge.net/">tcl bindings for libxml2 and
4063    libxslt</a>, as well as <a
4064    href="http://tclxml.sf.net/tkxmllint.html">tkxmllint</a> a GUI for
4065    xmllint and <a href="http://tclxml.sf.net/tkxsltproc.html">tkxsltproc</a>
4066    a GUI for xsltproc.</li>
4067</ul>
4068
4069<p></p>
4070</body>
4071</html>
4072