xml.html revision 6e93c4aa47c11b86642c504301a3ff24ee191172
1<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 2 "http://www.w3.org/TR/html4/loose.dtd"> 3<html> 4<head> 5 <title>The XML C library for Gnome</title> 6 <meta name="GENERATOR" content="amaya V4.1"> 7 <meta http-equiv="Content-Type" content="text/html"> 8</head> 9 10<body bgcolor="#ffffff"> 11<p><a href="http://www.gnome.org/"><img src="smallfootonly.gif" 12alt="Gnome Logo"></a><a href="http://www.w3.org/Status"><img src="w3c.png" 13alt="W3C Logo"></a></p> 14 15<h1 align="center">The XML C library for Gnome</h1> 16 17<h2 style="text-align: center">libxml, a.k.a. gnome-xml</h2> 18 19<p></p> 20<ul> 21 <li><a href="#Introducti">Introduction</a></li> 22 <li><a href="#Documentat">Documentation</a></li> 23 <li><a href="#Reporting">Reporting bugs and getting help</a></li> 24 <li><a href="#help">how to help</a></li> 25 <li><a href="#Downloads">Downloads</a></li> 26 <li><a href="#News">News</a></li> 27 <li><a href="#XML">XML</a></li> 28 <li><a href="#XSLT">XSLT</a></li> 29 <li><a href="#tree">The tree output</a></li> 30 <li><a href="#interface">The SAX interface</a></li> 31 <li><a href="#library">The XML library interfaces</a> 32 <ul> 33 <li><a href="#Invoking">Invoking the parser: the pull way</a></li> 34 <li><a href="#Invoking">Invoking the parser: the push way</a></li> 35 <li><a href="#Invoking2">Invoking the parser: the SAX interface</a></li> 36 <li><a href="#Building">Building a tree from scratch</a></li> 37 <li><a href="#Traversing">Traversing the tree</a></li> 38 <li><a href="#Modifying">Modifying the tree</a></li> 39 <li><a href="#Saving">Saving the tree</a></li> 40 <li><a href="#Compressio">Compression</a></li> 41 </ul> 42 </li> 43 <li><a href="#Entities">Entities or no entities</a></li> 44 <li><a href="#Namespaces">Namespaces</a></li> 45 <li><a href="#Validation">Validation</a></li> 46 <li><a href="#Principles">DOM principles</a></li> 47 <li><a href="#real">A real example</a></li> 48 <li><a href="#Contributi">Contributions</a></li> 49</ul> 50 51<p>Separate documents:</p> 52<ul> 53 <li><a href="upgrade.html">upgrade instructions for migrating to 54 libxml2</a></li> 55 <li><a href="encoding.html">libxml Internationalization support</a></li> 56 <li><a href="xmlio.html">libxml Input/Output interfaces</a></li> 57 <li><a href="xmlmem.html">libxml Memory interfaces</a></li> 58 <li><a href="xmldtd.html">a short introduction about DTDs and 59 libxml</a></li> 60 <li><a href="http://xmlsoft.org/XSLT/">the libxslt page</a></li> 61</ul> 62 63<h2><a name="Introducti">Introduction</a></h2> 64 65<p>This document describes libxml, the <a 66href="http://www.w3.org/XML/">XML</a> C library developped for the <a 67href="http://www.gnome.org/">Gnome</a> project. <a 68href="http://www.w3.org/XML/">XML is a standard</a> for building tag-based 69structured documents/data.</p> 70 71<p>Here are some key points about libxml:</p> 72<ul> 73 <li>Libxml exports Push and Pull type parser interfaces for both XML and 74 HTML.</li> 75 <li>Libxml can do DTD validation at parse time, using a parsed document 76 instance, or with an arbitrary DTD.</li> 77 <li>Libxml now includes nearly complete <a 78 href="http://www.w3.org/TR/xpath">XPath</a> and <a 79 href="http://www.w3.org/TR/xptr">XPointer</a> implementations.</li> 80 <li>It is written in plain C, making as few assumptions as possible, and 81 sticking closely to ANSI C/POSIX for easy embedding. Works on 82 Linux/Unix/Windows, ported to a number of other platforms.</li> 83 <li>Basic support for HTTP and FTP client allowing aplications to fetch 84 remote resources</li> 85 <li>The design is modular, most of the extensions can be compiled out.</li> 86 <li>The internal document repesentation is as close as possible to the <a 87 href="http://www.w3.org/DOM/">DOM</a> interfaces.</li> 88 <li>Libxml also has a <a href="http://www.megginson.com/SAX/index.html">SAX 89 like interface</a>; the interface is designed to be compatible with <a 90 href="http://www.jclark.com/xml/expat.html">Expat</a>.</li> 91 <li>This library is released both under the <a 92 href="http://www.w3.org/Consortium/Legal/copyright-software-19980720.html">W3C 93 IPR</a> and the <a href="http://www.gnu.org/copyleft/lesser.html">GNU 94 LGPL</a>. Use either at your convenience, basically this should make 95 everybody happy, if not, drop me a mail.</li> 96</ul> 97 98<p>Warning: unless you are forced to because your application links with a 99Gnome library requiring it, <strong><span 100style="background-color: #FF0000">Do Not Use libxml1</span></strong>, use 101libxml2</p> 102 103<h2><a name="Documentat">Documentation</a></h2> 104 105<p>There are some on-line resources about using libxml:</p> 106<ol> 107 <li>Check the <a href="FAQ.html">FAQ</a></li> 108 <li>Check the <a href="http://xmlsoft.org/html/libxml-lib.html">extensive 109 documentation</a> automatically extracted from code comments (using <a 110 href="http://cvs.gnome.org/bonsai/rview.cgi?cvsroot=/cvs/gnome&dir=gtk-doc">gtk 111 doc</a>).</li> 112 <li>Look at the documentation about <a href="encoding.html">libxml 113 internationalization support</a></li> 114 <li>This page provides a global overview and <a href="#real">some 115 examples</a> on how to use libxml.</li> 116 <li><a href="mailto:james@daa.com.au">James Henstridge</a> wrote <a 117 href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">some nice 118 documentation</a> explaining how to use the libxml SAX interface.</li> 119 <li>George Lebl wrote <a 120 href="http://www-4.ibm.com/software/developer/library/gnome3/">an article 121 for IBM developerWorks</a> about using libxml.</li> 122 <li>It is also a good idea to check to <a href="mailto:raph@levien.com">Raph 123 Levien</a>'s <a href="http://levien.com/gnome/">web site</a> since he is 124 building the <a href="http://levien.com/gnome/gdome.html">DOM interface 125 gdome</a> on top of libxml result tree and an implementation of <a 126 href="http://www.w3.org/Graphics/SVG/">SVG</a> called <a 127 href="http://www.levien.com/svg/">gill</a>. Check his <a 128 href="http://www.levien.com/gnome/domination.html">DOMination 129 paper</a>.</li> 130 <li>Check <a href="http://cvs.gnome.org/lxr/source/gnome-xml/TODO">the TODO 131 file</a></li> 132 <li>Read the <a href="upgrade.html">1.x to 2.x upgrade path</a>. If you are 133 starting a new project using libxml you should really use the 2.x 134 version.</li> 135 <li>And don't forget to look at the <a href="/messages/">mailing-list 136 archive</a>.</li> 137</ol> 138 139<h2><a name="Reporting">Reporting bugs and getting help</a></h2> 140 141<p>Well, bugs or missing features are always possible, and I will make a point 142of fixing them in a timely fashion. The best way to report a bug is to use the 143<a href="http://bugzilla.gnome.org/buglist.cgi?product=libxml">Gnome bug 144tracking database</a> (make sure to use the "libxml" module name). I look at 145reports there regularly and it's good to have a reminder when a bug is still 146open. Check the <a 147href="http://bugzilla.gnome.org/bugwritinghelp.html">instructions on reporting 148bugs</a> and be sure to specify that the bug is for the package libxml.</p> 149 150<p>There is also a mailing-list <a 151href="mailto:xml@gnome.org">xml@gnome.org</a> for libxml, with an <a 152href="http://mail.gnome.org/archives/xml/">on-line archive</a> (<a 153href="http://xmlsoft.org/messages">old</a>). To subscribe to this list, please 154visit the <a href="http://mail.gnome.org/mailman/listinfo/xml">associated 155Web</a> page and follow the instructions.</p> 156 157<p>Alternatively, you can just send the bug to the <a 158href="mailto:xml@gnome.org">xml@gnome.org</a> list; if it's really libxml 159related I will approve it.. Please do not send me mail directly especially for 160portability problem, it makes things really harder to track and in some cases 161I'm not the best person to answer a given question, ask the list instead.</p> 162 163<p>Of course, bugs reported with a suggested patch for fixing them will 164probably be processed faster.</p> 165 166<p>If you're looking for help, a quick look at <a 167href="http://xmlsoft.org/messages/#407">the list archive</a> may actually 168provide the answer, I usually send source samples when answering libxml usage 169questions. The <a href="http://xmlsoft.org/html/book1.html">auto-generated 170documentantion</a> is not as polished as I would like (i need to learn more 171about Docbook), but it's a good starting point.</p> 172 173<h2><a name="help">How to help</a></h2> 174 175<p>You can help the project in various ways, the best thing to do first is to 176subscribe to the mailing-list as explained before, check the <a 177href="http://xmlsoft.org/messages/">archives </a>and the <a 178href="http://bugs.gnome.org/db/pa/lgnome-xml.html">Gnome bug 179database:</a>:</p> 180<ol> 181 <li>provide patches when you find problems</li> 182 <li>provide the diffs when you port libxml to a new platform. They may not 183 be integrated in all cases but help pinpointing portability problems 184 and</li> 185 <li>provide documentation fixes (either as patches to the code comments or 186 as HTML diffs).</li> 187 <li>provide new documentations pieces (translations, examples, etc ...)</li> 188 <li>Check the TODO file and try to close one of the items</li> 189 <li>take one of the points raised in the archive or the bug database and 190 provide a fix. <a href="mailto:Daniel.Veillard@imag.fr">Get in touch with 191 me </a>before to avoid synchronization problems and check that the 192 suggested fix will fit in nicely :-)</li> 193</ol> 194 195<h2><a name="Downloads">Downloads</a></h2> 196 197<p>The latest versions of libxml can be found on <a 198href="ftp://xmlsoft.org/">xmlsoft.org</a> or on the <a 199href="ftp://ftp.gnome.org/pub/GNOME/MIRRORS.html">Gnome FTP server</a> either 200as a <a href="ftp://ftp.gnome.org/pub/GNOME/stable/sources/libxml/">source 201archive</a> or <a 202href="ftp://ftp.gnome.org/pub/GNOME/stable/redhat/i386/libxml/">RPM 203packages</a>. (NOTE that you need both the <a 204href="http://rpmfind.net/linux/RPM/libxml2.html">libxml(2)</a> and <a 205href="http://rpmfind.net/linux/RPM/libxml2-devel.html">libxml(2)-devel</a> 206packages installed to compile applications using libxml.)</p> 207 208<p><a name="Snapshot">Snapshot:</a></p> 209<ul> 210 <li>Code from the W3C cvs base libxml <a 211 href="ftp://xmlsoft.org/cvs-snapshot.tar.gz">cvs-snapshot.tar.gz</a></li> 212 <li>Docs, content of the web site, the list archive included <a 213 href="ftp://xmlsoft.org/libxml-docs.tar.gz">libxml-docs.tar.gz</a></li> 214</ul> 215 216<p><a name="Contribs">Contribs:</a></p> 217 218<p>I do accept external contributions, especially if compiling on another 219platform, get in touch with me to upload the package. I will keep them in the 220<a href="ftp://xmlsoft.org/contribs/">contrib directory</a></p> 221 222<p>Libxml is also available from CVS:</p> 223<ul> 224 <li><p>The <a 225 href="http://cvs.gnome.org/bonsai/rview.cgi?cvsroot=/cvs/gnome&dir=gnome-xml">Gnome 226 CVS base</a>. Check the <a 227 href="http://developer.gnome.org/tools/cvs.html">Gnome CVS Tools</a> page; 228 the CVS module is <b>gnome-xml</b>.</p> 229 </li> 230 <li>The <strong>libxslt</strong> module is also present there</li> 231</ul> 232 233<h2><a name="News">News</a></h2> 234 235<h3>CVS only : check the <a 236href="http://cvs.gnome.org/lxr/source/gnome-xml/ChangeLog">Changelog</a> file 237for a really accurate description</h3> 238 239<p>Items floating around but not actively worked on, get in touch with me if 240you want to test those</p> 241<ul> 242 <li>Implementing <a href="http://xmlsoft.org/XSLT">XSLT</a>, this is done as 243 a separate C library on top of libxml called libxslt, not released yet but 244 available from CVS</li> 245 <li>Finishing up <a href="http://www.w3.org/TR/xptr">XPointer</a> and <a 246 href="http://www.w3.org/TR/xinclude">XInclude</a></li> 247 <li>(seeems working but delayed from release) parsing/import of Docbook SGML 248 docs</li> 249</ul> 250 251<h3>2.3.10: June 1 2001</h3> 252<ul> 253 <li>fixed the SGML catalog support</li> 254 <li>a number of reported bugs got fixed, in XPath, iconv detection, XInclude 255 processing</li> 256 <li>XPath string function should now handle unicode correctly</li> 257</ul> 258 259<h3>2.3.9: May 19 2001</h3> 260 261<p>Lots of bugfixes, and added a basic SGML catalog support:</p> 262<ul> 263 <li>HTML push bugfix #54891 and another patch from Jonas Borgstr�m</li> 264 <li>some serious speed optimisation again</li> 265 <li>some documentation cleanups</li> 266 <li>trying to get better linking on solaris (-R)</li> 267 <li>XPath API cleanup from Thomas Broyer</li> 268 <li>Validation bug fixed #54631, added a patch from Gary Pennington, fixed 269 xmlValidGetValidElements()</li> 270 <li>Added an INSTALL file</li> 271 <li>Attribute removal added to API: #54433</li> 272 <li>added a basic support for SGML catalogs</li> 273 <li>fixed xmlKeepBlanksDefault(0) API</li> 274 <li>bugfix in xmlNodeGetLang()</li> 275 <li>fixed a small configure portability problem</li> 276 <li>fixed an inversion of SYSTEM and PUBLIC identifier in HTML document</li> 277</ul> 278 279<h3>1.8.13: May 14 2001</h3> 280<ul> 281 <li>bugfixes release of the old libxml1 branch used by Gnome</li> 282</ul> 283 284<h3>2.3.8: May 3 2001</h3> 285<ul> 286 <li>Integrated an SGML DocBook parser for the Gnome project</li> 287 <li>Fixed a few things in the HTML parser</li> 288 <li>Fixed some XPath bugs raised by XSLT use, tried to fix the floating 289 point portability issue</li> 290 <li>Speed improvement (8M/s for SAX, 3M/s for DOM, 1.5M/s for DOM+validation 291 using the XML REC as input and a 700MHz celeron).</li> 292 <li>incorporated more Windows cleanup</li> 293 <li>added xmlSaveFormatFile()</li> 294 <li>fixed problems in copying nodes with entities references (gdome)</li> 295 <li>removed some troubles surrounding the new validation module</li> 296</ul> 297 298<h3>2.3.7: April 22 2001</h3> 299<ul> 300 <li>lots of small bug fixes, corrected XPointer</li> 301 <li>Non determinist content model validation support</li> 302 <li>added xmlDocCopyNode for gdome2</li> 303 <li>revamped the way the HTML parser handles end of tags</li> 304 <li>XPath: corrctions of namespacessupport and number formatting</li> 305 <li>Windows: Igor Zlatkovic patches for MSC compilation</li> 306 <li>HTML ouput fixes from P C Chow and William M. Brack</li> 307 <li>Improved validation speed sensible for DocBook</li> 308 <li>fixed a big bug with ID declared in external parsed entities</li> 309 <li>portability fixes, update of Trio from Bjorn Reese</li> 310</ul> 311 312<h3>2.3.6: April 8 2001</h3> 313<ul> 314 <li>Code cleanup using extreme gcc compiler warning options, found and 315 cleared half a dozen potential problem</li> 316 <li>the Eazel team found an XML parser bug</li> 317 <li>cleaned up the user of some of the string formatting function. used the 318 trio library code to provide the one needed when the platform is missing 319 them</li> 320 <li>xpath: removed a memory leak and fixed the predicate evaluation problem, 321 extended the testsuite and cleaned up the result. XPointer seems broken 322 ...</li> 323</ul> 324 325<h3>2.3.5: Mar 23 2001</h3> 326<ul> 327 <li>Biggest change is separate parsing and evaluation of XPath expressions, 328 there is some new APIs for this too</li> 329 <li>included a number of bug fixes(XML push parser, 51876, notations, 330 52299)</li> 331 <li>Fixed some portability issues</li> 332</ul> 333 334<h3>2.3.4: Mar 10 2001</h3> 335<ul> 336 <li>Fixed bugs #51860 and #51861</li> 337 <li>Added a global variable xmlDefaultBufferSize to allow default buffer 338 size to be application tunable.</li> 339 <li>Some cleanup in the validation code, still a bug left and this part 340 should probably be rewritten to support ambiguous content model :-\</li> 341 <li>Fix a couple of serious bugs introduced or raised by changes in 2.3.3 342 parser</li> 343 <li>Fixed another bug in xmlNodeGetContent()</li> 344 <li>Bjorn fixed XPath node collection and Number formatting</li> 345 <li>Fixed a loop reported in the HTML parsing</li> 346 <li>blank space are reported even if the Dtd content model proves that they 347 are formatting spaces, this is for XmL conformance</li> 348</ul> 349 350<h3>2.3.3: Mar 1 2001</h3> 351<ul> 352 <li>small change in XPath for XSLT</li> 353 <li>documentation cleanups</li> 354 <li>fix in validation by Gary Pennington</li> 355 <li>serious parsing performances improvements</li> 356</ul> 357 358<h3>2.3.2: Feb 24 2001</h3> 359<ul> 360 <li>chasing XPath bugs, found a bunch, completed some TODO</li> 361 <li>fixed a Dtd parsing bug</li> 362 <li>fixed a bug in xmlNodeGetContent</li> 363 <li>ID/IDREF support partly rewritten by Gary Pennington</li> 364</ul> 365 366<h3>2.3.1: Feb 15 2001</h3> 367<ul> 368 <li>some XPath and HTML bug fixes for XSLT</li> 369 <li>small extension of the hash table interfaces for DOM gdome2 370 implementation</li> 371 <li>A few bug fixes</li> 372</ul> 373 374<h3>2.3.0: Feb 8 2001 (2.2.12 was on 25 Jan but I didn't kept track)</h3> 375<ul> 376 <li>Lots of XPath bug fixes</li> 377 <li>Add a mode with Dtd lookup but without validation error reporting for 378 XSLT</li> 379 <li>Add support for text node without escaping (XSLT)</li> 380 <li>bug fixes for xmlCheckFilename</li> 381 <li>validation code bug fixes from Gary Pennington</li> 382 <li>Patch from Paul D. Smith correcting URI path normalization</li> 383 <li>Patch to allow simultaneous install of libxml-devel and 384 libxml2-devel</li> 385 <li>the example Makefile is now fixed</li> 386 <li>added HTML to the RPM packages</li> 387 <li>tree copying bugfixes</li> 388 <li>updates to Windows makefiles</li> 389 <li>optimisation patch from Bjorn Reese</li> 390</ul> 391 392<h3>2.2.11: Jan 4 2001</h3> 393<ul> 394 <li>bunch of bug fixes (memory I/O, xpath, ftp/http, ...)</li> 395 <li>added htmlHandleOmittedElem()</li> 396 <li>Applied Bjorn Reese's IPV6 first patch</li> 397 <li>Applied Paul D. Smith patches for validation of XInclude results</li> 398 <li>added XPointer xmlns() new scheme support</li> 399</ul> 400 401<h3>2.2.10: Nov 25 2000</h3> 402<ul> 403 <li>Fix the Windows problems of 2.2.8</li> 404 <li>integrate OpenVMS patches</li> 405 <li>better handling of some nasty HTML input</li> 406 <li>Improved the XPointer implementation</li> 407 <li>integrate a number of provided patches</li> 408</ul> 409 410<h3>2.2.9: Nov 25 2000</h3> 411<ul> 412 <li>erroneous release :-(</li> 413</ul> 414 415<h3>2.2.8: Nov 13 2000</h3> 416<ul> 417 <li>First version of <a href="http://www.w3.org/TR/xinclude">XInclude</a> 418 support</li> 419 <li>Patch in conditional section handling</li> 420 <li>updated MS compiler project</li> 421 <li>fixed some XPath problems</li> 422 <li>added an URI escaping function</li> 423 <li>some other bug fixes</li> 424</ul> 425 426<h3>2.2.7: Oct 31 2000</h3> 427<ul> 428 <li>added message redirection</li> 429 <li>XPath improvements (thanks TOM !)</li> 430 <li>xmlIOParseDTD() added</li> 431 <li>various small fixes in the HTML, URI, HTTP and XPointer support</li> 432 <li>some cleanup of the Makefile, autoconf and the distribution content</li> 433</ul> 434 435<h3>2.2.6: Oct 25 2000:</h3> 436<ul> 437 <li>Added an hash table module, migrated a number of internal structure to 438 those</li> 439 <li>Fixed a posteriori validation problems</li> 440 <li>HTTP module cleanups</li> 441 <li>HTML parser improvements (tag errors, script/style handling, attribute 442 normalization)</li> 443 <li>coalescing of adjacent text nodes</li> 444 <li>couple of XPath bug fixes, exported the internal API</li> 445</ul> 446 447<h3>2.2.5: Oct 15 2000:</h3> 448<ul> 449 <li>XPointer implementation and testsuite</li> 450 <li>Lot of XPath fixes, added variable and functions registration, more 451 tests</li> 452 <li>Portability fixes, lots of enhancements toward an easy Windows build and 453 release</li> 454 <li>Late validation fixes</li> 455 <li>Integrated a lot of contributed patches</li> 456 <li>added memory management docs</li> 457 <li>a performance problem when using large buffer seems fixed</li> 458</ul> 459 460<h3>2.2.4: Oct 1 2000:</h3> 461<ul> 462 <li>main XPath problem fixed</li> 463 <li>Integrated portability patches for Windows</li> 464 <li>Serious bug fixes on the URI and HTML code</li> 465</ul> 466 467<h3>2.2.3: Sep 17 2000</h3> 468<ul> 469 <li>bug fixes</li> 470 <li>cleanup of entity handling code</li> 471 <li>overall review of all loops in the parsers, all sprintf usage has been 472 checked too</li> 473 <li>Far better handling of larges Dtd. Validating against Docbook XML Dtd 474 works smoothly now.</li> 475</ul> 476 477<h3>1.8.10: Sep 6 2000</h3> 478<ul> 479 <li>bug fix release for some Gnome projects</li> 480</ul> 481 482<h3>2.2.2: August 12 2000</h3> 483<ul> 484 <li>mostly bug fixes</li> 485 <li>started adding routines to access xml parser context options</li> 486</ul> 487 488<h3>2.2.1: July 21 2000</h3> 489<ul> 490 <li>a purely bug fixes release</li> 491 <li>fixed an encoding support problem when parsing from a memory block</li> 492 <li>fixed a DOCTYPE parsing problem</li> 493 <li>removed a bug in the function allowing to override the memory allocation 494 routines</li> 495</ul> 496 497<h3>2.2.0: July 14 2000</h3> 498<ul> 499 <li>applied a lot of portability fixes</li> 500 <li>better encoding support/cleanup and saving (content is now always 501 encoded in UTF-8)</li> 502 <li>the HTML parser now correctly handles encodings</li> 503 <li>added xmlHasProp()</li> 504 <li>fixed a serious problem with &#38;</li> 505 <li>propagated the fix to FTP client</li> 506 <li>cleanup, bugfixes, etc ...</li> 507 <li>Added a page about <a href="encoding.html">libxml Internationalization 508 support</a></li> 509</ul> 510 511<h3>1.8.9: July 9 2000</h3> 512<ul> 513 <li>fixed the spec the RPMs should be better</li> 514 <li>fixed a serious bug in the FTP implementation, released 1.8.9 to solve 515 rpmfind users problem</li> 516</ul> 517 518<h3>2.1.1: July 1 2000</h3> 519<ul> 520 <li>fixes a couple of bugs in the 2.1.0 packaging</li> 521 <li>improvements on the HTML parser</li> 522</ul> 523 524<h3>2.1.0 and 1.8.8: June 29 2000</h3> 525<ul> 526 <li>1.8.8 is mostly a comodity package for upgrading to libxml2 accoding to 527 <a href="upgrade.html">new instructions</a>. It fixes a nasty problem 528 about &#38; charref parsing</li> 529 <li>2.1.0 also ease the upgrade from libxml v1 to the recent version. it 530 also contains numerous fixes and enhancements: 531 <ul> 532 <li>added xmlStopParser() to stop parsing</li> 533 <li>improved a lot parsing speed when there is large CDATA blocs</li> 534 <li>includes XPath patches provided by Picdar Technology</li> 535 <li>tried to fix as much as possible DtD validation and namespace 536 related problems</li> 537 <li>output to a given encoding has been added/tested</li> 538 <li>lot of various fixes</li> 539 </ul> 540 </li> 541</ul> 542 543<h3>2.0.0: Apr 12 2000</h3> 544<ul> 545 <li>First public release of libxml2. If you are using libxml, it's a good 546 idea to check the 1.x to 2.x upgrade instructions. NOTE: while initally 547 scheduled for Apr 3 the relase occured only on Apr 12 due to massive 548 workload.</li> 549 <li>The include are now located under $prefix/include/libxml (instead of 550 $prefix/include/gnome-xml), they also are referenced by 551 <pre>#include <libxml/xxx.h></pre> 552 <p>instead of</p> 553 <pre>#include "xxx.h"</pre> 554 </li> 555 <li>a new URI module for parsing URIs and following strictly RFC 2396</li> 556 <li>the memory allocation routines used by libxml can now be overloaded 557 dynamically by using xmlMemSetup()</li> 558 <li>The previously CVS only tool tester has been renamed 559 <strong>xmllint</strong> and is now installed as part of the libxml2 560 package</li> 561 <li>The I/O interface has been revamped. There is now ways to plug in 562 specific I/O modules, either at the URI scheme detection level using 563 xmlRegisterInputCallbacks() or by passing I/O functions when creating a 564 parser context using xmlCreateIOParserCtxt()</li> 565 <li>there is a C preprocessor macro LIBXML_VERSION providing the version 566 number of the libxml module in use</li> 567 <li>a number of optional features of libxml can now be excluded at configure 568 time (FTP/HTTP/HTML/XPath/Debug)</li> 569</ul> 570 571<h3>2.0.0beta: Mar 14 2000</h3> 572<ul> 573 <li>This is a first Beta release of libxml version 2</li> 574 <li>It's available only from<a href="ftp://xmlsoft.org/">xmlsoft.org 575 FTP</a>, it's packaged as libxml2-2.0.0beta and available as tar and 576 RPMs</li> 577 <li>This version is now the head in the Gnome CVS base, the old one is 578 available under the tag LIB_XML_1_X</li> 579 <li>This includes a very large set of changes. Froma programmatic point of 580 view applications should not have to be modified too much, check the <a 581 href="upgrade.html">upgrade page</a></li> 582 <li>Some interfaces may changes (especially a bit about encoding).</li> 583 <li>the updates includes: 584 <ul> 585 <li>fix I18N support. ISO-Latin-x/UTF-8/UTF-16 (nearly) seems correctly 586 handled now</li> 587 <li>Better handling of entities, especially well formedness checking and 588 proper PEref extensions in external subsets</li> 589 <li>DTD conditional sections</li> 590 <li>Validation now correcly handle entities content</li> 591 <li><a href="http://rpmfind.net/tools/gdome/messages/0039.html">change 592 structures to accomodate DOM</a></li> 593 </ul> 594 </li> 595 <li>Serious progress were made toward compliance, <a 596 href="conf/result.html">here are the result of the test</a> against the 597 OASIS testsuite (except the japanese tests since I don't support that 598 encoding yet). This URL is rebuilt every couple of hours using the CVS 599 head version.</li> 600</ul> 601 602<h3>1.8.7: Mar 6 2000</h3> 603<ul> 604 <li>This is a bug fix release:</li> 605 <li>It is possible to disable the ignorable blanks heuristic used by 606 libxml-1.x, a new function xmlKeepBlanksDefault(0) will allow this. Note 607 that for adherence to XML spec, this behaviour will be disabled by default 608 in 2.x . The same function will allow to keep compatibility for old 609 code.</li> 610 <li>Blanks in <a> </a> constructs are not ignored anymore, 611 avoiding heuristic is really the Right Way :-\</li> 612 <li>The unchecked use of snprintf which was breaking libxml-1.8.6 613 compilation on some platforms has been fixed</li> 614 <li>nanoftp.c nanohttp.c: Fixed '#' and '?' stripping when processing 615 URIs</li> 616</ul> 617 618<h3>1.8.6: Jan 31 2000</h3> 619<ul> 620 <li>added a nanoFTP transport module, debugged until the new version of <a 621 href="http://rpmfind.net/linux/rpm2html/rpmfind.html">rpmfind</a> can use 622 it without troubles</li> 623</ul> 624 625<h3>1.8.5: Jan 21 2000</h3> 626<ul> 627 <li>adding APIs to parse a well balanced chunk of XML (production <a 628 href="http://www.w3.org/TR/REC-xml#NT-content">[43] content</a> of the XML 629 spec)</li> 630 <li>fixed a hideous bug in xmlGetProp pointed by Rune.Djurhuus@fast.no</li> 631 <li>Jody Goldberg <jgoldberg@home.com> provided another patch trying 632 to solve the zlib checks problems</li> 633 <li>The current state in gnome CVS base is expected to ship as 1.8.5 with 634 gnumeric soon</li> 635</ul> 636 637<h3>1.8.4: Jan 13 2000</h3> 638<ul> 639 <li>bug fixes, reintroduced xmlNewGlobalNs(), fixed xmlNewNs()</li> 640 <li>all exit() call should have been removed from libxml</li> 641 <li>fixed a problem with INCLUDE_WINSOCK on WIN32 platform</li> 642 <li>added newDocFragment()</li> 643</ul> 644 645<h3>1.8.3: Jan 5 2000</h3> 646<ul> 647 <li>a Push interface for the XML and HTML parsers</li> 648 <li>a shell-like interface to the document tree (try tester --shell :-)</li> 649 <li>lots of bug fixes and improvement added over XMas hollidays</li> 650 <li>fixed the DTD parsing code to work with the xhtml DTD</li> 651 <li>added xmlRemoveProp(), xmlRemoveID() and xmlRemoveRef()</li> 652 <li>Fixed bugs in xmlNewNs()</li> 653 <li>External entity loading code has been revamped, now it uses 654 xmlLoadExternalEntity(), some fix on entities processing were added</li> 655 <li>cleaned up WIN32 includes of socket stuff</li> 656</ul> 657 658<h3>1.8.2: Dec 21 1999</h3> 659<ul> 660 <li>I got another problem with includes and C++, I hope this issue is fixed 661 for good this time</li> 662 <li>Added a few tree modification functions: xmlReplaceNode, 663 xmlAddPrevSibling, xmlAddNextSibling, xmlNodeSetName and 664 xmlDocSetRootElement</li> 665 <li>Tried to improve the HTML output with help from <a 666 href="mailto:clahey@umich.edu">Chris Lahey</a></li> 667</ul> 668 669<h3>1.8.1: Dec 18 1999</h3> 670<ul> 671 <li>various patches to avoid troubles when using libxml with C++ compilers 672 the "namespace" keyword and C escaping in include files</li> 673 <li>a problem in one of the core macros IS_CHAR was corrected</li> 674 <li>fixed a bug introduced in 1.8.0 breaking default namespace processing, 675 and more specifically the Dia application</li> 676 <li>fixed a posteriori validation (validation after parsing, or by using a 677 Dtd not specified in the original document)</li> 678 <li>fixed a bug in</li> 679</ul> 680 681<h3>1.8.0: Dec 12 1999</h3> 682<ul> 683 <li>cleanup, especially memory wise</li> 684 <li>the parser should be more reliable, especially the HTML one, it should 685 not crash, whatever the input !</li> 686 <li>Integrated various patches, especially a speedup improvement for large 687 dataset from <a href="mailto:cnygard@bellatlantic.net">Carl Nygard</a>, 688 configure with --with-buffers to enable them.</li> 689 <li>attribute normalization, oops should have been added long ago !</li> 690 <li>attributes defaulted from Dtds should be available, xmlSetProp() now 691 does entities escapting by default.</li> 692</ul> 693 694<h3>1.7.4: Oct 25 1999</h3> 695<ul> 696 <li>Lots of HTML improvement</li> 697 <li>Fixed some errors when saving both XML and HTML</li> 698 <li>More examples, the regression tests should now look clean</li> 699 <li>Fixed a bug with contiguous charref</li> 700</ul> 701 702<h3>1.7.3: Sep 29 1999</h3> 703<ul> 704 <li>portability problems fixed</li> 705 <li>snprintf was used unconditionnally, leading to link problems on system 706 were it's not available, fixed</li> 707</ul> 708 709<h3>1.7.1: Sep 24 1999</h3> 710<ul> 711 <li>The basic type for strings manipulated by libxml has been renamed in 712 1.7.1 from <strong>CHAR</strong> to <strong>xmlChar</strong>. The reason 713 is that CHAR was conflicting with a predefined type on Windows. However on 714 non WIN32 environment, compatibility is provided by the way of a 715 <strong>#define </strong>.</li> 716 <li>Changed another error : the use of a structure field called errno, and 717 leading to troubles on platforms where it's a macro</li> 718</ul> 719 720<h3>1.7.0: sep 23 1999</h3> 721<ul> 722 <li>Added the ability to fetch remote DTD or parsed entities, see the <a 723 href="html/libxml-nanohttp.html">nanohttp</a> module.</li> 724 <li>Added an errno to report errors by another mean than a simple printf 725 like callback</li> 726 <li>Finished ID/IDREF support and checking when validation</li> 727 <li>Serious memory leaks fixed (there is now a <a 728 href="html/libxml-xmlmemory.html">memory wrapper</a> module)</li> 729 <li>Improvement of <a href="http://www.w3.org/TR/xpath">XPath</a> 730 implementation</li> 731 <li>Added an HTML parser front-end</li> 732</ul> 733 734<h2><a name="XML">XML</a></h2> 735 736<p><a href="http://www.w3.org/TR/REC-xml">XML is a standard</a> for 737markup-based structured documents. Here is <a name="example">an example XML 738document</a>:</p> 739<pre><?xml version="1.0"?> 740<EXAMPLE prop1="gnome is great" prop2="&amp; linux too"> 741 <head> 742 <title>Welcome to Gnome</title> 743 </head> 744 <chapter> 745 <title>The Linux adventure</title> 746 <p>bla bla bla ...</p> 747 <image href="linus.gif"/> 748 <p>...</p> 749 </chapter> 750</EXAMPLE></pre> 751 752<p>The first line specifies that it's an XML document and gives useful 753information about its encoding. Then the document is a text format whose 754structure is specified by tags between brackets. <strong>Each tag opened has 755to be closed</strong>. XML is pedantic about this. However, if a tag is empty 756(no content), a single tag can serve as both the opening and closing tag if it 757ends with <code>/></code> rather than with <code>></code>. Note that, 758for example, the image tag has no content (just an attribute) and is closed by 759ending the tag with <code>/></code>.</p> 760 761<p>XML can be applied sucessfully to a wide range of uses, from long term 762structured document maintenance (where it follows the steps of SGML) to simple 763data encoding mechanisms like configuration file formatting (glade), 764spreadsheets (gnumeric), or even shorter lived documents such as WebDAV where 765it is used to encode remote calls between a client and a server.</p> 766 767<h2><a name="XSLT">XSLT</a></h2> 768 769<p>Check <a href="http://xmlsoft.org/XSLT">the separate libxslt page</a></p> 770 771<p><a href="http://www.w3.org/TR/xslt">XSL Transformations</a>, is a language 772for transforming XML documents into other XML documents (or HTML/textual 773output).</p> 774 775<p>A separate library called libxslt is being built on top of libxml2. This 776module "libxslt" can be found in the Gnome CVS base too.</p> 777 778<p>You can check the <a 779href="http://cvs.gnome.org/lxr/source/libxslt/FEATURES">features</a> supported 780and the progresses on the <a 781href="http://cvs.gnome.org/lxr/source/libxslt/ChangeLog">Changelog</a></p> 782 783<h2>An overview of libxml architecture</h2> 784 785<p>Libxml is made of multiple components; some of them are optional, and most 786of the block interfaces are public. The main components are:</p> 787<ul> 788 <li>an Input/Output layer</li> 789 <li>FTP and HTTP client layers (optional)</li> 790 <li>an Internationalization layer managing the encodings support</li> 791 <li>a URI module</li> 792 <li>the XML parser and its basic SAX interface</li> 793 <li>an HTML parser using the same SAX interface (optional)</li> 794 <li>a SAX tree module to build an in-memory DOM representation</li> 795 <li>a tree module to manipulate the DOM representation</li> 796 <li>a validation module using the DOM representation (optional)</li> 797 <li>an XPath module for global lookup in a DOM representation 798 (optional)</li> 799 <li>a debug module (optional)</li> 800</ul> 801 802<p>Graphically this gives the following:</p> 803 804<p><img src="libxml.gif" alt="a graphical view of the various"></p> 805 806<p></p> 807 808<h2><a name="tree">The tree output</a></h2> 809 810<p>The parser returns a tree built during the document analysis. The value 811returned is an <strong>xmlDocPtr</strong> (i.e., a pointer to an 812<strong>xmlDoc</strong> structure). This structure contains information such 813as the file name, the document type, and a <strong>children</strong> pointer 814which is the root of the document (or more exactly the first child under the 815root which is the document). The tree is made of <strong>xmlNode</strong>s, 816chained in double-linked lists of siblings and with a children<->parent 817relationship. An xmlNode can also carry properties (a chain of xmlAttr 818structures). An attribute may have a value which is a list of TEXT or 819ENTITY_REF nodes.</p> 820 821<p>Here is an example (erroneous with respect to the XML spec since there 822should be only one ELEMENT under the root):</p> 823 824<p><img src="structure.gif" alt=" structure.gif "></p> 825 826<p>In the source package there is a small program (not installed by default) 827called <strong>xmllint</strong> which parses XML files given as argument and 828prints them back as parsed. This is useful for detecting errors both in XML 829code and in the XML parser itself. It has an option <strong>--debug</strong> 830which prints the actual in-memory structure of the document; here is the 831result with the <a href="#example">example</a> given before:</p> 832<pre>DOCUMENT 833version=1.0 834standalone=true 835 ELEMENT EXAMPLE 836 ATTRIBUTE prop1 837 TEXT 838 content=gnome is great 839 ATTRIBUTE prop2 840 ENTITY_REF 841 TEXT 842 content= linux too 843 ELEMENT head 844 ELEMENT title 845 TEXT 846 content=Welcome to Gnome 847 ELEMENT chapter 848 ELEMENT title 849 TEXT 850 content=The Linux adventure 851 ELEMENT p 852 TEXT 853 content=bla bla bla ... 854 ELEMENT image 855 ATTRIBUTE href 856 TEXT 857 content=linus.gif 858 ELEMENT p 859 TEXT 860 content=...</pre> 861 862<p>This should be useful for learning the internal representation model.</p> 863 864<h2><a name="interface">The SAX interface</a></h2> 865 866<p>Sometimes the DOM tree output is just too large to fit reasonably into 867memory. In that case (and if you don't expect to save back the XML document 868loaded using libxml), it's better to use the SAX interface of libxml. SAX is a 869<strong>callback-based interface</strong> to the parser. Before parsing, the 870application layer registers a customized set of callbacks which are called by 871the library as it progresses through the XML input.</p> 872 873<p>To get more detailed step-by-step guidance on using the SAX interface of 874libxml, see the <a 875href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">nice 876documentation</a>.written by <a href="mailto:james@daa.com.au">James 877Henstridge</a>.</p> 878 879<p>You can debug the SAX behaviour by using the <strong>testSAX</strong> 880program located in the gnome-xml module (it's usually not shipped in the 881binary packages of libxml, but you can find it in the tar source 882distribution). Here is the sequence of callbacks that would be reported by 883testSAX when parsing the example XML document shown earlier:</p> 884<pre>SAX.setDocumentLocator() 885SAX.startDocument() 886SAX.getEntity(amp) 887SAX.startElement(EXAMPLE, prop1='gnome is great', prop2='&amp; linux too') 888SAX.characters( , 3) 889SAX.startElement(head) 890SAX.characters( , 4) 891SAX.startElement(title) 892SAX.characters(Welcome to Gnome, 16) 893SAX.endElement(title) 894SAX.characters( , 3) 895SAX.endElement(head) 896SAX.characters( , 3) 897SAX.startElement(chapter) 898SAX.characters( , 4) 899SAX.startElement(title) 900SAX.characters(The Linux adventure, 19) 901SAX.endElement(title) 902SAX.characters( , 4) 903SAX.startElement(p) 904SAX.characters(bla bla bla ..., 15) 905SAX.endElement(p) 906SAX.characters( , 4) 907SAX.startElement(image, href='linus.gif') 908SAX.endElement(image) 909SAX.characters( , 4) 910SAX.startElement(p) 911SAX.characters(..., 3) 912SAX.endElement(p) 913SAX.characters( , 3) 914SAX.endElement(chapter) 915SAX.characters( , 1) 916SAX.endElement(EXAMPLE) 917SAX.endDocument()</pre> 918 919<p>Most of the other interfaces of libxml are based on the DOM tree-building 920facility, so nearly everything up to the end of this document presupposes the 921use of the standard DOM tree build. Note that the DOM tree itself is built by 922a set of registered default callbacks, without internal specific 923interface.</p> 924 925<h2><a name="library">The XML library interfaces</a></h2> 926 927<p>This section is directly intended to help programmers getting bootstrapped 928using the XML library from the C language. It is not intended to be extensive. 929I hope the automatically generated documents will provide the completeness 930required, but as a separate set of documents. The interfaces of the XML 931library are by principle low level, there is nearly zero abstraction. Those 932interested in a higher level API should <a href="#DOM">look at DOM</a>.</p> 933 934<p>The <a href="html/libxml-parser.html">parser interfaces for XML</a> are 935separated from the <a href="html/libxml-htmlparser.html">HTML parser 936interfaces</a>. Let's have a look at how the XML parser can be called:</p> 937 938<h3><a name="Invoking">Invoking the parser : the pull method</a></h3> 939 940<p>Usually, the first thing to do is to read an XML input. The parser accepts 941documents either from in-memory strings or from files. The functions are 942defined in "parser.h":</p> 943<dl> 944 <dt><code>xmlDocPtr xmlParseMemory(char *buffer, int size);</code></dt> 945 <dd><p>Parse a null-terminated string containing the document.</p> 946 </dd> 947</dl> 948<dl> 949 <dt><code>xmlDocPtr xmlParseFile(const char *filename);</code></dt> 950 <dd><p>Parse an XML document contained in a (possibly compressed) 951 file.</p> 952 </dd> 953</dl> 954 955<p>The parser returns a pointer to the document structure (or NULL in case of 956failure).</p> 957 958<h3 id="Invoking1">Invoking the parser: the push method</h3> 959 960<p>In order for the application to keep the control when the document is being 961fetched (which is common for GUI based programs) libxml provides a push 962interface, too, as of version 1.8.3. Here are the interface functions:</p> 963<pre>xmlParserCtxtPtr xmlCreatePushParserCtxt(xmlSAXHandlerPtr sax, 964 void *user_data, 965 const char *chunk, 966 int size, 967 const char *filename); 968int xmlParseChunk (xmlParserCtxtPtr ctxt, 969 const char *chunk, 970 int size, 971 int terminate);</pre> 972 973<p>and here is a simple example showing how to use the interface:</p> 974<pre> FILE *f; 975 976 f = fopen(filename, "r"); 977 if (f != NULL) { 978 int res, size = 1024; 979 char chars[1024]; 980 xmlParserCtxtPtr ctxt; 981 982 res = fread(chars, 1, 4, f); 983 if (res > 0) { 984 ctxt = xmlCreatePushParserCtxt(NULL, NULL, 985 chars, res, filename); 986 while ((res = fread(chars, 1, size, f)) > 0) { 987 xmlParseChunk(ctxt, chars, res, 0); 988 } 989 xmlParseChunk(ctxt, chars, 0, 1); 990 doc = ctxt->myDoc; 991 xmlFreeParserCtxt(ctxt); 992 } 993 }</pre> 994 995<p>The HTML parser embedded into libxml also has a push interface; the 996functions are just prefixed by "html" rather than "xml".</p> 997 998<h3 id="Invoking2">Invoking the parser: the SAX interface</h3> 999 1000<p>The tree-building interface makes the parser memory-hungry, first loading 1001the document in memory and then building the tree itself. Reading a document 1002without building the tree is possible using the SAX interfaces (see SAX.h and 1003<a href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">James 1004Henstridge's documentation</a>). Note also that the push interface can be 1005limited to SAX: just use the two first arguments of 1006<code>xmlCreatePushParserCtxt()</code>.</p> 1007 1008<h3><a name="Building">Building a tree from scratch</a></h3> 1009 1010<p>The other way to get an XML tree in memory is by building it. Basically 1011there is a set of functions dedicated to building new elements. (These are 1012also described in <libxml/tree.h>.) For example, here is a piece of code 1013that produces the XML document used in the previous examples:</p> 1014<pre> #include <libxml/tree.h> 1015 xmlDocPtr doc; 1016 xmlNodePtr tree, subtree; 1017 1018 doc = xmlNewDoc("1.0"); 1019 doc->children = xmlNewDocNode(doc, NULL, "EXAMPLE", NULL); 1020 xmlSetProp(doc->children, "prop1", "gnome is great"); 1021 xmlSetProp(doc->children, "prop2", "& linux too"); 1022 tree = xmlNewChild(doc->children, NULL, "head", NULL); 1023 subtree = xmlNewChild(tree, NULL, "title", "Welcome to Gnome"); 1024 tree = xmlNewChild(doc->children, NULL, "chapter", NULL); 1025 subtree = xmlNewChild(tree, NULL, "title", "The Linux adventure"); 1026 subtree = xmlNewChild(tree, NULL, "p", "bla bla bla ..."); 1027 subtree = xmlNewChild(tree, NULL, "image", NULL); 1028 xmlSetProp(subtree, "href", "linus.gif");</pre> 1029 1030<p>Not really rocket science ...</p> 1031 1032<h3><a name="Traversing">Traversing the tree</a></h3> 1033 1034<p>Basically by <a href="html/libxml-tree.html">including "tree.h"</a> your 1035code has access to the internal structure of all the elements of the tree. The 1036names should be somewhat simple like <strong>parent</strong>, 1037<strong>children</strong>, <strong>next</strong>, <strong>prev</strong>, 1038<strong>properties</strong>, etc... For example, still with the previous 1039example:</p> 1040<pre><code>doc->children->children->children</code></pre> 1041 1042<p>points to the title element,</p> 1043<pre>doc->children->children->next->children->children</pre> 1044 1045<p>points to the text node containing the chapter title "The Linux 1046adventure".</p> 1047 1048<p><strong>NOTE</strong>: XML allows <em>PI</em>s and <em>comments</em> to be 1049present before the document root, so <code>doc->children</code> may point 1050to an element which is not the document Root Element; a function 1051<code>xmlDocGetRootElement()</code> was added for this purpose.</p> 1052 1053<h3><a name="Modifying">Modifying the tree</a></h3> 1054 1055<p>Functions are provided for reading and writing the document content. Here 1056is an excerpt from the <a href="html/libxml-tree.html">tree API</a>:</p> 1057<dl> 1058 <dt><code>xmlAttrPtr xmlSetProp(xmlNodePtr node, const xmlChar *name, const 1059 xmlChar *value);</code></dt> 1060 <dd><p>This sets (or changes) an attribute carried by an ELEMENT node. The 1061 value can be NULL.</p> 1062 </dd> 1063</dl> 1064<dl> 1065 <dt><code>const xmlChar *xmlGetProp(xmlNodePtr node, const xmlChar 1066 *name);</code></dt> 1067 <dd><p>This function returns a pointer to new copy of the property 1068 content. Note that the user must deallocate the result.</p> 1069 </dd> 1070</dl> 1071 1072<p>Two functions are provided for reading and writing the text associated with 1073elements:</p> 1074<dl> 1075 <dt><code>xmlNodePtr xmlStringGetNodeList(xmlDocPtr doc, const xmlChar 1076 *value);</code></dt> 1077 <dd><p>This function takes an "external" string and converts it to one 1078 text node or possibly to a list of entity and text nodes. All 1079 non-predefined entity references like &Gnome; will be stored 1080 internally as entity nodes, hence the result of the function may not be 1081 a single node.</p> 1082 </dd> 1083</dl> 1084<dl> 1085 <dt><code>xmlChar *xmlNodeListGetString(xmlDocPtr doc, xmlNodePtr list, int 1086 inLine);</code></dt> 1087 <dd><p>This function is the inverse of 1088 <code>xmlStringGetNodeList()</code>. It generates a new string 1089 containing the content of the text and entity nodes. Note the extra 1090 argument inLine. If this argument is set to 1, the function will expand 1091 entity references. For example, instead of returning the &Gnome; 1092 XML encoding in the string, it will substitute it with its value (say, 1093 "GNU Network Object Model Environment").</p> 1094 </dd> 1095</dl> 1096 1097<h3><a name="Saving">Saving a tree</a></h3> 1098 1099<p>Basically 3 options are possible:</p> 1100<dl> 1101 <dt><code>void xmlDocDumpMemory(xmlDocPtr cur, xmlChar**mem, int 1102 *size);</code></dt> 1103 <dd><p>Returns a buffer into which the document has been saved.</p> 1104 </dd> 1105</dl> 1106<dl> 1107 <dt><code>extern void xmlDocDump(FILE *f, xmlDocPtr doc);</code></dt> 1108 <dd><p>Dumps a document to an open file descriptor.</p> 1109 </dd> 1110</dl> 1111<dl> 1112 <dt><code>int xmlSaveFile(const char *filename, xmlDocPtr cur);</code></dt> 1113 <dd><p>Saves the document to a file. In this case, the compression 1114 interface is triggered if it has been turned on.</p> 1115 </dd> 1116</dl> 1117 1118<h3><a name="Compressio">Compression</a></h3> 1119 1120<p>The library transparently handles compression when doing file-based 1121accesses. The level of compression on saves can be turned on either globally 1122or individually for one file:</p> 1123<dl> 1124 <dt><code>int xmlGetDocCompressMode (xmlDocPtr doc);</code></dt> 1125 <dd><p>Gets the document compression ratio (0-9).</p> 1126 </dd> 1127</dl> 1128<dl> 1129 <dt><code>void xmlSetDocCompressMode (xmlDocPtr doc, int mode);</code></dt> 1130 <dd><p>Sets the document compression ratio.</p> 1131 </dd> 1132</dl> 1133<dl> 1134 <dt><code>int xmlGetCompressMode(void);</code></dt> 1135 <dd><p>Gets the default compression ratio.</p> 1136 </dd> 1137</dl> 1138<dl> 1139 <dt><code>void xmlSetCompressMode(int mode);</code></dt> 1140 <dd><p>Sets the default compression ratio.</p> 1141 </dd> 1142</dl> 1143 1144<h2><a name="Entities">Entities or no entities</a></h2> 1145 1146<p>Entities in principle are similar to simple C macros. An entity defines an 1147abbreviation for a given string that you can reuse many times throughout the 1148content of your document. Entities are especially useful when a given string 1149may occur frequently within a document, or to confine the change needed to a 1150document to a restricted area in the internal subset of the document (at the 1151beginning). Example:</p> 1152<pre>1 <?xml version="1.0"?> 11532 <!DOCTYPE EXAMPLE SYSTEM "example.dtd" [ 11543 <!ENTITY xml "Extensible Markup Language"> 11554 ]> 11565 <EXAMPLE> 11576 &xml; 11587 </EXAMPLE></pre> 1159 1160<p>Line 3 declares the xml entity. Line 6 uses the xml entity, by prefixing 1161its name with '&' and following it by ';' without any spaces added. There 1162are 5 predefined entities in libxml allowing you to escape charaters with 1163predefined meaning in some parts of the xml document content: 1164<strong>&lt;</strong> for the character '<', <strong>&gt;</strong> 1165for the character '>', <strong>&apos;</strong> for the character ''', 1166<strong>&quot;</strong> for the character '"', and 1167<strong>&amp;</strong> for the character '&'.</p> 1168 1169<p>One of the problems related to entities is that you may want the parser to 1170substitute an entity's content so that you can see the replacement text in 1171your application. Or you may prefer to keep entity references as such in the 1172content to be able to save the document back without losing this usually 1173precious information (if the user went through the pain of explicitly defining 1174entities, he may have a a rather negative attitude if you blindly susbtitute 1175them as saving time). The <a 1176href="html/libxml-parser.html#XMLSUBSTITUTEENTITIESDEFAULT">xmlSubstituteEntitiesDefault()</a> 1177function allows you to check and change the behaviour, which is to not 1178substitute entities by default.</p> 1179 1180<p>Here is the DOM tree built by libxml for the previous document in the 1181default case:</p> 1182<pre>/gnome/src/gnome-xml -> /xmllint --debug test/ent1 1183DOCUMENT 1184version=1.0 1185 ELEMENT EXAMPLE 1186 TEXT 1187 content= 1188 ENTITY_REF 1189 INTERNAL_GENERAL_ENTITY xml 1190 content=Extensible Markup Language 1191 TEXT 1192 content=</pre> 1193 1194<p>And here is the result when substituting entities:</p> 1195<pre>/gnome/src/gnome-xml -> /tester --debug --noent test/ent1 1196DOCUMENT 1197version=1.0 1198 ELEMENT EXAMPLE 1199 TEXT 1200 content= Extensible Markup Language</pre> 1201 1202<p>So, entities or no entities? Basically, it depends on your use case. I 1203suggest that you keep the non-substituting default behaviour and avoid using 1204entities in your XML document or data if you are not willing to handle the 1205entity references elements in the DOM tree.</p> 1206 1207<p>Note that at save time libxml enforces the conversion of the predefined 1208entities where necessary to prevent well-formedness problems, and will also 1209transparently replace those with chars (i.e. it will not generate entity 1210reference elements in the DOM tree or call the reference() SAX callback when 1211finding them in the input).</p> 1212 1213<p><span style="background-color: #FF0000">WARNING</span>: handling entities 1214on top of the libxml SAX interface is difficult!!! If you plan to use 1215non-predefined entities in your documents, then the learning cuvre to handle 1216then using the SAX API may be long. If you plan to use complex documents, I 1217strongly suggest you consider using the DOM interface instead and let libxml 1218deal with the complexity rather than trying to do it yourself.</p> 1219 1220<h2><a name="Namespaces">Namespaces</a></h2> 1221 1222<p>The libxml library implements <a 1223href="http://www.w3.org/TR/REC-xml-names/">XML namespaces</a> support by 1224recognizing namespace contructs in the input, and does namespace lookup 1225automatically when building the DOM tree. A namespace declaration is 1226associated with an in-memory structure and all elements or attributes within 1227that namespace point to it. Hence testing the namespace is a simple and fast 1228equality operation at the user level.</p> 1229 1230<p>I suggest that people using libxml use a namespace, and declare it in the 1231root element of their document as the default namespace. Then they don't need 1232to use the prefix in the content but we will have a basis for future semantic 1233refinement and merging of data from different sources. This doesn't increase 1234the size of the XML output significantly, but significantly increases its 1235value in the long-term. Example:</p> 1236<pre><mydoc xmlns="http://mydoc.example.org/schemas/"> 1237 <elem1>...</elem1> 1238 <elem2>...</elem2> 1239</mydoc></pre> 1240 1241<p>The namespace value has to be an absolute URL, but the URL doesn't have to 1242point to any existing resource on the Web. It will bind all the element and 1243atributes with that URL. I suggest to use an URL within a domain you control, 1244and that the URL should contain some kind of version information if possible. 1245For example, <code>"http://www.gnome.org/gnumeric/1.0/"</code> is a good 1246namespace scheme.</p> 1247 1248<p>Then when you load a file, make sure that a namespace carrying the 1249version-independent prefix is installed on the root element of your document, 1250and if the version information don't match something you know, warn the user 1251and be liberal in what you accept as the input. Also do *not* try to base 1252namespace checking on the prefix value. <foo:text> may be exactly the 1253same as <bar:text> in another document. What really matters is the URI 1254associated with the element or the attribute, not the prefix string (which is 1255just a shortcut for the full URI). In libxml, element and attributes have an 1256<code>ns</code> field pointing to an xmlNs structure detailing the namespace 1257prefix and its URI.</p> 1258 1259<p>@@Interfaces@@</p> 1260 1261<p>@@Examples@@</p> 1262 1263<p>Usually people object to using namespaces together with validity checking. 1264I will try to make sure that using namespaces won't break validity checking, 1265so even if you plan to use or currently are using validation I strongly 1266suggest adding namespaces to your document. A default namespace scheme 1267<code>xmlns="http://...."</code> should not break validity even on less 1268flexible parsers. Using namespaces to mix and differentiate content coming 1269from multiple DTDs will certainly break current validation schemes. I will try 1270to provide ways to do this, but this may not be portable or standardized.</p> 1271 1272<h2><a name="Validation">Validation, or are you afraid of DTDs ?</a></h2> 1273 1274<p>Well what is validation and what is a DTD ?</p> 1275 1276<p>Validation is the process of checking a document against a set of 1277construction rules; a <strong>DTD</strong> (Document Type Definition) is such 1278a set of rules.</p> 1279 1280<p>The validation process and building DTDs are the two most difficult parts 1281of the XML life cycle. Briefly a DTD defines all the possibles element to be 1282found within your document, what is the formal shape of your document tree (by 1283defining the allowed content of an element, either text, a regular expression 1284for the allowed list of children, or mixed content i.e. both text and 1285children). The DTD also defines the allowed attributes for all elements and 1286the types of the attributes. For more detailed information, I suggest that you 1287read the related parts of the XML specification, the examples found under 1288gnome-xml/test/valid/dtd and any of the large number of books available on 1289XML. The dia example in gnome-xml/test/valid should be both simple and 1290complete enough to allow you to build your own.</p> 1291 1292<p>A word of warning, building a good DTD which will fit the needs of your 1293application in the long-term is far from trivial; however, the extra level of 1294quality it can ensure is well worth the price for some sets of applications or 1295if you already have already a DTD defined for your application field.</p> 1296 1297<p>The validation is not completely finished but in a (very IMHO) usable 1298state. Until a real validation interface is defined the way to do it is to 1299define and set the <strong>xmlDoValidityCheckingDefaultValue</strong> external 1300variable to 1, this will of course be changed at some point:</p> 1301 1302<p>extern int xmlDoValidityCheckingDefaultValue;</p> 1303 1304<p>...</p> 1305 1306<p>xmlDoValidityCheckingDefaultValue = 1;</p> 1307 1308<p></p> 1309 1310<p>To handle external entities, use the function 1311<strong>xmlSetExternalEntityLoader</strong>(xmlExternalEntityLoader f); to 1312link in you HTTP/FTP/Entities database library to the standard libxml 1313core.</p> 1314 1315<p>@@interfaces@@</p> 1316 1317<h2><a name="DOM"></a><a name="Principles">DOM Principles</a></h2> 1318 1319<p><a href="http://www.w3.org/DOM/">DOM</a> stands for the <em>Document Object 1320Model</em>; this is an API for accessing XML or HTML structured documents. 1321Native support for DOM in Gnome is on the way (module gnome-dom), and will be 1322based on gnome-xml. This will be a far cleaner interface to manipulate XML 1323files within Gnome since it won't expose the internal structure.</p> 1324 1325<p>The current DOM implementation on top of libxml is the <a 1326href="http://cvs.gnome.org/lxr/source/gdome2/">gdome2 Gnome module</a>, this 1327is a full DOM interface, thanks to Paolo Casarini, check the <a 1328href="http://www.cs.unibo.it/~casarini/gdome2/">Gdome2 homepage</a> for more 1329informations.</p> 1330 1331<p>The gnome-dom and gdome modules in the Gnome CVS base are obsolete</p> 1332 1333<h2><a name="Example"></a><a name="real">A real example</a></h2> 1334 1335<p>Here is a real size example, where the actual content of the application 1336data is not kept in the DOM tree but uses internal structures. It is based on 1337a proposal to keep a database of jobs related to Gnome, with an XML based 1338storage structure. Here is an <a href="gjobs.xml">XML encoded jobs 1339base</a>:</p> 1340<pre><?xml version="1.0"?> 1341<gjob:Helping xmlns:gjob="http://www.gnome.org/some-location"> 1342 <gjob:Jobs> 1343 1344 <gjob:Job> 1345 <gjob:Project ID="3"/> 1346 <gjob:Application>GBackup</gjob:Application> 1347 <gjob:Category>Development</gjob:Category> 1348 1349 <gjob:Update> 1350 <gjob:Status>Open</gjob:Status> 1351 <gjob:Modified>Mon, 07 Jun 1999 20:27:45 -0400 MET DST</gjob:Modified> 1352 <gjob:Salary>USD 0.00</gjob:Salary> 1353 </gjob:Update> 1354 1355 <gjob:Developers> 1356 <gjob:Developer> 1357 </gjob:Developer> 1358 </gjob:Developers> 1359 1360 <gjob:Contact> 1361 <gjob:Person>Nathan Clemons</gjob:Person> 1362 <gjob:Email>nathan@windsofstorm.net</gjob:Email> 1363 <gjob:Company> 1364 </gjob:Company> 1365 <gjob:Organisation> 1366 </gjob:Organisation> 1367 <gjob:Webpage> 1368 </gjob:Webpage> 1369 <gjob:Snailmail> 1370 </gjob:Snailmail> 1371 <gjob:Phone> 1372 </gjob:Phone> 1373 </gjob:Contact> 1374 1375 <gjob:Requirements> 1376 The program should be released as free software, under the GPL. 1377 </gjob:Requirements> 1378 1379 <gjob:Skills> 1380 </gjob:Skills> 1381 1382 <gjob:Details> 1383 A GNOME based system that will allow a superuser to configure 1384 compressed and uncompressed files and/or file systems to be backed 1385 up with a supported media in the system. This should be able to 1386 perform via find commands generating a list of files that are passed 1387 to tar, dd, cpio, cp, gzip, etc., to be directed to the tape machine 1388 or via operations performed on the filesystem itself. Email 1389 notification and GUI status display very important. 1390 </gjob:Details> 1391 1392 </gjob:Job> 1393 1394 </gjob:Jobs> 1395</gjob:Helping></pre> 1396 1397<p>While loading the XML file into an internal DOM tree is a matter of calling 1398only a couple of functions, browsing the tree to gather the ata and generate 1399the internal structures is harder, and more error prone.</p> 1400 1401<p>The suggested principle is to be tolerant with respect to the input 1402structure. For example, the ordering of the attributes is not significant, the 1403XML specification is clear about it. It's also usually a good idea not to 1404depend on the order of the children of a given node, unless it really makes 1405things harder. Here is some code to parse the information for a person:</p> 1406<pre>/* 1407 * A person record 1408 */ 1409typedef struct person { 1410 char *name; 1411 char *email; 1412 char *company; 1413 char *organisation; 1414 char *smail; 1415 char *webPage; 1416 char *phone; 1417} person, *personPtr; 1418 1419/* 1420 * And the code needed to parse it 1421 */ 1422personPtr parsePerson(xmlDocPtr doc, xmlNsPtr ns, xmlNodePtr cur) { 1423 personPtr ret = NULL; 1424 1425DEBUG("parsePerson\n"); 1426 /* 1427 * allocate the struct 1428 */ 1429 ret = (personPtr) malloc(sizeof(person)); 1430 if (ret == NULL) { 1431 fprintf(stderr,"out of memory\n"); 1432 return(NULL); 1433 } 1434 memset(ret, 0, sizeof(person)); 1435 1436 /* We don't care what the top level element name is */ 1437 cur = cur->xmlChildrenNode; 1438 while (cur != NULL) { 1439 if ((!strcmp(cur->name, "Person")) && (cur->ns == ns)) 1440 ret->name = xmlNodeListGetString(doc, cur->xmlChildrenNode, 1); 1441 if ((!strcmp(cur->name, "Email")) && (cur->ns == ns)) 1442 ret->email = xmlNodeListGetString(doc, cur->xmlChildrenNode, 1); 1443 cur = cur->next; 1444 } 1445 1446 return(ret); 1447}</pre> 1448 1449<p>Here are a couple of things to notice:</p> 1450<ul> 1451 <li>Usually a recursive parsing style is the more convenient one: XML data 1452 is by nature subject to repetitive constructs and usually exibits highly 1453 stuctured patterns.</li> 1454 <li>The two arguments of type <em>xmlDocPtr</em> and <em>xmlNsPtr</em>, i.e. 1455 the pointer to the global XML document and the namespace reserved to the 1456 application. Document wide information are needed for example to decode 1457 entities and it's a good coding practice to define a namespace for your 1458 application set of data and test that the element and attributes you're 1459 analyzing actually pertains to your application space. This is done by a 1460 simple equality test (cur->ns == ns).</li> 1461 <li>To retrieve text and attributes value, you can use the function 1462 <em>xmlNodeListGetString</em> to gather all the text and entity reference 1463 nodes generated by the DOM output and produce an single text string.</li> 1464</ul> 1465 1466<p>Here is another piece of code used to parse another level of the 1467structure:</p> 1468<pre>#include <libxml/tree.h> 1469/* 1470 * a Description for a Job 1471 */ 1472typedef struct job { 1473 char *projectID; 1474 char *application; 1475 char *category; 1476 personPtr contact; 1477 int nbDevelopers; 1478 personPtr developers[100]; /* using dynamic alloc is left as an exercise */ 1479} job, *jobPtr; 1480 1481/* 1482 * And the code needed to parse it 1483 */ 1484jobPtr parseJob(xmlDocPtr doc, xmlNsPtr ns, xmlNodePtr cur) { 1485 jobPtr ret = NULL; 1486 1487DEBUG("parseJob\n"); 1488 /* 1489 * allocate the struct 1490 */ 1491 ret = (jobPtr) malloc(sizeof(job)); 1492 if (ret == NULL) { 1493 fprintf(stderr,"out of memory\n"); 1494 return(NULL); 1495 } 1496 memset(ret, 0, sizeof(job)); 1497 1498 /* We don't care what the top level element name is */ 1499 cur = cur->xmlChildrenNode; 1500 while (cur != NULL) { 1501 1502 if ((!strcmp(cur->name, "Project")) && (cur->ns == ns)) { 1503 ret->projectID = xmlGetProp(cur, "ID"); 1504 if (ret->projectID == NULL) { 1505 fprintf(stderr, "Project has no ID\n"); 1506 } 1507 } 1508 if ((!strcmp(cur->name, "Application")) && (cur->ns == ns)) 1509 ret->application = xmlNodeListGetString(doc, cur->xmlChildrenNode, 1); 1510 if ((!strcmp(cur->name, "Category")) && (cur->ns == ns)) 1511 ret->category = xmlNodeListGetString(doc, cur->xmlChildrenNode, 1); 1512 if ((!strcmp(cur->name, "Contact")) && (cur->ns == ns)) 1513 ret->contact = parsePerson(doc, ns, cur); 1514 cur = cur->next; 1515 } 1516 1517 return(ret); 1518}</pre> 1519 1520<p>Once you are used to it, writing this kind of code is quite simple, but 1521boring. Ultimately, it could be possble to write stubbers taking either C data 1522structure definitions, a set of XML examples or an XML DTD and produce the 1523code needed to import and export the content between C data and XML storage. 1524This is left as an exercise to the reader :-)</p> 1525 1526<p>Feel free to use <a href="example/gjobread.c">the code for the full C 1527parsing example</a> as a template, it is also available with Makefile in the 1528Gnome CVS base under gnome-xml/example</p> 1529 1530<h2><a name="Contributi">Contributions</a></h2> 1531<ul> 1532 <li><a href="mailto:ari@lusis.org">Ari Johnson</a> provides a C++ wrapper 1533 for libxml: 1534 <p>Website: <a 1535 href="http://lusis.org/~ari/xml++/">http://lusis.org/~ari/xml++/</a></p> 1536 <p>Download: <a 1537 href="http://lusis.org/~ari/xml++/libxml++.tar.gz">http://lusis.org/~ari/xml++/libxml++.tar.gz</a></p> 1538 </li> 1539 <li><a href="mailto:doolin@cs.utk.edu">David Doolin</a> provides a 1540 precompiled Windows version 1541 <p><a 1542 href="http://www.ce.berkeley.edu/~doolin/code/libxmlwin32/">http://www.ce.berkeley.edu/~doolin/code/libxmlwin32/</a></p> 1543 </li> 1544 <li><a 1545 href="http://mail.gnome.org/archives/xml/2001-March/msg00014.html">Matt 1546 Sergeant</a> developped <a 1547 href="http://axkit.org/download/">XML::LibXSLT</a>, a perl wrapper for 1548 libxml2/libxslt as part of the <a href="http://axkit.com/">AxKit XML 1549 application server</a></li> 1550 <li><a href="mailto:fnatter@gmx.net">Felix Natter</a> provided <a 1551 href="libxml-doc.el">an emacs module</a> to lookup libxml functions 1552 documentation</li> 1553 <li><a href="mailto:sherwin@nlm.nih.gov">Ziying Sherwin</a> provided <a 1554 href="http://xmlsoft.org/messages/0488.html">man pages</a> (not yet 1555 integrated in the distribution)</li> 1556</ul> 1557 1558<p></p> 1559 1560<p><a href="mailto:Daniel.Veillard@imag.fr">Daniel Veillard</a></p> 1561 1562<p>$Id: xml.html,v 1.85 2001/06/01 10:11:57 veillard Exp $</p> 1563</body> 1564</html> 1565