xml.html revision 29a11cc696655f9ac841a5ca28b272e4150aafa1
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 2 "http://www.w3.org/TR/REC-html40/loose.dtd"> 3<html> 4<head> 5 <title>The XML C library for Gnome</title> 6 <meta name="GENERATOR" content="amaya V3.2.1"> 7 <meta http-equiv="Content-Type" content="text/html"> 8</head> 9 10<body bgcolor="#ffffff"> 11<p><a href="http://www.gnome.org/"><img src="smallfootonly.gif" 12alt="Gnome Logo"></a><a href="http://www.w3.org/Status"><img src="w3c.png" 13alt="W3C Logo"></a></p> 14 15<h1 align="center">The XML C library for Gnome</h1> 16 17<h2 style="text-align: center">libxml, a.k.a. gnome-xml</h2> 18 19<p></p> 20<ul> 21 <li><a href="#Introducti">Introduction</a></li> 22 <li><a href="#Documentat">Documentation</a></li> 23 <li><a href="#Reporting">Reporting bugs and getting help</a></li> 24 <li><a href="#help">how to help</a></li> 25 <li><a href="#Downloads">Downloads</a></li> 26 <li><a href="#News">News</a></li> 27 <li><a href="#XML">XML</a></li> 28 <li><a href="#tree">The tree output</a></li> 29 <li><a href="#interface">The SAX interface</a></li> 30 <li><a href="#library">The XML library interfaces</a> 31 <ul> 32 <li><a href="#Invoking">Invoking the parser: the pull way</a></li> 33 <li><a href="#Invoking">Invoking the parser: the push way</a></li> 34 <li><a href="#Invoking2">Invoking the parser: the SAX interface</a></li> 35 <li><a href="#Building">Building a tree from scratch</a></li> 36 <li><a href="#Traversing">Traversing the tree</a></li> 37 <li><a href="#Modifying">Modifying the tree</a></li> 38 <li><a href="#Saving">Saving the tree</a></li> 39 <li><a href="#Compressio">Compression</a></li> 40 </ul> 41 </li> 42 <li><a href="#Entities">Entities or no entities</a></li> 43 <li><a href="#Namespaces">Namespaces</a></li> 44 <li><a href="#Validation">Validation</a></li> 45 <li><a href="#Principles">DOM principles</a></li> 46 <li><a href="#real">A real example</a></li> 47 <li><a href="#Contributi">Contributions</a></li> 48</ul> 49 50<p>Separate documents:</p> 51<ul> 52 <li><a href="upgrade.html">upgrade instructions for migrating to 53 libxml2</a></li> 54 <li><a href="encoding.html">libxml Internationalization support</a></li> 55 <li><a href="xmlio.html">libxml Input/Output interfaces</a></li> 56 <li><a href="xmlmem.html">libxml Memory interfaces</a></li> 57</ul> 58 59<h2><a name="Introducti">Introduction</a></h2> 60 61<p>This document describes libxml, the <a 62href="http://www.w3.org/XML/">XML</a> C library developped for the <a 63href="http://www.gnome.org/">Gnome</a> project. <a 64href="http://www.w3.org/XML/">XML is a standard</a> for building tag-based 65structured documents/data.</p> 66 67<p>Here are some key points about libxml:</p> 68<ul> 69 <li>Libxml exports Push and Pull type parser interfaces for both XML and 70 HTML.</li> 71 <li>Libxml can do Dtd validation at parse time, using a parsed document 72 instance, or with an arbitrary Dtd.</li> 73 <li>Libxml now includes a nearly complete <a 74 href="http://www.w3.org/TR/xpath">XPath</a> and <a 75 href="http://www.w3.org/TR/xptr">XPointer</a> implementations.</li> 76 <li>It is written in plain C, making as few assumptions as possible, and 77 sticking closely to ANSI C/POSIX for easy embedding. Works on 78 Linux/Unix/Windows, ported to a number of other platforms.</li> 79 <li>Basic support for HTTP and FTP client allowing to fetch remote 80 resources</li> 81 <li>The design of modular, most of the extensions can be compiled out.</li> 82 <li>The internal document repesentation is as close as possible to the <a 83 href="http://www.w3.org/DOM/">DOM</a> interfaces.</li> 84 <li>Libxml also has a <a href="http://www.megginson.com/SAX/index.html">SAX 85 like interface</a>; the interface is designed to be compatible with <a 86 href="http://www.jclark.com/xml/expat.html">Expat</a>.</li> 87 <li>This library is released both under the <a 88 href="http://www.w3.org/Consortium/Legal/copyright-software-19980720.html">W3C 89 IPR</a> and the <a href="http://www.gnu.org/copyleft/lesser.html">GNU 90 LGPL</a>. Use either at your convenience, basically this should make 91 everybody happy, if not, drop me a mail.</li> 92</ul> 93 94<h2><a name="Documentat">Documentation</a></h2> 95 96<p>There are some on-line resources about using libxml:</p> 97<ol> 98 <li>Check the <a href="FAQ.html">FAQ</a></li> 99 <li>Check the <a href="http://xmlsoft.org/html/libxml-lib.html">extensive 100 documentation</a> automatically extracted from code comments (using <a 101 href="http://cvs.gnome.org/bonsai/rview.cgi?cvsroot=/cvs/gnome&dir=gtk-doc">gtk 102 doc</a>).</li> 103 <li>Look at the documentation about <a href="encoding.html">libxml 104 internationalization support</a></li> 105 <li>This page provides a global overview and <a href="#real">some 106 examples</a> on how to use libxml.</li> 107 <li><a href="mailto:james@daa.com.au">James Henstridge</a> wrote <a 108 href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">some nice 109 documentation</a> explaining how to use the libxml SAX interface.</li> 110 <li>George Lebl wrote <a 111 href="http://www-4.ibm.com/software/developer/library/gnome3/">an article 112 for IBM developerWorks</a> about using libxml.</li> 113 <li>It is also a good idea to check to <a href="mailto:raph@levien.com">Raph 114 Levien</a> <a href="http://levien.com/gnome/">web site</a> since he is 115 building the <a href="http://levien.com/gnome/gdome.html">DOM interface 116 gdome</a> on top of libxml result tree and an implementation of <a 117 href="http://www.w3.org/Graphics/SVG/">SVG</a> called <a 118 href="http://www.levien.com/svg/">gill</a>. Check his <a 119 href="http://www.levien.com/gnome/domination.html">DOMination 120 paper</a>.</li> 121 <li>Check <a href="http://cvs.gnome.org/lxr/source/gnome-xml/TODO">the TODO 122 file</a></li> 123 <li>Read the <a href="upgrade.html">1.x to 2.x upgrade path</a>. If you are 124 starting a new project using libxml you should really use the 2.x 125 version.</li> 126 <li>And don't forget to look at the <a href="/messages/">mailing-list 127 archive</a>.</li> 128</ol> 129 130<h2><a name="Reporting">Reporting bugs and getting help</a></h2> 131 132<p>Well, bugs or missing features are always possible, and I will make a point 133of fixing them in a timely fashion. The best way to report a bug is to use the 134<a href="http://bugs.gnome.org/db/pa/lgnome-xml.html">Gnome bug tracking 135database</a> (make sure to use the "gnome-xml" module name, not libxml or 136libxml2). I look at reports there regularly and it's good to have a reminder 137when a bug is still open. Check the <a 138href="http://bugs.gnome.org/Reporting.html">instructions on reporting bugs</a> 139and be sure to specify that the bug is for the package gnome-xml.</p> 140 141<p>There is also a mailing-list <a 142href="mailto:xml@rpmfind.net">xml@rpmfind.net</a> for libxml, with an <a 143href="http://xmlsoft.org/messages">on-line archive</a>. To subscribe to this 144majordomo based list, send a mail message to <a 145href="mailto:majordomo@rpmfind.net">majordomo@rpmfind.net</a> with "subscribe 146xml" in the <strong>content</strong> of the message.</p> 147 148<p>Alternatively, you can just send the bug to the <a 149href="mailto:xml@rpmfind.net">xml@rpmfind.net</a> list, if it's really libxml 150related I will approve it..</p> 151 152<p>Of course, bugs reports with a suggested patch for fixing them will 153probably be processed faster.</p> 154 155<p>If you're looking for help, a quick look at <a 156href="http://xmlsoft.org/messages/#407">the list archive</a> may actually 157provide the answer, I usually send source samples when answering libxml usage 158questions. The <a href="http://xmlsoft.org/html/book1.html">auto-generated 159documentantion</a> is not as polished as I would like (i need to learn more 160about Docbook), but it's a good starting point.</p> 161 162<h2><a name="help">How to help</a></h2> 163 164<p>You can help the project in various ways, the best thing to do first is to 165subscribe to the mailing-list as explained before, check the <a 166href="http://xmlsoft.org/messages/">archives </a>and the <a 167href="http://bugs.gnome.org/db/pa/lgnome-xml.html">Gnome bug 168database:</a>:</p> 169<ol> 170 <li>provide patches when you find problems</li> 171 <li>provide the diffs when you port libxml to a new platform. They may not 172 be integrated in all cases but help pinpointing portability problems 173 and</li> 174 <li>provice documentation fixes (either as patches to the code comments or 175 as HTML diffs).</li> 176 <li>provide new documentations pieces (translations, examples, etc ...)</li> 177 <li>Check the TODO file and try to close one of the items</li> 178 <li>take one of the points raised in the archive or the bug database and 179 provide a fix. <a href="mailto:Daniel.Veillard@w3.org">Get in touch with 180 me </a>before to avoid synchronization problems and check that the 181 suggested fix will fit in nicely :-)</li> 182</ol> 183 184<h2><a name="Downloads">Downloads</a></h2> 185 186<p>The latest versions of libxml can be found on <a 187href="ftp://rpmfind.net/pub/libxml/">rpmfind.net</a> or on the <a 188href="ftp://ftp.gnome.org/pub/GNOME/MIRRORS.html">Gnome FTP server</a> either 189as a <a href="ftp://ftp.gnome.org/pub/GNOME/stable/sources/libxml/">source 190archive</a> or <a 191href="ftp://ftp.gnome.org/pub/GNOME/contrib/redhat/SRPMS/">RPM packages</a>. 192(NOTE that you need both the <a 193href="http://rpmfind.net/linux/RPM/libxml2.html">libxml(2)</a> and <a 194href="http://rpmfind.net/linux/RPM/libxml2-devel.html">libxml(2)-devel</a> 195packages installed to compile applications using libxml.)</p> 196 197<p><a name="Snapshot">Snapshot:</a></p> 198<ul> 199 <li>Code from the W3C cvs base libxml <a 200 href="ftp://rpmfind.net/pub/libxml/cvs-snapshot.tar.gz">cvs-snapshot.tar.gz</a></li> 201 <li>Docs, content of the web site, the list archive included <a 202 href="ftp://rpmfind.net/pub/libxml/libxml-docs.tar.gz">libxml-docs.tar.gz</a></li> 203</ul> 204 205<p><a name="Contribs">Contribs:</a></p> 206 207<p>I do accept external contributions, especially if compiling on another 208platform, get in touch with me to upload the package. I will keep them in the 209<a href="ftp://rpmfind.net/pub/libxml/contribs/">contrib directory</a></p> 210 211<p>Libxml is also available from 2 CVS bases:</p> 212<ul> 213 <li><p>The <a href="http://dev.w3.org/cvsweb/XML/">W3C CVS base</a>, 214 available read-only using the CVS pserver authentification (I tend to use 215 this base for my own development, so it's updated more regularly, but the 216 content may not be as stable):</p> 217 <pre>CVSROOT=:pserver:anonymous@dev.w3.org:/sources/public 218 password: anonymous 219 module: XML</pre> 220 </li> 221 <li><p>The <a 222 href="http://cvs.gnome.org/bonsai/rview.cgi?cvsroot=/cvs/gnome&dir=gnome-xml">Gnome 223 CVS base</a>. Check the <a 224 href="http://developer.gnome.org/tools/cvs.html">Gnome CVS Tools</a> page; 225 the CVS module is <b>gnome-xml</b>.</p> 226 </li> 227</ul> 228 229<h2><a name="News">News</a></h2> 230 231<h3>CVS only : check the <a 232href="http://cvs.gnome.org/lxr/source/gnome-xml/ChangeLog">Changelog</a> file 233for a really accurate description</h3> 234 235<p>Item floating around but not actively worked on, get in touch with me if 236you want to test those</p> 237<ul> 238 <li>working on HTML and XML links recognition layers</li> 239 <li>parsing/import of Docbook SGML docs</li> 240</ul> 241 242<h3>2.2.6: Oct 25 2000:</h3> 243<ul> 244 <li>Added an hash table module, migrated a number of internal structure to 245 those</li> 246 <li>Fixed a posteriori validation problems</li> 247 <li>HTTP module cleanups</li> 248 <li>HTML parser improvements (tag errors, script/style handling, attribute 249 normalization)</li> 250 <li>coalescing of adjacent text nodes</li> 251 <li>couple of XPath bug fixes, exported the internal API</li> 252</ul> 253 254<h3>2.2.5: Oct 15 2000:</h3> 255<ul> 256 <li>XPointer implementation and testsuite</li> 257 <li>Lot of XPath fixes, added variable and functions registration, more 258 tests</li> 259 <li>Portability fixes, lots of enhancements toward an easy Windows build and 260 release</li> 261 <li>Late validation fixes</li> 262 <li>Integrated a lot of contributed patches</li> 263 <li>added memory management docs</li> 264 <li>a performance problem when using large buffer seems fixed</li> 265</ul> 266 267<h3>2.2.4: Oct 1 2000:</h3> 268<ul> 269 <li>main XPath problem fixed</li> 270 <li>Integrated portability patches for Windows</li> 271 <li>Serious bug fixes on the URI and HTML code</li> 272</ul> 273 274<h3>2.2.3: Sep 17 2000</h3> 275<ul> 276 <li>bug fixes</li> 277 <li>cleanup of entity handling code</li> 278 <li>overall review of all loops in the parsers, all sprintf usage has been 279 checked too</li> 280 <li>Far better handling of larges Dtd. Validating against Docbook XML Dtd 281 works smoothly now.</li> 282</ul> 283 284<h3>1.8.10: Sep 6 2000</h3> 285<ul> 286 <li>bug fix release for some Gnome projects</li> 287</ul> 288 289<h3>2.2.2: August 12 2000</h3> 290<ul> 291 <li>mostly bug fixes</li> 292 <li>started adding routines to access xml parser context options</li> 293</ul> 294 295<h3>2.2.1: July 21 2000</h3> 296<ul> 297 <li>a purely bug fixes release</li> 298 <li>fixed an encoding support problem when parsing from a memory block</li> 299 <li>fixed a DOCTYPE parsing problem</li> 300 <li>removed a bug in the function allowing to override the memory allocation 301 routines</li> 302</ul> 303 304<h3>2.2.0: July 14 2000</h3> 305<ul> 306 <li>applied a lot of portability fixes</li> 307 <li>better encoding support/cleanup and saving (content is now always 308 encoded in UTF-8)</li> 309 <li>the HTML parser now correctly handles encodings</li> 310 <li>added xmlHasProp()</li> 311 <li>fixed a serious problem with &#38;</li> 312 <li>propagated the fix to FTP client</li> 313 <li>cleanup, bugfixes, etc ...</li> 314 <li>Added a page about <a href="encoding.html">libxml Internationalization 315 support</a></li> 316</ul> 317 318<h3>1.8.9: July 9 2000</h3> 319<ul> 320 <li>fixed the spec the RPMs should be better</li> 321 <li>fixed a serious bug in the FTP implementation, released 1.8.9 to solve 322 rpmfind users problem</li> 323</ul> 324 325<h3>2.1.1: July 1 2000</h3> 326<ul> 327 <li>fixes a couple of bugs in the 2.1.0 packaging</li> 328 <li>improvements on the HTML parser</li> 329</ul> 330 331<h3>2.1.0 and 1.8.8: June 29 2000</h3> 332<ul> 333 <li>1.8.8 is mostly a comodity package for upgrading to libxml2 accoding to 334 <a href="upgrade.html">new instructions</a>. It fixes a nasty problem 335 about &#38; charref parsing</li> 336 <li>2.1.0 also ease the upgrade from libxml v1 to the recent version. it 337 also contains numerous fixes and enhancements: 338 <ul> 339 <li>added xmlStopParser() to stop parsing</li> 340 <li>improved a lot parsing speed when there is large CDATA blocs</li> 341 <li>includes XPath patches provided by Picdar Technology</li> 342 <li>tried to fix as much as possible DtD validation and namespace 343 related problems</li> 344 <li>output to a given encoding has been added/tested</li> 345 <li>lot of various fixes</li> 346 </ul> 347 </li> 348</ul> 349 350<h3>2.0.0: Apr 12 2000</h3> 351<ul> 352 <li>First public release of libxml2. If you are using libxml, it's a good 353 idea to check the 1.x to 2.x upgrade instructions. NOTE: while initally 354 scheduled for Apr 3 the relase occured only on Apr 12 due to massive 355 workload.</li> 356 <li>The include are now located under $prefix/include/libxml (instead of 357 $prefix/include/gnome-xml), they also are referenced by 358 <pre>#include <libxml/xxx.h></pre> 359 <p>instead of</p> 360 <pre>#include "xxx.h"</pre> 361 </li> 362 <li>a new URI module for parsing URIs and following strictly RFC 2396</li> 363 <li>the memory allocation routines used by libxml can now be overloaded 364 dynamically by using xmlMemSetup()</li> 365 <li>The previously CVS only tool tester has been renamed 366 <strong>xmllint</strong> and is now installed as part of the libxml2 367 package</li> 368 <li>The I/O interface has been revamped. There is now ways to plug in 369 specific I/O modules, either at the URI scheme detection level using 370 xmlRegisterInputCallbacks() or by passing I/O functions when creating a 371 parser context using xmlCreateIOParserCtxt()</li> 372 <li>there is a C preprocessor macro LIBXML_VERSION providing the version 373 number of the libxml module in use</li> 374 <li>a number of optional features of libxml can now be excluded at configure 375 time (FTP/HTTP/HTML/XPath/Debug)</li> 376</ul> 377 378<h3>2.0.0beta: Mar 14 2000</h3> 379<ul> 380 <li>This is a first Beta release of libxml version 2</li> 381 <li>It's available only from<a href="ftp://rpmfind.net/pub/libxml/"> 382 rpmfind.net FTP</a>, it's packaged as libxml2-2.0.0beta and available as 383 tar and RPMs</li> 384 <li>This version is now the head in the Gnome CVS base, the old one is 385 available under the tag LIB_XML_1_X</li> 386 <li>This includes a very large set of changes. Froma programmatic point of 387 view applications should not have to be modified too much, check the <a 388 href="upgrade.html">upgrade page</a></li> 389 <li>Some interfaces may changes (especially a bit about encoding).</li> 390 <li>the updates includes: 391 <ul> 392 <li>fix I18N support. ISO-Latin-x/UTF-8/UTF-16 (nearly) seems correctly 393 handled now</li> 394 <li>Better handling of entities, especially well formedness checking and 395 proper PEref extensions in external subsets</li> 396 <li>DTD conditional sections</li> 397 <li>Validation now correcly handle entities content</li> 398 <li><a href="http://rpmfind.net/tools/gdome/messages/0039.html">change 399 structures to accomodate DOM</a></li> 400 </ul> 401 </li> 402 <li>Serious progress were made toward compliance, <a 403 href="conf/result.html">here are the result of the test</a> against the 404 OASIS testsuite (except the japanese tests since I don't support that 405 encoding yet). This URL is rebuilt every couple of hours using the CVS 406 head version.</li> 407</ul> 408 409<h3>1.8.7: Mar 6 2000</h3> 410<ul> 411 <li>This is a bug fix release:</li> 412 <li>It is possible to disable the ignorable blanks heuristic used by 413 libxml-1.x, a new function xmlKeepBlanksDefault(0) will allow this. Note 414 that for adherence to XML spec, this behaviour will be disabled by default 415 in 2.x . The same function will allow to keep compatibility for old 416 code.</li> 417 <li>Blanks in <a> </a> constructs are not ignored anymore, 418 avoiding heuristic is really the Right Way :-\</li> 419 <li>The unchecked use of snprintf which was breaking libxml-1.8.6 420 compilation on some platforms has been fixed</li> 421 <li>nanoftp.c nanohttp.c: Fixed '#' and '?' stripping when processing 422 URIs</li> 423</ul> 424 425<h3>1.8.6: Jan 31 2000</h3> 426<ul> 427 <li>added a nanoFTP transport module, debugged until the new version of <a 428 href="http://rpmfind.net/linux/rpm2html/rpmfind.html">rpmfind</a> can use 429 it without troubles</li> 430</ul> 431 432<h3>1.8.5: Jan 21 2000</h3> 433<ul> 434 <li>adding APIs to parse a well balanced chunk of XML (production <a 435 href="http://www.w3.org/TR/REC-xml#NT-content">[43] content</a> of the XML 436 spec)</li> 437 <li>fixed a hideous bug in xmlGetProp pointed by Rune.Djurhuus@fast.no</li> 438 <li>Jody Goldberg <jgoldberg@home.com> provided another patch trying 439 to solve the zlib checks problems</li> 440 <li>The current state in gnome CVS base is expected to ship as 1.8.5 with 441 gnumeric soon</li> 442</ul> 443 444<h3>1.8.4: Jan 13 2000</h3> 445<ul> 446 <li>bug fixes, reintroduced xmlNewGlobalNs(), fixed xmlNewNs()</li> 447 <li>all exit() call should have been removed from libxml</li> 448 <li>fixed a problem with INCLUDE_WINSOCK on WIN32 platform</li> 449 <li>added newDocFragment()</li> 450</ul> 451 452<h3>1.8.3: Jan 5 2000</h3> 453<ul> 454 <li>a Push interface for the XML and HTML parsers</li> 455 <li>a shell-like interface to the document tree (try tester --shell :-)</li> 456 <li>lots of bug fixes and improvement added over XMas hollidays</li> 457 <li>fixed the DTD parsing code to work with the xhtml DTD</li> 458 <li>added xmlRemoveProp(), xmlRemoveID() and xmlRemoveRef()</li> 459 <li>Fixed bugs in xmlNewNs()</li> 460 <li>External entity loading code has been revamped, now it uses 461 xmlLoadExternalEntity(), some fix on entities processing were added</li> 462 <li>cleaned up WIN32 includes of socket stuff</li> 463</ul> 464 465<h3>1.8.2: Dec 21 1999</h3> 466<ul> 467 <li>I got another problem with includes and C++, I hope this issue is fixed 468 for good this time</li> 469 <li>Added a few tree modification functions: xmlReplaceNode, 470 xmlAddPrevSibling, xmlAddNextSibling, xmlNodeSetName and 471 xmlDocSetRootElement</li> 472 <li>Tried to improve the HTML output with help from <a 473 href="mailto:clahey@umich.edu">Chris Lahey</a></li> 474</ul> 475 476<h3>1.8.1: Dec 18 1999</h3> 477<ul> 478 <li>various patches to avoid troubles when using libxml with C++ compilers 479 the "namespace" keyword and C escaping in include files</li> 480 <li>a problem in one of the core macros IS_CHAR was corrected</li> 481 <li>fixed a bug introduced in 1.8.0 breaking default namespace processing, 482 and more specifically the Dia application</li> 483 <li>fixed a posteriori validation (validation after parsing, or by using a 484 Dtd not specified in the original document)</li> 485 <li>fixed a bug in</li> 486</ul> 487 488<h3>1.8.0: Dec 12 1999</h3> 489<ul> 490 <li>cleanup, especially memory wise</li> 491 <li>the parser should be more reliable, especially the HTML one, it should 492 not crash, whatever the input !</li> 493 <li>Integrated various patches, especially a speedup improvement for large 494 dataset from <a href="mailto:cnygard@bellatlantic.net">Carl Nygard</a>, 495 configure with --with-buffers to enable them.</li> 496 <li>attribute normalization, oops should have been added long ago !</li> 497 <li>attributes defaulted from Dtds should be available, xmlSetProp() now 498 does entities escapting by default.</li> 499</ul> 500 501<h3>1.7.4: Oct 25 1999</h3> 502<ul> 503 <li>Lots of HTML improvement</li> 504 <li>Fixed some errors when saving both XML and HTML</li> 505 <li>More examples, the regression tests should now look clean</li> 506 <li>Fixed a bug with contiguous charref</li> 507</ul> 508 509<h3>1.7.3: Sep 29 1999</h3> 510<ul> 511 <li>portability problems fixed</li> 512 <li>snprintf was used unconditionnally, leading to link problems on system 513 were it's not available, fixed</li> 514</ul> 515 516<h3>1.7.1: Sep 24 1999</h3> 517<ul> 518 <li>The basic type for strings manipulated by libxml has been renamed in 519 1.7.1 from <strong>CHAR</strong> to <strong>xmlChar</strong>. The reason 520 is that CHAR was conflicting with a predefined type on Windows. However on 521 non WIN32 environment, compatibility is provided by the way of a 522 <strong>#define </strong>.</li> 523 <li>Changed another error : the use of a structure field called errno, and 524 leading to troubles on platforms where it's a macro</li> 525</ul> 526 527<h3>1.7.0: sep 23 1999</h3> 528<ul> 529 <li>Added the ability to fetch remote DTD or parsed entities, see the <a 530 href="html/gnome-xml-nanohttp.html">nanohttp</a> module.</li> 531 <li>Added an errno to report errors by another mean than a simple printf 532 like callback</li> 533 <li>Finished ID/IDREF support and checking when validation</li> 534 <li>Serious memory leaks fixed (there is now a <a 535 href="html/gnome-xml-xmlmemory.html">memory wrapper</a> module)</li> 536 <li>Improvement of <a href="http://www.w3.org/TR/xpath">XPath</a> 537 implementation</li> 538 <li>Added an HTML parser front-end</li> 539</ul> 540 541<h2><a name="XML">XML</a></h2> 542 543<p><a href="http://www.w3.org/TR/REC-xml">XML is a standard</a> for 544markup-based structured documents. Here is <a name="example">an example XML 545document</a>:</p> 546<pre><?xml version="1.0"?> 547<EXAMPLE prop1="gnome is great" prop2="&amp; linux too"> 548 <head> 549 <title>Welcome to Gnome</title> 550 </head> 551 <chapter> 552 <title>The Linux adventure</title> 553 <p>bla bla bla ...</p> 554 <image href="linus.gif"/> 555 <p>...</p> 556 </chapter> 557</EXAMPLE></pre> 558 559<p>The first line specifies that it's an XML document and gives useful 560information about its encoding. Then the document is a text format whose 561structure is specified by tags between brackets. <strong>Each tag opened has 562to be closed</strong>. XML is pedantic about this. However, if a tag is empty 563(no content), a single tag can serve as both the opening and closing tag if it 564ends with <code>/></code> rather than with <code>></code>. Note that, 565for example, the image tag has no content (just an attribute) and is closed by 566ending the tag with <code>/></code>.</p> 567 568<p>XML can be applied sucessfully to a wide range of uses, from long term 569structured document maintenance (where it follows the steps of SGML) to simple 570data encoding mechanisms like configuration file formatting (glade), 571spreadsheets (gnumeric), or even shorter lived documents such as WebDAV where 572it is used to encode remote calls between a client and a server.</p> 573 574<h2>An overview of libxml architecture</h2> 575 576<p>Libxml is made of multiple components, some of them optionals, and most of 577the block interfaces are public. The main components are:</p> 578<ul> 579 <li>an Input/Output layer</li> 580 <li>FTP and HTTP client layers (optionnal)</li> 581 <li>an Internationalization layer managing the encodings support</li> 582 <li>an URI module</li> 583 <li>the XML parser and its basic SAX interface</li> 584 <li>an HTML parser using the same SAX interface (optionnal)</li> 585 <li>a SAX tree module to build an in-memory DOM representation</li> 586 <li>a tree module to manipulate the DOM representation</li> 587 <li>a validation module using the DOM representation (optionnal)</li> 588 <li>an XPath module for global lookup in a DOM representation 589 (optionnal)</li> 590 <li>a debug module (optionnal)</li> 591</ul> 592 593<p>Graphically this gives the following:</p> 594 595<p><img src="libxml.gif" alt="a graphical view of the various"></p> 596 597<p></p> 598 599<h2><a name="tree">The tree output</a></h2> 600 601<p>The parser returns a tree built during the document analysis. The value 602returned is an <strong>xmlDocPtr</strong> (i.e., a pointer to an 603<strong>xmlDoc</strong> structure). This structure contains information such 604as the file name, the document type, and a <strong>children</strong> pointer 605which is the root of the document (or more exactly the first child under the 606root which is the document). The tree is made of <strong>xmlNode</strong>s, 607chained in double-linked lists of siblings and with children<->parent 608relationship. An xmlNode can also carry properties (a chain of xmlAttr 609structures). An attribute may have a value which is a list of TEXT or 610ENTITY_REF nodes.</p> 611 612<p>Here is an example (erroneous with respect to the XML spec since there 613should be only one ELEMENT under the root):</p> 614 615<p><img src="structure.gif" alt=" structure.gif "></p> 616 617<p>In the source package there is a small program (not installed by default) 618called <strong>xmllint</strong> which parses XML files given as argument and 619prints them back as parsed. This is useful for detecting errors both in XML 620code and in the XML parser itself. It has an option <strong>--debug</strong> 621which prints the actual in-memory structure of the document, here is the 622result with the <a href="#example">example</a> given before:</p> 623<pre>DOCUMENT 624version=1.0 625standalone=true 626 ELEMENT EXAMPLE 627 ATTRIBUTE prop1 628 TEXT 629 content=gnome is great 630 ATTRIBUTE prop2 631 ENTITY_REF 632 TEXT 633 content= linux too 634 ELEMENT head 635 ELEMENT title 636 TEXT 637 content=Welcome to Gnome 638 ELEMENT chapter 639 ELEMENT title 640 TEXT 641 content=The Linux adventure 642 ELEMENT p 643 TEXT 644 content=bla bla bla ... 645 ELEMENT image 646 ATTRIBUTE href 647 TEXT 648 content=linus.gif 649 ELEMENT p 650 TEXT 651 content=...</pre> 652 653<p>This should be useful for learning the internal representation model.</p> 654 655<h2><a name="interface">The SAX interface</a></h2> 656 657<p>Sometimes the DOM tree output is just too large to fit reasonably into 658memory. In that case (and if you don't expect to save back the XML document 659loaded using libxml), it's better to use the SAX interface of libxml. SAX is a 660<strong>callback-based interface</strong> to the parser. Before parsing, the 661application layer registers a customized set of callbacks which are called by 662the library as it progresses through the XML input.</p> 663 664<p>To get more detailed step-by-step guidance on using the SAX interface of 665libxml, see the <a 666href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">nice 667documentation</a>.written by <a href="mailto:james@daa.com.au">James 668Henstridge</a>.</p> 669 670<p>You can debug the SAX behaviour by using the <strong>testSAX</strong> 671program located in the gnome-xml module (it's usually not shipped in the 672binary packages of libxml, but you can find it in the tar source 673distribution). Here is the sequence of callbacks that would be reported by 674testSAX when parsing the example XML document shown earlier:</p> 675<pre>SAX.setDocumentLocator() 676SAX.startDocument() 677SAX.getEntity(amp) 678SAX.startElement(EXAMPLE, prop1='gnome is great', prop2='&amp; linux too') 679SAX.characters( , 3) 680SAX.startElement(head) 681SAX.characters( , 4) 682SAX.startElement(title) 683SAX.characters(Welcome to Gnome, 16) 684SAX.endElement(title) 685SAX.characters( , 3) 686SAX.endElement(head) 687SAX.characters( , 3) 688SAX.startElement(chapter) 689SAX.characters( , 4) 690SAX.startElement(title) 691SAX.characters(The Linux adventure, 19) 692SAX.endElement(title) 693SAX.characters( , 4) 694SAX.startElement(p) 695SAX.characters(bla bla bla ..., 15) 696SAX.endElement(p) 697SAX.characters( , 4) 698SAX.startElement(image, href='linus.gif') 699SAX.endElement(image) 700SAX.characters( , 4) 701SAX.startElement(p) 702SAX.characters(..., 3) 703SAX.endElement(p) 704SAX.characters( , 3) 705SAX.endElement(chapter) 706SAX.characters( , 1) 707SAX.endElement(EXAMPLE) 708SAX.endDocument()</pre> 709 710<p>Most of the other functionalities of libxml are based on the DOM 711tree-building facility, so nearly everything up to the end of this document 712presupposes the use of the standard DOM tree build. Note that the DOM tree 713itself is built by a set of registered default callbacks, without internal 714specific interface.</p> 715 716<h2><a name="library">The XML library interfaces</a></h2> 717 718<p>This section is directly intended to help programmers getting bootstrapped 719using the XML library from the C language. It is not intended to be extensive. 720I hope the automatically generated documents will provide the completeness 721required, but as a separate set of documents. The interfaces of the XML 722library are by principle low level, there is nearly zero abstraction. Those 723interested in a higher level API should <a href="#DOM">look at DOM</a>.</p> 724 725<p>The <a href="html/gnome-xml-parser.html">parser interfaces for XML</a> are 726separated from the <a href="html/gnome-xml-htmlparser.html">HTML parser 727interfaces</a>. Let's have a look at how the XML parser can be called:</p> 728 729<h3><a name="Invoking">Invoking the parser : the pull method</a></h3> 730 731<p>Usually, the first thing to do is to read an XML input. The parser accepts 732documents either from in-memory strings or from files. The functions are 733defined in "parser.h":</p> 734<dl> 735 <dt><code>xmlDocPtr xmlParseMemory(char *buffer, int size);</code></dt> 736 <dd><p>Parse a null-terminated string containing the document.</p> 737 </dd> 738</dl> 739<dl> 740 <dt><code>xmlDocPtr xmlParseFile(const char *filename);</code></dt> 741 <dd><p>Parse an XML document contained in a (possibly compressed) 742 file.</p> 743 </dd> 744</dl> 745 746<p>The parser returns a pointer to the document structure (or NULL in case of 747failure).</p> 748 749<h3 id="Invoking1">Invoking the parser: the push method</h3> 750 751<p>In order for the application to keep the control when the document is been 752fetched (which is common for GUI based programs) libxml provides a push 753interface, too, as of version 1.8.3. Here are the interface functions:</p> 754<pre>xmlParserCtxtPtr xmlCreatePushParserCtxt(xmlSAXHandlerPtr sax, 755 void *user_data, 756 const char *chunk, 757 int size, 758 const char *filename); 759int xmlParseChunk (xmlParserCtxtPtr ctxt, 760 const char *chunk, 761 int size, 762 int terminate);</pre> 763 764<p>and here is a simple example showing how to use the interface:</p> 765<pre> FILE *f; 766 767 f = fopen(filename, "r"); 768 if (f != NULL) { 769 int res, size = 1024; 770 char chars[1024]; 771 xmlParserCtxtPtr ctxt; 772 773 res = fread(chars, 1, 4, f); 774 if (res > 0) { 775 ctxt = xmlCreatePushParserCtxt(NULL, NULL, 776 chars, res, filename); 777 while ((res = fread(chars, 1, size, f)) > 0) { 778 xmlParseChunk(ctxt, chars, res, 0); 779 } 780 xmlParseChunk(ctxt, chars, 0, 1); 781 doc = ctxt->myDoc; 782 xmlFreeParserCtxt(ctxt); 783 } 784 }</pre> 785 786<p>Also note that the HTML parser embedded into libxml also has a push 787interface; the functions are just prefixed by "html" rather than "xml"</p> 788 789<h3 id="Invoking2">Invoking the parser: the SAX interface</h3> 790 791<p>A couple of comments can be made, first this mean that the parser is 792memory-hungry, first to load the document in memory, second to build the tree. 793Reading a document without building the tree is possible using the SAX 794interfaces (see SAX.h and <a 795href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">James 796Henstridge's documentation</a>). Note also that the push interface can be 797limited to SAX. Just use the two first arguments of 798<code>xmlCreatePushParserCtxt()</code>.</p> 799 800<h3><a name="Building">Building a tree from scratch</a></h3> 801 802<p>The other way to get an XML tree in memory is by building it. Basically 803there is a set of functions dedicated to building new elements. (These are 804also described in <libxml/tree.h>.) For example, here is a piece of code 805that produces the XML document used in the previous examples:</p> 806<pre> #include <libxml/tree.h> 807 xmlDocPtr doc; 808 xmlNodePtr tree, subtree; 809 810 doc = xmlNewDoc("1.0"); 811 doc->children = xmlNewDocNode(doc, NULL, "EXAMPLE", NULL); 812 xmlSetProp(doc->children, "prop1", "gnome is great"); 813 xmlSetProp(doc->children, "prop2", "& linux too"); 814 tree = xmlNewChild(doc->children, NULL, "head", NULL); 815 subtree = xmlNewChild(tree, NULL, "title", "Welcome to Gnome"); 816 tree = xmlNewChild(doc->children, NULL, "chapter", NULL); 817 subtree = xmlNewChild(tree, NULL, "title", "The Linux adventure"); 818 subtree = xmlNewChild(tree, NULL, "p", "bla bla bla ..."); 819 subtree = xmlNewChild(tree, NULL, "image", NULL); 820 xmlSetProp(subtree, "href", "linus.gif");</pre> 821 822<p>Not really rocket science ...</p> 823 824<h3><a name="Traversing">Traversing the tree</a></h3> 825 826<p>Basically by <a href="html/gnome-xml-tree.html">including "tree.h"</a> your 827code has access to the internal structure of all the elements of the tree. The 828names should be somewhat simple like <strong>parent</strong>, 829<strong>children</strong>, <strong>next</strong>, <strong>prev</strong>, 830<strong>properties</strong>, etc... For example, still with the previous 831example:</p> 832<pre><code>doc->children->children->children</code></pre> 833 834<p>points to the title element,</p> 835<pre>doc->children->children->next->child->child</pre> 836 837<p>points to the text node containing the chapter title "The Linux 838adventure".</p> 839 840<p><strong>NOTE</strong>: XML allows <em>PI</em>s and <em>comments</em> to be 841present before the document root, so <code>doc->children</code> may point 842to an element which is not the document Root Element, a function 843<code>xmlDocGetRootElement()</code> was added for this purpose.</p> 844 845<h3><a name="Modifying">Modifying the tree</a></h3> 846 847<p>Functions are provided for reading and writing the document content. Here 848is an excerpt from the <a href="html/gnome-xml-tree.html">tree API</a>:</p> 849<dl> 850 <dt><code>xmlAttrPtr xmlSetProp(xmlNodePtr node, const xmlChar *name, const 851 xmlChar *value);</code></dt> 852 <dd><p>This sets (or changes) an attribute carried by an ELEMENT node. The 853 value can be NULL.</p> 854 </dd> 855</dl> 856<dl> 857 <dt><code>const xmlChar *xmlGetProp(xmlNodePtr node, const xmlChar 858 *name);</code></dt> 859 <dd><p>This function returns a pointer to new copy of the property 860 content. Note that the user must deallocate the result.</p> 861 </dd> 862</dl> 863 864<p>Two functions are provided for reading and writing the text associated with 865elements:</p> 866<dl> 867 <dt><code>xmlNodePtr xmlStringGetNodeList(xmlDocPtr doc, const xmlChar 868 *value);</code></dt> 869 <dd><p>This function takes an "external" string and convert it to one text 870 node or possibly to a list of entity and text nodes. All non-predefined 871 entity references like &Gnome; will be stored internally as entity 872 nodes, hence the result of the function may not be a single node.</p> 873 </dd> 874</dl> 875<dl> 876 <dt><code>xmlChar *xmlNodeListGetString(xmlDocPtr doc, xmlNodePtr list, int 877 inLine);</code></dt> 878 <dd><p>This function is the inverse of 879 <code>xmlStringGetNodeList()</code>. It generates a new string 880 containing the content of the text and entity nodes. Note the extra 881 argument inLine. If this argument is set to 1, the function will expand 882 entity references. For example, instead of returning the &Gnome; 883 XML encoding in the string, it will substitute it with its value (say, 884 "GNU Network Object Model Environment"). Set this argument if you want 885 to use the string for non-XML usage like User Interface.</p> 886 </dd> 887</dl> 888 889<h3><a name="Saving">Saving a tree</a></h3> 890 891<p>Basically 3 options are possible:</p> 892<dl> 893 <dt><code>void xmlDocDumpMemory(xmlDocPtr cur, xmlChar**mem, int 894 *size);</code></dt> 895 <dd><p>Returns a buffer into which the document has been saved.</p> 896 </dd> 897</dl> 898<dl> 899 <dt><code>extern void xmlDocDump(FILE *f, xmlDocPtr doc);</code></dt> 900 <dd><p>Dumps a document to an open file descriptor.</p> 901 </dd> 902</dl> 903<dl> 904 <dt><code>int xmlSaveFile(const char *filename, xmlDocPtr cur);</code></dt> 905 <dd><p>Saves the document to a file. In this case, the compression 906 interface is triggered if it has been turned on.</p> 907 </dd> 908</dl> 909 910<h3><a name="Compressio">Compression</a></h3> 911 912<p>The library transparently handles compression when doing file-based 913accesses. The level of compression on saves can be turned on either globally 914or individually for one file:</p> 915<dl> 916 <dt><code>int xmlGetDocCompressMode (xmlDocPtr doc);</code></dt> 917 <dd><p>Gets the document compression ratio (0-9).</p> 918 </dd> 919</dl> 920<dl> 921 <dt><code>void xmlSetDocCompressMode (xmlDocPtr doc, int mode);</code></dt> 922 <dd><p>Sets the document compression ratio.</p> 923 </dd> 924</dl> 925<dl> 926 <dt><code>int xmlGetCompressMode(void);</code></dt> 927 <dd><p>Gets the default compression ratio.</p> 928 </dd> 929</dl> 930<dl> 931 <dt><code>void xmlSetCompressMode(int mode);</code></dt> 932 <dd><p>Sets the default compression ratio.</p> 933 </dd> 934</dl> 935 936<h2><a name="Entities">Entities or no entities</a></h2> 937 938<p>Entities in principle are similar to simple C macros. An entity defines an 939abbreviation for a given string that you can reuse many times throughout the 940content of your document. Entities are especially useful when a given string 941may occur frequently within a document, or to confine the change needed to a 942document to a restricted area in the internal subset of the document (at the 943beginning). Example:</p> 944<pre>1 <?xml version="1.0"?> 9452 <!DOCTYPE EXAMPLE SYSTEM "example.dtd" [ 9463 <!ENTITY xml "Extensible Markup Language"> 9474 ]> 9485 <EXAMPLE> 9496 &xml; 9507 </EXAMPLE></pre> 951 952<p>Line 3 declares the xml entity. Line 6 uses the xml entity, by prefixing 953it's name with '&' and following it by ';' without any spaces added. There 954are 5 predefined entities in libxml allowing you to escape charaters with 955predefined meaning in some parts of the xml document content: 956<strong>&lt;</strong> for the character '<', <strong>&gt;</strong> 957for the character '>', <strong>&apos;</strong> for the character ''', 958<strong>&quot;</strong> for the character '"', and 959<strong>&amp;</strong> for the character '&'.</p> 960 961<p>One of the problems related to entities is that you may want the parser to 962substitute an entity's content so that you can see the replacement text in 963your application. Or you may prefer to keep entity references as such in the 964content to be able to save the document back without losing this usually 965precious information (if the user went through the pain of explicitly defining 966entities, he may have a a rather negative attitude if you blindly susbtitute 967them as saving time). The <a 968href="html/gnome-xml-parser.html#XMLSUBSTITUTEENTITIESDEFAULT">xmlSubstituteEntitiesDefault()</a> 969function allows you to check and change the behaviour, which is to not 970substitute entities by default.</p> 971 972<p>Here is the DOM tree built by libxml for the previous document in the 973default case:</p> 974<pre>/gnome/src/gnome-xml -> /xmllint --debug test/ent1 975DOCUMENT 976version=1.0 977 ELEMENT EXAMPLE 978 TEXT 979 content= 980 ENTITY_REF 981 INTERNAL_GENERAL_ENTITY xml 982 content=Extensible Markup Language 983 TEXT 984 content=</pre> 985 986<p>And here is the result when substituting entities:</p> 987<pre>/gnome/src/gnome-xml -> /tester --debug --noent test/ent1 988DOCUMENT 989version=1.0 990 ELEMENT EXAMPLE 991 TEXT 992 content= Extensible Markup Language</pre> 993 994<p>So, entities or no entities? Basically, it depends on your use case. I 995suggest that you keep the non-substituting default behaviour and avoid using 996entities in your XML document or data if you are not willing to handle the 997entity references elements in the DOM tree.</p> 998 999<p>Note that at save time libxml enforce the conversion of the predefined 1000entities where necessary to prevent well-formedness problems, and will also 1001transparently replace those with chars (i.e., it will not generate entity 1002reference elements in the DOM tree or call the reference() SAX callback when 1003finding them in the input).</p> 1004 1005<p><span style="background-color: #FF0000">WARNING</span>: handling entities 1006on top of libxml SAX interface is difficult !!! If you plan to use 1007non-predefined entities in your documents, then the learning cuvre to handle 1008then using the SAX API may be long. If you plan to use complex document, I 1009strongly suggest you consider using the DOM interface instead and let libxml 1010deal with the complexity rather than trying to do it yourself.</p> 1011 1012<h2><a name="Namespaces">Namespaces</a></h2> 1013 1014<p>The libxml library implements <a 1015href="http://www.w3.org/TR/REC-xml-names/">XML namespaces</a> support by 1016recognizing namespace contructs in the input, and does namespace lookup 1017automatically when building the DOM tree. A namespace declaration is 1018associated with an in-memory structure and all elements or attributes within 1019that namespace point to it. Hence testing the namespace is a simple and fast 1020equality operation at the user level.</p> 1021 1022<p>I suggest that people using libxml use a namespace, and declare it in the 1023root element of their document as the default namespace. Then they don't need 1024to use the prefix in the content but we will have a basis for future semantic 1025refinement and merging of data from different sources. This doesn't augment 1026significantly the size of the XML output, but significantly increase its value 1027in the long-term. Example:</p> 1028<pre><mydoc xmlns="http://mydoc.example.org/schemas/"> 1029 <elem1>...</elem1> 1030 <elem2>...</elem2> 1031</mydoc></pre> 1032 1033<p>Concerning the namespace value, this has to be an URL, but the URL doesn't 1034have to point to any existing resource on the Web. It will bind all the 1035element and atributes with that URL. I suggest to use an URL within a domain 1036you control, and that the URL should contain some kind of version information 1037if possible. For example, <code>"http://www.gnome.org/gnumeric/1.0/"</code> is 1038a good namespace scheme.</p> 1039 1040<p>Then when you load a file, make sure that a namespace carrying the 1041version-independent prefix is installed on the root element of your document, 1042and if the version information don't match something you know, warn the user 1043and be liberal in what you accept as the input. Also do *not* try to base 1044namespace checking on the prefix value. <foo:text> may be exactly the 1045same as <bar:text> in another document. What really matter is the URI 1046associated with the element or the attribute, not the prefix string (which is 1047just a shortcut for the full URI). In libxml element and attributes have a 1048<code>ns</code> field pointing to an xmlNs structure detailing the namespace 1049prefix and it's URI.</p> 1050 1051<p>@@Interfaces@@</p> 1052 1053<p>@@Examples@@</p> 1054 1055<p>Usually people object using namespace in the case of validation, I object 1056this and will make sure that using namespaces won't break validity checking, 1057so even is you plan to use or currently are using validation I strongly 1058suggest adding namespaces to your document. A default namespace scheme 1059<code>xmlns="http://...."</code> should not break validity even on less 1060flexible parsers. Now using namespace to mix and differentiate content coming 1061from multiple DTDs will certainly break current validation schemes. I will try 1062to provide ways to do this, but this may not be portable or standardized.</p> 1063 1064<h2><a name="Validation">Validation, or are you afraid of DTDs ?</a></h2> 1065 1066<p>Well what is validation and what is a DTD ?</p> 1067 1068<p>Validation is the process of checking a document against a set of 1069construction rules, a <strong>DTD</strong> (Document Type Definition) is such 1070a set of rules.</p> 1071 1072<p>The validation process and building DTDs are the two most difficult parts 1073of XML life cycle. Briefly a DTD defines all the possibles element to be 1074found within your document, what is the formal shape of your document tree (by 1075defining the allowed content of an element, either text, a regular expression 1076for the allowed list of children, or mixed content i.e. both text and 1077children). The DTD also defines the allowed attributes for all elements and 1078the types of the attributes. For more detailed informations, I suggest to read 1079the related parts of the XML specification, the examples found under 1080gnome-xml/test/valid/dtd and the large amount of books available on XML. The 1081dia example in gnome-xml/test/valid should be both simple and complete enough 1082to allow you to build your own.</p> 1083 1084<p>A word of warning, building a good DTD which will fit your needs of your 1085application in the long-term is far from trivial, however the extra level of 1086quality it can insure is well worth the price for some sets of applications or 1087if you already have already a DTD defined for your application field.</p> 1088 1089<p>The validation is not completely finished but in a (very IMHO) usable 1090state. Until a real validation interface is defined the way to do it is to 1091define and set the <strong>xmlDoValidityCheckingDefaultValue</strong> external 1092variable to 1, this will of course be changed at some point:</p> 1093 1094<p>extern int xmlDoValidityCheckingDefaultValue;</p> 1095 1096<p>...</p> 1097 1098<p>xmlDoValidityCheckingDefaultValue = 1;</p> 1099 1100<p></p> 1101 1102<p>To handle external entities, use the function 1103<strong>xmlSetExternalEntityLoader</strong>(xmlExternalEntityLoader f); to 1104link in you HTTP/FTP/Entities database library to the standard libxml 1105core.</p> 1106 1107<p>@@interfaces@@</p> 1108 1109<h2><a name="DOM"></a><a name="Principles">DOM Principles</a></h2> 1110 1111<p><a href="http://www.w3.org/DOM/">DOM</a> stands for the <em>Document Object 1112Model</em> this is an API for accessing XML or HTML structured documents. 1113Native support for DOM in Gnome is on the way (module gnome-dom), and it will 1114be based on gnome-xml. This will be a far cleaner interface to manipulate XML 1115files within Gnome since it won't expose the internal structure. DOM defines a 1116set of IDL (or Java) interfaces allowing to traverse and manipulate a 1117document. The DOM library will allow accessing and modifying "live" documents 1118presents on other programs like this:</p> 1119 1120<p><img src="DOM.gif" alt=" DOM.gif "></p> 1121 1122<p>This should help greatly doing things like modifying a gnumeric spreadsheet 1123embedded in a GWP document for example.</p> 1124 1125<p>The current DOM implementation on top of libxml is the <a 1126href="http://cvs.gnome.org/lxr/source/gdome/">gdome Gnome module</a>, this is 1127a full DOM interface, thanks to <a href="mailto:raph@levien.com">Raph 1128Levien</a>.</p> 1129 1130<p>The gnome-dom module in the Gnome CVS base is obsolete</p> 1131 1132<h2><a name="Example"></a><a name="real">A real example</a></h2> 1133 1134<p>Here is a real size example, where the actual content of the application 1135data is not kept in the DOM tree but uses internal structures. It is based on 1136a proposal to keep a database of jobs related to Gnome, with an XML based 1137storage structure. Here is an <a href="gjobs.xml">XML encoded jobs 1138base</a>:</p> 1139<pre><?xml version="1.0"?> 1140<gjob:Helping xmlns:gjob="http://www.gnome.org/some-location"> 1141 <gjob:Jobs> 1142 1143 <gjob:Job> 1144 <gjob:Project ID="3"/> 1145 <gjob:Application>GBackup</gjob:Application> 1146 <gjob:Category>Development</gjob:Category> 1147 1148 <gjob:Update> 1149 <gjob:Status>Open</gjob:Status> 1150 <gjob:Modified>Mon, 07 Jun 1999 20:27:45 -0400 MET DST</gjob:Modified> 1151 <gjob:Salary>USD 0.00</gjob:Salary> 1152 </gjob:Update> 1153 1154 <gjob:Developers> 1155 <gjob:Developer> 1156 </gjob:Developer> 1157 </gjob:Developers> 1158 1159 <gjob:Contact> 1160 <gjob:Person>Nathan Clemons</gjob:Person> 1161 <gjob:Email>nathan@windsofstorm.net</gjob:Email> 1162 <gjob:Company> 1163 </gjob:Company> 1164 <gjob:Organisation> 1165 </gjob:Organisation> 1166 <gjob:Webpage> 1167 </gjob:Webpage> 1168 <gjob:Snailmail> 1169 </gjob:Snailmail> 1170 <gjob:Phone> 1171 </gjob:Phone> 1172 </gjob:Contact> 1173 1174 <gjob:Requirements> 1175 The program should be released as free software, under the GPL. 1176 </gjob:Requirements> 1177 1178 <gjob:Skills> 1179 </gjob:Skills> 1180 1181 <gjob:Details> 1182 A GNOME based system that will allow a superuser to configure 1183 compressed and uncompressed files and/or file systems to be backed 1184 up with a supported media in the system. This should be able to 1185 perform via find commands generating a list of files that are passed 1186 to tar, dd, cpio, cp, gzip, etc., to be directed to the tape machine 1187 or via operations performed on the filesystem itself. Email 1188 notification and GUI status display very important. 1189 </gjob:Details> 1190 1191 </gjob:Job> 1192 1193 </gjob:Jobs> 1194</gjob:Helping></pre> 1195 1196<p>While loading the XML file into an internal DOM tree is a matter of calling 1197only a couple of functions, browsing the tree to gather the informations and 1198generate the internals structures is harder, and more error prone.</p> 1199 1200<p>The suggested principle is to be tolerant with respect to the input 1201structure. For example, the ordering of the attributes is not significant, 1202Cthe XML specification is clear about it. It's also usually a good idea to not 1203be dependent of the orders of the children of a given node, unless it really 1204makes things harder. Here is some code to parse the informations for a 1205person:</p> 1206<pre>/* 1207 * A person record 1208 */ 1209typedef struct person { 1210 char *name; 1211 char *email; 1212 char *company; 1213 char *organisation; 1214 char *smail; 1215 char *webPage; 1216 char *phone; 1217} person, *personPtr; 1218 1219/* 1220 * And the code needed to parse it 1221 */ 1222personPtr parsePerson(xmlDocPtr doc, xmlNsPtr ns, xmlNodePtr cur) { 1223 personPtr ret = NULL; 1224 1225DEBUG("parsePerson\n"); 1226 /* 1227 * allocate the struct 1228 */ 1229 ret = (personPtr) malloc(sizeof(person)); 1230 if (ret == NULL) { 1231 fprintf(stderr,"out of memory\n"); 1232 return(NULL); 1233 } 1234 memset(ret, 0, sizeof(person)); 1235 1236 /* We don't care what the top level element name is */ 1237 cur = cur->xmlChildrenNode; 1238 while (cur != NULL) { 1239 if ((!strcmp(cur->name, "Person")) && (cur->ns == ns)) 1240 ret->name = xmlNodeListGetString(doc, cur->xmlChildrenNode, 1); 1241 if ((!strcmp(cur->name, "Email")) && (cur->ns == ns)) 1242 ret->email = xmlNodeListGetString(doc, cur->xmlChildrenNode, 1); 1243 cur = cur->next; 1244 } 1245 1246 return(ret); 1247}</pre> 1248 1249<p>Here is a couple of things to notice:</p> 1250<ul> 1251 <li>Usually a recursive parsing style is the more convenient one, XML data 1252 being by nature subject to repetitive constructs and usualy exibit highly 1253 stuctured patterns.</li> 1254 <li>The two arguments of type <em>xmlDocPtr</em> and <em>xmlNsPtr</em>, i.e. 1255 the pointer to the global XML document and the namespace reserved to the 1256 application. Document wide information are needed for example to decode 1257 entities and it's a good coding practice to define a namespace for your 1258 application set of data and test that the element and attributes you're 1259 analyzing actually pertains to your application space. This is done by a 1260 simple equality test (cur->ns == ns).</li> 1261 <li>To retrieve text and attributes value, it is suggested to use the 1262 function <em>xmlNodeListGetString</em> to gather all the text and entity 1263 reference nodes generated by the DOM output and produce an single text 1264 string.</li> 1265</ul> 1266 1267<p>Here is another piece of code used to parse another level of the 1268structure:</p> 1269<pre>#include <libxml/tree.h> 1270/* 1271 * a Description for a Job 1272 */ 1273typedef struct job { 1274 char *projectID; 1275 char *application; 1276 char *category; 1277 personPtr contact; 1278 int nbDevelopers; 1279 personPtr developers[100]; /* using dynamic alloc is left as an exercise */ 1280} job, *jobPtr; 1281 1282/* 1283 * And the code needed to parse it 1284 */ 1285jobPtr parseJob(xmlDocPtr doc, xmlNsPtr ns, xmlNodePtr cur) { 1286 jobPtr ret = NULL; 1287 1288DEBUG("parseJob\n"); 1289 /* 1290 * allocate the struct 1291 */ 1292 ret = (jobPtr) malloc(sizeof(job)); 1293 if (ret == NULL) { 1294 fprintf(stderr,"out of memory\n"); 1295 return(NULL); 1296 } 1297 memset(ret, 0, sizeof(job)); 1298 1299 /* We don't care what the top level element name is */ 1300 cur = cur->xmlChildrenNode; 1301 while (cur != NULL) { 1302 1303 if ((!strcmp(cur->name, "Project")) && (cur->ns == ns)) { 1304 ret->projectID = xmlGetProp(cur, "ID"); 1305 if (ret->projectID == NULL) { 1306 fprintf(stderr, "Project has no ID\n"); 1307 } 1308 } 1309 if ((!strcmp(cur->name, "Application")) && (cur->ns == ns)) 1310 ret->application = xmlNodeListGetString(doc, cur->xmlChildrenNode, 1); 1311 if ((!strcmp(cur->name, "Category")) && (cur->ns == ns)) 1312 ret->category = xmlNodeListGetString(doc, cur->xmlChildrenNode, 1); 1313 if ((!strcmp(cur->name, "Contact")) && (cur->ns == ns)) 1314 ret->contact = parsePerson(doc, ns, cur); 1315 cur = cur->next; 1316 } 1317 1318 return(ret); 1319}</pre> 1320 1321<p>One can notice that once used to it, writing this kind of code is quite 1322simple, but boring. Ultimately, it could be possble to write stubbers taking 1323either C data structure definitions, a set of XML examples or an XML DTD and 1324produce the code needed to import and export the content between C data and 1325XML storage. This is left as an exercise to the reader :-)</p> 1326 1327<p>Feel free to use <a href="example/gjobread.c">the code for the full C 1328parsing example</a> as a template, it is also available with Makefile in the 1329Gnome CVS base under gnome-xml/example</p> 1330 1331<h2><a name="Contributi">Contributions</a></h2> 1332<ul> 1333 <li><a href="mailto:ari@btigate.com">Ari Johnson</a> provides a C++ wrapper 1334 for libxml: 1335 <p>Website: <a 1336 href="http://lusis.org/~ari/xml++/">http://lusis.org/~ari/xml++/</a></p> 1337 <p>Download: <a 1338 href="http://lusis.org/~ari/xml++/libxml++.tar.gz">http://lusis.org/~ari/xml++/libxml++.tar.gz</a></p> 1339 </li> 1340 <li><a href="mailto:doolin@cs.utk.edu">David Doolin</a> provides a 1341 precompiled Windows version 1342 <p><a 1343 href="http://www.ce.berkeley.edu/~doolin/code/libxmlwin32/">http://www.ce.berkeley.edu/~doolin/code/libxmlwin32/</a></p> 1344 </li> 1345 <li><a href="mailto:fnatter@gmx.net">Felix Natter</a> provided <a 1346 href="libxml-doc.el">an emacs module</a> to lookup libxml functions 1347 documentation</li> 1348 <li><a href="mailto:sherwin@nlm.nih.gov">Ziying Sherwin</a> provided <a 1349 href="http://xmlsoft.org/messages/0488.html">man pages</a> (not yet 1350 integrated in the distribution)</li> 1351</ul> 1352 1353<p></p> 1354 1355<p><a href="mailto:Daniel.Veillard@w3.org">Daniel Veillard</a></p> 1356 1357<p>$Id: xml.html,v 1.56 2000/10/21 09:25:52 veillard Exp $</p> 1358</body> 1359</html> 1360