xmltutorial.xml revision bd3b4fd15bc7fa17816a73cc8325085dcc378e8f
1<?xml version="1.0"?> 2<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" 3 "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd" [ 4<!ENTITY KEYWORD SYSTEM "includekeyword.c"> 5<!ENTITY STORY SYSTEM "includestory.xml"> 6<!ENTITY ADDKEYWORD SYSTEM "includeaddkeyword.c"> 7<!ENTITY ADDATTRIBUTE SYSTEM "includeaddattribute.c"> 8<!ENTITY GETATTRIBUTE SYSTEM "includegetattribute.c"> 9<!ENTITY CONVERT SYSTEM "includeconvert.c"> 10]> 11<article lang="en"> 12 <articleinfo> 13 <title>Libxml Tutorial</title> 14 <author> 15 <firstname>John</firstname> 16 <surname>Fleck</surname> 17 <email>jfleck@inkstain.net</email> 18 </author> 19 <copyright> 20 <year>2002</year> 21 <holder>John Fleck</holder> 22 </copyright> 23 <revhistory> 24 <revision> 25 <revnumber>1</revnumber> 26 <date>June 4, 2002</date> 27 </revision> 28 <revision> 29 <revnumber>2</revnumber> 30 <date>June 12, 2002</date> 31 </revision> 32 <revision> 33 <revnumber>3</revnumber> 34 <date>Aug. 31, 2002</date> 35 </revision> 36 <revision> 37 <revnumber>4</revnumber> 38 <date>Nov. 10, 2002</date> 39 </revision> 40 </revhistory> 41 </articleinfo> 42 <abstract> 43 <para>Libxml is a freely licensed C language library for handling 44 <acronym>XML</acronym>, portable across a large number of platforms. This 45 tutorial provides examples of its basic functions.</para> 46 </abstract> 47 <sect1 id="introduction"> 48 <title>Introduction</title> 49 <para>Libxml is a C language library implementing functions for reading, 50 creating and manipulating <acronym>XML</acronym> data. This tutorial 51 provides example code and explanations of its basic functionality.</para> 52 <para>Libxml and more details about its use are available on <ulink 53 url="http://www.xmlsoft.org/">the project home page</ulink>. Included there is complete <ulink url="http://xmlsoft.org/html/libxml-lib.html"> 54 <acronym>API</acronym> documentation</ulink>. This tutorial is not meant 55 to substitute for that complete documentation, but to illustrate the 56 functions needed to use the library to perform basic operations. 57<!-- 58 Links to 59 other resources can be found in <xref linkend="furtherresources" />. 60--> 61</para> 62 <para>The tutorial is based on a simple <acronym>XML</acronym> application I 63 use for articles I write. The format includes metadata and the body 64 of the article.</para> 65 <para>The example code in this tutorial demonstrates how to: 66 <itemizedlist> 67 <listitem> 68 <para>Parse the document.</para> 69 </listitem> 70 <listitem> 71 <para>Extract the text within a specified element.</para> 72 </listitem> 73 <listitem> 74 <para>Add an element and its content.</para> 75 </listitem> 76 <listitem> 77 <para>Add an attribute.</para> 78 </listitem> 79 <listitem> 80 <para>Extract the value of an attribute.</para> 81 </listitem> 82 </itemizedlist> 83 </para> 84 <para>Full code for the examples is included in the appendices.</para> 85 86 </sect1> 87 88 <sect1 id="xmltutorialdatatypes"> 89 <title>Data Types</title> 90 <para><application>Libxml</application> declares a number of data types we 91 will encounter repeatedly, hiding the messy stuff so you do not have to deal 92 with it unless you have some specific need.</para> 93 <para> 94 <variablelist> 95 <varlistentry> 96 <term><ulink 97 url="http://xmlsoft.org/html/libxml-tree.html#XMLCHAR">xmlChar</ulink></term> 98 <listitem> 99 <para>A basic replacement for char, a byte in a UTF-8 encoded 100 string. If your data uses another encoding, it must be converted to 101 UTF-8 for use with <application>libxml's</application> 102 functions. More information on encoding is available on the <ulink 103 url="http://www.xmlsoft.org/encoding.html"><application>libxml</application> encoding support web page</ulink>.</para> 104 </listitem> 105 </varlistentry> 106 <varlistentry> 107 <term> 108 <ulink url="http://xmlsoft.org/html/libxml-tree.html#XMLDOC">xmlDoc</ulink></term> 109 <listitem> 110 <para>A structure containing the tree created by a parsed doc. <ulink 111 url="http://xmlsoft.org/html/libxml-tree.html#XMLDOCPTR">xmlDocPtr</ulink> 112 is a pointer to the structure.</para> 113 </listitem> 114 </varlistentry> 115 <varlistentry> 116 <term><ulink 117 url="http://xmlsoft.org/html/libxml-tree.html#XMLNODEPTR">xmlNodePtr</ulink> 118 and <ulink url="http://xmlsoft.org/html/libxml-tree.html#XMLNODE">xmlNode</ulink></term> 119 <listitem> 120 <para>A structure containing a single node. <ulink 121 url="http://xmlsoft.org/html/libxml-tree.html#XMLNODEPTR">xmlNodePtr</ulink> 122 is a pointer to the structure, and is used in traversing the document tree.</para> 123 </listitem> 124 </varlistentry> 125 </variablelist> 126 </para> 127 128 </sect1> 129 130 <sect1 id="xmltutorialparsing"> 131 <title>Parsing the file</title> 132 <para>Parsing the file requires only the name of the file and a single 133 function call, plus error checking. Full code: <xref 134 linkend="keywordappendix" /></para> 135 <para> 136 <programlisting> 137 <co id="declaredoc" /> xmlDocPtr doc; 138 <co id="declarenode" /> xmlNodePtr cur; 139 140 <co id="parsefile" /> doc = xmlParseFile(docname); 141 142 <co id="checkparseerror" /> if (doc == NULL ) { 143 fprintf(stderr,"Document not parsed successfully. \n"); 144 xmlFreeDoc(doc); 145 return; 146 } 147 148 <co id="getrootelement" /> cur = xmlDocGetRootElement(doc); 149 150 <co id="checkemptyerror" /> if (cur == NULL) { 151 fprintf(stderr,"empty document\n"); 152 xmlFreeDoc(doc); 153 return; 154 } 155 156 <co id="checkroottype" /> if (xmlStrcmp(cur->name, (const xmlChar *) "story")) { 157 fprintf(stderr,"document of the wrong type, root node != story"); 158 xmlFreeDoc(doc); 159 return; 160 } 161 162 </programlisting> 163 <calloutlist> 164 <callout arearefs="declaredoc"> 165 <para>Declare the pointer that will point to your parsed document.</para> 166 </callout> 167 <callout arearefs="declarenode"> 168 <para>Declare a node pointer (you'll need this in order to 169 interact with individual nodes).</para> 170 </callout> 171 <callout arearefs="checkparseerror"> 172 <para>Check to see that the document was successfully parsed. If it 173 was not, <application>libxml</application> will at this point 174 register an error and stop. 175 <note> 176 <para>One common example of an error at this point is improper 177 handling of encoding. The <acronym>XML</acronym> standard requires 178 documents stored with an encoding other than UTF-8 or UTF-16 to 179 contain an explicit declaration of their encoding. If the 180 declaration is there, <application>libxml</application> will 181 automatically perform the necessary conversion to UTF-8 for 182 you. More information on <acronym>XML's</acronym> encoding 183 requirements is contained in the <ulink 184 url="http://www.w3.org/TR/REC-xml#charencoding">standard</ulink>.</para> 185 </note> 186 </para> 187 </callout> 188 <callout arearefs="getrootelement"> 189 <para>Retrieve the document's root element.</para> 190 </callout> 191 <callout arearefs="checkemptyerror"> 192 <para>Check to make sure the document actually contains something.</para> 193 </callout> 194 <callout arearefs="checkroottype"> 195 <para>In our case, we need to make sure the document is the right 196 type. "story" is the root type of the documents used in this 197 tutorial.</para> 198 </callout> 199 </calloutlist> 200 </para> 201 </sect1> 202 203 <sect1 id="xmltutorialgettext"> 204 <title>Retrieving Element Content</title> 205 <para>Retrieving the content of an element involves traversing the document 206 tree until you find what you are looking for. In this case, we are looking 207 for an element called "keyword" contained within element called "story". The 208 process to find the node we are interested in involves tediously walking the 209 tree. We assume you already have an xmlDocPtr called <varname>doc</varname> 210 and an xmlNodPtr called <varname>cur</varname>.</para> 211 212 <para> 213 <programlisting> 214 <co id="getchildnode" /> cur = cur->xmlChildrenNode; 215 <co id="huntstoryinfo" /> while (cur != NULL) { 216 if ((!xmlStrcmp(cur->name, (const xmlChar *)"storyinfo"))){ 217 parseStory (doc, cur); 218 } 219 220 cur = cur->next; 221 } 222 223 </programlisting> 224 225 <calloutlist> 226 <callout arearefs="getchildnode"> 227 <para>Get the first child node of <varname>cur</varname>. At this 228 point, <varname>cur</varname> points at the document root, which is 229 the element "story".</para> 230 </callout> 231 <callout arearefs="huntstoryinfo"> 232 <para>This loop iterates through the elements that are children of 233 "story", looking for one called "storyinfo". That 234 is the element that will contain the "keywords" we are 235 looking for. It uses the <application>libxml</application> string 236 comparison 237 function, <function><ulink 238 url="http://xmlsoft.org/html/libxml-parser.html#XMLSTRCMP">xmlStrcmp</ulink></function>. If there is a match, it calls the function <function>parseStory</function>.</para> 239 </callout> 240 </calloutlist> 241 </para> 242 243 <para> 244 <programlisting> 245void 246parseStory (xmlDocPtr doc, xmlNodePtr cur) { 247 248 <co id="anothergetchild" /> cur = cur->xmlChildrenNode; 249 <co id="findkeyword" /> while (cur != NULL) { 250 if ((!xmlStrcmp(cur->name, (const xmlChar *)"keyword"))) { 251 <co id="foundkeyword" /> printf("keyword: %s\n", xmlNodeListGetString(doc, cur->xmlChildrenNode, 1)); 252 } 253 cur = cur->next; 254 } 255 return; 256} 257 </programlisting> 258 <calloutlist> 259 <callout arearefs="anothergetchild"> 260 <para>Again we get the first child node.</para> 261 </callout> 262 <callout arearefs="findkeyword"> 263 <para>Like the loop above, we then iterate through the nodes, looking 264 for one that matches the element we're interested in, in this case 265 "keyword".</para> 266 </callout> 267 <callout arearefs="foundkeyword"> 268 <para>When we find the "keyword" element, we need to print 269 its contents. Remember that in <acronym>XML</acronym>, the text 270 contained within an element is a child node of that element, so we 271 turn to <varname>cur->xmlChildrenNode</varname>. To retrieve it, we 272 use the function <function><ulink 273 url="http://xmlsoft.org/html/libxml-tree.html#XMLNODELISTGETSTRING">xmlNodeListGetString</ulink></function>, which also takes the <varname>doc</varname> pointer as an argument. In this case, we just print it out.</para> 274 </callout> 275 </calloutlist> 276 </para> 277 278 </sect1> 279 280<sect1 id="xmltutorialwritingcontent"> 281 <title>Writing element content</title> 282 <para>Writing element content uses many of the same steps we used above 283 — parsing the document and walking the tree. We parse the document, 284 then traverse the tree to find the place we want to insert our element. For 285 this example, we want to again find the "storyinfo" element and 286 this time insert a keyword. Then we'll write the file to disk. Full code: 287 <xref linkend="addkeywordappendix" /></para> 288 289 <para> 290 The main difference in this example is in 291 <function>parseStory</function>: 292 293 <programlisting> 294void 295parseStory (xmlDocPtr doc, xmlNodePtr cur, char *keyword) { 296 297 <co id="addkeyword" /> xmlNewTextChild (cur, NULL, "keyword", keyword); 298 return; 299} 300 </programlisting> 301 <calloutlist> 302 <callout arearefs="addkeyword"> 303 <para>The <function><ulink 304 url="http://xmlsoft.org/html/libxml-tree.html#XMLNEWTEXTCHILD">xmlNewTextChild</ulink></function> 305 function adds a new child element at the 306 current node pointer's location in the 307 tree, specified by <varname>cur</varname>.</para> 308 </callout> 309 </calloutlist> 310 </para> 311 312 <para> 313 Once the node has been added, we would like to write the document to 314 file. Is you want the element to have a namespace, you can add it here as 315 well. In our case, the namespace is NULL. 316 <programlisting> 317 xmlSaveFormatFile (docname, doc, 1); 318 </programlisting> 319 The first parameter is the name of the file to be written. You'll notice 320 it is the same as the file we just read. In this case, we just write over 321 the old file. The second parameter is a pointer to the xmlDoc 322 structure. Setting the third parameter equal to one ensures indenting on output. 323 </para> 324 </sect1> 325 326 <sect1 id="xmltutorialwritingattribute"> 327 <title>Writing Attribute</title> 328 <para>Writing an attribute is similar to writing text to a new element. In 329 this case, we'll add a reference <acronym>URI</acronym> to our 330 document. Full code:<xref linkend="addattributeappendix" />.</para> 331 <para> 332 A <sgmltag>reference</sgmltag> is a child of the <sgmltag>story</sgmltag> 333 element, so finding the place to put our new element and attribute is 334 simple. As soon as we do the error-checking test in our 335 <function>parseDoc</function>, we are in the right spot to add our 336 element. But before we do that, we need to make a declaration using a 337 data type we have not seen yet: 338 <programlisting> 339 xmlAttrPtr newattr; 340 </programlisting> 341 We also need an extra xmlNodePtr: 342 <programlisting> 343 xmlNodePtr newnode; 344 </programlisting> 345 </para> 346 <para> 347 The rest of <function>parseDoc</function> is the same as before until we 348 check to see if our root element is <sgmltag>story</sgmltag>. If it is, 349 then we know we are at the right spot to add our element: 350 351 <programlisting> 352 <co id="addreferencenode" /> newnode = xmlNewTextChild (cur, NULL, "reference", NULL); 353 <co id="addattributenode" /> newattr = xmlNewProp (newnode, "uri", uri); 354 </programlisting> 355 <calloutlist> 356 <callout arearefs="addreferencenode"> 357 <para>First we add a new node at the location of the current node 358 pointer, <varname>cur.</varname> using the <ulink 359 url="http://xmlsoft.org/html/libxml-tree.html#XMLNEWTEXTCHILD">xmlNewTextChild</ulink> function.</para> 360 </callout> 361 </calloutlist> 362 </para> 363 364 <para>Once the node is added, the file is written to disk just as in the 365 previous example in which we added an element with text content.</para> 366 367 </sect1> 368 369 <sect1 id="xmltutorialattribute"> 370 <title>Retrieving Attributes</title> 371 <para>Retrieving the value of an attribute is similar to the previous 372 example in which we retrieved a node's text contents. In this case we'll 373 extract the value of the <acronym>URI</acronym> we added in the previous 374 section. Full code: <xref linkend="getattributeappendix" />.</para> 375 <para> 376 The initial steps for this example are similar to the previous ones: parse 377 the doc, find the element you are interested in, then enter a function to 378 carry out the specific task required. In this case, we call 379 <function>getReference</function>: 380 <programlisting> 381void 382getReference (xmlDocPtr doc, xmlNodePtr cur) { 383 384 cur = cur->xmlChildrenNode; 385 while (cur != NULL) { 386 if ((!xmlStrcmp(cur->name, (const xmlChar *)"reference"))) { 387 <co id="getattributevalue" /> printf("uri: %s\n", xmlGetProp(cur, "uri")); 388 } 389 cur = cur->next; 390 } 391 return; 392} 393 </programlisting> 394 395 <calloutlist> 396 <callout arearefs="getattributevalue"> 397 <para> 398 The key function is <function><ulink 399 url="http://xmlsoft.org/html/libxml-tree.html#XMLGETPROP">xmlGetProp</ulink></function>, which returns an 400 <varname>xmlChar</varname> containing the attribute's value. In this case, 401 we just print it out. 402 <note> 403 <para> 404 If you are using a <acronym>DTD</acronym> that declares a fixed or 405 default value for the attribute, this function will retrieve it. 406 </para> 407 </note> 408 </para> 409 </callout> 410 </calloutlist> 411 412 </para> 413 </sect1> 414 415 <sect1 id="xmltutorialconvert"> 416 <title>Encoding Conversion</title> 417 418 <para>Data encoding compatibility problems are one of the most common 419 difficulties encountered by programmers new to <acronym>XML</acronym> in 420 general and <application>libxml</application> in particular. Thinking 421 through the design of your application in light of this issue will help 422 avoid difficulties later. Internally, <application>libxml</application> 423 stores and manipulates date in the UTF-8 format. Data used by your program 424 in other formats, such as the commonly used ISO-8859-1 encoding, must be 425 converted to UTF-8 before passing it to <application>libxml</application> 426 functions. If you want your program's output in an encoding other than 427 UTF-8, you also must convert it.</para> 428 429 <para><application>Libxml</application> uses 430 <application>iconv</application> if it is available to convert 431 data. Without <application>iconv</application>, only UTF-8, UTF-16 and 432 ISO-8859-1 can be used as external formats. With 433 <application>iconv</application>, any format can be used provided 434 <application>iconv</application> is able to convert it to and from 435 UTF-8. Currently <application>iconv</application> supports about 150 436 different character formats with ability to convert from any to any. While 437 the actual number of supported formats varies between implementations, every 438 <application>iconv</application> implementation is almost guaranteed to 439 support every format anyone has ever heard of.</para> 440 441 <warning> 442 <para>A common mistake is to use different formats for the internal data 443 in different parts of one's code. The most common case is an application 444 that assumes ISO-8859-1 to be the internal data format, combined with 445 <application>libxml</application>, which assumes UTF-8 to be the 446 internal data format. The result is an application that treats internal 447 data differently, depending on which code section is executing. The one or 448 the other part of code will then, naturally, misinterpret the data. 449 </para> 450 </warning> 451 452 <para>This example constructs a simple document, then adds content provided 453 at the command line to the document's root element and outputs the results 454 to <filename>stdout</filename> in the proper encoding. For this example, we 455 use ISO-8859-1 encoding. The encoding of the string input at the command 456 line is converted from ISO-8859-1 to UTF-8. Full code: <xref 457 linkend="convertappendix" /></para> 458 459 <para>The conversion, encapsulated in the example code in the 460 <function>convert</function> function, uses 461 <application>libxml's</application> 462 <function>xmlFindCharEncodingHandler</function> function: 463 <programlisting> 464 <co id="handlerdatatype" />xmlCharEncodingHandlerPtr handler; 465 <co id="calcsize" />size = (int)strlen(in)+1; 466 out_size = size*2-1; 467 out = malloc((size_t)out_size); 468 469… 470 <co id="findhandlerfunction" />handler = xmlFindCharEncodingHandler(encoding); 471… 472 <co id="callconversionfunction" />handler->input(out, &out_size, in, &temp); 473… 474 <co id="outputencoding" />xmlSaveFormatFileEnc("-", doc, encoding, 1); 475 </programlisting> 476 <calloutlist> 477 <callout arearefs="handlerdatatype"> 478 <para><varname>handler</varname> is declared as a pointer to an 479 <function>xmlCharEncodingHandler</function> function.</para> 480 </callout> 481 <callout arearefs="calcsize"> 482 <para>The <function>xmlCharEncodingHandler</function> function needs 483 to be given the size of the input and output strings, which are 484 calculated here for strings <varname>in</varname> and 485 <varname>out</varname>.</para> 486 </callout> 487 <callout arearefs="findhandlerfunction"> 488 <para><function>xmlFindCharEncodingHandler</function> takes as its 489 argument the data's initial encoding and searches 490 <application>libxml's</application> built-in set of conversion 491 handlers, returning a pointer to the function or NULL if none is 492 found.</para> 493 </callout> 494 <callout arearefs="callconversionfunction"> 495 <para>The conversion function identified by <varname>handler</varname> 496 requires as its arguments pointers to the input and output strings, 497 along with the length of each. The lengths must be determined 498 separately by the application.</para> 499 </callout> 500 <callout arearefs="outputencoding"> 501 <para>To output in a specified encoding rather than UTF-8, we use 502 <function>xmlSaveFormatFileEnc</function>, specifying the 503 encoding.</para> 504 </callout> 505 </calloutlist> 506 </para> 507 </sect1> 508 509<!-- 510 <appendix id="furtherresources"> 511 <title>Further Resources</title> 512 <para></para> 513 </appendix> 514--> 515 <appendix id="sampledoc"> 516 <title>Sample Document</title> 517 <programlisting>&STORY;</programlisting> 518 </appendix> 519 <appendix id="keywordappendix"> 520 <title>Code for Keyword Example</title> 521 <para> 522 <programlisting>&KEYWORD;</programlisting> 523 </para> 524 </appendix> 525<appendix id="addkeywordappendix"> 526 <title>Code for Add Keyword Example</title> 527 <para> 528 <programlisting>&ADDKEYWORD;</programlisting> 529 </para> 530 </appendix> 531<appendix id="addattributeappendix"> 532 <title>Code for Add Attribute Example</title> 533 <para> 534 <programlisting>&ADDATTRIBUTE;</programlisting> 535 </para> 536 </appendix> 537<appendix id="getattributeappendix"> 538 <title>Code for Retrieving Attribute Value Example</title> 539 <para> 540 <programlisting>&GETATTRIBUTE;</programlisting> 541 </para> 542 </appendix> 543 <appendix id="convertappendix"> 544 <title>Code for Encoding Conversion Example</title> 545 <para> 546 <programlisting>&CONVERT;</programlisting> 547 </para> 548 </appendix> 549 <appendix> 550 <title>Acknowledgements</title> 551 <para>A number of people have generously offered feedback, code and 552 suggested improvements to this tutorial. In no particular order: 553 <simplelist> 554 <member>Daniel Veillard</member> 555 <member>Marcus Labib Iskander</member> 556 <member>Christopher R. Harris</member> 557 <member>Igor Zlatkovic</member> 558 </simplelist> 559 </para> 560 </appendix> 561</article> 562