1 ============================== 2 The Google URL Parsing Library 3 ============================== 4 5This is the Google URL Parsing Library which parses and canonicalizes URLs. 6Please see the LICENSE.txt file for licensing information. 7 8Features 9======== 10 11 * Easily embeddable: This library was written for a variety of client and 12 server programs in mind, so unlike most implementations of URL parsing 13 and canonicalization, it can be easily emdedded. 14 15 * Fast: hundreds of thousands of typical URLs can be parsed and 16 canonicalized per second on a modern CPU. It is much faster than, for 17 example, calling WinInet's corresponding functions. 18 19 * Compatible: When possible, this library has strived for IE7 compatability 20 for both general web compatability, and so IE addons or other applications 21 that communicate with or embed IE will work properly. 22 23 It supports Unix-style file URLs, as well as the more complex rules for 24 Window file URLs. Note that total compatability is not possible (for 25 example, IE6 and IE7 disagree about how to parse certain IP addresses), 26 and that this is more strict about certain illegal, rarely used, and 27 potentially dangerous constructs such as escaped control characters in 28 host names that IE will allow. It is typically a little less strict than 29 Firefox. 30 31 32Example 33======= 34 35An example implementation of a URL object that uses this library is provided 36in src/gurl.*. This implementation uses the "application integration" layer 37discussed below to interface with the low-level parsing and canonicalization 38functions. 39 40 41Building 42======== 43 44The canonicalization files require ICU for some UTF-8 and UTF-16 conversion 45macros. If your project does not use ICU, it should be straightforward to 46factor out the macros and functions used in ICU, there are only a few well- 47isolated things that are used. 48 49TODO(brettw) ADD INSTRUCTIONS FOR GETTING ICU HERE! 50 51logging.h and logging.cc are Windows-only because the corresponding Unix 52logging system has many dependencies. This library uses few of the logging 53macros, and a dummy header can easily be written that defines the 54appropriate things for Unix. 55 56 57Definitions 58=========== 59 60"Standard URL": A URL with an "authority", which is a hostname and optionally 61 a port, username, and password. Most URLs are standard such as HTTP and FTP. 62 63"File URL": A URL that references a file on disk. There are special rules for 64 this type of URL. Note that it may have a hostname! "localhost" is allowed, 65 for example "file://localhost/foo" is the same as "file:///foo". 66 67"Path URL": This is everything else. There is no standard on how to treat these 68 URLs, or even what they are called. This library decomposes them into a 69 scheme and a path. The path is everything following the scheme. This type of 70 URL includes "javascript", "data", and even "mailto" (although "mailto" 71 might look like a standard scheme in some respects, it is not). 72 73 74Design 75====== 76 77The library is divided into four layers. They are listed here from the lowest 78to the highest; you can use any portion of the library as long as you embed the 79layers below it. 80 811. Parsing 82---------- 83At the lowest level is the parsing code. The files encompasing this are 84url_parse.* and the main include file is src/url_parse.h. This code will, given 85an input string, parse it into the most likely form of a URL. 86 87Parsing can not fail and does no validation. The exception is the port number, 88which it currently validates, but this is a bug. Given crazy input, the parser 89will do its best to find the various URL components according to its rules (see 90url_parse_unittest.cc for some examples). 91 92To use this, an application will typically use ExtractScheme to determine the 93type of a given input URL, and then call one of the initialization functions: 94"ParseStandardURL", "ParsePathURL", or "ParseFileURL". This will result in 95a "Parsed" structure which identifies the substrings of each identified 96component. 97 982. Canonicalization 99------------------- 100At the next highest level is canonicalization. The files encompasing this are 101url_canon.* and the main include file is src/url_canon.h. This code will 102validate an already-parsed URL, and will convert it to a canonical form. For 103example, this will convert host names to lowercase, convert IP addresses 104into dotted-decimal notation, handle encoding issues, etc. 105 106This layer will always do its best to produce a reasonable output string, but 107it may return that the string is invalid. For example, if there are invalid 108characters in the host name, it will escape them or replace them with the 109Unicode "invalid character" character, but will fail. This way, the program can 110display error messages to the user with the output, log it, etc. and the 111string will have some meaning. 112 113Canonicalized output is written to a CanonOutput object which is a simple 114wrapper around an expanding buffer. An implementation called RawCanonOutput is 115proivided that writes to a raw buffer with a fixed amount statically allocated 116(for performance). Applications using STL can use StdStringCanonOutput defined 117in url_canon_stdstring.h which writes into a std::string. 118 119A normal application would call one of the three high-level functions 120"CanonicalizeStandardURL", "CanonicalizeFileURL", and CanonicalizePathURL" 121depending on the type of URL in question. Lower-level functions are also 122provided which will canonicalize individual parts of a URL (for example, 123"CanonicalizeHost"). 124 125Part of this layer is the integration with the host system for IDN and encoding 126conversion. An implementation that provides integration with the ICU 127(http://www-306.ibm.com/software/globalization/icu/index.jsp) is provided in 128src/url_canon_icu.cc. The embedder may wish to replace this file with 129implementations of the functions for their own IDN library if they do not use 130ICU. 131 1323. Application integration 133-------------------------- 134The canonicalization and parsing layers do not know anything about the URI 135schemes supported by your application. The parsing and canonicalization 136functions are very low-level, and you must call the correct function to do the 137work (for example, "CanonicalizeFileURL"). 138 139The application integration in url_util.* provides wrappers around the 140low-level parsing and canonicalization to call the correct versions for 141different identified schemes. Embedders will want to modify this file if 142necessary to suit the needs of their application. 143 1444. URL object 145------------- 146The highest level is the "URL" object that a C++ application would use to 147to encapsulate a URL. Embedders will typically want to provide their own URL 148object that meets the requirements of their system. A reasonably complete 149example implemnetation is provided in src/gurl.*. You may wish to use this 150object, extend or modify it, or write your own. 151 152Whitespace 153---------- 154Sometimes, you may want to remove linefeeds and tabs from the content of a URL. 155Some web pages, for example, expect that a URL spanning two lines should be 156treated as one with the newline removed. Depending on the source of the URLs 157you are canonicalizing, these newlines may or may not be trimmed off. 158 159If you want this behavior, call RemoveURLWhitespace before parsing. This will 160remove CR, LF and TAB from the input. Note that it preserves spaces. On typical 161URLs, this function produces a 10-15% speed reduction, so it is optional and 162not done automatically. The example GURL object and the url_util wrapper does 163this for you. 164 165Tests 166===== 167 168There are a number of *_unittest.cc and *_perftest.cc files. These files are 169not currently compilable as they rely on a not-included unit testing framework 170Tests are declared like this: 171 TEST(TestCaseName, TestName) { 172 ASSERT_TRUE(a); 173 EXPECT_EQ(a, b); 174 } 175If you would like to compile them, it should be straightforward to define 176the TEST macro (which would declare a function by combining the two arguments) 177and the other macros whose behavior should be self-explanatory (EXPECT is like 178an ASSERT, but does not stop the test, if you are doing this, you probably 179don't care about this difference). Then you would define a .cc file that 180calls all of these functions. 181