1/* 2 * Copyright (C) 2011 The Guava Authors 3 * 4 * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except 5 * in compliance with the License. You may obtain a copy of the License at 6 * 7 * http://www.apache.org/licenses/LICENSE-2.0 8 * 9 * Unless required by applicable law or agreed to in writing, software distributed under the License 10 * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express 11 * or implied. See the License for the specific language governing permissions and limitations under 12 * the License. 13 */ 14 15package com.google.common.hash; 16 17import com.google.common.annotations.Beta; 18import com.google.common.primitives.Ints; 19 20import java.nio.charset.Charset; 21 22/** 23 * A hash function is a collision-averse pure function that maps an arbitrary block of 24 * data to a number called a <i>hash code</i>. 25 * 26 * <h3>Definition</h3> 27 * 28 * <p>Unpacking this definition: 29 * 30 * <ul> 31 * <li><b>block of data:</b> the input for a hash function is always, in concept, an 32 * ordered byte array. This hashing API accepts an arbitrary sequence of byte and 33 * multibyte values (via {@link Hasher}), but this is merely a convenience; these are 34 * always translated into raw byte sequences under the covers. 35 * 36 * <li><b>hash code:</b> each hash function always yields hash codes of the same fixed bit 37 * length (given by {@link #bits}). For example, {@link Hashing#sha1} produces a 38 * 160-bit number, while {@link Hashing#murmur3_32()} yields only 32 bits. Because a 39 * {@code long} value is clearly insufficient to hold all hash code values, this API 40 * represents a hash code as an instance of {@link HashCode}. 41 * 42 * <li><b>pure function:</b> the value produced must depend only on the input bytes, in 43 * the order they appear. Input data is never modified. {@link HashFunction} instances 44 * should always be stateless, and therefore thread-safe. 45 * 46 * <li><b>collision-averse:</b> while it can't be helped that a hash function will 47 * sometimes produce the same hash code for distinct inputs (a "collision"), every 48 * hash function strives to <i>some</i> degree to make this unlikely. (Without this 49 * condition, a function that always returns zero could be called a hash function. It 50 * is not.) 51 * </ul> 52 * 53 * <p>Summarizing the last two points: "equal yield equal <i>always</i>; unequal yield 54 * unequal <i>often</i>." This is the most important characteristic of all hash functions. 55 * 56 * <h3>Desirable properties</h3> 57 * 58 * <p>A high-quality hash function strives for some subset of the following virtues: 59 * 60 * <ul> 61 * <li><b>collision-resistant:</b> while the definition above requires making at least 62 * <i>some</i> token attempt, one measure of the quality of a hash function is <i>how 63 * well</i> it succeeds at this goal. Important note: it may be easy to achieve the 64 * theoretical minimum collision rate when using completely <i>random</i> sample 65 * input. The true test of a hash function is how it performs on representative 66 * real-world data, which tends to contain many hidden patterns and clumps. The goal 67 * of a good hash function is to stamp these patterns out as thoroughly as possible. 68 * 69 * <li><b>bit-dispersing:</b> masking out any <i>single bit</i> from a hash code should 70 * yield only the expected <i>twofold</i> increase to all collision rates. Informally, 71 * the "information" in the hash code should be as evenly "spread out" through the 72 * hash code's bits as possible. The result is that, for example, when choosing a 73 * bucket in a hash table of size 2^8, <i>any</i> eight bits could be consistently 74 * used. 75 * 76 * <li><b>cryptographic:</b> certain hash functions such as {@link Hashing#sha512} are 77 * designed to make it as infeasible as possible to reverse-engineer the input that 78 * produced a given hash code, or even to discover <i>any</i> two distinct inputs that 79 * yield the same result. These are called <i>cryptographic hash functions</i>. But, 80 * whenever it is learned that either of these feats has become computationally 81 * feasible, the function is deemed "broken" and should no longer be used for secure 82 * purposes. (This is the likely eventual fate of <i>all</i> cryptographic hashes.) 83 * 84 * <li><b>fast:</b> perhaps self-explanatory, but often the most important consideration. 85 * We have published <a href="#noWeHaventYet">microbenchmark results</a> for many 86 * common hash functions. 87 * </ul> 88 * 89 * <h3>Providing input to a hash function</h3> 90 * 91 * <p>The primary way to provide the data that your hash function should act on is via a 92 * {@link Hasher}. Obtain a new hasher from the hash function using {@link #newHasher}, 93 * "push" the relevant data into it using methods like {@link Hasher#putBytes(byte[])}, 94 * and finally ask for the {@code HashCode} when finished using {@link Hasher#hash}. (See 95 * an {@linkplain #newHasher example} of this.) 96 * 97 * <p>If all you want to hash is a single byte array, string or {@code long} value, there 98 * are convenient shortcut methods defined directly on {@link HashFunction} to make this 99 * easier. 100 * 101 * <p>Hasher accepts primitive data types, but can also accept any Object of type {@code 102 * T} provided that you implement a {@link Funnel Funnel<T>} to specify how to "feed" data 103 * from that object into the function. (See {@linkplain Hasher#putObject an example} of 104 * this.) 105 * 106 * <p><b>Compatibility note:</b> Throughout this API, multibyte values are always 107 * interpreted in <i>little-endian</i> order. That is, hashing the byte array {@code 108 * {0x01, 0x02, 0x03, 0x04}} is equivalent to hashing the {@code int} value {@code 109 * 0x04030201}. If this isn't what you need, methods such as {@link Integer#reverseBytes} 110 * and {@link Ints#toByteArray} will help. 111 * 112 * <h3>Relationship to {@link Object#hashCode}</h3> 113 * 114 * <p>Java's baked-in concept of hash codes is constrained to 32 bits, and provides no 115 * separation between hash algorithms and the data they act on, so alternate hash 116 * algorithms can't be easily substituted. Also, implementations of {@code hashCode} tend 117 * to be poor-quality, in part because they end up depending on <i>other</i> existing 118 * poor-quality {@code hashCode} implementations, including those in many JDK classes. 119 * 120 * <p>{@code Object.hashCode} implementations tend to be very fast, but have weak 121 * collision prevention and <i>no</i> expectation of bit dispersion. This leaves them 122 * perfectly suitable for use in hash tables, because extra collisions cause only a slight 123 * performance hit, while poor bit dispersion is easily corrected using a secondary hash 124 * function (which all reasonable hash table implementations in Java use). For the many 125 * uses of hash functions beyond data structures, however, {@code Object.hashCode} almost 126 * always falls short -- hence this library. 127 * 128 * @author Kevin Bourrillion 129 * @since 11.0 130 */ 131@Beta 132public interface HashFunction { 133 /** 134 * Begins a new hash code computation by returning an initialized, stateful {@code 135 * Hasher} instance that is ready to receive data. Example: <pre> {@code 136 * 137 * HashFunction hf = Hashing.md5(); 138 * HashCode hc = hf.newHasher() 139 * .putLong(id) 140 * .putBoolean(isActive) 141 * .hash();}</pre> 142 */ 143 Hasher newHasher(); 144 145 /** 146 * Begins a new hash code computation as {@link #newHasher()}, but provides a hint of the 147 * expected size of the input (in bytes). This is only important for non-streaming hash 148 * functions (hash functions that need to buffer their whole input before processing any 149 * of it). 150 */ 151 Hasher newHasher(int expectedInputSize); 152 153 /** 154 * Shortcut for {@code newHasher().putInt(input).hash()}; returns the hash code for the given 155 * {@code int} value, interpreted in little-endian byte order. The implementation <i>might</i> 156 * perform better than its longhand equivalent, but should not perform worse. 157 * 158 * @since 12.0 159 */ 160 HashCode hashInt(int input); 161 162 /** 163 * Shortcut for {@code newHasher().putLong(input).hash()}; returns the hash code for the 164 * given {@code long} value, interpreted in little-endian byte order. The implementation 165 * <i>might</i> perform better than its longhand equivalent, but should not perform worse. 166 */ 167 HashCode hashLong(long input); 168 169 /** 170 * Shortcut for {@code newHasher().putBytes(input).hash()}. The implementation 171 * <i>might</i> perform better than its longhand equivalent, but should not perform 172 * worse. 173 */ 174 HashCode hashBytes(byte[] input); 175 176 /** 177 * Shortcut for {@code newHasher().putBytes(input, off, len).hash()}. The implementation 178 * <i>might</i> perform better than its longhand equivalent, but should not perform 179 * worse. 180 * 181 * @throws IndexOutOfBoundsException if {@code off < 0} or {@code off + len > bytes.length} 182 * or {@code len < 0} 183 */ 184 HashCode hashBytes(byte[] input, int off, int len); 185 186 /** 187 * Shortcut for {@code newHasher().putUnencodedChars(input).hash()}. The implementation 188 * <i>might</i> perform better than its longhand equivalent, but should not perform worse. 189 * Note that no character encoding is performed; the low byte and high byte of each {@code char} 190 * are hashed directly (in that order). 191 * 192 * @since 15.0 (since 11.0 as hashString(CharSequence)). 193 */ 194 HashCode hashUnencodedChars(CharSequence input); 195 196 /** 197 * Shortcut for {@code newHasher().putUnencodedChars(input).hash()}. The implementation 198 * <i>might</i> perform better than its longhand equivalent, but should not perform worse. 199 * Note that no character encoding is performed; the low byte and high byte of each {@code char} 200 * are hashed directly (in that order). 201 * 202 * @deprecated Use {@link HashFunction#hashUnencodedChars} instead. This method is scheduled for 203 * removal in Guava 16.0. 204 */ 205 @Deprecated 206 HashCode hashString(CharSequence input); 207 208 /** 209 * Shortcut for {@code newHasher().putString(input, charset).hash()}. Characters are encoded 210 * using the given {@link Charset}. The implementation <i>might</i> perform better than its 211 * longhand equivalent, but should not perform worse. 212 */ 213 HashCode hashString(CharSequence input, Charset charset); 214 215 /** 216 * Shortcut for {@code newHasher().putObject(instance, funnel).hash()}. The implementation 217 * <i>might</i> perform better than its longhand equivalent, but should not perform worse. 218 * 219 * @since 14.0 220 */ 221 <T> HashCode hashObject(T instance, Funnel<? super T> funnel); 222 223 /** 224 * Returns the number of bits (a multiple of 32) that each hash code produced by this 225 * hash function has. 226 */ 227 int bits(); 228} 229