I notice that Swift Data type's hashValue collision when first 80 byte of data
and data length
are same because of the Implementation only use first 80 bytes to compute the hash.
https://web.archive.org/web/20120605052030/https://opensource.apple.com/source/CF/CF-635.21/CFData.c
also, even if hash collision on the situation like this, I can check data is really equal or not by ==
does there any reason for this implementation(only use 80 byte of data to make hashValue)?
test code is under below
let dataArray: [UInt8] = [
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00
]
var dataArray1: [UInt8] = dataArray
var dataArray2: [UInt8] = dataArray
dataArray1.append(contentsOf: [0x00, 0x00, 0x00, 0x00])
dataArray2.append(contentsOf: [0xff, 0xff, 0xff, 0xff])
let data1 = Data(dataArray1)
let data2 = Data(dataArray2) // Only last 4 byte differs
print(data1.hashValue)
print(data2.hashValue)
print(data1.hashValue == data2.hashValue) // true
print(data1 == data2) // false
That Core Foundation source code is very old. I recommend that you not use it to draw conclusion about how things currently work.
You can see how Foundation actually works in Swift open source. I’m gonna provide two links here, one for the Swift Data
type and one for CFData
itself.
Absolutely. Data
values can be vast — on a modern Mac it’s perfectly reasonable to store GiB of data in one — and hashing all of that data would be ridiculously expensive.
Yes. That’s how hashing works:
-
Two values that are identical must hash to the same value.
-
OTOH, it’s expected that non-identical values might hash to the same value.
If you think about Data
that makes sense. Data
can store way more bits of information than its hash, and thus hash collisions are inevitable.
As to the specific algorithm choice (using the first 80 bytes), I’m not sure why that was chosen. You could imagine lots of algorithms that are potentially better. However, that is a performance concern, not a question of correctness. And performance is tricky:
-
Hash calculation is meant to be cheap, so it makes sense not to look at all the data.
-
On modern CPUs non-sequential memory accesses are way more expensive than sequential ones, so it makes sense not to jump around too much.
Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"