Hash Collision in Data type

I notice that Swift Data type's hashValue collision when first 80 byte of data and data length are same because of the Implementation only use first 80 bytes to compute the hash.

https://web.archive.org/web/20120605052030/https://opensource.apple.com/source/CF/CF-635.21/CFData.c

also, even if hash collision on the situation like this, I can check data is really equal or not by ==

does there any reason for this implementation(only use 80 byte of data to make hashValue)?

test code is under below

let dataArray: [UInt8] = [
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00
]

var dataArray1: [UInt8] = dataArray
var dataArray2: [UInt8] = dataArray

dataArray1.append(contentsOf: [0x00, 0x00, 0x00, 0x00])
dataArray2.append(contentsOf: [0xff, 0xff, 0xff, 0xff])

let data1 = Data(dataArray1)
let data2 = Data(dataArray2) // Only last 4 byte differs

    print(data1.hashValue)
    print(data2.hashValue)
    print(data1.hashValue == data2.hashValue) // true
    print(data1 == data2) // false
Answered by DTS Engineer in 824946022

That Core Foundation source code is very old. I recommend that you not use it to draw conclusion about how things currently work.

You can see how Foundation actually works in Swift open source. I’m gonna provide two links here, one for the Swift Data type and one for CFData itself.

Written by tkgka in 774367021
does there any reason for this implementation (only use 80 byte of data to make hashValue)?

Absolutely. Data values can be vast — on a modern Mac it’s perfectly reasonable to store GiB of data in one — and hashing all of that data would be ridiculously expensive.

Written by tkgka in 774367021
also, even if hash collision on the situation like this, I can check data is really equal or not by ==

Yes. That’s how hashing works:

  • Two values that are identical must hash to the same value.

  • OTOH, it’s expected that non-identical values might hash to the same value.

If you think about Data that makes sense. Data can store way more bits of information than its hash, and thus hash collisions are inevitable.

As to the specific algorithm choice (using the first 80 bytes), I’m not sure why that was chosen. You could imagine lots of algorithms that are potentially better. However, that is a performance concern, not a question of correctness. And performance is tricky:

  • Hash calculation is meant to be cheap, so it makes sense not to look at all the data.

  • On modern CPUs non-sequential memory accesses are way more expensive than sequential ones, so it makes sense not to jump around too much.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Accepted Answer

That Core Foundation source code is very old. I recommend that you not use it to draw conclusion about how things currently work.

You can see how Foundation actually works in Swift open source. I’m gonna provide two links here, one for the Swift Data type and one for CFData itself.

Written by tkgka in 774367021
does there any reason for this implementation (only use 80 byte of data to make hashValue)?

Absolutely. Data values can be vast — on a modern Mac it’s perfectly reasonable to store GiB of data in one — and hashing all of that data would be ridiculously expensive.

Written by tkgka in 774367021
also, even if hash collision on the situation like this, I can check data is really equal or not by ==

Yes. That’s how hashing works:

  • Two values that are identical must hash to the same value.

  • OTOH, it’s expected that non-identical values might hash to the same value.

If you think about Data that makes sense. Data can store way more bits of information than its hash, and thus hash collisions are inevitable.

As to the specific algorithm choice (using the first 80 bytes), I’m not sure why that was chosen. You could imagine lots of algorithms that are potentially better. However, that is a performance concern, not a question of correctness. And performance is tricky:

  • Hash calculation is meant to be cheap, so it makes sense not to look at all the data.

  • On modern CPUs non-sequential memory accesses are way more expensive than sequential ones, so it makes sense not to jump around too much.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Hash Collision in Data type
 
 
Q