Hash Collision in Data type

Question

Created Feb ’25

Replies 1

Boosts 0

Participants 2

I notice that Swift Data type's hashValue collision when first 80 byte of data and data length are same because of the Implementation only use first 80 bytes to compute the hash.

https://web.archive.org/web/20120605052030/https://opensource.apple.com/source/CF/CF-635.21/CFData.c

also, even if hash collision on the situation like this, I can check data is really equal or not by ==

does there any reason for this implementation(only use 80 byte of data to make hashValue)?

test code is under below

let dataArray: [UInt8] = [
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00
]

var dataArray1: [UInt8] = dataArray
var dataArray2: [UInt8] = dataArray

dataArray1.append(contentsOf: [0x00, 0x00, 0x00, 0x00])
dataArray2.append(contentsOf: [0xff, 0xff, 0xff, 0xff])

let data1 = Data(dataArray1)
let data2 = Data(dataArray2) // Only last 4 byte differs

    print(data1.hashValue)
    print(data2.hashValue)
    print(data1.hashValue == data2.hashValue) // true
    print(data1 == data2) // false

Boost

Answer 1

DTS Engineer OP

Apple

Feb ’25

Accepted Answer

Recommended

That Core Foundation source code is very old. I recommend that you not use it to draw conclusion about how things currently work.

You can see how Foundation actually works in Swift open source. I’m gonna provide two links here, one for the Swift Data type and one for CFData itself.

does there any reason for this implementation (only use 80 byte of data to make hashValue)?

Absolutely. Data values can be vast — on a modern Mac it’s perfectly reasonable to store GiB of data in one — and hashing all of that data would be ridiculously expensive.

also, even if hash collision on the situation like this, I can check data is really equal or not by ==

Yes. That’s how hashing works:

Two values that are identical must hash to the same value.
OTOH, it’s expected that non-identical values might hash to the same value.

If you think about Data that makes sense. Data can store way more bits of information than its hash, and thus hash collisions are inevitable.

As to the specific algorithm choice (using the first 80 bytes), I’m not sure why that was chosen. You could imagine lots of algorithms that are potentially better. However, that is a performance concern, not a question of correctness. And performance is tricky:

Hash calculation is meant to be cheap, so it makes sense not to look at all the data.
On modern CPUs non-sequential memory accesses are way more expensive than sequential ones, so it makes sense not to jump around too much.

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

1