Disc ID Calculation

cdhand.gif

Let us first have some words about how data is organized on a CD, going top down.

An Audio CD (CD-DA) can hold up to 99 audio tracks.

Sampling is done at a rate of 44.1 kHz using 16 bits resolution per channel, thus there are 44100 x 2 bytes x 2 channels (Stereo) = 176400 bytes of PCM data stored per second.

This audio data is contained in logical blocks of 2352 bytes each on the CD, holding 2352 / 176400 = 1 / 75 seconds of sound.

A logical block plus 882 bytes of error correction and control data forms a raw block of 3234 bytes that is spread among 98 frames of 33 bytes each that are all together written on one spiral track among the CD.

At present we can stop at the level of logical blocks for our task. The deeper stuff will be needed in case the CD-Text format becomes popular, which stores extra data (like artist and track information) in the above mentioned control data parts.

Contents

Reading the TOC

Audio CD

Now let us have a look at a real world CD-DA. Using a tool like cdrecord on this CD will yield:

$ cdrecord dev=/dev/cdrom -toc
...
first: 1 last 6
track:   1 lba:         0 (        0) 00:02:00 adr: 1 control: 2 mode: 0
track:   2 lba:     15213 (    60852) 03:24:63 adr: 1 control: 2 mode: 0
track:   3 lba:     32164 (   128656) 07:10:64 adr: 1 control: 2 mode: 0
track:   4 lba:     46442 (   185768) 10:21:17 adr: 1 control: 2 mode: 0
track:   5 lba:     63264 (   253056) 14:05:39 adr: 1 control: 2 mode: 0
track:   6 lba:     80339 (   321356) 17:53:14 adr: 1 control: 2 mode: 0
track:lout lba:     95312 (   381248) 21:12:62 adr: 1 control: 2 mode: -1

You can see that the CD has 6 audio tracks and the special lead-out track (it has number 170 in the CD TOC). Also note that the LBA (Logical Block Address) offsets start at address 0, but the first track starts actually at 00:02:00 (the standard length of the lead-in track). So we need to add 150 logical blocks to each LBA offset. The resulting data needed to calculate a MusicBrainz disc ID are:

track 1:       150    (150 + 0)
track 2:       15363  (150 + 15213)
track 3:       32314  (150 + 32164)
track 4:       46592  (150 + 46442)
track 5:       63414  (150 + 63264)
track 6:       80489  (150 + 80339)
lead-out track: 95462  (150 + 95312)

Multi-session (audio + data) CD

Because MusicBrainz doesn't include data tracks in Disc IDs, reading the TOC from a multi-session disc is a little more complicated. Running cdrecord on this CD will give us:

$ cdrecord dev=/dev/cdrom -toc
...
first: 1 last 8
track:   1 lba:         0 (        0) 00:02:00 adr: 1 control: 0 mode: 0
track:   2 lba:     13959 (    55836) 03:08:09 adr: 1 control: 0 mode: 0
track:   3 lba:     33436 (   133744) 07:27:61 adr: 1 control: 0 mode: 0
track:   4 lba:     52927 (   211708) 11:47:52 adr: 1 control: 0 mode: 0
track:   5 lba:     65631 (   262524) 14:37:06 adr: 1 control: 0 mode: 0
track:   6 lba:     77742 (   310968) 17:18:42 adr: 1 control: 0 mode: 0
track:   7 lba:     99024 (   396096) 22:02:24 adr: 1 control: 0 mode: 0
track:   8 lba:    125824 (   503296) 27:59:49 adr: 1 control: 6 mode: 1
track:lout lba:    188333 (   753332) 41:53:08 adr: 1 control: 6 mode: -1

This is an example of a CD with an extra track of data (what you know as CD-ROM), marketed in this case as CD-Extra, featuring a video and some pictures. (More precise: this is disc with two sessions, audio and data)

This CD has 8 tracks, but only the first 7 audio tracks should be used to calculate a MusicBrainz disc ID. The problem is that we can't use the offset of track 8 as the "lead-out track" offset, because there is a gap between the audio session and the data session. This gap is 11400 frames long (11250 frames for lead-out/lead-in + 150 frames of pre-gap), so we need to substract this value from the offset of track 8 to get the end of track 7. The result is:

track 1:       150    (150 + 0)
track 2:       14109  (150 + 13959)
track 3:       33586  (150 + 33436)
track 4:       53077  (150 + 52927)
track 5:       65781  (150 + 65631)
track 6:       77892  (150 + 77742)
track 7:       99174  (150 + 99024)
lead-out track: 114574 (150 + 125824 - 11400)

Calculating the Disc ID

The CD Index algorithm simply takes the following pieces of data and runs them through the SHA-1 hash function:

Before the data is fed through the SHA-1 hash, it is converted to upper-case hex ASCII using printf("%02X", value); for single-byte values and printf("%08X", value); for 4-byte values.

Code is a better definition than English, so here is the code that calculates the DiscID:

sprintf(temp, "%02X", pCDInfo->First­Track);
sha_update(&sha, (unsigned char*) temp, strlen(temp));

sprintf(temp, "%02X", pCDInfo->Last­Track);
sha_update(&sha, (unsigned char*) temp, strlen(temp));

for (i = 0; i < 100; i++) {
    sprintf(temp, "%08X", pCDInfo->Frame­Offset[i]);
    sha_update(&sha, (unsigned char*) temp, strlen(temp));
}
sha_final(digest, &sha);

Note that the lead-out track is stored in pCDInfo->Frame­Offset[0].

The resulting 20 byte SHA-1 signature is converted to a base 64 encoded character ASCII string that is the disc ID. One uses base 64 encoding to map an arbitrarily long string of bytes onto a string of printable ASCII characters. It does this by redistributing the 24 bits of three 8-bit codes each into four 6-bit codes from a table of 64 very common ASCII characters, which serve as digits in a base 64 number system. Thus we end up with an ID string of 28 characters, in the above audio CD example with 49HHV7Eb8UKF3aQiNmu1GR8vKTY-.

The astute reader might observe that the aforementioned 20 byte SHA-1 signature contains 160 bits, and so therefore will always be representable with 27 digits in a base 64 number system. (To convince yourself of this, consider that 64^27 > 2^160). Why, then, does a MusicBrainz DiscID contain 28 characters? The answer lies in the fact that the algorithm MusicBrainz uses for encoding a string of bytes, like the 20 byte SHA-1 signature, into this special ASCII string works on blocks of three bytes at a time. It converts each block of three bytes into 4 ASCII characters. A tricky situation arises if the encoding algorithm is asked to encode a string of n bytes, where n is not a multiple of three. In this case, the first floor[n/3] three-byte-long blocks are encoded as described in the previous paragraph, but the last block of bytes contains less than three bytes (it contains either 1 byte or it contains 2 bytes). To understand how the encoding algorithm deals with this pathological block of bytes, it is useful to look at the code in base64.c. The moral of the story is that a special character (MusicBrainz uses "-" as this special character) is needed, in addition to the 64 base-64 digits, in order to be able to encode a block of 1 or 2 bytes. Because the SHA-1 signature is 20 bytes long, and 20 = 2 (mod 3) the final block of bytes contains just 2 bytes, which, for reasons that are fully revealed in base64.c, means that the last character in every MusicBrainz DiscID is "-".

Note: This base 64 string is not the same one specified in RFC822. The RFC822 spec uses +, /, and = characters, all of which are special HTTP/URL characters. To avoid the problems with dealing with that, I (Rob) used ., _, and -. For details on this, please refer to base64.c in the libdiscid source code.

This scheme has the advantage of being very simple (simple to understand, simple to implement) and it is not ambiguous. However, two different pressings of the same disc may have different IDs. To handle this case, the MusicBrainz system will let a user check to see if the CD already exists in the system under a different ID. If so, the system creates a new association for the different pressing of the same CD.

If you'd like to know more about this, please download the client source code and check it out. The code is clean and self documenting.

If you are interested in creating other MusicBrainz clients and need the SHA-1 source code, you can either dig through the CD Index source, or check the W3C page on SHA-1.


Links