This page has not been reviewed by our documentation team (more info).

Find Duplicate Music Files

FDMF is an audio fingerprinting system similar to TRM. It could potentially replace TRM (see GettingRidOfTRM for more information about why we would want to do this).

The following is from [WWW] an email sent to the DevelopersMailingList:


The short version is that I think that the algorithm used by fdmf is a viable replacement for TRMs. As for the details..... well that's the long version. :)

Overview of fdmf

fdmf is a GPL'd package available from [WWW] http://www.w140.com/audio/. It is designed to Find Duplicate Music Files (hence the less than catchy name), and a large part of that task involves fingerprinting songs and comparing the fingerprints. In other words, it is does the same thing that TRMs attempt to do. The details of the fingerprint generation are all laid out in the package README, but basically it goes like this: take a binary file, decode it to wav, do some spectral analysis, and then quantize the results into 3 separate 256-bit fingerprints representing different things about the same song. That's right, you end up with *three* fingerprints, for a total fdmf signature of 768 bits. For clarity, I'll use "fingerprint" to talk about one of the 256-bit keys, and "signature" to refer to all 3 fingerprints combined.

Already we can see a couple differences from TRMs. First, signature generation is entirely client side. This is a good thing. Second, signatures are nothing more than a reduction of existing data, meaning that they are neither random nor likely to collide with anything other than a very similar wav. This is also a good thing. Third, the fuzzy logic of the matching is applied at match time, not signature generation time. While this takes more cycles overall and makes matching more expensive, I will later argue that this ends up being a good thing as well.

I might not have made it clear, so let me say this explicitly: like CD Index IDs, fdmf signatures are overly specific. It's quite likely that different binary versions of the same song will have different fdmf signatures. In fact, it's almost certain that there will be a different signature for most (if not every) different binary file out there. (This is in stark contrast to TRMs.) But if things that should match each other will have different signatures, how will the matches be detected?

Like I said, a fdmf signature is actually comprised of three 256-bit fingerprints. These are not random, like TRMs are, but decompositions of the song. In order to tell if two songs are the same, the hamming distance between each of the corresponding fingerprints is computed, and if the distance of all three fingerprints is less than some threshold (call it T), then the songs are considered to be the same. (The hamming distance is the number of bits that are different between two bit strings.) Each of the 3 fingerprints might use a different value for T (remember they measure different things) so T is really an array of 3 thresholds, each ranging from 0-255. Also, just to be clear, T should stay constant once settled upon, or else a match check may return different results.

If T is set to be {0,0,0} then a given fdmf signature will only match itself. Because fdmf signatures are overly specific, that is undesirable. If T is set to be {255,255,255} then a given fdmf signature will match anything. That is obviously bad. So the trick is to figure out the optimal values for T. We would want values to be large enough to recognize different versions of the same song, but not so large as to think that two different songs are the same. The author suggests some values that work for him, but my experiments have show them to be too large.

How well fdmf works

I've run the algorithm over a large, diverse population of music, and I'm happy to say that I didn't waste my time and that it works pretty well. Certainly much better than TRMs. What's more, if we can accept that a logical song might have many fdmf signatures (just like how a logical album might have many CD Index IDs), then we simply set the thresholds low and suddenly the algorithm works extremely well - assuming we have people associating accurate signatures as they find unknown signatures.

But enough qualitative talk.... here are my quantitative metrics:

  1. The time to generate a fingerprint. Because the entire file must be decoded, this depends on the file. The fdmf author claims about 90% of this time is actually spent decoding to wav, but, as that's a required step (unless somebody has a bunch of wav they want to analyze....), then we can't throw it out. As two points of reference, it takes an unloaded 2Ghz celeron about 7 seconds to generate a signature for a 4MB mp3 stored on a local drive. It takes a fully loaded 1.6GHz PPC G5 about 20 seconds to generate a signature for a similar file stored on an nfs mount.

  2. The time to lookup a fingerprint. As described above, looking for a matching fdmf signatures is not nearly the trivial problem that a TRM lookup is. This is because the fuzzy logic is applied at search time. A naive implementation would need to take the search signature and compare it to every known signature, returning which known signatures are under the threshold. Obviously, O(n) algorithms are impractical for 4 million+ signatures. (This is actually a fascinating computer science problem that many of my friends spent way too much time on. If anybody wants to exercise their CS theory, let me know and I'll give you the problem in an abstract wording.) Unfortunately, we couldn't come up with anything that was both accurate and less than O(n).... assuming that a newly added signature has to be included in search results immediately. If that assumption is NOT true - that is, if a signature doesn't have to be included in search results until some time after it is known to the system - then fdmf signatures can be made to simply be a matter of an index lookup in SQL. In other words, cheap, much less than a second, and still on par with TRM lookups. I'll get into the "how" farther down.

  3. How many logically different songs have the same fingerprint. Ideally this would be 1. Out of the 85,000 files I tested against, every file with the same fdmf key also had the same md5 hash, was a different bitrate encoding of the same source wav, or had been severely truncated to be less than 3 seconds in length. The former two is what you'd expect, and the latter is a corner case that I don't think any audio fingerprinting will be to deal with gracefully.

  4. How many logically different songs are returned in a given fingerprint search. Ideally this would be 1. This is completely dependent upon the threshold values chosen. Lower thresholds keep this number low, but at the expense of the next metric. In my sample set, any threshold less than {23,41,22} gave ideal results. Thresholds less than {30,50,30} gave less than 5 incorrect matches, all of them related to various censoring levels of lyrics. It is is expected that as the population of known songs increases, the number of incorrect matches will eventually increase with thresholds too large.

  5. How many logically identical songs are missed by searching for the fingerprint of any binary version of the song. Ideally this would be 0. Again, this is completely dependent upon the threshold values chosen. As the thresholds go up, this number goes down, but at the expense of the previous metric. Unfortunately it is also completely dependent upon the known songs. With my sample set, then, using the thresholds {30,50,30} might have missed up to 36 logical songs, if every song only had a single signature recorded. Using the thresholds of {23,41,22}, up to 168 songs might have been missed.

  6. How many fingerprints are required for logically identical songs. Ideally this would be 1. The fdmf algorithm really has no control over this. Because the signatures are a reduction of the binary files, it is likely that there will be as many signatures as there are binary files. While that sounds daunting, remember that each signature is only 96B in size, and does not need to be stored in memory.

As for the overall requirements:

  • Generating fingerprints must scale. (at least 1,000/sec?)

YES - this is done entirely client-side. Hard to imaging something more scalable than that. :)

  • Looking up fingerprint matches must scale. (at least 1,000/sec?)

YES - although computationally expensive, one-time-only work can be done before the signature is made available for search results. After that work is done, it's a simple index lookup. Both the computation and the SQL lookup can be load balanced over server farms, letting this scale almost infinitely.

  • Storing fingerprints must scale. (hold much more than 4M fingerprints)

YES - even though a signature is only 96B, it's quite likely that every distinct binary file out there will generate its own signature. So let's say an average of 20 signatures over 4 million songs, and we still use less than 10GB of storage. By today's storage standards, that's chump change. Still, if this is too much, a huffman tree should be able to compress a signature a little bit, leading to some unknown (but probably significant) level of space savings at the expense of computation time when submitting a new signature. It *should* go without saying that signatures do not need to be stored in ram... but maybe I need to explicitly say that anyway. :)

  • Free.

  • Open source?

YES - the code is GPL'd. Does that mean it's patent-free? Hell if I know. I'm no lawyer, but I suspect at the very least it means we'd have plausible protection to not get sued before a cease and desist was issued.

fdmf is not perfect. As noted above, it has trouble distinguishing between very short snippets of sound, even though they are obviously different to the ear. fdmf also has trouble differentiating between versions of songs with uncensored vs. clean lyrics, though having a tight enough threshold eliminates this problem. If the thresholds are too tight, then songs that should be considered matches might not be returned as such. This problem could be avoided, but only at the expense of requiring more signature associations to a logical song. That makes this less useful until a large set of signature associations are built up.

fdmf is also not cheap in terms of backend resources. I'll get into how to optimize it down below, but even though it can scale across machines and the expense will be hidden from the end users, it will require O(n) computations for each new signature, where n is the number of known signatures. That will just keep getting more expensive, though at least it will stay linear.

Finally, fdmf has some advantages over TRMs due to the difference in algorithm strategy. Because the fuzzy logic is applied at search time, that means that we can adjust those thresholds without having to recompute signatures. It also means a client could specify the precision of the matching they want to use. Both of these things are impossible with TRMs, and seem to be real steps forward.

How fdmf might work for MB

From an end user perspective, fdmf signatures should be a drop-in replacement for TRMs. The biggest change would probably be the addition of libplot as a requirement. The signatures are obviously longer than TRMs, but even when base64 encoded for UI display, they are only 129 characters long. (If we wanted to shorten that up we could encode the strings with ascii85, but I don't know what that really gains us.)

Things are a little different on the backend. Obviously, the TRM server goes away. But don't get rid of that hardware just yet.... now is as good as time as any to describe how I envision to speed up signature lookups from O(n) to very fast.

The trick is to notice that, as things are today, if a TRM isn't in the system when you try to look it up, MB simply says it doesn't know about it. fdmf signatures could work better, but that requires O(n) computations at lookup time. So instead, this is what could happen. If somebody searches for a signature that the system doesn't know about, the system will simply report there are no matches, even if there might be. At the same time, the system will record the search signature and kick off O(n) computations, recording which known signatures it matches. Once it has walked through all known signatures and recorded the matches, it marks the new search signature as "ready". The next time somebody searches for that signature, all the matches are pre-cached, and the time spent is a simple index lookup.

It's easy to see how this process naturally lends itself to parallel computing. Therefore, even though adding a signature is expensive, it can scale extremely well. The biggest bottleneck will be the database which serves the computing farm, but because most of that load is read-only, it's trivial to scale it across database servers. Not that *that* is cheap either.... but it does scale well.

I've been recording all my data in postgres, so database support shouldn't be a problem. I imagine that MB will want to have a slightly more refined schema than what I've been using. :) I have *not* yet cleaned up the install requirements or packaging of fdmf, so for instance the signature generation is still a perl script which calls decoders and the spline program from the plotutils package, before dumping the key into postgres.... obviously, that's hardly portable. But there's no reason why it shouldn't be easy to make some ansi C tool that calls uses libmad and libplot to do all the same work, and then sends the result to some tcp socket or generic web service. I can't speak for windows, but that would let things run just fine on posix.

Summary

Well, this has been a lengthy email. If you've gotten this far, congratulations. Let me recap.

fdmf.....

  1. appears to be a viable replacement to TRMs.

  2. looks capable of avoiding all the problems TRMs have.

  3. allows us to change the fuzziness of the searching sever side without having to re-compute all signatures, if it turns out our thresholds were set wrong.

  4. allows clients to specify how fuzzy they want their searches to be.

  5. will scale very well, but will probably require more CPU resources than TRMs do.

As for next steps.... I don't know. I assume people would want to try this on their own collections? If so, I can polish up the scripts I've written. Perhaps we want to involve the author of the tool? Should I spend time repackaging the fdmf signature generator into a real posix app?

 
Creative Commons EFF GPL LGPL This material is Open Knowledge Valid XHTML 1.1 Valid CSS 2.0
Original Design|vacubomb.com Contact details Server version: RELEASE-20071014