Can I convert long binary strings into unique serial numbers of much shorter length?
Show older comments
I’m generating a very large number of long binary strings, for example:
S1 = [0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 0 …]
S2 = [1 1 0 0 1 1 0 0 0 0 0 1 0 1 0 0 1 …]
S3 = …
Sometimes these strings are thousands of elements long and every time I generate a new string I would like to make sure that it is unique and has never been generated before. I could do this by simply storing all strings that has ever existed and using a function like ismember, isequal or something to compare every string to all previous strings, however, this takes a very long time and quickly becomes unfeasible as the list grows.
So I was wondering if there was a faster way of doing this. For example, could I convert every string into a unique serial number of much shorter length somehow so that S1 instead of being represented as a string became the number: 94237 (there is no logic behind this example number) and S 2 some other reasonably short number which perhaps would be much faster to compare uniqueness against than a long string? Is this possible and are there clever conventional ways of converting binary strings into unique serial numbers of minimal length? The serial number does not need to be backwards convertible into the string, all I need is to determine whether or not it is unique.
2 Comments
Pick a hash function that suits your data and requirements:
Note that any hash loses information: hash functions may allow collisions between different data by giving them the same hash value.
Petter Stefansson
on 6 Sep 2016
Accepted Answer
More Answers (1)
You said that you generate the numbers, but you do not mention any constraints on the generation. Without any constraints, just pick a number and append 1 to ensure that the numbers are unique.
If you do have constraints, it would be good to know them. hash functions are fine, but you cannot be 100% sure that there are no collisions, i.e., that you do not generate the same hash for the different inputs. It is just that the hash function tries so minimise the probability of such collisions.
5 Comments
Petter Stefansson
on 6 Sep 2016
Thorsten
on 6 Sep 2016
You can generate your first number of thousands of 1's and 0's
s
Then you generate your next unique number
s 1
and the next
s 1 1
and so one. All number are unique. And if you like you can randomly append 0 and 1.
Petter Stefansson
on 6 Sep 2016
That's exactly what I've suggested. You didn't mention that you need millions of sequences. Do you know in advance an upper limit of the number of sequences you need?
Then you could generate these N unique sequences in decimal reprentation using
seq = randperm(N);
and convert to binary
n = ceil(log2(N))
binseq = dec2bin(seq, n);
and pick from the precomputed list of unique binary sequences, one after the other.
Categories
Find more on Data Type Conversion in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!