## Thursday, August 06, 2015

### How hard would this cipher be for Eve to crack?

I've been looking at old ciphers since I am teaching a HS course on Crypto. We've done shift, affine, matrix, Playfair, 1-time pad, Vigenere, and then noting that in all of the above Alice and Bob need to meet, we did Diffie-Hellman.

The Playfair Cipher cipher is interesting in that it gives a compact way to obtain a mapping from pairs of letters to pairs of letters that (informally) LOOKS random. Of course NOW it would not be hard for Alice and Bob to AGREE on a random mapping of {a,b,...,z}^2 to{a,b,c,...,z}^2 and use that.   For that matter, Alice and Bob could agree on a random mapping from {a,b,c,...,z}^k to {a,b,c,...,z}^k for some reasonable values of k.

So consider the following cipher with parameter k: Alice and Bob generate a RANDOM 1-1, onto mapping of {a,...,z}^k to {a,...,z}^k. To send a message Alice breaks it into blocks of k and encodes each block using the mapping they agree on.

Question One: How big does k have to be before this is impractical for Alice and Bob?

Question two: If k=1 then Eve can easily break this with Freq analysis of single letters. For k=2 Eve can easily break this with Freq analysis of digraphs (pairs of letters). Is there a value of k such that the distribution of k-grams is not useful anymore? As k goes up does it get easier or harder to crack? I originally thought harder which is why i thought this would be a good code, but frankly I don't really know.Looking over the web it is EASY to find freq of letters and digraphs, but very hard to find any source for freq of (say) 10-grams.  So at the very least Eve would need to build up her own statistics--- but of course she could do this.

SO, here is the question: Is there a value of k such this code is easy for Alice and Bob but Hard for Eve TODAY? How about X years ago?

1. About whether it would be easier or harder when \$k\$ goes up, I would assume that it only gets easier. While it is true that looking up statistics becomes harder after a certain point, it would also be true that the entropy goes down very rapidly. For instance, I would assume that the triple "xkt" never appears in an English text.

I should also say that I have done frequency analysis in reality when I was a child and on the encrypted (k=1) notes that my sister used to write (what can I say except that I was a nosy kid :-))) ). To my experience, it was easy to match the more frequent letters (it was in Farsi) but very hard to match the less frequent ones. So, from there, I actually used my intuition to think of two letter words and so on to find the whole translation table.

I should also mention that, at the time, Internet was not widely accessible and I just wrote a program that read two texts and tried to match up the letters according to their frequencies. Nowadays, you can very easily write a program that reads wikipedia and generates the frequency table.

Therefore, I would guess that, if you had a long enough encrypted text, as \$k\$ goes up, it would become easier to extract translation tables because the entropy decreases and, so, distinguishing frequent patterns from unfrequent ones becomes easier (the same way that compression would work better if you could compress longer words).

2. > Question One: How big does k have to be before this is impractical for
> Alice and Bob?

Define impractical.

The key length is log (26^k)!, which is more than k * 26^k ~ k * 2^{4.7k}. This becomes inefficient for k=3, unpleasant for k=9, and ridiculous for k=15.

In contrast, AES has a key length of 256 bits (if you are paranoid).

> Is it secure?

Note first that it is trivially breakable under chosen-plaintext attacks, and probably known-plaintext attacks as well for reasonable values of k (e.g., k=8 or k=9). Beyond that, I would start looking at (bi-)word frequencies rather than letter frequencies.

3. It sounds strong, and certainly for large k there are long messages for which it is impossible to determine the exact string with sub exponential probability (i.e. strings in which no k-tuple is repeated), however the security is significantly weaker than it may first appear. Part of the problem is that as the key gets longer, you need more cipher text if you want to do a cipher text only attack, and it may seem hard to imagine that much cipher text when you are considering encoding English language.

However... the security should not depend on either the type of data you are sending, and when you convert this to encoding bit strings problems start to become obvious for certain types of data. Take the obvious extension to bit strings where you map strings of k bits onto other strings of k bits. Since there are (2^k)! such mappings, you cannot reasonably go above about 4 bytes.

Now, if you take a uniformly random string and encrypt it, there is clearly no way to determine the corresponding plaintext with probability greater than the inverse of the number of keys. One way of seeing why this is true is because the randomness from the key maps every input string onto a uniform distribution over (2^k)! strings, but the length of the message is unchanged, so the message effectively loses log_2(2^k)! bits of information about the plaintext.

However, the encryption scheme has a flaw: whenever I see the same value of a k bit block in the ciphertext, I know that in the plaintext, those values are the same. This can be enough to mount an attack if the plaintext comes from a sufficiently low entropy source (or equivalently that there is a sufficient corpus of ciphertext) so that the length of the message is significantly longer than the length of the key plus the Shannon information of the source. Below is an example of how one might accomplish such an attack:

For simplicity, suppose that k=24 (in which case the key is around 378 million bits long). Now suppose we encrypt a clear uncompressed high resolution photograph with 24-bits per pixel. This means each pixel gets encoded separately, and pixels which have the same colour value will have the same ciphertext value. Since we can expect a reasonably high number of pixels in the same spatial region of the photograph to have the same colour, we can infer the dimensions of the image by finding the wrapping of the string which maximizes the spatial clustering of cyphertext pixels of the same value. Further more, we know that in most photographs pixels near each other will likely be of similar colour, so we can start to infer the distance RGB plaintext values corresponding to different ciphertext values. Once you have approximated these distances all that is left is to distinguish which extremes correspond to high vs low values of R, G and B. But there are only 8 possibilities for this, so you are basically done. Plus, now you have the key, so you can break non-image messages.

I chose k=24 in the above since it exactly matched up with the length of each pixel, however even if it hadn't, this attack would still work provided that you consider blocks of ciphertext where the number of bits in between is a multiple of 24.

The above is only a toy example of how one might attack the scheme, but it lacks an important form of security: to be truly secure, one would want it to be the case that it is hard to distinguish between any two chosen plaintext strings. This is not the case for your scheme, since for example the ciphertexts corresponding the the messages [AAA...A][BBB...B] and [AAA...A][AAA...A] (where [ and ] indicate the block boundaries) can be easily distinguished, since the first necessarily has different ciphertext values for each block, whereas the second necessarily has the same ciphertext value. As a result, it is easy to come up with situations where the security falls apart, as above.