In a word: Moof.



The collected conjectures and observations of one (1) Colin Slater.

Not-so-simple Substitution Cipher

20 Jul 2005, 11:49AM

Yesterday I encountered a friend of mine working on an interesting problem. He had been given a few files full of hex bytes (in ASCII,) encrypted with a simple mono-alphabetic substitution cipher. Typically a substitution cipher is something to be laughed at when it comes to security, but there was a slight twist on this one. The encrypted data was not English text, but a windows .dll.

The normal way to attack a substitution cipher is to create a frequency distribution of the cipher text and the "probable plaintext", and correlate the results. This works well for English text, because there are few possible characters, and different letters have significantly different frequencies. But this was a binary file full of non-ascii "garbage", how could we solve this?

To start with, I implemented a simple frequency distribution in ruby. I had not used ruby too much before, so I thought I'd try it out. The result was 7 lines of code. I was impressed. I'll quote it here just for those who haven't looked at ruby any:

freq = Array.new(256, 0)
total =  0

File.open(ARGV[0],"r") do |cyphertext|
  cyphertext.each(" ") { |num| freq[num.hex] = freq[num.hex] + 1
                          total = total + 1}
end

freq.each_index { |i| print i, ": ", freq[i].to_f/total, "\n" }

Ruby is not another java/C/C++ clone, nor is it some conceptually challenging functional language. There are some nice features that make it very powerful, but also very easy to write code quickly with. I spent far more time looking at the problem instead of looking at the code, which would not have been the case with my favored language C.

So I had my two frequency distributions; now I had to somehow correlate them. I have no experience with statistics, and that's a problem. My frequency tables gave me the number of occurences of each character, as a percentage of the total number of characters. My correlation code took the ciphertext table and the table from an unencrypted dll, and matched up any pairs that had similar percentages. This worked rather poorly. The frequencies of many bytes are very close, making it very difficult for me to tell what's statistically significant and what is not. There has to be a better way to do this. I'm sure someone with more experience than I could point out one.

The frequency distribution failed, now I was back to looking for patterns by hand. The header of the dll gave me a few known cyphertext-plaintext pairs, maybe there was a pattern in them. I tried xor'ing them, adding them, using a list of 0->255,1->254,..., anything I could think of. No luck.

You probably expect I'm going to explain how I finally solved the problem. I wouldn't be writing this if there wasn't a happy ending, right? Nope, I'm stumped. I'm still looking for possible solutions; I'll try whatever anyone suggests.

——cs

Comments / Trackbacks

TB Url

Gonna post any of the files?

Posted by Brian Beck on 20 Jul 2005, 1:50PM.

What a novel and very overlooked idea by me. Here It Be.

Posted by Colin on 20 Jul 2005, 2:13PM.

Leave a Comment




Remember me?

(You may use HTML tags for style)