The goal of this file is to decode messages based on a simple letter swapping cypher. For example using the swapping rule of replacing the current letter with the one ahead (i.e. A -> B, B -> C, ..., Z -> A) the word STATISTICS would be coded as: TUBUJTJDT A brute force algorithm would require attemping all 26! letter swapping combinations so instead we will try a random letter swapping rule and then see how well it compares to the transition probabilities of the stochastic process of letters making up English words. First though we need to obtain an empirical estimate of the transition probabilities of letters in the Markov chain of word building based on the frequency of words in English. Put another way, if I know I have the letters TH, the probability of the next letter being a Z, F, or B is very small but the probability of the next letter a vowel (A,E,I,O,U,Y) is very high. We use the transition probabilities for letters in English words to devise a stochastic process for English writing. But we can't just use a list of unique English words, instead we need to weigh the list of English words by how often they occur. This is inherrently different from a spell check algorithm which just needs to reference a word from a long population of words where the frequency of usage is not important.
How well the observed phrase to decode matches the stochastic process letter transitions gives us a probabilistic model. The first part of this document outlines how to acquire the transition probabilities based on a large training set of letters in words in English writing. We will then use the empirical transition probability matrix to decode a phrase obtained from http://cryptogramcorner.org on Thu Feb 19, 2015.
coded = " PKT YTUJ MHSN JMCU FMHIY JK INHFM JMN WCXXHFZN. CJ CU YHEN KL NSNIPJMCXD HXE KL XKJMCXD, JMN UJICSCXD RCZZ, JMN ZKKA, JMN RHZA, JMN WIKWKIJCKXU KL JMN GKEP, JMN UKTXE KL JMN SKCFN, JMN NHUN KL JMN DNUJTINU. UHIHM GNIXMHIEJ "
print(coded)
## [1] " PKT YTUJ MHSN JMCU FMHIY JK INHFM JMN WCXXHFZN. CJ CU YHEN KL NSNIPJMCXD HXE KL XKJMCXD, JMN UJICSCXD RCZZ, JMN ZKKA, JMN RHZA, JMN WIKWKIJCKXU KL JMN GKEP, JMN UKTXE KL JMN SKCFN, JMN NHUN KL JMN DNUJTINU. UHIHM GNIXMHIEJ "
This stochastic process based text decoding is based on the method and code from http://alandgraf.blogspot.ca/2013/01/text-decryption-using-mcmc.html?m=1 also re-posted here http://www.r-bloggers.com/text-decryption-using-mcmc/. In that method they determine the one step ahead transition probabilities for letters based on scanning the book 'War and Peace'. The original version referred to above examines the 27x27 matrix (729 matrix elements) of letter transition probabilities (where the 27th letter referred to here is actually a space, period, comma, etc...). However many longer letter combinations segments are very common. For example common word endings include ED, LY and ING. Consequently in this code I use the (much slower) (27x27=729) x (27x27=729) matrix array of 2letter transition probabilities (with 19,683 matrix elements). While the original code examines a Markov chain where the next letter in sequence only depends on the previous letter in the sequence, by extending the matrix is at risk of losing that Markov property. However the problem could be reformulated as a Markov process with increased accuracy by effectively taking a dyad of two letters as the history, and then predicting the third letter in the sequence. Basically setting up the matrix as a 729 x 729 transition matrix of moving from (letter i-1, letter i) to (letter i, letter i+1). In that way the row and column states would be (AA), (AB), (AC),...,(_ _), where _ is used for the punctuation item. This code is much slower because I set it up with nested for loops.
As an additional modification, since the book 'War and Peace' uses mainly Russian names and includes some French phrases I used a different set of text. In this example the books used as training data were 'A Tale of Two Cities', 'Moby Dick', 'The Adventure of Sherlock Holmes', and 'Pride and Prejudice' all retrieved from project Gutenberg: http://www.gutenberg.org/ebooks/search/?sort_order=downloads A large sample of words is needed to make the probability transitions.
Load the data and then convert all letters to upper case:
setwd("/Users/iamdavecampbell/Dropbox/Stat380-2015/Text Decode")
reference=readLines("Books.txt",encoding="UTF-8")
reference=toupper(reference)
Start with a matrix of the (729) x (729) dyads of the 27 characters (A to Z plus punctuation) and over 60,000 lines of text. Note that there are a lot of strange formatting punctuation marks including quotes and slashes, so they are all lumped together as a common 27th character.
LetterDyadSet = NULL
characters2use = c(toupper(letters),"")
for(lp in 1:length(characters2use)){
LetterDyadSet=c(LetterDyadSet,paste(characters2use[lp],characters2use))
}
possible.moves=matrix(0,nrow=27*27,ncol=27*27,dimnames=list(LetterDyadSet,LetterDyadSet))
Note that I start with a value of 1 in all possible locations (i.e. note that going from (A,P) to (D, P) is not possible because we use (TwoBackLetter Lastletter) to (Lastletter CurrentLetter) so as to avoid zero probabilities (for convenience in case we're decyphering strange text with unusual never before seen words). This adds a block diagonal set of non-zero elements.
Twobackletter = ""
lastletter = ""
for (Twobackletter in characters2use) {
for (lastletter in characters2use) {
BackDyad=paste(Twobackletter,lastletter)
for (curletter in characters2use) {
currDyad = paste(lastletter,curletter)
possible.moves[dimnames(possible.moves)[[1]]==BackDyad,
dimnames(possible.moves)[[2]]==currDyad]= 1
}
}
}
trans.mat = possible.moves
Now sweep through the text keeping track of the current letter and the last two letters and build up the empirical stochastic process based on the text.
#This segment of code is modified from
# <http://alandgraf.blogspot.ca/2013/01/text-decryption-using-mcmc.html?m=1>
# to accomodate the dyad letter structure.
Twobackletter = ""
lastletter = ""
for (ln in 1:length(reference)) {
if (ln %% 2500 ==0) {cat("Line of text",ln,"\n") }
for (pos in 1:nchar(reference[ln])) { # position within a line for current letter
curletter = substring(reference[ln],pos,pos)
currDyad = paste(lastletter,curletter)
BackDyad = paste(Twobackletter,lastletter)
#diagnostic help:
#if (ln %% 500 ==0) {cat("Line of text",ln,"\n")
# print(paste(BackDyad,currDyad))
#}
if (currDyad %in% LetterDyadSet) {
trans.mat[dimnames(trans.mat)[[1]]==BackDyad,
dimnames(trans.mat)[[2]]==currDyad]=
trans.mat[dimnames(trans.mat)[[1]]==BackDyad,
dimnames(trans.mat)[[2]]==currDyad]+1
Twobackletter=lastletter
lastletter=curletter
} else { # if the current letter is a punctuation mark
curletter = ""
currDyad = paste(lastletter,curletter)
trans.mat[dimnames(trans.mat)[[1]]==BackDyad,
dimnames(trans.mat)[[2]]==currDyad]=
trans.mat[dimnames(trans.mat)[[1]]==BackDyad,
dimnames(trans.mat)[[2]]==currDyad]+1
Twobackletter=lastletter
lastletter=curletter
}
}
}
## Line of text 2500
## Line of text 5000
## Line of text 7500
## Line of text 10000
## Line of text 12500
## Line of text 15000
## Line of text 17500
## Line of text 20000
## Line of text 22500
## Line of text 25000
## Line of text 27500
## Line of text 30000
## Line of text 32500
## Line of text 35000
## Line of text 37500
## Line of text 40000
## Line of text 42500
## Line of text 45000
## Line of text 47500
## Line of text 50000
## Line of text 52500
## Line of text 55000
## Line of text 57500
## Line of text 60000
## Line of text 62500
# Now to normalize the whole thing to make sure that the row sums are equal to 1
trans.prob.mat = sweep(trans.mat,1,rowSums(trans.mat),FUN="/")
Potential problems are that line starts have strange punctuation and are therefore set up as though they have two punctuation items. So in the end having two punctuation items could mean the start of a sentence, a new line, or an element of a list. Words which can start (or end) a sentence are different in essense to those resigned to the middle. This will lead to poor estimates other than the fact that give insight into letters starting words. Additionally, in most cases having triples or quadtuples of letters will be more informative because of 'ING' almost surely occurs at the end of a word.
The 729 x 729 matrix with logical constraints is too big and uninformative to visualize directly.
Instead we can look at some individual relationships and transitions. What are the most likely transitions from (T,H), (A,N), (R,E), or (I,N)?
The goal from here is to use MCMC to initially propose a cypher and then test it out to see if it is reasonable based on the transition probability matrix that we built above. We then propose to alter the cypher and see if it's any more likely. If it is more likely then we definitely keep the proposed altered cypher. If it's worse then we accept it with some probability. In the end we want a distribution of decoded messages so we can pull out the most likely and/or some neaby or even equivalently good decoding. If we don't do this probabilistically then we would need to actually try all the cyphers and read all the messages. That would be sure painful.
Note that we set up a stochastic process on the letters by building up a transition probability matrix. Now we are going to use that to see if we can decode our message by finding the most probable mapping of letters to fit with the probability model in our transition probability matrix. To do that we will put the letter mapping through another stochastic process (MCMC) which will run a random walk to give us a sample from the limiting distribution of reasonable mappings. From there we will have the ability to (probabilistically) decode our message.
# The code from <http://alandgraf.blogspot.ca/2013/01/text-decryption-using-mcmc.html?m=1>
# also provides a decoding function (decode) based on the substitution
# rule and a likelihood function sum(log.prob) based on the stochastic
# process. I've modified their log.prob function to deal with the
# dyad structure.
decode <- function(mapping,coded) {
coded=toupper(coded)
decoded=coded
for (i in 1:nchar(coded)) {
if (substring(coded,i,i) %in% toupper(letters)) {
substring(decoded,i,i)=toupper(letters[mapping==substring(coded,i,i)])
}
}
decoded
}
log.prob <- function(mapping,decoded) {
logprob=0
Twobackletter = ""
lastletter = ""
for (i in 1:nchar(decoded)) {
curletter = substring(decoded,i,i)
if(curletter == " "){curletter=""}
currDyad = paste(lastletter,curletter)
BackDyad = paste(Twobackletter,lastletter)
if (currDyad %in% LetterDyadSet) {
logprob=c(logprob,log(trans.prob.mat[rownames(trans.mat)==BackDyad,
colnames(trans.mat)==currDyad]))
Twobackletter=lastletter
lastletter=curletter
} else {
#print(paste(BackDyad,currDyad))
curletter =""
currDyad = paste(lastletter,curletter)
logprob=c(logprob,log(trans.prob.mat[rownames(trans.mat)==BackDyad,
colnames(trans.mat)==currDyad]))
Twobackletter=lastletter
lastletter=curletter
}
}
# if the method fails just give it some really big negative number.
if(is.nan(sum(logprob)) || is.infinite(sum(logprob))){logprob = -1000000000}
logprob
}
For reproducibility, I will set the random seed
set.seed(1123)
Now set up a starting point and run the MCMC:
# Using code modified
# from <http://alandgraf.blogspot.ca/2013/01/text-decryption-using-mcmc.html?m=1>
# to accomodate the letter dyads and to keep decoded values and
# mappings.
mapping=sample(toupper(letters))
iter=1
Niters=50000
mapping = matrix(mapping,nrow=26,ncol=Niters+1)
cur.loglike = rep(0,Niters+1)
cur.decode=matrix(rep(decode(mapping[,iter],coded),Niters+1),nrow=Niters+1,ncol=nchar(decode(mapping[,iter],coded)))
cur.loglike[1]=sum(log.prob(mapping[,1],cur.decode[1]))
max.loglike=cur.loglike[1]
max.decode=cur.decode[1]
print("Starting with the initial mapping and attempted decoded message:")
## [1] "Starting with the initial mapping and attempted decoded message:"
print(rbind(from=mapping[,1],to=letters))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## from "T" "V" "U" "M" "S" "D" "X" "N" "Q" "F" "J" "P" "G"
## to "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## from "Y" "O" "E" "W" "Z" "B" "I" "L" "H" "K" "R"
## to "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x"
## [,25] [,26]
## from "C" "A"
## to "y" "z"
cat(1,"Current log likelihood: ", cur.loglike[1], "and decoded message: ", cur.decode[1],"\n")
## 1 Current log likelihood: -1144 and decoded message: LWA NACK DVEH KDYC JDVTN KW THVJD KDH QYGGVJRH. YK YC NVPH WU HEHTLKDYGF VGP WU GWKDYGF, KDH CKTYEYGF XYRR, KDH RWWZ, KDH XVRZ, KDH QTWQWTKYWGC WU KDH MWPL, KDH CWAGP WU KDH EWYJH, KDH HVCH WU KDH FHCKATHC. CVTVD MHTGDVTPK
while (iter<=Niters) {
# select letters to switch (move 1 pair at a time)
proposal=sample(1:26,2)
prop.mapping=mapping[,iter]
prop.mapping[proposal[1]]=mapping[proposal[2],iter]
prop.mapping[proposal[2]]=mapping[proposal[1],iter]
prop.decode=decode(prop.mapping,coded)
prop.loglike=sum(log.prob(prop.mapping,prop.decode))
if (runif(1)<exp(prop.loglike-cur.loglike[iter])) {
mapping[,iter+1]=prop.mapping
cur.decode[iter+1]=prop.decode
cur.loglike[iter+1]=prop.loglike
if (cur.loglike[iter+1]>max.loglike) {
max.loglike=cur.loglike[iter+1]
max.decode=cur.decode[iter+1]
}
}else{
cur.decode[iter+1] = cur.decode[iter]
cur.loglike[iter+1]= cur.loglike[iter]
mapping[,iter+1] = mapping[,iter]
}
if((iter %% (Niters/20)) ==0){
cat("iteration:",iter,"Log likelihood: ", cur.loglike[iter], "and decoded message: ", cur.decode[iter],"\n")}
iter=iter+1
}
## iteration: 2500 Log likelihood: -634.3 and decoded message: YOU MUHA RIVE ARCH BRISM AO SEIBR ARE PCNNIBLE. CA CH MIGE OF EVESYARCND ING OF NOARCND, ARE HASCVCND WCLL, ARE LOOK, ARE WILK, ARE PSOPOSACONH OF ARE TOGY, ARE HOUNG OF ARE VOCBE, ARE EIHE OF ARE DEHAUSEH. HISIR TESNRISGA
## iteration: 5000 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 7500 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 10000 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 12500 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 15000 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 17500 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 20000 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 22500 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 25000 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 27500 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 30000 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 32500 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 35000 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 37500 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 40000 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 42500 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 45000 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 47500 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
## iteration: 50000 Log likelihood: -406.6 and decoded message: YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT
In my case it took was very fast to solve the puzzle: " YOU MUST HAVE THIS CHARM TO REACH THE PINNACLE. IT IS MADE OF EVERYTHING AND OF NOTHING, THE STRIVING WILL, THE LOOK, THE WALK, THE PROPORTIONS OF THE BODY, THE SOUND OF THE VOICE, THE EASE OF THE GESTURES. SARAH BERNHARDT "
I ran the algorithm for 50,000 iterations. We can look at the performance by looking at how the letter mapping moves around the probability space. Higher values suggest higher probability of the letter mapping based on the stochastic process that we devised above and the ordering of the letters in the message that we are trying to decode. Initially the mapping is not good so the decoding is really poor. Quickly though the mapping of letters moves to a much better location and the probability of a decoded message increases. The follosing plots really show that the decoding (MCMC) stochastic process has to travel through transient states for about 5,000 iterations until it reaches the recurrent part of the mapping states (i.e. the limiting distribution of the mapping). Recall that there are now 2 layers of stochastic process. The one we used to determine the transition probabilities for going from (2 letters back, 1 letter back) to (1 letter back, current letter), and the one that we are using for the MCMC in order to figure out a good fit for the mapping to decode the message.
We can use iterations 10,000 to 50,000 (the recurrent part of the MCMC) to build up a probability model for the decoding cypher. Now we can also look at the probability of the decoded mapping scheme, darker squares indicate higher probability events. Ultimately we get a distribution of possible decodings, where until we get a new message to decode, we won't be able to refine the mapping because there is not much information about letters not appearing in the decoded message such as J, Z, X, and Q. As a result those decoded letters end up with equal probability of being the coded letters V, Q, O, and B. It looks like the distribution of possible decodings in this case contains a single possible message.
library(ggplot2)
library(reshape)
mapmatrix = matrix(0,nrow=26,ncol=26,dimnames=list(letters,toupper(letters)))
for(from in 1:26){
fromLetter = letters[from]
for(to in 1:26){
toLetter = toupper(letters)[to]
mapmatrix[fromLetter,toLetter] = sum(mapping[from,10000:iter]==toLetter)/(iter-10000)
}
}
ggplot(melt(mapmatrix), aes(X1, X2, fill = value)) +
geom_tile(colour = "grey") + scale_fill_gradient(low = "white", high = "blue")+xlab("To decoded letter")+ylab("From coded letter")