Crick's Comma Free Genetic Code

How it works and how it gives 20 amino acids

Apr 24, 2025

Crick’s Comma Free Genetic Code: Crick suggested the "adaptor hypothesis," which posits that amino acids do not interact directly with mRNA but are carried by small molecules that recognize specific codons. (Today, the adaptor molecules would be called tRNAs.) The codons are then thought to be non-overlapping triplets of bases. This code removes the restriction in the Gamow schemes that allowed only certain combinations of amino acids to be adjacent. However, care must be taken to ensure that each triplet uniquely identifies one amino acid. For example, suppose that somewhere in an mRNA there is the partial sequence ... UGUCGUAAG.... (Note that in RNA uracil replaces the thymine of DNA, and so the code is written with U rather than T.) The intended reading frame is ... UGU, CGU, AAG..., but the RNA molecule has no spaces or commas to indicate codon boundaries. The sequence could equally well be read as ... UG, UCG, UAA, G ... or ... U, GUC, GUA, AG.... Each of these alternatives would have a different meaning. Furthermore, adaptor molecules that attached to the messenger RNA in different reading frames might interfere with one another and prevent any protein from being produced at all. He suggested that adaptor molecules exist for only a subset of the 64 codons, with the result that only that subset would be meaningful; the rest of the triplets would be "nonsense codons." The trick is to construct a code in such a way that when any two meaningful codons are put next to each other, the frame-shifted overlap codons are always nonsense. For example, if CGU and AAG are sense codons, then GUA and UAA must be nonsense, because they appear inside the concatenated sequence CGUAAG. Similarly, AGC and GCG are ruled out by the sequence AAGCGU. The codons AAA, CCC, GGG and UUU cannot appear in any comma-free code, since they cannot combine with themselves without generating reading-frame ambiguity. If all the out-of-frame triplets are nonsense, then there is only one possible way to read the message. A code with this property is said to be comma-free, since messages remain unambiguous even when words are run togetherwithoutcommasorspaces. Show that such a comma free code also codes for 20 amino acids. One way to solve this problem is to write a Matlab code for it.

To solve this problem, start by assuming that a particular codon is allowed. Thus, if ACT is a possible codon then to eliminate ambiguity, ACTACT should be uniquely readable as two ACTs. This eliminates CTA and TAC as possible codons because they would result in an ambiguity in the reading frame. Now choose a different possible codon – say CGT and eliminate ambiguous codons.... and so on. Note that not all starting choices will lead to 20 codons. They often end up with fewer than 20. The point is that the maximum number of codons starting with some sequence of choices will result in 20 codons

Solution: The first thing to notice is that AAA, CCC, GGG and TTT cannot be codons in the comma free code because they would each cause reading frame ambiguity (i.e. in a sequence AAAAAAAA, the reading frame is ambiguous). It is also easy to see that if XYZ is an allowed codon, then its cyclic permutations YZX, ZXY cannot be codons. For example, if both XYZ and YZX were allowed, then the sequence XYZXYZ would have a reading frame ambiguity: XYZXYZ and XYZXYZ. Thus, for any allowed codon, its three cyclic permutations are disallowed. Hence the 60 possible codons that remain after discarding AAA, CCC, GGG, TTT should divide into 20 distinct classes. In principle, it should be possible to choose one codon from each class to avoid ambiguity of the reading frame.

Although this seems plausible, the actual exercise of devising such a code is not trivial.

Below is a Matlab program that iteratively finds the codons. The nucleotides have been replaced by numbers: A==1, C==2, G==3, T==4.

When you run the code, it asks you for an input codon. The number of codons it finds which have no reading frame ambiguity (and so can be used in a comma-free-code) depends on the input codon. Starting with some of the codons does not produce twenty codons. For example, starting with 112 ends up finding only 9 codons:

[112,311,411,321,421,312,412,322,422]. However, starting with codon 123 generates a complete set:

C: [123,211,311,411,122,322,422,213,313,413,323,423,433,214,314,414,124,324,424,434]

Other comma free codes can be obtained by choosing different start codons for example, 231 and 312.

The second code below will check that your list of codons has no reading frame ambiguity. The code writes the output array C into a local file called C.csv.

Code to find the codons:

______________________________

% Code to find the codons for Crick's comma free code. The number found

% depends on starting codon (line 16) - the initial value 123 gives all 20

clc;clear all;close all;

C00 = zeros(1,60);

ic = 0;

for i=1:4

for j=1:4

for k=1:4

if abs(i-j)+abs(i-k)+abs(k-j)>0

ic = ic+1;

C00(ic) = i+j*10+k*100;

end

C(1)=input(' give initial codon \n')

for ij=1:20

C0=C00;

n = length(C);

iw = zeros(n,1);

nc = 0;

for i=1:60

for j=1:n

if C0(i)==C(j)

nc = nc+1;

iw(nc) = i;

end

C0(iw(:))=[];

% now try to add codons one by one and check

m = length(C0);

n=n+1;

err = 1;

kk = 0;

while (err == 1)

kk = kk+1;

if kk>m

C(n)=[];

n=n-1;

for ij=1:n

fprintf('Codon %i = %i \n',ij,C(ij))

end

fprintf('*********\n')

fprintf(' %i codons found \n',n)

break

end

jerr = 0;

C(n) = C0(kk); % add one more codon from C0

% check that this still works

for i=1:n

c3 = mod(C(i),10);

c2 = mod((C(i)-c3)/10,10);

c1 = (C(i)-c2*10-c3)/100;

for j=1:n

d3 = mod(C(j),10);

d2 = mod((C(j)-d3)/10,10);

d1 = (C(j)-d2*10-d3)/100;

I1 = d2*100+d3*10+c1;

I2 = d3*100+c1*10+c2;

I3 = c2*100+c3*10+d1;

I4 = c3*100+d1*10+d2;

for k=1:n

if (I1==C(k))||(I2==C(k))||(I3==C(k))||(I4==C(k))

jerr = 1;

% fprintf(' error %i %i %i %i %i \n',I1,I2,I3,I4,C(k))

end

if jerr == 0

err = 0;

end

csvwrite('./C.csv',C)

________________________

Code to check the codons:

clc;clear all;close all;

C = csvread('./C.csv');

n=length(C);ierr=0;

for i=1:n

c3 = mod(C(i),10);

c2 = mod((C(i)-c3)/10,10);

c1 = (C(i)-c2*10-c3)/100;

for j=1:n

d3 = mod(C(j),10);

d2 = mod((C(j)-d3)/10,10);

d1 = (C(j)-d2*10-d3)/100;

I1 = d2*100+d3*10+c1;

I2 = d3*100+c1*10+c2;

I3 = c2*100+c3*10+d1;

I4 = c3*100+d1*10+d2;

for k=1:n

if (I1==C(k))||(I2==C(k))

fprintf(' error 1 %i %i %i %i %i \n',I1,I2,I3,I4,C(k))

ierr=1;

end

if (I3==C(k))||(I4==C(k))

fprintf(' error 2 %i %i %i %i %i \n',I1,I2,I3,I4,C(k))

ierr=1;

end

if ierr==0

for ij=1:n

fprintf('Codon %i = %i \n',ij,C(ij))

end

fprintf(' codons have no reading frame ambiguity \n')

end

Science on Saturdays

Discussion about this post