Biology 335 - Molecular Genetics

Genomic Libraries


Vector Ligations




The bacteria transformed with the plasmid vector ligation reaction, or the phage vector ligation reaction packaged in vitro represent

Each bacterial transformant (plasmid vectors) or in vitro packaged virus particle (phage vectors) represents a single vector molecule joined to a unique foreign DNA fragment.
The population of transformants or virus particles obtained is called a library - and in the primary library, each individual bacterium or virus is unique.

Since our objective in constructing such a library is to identify and isolate a specific foreign DNA fragment, it is important that all DNA sequences in the sample are represented by one or more individual recombinant(s).


Foreign Genomic DNA Fragment Preparation

We have already looked at the pattern of genomic fragments generated by restriction enzyme digestion of genomic DNA.
If cut with a 4-mer enzyme, the average fragment size is approximately 256 bp.
Similarly, digestion with a 6-mer enzyme generates an average fragment size of approximately 4000 bp
and digestion with an 8-mer generates an average fragment size of 64,000 bp.
For each of these digests, fragments larger and smaller than the average are also produced.

Given the size limitations imposed by the various vector systems,
recombinant plasmids greater than 10 kb transform bacteria very inefficiently,
lambda insertrion vectors cannot package inserts greater than 10 kb in size
lambda subsititution vectors can only accomodate inserts between 10 and 20 kb in size,
complete genomic digests with 6-mer or 8-mer enzymes will produce large genomic DNA fragments which will not be represented in a library using any of these vector systems.

In order to ensure that all genomic sequences are represented in our libraries, we must generate random genomic fragments and make a probability estimate of the probability with which our specific DNA fragment will be included in a statistically relevant number or recombinant molecules.

To generate the random genomic DNA fragments and ligate them into our selected vector , we focus our attention on pairs of enzymes which generate compatable ends to facilitate ligation.
One of these enzymes must cut the genome frequently (ie a 4-mer) so that we can obtain a random population of genomic fragments,
while the other must cut uniquely in the vector (ie a 6-mer enzyme).
The most common enzymes used for this purpose are SauIIIA and BamHI.



In order to generate a random population of genomic DNA fragments using an enzyme that cuts frequently, we use partial digest conditions where then enzyme only cuts the genome at a fractions of its recognition sites (limit enzyme concentration or time of digest or both).


Since the recognition sites occur frequently and we cut them randomly, the ends of the fragments we produce are a random selection of genomic sites.


Following partial digestion of the genomic DNA, we isolate an appropriate size class of genomic fragments - each end of each fragment represents one of many possible SauIIIa sites.

These fragments are then ligated into a BamHI digested vector - either plasmid or lambda insertion or substitution vector.

Each recombinant transformant or packaged phage resulting from this ligation represents an individual random genomic fragment.

This produces a population of recombinant molecules that contain many overlapping genomic fragments - any given genomic sequence will be represented by several of these overlapping fragments.


Since this library contains a random population of genomic fragments we need to use a statistical approach to determining how many independant recombinant plasmids or phage are needed to ensure that all genomic sequences are represented in the library.

There are two statistical approaches used to estimate the number of recombinants required.

The first is the 'lazy molecular biologist's rule of thumb':

First, we calculate the number of individual recombinants that contain sufficient genomic DNA sequence to represent one complete genome - the genome equivalent.


1 genome equivalent = total genome length / average insert size

for the human genome, using an average plasmid insert size of 6 kb,
1 genome equivalent = (3 x 109 bp) / (6 x 103 bp) = 5 x 105 plasmids.

for the human genome, using a phage substitution vector insert size of 20 kb
1 genome equivalent = (3 x 109 / (2 x 104 bp) = 1.5 x 105 phage.


The genome equivalent assumes that the genomic inserts are non-overlapping.
In reality, these are random overlapping fragments that are follow a Gausian fashion across the genome.

To approximate this distribution the 'rule of thumb' stipulates that

5 genome equivalents are necessary to ensure that
all genomic sequences will be represented with 95% probability

10 genome equivalents are necessary to ensure that
all genomic sequences will be represented with 99% probability.


The second approach uses the simple statistical formula:


N =
ln (1 - p )

ln (1 - f )

where p = probability (0.95 or 0.99)

and f = fraction of the genome contained in a single average insert
(2 x 104 / 3 x 109 ) = 0.66 x 10-5 for a lambda substitution library of the human genome


The molecular biologist rule of thumb overestimates the number of recombinants needed to give complete representation of the genome - but then you can never really have TOO many recombinants can you?

For the human genome,

a random shotgun plasmid library constructed using genomic inserts of 6 kb needs to contain at least


for 95% coverage

5 genomic equivalents = 5 ( 3 x 109 bp ) / ( 6 x 103 bp )
= 2.5 x 106 clones


N = ln ( 1 - p )/ ln ( 1 - f ) = ln ( 1 - 0.95 ) / ( 1 - ( 6 x 103 / 3 x 109 )
= ln ( 0.05 ) / (1 - 0.000002 )
= 1.5 x 106 clones

for 99% coverage

10 genomic equivalents = 10 ( 3 x 109 bp ) / ( 6 x 103 bp )
= 5 x 106 clones


N = ln ( 1 - 0.99 ) / ln ( 1 - 0.000002)
= 2.3 x 106 clones


a random shotgun lambda substitution library constructed using genomic inserts of 20 kb needs to contain at least


for 95% coverage,

5 genomic equivalents = 5 ( 3 x 109 bp ) / ( 2 x 104 bp )
= 7.5 x 105 clones


N = ln ( 1 - 0.95) / ln ( 1 - (2 x 104 / 3 x 109))
= ln (0.05) / ln (1 - 6.6 x 10-6)
= 1 x 105 clones.

for 99% coverage

10 genome equivalents = 1.5 x 106 clones

N = ln ( 1 - 0.99) / ln (1 - 6.6 x 10-6)
=3 x 105 clones


go to
cDNA Libraries