Genomic sequencing

I'd like private tutoring!

As previously touched on, the genome is the entirety of genetic material carried by an individual or species and varies accordingly. The database of genomes of different species is growing and includes humans (the Human Genome Project). For example, the human genome, by chromosome, is viewable here:

Simple genomes

Simple genomes such as those of viruses can enable a relatively straightforward effort of assigning proteins to each gene in the genome, and thus creating a database of them. This is known as a proteome.

The information gleaned from a virus proteome, for example, can inform vaccination targets by selecting appropriate antigens such as elements of the viral capsid.

Other exciting synthetic biology applications can be explored such as glowing beer, synthesising specific compounds useful in medicine or manufacturing using organisms to whom that product isn't native in an attempt to boost production or create new products.

Complex genomes

Analysing and storing information about more complex genomes is hindered by non-coding DNA and regulatory genes. Non-coding DNA and regulatory genes take up the vast majority of this type of genome. This means that the actual protein products that genes code for are in the minority.

The proteomes corresponding to complex genomes, human included, are therefore difficult to build. Sequencing methods themselves have witnessed, and continue to witness a rapid evolution towards faster, more efficient, automated techniques that can yield tremendous amounts of data.


For example, Sanger sequencing has been the main method of sequencing DNA and yielded many variations of itself. The basic concept follows these steps:

1. Mix copies of your target DNA to be sequenced with radioactive nucleotides (with A, T, G or C bases)

2. These nucleotides also prevent further DNA lengthening, resulting in a mixture of different sequence DNA strands complementary to the template DNA

3. e.g. AATGGC creates TTACCG, TACCG, ACCG, CCG, CG and G

4. Run the DNA mixture on a gel to separate the different strands by size

5. Infer their sequence based on the results: the radioactive reading of the different bases (A, T, C or G) alongside the size sequence of the strands (smaller strands run further down the gel while larger strands stay towards the top, where they were loaded)

At present, protein sequencing is not efficient enough to compete with simply sequencing the corresponding DNA and inferring the protein from the DNA. The speed and cost of genetic sequencing has been regarded to follow a trend similar to that of transistor speed and cost known as Moore's Law which predicted that speed would double as price halved. So far it has held true for DNA sequencing and is known as The Carlson Curve.

Tracing key events in evolution

This sequence data alongside fossil evidence helps trace back major biological events in history such as the last universal common ancestor (LUCA) of all life on Earth, emergence of the first prokaryotes, the evolution of photosynthesis, the first eukaryotic organisms and multicellular life.

The comparison of sequences has provided evidence for 3 domains (the largest level of classification) of life: bacteria, archaea and eukaryotes.

By using the mutation rate in DNA or amino acid sequences in different organisms, a technique called molecular clocks can be used to reverse calculate the most likely point in time when two organisms diverged. For example, two species that have more mutations in different places are likely to be more distantly related than two species that have fewer mutations in different places.

Determining evolutionary relationships between species

In the era of molecular biology, we no longer have to rely on superficial visual cues only in order to classify species. We can look at and compare their DNA, proteins, etc. A common method of visualising these differences is gel electrophoresis which involves loading small volumes of samples on a gel and running a current across it in order to separate the samples by size.

Since the gel has a microscopic matrix inside that provides resistance against sample movement through it, the larger molecules move more slowly while the smaller fragments can move more quickly.

The positive charge is at the bottom of the tank, while the samples are loaded at the top. This way, they will move downwards towards the bottom of the gel because they have a negative charge as molecules. The current is run across the gel for around 30-60 minutes (ensuring the samples don't run too long and hence run off the gel into the buffer solution! if that happens they are lost) after which the sample's progression on the gel can be visualised by using a stain solution or pre-existing coloured label visible under UV light.

The protein or DNA samples for example can come from the different species' muscle or some other tissue source. DNA samples can be replicated in the lab using specific primers (in PCR) to make certain genes or sections of DNA that are to be compared and looked at on a gel. Alternatively, all present proteins in a sample can be investigated by running the whole sample on a gel and comparing the differences.

The bands on the gel might look something like this. Based on the height on the gel of the different bands (which represent different proteins in the sample), we can see that all species have the band at the top. This is a protein of the same size that they all share.

The second largest protein (the second highest band on the gel) belongs to Species Y and is unique to it, not being shared by the other species. Same for the third one down of Species Z. Looking at all the bands for each species, we can see that Species A and Species Z share the most bands in common (3), so we assume from this data that based on their proteins, Species A and Species Z are the most closely related compared to any other species combination here (A, X, Y, Z).

With advancing technology, scientists no longer have to rely on capturing animals or gathering data manually in the field. Bioinformatics enables the analysis of a whole genome from a computer. Once the initial DNA sequencing has taken place, a lot of research can be conducted just from that data. For example, the DNAmRNA or amino acid sequence between two individuals or species can be compared.

From this short sequence of amino acids in the haemoglobin of these different species we can infer several things. Let's do humans and chimps first! How many differences are there? Lys, Glu, His, Iso and... Lys, Glu, His, Iso. Right. Absolutely no difference. Humans and gorillas have one difference, zebras and horses have one difference and zebras and humans have 3 differences!

We can infer a lot of different information from this table, and it's just a very small sequence in just one protein looking at just five different species. The potential of investigating diversity with molecular biology tools is astounding.

DNA can be studied similarly, and a lot of creativity can be employed to come up with ways to twist and turn heaps of genetic data in such a way that interesting information can be pulled out. In this example, it's a fairly straightforward, run of the mill comparison between the DNA sequence itself of a mouse gene versus a fly gene.

We can see that the sequence itself is 76.66% identical, while the protein product resulting from the exons only, is actually identical in its entirety at 100% between the two sequences (highlighted in green).

Personal genomics and health

Following the successful first complete sequencing of a human genome, an updated project called 100,000 Genome Project in the UK has been launched by the government through the NHS to sequence the genomes of rare disease and cancer patients as well as their families. The insights gleaned from this data may serve to find treatments, as well as provide a rich source for further research that may be relevant to other disciplines in molecular biology and epidemiology.

Ethical implications

I got some of my DNA screened for several select markers, including for Alzheimer's disease and Parkinson's, as well as many inherited conditions. Before I could see the results, which could tell me I am at a higher risk for some of these conditions, I had to read a statement explaining what these results could mean, not just for myself, but for members of my family too. Maybe I didn't really care at the time whether I would be more likely to get Alzheimer's in my old age, but suddenly I realised it might be extremely relevant for my mother or grandmother.

Genetic information can affect people's outlook on healthlifestylefamily connectionsreproduction and identity. Personally, I found out I am a carrier of a thrombosis factor associated with a 5 times higher risk of blood clotting. It won't affect me, but it might affect my genetic children if they receive two copies. I also found out I metabolise certain drugs quicker, and others slower. This might be useful in the future if I need to take them. Some are for epilepsy, some for diabetes, and so on.

Ancestry-wise, I expected my mother's side to be Balkan (Romanian), and my father's side to be Middle Eastern (Iranian) based on the region assignments at the time, representing population locations as far back as several hundred years. Indeed, I scored 43% Middle Eastern, but only 14% Balkan! Other populations included Ashkenazi Jewish, Italian and East Asian, with most of it being non-specific, vaguely European. I take it in good humour and am very proud of all these findings, but there are people who might have strong reactions to this type of knowledge about their ancestry.

The ethical implications stretch quite far and wide, up and down. The knowledge pertains to trivial matters such as earwax type and caffeine metabolism, but also significant health markers such as those for breast cancer and Alzheimer's. They pertain to ourselves as individuals, but stretch to our immediate genetic relatives, generations above, generations below and indeed those yet to be born. This is why this information requires careful treatment.


As briefly touched upon in the introduction to this chapter, genomics (the study of genomes) is emerging as a key scientific field in terms of addressing disease and learning more about health. Within healthcare, genomics has the potential, and has already begun, to support risk predictionpreventiondiagnosistreatment in terms of drug choice and dosage, and prognosis.

Genomic medicine started in the areas of oncologypharmacologyrare and undiagnosed diseases and infectious disease.

Risk prediction is employed by studying associations between certain diseases and the presence of specific genes preferentially in that patient population. Sometimes, especially for rare disease that tend to have a single genetic root, it's possible to know the mechanism by which that mutation causes a disease. However, other times this isn't elucidated and all we can work with is the knowledge that, for whatever reason as of yet unknown, the association stands. It gives a patient a percentage increased lifetime likelihood of developing a certain disease.

One example are the BRCA1 and BRCA2 alleles whose protein products are involved in DNA repair in cells, acting as tumour suppression genes. Different variations of these genes have been linked to a 20-60% increased risk of breast and ovarian cancer. 

Prevention can then take place by paying close attention, just by being aware of the increased risk, or in some cases, preventative interventions such as taking certain drugs or elective surgeries. In pharmacology, knowledge of increased risk of side effects from certain drugs can inform patients to avoid them or take an alternative drug. This ties in with treatment, and a patient's option to take a drug they will personally have a better response to, or at a better tailored dose. For example, fast metabolism of a drug may mean they will have to take it more frequently as their body is breaking it down more quickly.

Prognosis is about knowing the likely outcome of a condition. This can connect back to the drugs taken and response to those, or refer to how a disease might develop. For example, in the case of some disease there are multiple variations in genes with different outcomes. This could be in terms of the likelihood of getting a disease, as well as in terms of disease severity and progression.

Fighting disease with non-human genomes

Having access to the genomes of other species can elucidate knowledge about the mechanisms of action of various metabolic pathways and proteins, and the relationship between these things between different organisms. For example, the genomes of malaria-causing Plasmodium falciparum as well as its vector, the mosquito Anopheles gambiae have been sequenced. This data can help develop better ways of controlling malaria.

Genetic information on the parasite can help edit its genome in order to produce attenuated versions for the purpose of vaccination. Alternatively, the mosquito which carries the parasite to humans could be modified so that it's no longer capable of transmitting the malaria agent to humans.

Pests of crops of interest to humans can also be better tackled through information from sequenced genomes.

Research species are often sequenced first, such as the ubiquitous "lab rats", flies (Drosophila melanogaster), worms (Caenorhabditis elegans) and frogs (Xenopus laevis).

<< Previous topic: Evolution                                                                      Next topic: Metabolic pathways and their control >>