Mapping the Cannabis Genome

Soft Secrets
18 Oct 2011

Medicinal Genomics released the raw sequence of the newly mapped Cannabis sativa genome, and is set to release the C. indica genome in a matter of weeks.


Medicinal Genomics released the raw sequence of the newly mapped Cannabis sativa genome, and is set to release the C. indica genome in a matter of weeks. On August 18th 2011, the Massachusetts-based company Medicinal Genomics released the raw sequence of the newly mapped Cannabis sativa genome, and is set to release the C. indica genome in a matter of weeks.

This historic event is set to have far-reaching implications for the field of medicinal Cannabis research, as well as opening up the possibility of far more accurate quality control testing - and giving breeders a huge advantage in the drive to create new strains. The study of genetics is a complex business in itself, and at first glance appears incomprehensible for most. The majority of people vaguely remember the basics from high-school biology class, but would be hard-pressed to provide a decent explanation. SSUSA brings you up to speed, while going into a little more detail about the Cannabis genome project itself and why its impact on Cannabis research will be so great. Kevin McKernan, CEO of Medicinal Genomics, has a strong background in the development and implementation of sequencing technology.

He was involved with the Human Genome Project and was responsible for the design of a new bench-top SOLiD sequencing instrument during his time at Life Technologies, a highly regarded biotechnology company. With Medicinal Genomics, he was able to combine his own expertise and the unique opportunity presented by the emerging medical Cannabis market to begin his own concern. It is estimated that the project cost around $200,000 in total, a cost that Kevin evidently does not seek to recoup through licensing use of the data, as it has been released for public use through Amazon's EC2 cloud computing service.

However, future funding is unlikely to be an issue as the interest in this project - and its results - has been phenomenal; the number of potential applications for various sectors is countless, and much more work must be done to interpret the raw data into a precise genomic structure. Having made such an initial impact, this energetic young company is sure to be making genomic headlines again in the future. It might be pertinent at this point to provide a short introduction to genomics. Put as succinctly as possible: genomics is the field of genome research, and is a highly specialized branch of genetic studies. The genome is the complete genetic code of which an individual is comprised, and is contained within the DNA. In most species, DNA forms chromosomes containing both genes and non-coding DNA. A gene is a stretch of DNA that controls a hereditary trait in a species; an allele is the expression or variation of that gene in individuals of the species. Non-coding DNA may represent a large proportion of the genome, but is not encoded as genes and not all of its functions are fully understood. The number of chromosomes in the genome differs between species: humans have forty-six, Cannabis just twenty; the number of genes within the chromosome also varies.

DNA itself is a biological polymer or repeating macromolecule, made up of millions of sections called nucleotides. These sections are comprised of a simple sugar, a group of phosphates, and a nitrogenous base (nucleobase). This base may be one of four different compounds - adenine, guanine, thymine and cytosine. Through billions of permutations of these four bases, every living species has evolved its own unique genome. Sexually reproducing species, such as humans and C. sativa, usually have sets of chromosome pairs: one set inherited from the mother's ovum (or ovule, in the case of flowering plants) and a corresponding set from the father's sperm (angiosperm, or pollen). These sets are randomly recombined in the process of reproduction to form genetically distinct offspring. A human embryo will usually contain twenty-two pairs of non-sexual chromosomes (autosomes) and one pair of sexual chromosomes (allosomes). C. sativa usually has ten autosome pairs and one allosome pair. To map the complete genome of a sexually reproducing species the full set of chromosomes, including both the male and the female version of the allosome, must be sequenced, or tested to determine the exact order of the four nucleobases along the polymer. The mitochondrial DNA that is contained within the chloroplast must also be mapped, as it differs from chromosomal DNA and has its own specific functions according to species. Very few whole genomes of higher plant species have so far been mapped and published, and C. sativa is unique in being the first dioecious plant sequenced primarily for its medicinal value - other genomes mapped have either been of model plants or of oil and food crops. The medicinal plant Artemisia annua was previously sequenced to better understand the enzyme pathways that make the anti-malaria drug Arteminisin; however, plant growth cycle was much longer than Cannabis and it only had one medicinal compound of interest. Cannabis has 85 cannabinoids and potentially hundreds of terpenes of therapeutic interest, making it a potentially far more valuable medicinal plant. Medicinal Genomics intends to map the genomes of many beneficial plants in the future - but why was Cannabis the first choice for them? Initially introduced to the subject through a 2003 publication in Nature Reviews, CEO Kevin McKernan soon recognized the potential for investment. Not only is the market for medical Cannabis growing at a rate of up to 50% per year in the US, but the wide range of syndromes that can be successfully treated by Cannabis and derivative preparations could well be unrivaled in nature. In addition, its importance as a food crop is unquestioned, and will also increase as the market begins to expand. Furthermore, its potential as a biodiesel crop - in times when the US and other developed countries are expanding their bio-ethanol production at an unprecedented rate - may cement its role in future global trade. Hemp varieties are genetically quite different and will require separate sequencing - although this is unlikely to be too far away given the rapidly growing global interest in the crop. The ability to create new varieties that are ever more suited to purpose would be highly advantageous, and knowledge of the genomic structure of the strains involved could provide the basis for selection of parents as more accurate predictions can be made about the nutrient and cannabinoid profile of the offspring. Depending on the complexity of the genome and the extent of possible variation between individuals and varieties of the species in question, several techniques may be employed to sequence genomic data. The researchers at Medicinal Genomics started out using short-read technology, which analyzes short segments of DNA (around 200 base pairs) and collates the results to provide a complete picture. However, this method proved ineffectual at highlighting the true complexity of the genetic code, and longer reads were needed. The technology used to obtain the eventual results, the GS-FLX+ platform, is a Next Generation Sequencing technology that reads the DNA in fragments up to 750 base pairs (bp) long. They performed 49.5 million sequences, each individual sequence approximately 630bp in length. As little as three micrograms (μg) of genomic DNA, derived from any type of organism, is sufficient to perform a sequencing.

“The genomic structure of the strains could provide the basis for selection of parents as more accurate predictions can be made about the offspring.„

The DNA purification was performed at the company's Amsterdam facility, and the sequencing itself was performed, using the latest high-throughput technology, by various research institutions - including Roche's 454 Life Sciences sequencing laboratory. The varieties used were acquired through collaboration with DNA Genetics, the Amsterdam seed company, and required special breeding programs to develop. For the indica genome sequence, triple back-crossed L.A. Confidential was used. A back-crossed specimen is the offspring of a plant and the plant's parent, and is usually created to cement a characteristic within a strain, or possibly to breed out unwanted consequences of a previous cross. Double- or triple back-crossing misses generations to cross a specimen with its grandparent or great-grandparent. By triple back-crossing the L.A. strain, the breeders were able to create a phenotype that had great genetic similarity to its pure indica ancestors. For the sativa genome, the sativa hybrid Chemdawg was used. It is a highly regarded medical strain which, although having some indica ancestry, has a very cerebral effect and many sativa characteristics. Medicinal Genomics also worked with the Greenhouse Seed Company to investigate high CBD landraces and ruderalis varieties. Sanger Sequencing is the method used for the Human Genome Project. To purify the DNA and remove the remaining organic material, the sample is broken down mechanically, and the chromosomes are then separated into their component strands through a process known as the polymerase chain reaction (PCR). Polymerase refers to various enzymes that assist the replication and repair of DNA, catalyzing the linking of nucleotides in a specific order, and using a single short strand of DNA (or primer) as a template. The PCR combines polymerase with an artificial primer to create multiple instances of the same strand, each segment identical - but for one nucleotide's difference in length. The final base of each fragment is then fluorescently dyed for identification purposes, and the bases are separated and arranged through a process of gel electrophorosis (where dispersed particles in a fluid are forced to migrate by an electrical charge). The dyed nucleobases travel one by one through the gel, and pass through a laser beam, which is transformed into a different wavelength of light according to the type of base. The beam is then focused onto a spectrograph by lenses and read by a CCD camera system, and the order in which the different colors are recorded determines the original order of the nucleobases along the DNA polymer. The chromosomes are thus reconstructed into their original form, and the genome is said to be assembled. Two different Next Generation Sequencing methods, known as Sequencing by Synthesis, were employed in the mapping of the C. sativa genome. PyroSequencing from Roche/454 uses PCR (like Sanger Sequencing above), but a massively parallel version called water-in-oil Emulsion PCR. The genome is fragmented into 700-1000 base fragments, which are modified to have the same DNA sequence on the ends, and whipped into an emulsion with other particles that contain sequences of the primer sequences. The salad dressing-like emulsion isolates one DNA molecule into a PCR-enabled water droplet surrounded by oil. This single molecule per droplet is achieved by simple dilution: most droplets are in fact empty, and a very rare few have a single DNA molecule in them. Billions of droplets will each have a distinct DNA molecule and particles present in them. 'Xerox' copying the DNA molecules in these emulsion droplets effectively amplifies the signal one is looking at in the sequencing process. Once the DNA attaches to the particles, they can be put in a plate with millions of wells in them attached to a CCD camera. Luciferase (a firefly enzyme) is used to produce light that the camera can measure every time one of the four letters in solution runs across the chip. As we flow adenosine across the chip, any droplet that has T as the next letter on the strand of DNA will create light that the camera can measure. The chip is washed and the process is repeated with the other three bases. The strands of DNA are grown with polymerase and bases sequentially added. This process is repeated 500 times for each of the four bases to reach over 700 bases of sequence, and takes approximately 24 hours to complete. The Illumina Sequencing by Synthesis is fundamentally very different and uses Bridge PCR, which results in fewer DNA molecules and hence shorter DNA Read length. It also requires different detection mechanisms, requiring lasers and fluorescent dyes, but offers better sequencing accuracy - especially in sequence stretches known as homopolymers (long runs of the same letter) like AAAAAAATTTGGG. The result was a sequence of 131 billion bases - far higher than the 157 million or so found in the model plant Arabidopsis thaliana, a species of flowering plant, although the data requires further interpretation to define the true size and scope of the entire genome. It is to be expected that C. sativa would show more genetic complexity than A. thaliana: the latter is from a genus of relatively simple plants; Cannabis on the other hand has highly developed systems in place, most obviously the system of cannabinoid production. The genomic variation between the different samples tested was over 1%, which is ten times higher than that seen in humans - this ability to express hugely different genotypes while retaining the ability to interbreed is a large part of why Cannabis is such an adaptable species, and why it appears in so many different forms. Although much interpretive research remains to be done, the sequencing of the entire genome, which will enable study of the plant with no need for physical specimens, should throw a great deal of light on questions that have remained incompletely answered thus far: the exact nature and function of the cannabinoid system, the genetic differences between varieties and subspecies such as indica and sativa, and the complexities of sexual reproduction, to name just a few. For example, using the genomic data to determine the exact location and function of the genes that control eventual gender within the sex chromosomes could be determined, and the capability for self-expression (for example when the females are left several weeks beyond normal harvest, and produce genetically identical seeds in an effort to sustain the genotype) could be pinpointed. Conversely, programs to entirely breed out hermaphroditic tendencies would be highly desirable for many traditional growers. But perhaps it is the question of the cannabinoid function, and how exactly it relates to our own endocannabinoid system, that is the most fundamental. The publication of the genomic data will inevitably lead to ethical issues being raised. Genetic modification of crops remains a controversial issue - despite widespread implementation, there are still unanswered questions regarding their effect on the ecosystem to which they are introduced. For crops grown indoors, this is less of an issue, but for large-scale outdoor cultivation the impact on biodiversity is thought to be detrimental in some cases. When dealing with a subject as complex as organic chemistry, it is very difficult to predict the effect a single modification will have on normal interactions between different elements within an ecosystem. “Implications of the newly released genomic data are more wide-reaching than simply assisting genetic modification programs„ The question of whether it is better to selectively breed or genetically modify for a desired trait is difficult to answer. Breeding projects can generate quicker results via careful selection of specific phenotypes; developing strains by monitoring their genetic pathways may take longer, but should lead to a more controlled end result. However, the implications of the newly released genomic data are more wide-reaching than simply assisting genetic modification programs, and the precedent set by the Medicinal Genomics team of making their data open-source will hopefully provide the incentive for others to do the same. Without doubt, the involvement of many will be needed to complete the next stage of this project, and if the team releases their planned iPhone app, which will allow users to add genomic annotations to the sequenced data, they will be providing huge support to the scientific community. Functional genomics is the branch that uses genome project results to assess the interactions between specific genes and proteins and their expression in the phenotype. This is generally the part of the research that takes the longest - the raw genomic data for the Human Genome Project was first released in 2001 but it was not until 2003 that it was declared 'complete' (all chromosomes mapped and every gene within the chromosomes identified and located, without errors). “The market for medical Cannabis growing at a rate of up to 50% per year in the USA„ After assembly, the genome has to be annotated - that is, the regions that contain specific genes must be located and the relevant biological information attached to the electronically-stored data. Genome annotation continues to this day for the human genome, as the total number of different expressions of the thousands of genes contained in it is staggeringly high and requires expert human input to describe. In order to assess the impact of a specific gene, genes can be deleted or disrupted in the genome and the effect observed in the resulting phenotype. This process has helped researchers identify several key genetic functions in humans, shedding light on various disorders and their causes. Once the genome is annotated, the data can be used in many ways - to determine the extent of similarity or to identify the genes responsible for similar functions (and how they differ) between species or varieties; to study the evolution of a species, its ancestry and developmental path, and how it fits into the taxonomic system; and ultimately to provide another piece to the unimaginably vast puzzle that is the evolutionary tree, whose roots are the protozoa of Earth's early days and whose branches number millions. Comparative genomics specifically deals with the variation between different genomes, and is important in understanding how and in response to what pressures mutation occurs, and divergence established. It can also be used to study the adverse mutations that can arise, which for humans is already providing much information regarding the nature of cancers and why they occur. While advances in genetic research as a whole have provided much insight into the nature and importance of the Cannabis plant, the release of the C. sativa genome and the ongoing research it has engendered are set to sharply propel us into a much higher level of overall understanding. Not only this, but a clear message has been sent to the public: that Cannabis is a medicinal plant worthy of serious consideration, and that the academic community recognizes it as such. Eventually, reluctant governments across the world will be forced to follow suit. Support from credible and respected sources, such as Kevin McKernan and his team, is vital to the success of the movement, and they well deserve the plaudits they are receiving for their outstanding work. Special thanks to Kevin McKernan for his input

S
Soft Secrets