All information about the human genome is stored in the DNA sequence in the cell nucleus, and was mapped in the early 2000s. Genes are defined sections of DNA encoding different types of proteins. In recent decades, researchers have been able to define around 21,000 protein coding human genes, using DNA analysis, for example. In the different cell types of the body, different protein producing genes are active or inactive, and many medical conditions also depend on altered activity of specific genes.
In humans, only about 1.5% of the human genome or DNA consists of protein-coding genes. Of the remaining DNA, some sequences are used to regulate the genes' production of proteins, but the bulk of the DNA is considered to lack any purpose and is often referred to as "junk DNA". Within this junk DNA there are so-called called pseudogenes. Pseudogenes have been considered as non-functional genes, which are believed to be gene remnants that lost their function during evolution.
In the current paper in Nature Methods, researchers present a new proteogenomics method, which makes it possible to track down protein coding genes in the remaining 98.5% of the genome, something that until now has been an impossible task to pursue. Among other things, the research shows that some pseudogenes produce proteins indicating that they indeed have a function.
"To be able to do this we had to match experimental data for sequences of peptides with millions of possible locations in the whole genome", says Associate Professor and study leader Janne Lehtiö. "We had to develop both new experimental and bioinformatics methods to allow protein based gene detection, but when we had everything in place it felt like participating in a Jules Verne adventure inside the genome."
The Lehtiö team found evidence for almost one hundred new protein-coding regions in the human genome. Similar findings were made in cells from mice. Many of the new proteins encoded by pseudogenes could also be traced in other cancer cell lines, and the next objective on the researchers' agenda is to investigate if these genes in the "junkyard" of the genome play a role in cancer or other diseases.
"Our study challenges the old theory that pseudogenes don't code for proteins", says Dr Lethiö. "The presented method allows for protein based genome annotation in organism with complex genomes and can lead to discovery of many novel protein coding genes, not only in humans but in any species with a known DNA sequence."
The current study was conducted by researchers from Karolinska Institutet, Stockholm University and Royal Institutet of Technology (KTH) all active at Science for Life Laboratory (SciLifeLab). Principle Investigator, Associate Professor Janne Lehtiö, is active at the Department of Oncology-Pathology at Karolinska Institutet and his laboratory is located at SciLifeLab. GE Healthcare Bio-Sciences in Uppsala provided technical support for the method development. The research was funded by the Swedish Research Council, the Swedish Cancer Society, the Stockholm County Council, Stockholm's Cancer Society, and by the EU FP7 project GlycoHit.
Publication: 'HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics' Branca R.M.M, Orre L.M., Johansson H.J., Granholm V., Huss M., Pérez-Bercoff Å., Forshed J., Käll L., Lehtiö J., Nature Methods, advance on line publication 17 November 2013, doi: 10.1038/nmeth.2732.