How to compute normalized RNA-seq expression from multicov files

Why you should use TPM rather than RPKM

The central point of these papers is to work out an alternative measure for RNA-seq expression abundance that resembles as closely as possible the relative molar concentration (rmc) of each RNA species present in a sample. It is easy to see that the average rmc across genes has to be a constant that only depends on the number of genes mapped in an RNA-seq experiment.

One example of measures that fulfills the invariant average criterion is Transcript per million (TPM), being defined as

where t_g is a proxy for the number of transcripts that can be explained by a certain number of mapped reads and T is the sum of all t_g over all genes. If one is interested in mRNA abundance, the average TPM - and thus the average rmc is inversely proportional to the number of features present in a reference annotation.

Practically, TPM values for individual genes can be computed from read count tables, ie. tables that give the number of reads overlapping a specific gene. Typical programs for obtaining read count tables are htseq-count or multiBamCov (see bedtools multicov).

I have recently implemented normalize_multicov.pl, a tool for computing normalized RNA-seq expression in terms of TPM from multicov files. It is part of the ViennaNGS Perl Modules for NGS analysis and very easy to use: Just provide it the output of a bedtols multicov run on your data as well as the read length used for sequencing your samples and get back a normalized multicov file of your samples in terms of TPM. That's all ...

Posted by Michael T. Wolfinger on . updated