RNAseq trimming

Hi,

TLDR: I want to know if filtering RNAseq data (by counts to eliminate low count genes) for comparison and pathway enrichment analysis is needed.

From the answers: it should not be needed, but if I insist on doing it, use the second proposed method (a low threshold number in a certain number of samples/individuals). To select the threshold value, rely on literature and await for reviwer's feedback.

Why no need for DESeq2: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#pre-filtering

Introduction: I am working in a monogenetic desease that produces systemic alterations, specially disorders in the central nervous system. I have completed the analysis pipeline, but there is a point of strong contention with some colleges: how do we trim the data from the counts perspective.

Samples: blood RNAseq

Experimental design: Two groups of study, 10 individuals in each.

Original material: Raw and Normalized counts, dseq2 comparison results.

Objective: describe differences between controls and study individuals and proceed with enrichment analysis (GSEA)

So how should I adress this problem?

1. Average count: So, first, there is the possibility of using meancounts to filter the data. Here is the distribution for the average count number for one of the comparissons. I do not really like this, because It is too sensitive to extreme values. Also, ¿what if a subpopulation of the group of interest has a relevant overexpression of one gene that is not present in all of the other samples?

Here some coworkers are pushing for the use of a value of 50 (log2(50+1) = 5.9) as a cutoff. I am not conviced by this, because in literature I have found values around 10 or 15. My OP: From a biological standpoint, genes comming form the CNS are going to be much less abundant in the blood samples than for examples the ones related to the immune system activity and metabolism.

Average counts and log2 of averagecounts +1

2. Individual counts: We can select a minimal value of counts to be recogniced as "real/reliable levels of expression", that has to be present in a set number of samples.

So first here you have the distribution for counts for each gene and sample in case it could be a relevant factor. On this matter, it seems that around the value of 10 counts thre is an interesting change in the behaviour of the data.

Distribution of counts per gene and sample.

Then, if we look for the number of samples that stays over a certain threshold for each gene, we can get an idea of how many samples per gene show counts over the threshold (excuse the spanish, I feel lazy to translate). Here I found it to be tretty interesting that there is a mass of genes that only surpass the threshold in 0, 1 or 2 samples, and another big mass of genes which expression is over the threshold in almost all of them.

Number of samples per gene that present a number of counts greater than the threshold. Translations: Umbral = threshold; Media = average; Mediana = median.

So we want to find genes from the CNS in the blood, understandably they are going to be in lower amounts. And there is the possibility of some genes only being present in some subgroups. For this reason, in my opinion, it could be an option to select 10 as a threshold, and at least present in 4 individuals.

What do you think on this, would you pursue other mechanism or methodologies?

Thanks in advance,

David