TRANSECT

 

TRANSECT Manual 24.04

 

1. Background

TRANSECT works by defining two groups (strata, plural for stratum) within a cohort based solely on the expression of a gene or a gene set of interest and subsequently compares the stratum, one against the other, for global expression changes and functional differences. TRANSECT outputs descriptive statistics about the gene/s of interest, the products of the stratification process, the differential expression results and subsequent enrichment outcomes. The application uses publicly available large cohort datasets and simply requires the user to choose at a minimum

  1. the cohort database containing participant IDs and gene expression measurements

  2. a gene (or multiple genes) of interest whose expression levels are used to rank participants in the cohort database

  3. an integer percentile value used on the expression or ranking measurements as a threshold to partition the cohort into low and high stratum for subsequent comparisons

 

2. Getting started

2.1 Installation using Conda

(10 - 15 minutes on an average PC)

Making use of a Conda environment for the sizable number of prerequisite modules and dependencies needed by TRANSECT is recommended for most use cases. It is not only easier to achieve but cleaner, simpler to manage and way quicker than the native install

 

  1. Start by cloning the repository using the git command to a suitable location on your device

     

    Alternatively TRANSECT code and executable files can be downloaded from GitHub at https://github.com/twobeers75/TRANSECT. Click on the green "Code" button followed by "Download ZIP" (note the download location). Find the downloaded ZIP file and move it to an appropriate location if required, before extracting the contents and renaming the folder

     

  2. Install Conda on your system (version > 24.1.0). You can skip this step if you already have it. There are many wikis on how to install Conda for Ubuntu, here is just one. Please consult the Conda documentation relevant to your operating system.

     

  3. Next, we create the TRANSECT Conda environment. Here we will run an installation script that automates the creation of the TRANSECT environment, with all the tools and dependencies required to run TRANSECT. The scripts required for this can be found in the INSTALL/ subdirectory of the TRANSECT folder.

 

And that's it! You should now have all the necessary applications and dependencies in the TRANSECT environment to run this application. Please note, just like any virtual environment you are required to activate the TRANSECT environment in order to use the application. You can deactivate at will when not in use.

A few extra very useful commands for those not accustomed to Conda environments

More information about managing Conda environment can be found here

2.2 Native installation on Ubuntu

(30 - 45 minutes or longer on an average PC)

NOTE: This is not the recommended installation procedure. TRANSECT requires and depends on numerous packages and applications. These take some time to install natively if not already present. A fresh install on a vanilla Ubuntu 22.04 can take 30-45mins depending on the PC and network speeds.

  1. To start, clone the repo

     

    Alternatively TRANSECT code and executable files can be downloaded from GitHub at https://github.com/twobeers75/TRANSECT. Click on the green "Code" button followed by "Download ZIP" (note the download location). Find the downloaded ZIP file and move it to an appropriate location if required, before extracting the contents and renaming the folder

     

  2. Install python3 pip, java if required, and other TRANSECT dependencies (approx. 1-2min)

     

  3. Install R, the "pacman" package and Bioconductor specific packages. You can skip this step if you already have R. There are many wikis on how to install R on Ubuntu, here is just one (specifically for Ubuntu 22.04) (approx. 1min)

     

  4. Start R from the terminal and install pacman and devtools. Follow the prompts and choose (if asked) to install these packages into a personal library.

    Once you enter the R shell you should see printed out in the terminal a number of lines about the R version and licenses followed by a ">" symbol. I have used this symbol below to indicate that you need to be in the R shell to run these commands but, you can't copy the ">" symbol too. It won't work. *(approx. 25mins)

 

NOTE: TRANSECT requires many additional R packages however these are all installed on demand the first time (and only the first time) you run each one of the different TRANSECT commands after a nativ install. Please keep this in mind on your first run as it will take substantially longer compared to all subsequent runs.

 

3. Basic demonstration

TRANSECT has two main operations; Prepare and Analyse.

In order for TRANSECT to function, it first requires a cohort dataset to work on. This is retrieved and formatted appropriately by the Prepare scripts. Subsequently, TRANSECT can run analyses on the downloaded data using the Analyse scripts

Example commands to investigate ZEB1 using the RECOUNT3 TCGA PRAD cohort;

 

4. TRANSECT workflow

4.1 Prepare workflow diagram

TRANSECT

4.2 Analyse workflow diagram

TRANSECT

 

5. Stratification modes

The basic premise of TRANSECT is to stratify individuals from large cohort transcriptomic data into defined groups called strata based on singular gene expression or composite gene expression sets. The stratified participant strata are subsequently compared one to the other in order to assess global expression changes and functional differences.

5.1 Single gene analysis

Single gene stratification is simply the division of individuals within a cohort population into distinct strata based solely on the expression of one single gene. In the current version of TRANSECT, individuals with expression levels at or near both ends of the physiological limits for the gene of interest are grouped separately and subsequently compared (depiction below).

TRANSECT

5.2 Composite gene analysis - Additive mode

Composite analyses use information from multiple genes simultaneously to divide individuals within a cohort population into distinct strata. The additive mode of TRANSECT uses expression information from multiple genes (2 – 5 genes in the current implementation) to rank individuals expressing each of the component genes at near to physiological extreme and separate them into low and high strata for subsequent DE analysis. This is achieved by computing the average of rank positions for all component genes for each participant and using the metric to position each individual within the cohort in order. Once this is achieved and in like manner to the single gene analysis, individuals with extreme high average rank positions are grouped and compared to individuals with extreme low rank positions.

It is important to note that this process leads, in most scenarios, to the exclusion of participants with extreme expression for any one (or more) of the component genes of interest. A good example of this can be seen from the additive mode case study for ESR1, PGR and ERBB2 (triple-negative breast cancer genes) in the RECOUNT3 BRCA cohort.

The figure below (a TRANSECT output for this type of analysis), plots the expression level separately for each component gene (here ESR1, PGR and ERBB2), for each participant. Expanding on this, each participant in the cohort occupies a single point on each of the three distributions in the figure, corresponding to the participants expression level for the three genes of interest. Featured on the figure below is participant TCGA-C8-A12Z-01A who possesses the highest expression level for ERBB2 (TPM=3747.42) of all participants in the cohort however, does not express either ESR1 nor PGR (TPM=0.37 and 0.15 respectively). Individuals stratified by TRANSECT who rank low for expression of all three receptors are marked in cyan (one point per individual on each of the three distributions) and conversely, those ranking high for elevated expression of all three receptor genes are marked in magenta.

TRANSECT

5.3 Composite gene analysis - Ratio mode

In the same manner as the additive mode described above, the ratio mode also considers information from multiple genes simultaneously to partition individuals within a cohort population into distinct strata. The ratio mode uses expression information from strictly two genes to rank individuals. In order to achieve this, TRANSECT calculates a simple log-ratio statistic (log fold change) between the 2 genes of interest for each patient and uses this as the rank metric. Extremely low ratio scores will demarcate participants where geneA >>> geneB and vice versa. Again, individuals at both extremes are grouped and compared.

Like the additive mode, participants with extreme expression for any one of the component genes may, or may not, make it in to the stratified groups for later comparison.

The figure below (another TRANSECT output for this type of analysis), plots the expression level separately for the two component genes (IL3RA and CSF2RB), for each participant. Expanding on this, here each participant in the cohort occupies two points on the plot (each point directly above and below the other), corresponding to the participants expression level for the two genes of interest. Participants are ordered across the x-axis from low to high based on their ratio score (converted to a rank metric). Featured on the figure below are two participants TCGA-AB-2868-03A and TCGA-AB-2959-03A who rank lowest and highest for this analysis in this cohort respectively, showing both the ratio score and TPM expression values. The solid vertical blue and red lines demarcate the thresholds for inclusion into the low and high strata respectively.

TRANSECT

5.4 Multimodal analysis

In select large cohort studies there exist measurements derived from multiple omics for the same individual at the same or similar timepoints. For example, the TCGA study consists of RNA (mRNA, miRNA), and DNA (methylation, mutation, and copy number) data in addition to the associated global proteomics data generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC). TRANSECT has the facility to survey changes in one omics data type based on the stratification of individuals using matched data from another omics. As in the use cases above, individuals at each extreme are grouped and compared.

TRANSECT comes preconfigured with the ability to assess global changes in mRNAs based on the stratification of cohort participants based on their miRNA expression. This can only run whilst using GDC TCGA data that possess sufficient numbers of mRNA and miRNA expression data. Other types of multimodal analyses using different omics types require custom configuration.

 

6. Prepare commands and options

Prepare is a process that retrieves the raw data from online repositories and prepares it (if required) for analysis. TRANSECT comes bundled with three different prepare scripts, one each for RECOUNT3, GTEx and GDC-TCGA data.

All downloaded and formatted data is stored by default in the TRANSECT/data/<RECOUNT3|GTEx|GDC>/ subdirectory in individual folders named by tissue/cancer abbreviation. For example, the RECOUNT3 PRAD data downloaded in the basic demonstration of this manual is stored in /TRANSECT/data/RECOUNT3/PRAD/

6.1 RECOUNT3

Retrieve and prepare RECOUNT3 RNA-seq data for in-house custom analyses.

USAGE:

PARAMETERS: -h Show help text -p RECOUNT3 project id: needs to be valid RECOUNT3 project id (ie. BRCA for TCGA data OR BREAST for GTEx). Required You can retrieve and prepare more than one RECOUNT dataset by using a bash for loop like this;

6.2 GTEx

Retrieve and prepare GTEx RNA-seq data for in-house custom analyses. Unlike RECOUNT3 and GDC data retrieval, GTEx data for all tissue types are retrieved in a single file. Subsequently, this is separated into tissue specific datasets using the information in the metadata file.

USAGE:

PARAMETERS: -h Show help text -a retrieve all expression data (mRNA counts and TPMs), Required for the proper functioning of TRANSECT -c retrieve only mRNA counts -t retrieve only mRNA TPMs

6.3 GDC

Retrieve and prepare TCGA RNA-seq data for in-house custom analyses.

USAGE:

PARAMETERS: -h Show help text -p TCGA project id: needs to be valid TCGA project id as used by GDC (ie. TCGA-BRCA). Required -a retrieve all expression data (mRNA counts and TPMs as well as miR and isomiR RPMs) -c retrieve only mRNA counts -r retrieve only miR RPMs -R retrieve only isomiR RPMs -k keep all data (Default: False) -n data is not from TCGA study (Default: False)

To retrieve and prepare more than one TCGA cancer dataset use a bash for loop like this;

To retrieve and prepare all TCGA cancer datasets you can loop through all lines in GDC_API/TCGA_Study_Abbreviations.tsv (WARNING: this requires lots of time, network and disc space)

 

Please be aware that some of these collections are large and require substantial disk space. They can take a considerable amount of time to download and process. For example, downloading and processing GDC TCGA-BRCA takes just over 30 minutes (using a high speed network connection and an up to date workstation) and requires more than 14GB of disk space (most of which can and by default is, deleted afterwards). In comparison, GDC TCGA-LAML takes less than 5 minutes to retrieve and less than 2GB of disc space.

In addition, the GDC prepare script often fails when downloading large datasets. This is caused by network connectivity issues (tested only in Australia) with the GDC repository. If you experience issues, delete the relevant dataset and retry the prepare command.

 

7. Analysis commands and options

Analyse is a process that uses the prepared public data from above, conducts the stratified differential expression and produces all the outputs. Like with the prepare operations, TRANSECT comes bundled with three analyse scripts, one each for RECOUNT3, GTEx and GDC-TCGA.

Unlike the prepare operations, the output from these calls is saved in the current working directory and therefore it is recommended to create a descriptively named folder for each of your analyses. TRANSECT comes with an preinstalled output folder containing subdirectories (TRANSECT/output/<RECOUNT3|GTEx|GDC>/) however, you may choose any working directory at your discretion. Keep in mind that if TRANSECT output exists in the current working directory, it will be overwritten.

For each script, composite analyses can be run using the plus charater (+) for additive combinations or by using the modulus character (%) for ratio. The two special characters are used between gene names like so. Additive example: ESR1+PGR+ERBB2 or Ratio example: ESRP1%ZEB1

7.1 RECOUNT3

Differential expression analysis of RECOUNT3 data stratified into high and low groups by gene of interest Please run this wrapper script in the directory of the desired output location

USAGE:

PARAMETERS: -h Show help text -p RECOUNT3 tissue id: needs to be valid RECOUNT3 tissue id as at RECOUNT3 (ie. BRCA for TCGA or BREAST for GTEx). Required -g Gene of interest: needs to be a valid HGNC symbol (ie. ZEB1). Required -s Stratify by molecule: Must match -g and can only be mRNA at present. Required -t Percentile: startify data into top and bottom x percentile (valid x between 2 and 25). Required -e Enrichment analyses: Run GSEA on DE results (Default: Only run WebGestalt) -S Switch pairwise comparison: find genes DE in low group compared to high group (Default: high compared to low) -a Do all analyses -c Do correlation analysis only -d Do differential expression analysis only

7.2 GTEx

Differential expression analysis of GTEx data stratified into high and low groups by gene of interest Please run this wrapper script in the directory of the desired output location

USAGE:

PARAMETERS: -h Show help text -p GTEx tissue id: needs to be valid GTEx tissue id as at GTEx (ie. Breast). Required -g Gene of interest: needs to be a valid HGNC symbol (ie. ZEB1). Required -s Stratify by molecule: Must match -g and can only be mRNA at present. Required -t Percentile: startify data into top and bottom x percentil (valid x between 2 and 25). Required -e Enrichment analyses: Run GSEA on DE results (Default: Only run WebGestalt) -S Switch pairwise comparison: find genes DE in low group compared to high group (Default: high compared to low) -a Do all analyses -c Do correlation analysis only -d Do differential expression analysis only

7.3 GDC

Differential expression analysis of TCGA data stratified into high and low groups by gene of interest Please run this wrapper script in the directory of the desired output location

USAGE:

PARAMETERS: -h Show this help text -p TCGA project id: needs to be valid TCGA project id as at the GDC (ie. TCGA-BRCA). Required -g Gene of interest: needs to be a valid HGNC symbol (ie. ZEB1). Required -s Stratify by molecule: must match -g and can only be one of (mRNA or miRNA). Required -t Percentile: startify data into top and bottom x percentile (valid x between 2 and 25). Required -e Enrichment analyses: Run GSEA on DE results (Default: Only run WebGestalt) -S Switch pairwise comparison: find genes DE in low group compared to high group (Default: high compared to low) -a Do all analyses -c Do correlation analysis only -d Do differential expression analysis only

 

8. Output

TRANSECT takes in a cohort dataset and processes the data as follows.

  1. First, TRANSECT partition the data by the expression of a gene/s of interest into low and high strata

  2. Subsequently, TRANSECT compares the resulting strata, one to the other, to identify differentially expressed genes

  3. And finally, TRANSECT uses the results from the DE analysis to run functional annotation and enrichment analyses

The outputs from TRANSECT are likewise grouped into 3 categories and returned in three folders in the working directory from where the program is executed

8.1 01-Stratification

The stratification process produces 2 tables, and 3 plots.

  1. GOI_exp_raw_OG.tsv contain the raw original expression (TPM) data for all gene/s of interest

  2. GOI_exp_with_strat.tsv contains the same data sorted with additional columns relating to the participants ranking score, percentiles and quantile values.

  3. TPM_histogram.html or TPM_Boxplot_Sina.html or TPM_Scatter.html, all which plot data from the two tables above differently depending on the chosen TRANSECT mode in an attempt to describe the distribution of gene expression across the cohort participants

  4. TPM_N-T_boxplot.html which shows the distribution of expression partitioned by disease state when available

  5. TPM_strat_boxplot.html which plots the low and high strata participants resulting from the stratification process

8.2 02-DE

The DE analysis produces many tables and plots most easily described as follows.

  1. DE Setup – design.tsv and gene_raw_expression_data_cpm.csv

  2. DE QC – bcv and mean_var.png plots as well as the MDS-Plot.html in the glimma-plots folder

  3. Normalised expression tables - gene_normalised_expression_data_cpm.csv (also in log form)

  4. DE result tables - High_Vs_Low _de_sigFC.csv and top_tags.csv

  5. DE result plots - High_Vs_Low_volcano.png and High_Vs_Low_heatmap.png as well as an interactive version of the volcano plot in the glimma-plots folder

  6. The glimma-plots folder containing the interactive web plots and associated data

8.3 03-Enrichment

The 2 enrichment analyses result in the production of two folders each with a separate collection of tables and plots.

  1. GSEA

    When selected, this folder contains the output folders from running GSEA against the Hallmark as well as the Curated MSigDB collections respectively. Within each folder, users can open the index.html file to access and interact with the results in a web browser. In addition, the results are summarised and provided in tabular form (.csv) as well as interactive form (.html). See the GSEA User Guide for more details

    GSEA input data – 3 text files used for the GSEA analysis are saved in the top-level folder. The default GSEA method used by TRANSECT is the pre-ranked method. Input for this analysis can be found in the .rnk file. Provided but not used by TRANSECT are alternate GSEA input files (.cls and .txt). These files can be used to rerun GSEA outside of TRANSECT with custom parameters against different collections.

  2. WebGestalt

    The ORA results are presented in six folders; two each for disease, gene ontology and pathway enrichment, for up and down regulated genes separately (when available). Within each folder, users can open the .html file to access and interact with the results in a web browser. See the WebGestalt Manual for more details

 

9. Precautions

9.1 Cohort size

TRANSECT requires large numbers of participants in the cohort data sets to adequately achieve appropriate stratification and grouping. Ideally, individual members of each stratum derived from the stratification process will share highly similar attributes or characteristics (here, gene expression levels). Cohort data sets with low participant numbers are unlikely to possess the required random sampling of a population to achieve defined stratum containing members with shared characteristics and may force the allocation of members with different characteristics into the same stratum.

9.2 Bulk RNA-seq heterogeneity

The heterogeneity of cell types within bulk tissue samples which are present in these cohort data sets can lead to misleading observations if not carefully considered. As an example, whilst examining one of our case studies using the RECOUNT3 GTEx Blood cohort (ratio mode - IL3RA%CSF2RB), we stumbled upon separation statistics in the MDS plot that appeared altogether unlikely and very dubious.

TRANSECT

A few telling signs stand out. First, there is a huge near 90% explaining the separation between high (red) and low (blue) samples across dimension 1, something usually only observed between isogenic cell lines or technical replicates. Second, four samples from the high group appear not to belong? Diving into the GTEx metadata for the participants in these strata, we found that the RECOUNT3 GTEx Blood cohort had unaccounted substructure!

Below is the table of participants used in the above analysis separated by stratum. On the left is the low group (light blue background) with GTEx metadata showing consistently for each participant, membership in "GTEx SMTS Blood" and "GTEx SMTSD Whole Blood", as expected. On the right (light red background), the same for the high stratum showing that although all participants in this stratum are annotated as "GTEx SMTS Blood", the detailed annotation reveals all but four are actually "Cells - EBV-transformed lymphocytes". Unintentionally, we were comparing Whole Blood to EBV-transformed lymphocytes. (NOTE: “SMTS” - Tissue Type, area from which the tissue sample was taken and “SMTSD” - Tissue Type, more specific detail of tissue type)

TRANSECT

Undoubtedly, there is more substructure in the Whole Blood cohort, some subtle others maybe not?

 

10. Example commands

Single gene analysis

Composite gene analysis - Additive mode

Composite gene analysis - Ratio mode

Multimodal analysis

 

11. Custom Databases

11.1 Subsetting preconfigured DBs

TRANSECT comes with the capability of retrieving and formatting data from GTEx, GDC and RECOUNT3, ready for analyses. In some situations, there may be a strong case for refining or "subsetting" these data into smaller parts. At a bare minimum, TRANSECT requires 2 elements for each DB: 1. Count matrix and 2. TPM matrix. These matrices must be matching (identical dimensions with identical row and column names).

Here, we will use data from RECOUNT3 for TCGA-BRCA as an example of the steps required to initiate a subset-DB. We start in R, requiring the tidyverse and data.table R libraries. Once in R or RStudio, navigate to your RECOUNT3 BRCA data repository and execute the following code.

Next, we open a bash terminal and execute the following commands

Done! The same procedure can be used for subsetting any other DB from each of the three data sources. Just remember to abide by the appropriate naming conventions and be consistent when applying this name to the TRANSECT/REF_FILES/study_abbreviations/ documents.

11.2 Using private, in-house or custom data sets

Private, in-house and custom data can be likewise configured if the reference genome used to create the matrix is the same as what TRANSECT uses (gencode v36 for GDC; v26:v39 for GTEx and; v26 for R3). In cases where this in not true, extra steps are required. Let's start from the beginning.

Open a bash terminal and execute the following commands

We will come back to this terminal but need now to swap over into R for some data wrangling.

Next, we go back to the bash terminal and execute the following commands

Stratification and DE work. If this is all you need then job done!

But, both WebGestalt and GSEA expect actual gene names and will complain about the likes of "ENSG00000112531.16_QKI". If these analyses are required then, back into R to fix the issue.

And, finally... (I promise :-), we can run TRANSECT to completion.

And, that's it! A few things to consider if you decide to venture into something like this;

  1. Be prepared, you will likely have to do some major data wrangling, a good Data Scientist/bioinformatician goes a long way.

  2. Some data sets are extremely large and equally unwieldy. Be prepared.

  3. Because TRANSECT is configured to recognise TCGA and GTEx participant identifiers, when using these data TRANSECT knows which samples are cancer and which are not ("normal"). For custom DBs with unknown/unrecognisable participant/sample IDs, TRANSECT will not know whether these are diseased or not, whether they are human or not, whether the data is expression or not. In these cases, the plots produced will all default to "normal" samples and "TPM" measurements, whether this is true or not.

 

12. Publication details

Currently unpublished.

 

TRANSECTThis manual was produced in markdown using typora v1.9.5