Projects

Question

Projects

psalguero · Accepted Answer

The tappAS application is project based: you create a project, input your data, and work with it. Each project has a corresponding file folder where all its data and analyses results are stored. All the necessary project management functions - create, open, rename, and delete - are provided in the application. To create a project, you must provide the following information:

A unique project name
The biological species associated with the RNA-seq data
The file location for the annotation features file or select one of the application provided annotation files
The experiment type
The file location for your experiment design file
The file location for your transcript level raw counts expression matrix file
Optionally, but recommended, the low count and coefficient of variation filtering parameter
Optionally, the inclusion or exclusion transcripts list file location for filtering

Input Data and Filtering

There are three input data files required to create a project: an experiment design file, a transcript level raw counts expression matrix, and a corresponding annotation file. The input data and optional filtering block diagram is shown below:

A. Experiment Design

An experiment design file defining the experimental groups, time slots, for time course experiments, and replicates. The first experimental group is considered the control group. See Experiment Design File Format for details.

B. Expression Matrix

A data file containing transcript level raw counts for one or more experimental groups and one or more time points with at least two replicates each. You must provide raw counts in the expression matrix; they are required for some statistical analyses. Internally, the application maintains a copy of the original raw counts matrix as well as a normalized copy. See Expression Matrix File Format for details.

C. Annotation Features

A data file containing annotation features for all expressed transcripts. Any transcript in the expression matrix that is not included in this file will be filtered out. You may use one of the annotation files provided by the application or use your own. The application currently provides the following annotation files:

Homo sapiens – Ensembl and RefSeq
Mus musculus – Ensembl and RefSeq
Arabidopsis thaliana – Ensembl
Zea mays – Ensembl

See Annotation Features File Format for details.

Homo sapiens - Ensembl and RefSeq
Mus musculus - Ensembl and RefSeq
Arabidopsis thaliana - Ensembl
Zea mays - Ensembl

See Annotation Features File Format for details.

D. Low Counts and Coefficient of Variation Filter

An optional filter for removing transcripts with low expression levels and inconsistent expression values across samples.

E. Transcripts Filter

An optional transcripts filter for removing unwanted transcripts. You may provide an inclusion list, for transcripts to include, or an exclusion list for transcripts to filter out. You may, for example, initially bring in all the data into a project and then use the application's ad hoc queries, or analysis results, to generate, and export, a transcripts list. You may then reinput the data into the project applying the exported transcript list as a filter.

F. Project Data

The project data consists of all the transcripts that remain after filtering, along with their corresponding annotation features. Transcripts that are filtered out are no longer part of the project data. For example, if a gene contains 5 isoforms and two of them are filtered out, the application data will only have 3 isoforms for the gene. If all isoforms for a gene are filtered out, the gene will no longer be part of the project data. It is important that you understand that from the application's perspective, the data included in the project represents the 'universe' for the project. Genes and transcripts that are not part of the project data are not taken into account in any way by the application. For example, when using 'All genes' in a data analysis, it refers to all genes in the project data not all genes for the species or all genes in the annotation file. You may reinput the data for a project at any time; however, all existing analysis results will be cleared.

Expression Matrix Data Normalization

As previously stated, the application keeps a copy of the original raw counts expression matrix and also creates a new matrix using normalized counts. The Trim Mean of M (TMM) normalization procedure by Robinson and Oshlack, provided in the R package NOISeq, is used to normalize the data. You may view the NOISeq documentation and installation instructions at:

https://www.bioconductor.org/packages/release/bioc/html/NOISeq.html

Experiment Design File Format

The experiment design file defines the relationship between the expression matrix data and the various experimental groups, time slots, and replicates. There are three experiment types supported by the application:

Case-Control
Time-Course Single Series
Time-Course Multiple Series

The design file will change depending on the experiment type. However, regardless of experiment type, it is possible to use the same expression matrix and just modify the design file. By doing so, you have the option to run case-control analysis, and time-course single series analysis using the data from a time-course multiple series experiment. You may also, leave out replicates, time slots, etc. without having to make any changes to the expression matrix. Regardless of what data you use from the expression matrix, the first experimental group is treated as the control group where relevant. The following format rules apply to all design files:

The data must be in Tab Separated Values (TSV) format and must contain a single line header
Comment lines are not allowed
The first experimental group is considered the control group where relevant
All samples for an experimental group must be grouped together
All samples for a given time slot, within an experimental group, must be grouped together
All time slots for a given group must be specified in chronological order
Time values must be specified using numbers only - no time units
Sample column names must be unique
Sample column names are case-sensitive and must match the expression matrix

Case-Control Design File

The case-control design file must contain two experimental groups. Each group must contain at least two replicates. Sample design file:

sample	group
CASE1	CASE
CASE2	CASE
CONTROL1	CONTROL
CONTROL2	CONTROL

Single Series Time-Course Design File

The single series time-course design file must contain a single experimental group. The group must contain at least two time slots with a minimum of two replicates per time slot. Sample design file:

sample	time	group
CASE1	0	CASE
CASE2	0	CASE
CASE3	3	CASE
CASE4	3	CASE

Multiple Series Time-Course Design File

The multiple series time-course design file must contain at least two experimental groups. Each group must contain at least two time slots with a minimum of two replicates per time slot. Sample design file:

sample	time	group
CASE1	0	CASE
CASE2	0	CASE
CASE3	3	CASE
CASE4	3	CASE
CONTROL1	0	CONTROL
CONTROL2	0	CONTROL
CONTROL3	3	CONTROL
CONTROL4	3	CONTROL

Expression Matrix File Format

The expression matrix file must contain raw expression counts for one or more experimental groups. Each group may have one or more time slots with each time slot having at least two replicates. The following format rules apply:

The data must be in Tab Separated Values (TSV) format and must contain a single line header
A unique transcript id identifies each row and must match one of the transcripts provided in the annotation file or it will be discarded
Sample column names must be unique
Sample column names are case-sensitive and must match the experiment design file
The columns do not need to be in any specific order - the experiment design file will provide grouping information

Expression matrix file partial contents sample:

	NPC1	NPC2	OLD1	OLD2
Transcript.1	7275	3602	3707	3485
Transcript.2	358.64	206.58	2056.72	2094.65
Transcript.2	332.44	329.38	1529.46	1318.57
Transcript.4	46.92	13.03	20.82	33.78

Annotation Features File Format

The annotation file must follow the basic Generic Feature Format 3 (GFF3). However, it has been slightly modified to suit the application: the "score" and "phase" columns are not used and some of the attributes may not fully abide by the formal specifications. The file consists of a set of annotation features for each transcript. Each set of features is divided into sections as follows:

Transcript 1
Transcript Level Feature Annotations – basic transcript information, UTR motifs, microRNAs, etc.
Genomic Level Feature Annotations – exons, splice junctions, etc.
Protein Level Feature Annotations – gene ontology features, domains, phosphorylation sites, etc.
Transcript 2
…
Transcript 3
…

Some of the annotation features must be named as expected by the application, see sample annotation file below:

Source	Feature	Description
tappAS	transcript	Start of transcript features
tappAS	gene	Gene information
tappAS	CDS	CDS information
tappAS	genomic	Start of genomic features
tappAS	exon	Exon
tappAS	splice_junction	Splice junction
tappAS	protein	Start of protein features

In addition, the following attributes must be named as required by the application, see sample annotation file below:

Attribute	Description
ID	Feature ID
Name	Feature name
Desc	Feature description
Chr	Feature chromosome

Annotation file partial contents sample (header should not be included):

SeqName	Source	Feature	Start	End	Score	Strand	Phase	Attributes
PB.3189.4	tappAS	transcript	1	1399	.	+	.	ID=XM_006524897.1; primary_class=full_splice_match; PosType=T
PB.3189.4	tappAS	gene	1	1399	.	+	.	ID=Qpct; Name=Qpct; Desc=glutaminyl-peptide cyclotransferase (glutaminyl cyclase); PosType=T
PB.3189.4	tappAS	CDS	10	951	.	+	.	ID=XP_006524960.1; PosType=T
PB.3189.4	UTRsite	3'UTRmotif	1288	1295	.	+	.	ID=U0023; Name=K-BOX; Desc=K-Box; PosType=T
PB.3189.4	UTRsite	PAS	1380	1399	.	+	.	ID=U0043; Name=PAS; Desc=Polyadenylation Signal; PosType=T
PB.3189.4	mirWalk	miRNA	986	993	.	+	.	ID=mmu-miR-495-5p; Name=mmu-miR-495-5p; Desc=UTR3; PosType=T
PB.3189.4	tappAS	genomic	1	1	.	+	.	Chr=chr17; PosType=G
PB.3189.4	tappAS	exon	79052257	79052388	.	+	.	Chr=chr17; PosType=G
PB.3189.4	tappAS	exon	79070673	79070951	.	+	.	Chr=chr17; PosType=G
PB.3189.4	tappAS	exon	79077482	79077658	.	+	.	Chr=chr17; PosType=G
PB.3189.4	tappAS	exon	79079467	79079566	.	+	.	Chr=chr17; PosType=G
PB.3189.4	tappAS	exon	79081747	79081863	.	+	.	Chr=chr17; PosType=G
PB.3189.4	tappAS	exon	79089623	79090216	.	+	.	Chr=chr17; PosType=G
PB.3189.4	tappAS	splice_junction	79052388	79070673	.	+	.	ID=known_canonical; Chr=chr17; PosType=G
PB.3189.4	tappAS	splice_junction	79070951	79077482	.	+	.	ID=known_canonical; Chr=chr17; PosType=G
PB.3189.4	tappAS	splice_junction	79077658	79079467	.	+	.	ID=known_canonical; Chr=chr1; PosType=G
...	...	...	...	...	...	...	...	...
PB.3189.4	tappAS	protein	1	313	.	+	.	ID=NP_001303658.1; PosType=P

Generating an annotation file is not a trivial task and it's not recommended unless you have a good programming background and knowledge of annotation features. If possible, use one of the annotation files provided by the application. If no annotation file is provided for the species you are interested in, you may contact us .