AniProtDB | Division of Intramural Research

The SRA Toolkit is a collection of tools and libraries for downloading and processing data from the Sequence Read Archives (SRA).

Website:

https://ncbi.github.io/sra-tools/

Example Usage:

fasterq-dump --split-3 SRR5311041

Extra:

Add /1 and /2 to header name to run sequences in Trinity:

awk '{ if (NR%4==1 || NR%4==3) { print "/1" } else { print } }' SRR5311041_1.fastq > temp_SRR5311041_1.fastq

awk '{ if (NR%4==1 || NR%4==3) { print "/2" } else { print } }' SRR5311041_2.fastq > temp_SRR5311041_2.fastq

mv temp_SRR5311041_1.fastq SRR5311041_1.fastq
mv temp_SRR5311041_2.fastq SRR5311041_2.fastq

FastQC is a program designed to spot potential problems in high throughput sequencing datasets. It runs a set of analyses on one or more raw sequence files in fastq or bam format and produces a report which summarizes the results.

Website:

https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Example Usage:

mkdir fastqc
fastqc -o fastqc/ SRR5311041_1.fastq SRR5311041_2.fastq

Trim Galore! Is a wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files.

Website:

https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/

Example Usage:

trim_galore --paired -q 30 --stringency 5 --length 70 SRR5311041_1.fastq SRR5311041_2.fastq

FastQC is a program designed to spot potential problems in high throughput sequencing datasets. It runs a set of analyses on one or more raw sequence files in fastq or bam format and produces a report which summarizes the results.

Website:

https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Example Usage:

mkdir fastqc_2ndRound
fastqc -o fastqc_2ndRound/ SRR5311041_1_val_1.fq SRR5311041_2_val_2.fq

Rcorrector implements a k-mer based method to correct random sequencing errors in Illumina RNA-seq reads.

Website:

https://github.com/mourisl/Rcorrector/

Example Usage:

run_rcorrector.pl -1 SRR5311041_1_val_1.fq -2 SRR5311041_2_val_2.fq -t 16 -k 31

Trinity is a state of the art method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data.

Website:

https://github.com/trinityrnaseq/trinityrnaseq/wiki/

Example Usage:

Trinity --seqType fq --max_memory 240G --left SRR5311041_1_val_1.cor.fq --right SRR5311041_2_val_2.cor.fq --CPU 12 --min_kmer_cov=2 --SS_lib_type FR --output trinity_Hhongkongensis

TransDecoder identifies candidate coding regions within transcript sequences such as those generated by de novo RNA-Seq transcript assembly using Trinity, or constructed based on RNA-Seq alignments to the genome using Tophat and Cufflinks.

Website:

https://github.com/TransDecoder/TransDecoder/wiki/

Example Usage:

TransDecoder.LongOrfs -t Trinity.fasta
TransDecoder.Predict -t Trinity.fasta

Ensuring that a proteome is non-redundant and does not contain more than one copy of a protein will give the user a more accurate BUSCO completeness score.

Script:

Example Usage:

mkdir proteome
cd proteome

Before running the Perl script, convert proteome file from multilinefasta to singleLine fasta:

awk '/^>/ {printf("
%s
",/var/www/html/aniprotdb/pipeline/index.cgi);next; } { printf("%s",/var/www/html/aniprotdb/pipeline/index.cgi);} END {printf("
");}' < Trinity.fasta.transdecoder.pep | sed "1d" > Trinity.fasta.transdecoder_singleLine.fasta

perl filter_redundancy.pl Hhongkongensis

BUSCO completeness assessments employ sets of Benchmarking Universal Single-Copy Orthologs from OrthoDB to provide quantitative measures of the completeness of genome assemblies, annotated gene sets, and transcriptomes in terms of expected gene content.

Website:

https://gitlab.com/ezlab/busco/

Example Usage:

busco -i Hhongkongensis_proteins.fasta -c 8 -o Hhongkongensis -l metazoa_odb10 -m proteins

InterProScan is the software package that allows sequences (protein and nucleic acid) to be scanned against InterPro's signatures. Signatures are predictive models, provided by several different databases, that make up the InterPro consortium.

Online Tool Website:

https://www.ebi.ac.uk/interpro/search/sequence/

Website:

https://github.com/ebi-pf-team/interproscan/wiki/HowToRun/

Example Usage:

./interproscan.sh -appl CDD -t p -f tsv -pa -goterms -iprlookup -i Hhongkongensis_proteins.fasta -o interproscan_Hhongkongensis_CDD_results.txt

The script parse_interproscan.pl can be used to retrieve proteins with the domain of interest.

Script:

Example Usage:

perl parse_interproscan.pl -d cd00934 -s "Hoilungia hongkongensis" -f Hhongkongensis_proteins.fasta -i interproscan_Hhongkongensis_CDD_results.txt

Note:

Users can choose the CDD or Pfam domains of their preference, and search the proteome of one species ("Hsapiens" for Homo sapiens, "Nnomurai" for Nemopilema nomurai, …).

Required Files:

Proteome file and InterProScan scan results file.

Pipeline Overview

Metazoan Proteome Pipeline

Pipeline Overview

Metazoan Proteome Pipeline

FastQC

Website:

Example Usage:

Trim Galore!

Website:

Example Usage:

FastQC

Website:

Example Usage:

Rcorrector

Website:

Example Usage:

Trinity de novo Assembly

Website:

Example Usage:

TransDecoder

Website:

Example Usage:

Remove Redundancy

Script:

Example Usage:

BUSCO

Website:

Example Usage:

InterProScan

Online Tool Website:

Website:

Example Usage: