Discussion autour des formats de fichiers.
Les standards à utiliser, les bibliothèques, les bonnes pratiques.

Bioinfo standard : GFF File

image_print

GFF : Generic Feature Format Version 3

Specifications

https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

Samples

S1

##gff-version 3.2.1
##sequence-region ctg123 1 1497228
ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN
ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;Parent=gene00001
ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1
ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001;Name=EDEN.2
ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001;Name=EDEN.3
ctg123 . exon 1300 1500 . + . ID=exon00001;Parent=mRNA00003
ctg123 . exon 1050 1500 . + . ID=exon00002;Parent=mRNA00001,mRNA00002

S2

ctg123 . cDNA_match 1050 9000 6.2e-45 + . ID=match00001;Target=cdna0123 12 2964;Gap=M451 D3499 M501 D1499 M2001

Recommended library

?

Comments

Element length

In S1, in element mRNA00001, what is the length of the element ? does the base 9000 part of the element ?
The specification is not that clear on this. But other samples found in the specification make the answer to be :
L = end – start + 1, the base ‘end’ is part of the element.

Target length = target end - target start + 1
Ref length = ref end - ref start + 1

Argument

In S2, target length is the sum of gap M bases : M451 + M501 + M2001 = 2953 bases = 2964 – 12 +1

In S2, ref length is the sum of gap M and D bases : M451 + D3499 + M501 + D1499 + M2001 = 7951 bases = 9000 – 1050 + 1