User login

Adding OBO Flat File Format to Taxonomy Import Export

Searched words: 
taxonomy_xml OBO specification import Drupal taxonomy OBO Flat File Format Specification Drupal vocabulary OBO Flat File Format Specification PHP parser obo php parser attempting to import a weird ontology file format into Drupal taxonomy

The parser needs to follow the instructions here:
http://geneontology.org/GO.format.obo-1_2.shtml

To import a file of the kind available here:
http://geneontology.org/GO.downloads.ontology.shtml

Taxonomy Import/Export is the slightly misnamed in Drupal's CVS http://drupal.org/project/taxonomy_xml

The way it works is it has include files that add the ability to import and export different formats.

It includes a call to a "taxonomy_term_presave" hook, but this hook is not used anywhere in Drupal core nor any module in our SCF distribution, including taxonomy_enhancer and taxonomy_xml itself.

<?php
    // Here's where last-minute data storage done by other modules gets set up
    module_invoke_all('taxonomy_term_presave', $term);
?>

Here's what my drupal_set_message for the raw data looks like; it has simply converted the file to a string. (Other messages also shown.)

* Loaded file sample-OBO_gene_ontology.obo. Now processing it.
*

'format-version: 1.2
date: 31:10:2008 15:18
saved-by: dph
auto-generated-by: OBO-Edit 1.101
subsetdef: goslim_candida "Candida GO slim"
subsetdef: goslim_generic "Generic GO slim"
subsetdef: goslim_goa "GOA and proteome slim"
subsetdef: goslim_pir "PIR GO slim"
subsetdef: goslim_plant "Plant GO slim"
subsetdef: goslim_yeast "Yeast GO slim"
subsetdef: gosubset_prok "Prokaryotic GO subset"
synonymtypedef: systematic_synonym "Systematic synonym" EXACT
default-namespace: gene_ontology
remark: cvs version: $Revision: 5.876 $

[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution

[Term]
id: GO:0000002
name: mitochondrial genome maintenance
namespace: biological_process
def: "The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw]
is_a: GO:0007005 ! mitochondrion organization

[Term]
id: GO:0000003
name: reproduction
namespace: biological_process
alt_id: GO:0019952
alt_id: GO:0050876
def: "The production by an organism of new individuals that contain some portion of their genetic material inherited from that organism." [GOC:go_curators, GOC:isa_complete, ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular Biology"]
subset: goslim_generic
subset: goslim_pir
subset: goslim_plant
subset: gosubset_prok
synonym: "reproductive physiological process" EXACT []
is_a: GO:0008150 ! biological_process

[Term]
id: GO:0000005
name: ribosomal chaperone activity
namespace: molecular_function
def: "OBSOLETE. Assists in the correct assembly of ribosomes or ribosomal subunits in vivo, but is not a component of the assembled ribosome when performing its normal biological function." [GOC:jl, PMID:12150913]
comment: This term was made obsolete because it refers to a class of gene products and a biological process rather than a molecular function.
is_obsolete: true
consider: GO:0042254
consider: GO:0051082

[Term]
id: GO:0080025
name: phosphatidylinositol-3,5-bisphosphate binding
namespace: molecular_function
def: "Interacting selectively with phosphatidylinositol-3,5-bisphosphate, a diphosphorylated derivative of phosphatidylinositol." [PMID:18397324]
synonym: "PtdIns(3,5)P2 binding" EXACT []
is_a: GO:0035091 ! phosphoinositide binding

[Term]
id: GO:0080026
name: response to indolebutyric acid stimulus
namespace: biological_process
def: "A change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of an indolebutyric acid stimulus." [PMID:18725356]
synonym: "response to IBA stimulus" EXACT []
synonym: "response to indole-3-butyric acid stimulus" NARROW []
is_a: GO:0009733 ! response to auxin stimulus

[Typedef]
id: negatively_regulates
name: negatively_regulates
is_a: regulates ! regulates

[Typedef]
id: part_of
name: part_of
xref: OBO_REL:part_of
is_transitive: true

[Typedef]
id: positively_regulates
name: positively_regulates
is_a: regulates ! regulates

[Typedef]
id: regulates
name: regulates
transitive_over: part_of

'

error
Failed to import any new terms. This may be due to syntax or formattings errors in the import file.

Initial Working OBO Import

Success, sample file, with test output.

Sources of test output:

<?php
/**
* Add term to the array of terms we are building and reset the term.
*/
function taxonomy_xml_obo_add_term(&$terms, &$term) {
  $terms[] = $term;
drupal_set_message('<pre>'.var_export($term,TRUE).'</pre>'); 
  $term = array();
  return;
}
?>

<?php
function taxonomy_xml_obo_parse(&$data, $vid = 0, $url = '') {
  #drupal_set_message(t("Importing from provided OBO data file %url.", array('%url' => $url)));
/*
  if ($vid == 0) {
    // We've been asked to use the vocab described in the source file.
    // OBO files can define different vocabs for different terms
    // but I don't think the parser can handle that
    drupal_set_message(t("No vocabulary specified in the form, using a default one."));
    // Create a placeholder, use that
    $vocabulary = _taxonomy_xml_get_vocabulary_placeholder('Taxa');
    $vid = $vocabulary->vid;
  }
  else {
    // Else using a form-selected vocob.
    $vocabulary = taxonomy_vocabulary_load($vid);
  }
*/ 

  // BEGIN the first loop, finding terms in this document

  // Remembering all terms is memory-intensive, but may be more efficient in batch jobs.
  // Use a static list where possible. EXPERIMENTAL
  $terms =& taxonomy_xml_current_terms();
 
  #dpm(array("About to start analyzing a data doc $url, known terms are: " => $terms));
 
  $lines = taxonomy_xml_obo_prepare_data($data);

  $term_line = 0;
  foreach($lines as $i => $line) {
    if (trim($line) == '[Term]') {
      if ($term_line > 1) {
        // this check covers us if an OBO file is missing a blank line
        // this was never reset to zero, and we've had at least one line
        // we should try to 'save' the term before going farther
        taxonomy_xml_obo_add_term($terms, $term);
      }
      $term_line = 1;
    }
    elseif ($term_line > 1 && trim($line) == '') {
      taxonomy_xml_obo_add_term($terms, $term);
      $term_line = 0;
    }
    elseif ($term_line) {
      taxonomy_xml_obo_add_pair($line, $term);
      $term_line++;
    }
    unset($lines[$i]);
  }

  foreach ($terms as &$term) {
    // Skip duplicates (some dupes may exist due to the use of handles)
//    if ($term->taxonomy_xml_presaved) continue;
    $term = (object)$term;
   
    if (!isset($term->is_obsolete) || $term->is_obsolete != TRUE) {
 
      $term->vid = taxonomy_xml_obo_get_vid($term->namespace);
 
      // Translate the predicate statements into the syntax we need
      taxonomy_xml_canonicize_predicates($term);
      // ^^ does nothing useful?
 
      $term->description = $term->def;
      unset($term->def);
     
      // taxonomy_enhancer likes everything in a "fields" array.
      $term->fields = array();
      // Furthermore, everything can have multiple values (per data structure).
      $term->fields['field_uri'] = array(
          0 => array(
          'value' => $term->id,
          ),
      );
      // This is ugly and I don't like it.  Extreme unnecessary complexity.
      // honestly, not any good reason can't just use $term->field_whatever --
      // this is taxonomy, not CCK, single values are fine!
     
      // Data is now massaged and referring to itself correctly,
      // Start creating terms so we can retrieve term ids
 
      // Ensure name is valid   
      if (! $term->name) {
          drupal_set_message(t("Problem, we were unable to find a specific label for the term referred to as @id.", array('@id' => $term->id)));
          $term->name = $term->id;
      }
 
  drupal_set_message('<pre>'.var_export($term,TRUE).'</pre>');
// ...
// lots of other stuff here
// ...
    }
  } // end term-construction loop;
 
  #dpm(array('created a bunch of terms, now they need relations set.' => $terms));
 
  taxonomy_xml_set_term_relations($terms);

  #dpm(array('After re-linking, we now have all terms set' => $terms));

  return $terms;
 
}
?>

    *  Loaded file sample-OBO_gene_ontology.obo. Now processing it.
    * 118 rows of data
    *

      array (
        'id' => 'GO:0000001',
        'name' => 'mitochondrion inheritance',
        'namespace' => 'biological_process',
        'def' => '"The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]',
        'synonym' =>
        array (
          0 => '"mitochondrial inheritance" EXACT []',
        ),
        'is_a' =>
        array (
          0 => 'GO:0048308 ! organelle inheritance',
          1 => 'GO:0048311 ! mitochondrion distribution',
        ),
      )

    *

      array (
        'id' => 'GO:0000002',
        'name' => 'mitochondrial genome maintenance',
        'namespace' => 'biological_process',
        'def' => '"The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw]',
        'is_a' =>
        array (
          0 => 'GO:0007005 ! mitochondrion organization',
        ),
      )

    *

      array (
        'id' => 'GO:0000003',
        'name' => 'reproduction',
        'namespace' => 'biological_process',
        'alt_id' =>
        array (
          0 => 'GO:0019952',
          1 => 'GO:0050876',
        ),
        'def' => '"The production by an organism of new individuals that contain some portion of their genetic material inherited from that organism." [GOC:go_curators, GOC:isa_complete, ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular Biology"]',
        'subset' =>
        array (
          0 => 'goslim_generic',
          1 => 'goslim_pir',
          2 => 'goslim_plant',
          3 => 'gosubset_prok',
        ),
        'synonym' =>
        array (
          0 => '"reproductive physiological process" EXACT []',
        ),
        'is_a' =>
        array (
          0 => 'GO:0008150 ! biological_process',
        ),
      )

    *

      array (
        'id' => 'GO:0000005',
        'name' => 'ribosomal chaperone activity',
        'namespace' => 'molecular_function',
        'def' => '"OBSOLETE. Assists in the correct assembly of ribosomes or ribosomal subunits in vivo, but is not a component of the assembled ribosome when performing its normal biological function." [GOC:jl, PMID:12150913]',
        'comment' => 'This term was made obsolete because it refers to a class of gene products and a biological process rather than a molecular function.',
        'is_obsolete' => 'true',
        'consider' =>
        array (
          0 => 'GO:0042254',
          1 => 'GO:0051082',
        ),
      )

    *

      array (
        'id' => 'GO:0033959',
        'name' => 'deoxyribodipyrimidine endonucleosidase activity',
        'namespace' => 'molecular_function',
        'def' => '"Catalysis of the reaction: cleaves the N-glycosidic bond between the 5\'-pyrimidine residue in cyclobutadipyrimidine (in DNA) and the corresponding deoxy-D-ribose residue." [EC:3.2.2.17]',
        'synonym' =>
        array (
          0 => '"deoxy-D-ribocyclobutadipyrimidine polynucleotidodeoxyribohydrolase activity" EXACT [EC:3.2.2.17]',
          1 => '"deoxyribonucleate pyrimidine dimer glycosidase activity" EXACT [EC:3.2.2.17]',
          2 => '"endonuclease V activity" BROAD [EC:3.2.2.17]',
          3 => '"PD-DNA glycosylase activity" EXACT [EC:3.2.2.17]',
          4 => '"pyrimidine dimer DNA glycosylase activity" EXACT [EC:3.2.2.17]',
          5 => '"pyrimidine dimer DNA-glycosylase activity" EXACT [EC:3.2.2.17]',
          6 => '"T4-induced UV endonuclease activity" EXACT [EC:3.2.2.17]',
        ),
        'xref' => 'EC:3.2.2.17',
        'is_a' =>
        array (
          0 => 'GO:0016799 ! hydrolase activity, hydrolyzing N-glycosyl compounds',
        ),
      )

    *

      array (
        'id' => 'GO:0033985',
        'name' => 'acidocalcisome lumen',
        'namespace' => 'cellular_component',
        'def' => '"The volume enclosed by the membranes of an acidocalcisome." [GOC:mah]',
        'is_a' =>
        array (
          0 => 'GO:0044444 ! cytoplasmic part',
          1 => 'GO:0070013 ! intracellular organelle lumen',
        ),
        'relationship' => 'part_of GO:0020022 ! acidocalcisome',
      )

    *

      array (
        'id' => 'GO:0080025',
        'name' => 'phosphatidylinositol-3,5-bisphosphate binding',
        'namespace' => 'molecular_function',
        'def' => '"Interacting selectively with phosphatidylinositol-3,5-bisphosphate, a diphosphorylated derivative of phosphatidylinositol." [PMID:18397324]',
        'synonym' =>
        array (
          0 => '"PtdIns(3,5)P2 binding" EXACT []',
        ),
        'is_a' =>
        array (
          0 => 'GO:0035091 ! phosphoinositide binding',
        ),
      )

    *

      array (
        'id' => 'GO:0080026',
        'name' => 'response to indolebutyric acid stimulus',
        'namespace' => 'biological_process',
        'def' => '"A change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of an indolebutyric acid stimulus." [PMID:18725356]',
        'synonym' =>
        array (
          0 => '"response to IBA stimulus" EXACT []',
          1 => '"response to indole-3-butyric acid stimulus" NARROW []',
        ),
        'is_a' =>
        array (
          0 => 'GO:0009733 ! response to auxin stimulus',
        ),
      )

    * Namespace biological_process matches vocabulary ID number 7.
    *

      stdClass::__set_state(array(
         'id' => 'GO:0000001',
         'name' => 'mitochondrion inheritance',
         'namespace' => 'biological_process',
         'synonym' =>
        array (
          0 => '"mitochondrial inheritance" EXACT []',
        ),
         'is_a' =>
        array (
          0 => 'GO:0048308 ! organelle inheritance',
          1 => 'GO:0048311 ! mitochondrion distribution',
        ),
         'vid' => '7',
         'description' => '"The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]',
         'fields' =>
        array (
          'field_uri' =>
          array (
            0 =>
            array (
              'value' => 'GO:0000001',
            ),
          ),
        ),
      ))

    *

      stdClass::__set_state(array(
         'id' => 'GO:0000002',
         'name' => 'mitochondrial genome maintenance',
         'namespace' => 'biological_process',
         'is_a' =>
        array (
          0 => 'GO:0007005 ! mitochondrion organization',
        ),
         'vid' => '7',
         'description' => '"The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw]',
         'fields' =>
        array (
          'field_uri' =>
          array (
            0 =>
            array (
              'value' => 'GO:0000002',
            ),
          ),
        ),
      ))

    *

      stdClass::__set_state(array(
         'id' => 'GO:0000003',
         'name' => 'reproduction',
         'namespace' => 'biological_process',
         'alt_id' =>
        array (
          0 => 'GO:0019952',
          1 => 'GO:0050876',
        ),
         'subset' =>
        array (
          0 => 'goslim_generic',
          1 => 'goslim_pir',
          2 => 'goslim_plant',
          3 => 'gosubset_prok',
        ),
         'synonym' =>
        array (
          0 => '"reproductive physiological process" EXACT []',
        ),
         'is_a' =>
        array (
          0 => 'GO:0008150 ! biological_process',
        ),
         'vid' => '7',
         'description' => '"The production by an organism of new individuals that contain some portion of their genetic material inherited from that organism." [GOC:go_curators, GOC:isa_complete, ISBN:0198506732 "Oxford Dictionary of Biochemistry and Molecular Biology"]',
         'fields' =>
        array (
          'field_uri' =>
          array (
            0 =>
            array (
              'value' => 'GO:0000003',
            ),
          ),
        ),
      ))

    * Namespace molecular_function matches vocabulary ID number 9.
    *

      stdClass::__set_state(array(
         'id' => 'GO:0033959',
         'name' => 'deoxyribodipyrimidine endonucleosidase activity',
         'namespace' => 'molecular_function',
         'synonym' =>
        array (
          0 => '"deoxy-D-ribocyclobutadipyrimidine polynucleotidodeoxyribohydrolase activity" EXACT [EC:3.2.2.17]',
          1 => '"deoxyribonucleate pyrimidine dimer glycosidase activity" EXACT [EC:3.2.2.17]',
          2 => '"endonuclease V activity" BROAD [EC:3.2.2.17]',
          3 => '"PD-DNA glycosylase activity" EXACT [EC:3.2.2.17]',
          4 => '"pyrimidine dimer DNA glycosylase activity" EXACT [EC:3.2.2.17]',
          5 => '"pyrimidine dimer DNA-glycosylase activity" EXACT [EC:3.2.2.17]',
          6 => '"T4-induced UV endonuclease activity" EXACT [EC:3.2.2.17]',
        ),
         'xref' => 'EC:3.2.2.17',
         'is_a' =>
        array (
          0 => 'GO:0016799 ! hydrolase activity, hydrolyzing N-glycosyl compounds',
        ),
         'vid' => '9',
         'description' => '"Catalysis of the reaction: cleaves the N-glycosidic bond between the 5\'-pyrimidine residue in cyclobutadipyrimidine (in DNA) and the corresponding deoxy-D-ribose residue." [EC:3.2.2.17]',
         'fields' =>
        array (
          'field_uri' =>
          array (
            0 =>
            array (
              'value' => 'GO:0033959',
            ),
          ),
        ),
      ))

    * Namespace cellular_component matches vocabulary ID number 8.
    *

      stdClass::__set_state(array(
         'id' => 'GO:0033985',
         'name' => 'acidocalcisome lumen',
         'namespace' => 'cellular_component',
         'is_a' =>
        array (
          0 => 'GO:0044444 ! cytoplasmic part',
          1 => 'GO:0070013 ! intracellular organelle lumen',
        ),
         'relationship' => 'part_of GO:0020022 ! acidocalcisome',
         'vid' => '8',
         'description' => '"The volume enclosed by the membranes of an acidocalcisome." [GOC:mah]',
         'fields' =>
        array (
          'field_uri' =>
          array (
            0 =>
            array (
              'value' => 'GO:0033985',
            ),
          ),
        ),
      ))

    *

      stdClass::__set_state(array(
         'id' => 'GO:0080025',
         'name' => 'phosphatidylinositol-3,5-bisphosphate binding',
         'namespace' => 'molecular_function',
         'synonym' =>
        array (
          0 => '"PtdIns(3,5)P2 binding" EXACT []',
        ),
         'is_a' =>
        array (
          0 => 'GO:0035091 ! phosphoinositide binding',
        ),
         'vid' => '9',
         'description' => '"Interacting selectively with phosphatidylinositol-3,5-bisphosphate, a diphosphorylated derivative of phosphatidylinositol." [PMID:18397324]',
         'fields' =>
        array (
          'field_uri' =>
          array (
            0 =>
            array (
              'value' => 'GO:0080025',
            ),
          ),
        ),
      ))

    *

      stdClass::__set_state(array(
         'id' => 'GO:0080026',
         'name' => 'response to indolebutyric acid stimulus',
         'namespace' => 'biological_process',
         'synonym' =>
        array (
          0 => '"response to IBA stimulus" EXACT []',
          1 => '"response to indole-3-butyric acid stimulus" NARROW []',
        ),
         'is_a' =>
        array (
          0 => 'GO:0009733 ! response to auxin stimulus',
        ),
         'vid' => '7',
         'description' => '"A change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of an indolebutyric acid stimulus." [PMID:18725356]',
         'fields' =>
        array (
          'field_uri' =>
          array (
            0 =>
            array (
              'value' => 'GO:0080026',
            ),
          ),
        ),
      ))

    * Updated 8 term(s) mitochondrion inheritance, mitochondrial genome maintenance, reproduction, ribosomal chaperone activity, deoxyribodipyrimidine endonucleosidase activity, acidocalcisome lumen, phosphatidylinositol-3,5-bisphosphate binding, response to indolebutyric acid stimulus.
    * Imported vocabulary . You may now need to Review the vocabulary settings or List the terms

Resolution

Comments

OWL/RDF ?

All of the OBO ontologies are also available as OWL ontology files (see obofoundry.org). The Taxonomy Import/Export module already has functionality for importing RDF, and this functionality could easily be added.

Cheers,
Matthias Samwald
DERI Galway, Ireland
Konrad Lorenz Institute for Evolution and Cognition Research, Austria

Thanks

Parsing the format hasn't been a problem, adding the terms through taxonomy_save_term has been. Batch API can be used to prevent timeout, but it's still ridiculously slow.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.
  • You can use Markdown syntax to format and style the text. Also see Markdown Extra for tables, footnotes, and more.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <img> <blockquote> <small> <h2> <h3> <h4> <h5> <h6> <sub> <sup> <p> <br> <strike> <table> <tr> <td> <thead> <th> <tbody> <tt> <output>
  • Syntax highlight code surrounded by the {syntaxhighlighter SPEC}...{/syntaxhighlighter} tags, where SPEC is a Syntaxhighlighter options string or "class="OPTIONS" title="the title".
  • Lines and paragraphs break automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.