(in-package :ecocyc)
#|
###################################################################################
Biocyc2SBML description:
========================
A tool for converting the information contained in the Biocyc
Pathway/Genome Database into SBML models.
Author: Jeremy Zucker
History: Version 1.16. The original version of this program was
written during the 2003 SBML Hackathon that was hosted at the
Virginia Bioinformatics Institute. This is my first non-trivial
Common Lisp program, so feedback is definitely encouraged.
License: This software is released under the Lisp Lesser GNU Public License (LLGPL).
See LLGPL.txt and LGPL.txt for more details.
Requirements:
=============
Biocyc2SBML depends on the Biocyc pathway tools suite distributed by SRI.
http://BioCyc.org/downloads.shtml
To obtain the Software-Database Bundle, please contact
biocyc-info@ai.sri.com to obtain a software license agreement.
o Universities, non-profit research institutes, government
laboratories: Local installations of BioCyc are free for
research purposes.
o Commercial sites: BioCyc is available for a fee to commercial
institutions.
Operating System: Solaris 8, Linux (2.2 or higher kernel, OpenMotif,
and glibc 2.1+), or Microsoft Windows 98/2000/NT/XP (not tested on
Solaris 9 or Windows Me).
Open Biocyc Databases:
======================
The following version 7.5 Pathway/Genome databases are currently open:
* AgroCyc -- Agrobacterium tumefaciens
* HpyCyc -- Helicobacter pylori
* MtbCdcCyc -- Mycobacterium tuberculosis CDC1551
* MtbRvCyc -- Mycobacterium tuberculosis H37Rv
* PseudoCyc -- Pseudomonas aeruginosa
* VchoCyc -- Vibrio cholerae
SBML models generated from these databases can be redistributed and
published without restriction. Please see
http://biocyc.org/open-reg.shtml for details.
Non-open Biocyc Databases:
==========================
* BsubCyc -- Bacillus subtilis
* CauloCyc -- Caulobacter crescentus
* CtraCyc -- Chlamydia trachomatis
* EcoCyc -- Escherichia coli
* HinCyc -- Haemophilus influenzae
* MetaCyc -- Metabolic pathways and enzymes from 150 species
* MpneuCyc -- Mycoplasma pneumoniae
* PseudoCyc -- Pseudomonas aeruginosa
* YeastCyc -- Saccharomyces cerevisiae
* TpalCyc -- Treponema pallidum
SBML MODELS GENERATED FROM THESE DATABASES HAVE RESTRICTIONS ON
PUBLICATION AND REDISTRIBUTION. Please see
http://biocyc.org/all-reg.shtml for details.
Invocation:
===========
bash $ pathway-tools -lisp
EC(1): (load "biocyc2sbml.lisp")
; Loading /home/zucker/src/BPHYS/biocyc2sbml.lisp
T
EC(2): (generate-sbml *open-orgs* "biocyc-open/")
;; ensure-directories-exist: creating
;; /home/zucker/src/BPHYS/biocyc-open/models
;; ensure-directories-exist: creating
;; /home/zucker/src/BPHYS/biocyc-open/doc
NIL
EC(3): (generate-sbml *all-orgs* "biocyc-full/")
;; ensure-directories-exist: creating
;; /home/zucker/src/BPHYS/biocyc-full/models
;; ensure-directories-exist: creating
;; /home/zucker/src/BPHYS/biocyc-full/doc
NIL
EC(4): :exit
SBML models were generated with the following command:
;; Function: 2sbml
;; Returns: NIL
;; Arguments: filename - output file
;; rxn-list - a list of reaction frames
;; org-id - the biocyc :org-id (default 'ecoli)
;; org-name - a string which represents the full name of
;; the organism
;; filter-p - a boolean function that takes a single reaction
;; frame and optionally one other argument as
;; parameters. (default 'nil)
;; Example usage:
A. Generate a model of E. coli describing all enzyme-catalyzed
reactions and transport reactions:
EC(1): (2sbml "ecoli-all.xml" (append (all-rxns :enzyme) (all-rxns :transport))
'ecoli "Escherichia coli K-12")
B. Generate a model of Bacillus subtilis describing only
small-molecule reactions that are balanced:
EC(2): (select-organism :org-id 'BSUB)
EC(3): (2sbml "bsub-balanced-small-molecules.xml"
(all-rxns :small-molecule) "Bacillus subtilis" #'balanced-p)
C. Generate a web page describing all the small molecule metabolism
reactions of Vibrio cholerae N16961:
EC(4): (select-organism :org-id 'VCHO)
EC(5): (2html "vcho-smm.html" (all-rxns :smm) 'VCHO "Vibrio cholerae N16961")
D. Generate a web page describing only those Metacyc reactions which have
molecular weights for each reactant and products:
EC(5): (select-organism :org-id 'META)
EC(6): (2html "Metacyc-molecular-weights.html" (all-rxns) 'META
"MetaCyc" #'molecular-weight-p)
SBML Validation
===============
Each SBML model generated can be validated using the testSBML script
in the test directory. The testSBML script assumes that
libsbml-2.0.1 has been installed in /usr/local. Please change
LD_LIBRARY_PATH if this is not the case.
Additionally, all generated SBML models can be validated with the
software packages listed on http://www.sbml.org
Bugs, Kludges, and Hacks:
=========================
There were a few tweaks which needed to happen in order to successfully validate:
1. SBML unique identifiers may only contain numbers, letters or underscores. Furthermore,
the first character cannot be a number. According to the BNF grammar
on page 7 of the level 2 SBML spec:
letter ::= 'a'..'z', 'A'..'Z'
digit ::= '0'..'9'
nameChar ::= letter | digit | '_'
name ::= ( letter | '_' ) nameChar*
|
This is in contrast to Biocyc unique identifiers which
may contain parentheses, dashes, or html markup.
In order to ensure that no information is lost when converting a
Biocyc id to an SID, the following algorithm is employed:
a. If the first character is not a letter, prepend a single
underscore. i.e. 2-OCTAPRENYLPHENOL becomes _2-OCTAPRENYLPHENOL
b. For each character in the Biocyc id, if the character is not alphanumeric or underscore,
replace the character with its ascii value delimited by a double
underscore. i.e. _2-OCTAPRENYLPHENOL becomes _2__45__OCTAPRENYLPHENOL
Note that this algorithm is reversible as long as Biocyc never uses an
underscore at the beginning of an id and never happens to have an id
with a number delimited by double underscores. Fortunately, it does
not.
2. In the notes section, XHTML does not appear to recognize entities
such as β and γ To handle this namespace issue, all
names and reaction equations and are html-encoded such that EC#
5.1.3.3, the ALDOSE-1-EPIMERASE-RXN has a reaction equation of:
α-D-glucose = β-D-glucose
3. According to the current SBML specification, a compound that is
transported must have an
identifier for each compartment. I solve this problem
by creating 2 species tags for each transported chemical, and use the correct
speciesReference in the transport reaction.
4. Coefficients of a reaction had to be normalized in order to be
accepted by the SBML spec.
Fortunately, the newest level 2 specification accepts floating point
numbers for stoichiometry:
Biocyc coefficient ==> SBML stoichiometry
N ==> 1
2N ==> 2
M ==> 1
0.5d0 ==> 0.5
5. To represent the relationship of genes to enzymes to reactions, the
following convention is used:
a. Genes are a species that is only found in the DNA compartment
b. Enzymes are a species that are only found in the cytoplasm
c. For each enzyme complex, a reaction exists where the reactants
are the genes and the product is the enzyme complex.
d. For each enzyme-catalyzed reaction, a list of modifiers
associates each enzyme with the reaction it catalyzes.
Acknowledgements:
=================
Thanks to Matthew Temple for giving me permission to
pursue my interests. Thanks to Peter Karp and the Pathway-tools
support team for making Biocyc a reality and providing the pathway-tools API.
Thanks to the organizers and participants of the SBML Hackathon
for answering numerous arcane questions about SBML and providing
food, lodging, and facilities for hacking. Thanks to everyone in
the Church lab, especially Daniel Segre, Wayne Rindone, and Xiaoxia
Lin for their support, my BPHYS 101 project partners Jeremy Katz
and Julian Bonilla for embarking with me on this madness, and last
but not least George Church for getting
me excited about this project idea in the first place, and sending
me to the Hackathon in the end.
##############################################################################
|#
; This file contains Lisp functions for querying Pathway Tools
; databases and for generating an SBML file from the results.
(setq *all-orgs* '((VCHO "Vibrio-cholerae-N16961")
(TPAL "Treponema-pallidum-Nichols")
(PSEUDO "Pseudomonas-aeruginosa-PAO1")
(MTBRV "Mycobacterium-tuberculosis-H37Rv")
(MTBCDC "Mycobacterium-tuberculosis-CDC1551")
(MPNEU "Mycoplasma-pneumoniae")
(META "Metacyc")
(HPY "Heliobacter-pylori-26685")
(HINF "Haemophilus-influenza")
(ECOLI "Escherichia-coli-K-12")
(CTRA "Chlanydia-trachomatis-D-UW-3-CX")
(CAULO "Caulobacter-crescentus")
(BSUB "Bacillus-subtilis")
(AGRO "Agrobacterium-tumefaciens-C58")))
; * AgroCyc -- Agrobacterium tumefaciens
; * HpyCyc -- Helicobacter pylori
; * MtbCdcCyc -- Mycobacterium tuberculosis CDC1551
; * MtbRvCyc -- Mycobacterium tuberculosis H37Rv
; * PseudoCyc -- Pseudomonas aeruginosa
; * VchoCyc -- Vibrio cholerae
(setq *open-orgs* '((VCHO "Vibrio-cholerae-N16961")
(AGRO "Agrobacterium-tumefaciens-C58")
(HPY "Heliobacter-pylori-26685")
(MTBRV "Mycobacterium-tuberculosis-H37Rv")
(MTBCDC "Mycobacterium-tuberculosis-CDC1551")
(PSEUDO "Pseudomonas-aeruginosa-PAO1")))
(defun generate-sbml (org-list dir-name)
(sri::check-and-create-directory (format nil "~Amodels/" dir-name))
(sri::check-and-create-directory (format nil "~Adoc/" dir-name))
(loop for org in org-list
for org-id = (car org)
for org-name = (cadr org)
do
(select-organism :org-id org-id)
(2sbml (format nil "~A/models/~A.xml" dir-name org-name)
(append (all-rxns :enzyme) (all-rxns :transport))
org-id org-name 'nil)
(2html (format nil "~A/doc/~A.html" dir-name org-name)
(append (all-rxns :enzyme) (all-rxns :transport))
org-id org-name 'nil)))
(defun smm-transport-rxns2sbml (filename &optional (org-id 'ecoli) (org-name "Biocyc"))
(2sbml filename
(append (all-rxns :smm)
(all-rxns :transport))
org-id
org-name 'nil))
(defun balanced-p (rxn &optional (tolerance 0.0))
(setq sum 0.0)
(loop for reactant in (get-slot-values rxn 'left)
do (let ((coefficient (get-coefficient rxn 'left reactant))
(molecular-weight (get-molecular-weight reactant)))
(setq sum (- sum (* molecular-weight coefficient)))
))
(loop for product in (get-slot-values rxn 'right)
do (let ((coefficient (get-coefficient rxn 'right product))
(molecular-weight (get-molecular-weight product)))
(setq sum (+ sum (* molecular-weight coefficient)))
))
(>= tolerance (abs sum)))
(defun balanced-rxns2sbml (filename &optional (org-id 'ecoli) (org-name "Biocyc"))
(2sbml filename
(append (all-rxns :smm)
(all-rxns :transport))
org-id
org-name
#'molecular-weight-p))
(defun balanced-rxns2html (filename &optional (org-id 'ecoli) (org-name "Biocyc"))
(2html filename
(append (all-rxns :smm)
(all-rxns :transport))
org-id
org-name
#'molecular-weight-p))
(defun make-html (org-id rxn-list &optional (org-name "Biocyc"))
(format t "~%SBML model for ~A~%" org-name)
(format t "~%~%")
(print-notes org-id (append (all-rxns :smm) (all-rxns :transport)) org-name 'nil)
(format t "~%"))
;; Function: 2sbml
;; Returns: NIL
;; Arguments: filename - output file
;; rxn-list - a list of reaction frames
;; org-id - the biocyc :org-id (default ecoli)
;; Function: 2sbml
;; Returns: NIL
;; Arguments: filename - output file
;; rxn-list - a list of reaction frames
;; org-id - the biocyc :org-id (default 'ecoli)
;; org-name - a string which represents the full name of
;; the organism
;; filter-p - a boolean function that takes a single reaction
;; frame and optionally one other argument as
;; parameters. (default 'nil)
;; Example usage:
; A. Generate a model of E. coli describing all enzyme-catalyzed
; reactions and transport reactions:
; EC(1): (2sbml "ecoli-all.xml" (append (all-rxns :enzyme) (all-rxns :transport))
; 'ecoli "Escherichia coli K-12")
; B. Generate a model of Bacillus subtilis describing only
; small-molecule reactions that are balanced:
; EC(2): (select-organism :org-id 'BSUB)
; EC(3): (2sbml "bsub-balanced-small-molecules.xml"
; (all-rxns :small-molecule) "Bacillus subtilis" #'balanced-p)
; C. Generate a web page describing all the small molecule metabolism
; reactions of Vibrio cholerae N16961:
; EC(4): (select-organism :org-id 'VCHO)
; EC(5): (2html "vcho-smm.html" (all-rxns :smm) 'VCHO "Vibrio cholerae N16961")
; D. Generate a web page describing only those Metacyc reactions which have
; molecular weights for each reactant and products:
; EC(5): (select-organism :org-id 'META)
; EC(6): (2html "Metacyc-molecular-weights.html" (all-rxns) 'META
; "MetaCyc" #'molecular-weight-p)
(defun 2sbml (filename rxn-list &optional
(org-id 'ecoli) (org-name "Biocyc") (filter-p 'nil))
(cond ((fequal nil filter-p)
(tofile filename
(make-sbml org-id rxn-list org-name)))
(t (tofile filename
(make-sbml org-id
(rxn-filter rxn-list filter-p)
org-name)))))
(defun 2html (filename rxn-list &optional
(org-id 'ecoli) (org-name "Biocyc")(filter-p 'nil))
(cond ((fequal nil filter-p)
(tofile filename
(make-html org-id rxn-list org-name)))
(t (tofile filename
(make-html org-id (rxn-filter rxn-list filter-p)
org-name)))))
;; Function: rxn-filter - screens a list of reactions for offending criteria
;; Returns: a list of reactions
;; Arguments: rxn-list - a list of reaction frames
;; filter-p - a boolean function that takes a reaction frame as an argument
;; Examples: (rxn-filter (all-rxns) #'balanced-p)
(defun rxn-filter (rxn-list filter-p &optional args)
(let ((filtered-list '()))
(loop for rxn in rxn-list
when (funcall filter-p rxn args)
do (pushnew rxn filtered-list))
filtered-list))
;; Function: molecular-weight-p - checks to see if all substrates of a reaction have a molecular weight. (It actually does not currently check to see if the reaction is balanced)
;; Returns: boolean - t if the reaction is balanced, 'nil otherwise
;; Arguments: rxn - a reaction frame
;; Example: (molecular-weight-p '|CITRATE-(RE)-SYNTHASE-RXN|) ==> t
;; (molecular-weight-p 'THIOREDOXIN-REDUCT-NADPH-RXN) ==> NIL
;;
(defun molecular-weight-p (rxn &optional args)
(setq balanced 't)
(loop for substrate in (substrates-of-reaction rxn)
do (if (equal 0 (get-molecular-weight substrate))
(setq balanced 'nil)))
balanced)
(defun get-molecular-weight (metabolite)
(if (and (coercible-to-frame-p metabolite) (not (fequal nil (get-slot-value metabolite 'molecular-weight))))
(get-slot-value metabolite 'molecular-weight)
0))
;; This function returns an SBML string to the standard output
;; Returns: NIL
;; Arguments: org-id - Biocyc id of the organism
;; rxn-list - list of reaction frames
;; Example usage:
;; (make-sbml 'ecoli (all-rxns :smm))
;;
(defun make-sbml (org-id rxn-list &optional (org-name "Biocyc"))
(select-organism :org-id org-id)
(print-header org-id org-name)
(print-notes org-id rxn-list org-name)
(print-list-of-compartments (list "cytoplasm" "extracellular" "dna"))
(print-list-of-species rxn-list)
(print-list-of-reactions rxn-list)
(print-footer))
(defun print-table (thead tbody bgcolor)
(format t "
~%" bgcolor (format nil "SBML model for ~A" org-name))
(format t "
")
(print-table (list "Citations")
(list (print-citation "D. Segre, J. Zucker, J. Katz, X. Lin, P. D'haeseleer, W. Rindone, P. Kharchenko, D. Nguyen, M. Wright, G. Church"
"From annotated genomes to metabolic flux models and kinetic parameter fitting"
"http://genetics.med.harvard.edu/~dsegre/Segre_etal_OMICS_2003.pdf"
"OMICS 7:3 301-317 2003" )
(print-citation "P. Karp, S. Paley, and P. Romero"
"The Pathway Tools Software"
"http://www.ai.sri.com/pkarp/pubs/02ptools.pdf"
"Bioinformatics 18:S225-32 2002"))
bgcolor)
(format t "