Supplementary Information 20
A. Databases that were used for the annotation pipeline and curation
Database  Reference Description
Nucleotide sequence
DDBJ Tateno et al. Nuecleic Acids Res. 30, 27-30. (2002) all known nucleotide and protein sequences
EMBL Stoesser et al . Nucleic Acids Res. 30, 21-26. (2002) all known nucleotide and protein sequences
GenBank Benson et al. Nucleic Acids Res. 30, 17-20. (2002) all known nucleotide and protein sequences
Mouse Genome Informatics (MGI) - Mouse Genome Database (MGD) Blake et al. Nucleic Acids Res. 30, 113-115. (2002) model organsim database for the laboratory mouse; gene, sequence, nomenclature, GO information among others
RefSeq/LocusLink Pruitt et al. Nucleic Acids Res. 29, 137-140. (2001) non-redundant collection of genes and reference reference sequence standards
dbEST(mouse division)   mouse EST sequences
UniGene Wheeler et al. Nucleic Acids Res. 30:13-16, 2002 clusters of ESTs and full-length mRNA sequences; each cluster; represent a unique known or putative gene
TIGR Gene Indices J. Quackenbush et al. Nucleic Acids Res. 29, 159-164. (2001) TIGR and GenBank EST sequences assembled to tentative consensus sequences
nt(NCBI) Wheeler et al. Nucleic Acids Res. 30, 13-16. (2002) all GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant".  
Alternative splicing dB Zavolan et al. manusript in preparation Database of alternatively spliced mouse transcripts
Mapping
MGSC v3 Mouse Genome Sequencing Consortium. Nature. (this issue) (2002) mouse genome sequence assembly
Human "Golden Path" International Human Genome Sequencing Consortium, Nature 409, 860-921. (2001) human genome sequence assembly
Ensembl Hubbard et al. Nucleic Acids Res. 30, 38-41. (2002) genome dataset containing confirmed and predicted genes, exons, transcripts, and contigs
Riken-GenoMapper M. musculus cDNA mapping H. Kiyosawa et al. in preparation RIKEN clones mapped to mouse genome incl. information disease, public mouse genes, markers and ESTs
Riken-GenoMapper H. sapiens cDNA mapping H. Kiyosawa et al. in preparation RIKEN clones mapped to human genome incl. information disease, public mouse genes, markers and ESTs
Radiation Hybrid Map  I. Yamanaka et al. J. Struct. Func. Genomics 2, 23-28. (2002) RIKEN clones mapped to mouse chromosomes based on sequence homology to ESTs of Whitehead mouse T31 radiation hybrid map
Protein sequence
nr(NCBI) Wheeler et al. Nucleic Acids Res. 30, 13-16. (2002) non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF 
SPTR (SwissProt + TrEMBL non-redundant protein set) Bairoch et al. Nucleic Acids Res. 28, 45-48. (2000) annotated protein databasewith minimum redunandancy, annotation incl. GO terms and functional sites
PIR NREF Wu et al. Nucleic Acids Res. 30, 35-37. (2002) non-redundant reference  protein database that includes all sequences  from PIR-PSD, Swiss-Prot, TrEMBL, RefSeq, GenPept, and PDB
Domains, motifs and superfamilies
SCOP Lo Conte et al. Nucleic Acids Res. 30, 264-267. (2002) structural classification of proteins
SUPERFAMILY Gough et al., Nucleic Acids Res. 30, 268-272. (2002)
HMM based on the SCOP 'superfamily' level of protein domain classification
Pfam Bateman et al. Nucleic Acids Res. 30, 276-280. (2002) semi-automatic protein familydatabase containing multiple protein alignments and profile-HMMs of thesefamilies
MDS Kawaji et al. Genome Res. 12, 367-378. (2002) novel motifs extracted from SPTR and FANTOM DB
InterPro Apweiler et al. Nucleic Acids Res. 29, 37-40. (2001) integrated view of otherdomain and functional site databases (PROSITE, PRINTS, ProDom and Pfam)
UTRsite and UTRdb Pesole et al. Nucleic Acids Res. 30, 335-340. (2002) UTRsite: nucleotide sequence patterns of UTRs where a functional role has been shown epxerimentally; UTRdB a non-redundant 3' and 5'UTRsequences of eukaryotic mRNAs enriched with annotations abouts functional elements and repeats
Pathway
KEGG Kanehisa et al. Nucleic Acids Res. 30, 42-46. (2002) metabolic and regulatory pathway maps
Disease
OMIM Wheeler et al. Nucleic Acids Res. 30, 13-16. (2002) catalog of human genes and genetic disorders
Literature
PubMed   abstracts and bibliographicinformation of journal articles and books
Gene Onotology
GO database Ashburner et al. Nat Genet. 25, 25-29. (2000) gene ontology terms
SNP
dbSNP Wheeler et al. Nucleic Acids Res. 30, 13-16. (2002) single nucletoide polymorphism
B. Programs that were used during the full-length sequencing and functional annotation
Software name Reference Description
Database searching
NCBI-BLAST Altschul et al. J. Mol. Biol. 215, 403-410. (1990) Basic Local Alignment SearchTool that includes s a set of similarity search programs(BLASTN, BLASTP,BLASTX, TBLASTN, TBLASTX)
RepeatMasker Smit, A.F.A. and Green, P. unpublished results screens DNA sequences againsta library of repetitive elements, as well as for low complexity regions;it returns a masked query sequence ready for database searches
Protein Sequence Analysis  
FASTY Pearson et al, Genomics 46, 24-36. (1997) FASTY is a program of the FASTA package that compares a DNA sequence to a protein sequence database using the FASTA algorithm; it translates the DNA sequencein three forward (or reverse) frames and allows frameshifts) 
HMMER Eddy. Bioinformatics 14, 755-763. (1998) profile hidden Markov modelsfor biological sequence analysis; searches a sequence database with a profileHMM or builds a hidden Markov model from an sequence alignment
InterProScan Zdobnov and Apweiler. Bioinformatics 17, 847-848. (2001) SW-based InterPro motif search
iPSORT Bannai et al. Bioinformatics 18, 298-305. (2002) Predicts the  subcellular location of proteins
TMHMM A. Krogh et al. J. Mol. Biol. 305, 567-580. (2001) Prediction of transmembrane helices in proteins
COILS A. Lupas et al. Science 252, 1162-1164. (1991) Prediction of coiled-coil conformation from protein sequences
SignalP H. Nielsen el al. Proc Int Conf Intell Syst Mol Biol 6, 122-130. (1998) Prediction of the presence and location of signal peptide cleavage sites in amino acid sequences 
Gene structure; Open Reading Frame
DECODER (in house) Fukunishi and Hayashizaki, Physiological genomics 5, 81-87. (2001) extracts open reading frames from sequences and corrects frame-shifts
rsCDS (in house) M. Furuno et al. in preration CDS prediction completely based on homology search of protein sequences
ProCrest (in house) J. Adachi et al. in preparation CDS prediction based on coding potential in DNA sequences
NCBI CDS Predictor (in house) L. Wagner, (unpublished) CDS prediction based on both homology proteins and coding potential
Sequence assembly, clustering, Gene Index building
Phred Ewing and Green. Genome Res. 8, 186-194. (1998) reads DNA sequencer tracedata, calls bases, and assigns quality values to the bases
Phrap   assembles shotgun DNA sequencedata to a contig sequence
Consed  D. Gordon et al. Genome Res. 8, 195-202. (1998) edits sequence assembliescreated by Phrap for reassembling of the same data set
CAP3 X. Huang et al. Genome Res. 9, 868-877. (1999) assembles sequences using base quality values in computation of overlaps between reads; construction of multiple sequence alignments of reads, and generation of consensus sequences; integrated in the TIGR Gene Index assembly pipline
Megablast   nucleotide sequence alignment search program, used for clustering in the TIGR Gene Index assembly
TGI assemby pipeline J. Quackenbush et al. Nucleic Acids Res. 29, 159-164. (2001) TIGR Gene Index assembly pipline
Mapping and genomic alignments
TGI mapping pipeline   genomic alignment and groupingof tentative transcript sequences
blEST L. Florea et al. Genome Res. 8, 967-974. (1998) cDNA-genome alignment program integrated in TIGR Gene Index genomic mapping pipeline
SIM4 L. Florea et al. Genome Res. 8, 967-974. (1998) aligns a cDNA sequence to a genomic sequence under the assumption that the differences between the two sequences are limited to introns in the genomic sequence and sequencing errors in either of the sequences
Gene Ontology Browser  
GO around J. Tanoue et al. Bioinformatics (in press) Gene ontology viewer
C. Systems that were used for  computational analyses and curation
Software name Reference Description
FANTOM cDNA annotation system (CAS) T. Kasukawa et al. in preparation web-based system for human curation of sequences
ITOP T. Kasukawa et al. in preparation displays seqencing quality (PHRED) scores
Homology Viewer M. Furuno et al. in preparation Graphical viewer that shows homologous regions to protein sequences and start/stop condons for each frame
ClusTrans J. Adachi et al. in preparation RIKEN cDNA sequence clustering, viewer, and editor
READ Bono et al. Nucleic Acids Res. 30, 211-213. (2002) RIKEN expression array database
Metabolomapper H. Bono et al. in preparation system to browse and map assigned EC numbers ot KEGG metabolic pathways
FACTS T. Nagashima et al. in preparation system to explore and curate computational higher functional annotations (protein interactions and disease assocations) of cDNA clones using text sources