Centralized Life Sciences Data (CLSD)

The CLSD is a compendium of publicly available databases related to biology, genomics, medicine, etc. These have been federated to allow customized queries using structured query language (SQL). In some cases, the CLSD provides access to data not available through any other information resource.

The CLSD project is now being made available to the wider research community as a pilot through the TeraGrid with the aim of establishing it as a production service accessible through a variety of methods.

IU is actively soliciting input on what data sources and access methods would be most useful for research. To submit suggestions, or to obtain assistance using CLSD please contact "data at indiana dot edu".

CLSD is implemented using IBM's data federation technology, Information Integrator, which includes IBM's DB2 Relational Database.

How can CLSD be useful to researchers?

There are three important reasons that CLSD may be of value in accessing publicly available data:
  • Access of data from multiple databases using a single standard SQL query. This allows researchers to merge data from multiple sources. In addition, one can execute a BLAST search within an SQL query and merge the results with data from other sources. A short description of the use of SQL as an interface to CLSD is available here.
  • IU keeps data in CLSD constantly up to date, so researchers can access data from CLSD with confidence that they are getting the most recent data.
  • Researchers can use a web page to execute an SQL query, or write programs that use a WSRF web service, a JAX-RPC web service, JDBC, or a DB2 client library to send queries to CLSD and retrieve data. For more details about access methods see Accessing CLSD.

Which data sources are available?

The following data sources may be referenced within SQL queries. Some of these resources are stored in relational form within CLSD, some are federated from remote servers at NCBI, and some are mirrored on Indiana University computer systems and THEN federated into CLSD. (The federated resources appear as local tables within CLSD.)
  • Resources stored in relational form within CLSD:
    • BIND -- Pathways, Gene interactions
    • ENZYME -- Enzyme nomenclature
    • ePCR -- ePCR results of UniSTS vs Homo sapiens
    • SGD -- Saccharomyces Genome Database
    • KEGG data sources:
      • LIGAND -- Pathways, Reactions, & Compounds
      • PATHWAY -- Pathway map coordinates
    • NCBI data sources:
      • LocusLink -- Genetic Loci. (LocusLink has been inactive since July 1, 2005 when it was retired in favor of UniGene; it is retained for achival use.)
      • UniGene -- Gene clusters
  • Federated data sources, where the data is stored:
    • at the originating site:
    • on local (mirror) servers external to CLSD but housed at Indiana University:
      • BLAST -- Basic Local Alignment Search Tool (mirrored at IU by UITS)
        • Nucleotide data: NT
        • Protein data: NR and Swiss-Prot
      • dbSNP -- Single Nucleotide Polymorphisms (mirrored at IU by IUSM)

Additional information

There are some additional tools provided with DB2 for use with CLSD that may be of use to some researchers.

Users of CLSD can subscribe to the notification list for announcements about service changes and outages by emailing the text:

     subscribe clsd-notify-l
to the address:
     listserv at iupui dot edu
There is also a short slideset from a talk about CLSD.

Special parsers were developed to help convert data from non-relational to relational form for injection into CLSD. These parsers are available for download.


Acknowledgements

The activities of the UITS Research Computing Division have been supported by Shared University Research Grants from IBM, Inc. to Indiana University.

The activities of the UITS Research Computing Division support efforts of the Indiana Genomics Initiative (INGEN), and have received financial assistance from INGEN. The Indiana Genomics Initiative (INGEN) of Indiana University is support in part by Lilly Endowment Inc.

CLSD uses DataDirect's Connect ODBC drivers for Unix in conjunction with Information Integrator to connect to SQL Server databases.