LISA’11: Tech Sessions – Databases keynote

Print Friendly, PDF & Email

Andy Palmer from Novartis Institute for Biomedical Research talks of a Renaissance of database system innovation.

Proven that one size does not fit all. With all the db systems available, how do you go about picking the right one? Empirical testing and benchmarking tools could address this.

The generic types of DBMS right now; row store (oracle, mysql, postgresql) combined into tables, column store (vertica, h-base) greatly improved SQL performance, then last type are file oriented (Cassandra, Hadoop) with map reduce allows massive performance improvements.

Other types; document store (couchdb, mongodb), array-oriented (paradigm4) for science apps not business and graph databases (neo4j). Finally, not really a type but more of a practise, database federation (Composite software, Teiid).

This brought a challenge to existing DBA, where a new interdisciplinary skill set is required; systems, cluster, database design, admin etc. In science field the data is the largest component product of the project so managing and processing it is the largest area for development.

Some myths that they needed to dispel when approaching this shift in practise:

  • oracle sucks
  • oracle is the solution for everything
  • each app should have it’s own db
  • should have only one integrated database
  • federation is the solution

Scale is a big issue, gnome sequence data is 4 Pbs with 70Tb produced a year. The introduction of the NGS project caused a data explosion. Rapidly found it impossible to use standard database technologies due to frequent changes to database schema, format and datasets. They tried many DBMS out. Got some standard tests and collected questions on what sort of data, processing and analitics would be expected. A possible solution would be to combine many technologies to create a workflow and to process through the data in the raw format to a format usable by the application making full use of the advantages of the dbs. Another angle was to use Paradigm4 array store which stored the data in the format for which it was to be processed. Using commodity hardware and Vertica storing the data in this way they were able “dramatic” savings and 10x performance increases.