New Database Schema

From VistrailsWiki
Revision as of 19:25, 22 February 2011 by Tommy (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

We are proposing an update to the schema for storing a vistrail in a MySQL database.

The current schema requires multiple SQL commands to access and update a vistrail. One SELECT query is required for each table we want to extract information from. This gets really slow when the server is far away from the client. For instance, there are currently around 28 tables used. A round-trip query from USA to Europe takes 0.17 seconds. That is about 5 seconds in total time. This means that each time you press the save button, VisTrails locks up for 5 seconds. We have not found a simple way of getting around this problem. We need to store the vistrail in a database because we want to be able to perform detailed queries across all the provenance information.

Proposed Improvements

We have come up with 3 possible solutions that each have its advantages and disadvantages. Comments and other suggestions are appreciated.

  1. Store all information in a single table.
    Pro: One single command can be used to access and update a vistrail.
    Pro: We can still use a vanilla MySQL server.
    Con: A lot of redundancy in this table since different objects need different kinds of fields such as dates and integers.
    Con: One big table might lead to slow performance.
    Con: Difficult to implement.
  2. Add an intermediate server application that can receive and perform multiple SQL commands in batches.
    Pro: No changes to the schema necessary.
    Con: We have to install a new application on the server side that handles communication between VisTrails and the MySQL server.
  3. Use a more object-focused database such as an XML database.
    Pro: A single query can retrieve a complete vistrail.
    Con: We would need to install a new database such as eXist
    Con: We would need to use a different query language such as XQuery
  4. Other ideas?

Benckmarking of INSERT methods

  • Using multiple statements in a single query:
    • 20000 INSERT statements takes 4 seconds
      • 2 seconds for query
      • 2 seconds for iterating over the results
    • 10000 INSERT statements with 2 values each takes 4 seconds
    • 1000 INSERT statements with 20 values each takes 2 seconds
    • 100 INSERT statements with 200 values each takes 1.5 seconds
    • 10 INSERT statements with 2000 values each takes 1.5 seconds
    • 1 INSERT statements with 20000 values each takes 1.4 seconds
    • 1 INSERT statements with 50000 values each takes 1.8 seconds (~60000 is max before max_allowed_packet size)
  • Using STORED PROCEDURE to unpack insert statements:
    • 1 INSERT statements with 60000 values each takes 2.3 seconds (~60000 is max before max_allowed_packet size)
    • 2 INSERT statements with 30000 values each takes 2.4 seconds
    • 10 INSERT statements with 2000 values each takes 1.7 seconds
    • 100 INSERT statements with 200 values each takes 2.3 seconds
    • 1000 INSERT statements with 20 values each takes 9 seconds
    • 10000 INSERT statements with 2 values each takes 169 seconds (inefficient string parsing in the stored procedure)