Notes from initial meeting
August 20, 2008
- Support the publish, aggregation, presentation of numerical output, "Datasets"
(the term "Data" is reserved for numerical output that is observed) - To be used by oceanographic modeling community
- Will have the following attributes:
- Registration of Datasets sources
- Scalable HA delivery of Datasets
- Inventory of all registered Datasets
- Efficient republishing of
. Datasets,
. republishing of aggregations, f(Datasets) => Datasets
. presentations, f(Datasets) => Visible product
Design
Exchange between producer and consumer (of nout sets)
pieces of exchange:
- caching
- transform
- visualization
- inventory
Publisher should come in
- register themselves
- their datasets
- be able to subscribe themselves
- when the register nout sets,
let them define lease time in cache
. might not want to keep it forever,
. might want to mirror
. might want to update?
dataset review cycle time
no cache implies a pass through to the dataset source
- nout set have cache expiration time
a mirror site is a cache site with no time to live time
Consumer
- discovery
- access
get the file through normal means
should be able to get and subscribe (registers)
what is leas on subscription
notification cycle time (rss/atom, notify about new data don't push data)
Inventory of exchange
use ERDDAP
- inventory - presentation layer
- transformation
- presentation
- use memcache in our erddap mangling for getting updates from sources
does that mean diff erddaps can do fake crawler requests through memcache
or actually share the mem representation of erddaps
Use AWS EC2 and S3
pick a revision control system (svn, hg, git...)
Terms
Application
Conceptually, an application is composed of discrete functional
parts, components.
An instance of an application is comprised of component sets
Component
Something that runs on a Node/VM.... of specific defined utility.
Node/VM
Outline

ERDDAP
Functionality of ERDDAP, split up into exclusive (as possible) pieces
1. Transform engine
- from DAP formats, to lot's of things
- also produces visualizations
2. Inventory of datasets
- static list of source URLs and associated metadata, by dataset
- Update service (via periodic polling)
3. Aggregation of datasets
- inherent since each dataset defined in dataset.xml can have any sourceURL
- provides web page with HTML table of datasets declared in dataset.xml
4. DAP Server
- can serve data via DAP protocol
- currently cannot serve locally stored datafiles
- REST and ROA
style request url format
5. Data Access Forms
- web pages for humans facilitating the creation of DAP requests
Sub projects
Break down of tasks that involve writing code/modifying code to make a usable package of specific utility
Registration Web App
Present publisher (a real person) a form for submitting info on their data
source, and other specific metadata.
Store this info.
Functionality:
Research:
Determine exactly what info is needed from producer to get and serve their
data
Enumerate these as a list of parameters with specific meaning.
Ways to do this:
Use Django for the interface
?. Expand off of the generate datasetXml work done by the erddap people
Mirroring: storage service (using S3, EBS)
Research:
Related to:
Registration service
Overview of projects, try to make these concrete
Figure out how potential sub projects are related, if and how they should
be coupled/decoupled...
Crawler: ERDDAP Inventory Review Service
The inventory source is defined in the dataset.xml file.
The inventory defines data sources
Inventory metadata needs to be updated.
Functionality:
Go to each source url at specified interval and get das, dds (metadata)
Compare to existing; differences imply updates; flag these indicating new
data is available
Issues:
Any instance of an ERDDAP server is set to perform this crawling operation.
We need to somehow adapt this functionality and construct a service that
does this and shares it's state with other ERDDAPs.
Related to:
Registration Component
References:
http://coastwatch.pfel.noaa.gov/erddap/download/setupDatasetsXml.html
Line 234 in gov/noaa/pfel/erddap/dataset/EDDGridFromDap.java
Research
The erddap people have a project for generating datasetXml files, but they
have told us that every data source always has some specific nuance that
requires the datasetXml file to be tweaked by a human.
These configuration details will have to be obtained form the registre
Notification Service
Consumers will subscribe to be notified when new data is available for
particular datasets they choose to follow.
Notifications are triggered to broadcast when the Review service detects
new metadata is available. (This is done on a dataset by dataset basis)
Functionality:
Users subscribe to datasets, at our site.
Users are notified about availability;
notification by dataset name, maybe include what has changed?
Related to:
Review Service
Overall design questions and ideas
Should we look at this as
- a super erddap? A project based around erddap expanding and growing its
core functionality
OR - an application that uses erddap a lot. The application is primarily a
scalable web application, interfaced by humans and programs (modeling,
analysis, etc.), that delegates certain tasks to ERDDAP
Do we want it to look like erddap from the consumers perspective?
Or should it be our own simple interface that uses erddap for a lot, but adds
functionality around it.
Answer
It is like a super erddap. It should appear as normal erddap to the consumer, with the addition of the subscription service attaced to each dataset.
Is it better to have all erddap nodes be registered for all datasets, or
to allocate each publisher with a pool of erddaps...
This implies that there should be a set of nodes dedicated to knowing about
the inventory and we have to make an interface of all the publishers to
choose from, which point to thier erddap servers with the dataset
inventory/listing things
EBS, if I understand it right, will make recovering from a major system
failure (cloud turning off...!) much quicker if persistent data doesn't
have to be moved from S3.
Need to know how MySQL cluster main-memory backup/recovery system should
work with the EBS snapshots?