Saturday 22 August 2009

Objects & Metadata

As usual Dave Graham brings up some interesting and worthwhile topics in his blog post here

Now being an ex database programmer ('ex' of anything being of course the very worse and most dangerous type), and of course a storage curmudgeon, I have a passion for the topic of metadata and data. And being somebody having to deal with PBs of object data I naturally have some concerns and views here...

Now normally I agree with Dave on a lot of things - but I have to say I much prefer my scallops to be seared and served on black pudding nice and simply, letting the quality of the flavours shine.

That said I have to agree re his view of being able to segment metadata & object storage models into two areas - but do think there is a place (almost essential IMHO) for both models in the future storage.

We've seen this area tackled by a number of existing technologies re CAS and object stores (Caringo CFS gateway onto Castor object layer is good example) - but are only just starting to see the key new elements test these, namely vast scale (think EBs), geo-dispersal/distribution/replication, low cost.

I do also think it's worth exploring some of the possible types / layers of metadata, for me this breaks into :-
  • System / Infrastructure metadata - the metadata mandated by the storage service subsystem for every application using the service and every object held within the storage service. System metadata is under the exclusive control of storage service subsystem, although can be referenced by applications & users. Examples such as object ID, creation data, security, hash/checksum, Storage service SLA attributes (resilience, performance etc) etc.
  • Application metadata - This is the metadata associated with each object that is controlled and required by the application service(s) utilising the object. There may be multiple sets of application metadata for a single object, each only accessible by the approved application.
  • Object metadata - context & descriptive attributes, object history, related objects, optional user extensible metadata
I would expect all 3 examples of these metadata to be linked with every object, with at least the 'system metadata' always held locally with the object. The 'application metadata' & 'object metadata' may reside in the storage system, the storage service, the application or any combination. (In this context I refer to the storage system as an object store, and the storage server as being object store + metadata store)

Some of the metadata relates to the application and infrastructure architecture (eg geo-location information re object distribution & replication) whilst some of the metadata are attribute fields used within the application itself.

Given the above, it should be clear that I certainly agree with an entry Dave made in his blog comments re :-
"interesting note on ownership to which I'd say that there has to be dual ownership, one from the system level (with immutable meta such as creation date, etc.) as well as mutable data (e.g. user generated meta). The meta db then needs to maintain and track 2 different levels. Policy can affect either, fwiw."
So some thoughts about where to locate metadata as it relates to the object :-
  • As referenced above, I believe 'System metadata' must always reside with the object as it is used by the storage service for mngt, manipulation and control of the object itself, and ensure it's resilience & availability.
  • As has been an issue with file-systems for some time, there is always an issue with fragmentation of the underlying persistency layer with vast size differences between objects and metadata when they are tightly coupled
  • As a result of needing to traverse the persistency layer to establish the metadata, there are performance issues associated with metadata embedded within the object layer - move the metadata to a record based system and performance & accessibility can increase dramatically
  • For certain classes of use (eg web 2 etc) it's often the metadata that is accessed, utilised & manipulated several orders of magitude more often than the objects themselves, thus the above improvements in performance and accessibility of metadata (thin SQL query etc) make major differences
  • Clearly if the metadata and objects are held separately the metadata can be delivered to applications without needing to send the objects, similarly the metadata can be distributed separately / in-advance of the object. Thus having major advantages for application scaling and geo-distribution.
  • With the split of persistecy location / methods this also allows for security layers to be handled differently for the metadata and the object.
This also brings into line a question area I've been working with for over 2 years with object stores - in that what features & functions should live in the application layer and what features and functions should live within the infrastructure (storage service) layer. What areas of metadata are actually data information in their own right, or embedded in the application logic, there appears to be no clear rules or guidelines.

If you like, this could be seen as an argument between IaaS & Paas - and for sure the only sensible answer for a company right now is IaaS, PaaS exposes far too much of the logic, taxonomy, behaviours, trends and metadata layers to the PaaS provider than is healthy.

There is also an additional interest point re metadata - as we move from the System metadata into the Application & Object metadata, should we consider privacy and encryption of the metadata itself? (assuming that the objects will always be protected appropriately) I could see how this will be a requirement in some multi-tenancy environments an for some metadata elements...

Lastly some more questions :-
  1. How do you cover the topics of backup/recovery of the various metadata elements?
  2. How do you cope with bulk import / export of the various levels of metadata and their logical / context relationships?
  3. What standards will emerge for metadata schema definitions and attributes?
  4. What standards will emerge for policy script language & descriptors that manipulate within the storage systems? (think how to describe an SLA in a programmatic language)
  5. Can security authorisation & permission tokens exist and be enforced in a separate context and control domain to the identities?
Naturally in this we're not covering any of the 'internal' metadata used by the storage system to locate objects, to handle the multiple instances of the same object within a 'object storage service' (resilience, replication etc), to enable sharding / RSE encoding of objects etc that the storage system has to cope with.

Now I'm off for some lunch, I'm hungry and fancy some seafood for some reason ;)



Reblog this post [with Zemanta]

No comments:

Post a Comment