Derivates API: Feature list

I did a lot of research, talking to community people and thinking about my GSoC project. More detailed features list has been created as a result of that. I encourage you to check it out and propose some improvements and comments to it.

Basic features

Ability to handle derivates as a managed or unmanaged file

The main difference between this two approaches is in a fact, that Drupal knows about existence of managed files, but does not about unmanaged files. Managed files have it's own entry in file_managed DB table, while unmanaged files exist only in files/ folder. Derivates API should implicitly or explicitly know also about unmanaged files. 

Managed files are autonomous file entities, but we must preserve reference to the original. This concept could be similar to node translation in i18n, where there are references to original node from every translation. QUESTION: Do we allow derivates of derivates? There should be some kind of mechanism to prevent "infinite loops" if yes.

Examples:

  • video thumbnail: unmanaged
  • low bitrate version of a podcast: unmanaged
  • video podcast in different formats: unmanaged
  • derivates that live in different streams: managed
  • free sample video of a full version (accessible only to registered users): managed

Drupal must know about the derivate

Drupal must have information about existence of a derivate. This is quite trivial when dealing with managed, but can be tricky if we think about unmanaged derivates. It would be cool, if there would be a mechanism for locating unmanaged files in an implicit way (similar to Image styles, where we define a location of image with style name). We do not need any DB information about this files (i.e.: internal file management), if it could be done this way.

Include an option to delete all derivates, when original is deleted

There will be a request to delete all derivates, if we delete their original. Since this is not always the case, it should be optional. QUESTION: How do we delete unmanaged derivates, if there is no info about them in DB?

Include an option to delete original, when derivates are created

Sometimes we do not need original anymore, when derivate was created. One of derivates should be proclaimed as original when this happens.

Backend engines framework

Backends must know which files they support

There is probably no engine, that could possibly support all different file types, streams and file extensions. As engines will evidently support only a smaller or bigger subset of files, we need a mechanism to allow them to tell us that. File type can be defined using one or more of this four parameters:

  • file type (image, video, audio, other, ...),
  • file stream (public://, private://, youtube://, ...),
  • mime type (image/jpg, image/png, ...),
  • file extension (.avi, .mp4, .pdf, ...).

Each engine will be asked to provide a set of rules, that will define a subset of files, that is supported by this engine. Rules will be connected with conjunction (i.e. all rules must pass). QUESTION: Do we need to support disjunction?

Example 1: 

  • type: video

Example 2:

  • type: video
  • stream: youtube://

Example 3:

  • type: audio
  • mime: audio/mp3, audio/ogg

Try to preserve metadata while transcoding

There are two types of metadata:

  1. Metadata stored in file itself (Ex.: EXIF, ID3, ...): preservation of this metadata must be handeled by engines. 
  2. Metadata stored in file entity fields: we should support copying of fields from original to derivate, but this must be optional. We should also support custom mapping of fields, since it will not be always an 1:1 projection. As derivate can generally be a different file type, also fields in original and derivate will not always be the same.

Status of conversion must be accessible

Engine must be able to inform us about status of conversion. This information can be used as a feedback to the content creator or as an argument, used to modify content's display (ex.: "This video is being trans coded and will be accessible in N minutes and M seconds." - Vimeo uses something similar for videos that are being transcoded.). Display itself is not a job of this module. We just need to provide that information. A set of possible statuses must be defined in advance, and could look similar to this:

  • scheduled,
  • running, 
  • done,
  • timed out,
  • unable to run,
  • error (terminated).

Transition between this states could also trigger hooks.

Transcoding scheduling (with rules or on demand)

There should exist some sets of rules, to define the actual start time of a conversion/transcode. This should be the job of API itself, since engines does not have a lot in common with this feature. API should just run conversion process on a engine side, when there is a right time to do that. There are three different options:

  1. Immediate run: asset is transcoded immediately after it becomes available to Drupal. Suitable for: video thumbnails, transcoding of small video files, triggering transcodes on a 3rd party service, ...
  2. Scheduled run: some transcodes should be run at a later time (when there are more CPU resources available, at a certain time, ...). There should exist some kind of a mechanism for definition of this rules. Rules should probably be time-based (run at 3:00 AM) and non time based (run when system load does not exceed 0.5 for 5 minutes). Suitable for: transcoding of big/high quality video assets.
  3. On demand run: asset, that was already available to Drupal, is transcoded on demand. This approach is similar to Image styles/Imagecache principles, since actual conversion happens upon request. Request can come from the outside world (site visitor requests a .pdf version of a document, saved in .odt) or from the inside (site editor/webmaster) triggers that conversion manually, since he wants to publish asset in another format. 

Engine-specific configuration

Each engine should have ability to introduce it's own set of configuration parameters. This configuration should be accessible in a central configuration form, but used only by an engine. 

Other features

Ability to export configurations to a feature

It would be great to have ability to export derivates configurations to a feature. This would allow us to simply create pre-defined configurations for most common derivates. When I speak about configuration, I do not think about backend engine. I think about a set of parameters, that define a unique group of derivates. Those parameters are:

  • type of files, that this configuration applies to,
  • managed/unmanaged files, 
  • backend engine to be used, 
  • metadata mapping,
  • scheduling configuration, 
  • derivate specific configuration (deletion policy, ...),
  • engine specific configuration,
  • ...

Views support

Derivates should be view-able with views. 

UPnP AV media servers support

There was a request to support UPnP AV media servers. This part should be handled by backend engines and stream wrappers and should fit in this API, if designed well.

Resources:

Other Questions:

  • Do we define a concept of "derivate type"? Where and how can we use that? How and who can define a new derivate type?
  • Do we define a concept of "derivate sets"? Set is a group of derivate configurations. Some of parameters seem to be naturally more suitable to derivate sets than derivates itself. Example:
    • Derivate set: Video podcast
    • Derivate configurations: h264/720p/2000kbps, h264/480p/800kbps, xvid/480p/1500kbps, ...
    • Rules that apply to sets: deletion policy (what happens, when original is deleted), conversion scheduling, metadata mapping, ...
    • Rules that apply to derivate: engine to be used, engine-specific configuration, managed/unmanaged, derivate stream, ...