Multimedia inclusion and referencing issues


These issues have been discussed in the following topics:

message/view/GOALS/29930043
message/view/GOALS/30141635
Please don’t write more about this issue there, use the Discussion tab for this page

The following is just a first attempt to capture some of the issues.



1. Introduction

The goals say that the BetterGEDCOM (BG) should include a standard container specification to accommodate ancillary Multimedia resources. With Multimedia we mean digital resources that may represent photos, scanned images, video, sound, documents, web pages, diagrams, maps, (database,?) etc. Importantly, we need the ability to incorporate resources in formats that are as yet unknown to us (such as emerging audio codecs). Some resources (e.g., video) may demand high performance, which means we must not introduce significant overhead for media objects.

A resource may reside in a file or in information available via an internal computer interface or via a data network.


1.1 Other wiki references:


2. User requirements

2.1 A solution should allow


2.2 A solution could also allow


3. Technical solutions

3.1 Internal files

These are files transferred together with the genealogy data file, in a Container file. (Multimedia objects are not stored in XML in the genealogy data file due to efficiency considerations.)


Discussion: One possible Container file type is zip (possibly with an internal packaging structure). The Open Packaging Conventions (OPC) has been mentioned in this context, and there are most likely others ((OOXML (Open Office XML, aka Open XML), ODF (OpenDocument Format), OpenDocument)). Also "ISO-Image" (used for images of CD/DVDs - ISO-9660 based or UDF? based) Expertise needed! Some existing standards may be too complex, but maybe it is possible to specify restrictions on their use. Some of them may contain other useful functionality not mentioned on this page.

3.2 External references

<<Needs much more work and discussion>>

Needs to identify the location of the object (e.g. an URL/URI), access method if any (e.g. HTTP), possibly transfer method (eg. e-mail), transfer time, - or may indicate that the identifier of the object should be used to obtain this info, if stored on the receivers system. E.g. if the submitter has previously received the media from the receiver, and thus do not need return them.


3.3 Considerations independent of “internal” and “external”

The structure in the genealogy data file that references/describes the multimedia objects should in principle be the same for internal and external objects, but some info will be different.

Both the internal and external methods may be used in the same genealogic data file.

4. Backward compatibility issues

None? What about Gedcom 5.5.1?


5. Grouping of functionality/support levels

<<Something to think about later>>


Comments

greglamberson 2010-11-13T20:50:09-08:00
ISO image?
I admit this isn't an area I know a lot about right now. Certainly I'm very familiar with archival formats, but I've never dealt with archival formats from a development point of view but rather as a user.

Consider this: Many software applications burn CDs for data transport, but this data can't then be imported or exported (typically). However, an "ios image" which is a format used to store images of CDs and DVDs is an efficient compressed file that has its own filesystem independent of any one operating system. (File systems used for iso images are either ISO-9660 based or UDF based, I think.)

The issue with file formats is really one about what is easier for developers to adopt. I don't think users care much, as long as it doesn't mess with the data. Obviously naming conventions would be an issue, as well as any filesystem structure or conflicts between names if any filesystem structure changes occurred.

I wonder if using an iso image format is a good option. Anyway, I look forward to see what those who know a lot more about this than I do have to say.
gthorud 2010-11-14T06:44:45-08:00
Goal title and scope
I suggest that we change the title of this page to "Multimedia transfer (rather than inclusion) and referencing issues." And that we move the inclusion bullet from goal 2 to a new goal 6 - which reflects the new title of the page. Inclusion is just halve the scope - as I see it, and could be interpreted as inclusion in the BG file.

Comments?
greglamberson 2010-11-15T11:40:05-08:00
I think we're getting a little mixed up here.

Of course, the purpose of BetterGEDCOM in terms of what we're doing is to allow genealogists to transfer all their data and documents. HOWEVER, the first part of the standard is NOT a transfer mechanism, speaking from a technology point of view. The first part of the project involves merely a way to format data uniformly.

The "transfer" portion of the initial work is done by humans who import and export the data, put it on a flash drive and put it in their pockets. Subsequent work will involve allowing computer programs and services to directly synchronize and transfer data, depending on the rules we set up to allow such transfers. That's what transfer involves from a technology point of view.

There is no question whatsoever that getting all data and documents into an out of programs completely is what we're doing.
GeneJ 2010-11-15T12:38:17-08:00
Perhaps semantics?

What every the uniform data format IS, it should be capable of recognizing the media as a source and as content that seeks a citation.
jbbenni 2010-11-15T13:08:39-08:00
There's been some good discussion. I feel the need to regroup and pin down what I think is established, and what questions remain. Bear with me, here's the world according to jbenni...

Quoting Wikipedia: GEDCOM, an acronym for GEnealogical Data COMmunication, is a de facto specification for exchanging genealogical data between different genealogy software.

So BG, like OG (Old/Original GEDCOM), is a specification for data exchange. Computers typically exchange data in streams or files, and for this purpose files seem like the obvious winner. They are easily exchanged and archived. An application that is BG compliant need not use BG internally, but it must be able to import data from valid BG files and export data to valid BG files without loss. This is key.

So what kind of data to exchange? Genealogy data is, broadly, conclusions and evidence. In the early days of GEDCOM evidence was generally in the form of a textual citation to an external official record or document. Before scanners and copiers, there wasn't much choice. But now evidence is increasingly in the form of files that are digital copies of original sources. (Images of official records, PDF or scanned pages of books, digital photos of grave markers with geocodes and timestamps, video recordings of family histories as told by elders, etc.)

So it's fair to say there has been a sea-change in the amount and kind of evidence data that is included in a genealogy project, and that GEDCOM has serious limitations as an exchange format for digital evidence (in addition to its other limitations discussed elsewhere). BG must do better, and must remain a super-set of GEDCOM in this area (perhaps in all areas).

There's solid consensus that the conclusions and all the meta-data about evidence belongs in XML. It's an obvious choice. No serious dissent, and besides it's in the goals.

There's good consensus that evidence in the form of digital source files be rolled up into a (ZIP-like) container file, with provision for certain important characteristics:
1. Support native file formats for multimedia, including new ones yet to come.
2. Low overhead (both space and time efficient). This means providing selectivity for compression and encoding choices. (E.g., Do compress XML, don't compress or base64 encode video)
3. Allow "indirection" and "relative addressing" -- these are techniques to provide convenient and portable ways to reference content in the container that endure across import/export actions
4. Identifying and cataloging content

Although I've said it in my own words, and others may say it better, I believe the concepts above are pretty well accepted by the BG community. A couple of open questions remain:

1. Given that a container is expected to be used to hold certain digital files, is there such a thing as a BG XML file that exists outside the container structure?
2. Given the need for a container in at least some cases, is there an existing body of work we can utilize to save effort in specifying and implementing the container?

With question 1 there are multiple views already expressed in this discussion. Some suggest that BG refers to the XML part of the payload, and that the container (in whatever format is chosen) be separate from that XML. Others suggest that BG means a container that always includes an XML part, and optionally includes other files as necessary.

Although my opinion is on record, I can see pros and cons. And I remain open to persuasion. I intend to split that topic out as a separate discussion thread, and see if we can achieve consensus here.

The second question is really just a technicality. The choice of container format and technology doesn't really interact with goals, unless there are requirements implications (like does the container need to provide access control to files within it?). I will try to separate that out too, and probably put it in the sandbox since it has less to do with goals.

I admit that there may be other questions about containers and file inclusion that I'm overlooking or haven't fully recognized. Help me out by stating the questions simply, and giving them their own thread if needed.

Agree?
gthorud 2010-11-15T14:07:21-08:00
Can someone present some real word arguments why the xml structure containing the genealogy data (and not multimedia) should NEVER be allowed to exist as a file outside a container?
hrworth 2010-11-15T14:25:08-08:00
As a simple user of a genealogy program, I want to share my research with another researcher. However the data gets from me to the other user, I hope that you techies can make that happen.

What is not fully implemented in the current GEDCOM is media files. Some programs my contain links to media.

What is more important now, is that there may be Media linked in one genealogy program to People, Places, Relationships, and Sources. Those media files need to get from one place to another.

Lets take a Marriage Certificate and a Wedding Photo. Those two media objects will be linked in one program to the People in the Photo, the Names of People in the Marriage Certificate, The Place, perhaps, and certainly they could serve as Source material.

How it gets transported, I wouldn't know, but when it arrives at the other end, it needs to be extracted and presented to the other user with the same links between the People, Places, Events/Facts, and Source material. Images and all.

One User's opinion.

Russ
hrworth 2010-11-15T14:25:29-08:00
As a simple user of a genealogy program, I want to share my research with another researcher. However the data gets from me to the other user, I hope that you techies can make that happen.

What is not fully implemented in the current GEDCOM is media files. Some programs my contain links to media.

What is more important now, is that there may be Media linked in one genealogy program to People, Places, Relationships, and Sources. Those media files need to get from one place to another.

Lets take a Marriage Certificate and a Wedding Photo. Those two media objects will be linked in one program to the People in the Photo, the Names of People in the Marriage Certificate, The Place, perhaps, and certainly they could serve as Source material.

How it gets transported, I wouldn't know, but when it arrives at the other end, it needs to be extracted and presented to the other user with the same links between the People, Places, Events/Facts, and Source material. Images and all.

One User's opinion.

Russ
greglamberson 2010-11-15T15:14:48-08:00
Russ and GeneJ,

You both bring up completely different points that are, in my opinion, far more vital: How multimedia files are associated to data uniformly.

PLEASE, as this topic has been so badly mangled, can we please just end this discussion and split this up into other topics?
gthorud 2010-11-15T15:23:10-08:00
And you suggest which topics?
gthorud 2010-11-15T18:22:12-08:00
Since you insist that anyone can do what they want on this wiki, I have re-entered the link to the multimedia page in on the navigation bar.
gthorud 2010-11-15T18:36:15-08:00
The last entry was aimed at Greg. I intend to update the Multimedia page tomorrow, with a reference to the container page, and focus more on the overall issue of multimedia - an area where the container will be one instrument.
jbbenni 2010-11-16T06:46:28-08:00
I didn't intend to insist any such thing! So thanks for clarifying the intended audience of your comment.

I'm learning as we go about what is possible and what is appropriate for this Wiki. (And I'm chagrinned that I have authority to create pages but not delete them, and that I can't rename my own pages either!)

Titles are important, and brevity is helpful, so I welcome you revising the title/link.
gthorud 2010-11-16T07:20:16-08:00
I have dropped the intention of updating the multimedia page for now, I think there is need for more discussion first. I don't like to see wasted work.
jbbenni 2010-11-14T06:55:58-08:00
Interesting. You are emphasizing the distinction between transfer and inclusion. Can you elaborate?

I see the issue as a need for a container file format that is:
1. Durable, for archive purposes
2. Transportable across platforms and applications for import/export data sharing
3. Self-sufficient, in that the container holds both the XML payload and the native formatted ancillary files, along with enough meta information to navigate and cross-connect the whole thing.

If I'm correct, BG isn't an application, it's a file format. In my view, that means BG IS a container file format (like ODF) that holds (among other things) an XML payload where the GEDCOM replacement resides.

Are we heading in the same direction?
greglamberson 2010-11-14T07:28:12-08:00
gthorud,
Well, for the purposes of the current project, we're not focusing on the transfer of anything. We're merely worried about what state and format things are IN. Transferring them into or out of those formats is another matter. This may seem like a silly distinction, but it is actually vitally important to stay within the scope of the project. Communications/transfer issues and application-related issues (i.e., import/export) are separate concerns.

jbbenni, I agree wholeheartedly with your point of view here.
hrworth 2010-11-14T10:03:34-08:00
Greg,

I just wanted to add a User's Opinion here.

I want everything in a program on my PC, transferred to where ever I am sending the information to.

Knowing that the information on my PC may be Media, that does need to be, and should be addressed separately, for now, from the rest of the information being shared.

I think that understanding of what makes up Media on my PC is important. Understanding what makes up Media on another platform is important. So, understanding what is "out there" is what is being discussed.

Russ
gthorud 2010-11-14T19:44:53-08:00
Jbbenni

You write ”BG IS a container file format (like ODF) that holds (among other things) an XML payload where the GEDCOM replacement resides.”

I don’t know if it is this that Greg agrees wholeheartedly to?

This may be a misunderstanding or we have different opinions about how this should be done. It may be that the term inclusion (contained in the original heading) that is misleading. My intension in the first entry in this topic was to say that exactly that – it is misleading, but I see that what I wrote can be interpreted in several ways.

The current state of affairs is that multimedia will not be included in the Better Gedcom FILE, which is XML encoded.

It is my understanding that the Container is not needed when the genealogy data file is not accompanied by multimedia, although the g-data could be alone in a Container – e.g. for the purpose of compression. Thus I think of a

BG-file as a Gedcom file today, with changed/extended information and xml encoding, with no multimedia within it,

and it is in my view equivalent to what I have called “genealogy data file” on the page. I don’t think one should be required to use zip or an ISO-image or whatever Container – in order to transfer only the genealogy data. Tell me if I am wrong.

One way or another, we will have to define the term BG-file and “BG-file format” in the Glossary, and use that instead of “genealogy data”. Since these are extremely important terms, I would like to have agreement on a definition before we proceed – see above.

When BG-file is defined, we can define a “BG Container file”, a “BG Container” (the “envelope”) and ”“BG Container file format” (which may turn out to be zip or ISO-image when precisely defined some time in the future). So a “BG Container file” consist of a “BG Container” which contains
· (?zero or) one (or more?) BG-file AND
· zero or more multimedia files.


Greg

I agree that the term transfer is not a good choice either. Suggestions? Maybe “Append” – that is the only word that my dictionary suggests, leading to “multimedia appendixes”??

I also see that the term “internal reference” could be improved because what is provided is not only reference, but also the Container, while “external reference” is only reference – and it may be desirable to get rid of the word internal altogether.

And I still think we should have a 6’th goal:

BetterGEDCOM should utilize a standardized file container to hold separate appended files such as multimedia, and should allow references from within the BetterGEDCOM data to these files or other files available via e.g. data communication networks.
greglamberson 2010-11-14T20:03:49-08:00
Well, it's not imperative that I agree to it, first off.

Secondly, I think that there are many ways to handle multimedia files or other supporting files. These files can technically be embedded within XML (which is not really advisable), or we can use a container format, and I'm sure there are other options whether or not they are particularly good ideas.

It's a good thing to explore all the possible options, so I'm pretty ok with whatever directions this goes right now. We do need to consider all the possibilities, and also keep in mind that as long as it serves the purposes of the project (i.e., the data files can be included easily somehow), there may be something that is far easier for developers to incorporate, and their input, which isn't even available yet, may change direction of our ideas entirely.

For the purposes of uniformity, do we need a container format to be consistent with or without any additional files attached? I don't know. It is true that XML benefits greatly from compression, but this is the sort of thing that I think software developers will have a great deal of input on later in the process, and I'm fine with that too.

As for the language of inclusion and the goal language and so forth, I don't really think it makes a huge difference. As usual, I recommend that if you feel strongly about it, go right ahead and change/add it.
GeneJ 2010-11-15T10:43:08-08:00
I'm kind of in the transfer camp, too.

While the point may have been expressed, I'm hopeful BetterGEDCOM recognizes media for its content AND source value.

Let's make sure the sources transfer WITH the media.
jbbenni 2010-11-14T06:47:41-08:00
Examples of container file technology
BG isn't the first project that wants the benefits of XML for its core data, plus the ability to include supporting resources in their native file formats. Office suites, for example, need to be able to include photos, sound, video, and charts into documents and presentations. A container file approach supports this perfectly, with the advantage that it will work with future native file formats too.

Major examples of XML plus a container include:
1. OPC (Open Packaging Conventions), see http://en.wikipedia.org/wiki/Open_Packaging_Conventions. This container specification is part of the larger Office Open XML (OOXML or OpenXML) standard developed by Microsoft for Office 2007 and beyond.
2.OpenDocument (ODF or "Open Document Format for Office Applications"), see http://en.wikipedia.org/wiki/OpenDocument. This is a competing approach that was developed by an open consortium, adopted by OpenOffice.Org, and has OASIS standardization and pretty good worldwide support.

Because OpenOffice.Org technology was designed with Open Source licensing and cross platform support from its inception, it may be a particularly good fit for BG.

For details on the OpenDocument's container technology, see http://docs.oasis-open.org/office/v1.2/cd05/OpenDocument-v1.2-cd05-part3.odt It has provisions for compression, encryption and digital signing -- but I didn't see access control. I don't consider lack of access control for parts to be a problem, although it was mentioned as desirable in discussions. It imposes some conditions on compliance that might be beneficial (e.g., requires a "manifest" and MIME information).

There's at least one discussion of how OpenDocument and OPC have some degree of interoperability, suggesting the container portions of the respective projects are in fact compatible: http://www.oreillynet.com/xml/blog/2007/07/can_a_file_be_odf_and_open_xml.html#comment-1027843

Should we promote OpenDocument to the Sandbox section as a provisional candidate for the BG container format? If not, why not?

Thanks!
greglamberson 2010-11-14T07:42:14-08:00
jbbenni,

Absolutely. I suggest you make a new page for multimedia file and container issues (or something like that) subordinate to the Sandbox page and put a link there to it. Do you know how to do this? If not, let me know and I'll do it.
jbbenni 2010-11-15T13:47:24-08:00
Is the XML inside the container, or are they separate files?
This question is split-out from a discussion in [[Multimedia File Inclusion Issues:Goal title and scope]].

There's already pretty good consensus that in some cases, notably when including digital source files, BG needs a container to encapsulate multiple files in various native formats, and to provide an address framework for them.

So is the BG XML payload always delivered inside the container file, or is the BG XML in a standalone file with the container file as an option "on the side"?

Arguments for keeping XML and the container in separate files:
1. Keep it simple. XML is recognized as the best format for the "guts" of a BG file, and containers aren't always needed.
2. Humans can read XML, but not container files. One of the best things about XML is that both humans and computers can deal with it. Especially when developing and debugging, it's much easier to look inside an XML file to see what's going on.

Arguments for having the XML inside the container:
1. The XML content can be dependent on the state of the container, and cross-linked information recorded in two separate files will inevitably get out of synch. We can eliminate complexity and fragility if BG always imports/exports to a SINGLE FILE that is complete and contains a snapshot of everything in an internally consistent way.
2. It's easier to specify, design, implement, and educate users about a single file than a situation that can create one, the other, or both.
3. Backups and automation become simpler if a single file is at issue, rather than a pair (or cluster) of interdependent files with different modification times.

Potential compromise: BG operates on containers that always include the XML, but the XML can also be emitted "on the side".

This ensures that the BG container is always complete and consistent, yet allows for the convenience of XML access in text editors and such.

Comments?
gthorud 2010-11-15T18:08:00-08:00
The title seems to be a little too limiting wrt the number of choices.

I can't see that anyone have suggested an alternative where the BG xml payload should ALWAYS be kept outside the container in a separate file, so I assume that is agreed.
The Subject line makes me think that you consider this an alternative, that is probably not the case.

I think the user should have two or three choices

1) have the BG xml payload as a separate file outside the container (possibly no multimedia at all)

2) the payload and the multimedia is within the same container file (this is your 2'nd alternative), and maybe

3) the payload is in a file (outside any container or alone in a container) and the multimedia is in a separate container file with no payload. The cross-linking becomes more complex, but one will always have to be able to reference data outside the file where the payload is, so one could investigate the possibility. It is not much more complex - it should be possible with eg. a common unique reference within the files, linking them together. (I assume 3) is the same as your first alternative).


The alternative I do not want to see is that (4) the payload should be required to be in a container even when there is no multimedia. One of several reasons is that there should not be a requirement that all BG implementations must support multimedia using the container - especially if the container is complex to implement.

I am sorry but I am not sure I understand what you mean by "The XML content can be dependent on the state of the container, and cross-linked information recorded in two separate files will inevitably get out of synch." What is "State" of the container? Why is it "inevitably"?
jbbenni 2010-11-16T06:41:17-08:00
Thanks gthorud, for your good comments.

I may have read too much into someone else's comments, but I did think there was interest by someone other than me in having the XML payload always outside the media container. I'm happy with taking that option off the table (as you did) unless someone makes a case for it. I favor option 2 (your numbering or mine), when both XML and media data is present. I think we agree on that. The interesting case is when there is XML payload, but no media is present. Should the XML be in a container, as a standalone file, or (perhaps strangely) offered as both?

Consider this scenario: A BG project that starts life as pure XML with no media files. Each time the BG project is archived or exchanged, the XML payload would be the only content. But eventually some media is added to the project.

Would you want to see the XML as standalone before the media is added, but inside the container afterwards?

I'd rather see the XML inside the container from the outset, even if it means the container is overkill at first.

As for my comments about the XML content depending on the state of the container and inevitably getting out of synch, I may have leapt too far. Clearly, the XML should have the ability to reference items in the container. But with that capability comes a risk: a link from the XML can be broken by a change to the container. For example, when a file in the container is renamed or removed/moved (possibly to another path within the container) then the XML link can "broken" -- resulting in what is sometimes called a "dangling pointer".

In my experience with archival systems, the best way to protect against this problem is to make sure the XML and the container directory reflect the same "state of the system". One sure fire approach is to write both at the same time, and read both together. This means they need to be effectively treated as one file, so they might as well be one file. That's the justification I'm offering.

You were right to question my earlier blanket statement, and I hope I've provided a reasonable justification (above). Thoughts?
gthorud 2010-11-16T09:18:08-08:00
Regarding the issue of whether the BG payload should be able to exist as a file outside a container, or not. The answer is very dependent on the MINIMUM possible implications of always having a container, wrt:

a) Minimum File size overhead (in terms of encoding and any extra data to transfer

b) Minimum Processing overhead

c) Minimum implementation effort for the container, on all platforms (using no container requires no implementation effort)

d) Minimum functionality – e.g. will the container allow transfer of the BG data without compressing them.

e) Will the software used for the container be around in 30 years? Is it so simple to implement that it does not matter?

f) How easy is it to determine what “container functionality” is used in a file? Is it easy to find out if I support, or do not support, the functionality in a received file.

g) Most likely more aspects to be considered.


Re. c) Is there code publicly available for all platforms/programing environments that can be used in COMMERCIAL implementations without problems?

Re. d) For very long time storage, the implications of a bit error in a compressed file is much larger for a compressed file than for a non-compressed file. Therefore compression should not be used for long term storage of the xml coded BG gen data (i.e I am not talking about eg. video where the requirements are different). I am not sure, but I think the strategies in archives for long term storage is to store the bits and pieces in a complex multimedia file as individual files –and use general functionality of the archiving system to tie things together, although they will also store the complex file.

Also, will there be interworking problems if the sender is using more functionality in the container than the receiver supports. If there is no container, the options are fewer.

So, in order to make a discussion, one must dig into the various container alternatives, but I am reluctant to go too far into that unless we have people that have some knowledge with us, and it could be a lot of work. Also there are other issues in BG that I would like to work on in this early phase.


Re “dangling pointers” – I think I understand your point, and agree that it is desirable to have all info in one “place”, and we are not preventing anyone from writing a consistent file with BG data and multimedia, and no dangling pointers. But, should we prevent everyone from creating a file with reference to files outside the container? I think not. I think the user should be able to choose. (Eg. many Web-links are very likely to become “dangling” after some time, but do you want to prevent inclusion of URL/URNs?).
jbbenni 2010-11-16T12:09:48-08:00
There are always tradeoffs. But most of what you mention applies to any use of a container-- whether the XML is the only thing in it or otherwise. Consideration of functionality (e.g., compression option), long term survivability, feature set, licensing, and other aspects are valid questions about ANY use of containers. They don't really bear on the question of whether the XML goes into the container in the special case when there's no other data.

Your item c) actually argues against coding for XML outside the container. That is, we can save implementation effort if XML is ALWAYS in the container, rather than having to code two different cases.

Lastly, I don't suggest that external references be restricted, and URIs/URLs are good candidates for external references. I do suggest they be handled differently than internal references, and that we plan to accommodate broken external references. But there's no excuse for a broken internal reference, and you are right that putting all the cross-linked info in one place is the way to avoid that. I think it's the best way to avoid it, and so XML belongs inside the container most of the time. I suggest that it be all the time, to reduce complexity and provide a more consistent behavior.

I don't mean to stand in the way of anyone working on other BG issues at this phase. An ongoing discussion of containers needn't be an obstacle to other work.