A proposal by G.Thorud 27 Nov 2011.
See file link below.

Introduction


Many of the most used US programs have to some extent implemented support for the citation style guide "Evidence Explained" (EE). In most cases the programs have parameterized the implementations allowing fields holding the information rendered in citations (Citation Elements, CEs) to be defined in a database for each Master Source Type. There are also Templates that control the output in citations in reports. At least one program exports these CEs in GEDCOM files, in a non-standard way, but other programs cannot read them. In general, it is currently not possible to exchange citation data according to EE between the major US programs using GEDCOM.

There has been much discussion on the wiki pro and con Evidence Explained. I see no reason to repeat these discussions here. No matter our opinion, EE support in programs will not go away, users will continue to record data according to EE, and there is a need to exchange these data between programs. At the same time there is a need to develop guides with definitions of Master Source Types for other countries, best done by users in those countries. Some of these guides should preferably contain more generic Master Source Types and Citation Elements than in EE, something along the lines that has been envisioned by those criticizing EE. It is also desirable to be able to import data from various databases on the Internet holding meta-data about sources, databases that are in many cases not designed for genealogy – this may also create a need for sets of Master Source Types supporting these data. All of the guides will evolve over time. Genealogy programs must therefore be able to handle and exchange data adhering to several guides, and it is preferable that source data can be converted from one guide to another making citation output consistent.

Since there will be many guides, the best solution is to allow programs to exchange definitions of the Master Source Types, Citation Elements and Templates, thus easing development, distribution and reducing duplication of work.

There is also a need to support exchange of citation data between users in different countries, often speaking different languages. It must therefore be possible to exchange translated definitions so that the receiver can understand the data, and can render the citations using her/his own language and cultural conventions, possibly in a different style – to the extent possible without very complex solutions.

It should also be possible to extract source and citation data from the more complex sets of Master Source Types, which can be transferred in the few fields that current GEDCOM defines, thus providing backwards compatibility – to the extent possible.

The suggested data model describes data structures that try to satisfy all these needs, and it tries to create a platform that will satisfy both those who supports EE and those who do not. Continuing to discuss pro and con EE will not bring us forward.

The data model is independent of the method used to transfer the data. I hope we can have a discussions split into several topics rather than one big discussion. One possible realization of the model is to specify an extension to current GEDCOM to hold these data.

The model is not finished in every detail, and there are some outstanding issues, but it should provide sufficient detail for discussions.

The document can be downloaded here
Sources and Citation data model DRAFT v 0.4 27nov2011.pdf

Comments

ttwetmore 2011-11-28T20:42:51-08:00
Tom's First Comments on Draft 0.4
Some comments on G. Thorud’s “A data model for sources and citations DRAFT v.0.4 A proposal for discussion.”

I read this with respect to Better GEDCOM, trying to see how a MSTS, MST, TS, and others, fit into the Better GEDCOM model. I have come to the conclusion that many don’t so are not manifested in BG files as records or in any other form. Which made me wonder what is purpose of this document. I have come to the conclusion that G is here designing the architecture for a system that could be used to translate source and citation information back and forth between different citation templates, that exists in different templates sets as promulgated by various organizations, and between different forms of data as found in genealogical programs that exist today. This is a noble cause, but I don’t see its usefulness for the purposes of BG.

BG has its set of classes and relationships to define the objects required in a BG transmission file. BG exists in a world in which citation templates from different organizations exist. BG source records must have types that are somehow congruent to the types used by those templates. So BG needs a Source record. We define a fixed set of types for those source records, taking into account the source types used by the most important template sets, along with a way to extend the set. We define a set of attributes (which G calls citation elements) that our Source records can have, along with suggestions as to which attributes are most important for the generation of citations for sources of the different types.

And then we must define how other BG records (Persons, Events, other Sources) can refer to these Source records. DeadEnds does this with the SourceReference concept, while G does it with the poorly named Citation entity. Each of these entities contain attributes (also called citation elements by G). I have provided a detailed example in which the Source record for a journal article is structured into attributes, and how a SourceReference from that Source record is used to refer to the journal the article came from, with the attributes that are used in the SourceReference and the attributes found in the Source record for the journal as a whole. It is at this level that BG should be concentrating, not at the gigantic level of transforming data from anybody's database into citations from anybody's template set. This latter problem is monstrous and I can't see any way that tackling it is going to advance BG. This document represents G's 0.4th stab at architecting a model for this gigantic problem, but I don't see how it helps get to a BG model.

I started out with the hope of providing a detailed, point by point discussion of G’s document, but when I realized that the document’s scope goes well beyond that of BG, that in fact little of it bears directly on BG, that I lost motivation to do so. So here are just a few random things.

A MSTS is a concept that exists at the level of standards promulgator (EE, etc). Not a BG concept.

In my opinion the MST and MSI are defined incorrectly, based on my assumption that an MST should be the indication of a source type as used by a standards promulgator, and an MSI should be the representation of an actual source of a specific type as used by a specific promulgator. With that perspective, why does an MST have Full Name, Alternative name, Short name? They belong in the instance. This is all made more confusing since there is no good, general definition of any of these concepts. The document is written as if the readers already share an understanding of these concepts, yet I find much of this discussion complicated and using terms I simply cannot get into focus.

Why does BG need the concept of a “master” source? What is a “master source” anyway? How is a master source different from any other source? The idea of the master source seems to come from the pick lists provided by some genealogical systems (TMG?) to make it easier for a user to add references to selected sources. The word “master” has no meaning with respect to BG. Source records are Source records. If an application happens to allow a user faster access to some of the Source records, that's a nice feature of the application, but provides no justification for using the term "master" in data models.

And just some points I jotted down while reading, some redundant with the above:

1. MSTS and MST objects are outside the realm of BG.

2. MSI objects are simply Source records as they are in BG. The word “master” is confusing and should be removed. How would this model handle “non-master” sources?

3. Citation records are SourceReference records in my terminology, that is, a record that provides some additional detail as to the location of information within a source. In my opinion it is wrong to use the term citation in any way with respect to naming a BG record type. This promulgates mistakes made elsewhere and does nothing to help understanding. Justifying calling a record a citation because it contains elements that are used to generate citations is wrong. Source records are also made up of attributes that are used to generate citations. This misuse of the term citation is the source of much of the confusion around the citation concept. Shouldn’t BG try to break the circle of confusion?

4. Template Sets, Citation Templates, and Conversion Templates are outside the realm of BG. If an application supports the EE template set, this is great, but the application is responsible for knowing how to interpret those templates with respect to the source types and attributes used in the actual data. BG's responsibility ends after providing those attributes in a solid model that includes Source records and SourceReferences.

5. Citation Element Types and Citation Element Values are a complicated way of describing a basic concept -- key/value attributes that describe important properties of sources (e.g., authors, titles), and places inside sources (e.g., issue numbers, page numbers). For the purpose of BG we need a list of which key/value pairs are most appropriate for each source type (e.g., what sources need names or titles, which need authors, which need web sites, which need publication dates, etc). The Citation Element Type as defined here is overly complex and outside the realm of BG. All a Citation Element Type should be is a key (or tag) and a description of what its value should be (its semantics) so a user can decide what attributes to add to each Source record and how the user should compose the value of the attribute.

6. G's model has 10 things in solid boxes in the model diagram. Only three of them are BG concepts -- Master Source Instance, which is the Source record; the Citation, which is the SourceReference; and the Repository, which is the Repository. Some of the others are concepts of some interest in specifying some of the details of Sources and SourceReferences. Most of the boxed elements are only needed in the architecture of the gigantic translating machine that I believe is the true basis of G's model.

(I should be thankful that G calls these attributes citation elements. Family Search and others have started calling them metadata! Holy citation, Batman!)
ttwetmore 2011-12-12T10:45:49-08:00
@Tony,

I am not an expert on URNs, but I read a number of articles about them when you brought up the idea. I have come to the definite conclusion that you are right and that URN's are appropriate for expressing source types and citation elements in an elegant manner. However, I am old-fashioned enough to still want to use the structured attribute solution.

As and you say UUIDs are inappropriate.
ACProctor 2011-12-12T12:46:33-08:00
You wouldn't have to give up the "structured attribute" scheme Tom - this would be inclusive. :-)

My post on Dec 7th tried to describe an end-to-end scenario. The element meta-data would be fetched from some repository using the URN as a key or handle associated with the source type.

The returned meta-data (i.e. element names and types) would be "structured", and I made a case for including that structured meta-data in the BG file in order to prevent a total reliance on that source-type repository thereafter.

The report writer would use the same key - plus the user's locale and the required citation style type - in order to retrieve a formatting template. That might be something like CSL or something else - I'm not really concerned by it.

The formatting of the element values into the URN base-string to produce a single "exanded URN" was a separate use of the URN string that I feel has potential but should really be discussed in a separate thread. Where the base-string identified the source-type, the expanded-string would identify the specific citation-reference. It's true that the base-string and structured-attribute elements would also perform the same function but a single string would be useful in application contexts like a relational table or a hashtable.

Tony
GeneJ 2011-12-13T14:10:28-08:00
@ Tom,

A clarification.

Much earlier you wrote, "For example the Dublin Core doesn't define all that many elements and their self-described task is to be able to describe all resources."
http://bettergedcom.wikispaces.com/message/view/A+Data+Model+for+Sources+and+Citations/47134014?o=40#47529584

Dublin Core allows the elements to be reused/duplicated. There are then hosts of "Qualified Dublin Core" versions, where useful extensions develop, in part to avoid confusion when an element woud otherwise be duplicated.
ttwetmore 2011-12-13T17:50:25-08:00
@GeneJ,

Thanks.

The underlying question of my message is stated simply. How many citation element types does BG need? If you have read Louis's most recent messages, you will see he believes GEDCOM's current 3 might be enough. I have suggested maybe 20. Geir has suggested at least 50. An uncritical look at EE seems to indicates 100s. My comment about Dublin Core was to point out that its authors were also on the conservative side of the answer.

Personally, if BG uses XML syntax, I hope that we will not end up piggy-backing onto a vast number of different namespaces. If I had my druthers we wouldn't use a single one. But if we end up adopting existing standards for dates, places, names, URNs, resources, ..., this is an issue I will likely have to concede.

Thank goodness I always have the DeadEnds model to come home to!
GeneJ 2011-12-13T19:41:46-08:00

We ought to start a pool, Tom.

I developed a list sometime back that contains 64 elements. The list contains GEDCOM publication data and repository data; it will contain information that will seem duplicative ("email" vs "email for private use").
https://docs.google.com/spreadsheet/ccc?key=0AhGBiJ9HyACHdHQ1aEZpRnZ6S2lCSll1UUlaRkdOQWc&hl=en_US#gid=0

While we will carve that list down from 64, but I don't see how we'll get it down to 20.

Internationally, my money is on 80-100, but that number includes administrative fields and elements for ISSN, ISBN, DOI and the like. I don't take names to the atomic level (given, lastname, etc.), but we have programs right now that do. It would drop if we do group concepts for say different types of contributors (editor, photographer, translator...as I recall from the Zotero forums, in France there is even a term for a special editor). Likewise, it would be a smaller group if we are successful with smart fields (like "jurisdiction" and maybe access date/access year).

To put this in perspective ...

My preliminary work took Yates "elements" to mid-500. He agrees with that number, but in a recent posting, stated there might be 2000-3000 if he created a smart list for the "closed" standard he envisioned (ie., custom elements not allowed).

Zotero has give or take, 100 elements. CSL has fewer.
Zotero Fields_alpha_97-04v.xls
louiskessler 2011-12-13T23:05:18-08:00

GeneJ:

The elements in that spreadsheet alone don't work.

Your Zotaro spreadsheet, on the other hand, is PERFECT!

See my post at: http://bettergedcom.wikispaces.com/message/view/Sources+and+Citations/48156128
ACProctor 2012-01-16T09:30:33-08:00
This thread seems to have dried up a little. Going back over the discussions - both here and the displaced ones - I still see lots of unanswered questions. Here would be my list. Would anyone like to add a bunch more that we'd need to resolve in this area before we can move on?

1) Do we agree on the general goal of transporting the "essence" of a citation from one program to another, without restricting the recipient's choice of printed citation style, or their regional settings?

2) Should we distinguish between source-type and citation-type? For instance, a specific source may have different modes of reference, which then implies different tags.

3) Do we agree that an "element-only" approach is the only way to achieve this?

4) Do we use a URI, or other registerable key, for identifying types?

5) Do we define some types as part of the BG standard, implying that their key is based on our domain say?

6) Do we believe in a shared set of element tags, or that each type has its own private set of tags (as with database table column names)?

7) How do we represent elements that represent lists of values, e.g. lists of authors, lists of page references, lists of dates. STEMMA suggested using JSON's list/structure syntax in the element value. This nicely divorces it from XML so that our choice of format doesn't impact the meta-data interface.

8) Do we distinguish between a source reference and an actual source, e.g. a formal census reference or the actual place where we got the image from? STEMMA calls these the definite and indefinite source.

9) Do we persist a copy of the element meta-data inside BG, along the lines recommended in STEMMA? Otherwise, interpretation of the data is reliant on there being an equivalent 'type' (with the same key) available to the recipient.

10) Should BG get involved in citation templates, and to what extent? Should we simply define a data/meta-data interface that existing template engines can program to? How do we guarantee that they can be moved from one program to another (same user) with total fidelity unless there was a common exchange format for the templates?

11) Do we really need to worry about procedural templates (e.g. ones defined around scripting) or simply provide a declarative scheme where the meta-data provides everything the template engine needs to know. I personally believe that template engines that resort to scripting are badly designed and there's no good reason for doing it that way except maybe expediency.

12) Do we need to make recommendations on the 3-tier model where the element meta-data is defined in one place, the associated UI data in another, and the citation template in a third? This would be best-practice in a globalised application but that doesn't mean vendors will do it that way.

13) One requirement for avoiding (11) is to provide more information on any personal names, e.g. the cultural "name scheme" that defines its structure? This was discussed a little under Personal Names in general.

Tony
ttwetmore 2012-01-16T17:50:09-08:00
Tony, My answers to some of your questions:

1) Do we agree on the general goal of transporting the "essence" of a citation from one program to another, without restricting the recipient's choice of printed citation style, or their regional settings?

I do, but I don’t think all others do. Some want to also be able to transport the exact way to display the citation as well as the data content of the citation. Which is why some think templates should be handled by BG.

2) Should we distinguish between source-type and citation-type? For instance, a specific source may have different modes of reference, which then implies different tags.

Not sure what you mean by citation type. If you mean footnote form versus bibliography form, I don’t think this type belongs in the citation at all. It’s simply a matter of display.

3) Do we agree that an "element-only" approach is the only way to achieve this?

I do.

4) Do we use a URI, or other registerable key, for identifying types?

I don’t like the added complexity of URI. I like simple key words.

5) Do we define some types as part of the BG standard, implying that their key is based on our domain say?

Take out the term "domain" and I agree.

6) Do we believe in a shared set of element tags, or that each type has its own private set of tags (as with database table column names)?

A standard set, though some tags might be very specific to just one or two source types.

7) How do we represent elements that represent lists of values, e.g. lists of authors, lists of page references, lists of dates. STEMMA suggested using JSON's list/structure syntax in the element value. This nicely divorces it from XML so that our choice of format doesn't impact the meta-data interface.

Use the same tag more than once, that is, use multiple data elements of the same type.

8) Do we distinguish between a source reference and an actual source, e.g. a formal census reference or the actual place where we got the image from? STEMMA calls these the definite and indefinite source.

User preference. BG should handle both.

9) Do we persist a copy of the element meta-data inside BG, along the lines recommended in STEMMA? Otherwise, interpretation of the data is reliant on there being an equivalent 'type' (with the same key) available to the recipient.

In my opinion, the values of all elements are just strings, though some should have "experts" to help interpret their values. So no meta-data (in the proper sense of the word, not the citation element sense of the word) needed in BG. All that is needed is recommendations in the BG standard on how certain elements values should be interpreted. Leave the rest up to the vendors.

10) Should BG get involved in citation templates, and to what extent? Should we simply define a data/meta-data interface that existing template engines can program to? How do we guarantee that they can be moved from one program to another (same user) with total fidelity unless there was a common exchange format for the templates?

No.

11) Do we really need to worry about procedural templates (e.g. ones defined around scripting) or simply provide a declarative scheme where the meta-data provides everything the template engine needs to know. I personally believe that template engines that resort to scripting are badly designed and there's no good reason for doing it that way except maybe expediency.

I don’t know what a procedural template is so I can’t help here.

12) Do we need to make recommendations on the 3-tier model where the element meta-data is defined in one place, the associated UI data in another, and the citation template in a third? This would be best-practice in a globalised application but that doesn't mean vendors will do it that way.

BG can make recommendations on anything it doesn’t actually define, especially if we think it would make its use easier.

13) One requirement for avoiding (11) is to provide more information on any personal names, e.g. the cultural "name scheme" that defines its structure? This was discussed a little under Personal Names in general.

Since this is an offshoot of 11 I can’t answer. However, unlike others, I recommend treating names as very simple lists of name parts, wherein a certain subsequence of the parts, typically a subsequence of one, be identified as the "sort key." This is obviously the surname in western names.

Tom
ACProctor 2012-01-17T08:52:46-08:00
I have different opinions on many of these. However, I just wanted to clarify a few of the topics here:-

2) What you're describing, Tom, I've called the 'citation form' elsewhere (e.g. STEMMA). By 'citation type', I meant the type of reference to the source, and suggested it is possible to have a single source with different methods of reference - possibly dependent upon how it was catalogued in different repositories. The point is that the citation-types would then require different parameters for the same source-type. It's subtle distinction that we could "paper over".

4) A simple keyword will not hack it. I know that it will result in chaos and ambiguity.

10) I would have agreed with you only a matter of weeks ago but the question I posed above is moving me the other way. I think we need a common format for import/export of a formatting template definition. This is why I also raised the subject of 'procedural templates'.

11) A procedural template is one that involves some type of procedural scripting rather than having a purely declarative definition. If we do define an interchange format for templates - which is absolutely possible - then it would have to be declarative and be backed-up by appropriate typing and meta-data.


Did you have any questions you'd like to add to the pile for discussion later though Tom?

Tony
ttwetmore 2012-01-17T10:16:31-08:00
Tony, Thanks for your patience, and responding to your repsone:

I have different opinions on many of these. However, I just wanted to clarify a few of the topics here:-

After reading much of your writings I do agree that we disagree on a few things. I am going to try to respond to your STEMMA model eventually. I have read it through once and need another reading or two before I can get my thoughts enough in order. Unfortunately I am now extremely busy acting as the executor for my Dad's estate, and that is forcing me to travel back and forth between my home and his quite a lot. Once I get that process under better control I hope to have more time.

2) What you're describing, Tom, I've called the 'citation form' elsewhere (e.g. STEMMA). By 'citation type', I meant the type of reference to the source, and suggested it is possible to have a single source with different methods of reference - possibly dependent upon how it was catalogued in different repositories. The point is that the citation-types would then require different parameters for the same source-type. It's subtle distinction that we could "paper over".

Thanks. That distinction is a bit subtle for me. The DeadEnds model has source records and sourceReferences. The references are structures in records that point to sources in order to indicate the location of evidence. A person record would obviously have a sourceReference to a source record. Then source records can be hierachically arranged by allowing each source record to also contain a sourceReference that specifies where that particular source can be found in a higher layer source. Pretty conventional ideas I’d say.

4) A simple keyword will not hack it. I know that it will result in chaos and ambiguity.

I disagree. GEDCOM has no sweat, no other genealogical format has any sweat, and XML as an entire concept, has no sweat with using simple keywords. I could accuse you of FUD by bringing up chaos and ambiguity without any justifications for it!!

10) I would have agreed with you only a matter of weeks ago but the question I posed above is moving me the other way. I think we need a common format for import/export of a formatting template definition. This is why I also raised the subject of 'procedural templates'.

Geir is in favor of this also, as stated in his document. I think it’s a big mistake to conflate data with presentation in the BG format. I’m not against a joint effort to address the issue of template formats, and not against the idea of different vendors using the same template formats, or about the idea of templates that are customized (as they in fact would obviously have to be) for BG data. I can envision vendors being able to share their templates and presentation data as easily as they can share their pure genealogical data. I believe BG should limit its scope to the data layer.

11) A procedural template is one that involves some type of procedural scripting rather than having a purely declarative definition. If we do define an interchange format for templates - which is absolutely possible - then it would have to be declarative and be backed-up by appropriate typing and meta-data.

Thanks, but I’ll wait til there’s an example to grok.

Did you have any questions you'd like to add to the pile for discussion later though Tom?

Frankly, I think the topic of sources and citations has been discussed enough. I have strong opinions that I have expressed many times. I have described what I believe BG should do with respect to sources and citations as my “four tasks,” and I have found no reason to change my views there. I have proposed the DeadEnds model as a whole, and singled out the source record and the source reference as the part of that model that I believe fully supports sources and citations at the data level that I believe BG should address.

BG has so far been unable to get beyond these discussions and make effective decisions. Whether this can change is a question I fear the answer to. There are not enough qualified technical people who understand the issues associated with BG right now, and there is no defined way to proceed technically and make decisions. There are three to five people here who are willing to discuss in detail their thoughts about what the BG standard should be, but their ideas seem to be divergent enough that I wonder whether such a small group can be effective.
ACProctor 2012-01-17T10:27:32-08:00
It sounds like a tough time for you at the moment Tom. Our thoughts are with you.

No pressure to review STEMMA or respond to any of this stuff yet. I only wanted to bubble some of the bigger questions to the surface since there are a lot of strong opinions in this area still.

Tony
ACProctor 2012-06-07T12:00:58-07:00
A follow-up to my own post from Dec 12th...

Re: "Although I don't know of such a case, someone who owned a particular email address might define URIs of the form email://me.mydomain.com"

I knew I'd seen references to this sort of thing (i.e. using an email as the root of a URI namespace), but I couldn't remember where.

Well, the syntax was actually mailto:name@emaildomain. The following link gives ssome examples, including in XML Namespace URIs alongside the ubiquitous http: scheme.

http://tools.ietf.org/html/draft-ietf-webdav-propfind-space-00.

I also found another refreence in an answer to a question (not mine) at: http://www.coderanch.com/t/124566/XML/NameSpace-URI.

I would assume the same can be done using the im: scheme to create a unique URI root.

Tony
gthorud 2011-12-05T15:31:54-08:00
I think we need to keep the conflict level down, and we should all contribute to that.

I think Tom will appreciate GeneJ's expertise wrt Sources and Citations, she has made substantial contributions to this work already. I am not sure, but I expect that what Gene is referring to is that if you leave the responsibility to each vendor to figure out how to present citations based on "elements only" (simply transferring the value and type of each CE used), they will not be able to do that, when they have not even been capable of coming up with implementations adhering to the specifications in EE. I agree with that.


More about MST independent citation rendering.

Just to recap some of what's been written above:

Tom wrote: " It frankly doesn't matter what the MST is for the source they decide to apply the special template to as long as the necessary citation elements exist. Therefore, if we support the simple notion of a generic MST and the ability to use any of the full set of citation elements with generic MSTs, we provide a fully extendable system."

GeneJ called that the "Element-approach".

Citation Style Language (and thus Zotero) does in principle use this approach, but then they have developed a scripting language (programming language) to do the task, a language which is considerably more complex than that in the templates of current genealogy programs. CSL is used to define a style in a style definition file (a large number of them currently defined), and also relies on files that specify language/culture dependencies ((but, of course, not translation in general)). CSL is not extendable wrt MSTs or CETs, without reprogramming the rendering (citation production) engines.

CSL based solution go a long way towards being MST independent, but there are dependencies, and the MST does accompany the CEs. If a new MST can be introduced without changes to all (or many) of the large number of style definitions will depend on the new MST (assuming that the engine would allow new MSTs). CSL will not in general handle a generic MST.

I have not proposed to use CSL, since the capabilities of templates are well known, and they can most likely be extended to provide the necessary translation capabilities. CSL would most likely require considerable extension in order to support genealogy requirements. CSL is also so complex that it is unrealistic to expect most users that currently define templates in genealogy programs, to be able to use it.

MST independent rendering of citations may work with the few CETs in current GEDCOM, but expecting it to work with a "BG Core", that could easily have 50+ CETs (CSL currently has about 50, and the number is increasing), would require very complex implementations – and I have serious doubt that it is possible. MST independent rendering would require a lot of research before one could say that it would work, and you can never be sure that there will not be problems since citation styles are changing and there will be different requirements around the world.

MST independent rendering would also make it impossible to make user extensions with e.g. new CETs. I have no expectation that a foreign vendor would change their complex implementation to support my requirements.

I doubt if smaller vendors will have the resources to implement MST independent rendering.
ttwetmore 2011-12-05T19:17:25-08:00
@Geir, quick response.

I believe the element-only approach is the only viable approach for BG to take. If BG also wishes to participate in an effort to create a standard approach for source templates that is fine. However the primary goal of BG is to define the format of files for the full exchange of genealogical data, and the element approach provides the simplest way to do that.

I've outlined my opinion on how BG should address the source and citation area as four tasks. I believe that approach is simple enough and common sensical enough that it has some chance of success. Frankly, upon reading your document, I don't see any plan or path for BG to take towards our goals.

Anything else I could say would be redundant.
ttwetmore 2011-12-06T05:41:00-08:00
@Tony,

www.nationalarchives.gov.uk/?census=RG9&piece=2459&folio=19&page=3

Thanks for your example of a URN for handling source and citation info. I believe I understand it now.

In my mind this is an element-only approach. One could view this URN as equivalent to:

<source type="census" subtype="uk">
  <census> RG9 </census>
  <piece> 2459 </piece>
  <folio> 19 </folio>
  <page> 3 </page>
</source>

It seems that in this kind of example URN parameters and citation elements are equivalent concepts, and that the part of the URN before the parameters is equivalent to the source type.

Other than the syntactic sugar, do you think there is any other semantic difference between them? Do we get anything extra from either approach?

Do you think that the more "modern" appearance of a URN with respect to the longer and possibly more unwieldy XML element structure is an advantage?

I am old-fashioned enough to still like the element structure version, but my age should not be a factor in BG's decision making!!

Thanks for your patience in providing an example.
testuser42 2011-12-06T05:51:49-08:00
I don't claim to understand all the subtleties of the problem, so excuse me if I'm being too simplistic.

What if we demonstrate the element-only approach from the back to the front: Take a complex citation (GeneJ has posted many in various pages) and then step by step show where each element of the citation shows up in the BG data. This might reassure those who still doubt that all the CEs will be transferred with an "element-only" approach.


Another idea that has been touched already: (Presentation-)Templates to BG are like CSS to HTML. One could take this analogy and have two files: the data (citation elements) is in BG, and the style rules are in an extra template style sheet (TSS?). Your app could save your preferred style-sheet, or read in a style-sheet from elsewhere (e.g. data providers like genealogical societies, or journals, or the mormons or...). If the program does not have that capability, that's OK -- it'll just show the CEs in its own standard style. But the data is not lost.

Maybe there's a need for a second kind of template, that doesn't concern the presentation but the "grouping" and selection of CEs suited for a specific job.
This doesn't sound too complicated to me, either. It could be done similar to the "SCHEMA" that was part of GEDCOM some versions ago, defining a title, description / explanation for usage, and the CEs to be offered. This could be in another file, or combined with the style-sheet. I think such a "citation template definition" would be imported once into your software, and then you would be offered the new template with the matching fields nicely grouped. If you don't have the "citation template definition" installed, you'd have to find the right CE fields yourself. It's just a little more work, but the data will be in the CEs just the same.

I do think that the most important CEs should be defined (and explained) by us.
There should be an easy way to add new/custom CEs, like a short definition statement (name of new CE, type of CE, intended usage of CE). These new definitions should be exported somehow ("citation template definition file" or inline in BG??), and some programs will read and interpret them. But even if un-interpreted, the data in custom CEs will show up in software just like data in predefined CEs (ie, as "key-value" pair or "key-type-value").
ACProctor 2011-12-06T06:31:50-08:00
Re: "In my mind this is an element-only approach. One could view this URN as equivalent to..."

Sorry Tom but you've missed the point. It is definitely not the same as an element-only scheme.

This scheme does use citation elements, and so voids presentation issues of formatting & styles, but it also has a URN base-string.

The base-string is important because I was suggesting it could be the key that holds the whole thing together.

Forget about parameters and parameter insertion for a second. Just imagine the URN is a textual key that connects a citation-instance to the citation-template (i.e. the element names and types) and also to the style-template (e.g. CSL or something proprietary).

In my scheme, I've made that URN base-string deal with two different requirements but a plain un-parameterised URN could still connect the parts of this proposal together (i.e. citation-instance, citation-template, and style-template).

It just seemed to be that this is what's missing in the BG proposal and it might avoid the reliance on some magic element-only search-and-select algorithm. :-)
ttwetmore 2011-12-06T06:32:09-08:00
@Testuser

I don't claim to understand all the subtleties of the problem...
I think you’ve captured the subtleties.

What if we demonstrate the element-only approach from the back to the front: Take a complex citation (GeneJ has posted many in various pages) and then step by step show where each element of the citation shows up in the BG data.
Good idea.

...Templates to BG are like CSS to HTML. One could take this analogy and have two files: the data (citation elements) is in BG, and the style rules are in an extra template style sheet (TSS?). Your app could save your preferred style-sheet, or read in a style-sheet from elsewhere... If the program does not have that capability, that's OK -- it'll just show the CEs in its own standard style. But the data is not lost.
Yes. And BG could, if it wished, participate in the effort to design the template format. That effort would result in a “TSS” as you have described.

Maybe there's a need for a second kind of template, that doesn't concern the presentation but the "grouping" and selection of CEs suited for a specific job.
I believe this is exactly the concept of the [master] source type. Each source type has a specific set of citation elements, some, I assume, might be required while others optional. Geir has these concepts in his document.

...It could be done similar to the "SCHEMA" that was part of GEDCOM some versions ago...This could be in another file, or combined with the style-sheet. I think such a "citation template definition" would be imported once into your software, and then you would be offered the new template with the matching fields nicely grouped.
Yes. This specification file could drive the UI for adding/editing source information. The application uses the source type to decide which elements to show by default on the UI screen. I’d beg that RootsMagic and TMG do this now.

/I do think that the most important CEs should be defined (and explained) by us.
There should be an easy way to add new/custom CEs, like a short definition statement (name of new CE, type of CE, intended usage of CE).//
I agree also. This bears on GeneJ’s comments about the 80-20 rule.
ttwetmore 2011-12-06T06:45:54-08:00
@Tony,

Forget about parameters and parameter insertion for a second. Just imagine the URN is a textual key that connects a citation-instance to the citation-template (i.e. the element names and types) and also to the style-template (e.g. CSL or something proprietary).

I still see them as equivalent. I think of the URN "textual key" as the source type, and the source type accomplishes the same two purposes that you describe. The source key both ties the "citation-instance" (what I call a source reference) that actually holds the source type, to its citation-template (which I would define as the set of required and optional citation-elements/attributes/metadata that should be used for that source type), and it also ties that source type to the style-template to be used to render it as a final text string.

So I see the source type as doing the same double duty that the URN textual key does. Perhaps my thinking is still flawed though.

In Testuser's very recent post he talked about two specification files, one for style templates and one for "citation element collections." In my view the source key provides the index into both these specifications, and I think this is also equivalent with your two uses of the textual key.
ACProctor 2011-12-06T06:56:12-08:00
Sounds like we're close to a consensus here.

The advantage of a URN as a key is that it's designed to be globally unique to to an "authoring body", and still allowing other other bodies/companies/individuals to create their own unique keys without fear of clashing.
ACProctor 2011-12-07T14:21:55-08:00
It appears there is still some vestige of interest in this URN suggestion. However, I think my posts were viewed as being UK-centric - which is unfortunately because I had simply plucked one UK example in a hurry.

As Tom suggested, the scheme is not that different to the one based on Citation Elements, ... except for the use of this extra URN string. I'll therefore try and come at the suggestion from a different angle to see if that helps.

I believe there is a lot of consensus on this subject already but terminology is dividing us.

Suppose an end-user wants to insert a Citation Reference in their data. They know the parameters (or element values) that will need to go with it but we won't burden the illustration by picking a precise case. Let's leave it generic. The product being used would provide some list of supported Source Type descriptions and the end-user would select one relevant to their source.

The product would then need to load the meta-data associated with that Source Type which would include the names and types of the Citation Elements. I believe that meta-data should be held in some separate file (or table) so the product needs a Source Type Name to use in the associated lookup (not the 'description' as selected from the UI). Let's not presuppose what format that name has yet... we'll come to that in a second.

Now the BG file could just store the Source Type Name and the end-user's element values, without the meta-data for those Citation Element. However, that would render the BG file useless without the Source Type repository always being available to re-do the lookup. I think it should therefore store the names and types of the Citation Elements in the Citation Reference being inserted.

OK, now let's move on to the same user generating a report. The report writer will need to know how to format that Citation Reference for the user's locale and with their selected citation style (e.g. CMOS). Another lookup would have to be performed to get, for instance, a relevant CSL definition that would format the Citation Reference. The same Source Type Name would be used as a key here, but together with the user's locale (e.g. en_US) and the citation style (e.g. CMOS).

Right, so - give or take some debate on the technical terms - we have a rough scheme that keeps presentation information out of the BG file, and allows a source description (picked from a list) to be related back to a Source Type in a separate repository, and also to a formatting template in a different repository.

My only contribution here was to suggest using a URN as that Source Type Name. A URN is simply a unique string registered to a given group/company/individual. It looks like a URL in syntax but isn't used for accessing anything. If someone has a Web URL registered then no one else should use a similar-looking URN. The URN could be based on the registered URL for the body controlling the source (e.g. a library or archive), or it could even be a BG one for generic globally-applicable sources like 'human testimony', e.g. http://www.BetterGEDCOM.com/source-type/testimony.

My previous post appeared a bit more complex because I'd also allowed the same URN idea to be used as a "generator" to produce a single string uniquely identifying the Citation Reference. Tom is correct when he says you could equally compare all the separate element values but then if you needed a single key for the Citation Reference (e.g. to put it in a SQL table) then my parameterised URN would provide a simple way of doing it that was guaranteed to not have any clashes. However, that's a separate issue from the primary usage described above.

Tony
gthorud 2011-12-10T17:55:23-08:00
Tony,

About the use of CSL. Using out terminology, a CSL style specification is loaded independent of the MST, but it may contain MST specific rules. The MSTs of CSL are fixed, meaning that you would have to develop a special CSL program (and syntax) for genealogy.



If I have understood this correctly, an URN would be used for the same purposes as UUIDs in my document, the important property is that it is unique.

I am not an expert in this, but I seem to remember that there are various program libraries that will generate a UUID (various types) or you can go to a web page and have it generated - and any one of them would generate a unique number - for all practical purposes. So one aspect is how easy it is to generate the UUID or URN.

Another aspect that I can think of is overhead, i.e. length of the string - there could be a small difference between the schemes. ( UUID=16 Bytes/characters, 128 bit)

Then, if all URNs should have the same root, you would need to set up a sceheme for distributed assignement of values. Probably a small disadvantage for the URN.

In any case I think the string after the root URL should be language independent. UUIDs already have that property.

I guess the aproach chosen would be BG wide, URNs/UUIDs are needed in other areas than Sources and Citations.

Based on my corrent knowledge, I would have a slight preference for UUIDs, but I have no strong oppinions.
ttwetmore 2011-12-10T19:18:56-08:00
I believe we are now talking about using URNs or UUIDs for source types. There won't be that many of them, so UUIDs are, in my opinion, inappropriate; UUIDs are intended to uniquely identify very large sets of objects that may grow unpredictably. UUIDs are 36 characters long (as generally expressed).

On the other hand UUIDs are very appropriate for identifying persons and other records in a BG transmission file. Users would never be interested in using a UUID as a way to get meaning from a person records. However, users would be very interested in seeing a short descriptive tag letting them know what kind of source record they have, instead of an indecipherable 36-random-character string.

URNs are appropriate, as they can be human readable. However, I don't see anything wrong with sticking with short strings like "book," "article," "census", "certificate," "letter," "website," and so on. We may need to add a sub-type for these; keep that option in mind. Subtype are very easy in URN's since you just stick in some dots to give the URS some structure.
ACProctor 2011-12-12T05:53:20-08:00
(sorry, I was away for the weekend)

There is certainly some overlap between a URN and a UUID Tom but still a lot of differences too. A UUID is a randomised identifier that is designed to be spatially and temporally unique. There is no meaning in the resultant identifier (as you've already illustrated) - it is simply an identifier that is unique. A URN, on the other hand, is based around an issuing authority, and that is visible in the string. The string is also extensible and can include versioning and parameters if necessary.

Hence, for this particular usage, a URN is definitely more appropriate.

However, I'm afraid I need to tighten-up my terminology a little since familiarity has made me a bit lazy. What I've been describing is technically not a URN. A URN is a form of URI that specifically uses a "urn:" scheme prefix and is designed to support hierarchical naming of objects. A much-quoted example is ISBN book references. The syntax is therefore more rigid (e.g. urn:xx:yy:zz) and the allowable characters more restricted. The associated namespace also has to be officially registered and that administrative burden has lessened their usage.

What I've described is still a URI but employed for naming purposes, and is hence not a URL. Possibly through myopia, W3C still don't have a separate category for this although the historical use of the term URN is this context is accepted. Some material describes it as a "namespace name" but that's not universal. A really familiar example is the "URI namespace" in XML bodies, and the context of that usage has strong similarities with what I've proposed.

Essentially, this form of URI names an object, or a type, and is guaranteed to be unique by virtue of the ownership of the domain used. For instance, a URI of http://www.BetterGedCOM.com would be unique to us if we owned the BetterGEDCOM.com domain. Although I don't know of such a case, someone who owned a particular email address might define URIs of the form email://me.mydomain.com. The XML standard specially says that the namespace URI is not designed to be dereferencable and so the "http:" scheme prefix isn't implying any access protocol. This style of URI can be created in a decentralised manner, unlike real URNs.

Some references if anyone wants to follow up on this - sorry about the lack of "citation styles" :-)

From: http://en.wikipedia.org/wiki/Uniform_resource_name in the introduction:

The term "uniform resource name" (URN) has been used historically to refer to both URIs under the "urn" scheme (RFC 2141), which are required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable, and to any other URI with the properties of a name.

From: W3C at http://www.w3.org/2001/tag/doc/URNsAndRegistries-50.html "URNs, Namespaces, and Registries"
ttwetmore 2011-12-04T12:14:34-08:00
Research we've already done indicates an element-only approach (scope of work) doesn't solve the user transfer problem because the vendors will not independently develop compatible “citation templates.” (Even when accompanied by more than 1,000 examples in a 900-page book.)

The research you have done has no bearing on this issue. You have exported data in GEDCOM format from sets of programs and then imported that GEDCOM into sets of other programs and recorded what worked and what did not in the source area. Everybody and their grandmother knows that the results would be a disaster. Making a statement that this "research" indicates that an element-only approach wouldn't work has no basis.

We have no idea what vendors would do in the face of new standards, so using what vendors do now as an argument against element-only is wholly inappropriate.

Since you clearly don't like the element-only approach, could you please define what you think is the right way to transfer source and citation information in BG format. Can you provide precise examples, in a presumed BG file format, as Louis and I have done, of how you think it should work? Have you anything constructive to contribute to this discussion?

Someone wrote about targeting the “vast majority of cases.” Believe you are better off working the 80-20 rule—target 20% of the cases that represent 80% of all the entries.

If you read my comments on "structured flexibility" above you will see that I was pointing out that a major goal model writing is to find the correct balance between structure and flexibility. One aspect of this balance is where along the xx-xx spectrum is your aiming point. I would say that in sources and citations things are closer to 90-10 if not 95-5.
GeneJ 2011-12-04T12:21:39-08:00
"The research you have done has no bearing on this issue. You have exported data in GEDCOM format from sets of programs and then imported that GEDCOM into sets of other programs and recorded what worked and what did not in the source area. Everybody and their grandmother knows that the results would be a disaster. "

Beg your pardon, Tom. We conducted research in each of several programs, Tom, and compared the output of those native program.
ttwetmore 2011-12-04T13:29:16-08:00
Beg your pardon, Tom. We conducted research in each of several programs, Tom, and compared the output of those native program

That was my point. If you would care to explain how that research has a bearing on the element-only approach I would be glad to here it.
ttwetmore 2011-12-04T13:40:14-08:00
I don't know the answer to the question in Tom's second last posting, but I expect greater than 20 and considderably smaller than in EE - but there is a difference between having 1) MSTs for mainly the US, and 2) for "the whole world" +/- The latter might add a few.
What is important is that we envisage more general MSTs and CETs than in EE, CETs are probably the most important to keep general.


I agree, Geir. Though I do wonder why the rest of the world would need many more MSTs than North America. One would think that books, journals, certificates, land records, censuses, parish registers, are fairly universal. Are there sources of evidence in other parts of the world that are very different from the those used by North American genealogists? I know about the Scandanavian “farm books.” Does this require a new MST or can it fit into a more general category like a diary or a register or a journal? I don’t know; just asking. I can imagine that there might be a few new evidence sources to worry about, but would they add very significantly to the overall number of MSTs? Do you think the fact that the non-North American world has some different types of evidence is a problem for the element-only approach?

I very much agree with your comment that the CETs are the most important to keep general. There needs to be only one “title” citation element that would be reused by any of the MSTs needing a title. One “author” CE for all. One “editor” for all. And so on. In the lists you see you often find “bookTitle,” “articleTitle,” and so on, inflating the number of CEs unnecessarily. When appropriate you might like “shortTitle”, but we’re really not going to add up to a lot of CE’s that way.
ttwetmore 2011-12-04T21:36:44-08:00
@GeneJ,

You clearly don't believe the element-only approach is the right one to take for source and citations in BG data. What do you believe is the right way?

Here are two questions I asked people to answer. Would you mind letting me know what are your answers?

1. Do you believe that fully formatted, RTF-like citation strings belong in BG data?
2. Do you believe that source templates belong in BG data?
ttwetmore 2011-12-05T04:02:06-08:00
@Geir, a couple comments on a recent post of yours...

...If a program has a preferred template/method in the appropriate language that supports the MST of an imported MSI, it should in general be expected to ignore a Template for the MST transferred together with the MSI, and this is what the exporter of the file should expect, unless there is some special agreement that would apply to all templates with corresponding MSIs in the file. The importer could at his own discretion choose to use the template in the file, if for example he does not have an appropriate template, but the exporter cannot expect this to happen without prior agreement.

This means you believe templates can exported and imported in BG files. Which means that BG will have to create a record format for them, which means there should only be one “standard” version of the EE templates, etc, which implies there should be only one copy of those templates, which implies, practically, that BG is responsible for them. I believe including templates in the BG model is a big mistake. I would like to hear the opinions of others.

My model contains a feature that allows a Citation Instance (reference note) to point to a template that will overrule the one defined for the MST.

And this means you believe that BG Citation Instances would actually point to the templates that would interpret them into strings. I also believe this is a mistake, not only because I don’t think the templates should be in the file to begin with, but that the linking between citations and the templates to interpret them should be done at the source type level.

If the exporter were to control which template etc. to use, reports could very easily contain inconsistent citation styles.

I agree with this, but because in my view exporters could never dictate such things, because they could not export templates.

Transfer of overruling templates would be useful when transferring between a user's programs, but a user should not rely on such templates without prior agreement...

Don’t you think that when we start talking about “overruling templates” that we have really gone overboard for the needs of BG?

Please consider how you would feel if you were a vendor considering implementing to the BG standard. How would you feel if BG had element-only source and citation info? How would you feel if BG also included source templates and required the vendor to use those templates to generate citations? Do you really think vendors are going to be willing to put together an entirely new source handling infrastructure for BG, especially the many vendors that now only do a rudimentary job with generating citations?
ACProctor 2011-12-05T05:08:49-08:00
I appreciate this work has progressed quite far down the road but I'll try and present a quite different viewpoint in case you've missed anything folks. Please don't think I'm trying to counter everything.

I believe the templates should not be the responsibility of BG and agree with Tom there. I would hope that they would accessible from some separate repository as either CSL or some proprietary format. Since it is the reporting engine that would need them, I believe we shouldn't mandate too much in BG.

In the data format I was developing, I had originally provided a "Mickey Mouse" <DisplayFormat> within each Citation entity. However, that would have made those Citation entities localised for presentation in a given culture - which I was trying to avoid where possible - and it didn't allow for the different citation styles in common use. I've therefore backtracked on that.

The problem seems to be that of linking the Citation entities in the data with the citation templates available elsewhere. There's no common handle! Even if we have a MSTS then we cannot assume there will be a finite predetermined set. If the same report is generated with an alternative template repository (with different styles), or a different localised version of such a repository, then what links the citations to their respective templates? I think saw somewhere else in this thread a description of a complex search-and-select algorithm.

Would it not be easier to have a URN? I've tried to use a parameterised URN in my scheme to provide a way of matching two specific citations for equality, i.e. the final URN with parameters inserted becomes a unique handle for the citation reference. However, the URN base string (with its parameter markers) could also be used as a handle for selecting a template from some template repository.

The URN idea works for quite a broad range of source types, including first-hand testimony, second-hand testimony, and even family folklore [some of the more difficult ones that challenged me at first]. It depends a little on whether any of the citation elements (parameters in my scheme) are themselves localised or not. I'm still researching for examples there. Another issue would be whether we could persuade vendors to accept the notion of a URN base-string and agree on them for the very common source types. I think which every way we go, it will impact vendors somehow so maybe a minimal upfront string like this could be a nice anchor-point that keeps the citations and their templates at arms length.

Tony
ttwetmore 2011-12-05T09:18:33-08:00
Tony,

I understand the idea of a URN for things like ISBN's for referring to titles and ISSN for referring to serials, but I don't know much more about them. I would like to see a couple examples of what you mean.
ACProctor 2011-12-05T09:49:45-08:00
A URN, which is another form of URI, looks like a URL but isn't actually used to locate anything - I'm sure you already know that Tom but I mention it for anyone else reading this. It is simply a unique identifying string.

For instance, if a base URN representing a UK census reference was defined as:

http://www.nationalarchives.gov.uk/?census=?&piece=?&folio=?&page=?

Then you can see that it takes a number of parameters for census, piece, folio, and page. These are the citation elements in your scheme. An actual citation instance would provide associated values for one specific citation reference. If they were inserted into the base URN then it would give something like:

http://www.nationalarchives.gov.uk/?census=RG9&piece=2459&folio=19&page=3

This uniquely identifies that particular UK census page and a string comparison can be used to test if another is the same, or similar. The base URN here would have to be chosen by agreement with the National Archives. I simply picked it as an example because of their association with the records.

It would be equally valid to select a BetterGEDCOM one. There is no requirement for there to be a similar Web site - the essential requirement is that no one else uses the same base URN for anything. We could provide a list together with the default master source types we come up with.

As another example, consider my scheme for first-hand testimony. I might have a base URN of:

http://www.parallaxview.co/familyhistorydata/testimony/?source=?&date=?

In a related citation instance, I would provide a parameter which was a PersonRef and the result would be their canonical name being used as the URN parameter.

My suggestion was basically saying that the base URN (i.e. without inserted parameters) could also be used as a handle to tie source types to citation templates, thus allowing an easier switching of template repositories due to localisation, change of style, or change of reporting product.
rmburkhead 2011-12-05T10:29:51-08:00
Reading through the thread on this topic, I think I am starting to see some merits in both points of view: the element-only approach, and the templated approach proposed by Geir. I am coming at this having been firmly rooted in the belief that the element-only approach is the way forward, so that will color my comments.

The first question I ask myself when looking at something like this is "what problem are we trying to solve?" Here, specifically, the problem is how to transfer the data representing sources, citations from those sources, and possibly repositories holding those sources from one genealogical system to another. To me, that is the fundamental goal. On top of that, there are additional goals regarding the presentation of that data in one or more particular citation and bibliographic styles.

First off, it is my position that the presentation of citation and bibliographic data in any particular style is likely outside the scope of BG's goal of the transfer and storage of genealogical data. Presentation of data is something best left to the applications that consume data, rather than to a standard that is used to transfer data. That being said, I recognize the value in coming up with a standard method of defining and communicating the structure of templates used to organize data into these forms of presentation, and think that BG can consider that as part of its scope separate from the communication of the data itself.

So how, then, to transfer the data efficiently and without loss of semantics? The element-only approach posits that it is possible to develop a set of well-defined data elements (fields, if you will) and structures (records, I suppose) that can define a source record, and a specific citation attached to a unit of genealogical data used/derived from that source. Further, it suggests that that well-defined set of data elements can then be mapped to most, if not all, "types" of sources from which citations can be made.

The templated approach suggests--ignoring the presentational aspects for now--suggests that a set of well-defined data elements can be developed for each "type" of source material, but that it might be unreasonable to map those to a common set of discrete data elements. It, then, becomes the job of the standard to communicate the data in the form or context of those templates. It also becomes the job of the standard to define those templates.

The advantage of the first approach is that there is a common and well-defined set of fields in which to fit the data for any source type and any citation from that source. The challenge is mapping data from these disparate source types to these finite number of fields. I believe that this could be accomplished, but not without some contentious argumentation beforehand. The question is, is it valuable to have these data in a common set of fields for any purpose other than not having an explosion of fields that define sources and citations. The advantage of the second approach is that there is no need to try to match a particular data element in one source type to an equivalent data element in another source type. In not trying to equate a data element in one source type to that in another, I think that some ambiguity can be avoided because there would be little or no attempt to shoe-horn a tangentially related element into a common data field.

So, if we separate the presentation elements, and focus only on the transfer of data, then the "templated" approach becomes an exercise of defining the named fields for each source type and corresponding citation types. If that is then boiled down, then what we end up with is (in the parlance of "GEDCOM-classic"--but only as a frame of reference, not a suggested syntax) a bunch of tags and type definitions for source and citation data. Not too different than defining a common set of discrete data elements.

Which brings us back around to presentation. To get a set of source and citation data into one presentation format or another (Mills, APA, etc.), it is a matter of arranging the set of data elements into the order and boilerplate called for by that style. That could be done either with a common set of well-defined data elements, or with a set of data templates (not presentation templates) that define the elements for a particular source type and citation type.

A concern was raised, early on in the discussion, about terminology. I believe GeneJ suggested that we not be too concerned with pinning down terminology at this point in the discussion, and I am willing to accept that for now. On the other hand, having a common vocabulary that is rigidly defined, unambiguous, and that leaves as little room for (mis)interpretation as possible is critical to developing a successful standard. I see, time and time again, implementations of GEDCOM and other (non-genealogical) standards where the implementer misinterpreted the semantics of a field or a construct, and thus failed to comply with the standard. Sometimes it is the fault of the implementer, a simple mistake reading the standard. Sometimes it is the fault of the standard, leaving so much ambiguity that a reasonable reader could interpret it any of a number of ways and still be right. Strict definitions of the terminology used to describe the standard, and rigid adherence to those definitions, is the way to avoid these cases. At some point we will need to address this issue (and it will need to be addressed over and over again). The earlier, the better, since simply discussing any aspect of BG while one party defines a term one way, and another party defines it differently, will lead us into trouble. It is too easy to participate in a discussion assuming that everyone defines a term the same way, but even subtle differences can cause us problems later on. We need to, at some point, take the assumptions out of the equation, to the extent that we can.

There has been some discussion, too, about the syntax to be used with BG: XML, JSON, "GEDCOM-classic", etc. That would be a valuable discussion to have, but I believe it is outside the scope of this specific topic regarding sources and citations. I probably have lots to say about it, but I'll refrain from doing so in this thread.

-Robert
ACProctor 2011-12-05T10:49:12-08:00
...apologies: I may have confused 'citation templates' and 'presentation templates' in my post.

Tony
ttwetmore 2011-12-05T12:02:27-08:00
Robert asks "What problem are we trying to solve?" and answers it with how to transfer the data representing sources, citations from those sources...from one genealogical system to another. He adds the additional goal of the presentation of that data in one or more particular citation and bibliographic styles.

I agree, but stress that his first answer is the key for BG, and though the second is also key, it is outside the core of BG to solve.

He furthers says the presentation of citation and bibliographic data in any particular style is likely outside the scope of BG's goal of the transfer and storage of genealogical data. Presentation of data is something best left to the applications that consume data, rather than to a standard that is used to transfer data

I agree and have argued this point consistently.

He goes on to say that there is value in coming up with a standard method of defining and communicating the structure of templates used to organize data into these forms of presentation, and think that BG can consider that as part of its scope separate from the communication of the data itself.

I take this to mean the Robert believes BG might also help define the structure of templates. I don’t agree or disagree. I view this as an associated task, better not left up to any one vendor, but not part of core BG.

Robert then summarizes the element-only approach that I agree with so won’t quote.

He then suggests that the “templated” approach suggests that a set of well-defined data elements can be developed for each "type" of source material, but that it might be unreasonable to map those to a common set of discrete data elements. It, then, becomes the job of the standard to communicate the data in the form or context of those templates. It also becomes the job of the standard to define those templates.

I don’t see any problem finding a common set of elements to be used by all templates. I see the template approach differently, as being one in which templates themselves are “records” in the BG file, so that BG files also contain the templates that users wish to use, and those templates follow their data wherever it goes. This is very different from Robert’s definition of the templated approach.

Robert’s next point, if I might paraphrase, is that the element-only approach might prove difficult because there may be so many potentially different elements needed for so many different templates that it might be easier and less argumentative to simply use source-type-specific elements. My opinion is that coming up with a single set of elements for use in all source types will not be difficult (though probably argumentative!). Obviously each template would require only a subset of the elements.

... the "templated" approach becomes an exercise of defining the named fields for each source type and corresponding citation types. If that is then boiled down, then what we end up with is ... a bunch of tags and type definitions for source and citation data. Not too different than defining a common set of discrete data elements.

I agree, but with the proviso given above that I don’t see the templated approach the same way.

To get a set of source and citation data into one presentation format or another (Mills, APA, etc.), it is a matter of arranging the set of data elements into the order and boilerplate called for by that style.

Exactly. This is the key idea that separates the concepts of elements as data and templates as presentation. No mixing.

There has been some discussion, too, about the syntax to be used with BG: XML, JSON, "GEDCOM-classic", etc. That would be a valuable discussion to have, but I believe it is outside the scope of this specific topic regarding sources and citations. I probably have lots to say about it, but I'll refrain from doing so in this thread.

Yes. This is a red-herring directing attention away from what is really important. The final decision is has no developmental implications, since a plug-in could be used to export and import BG data in any or all of these formats.
ACProctor 2011-12-03T15:20:02-08:00
(apologies for diverting this thread Tom)

I totally agree with your idea of two-level scheme. In the work I've been doing previously, I'd tried to make all my top-level entities fully hierarchical, i.e. Person, Place, Event, Citation, and Resource.

I'm not sure whether this is something you'd considered but it requires a looser interpretation of a Citation to make it work there. For instance, not just citing an article in its entirety, but the publication, the repository, etc., as separate parts.

These parts can then be linked in a hierarchy rather than all references to articles in the same publication, available from the same repository, all having some duplication of data.

Am I on the right track here or have I slipped off the rails again? ;-)
ttwetmore 2011-12-03T15:29:39-08:00
AC,

I think you are on track (and I don't you've slipped off any rails yet!) Very nice to have more thoughtful voices addressing the various issues. I have no problems with you bringing old threads back to life or twisting threads in new directions. New blood usually offers new perspectives!
GeneJ 2011-12-03T21:40:00-08:00
@ Louis,

You wrote, “if programs properly used this general structure with BetterGEDCOM improvements, then ALL the source [reference] data would be correctly transferred. Solving documented problems.”

Thank you for helping us focus attention on defining the problem(s) to be solved.

I've written in the past that if we can't transfer the elements, we can't possibly generate programmed citations, so I don’t mean to minimize the importance of moving the element data. I'm not sure, though, which documented problems we think an "element-only" transfer will fix/solve? If you mean the documented GEDCOM transfers, we know now that those transfer issues are only a symptom of the problem.

Users who take the time to develop well-formed citations in any particular program want those citations to survive the transfer.

Every time I write a blog, send an E-mail, produce a family group sheet, digitize a photo, submit a query … I need access to well-formed citations, and I turn to my genealogy software. Folks like me spend the time filling out the forms (and in my case, typing in all the bits that make up the “citation template”) so the software will produce those citations on command. If I’m going to lose the functionality of those elements on transfer, why bother with the forms—just free-form the citations. That’s Randy Seaver’s solution under GEDCOM. What would change under the element-only approach?

What am I missing? --GJ
ttwetmore 2011-12-04T01:58:13-08:00
"Element-only" means that BG transfers all citation elements, which means the BG data holds all the citation elements. All the data necessary to construct all the citations is always available in a BG archival file or a database based on the BG model. All, all, all, all, all. It also means that BG data does not contain nor does it transfer source templates or RFT-formatted citations strings that are "report-ready."

It is up to the applications to use the elements to generate the formatted citation strings that appear in reports. They do that by applying source templates to the elements found in the BG data. The applications do this. The applications let the users choose which templates to apply. The application holds the software that applies the templates to the BG elements, so the application is responsible for generating the fully formatted citation strings to be used in reports (blogs, email, family group sheets, queries, big reports). Do you or do you not agree with this idea? This is the guiding principal that that we've been talking about for months, but you seem to either not agree with it or we haven't managed to explain it well enough.

Users who take the time to develop well-formed citations in any particular program want those citations to survive the transfer.

This can mean two things. If it means the users entered values for all the relevant citation elements (metadata) when they were filling out the source forms provided by their program, then those elements survive the transfer to any BG compliant program. If you mean the user actually composed their own citation strings, by fully typing in an RTF-like string by hand, putting the punctuation and quotes and font changes in the right places, then that will not survive the transfer unless the user puts the string into a BG note structure. If you believe that these formatted strings belong in a BG transfer file then you are far from the consensus that stylistic info (particularly templates) should be APPLIED to BG data, not CONTAINED in BG data.

Maybe we should all answer two questions:

1. Do we believe that fully formatted, RTF-like citation strings belong in BG data?
2. Do we believe that source templates belong in BG data?

In my opinion the answer to both these is a resounding "no."

Every time I write a blog, send an E-mail, produce a family group sheet, digitize a photo, submit a query … I need access to well-formed citations, and I turn to my genealogy software. Folks like me spend the time filling out the forms (and in my case, typing in all the bits that make up the “citation template”) so the software will produce those citations on command. If I’m going to lose the functionality of those elements on transfer, why bother with the forms—just free-form the citations. That’s Randy Seaver’s solution under GEDCOM. What would change under the element-only approach?

Nothing changes, and the element-only approach is indeed your salvation. Your program, upon importing BG data, gets all the citation elements included in that data. Your program then generates the citation strings that you need from those elements.
gthorud 2011-12-04T07:34:49-08:00
Tom,

How will you handle the MSTs that don't fit in your 20 or so MSTs?
ACProctor 2011-12-04T08:11:48-08:00
I was wondering about that too. There's a lot of structure in your design Tom does that make it too rigid? I don't want to sound negative but I was thinking not just about organisations adding other source types that aren't in the MSTS but also knowledgeable users.

I wouldn't expect a product to ONLY acknowledge the pre-defined MSTs but supposing they don't provide any way of adding new ones. How big a job would it be to do it manually?
ACProctor 2011-12-04T08:33:46-08:00
Sorry, I meant "...the design...". No offence intended Geir :-)
ttwetmore 2011-12-04T09:02:08-08:00
How will you handle the MSTs that don't fit in your 20 or so MSTs?

Task 3 is the job of deciding the MST's BG should support. The goal would be to define types for the vast majority of cases. Whether the number is 20 or 40 or 60 we would determine then. We might decide we need a type/subtype structure at that point. I think that many of the fine shades of difference between types of sources does not have to be encoded in the types or the citations. The same elements and the same templates may apply to many of the 350 or so lines in the Yates list and so on. For example the Dublin Core doesn't define all that many elements and their self-described task is to be able to describe all resources.

For future MSTs that don't fit into any of our categories we make sure there is a user extension avenue available. Our goal would be to not require this avenue very often.

How many meaningfully different MSTs do you think there are?
ttwetmore 2011-12-04T09:26:07-08:00
There's a lot of structure in your design Tom; does that make it too rigid? I don't want to sound negative but I was thinking not just about organisations adding other source types that aren't in the MSTS but also knowledgeable users.

I see an important aspect of modeling as finding the right balance between structure and flexibility. Allowing a reasonable number of MSTs and a full set of citation element types seems to me the right balance between the two. Leaving too much flexibility, i.e., requiring users and organizations to create user defined MSTs and element types in many situations, leads to many problems, most notably the inability to share data between programs, which is the main problem BG was formed to solve. Yet too little flexibility forces users to pound round pegs into square holes in deciding how to enter data. I can only express my opinion that finding the right compromise for these issues is a major part of the "craft" of good model building. I do believe that user extensions must be allowed for, but that a good design minimizes the need to use them.

I wouldn't expect a product to ONLY acknowledge the pre-defined MSTs but supposing they don't provide any way of adding new ones. How big a job would it be to do it manually?

Bear with me a moment. A source template is basically a pattern that defines the set of citation elements to go looking for, and how to format the element values (the order in which they appear, the punctuation between them, italics & boldface, etc.) If an organization needs a new, special citation format they can define a new source template that brings together the specific elements. It frankly doesn't matter what the MST is for the source they decide to apply the special template to as long as the necessary citation elements exist. Therefore, if we support the simple notion of a generic MST and the ability to use any of the full set of citation elements with generic MSTs, we provide a fully extendable system. Even better, if we allow the user to name these generic MSTs. With this approach, the full data can be still be shared between systems, because there is really no user extension involved, and if the templates are also shared, importing programs can also generate the same citation strings as the exporting program.
gthorud 2011-12-04T09:44:38-08:00
I don't know the answer to the question in Tom's second last posting, but I expect greater than 20 and considderably smaller than in EE - but there is a difference between having 1) MSTs for mainly the US, and 2) for "the whole world" +/- The latter might add a few.

What is important is that we envisage more general MSTs and CETs than in EE, CETs are probably the most important to keep general.

The escape mechanism is already in my model called MSTS/MST/CET/TS and templates, corresponding to the functionality in several major programs.
gthorud 2011-12-04T09:53:25-08:00
About transfer and selection of citation templates for output in reports.

If I have understood GeneJ correctly, this is partly in response to her last posting, and Tom's reply to it.

In general the importing program should be the one to control how Citations are rendered in reports etc., and will be the one that selects the template or other method. If a program has a preferred template/method in the appropriate language that supports the MST of an imported MSI, it should in general be expected to ignore a Template for the MST transferred together with the MSI, and this is what the exporter of the file should expect, unless there is some special agreement that would apply to all templates with corresponding MSIs in the file. The importer could at his own discretion choose to use the template in the file, if for example he does not have an appropriate template, but the exporter cannot expect this to happen without prior agreement.

My model contains a feature that allows a Citation Instance (reference note) to point to a template that will overrule the one defined for the MST. This template could for example contain just text to be rendered in the citation. The same rule as in the previous paragraph must apply to these overruling templates, so the importing program could chose to use its own template, the one supplied in the file for the MST, or the overruling one.

The user of the importing program should be able to control what is to happen.

If the exporter were to control which template etc. to use, reports could very easily contain inconsistent citation styles.

Transfer of overruling templates would be useful when transferring between a user's programs, but a user should not rely on such templates without prior agreement. The capability to transfer CEs containing prefixes or suffixes (cf. also notes) for the citation, that could be expected to import, could perhaps be a compromise.
GeneJ 2011-12-04T11:16:03-08:00
Thank you for your comments.

Research we've already done indicates an element-only approach (scope of work) doesn't solve the user transfer problem because the vendors will not independently develop compatible “citation templates.” (Even when accompanied by more than 1,000 examples in a 900-page book.)

Someone wrote about targeting the “vast majority of cases.” Believe you are better off working the 80-20 rule—target 20% of the cases that represent 80% of all the entries.
ACProctor 2011-12-03T09:18:41-08:00
Apart from the group being called BetterGEDCOM, what is the rationale for compatibility with GEDCOM?

XML is a modern international standard with generic tools available for verification and conformance to a schema. There are also intelligent editors and viewing tools for it.

If BG is worth its salt then it will be far better than GEDCOM in scope, flexibility, and robustness. That means you cannot convert BG to GEDCOM without sigificant data loss.

I agree that migrating GEDCOM data to BG is a must but what is the value is giving BG a GEDCOM-like syntax? Being familiar with the existing GEDCOM format sounds like a red herring since the conversion process will go via an in-memory data model, not directly from file-to-file.
GeneJ 2011-12-03T09:55:28-08:00
@Tom, I see.

Some comments, intended to be helpful in this particular context.

(1) I would find it odd to see a "master source" at the level of the journal publisher. It adds a needless level and typically would require fair programming to produce a good entry in the bibliography.

(2) I suppose some might record the dates a rare journal title was published, but don't see a reason for such notation in this case. I see no reason to ask folks to research an entire journals publication history to enter the information about a journal article.

(3) The journal title is _New England Historical and Genealogical Register_. (Have lost count of the times I mis-represented that title below as NEHGR.)

(4) NEHGR frequently publishes serialized articles, and had this been a real article, it might have appeared over different volumes and different issues--some of which might continue pagination and others that might not. Actually, that is one of the arguments FOR using templates--many users find it a burden to independently develop and maintain ways to consistently report on issues.

(5) As we have discussed before on the wiki, reporting about repositories for most published materials is a needless time and attention burden producing information irrelevant to those who don't live in that immediate vicinity. Similar logic would apply to recording a call number. Some users may choose to record these kinds of personal details, but we shouldn't suggest they form the basis of meaningful reference notes and bibliographic lists.

(6) You provided fields to present various details about the journal publisher. Although there are exceptions, modern citations don't usually make reference to the publishers of a major journal. (NEHGR would be considered a major journal.) See EE (2007) 796. As below (no. 7), templates help folks quickly identify the fields that need to be recorded so they don't spend their time and effort recording details that are not needed.

(7) Recording volume numbers in roman numerals and identifying an issue by the issue number is not the current best practice. See EE (2007), 86 for practices about roman numerals and p. 798 for modern presentations of volumes and issues. If we applied moden techniques to the made up reference in your example, we would present the volume number in Arabic numbers and the NEHGR issue would have been identified by a month and year. (Identifying a year of publication is helpful separately about copyright matters.)

(8) Titles of articles are frequently long and many contain a subtitle and can become very long. Your example made no provision for a shortened title.

(9) Your example didn't provide for an ISSN number or a DOI. The former is a reasonably long standing standardized identification for journal articles; the latter is more modern and selective.
ttwetmore 2011-12-03T10:15:41-08:00
The key point about GEDCOM is that it is both a syntax and a semantics. It is GEDCOM as a semantics that is 99% of the problem with GEDCOM. The syntax of GEDCOM is isomorphic to XML, so anything XML can do GEDCOM syntax can do too. This point of GEDCOM as syntax vs GEDCOM as semantics seems to be subtle unless you are used to dealing with languages and writing parsers, and so on.

To say "you cannot convert BG to GEDCOM without sigificant data loss" is only true if you are talking about a GEDCOM semantic standard such as version 5.5. This is not what Louis and I mean when we talk about GEDCOM in this context. We are talking about using the GEDCOM syntax as the archival format for a much expanded genealogical data model. Louis likes this idea because we would keep all the parts of GEDCOM 5.5 that can move into the new model, and then extend that model with all the new record types and atrribute types needed for the new model. This makes BG more compatible with current systems if it is undertaken as a sequence of steps that builds upon the established base. The idea being that vendors could continue with much of their current code base.

Certainly XML is now the big gorilla, and Louis and I both know that suggesting that BG use GEDCOM syntax rather than XML syntax is a lost cause. (At least I know that ;) ). But just like XML has tools and editors and verifiers, etc, GEDCOM is so old and so used, that it too has all those accoutrements, available in all languages on all operating systems. I have written GEDCOM parsers and validators up the kazoo over the past twenty years (in C, Java, C++, Objective-C, for UNIX, Windows, Mac), as have many others, and there are many sophisticated tools that exist for handling GEDCOM. For example, GEDCOM validators often check for cycles in the ancestry tree (I am my own grandpa), often check for date inconsistencies (both in life spans, and in terms of trivial things like parents being born after their children, and so forth), and many other things. This kind of deep semantic checking is outside the realm of automatic schema checking you can get with XML validators.

I am not making a strong plea that we don't use XML. I am simply expressing the opinions of an old developer who has been intimately involved with the computing world for 45 years. XML is a dominating trend now. It is as much hype as anything else. It has eliminated almost entirely all the custom, "little languages" that we used to write for special purpose compilers and specification tools. It is touted as the end-all and be-all, final specification language that can be used for every application, and the encoding language of choice for all computer applications. Never mind that a special purpose specification language might be much easier to write, read, understand, and be a tenth the size of the XML specification. There is a backlash underway against this Microsoftian trend of XML, as currently expressed in JSON and Google protocol buffers and the many developers who still know the value of special purpose specification languages.

Have you ever written an XSLT program? XSLT is a programming language written in XML. It is the language you are "supposed to use" if you need to do anything other than the most trivial processing of data in XML format. It the most unwieldy programming language in the world. It represents the logical conclusion of the XML trend to create every language application as an XML schema. I have had to write a number of these programs to be consistent with genealogical systems that use a very complex XML format for their data, and requires pipeline tools that stream XML elements from one XML schema to another. I could go on, but the point is that an XSLT program will be ten times the length of a normal program, will be almost impossible to understand, will take many times the time and effort to develop as a normal program would, and so on. But XSLT, for an XML aficionado, is used as an example to demonstrate the universal power of XML.
ttwetmore 2011-12-03T10:49:36-08:00
Responses to GeneJ’s comments on my journal article example. In deference to Geir I have removed GeneJ’s comments but left her numbers so you can refer to them.

(1)
The publisher isn’t the “master source;” the journal is. It is not needless if there are many articles referenced from it.

(2)
These are example citation elements that are useful in the case of a journal and an article. This is an example, not a formal specification of exactly what citation elements to use in every situation. That is task 3 and 4 as I have outlines the 4 tasks we need to accomplish.

(3)
Thank you, doesn’t bear on my points.

(4)
My approach is completely usable by templates; that is one of its main points. There is nothing technically hard about applying a template to a two-level hierarchy, if this is what you are concerned about.

(5)
This is an example. I am showing where a repository record would fit in the scheme. There is no requirement to use them.

(6)
This is an example. At this level I don’t care about the nuances of the old ways and the new ways, of the differences between major and minor journals. The point is demonstrating how source records and source references capture all the needs for generating citations in a simple, common sensical and easy to understand manner. The details get worked out in tasks 3 and 4.

(7)
This is an example. Whether volume numbers are roman or arabic is meaningless for my purpose. The details get worked out in tasks 3 and 4.

(8)
Because this is an example where there wasn’t a long title. Other citation elements would be available for situations where they were required. The details get worked out in tasks 3 and 4.

(9)
Because this is an example, not intended to be complete. The details get worked out in tasks 3 and 4.

None of your comments have any bearing on my overall approach of source records and source references. None of them demonstrate a weakness. None of them make a constructive comment on making changes to the approach of source records and source references. They are simple comments on the details of the example I made up. I appreciate your knowledge on these specific points, but other than making a few trivial changes to the example, there is nothing here that demonstrates any problems with the approach.
ACProctor 2011-12-03T11:12:11-08:00
Re: "The key point about GEDCOM is that it is both a syntax and a semantics. It is GEDCOM as a semantics that is 99% of the problem with GEDCOM. The syntax of GEDCOM is isomorphic to XML, so anything XML can do GEDCOM syntax can do too. This point of GEDCOM as syntax vs GEDCOM as semantics seems to be subtle unless you are used to dealing with languages and writing parsers, and so on...."

I thought my 35 years in computing were unusual these days Tom. As you can see from my profile-paragraph we've both "been around the block a few times".

I'm sure with your experience you'll also agree that things have changed since the days when GEDCOM was new. Requirements are different these days. Resource limitations are vastly different.

XML would never have worked in the 60's or 70's. Similarly, no one would design a format like GEDCOM these days.

To preserve the syntax on the basis that you're retaining some of the semantics is stretching the truth a little. An update of GEDCOM along the lines discussed in this group, and certainly along the lines needed by the industry, would render the previous semantics inapplicable. It would have greatly changed semantics, irrespective of the syntax.

For what it's worth, my opinion would be to make a clean break and do the industry a favour. GEDCOM did its job but it is should be a mere legacy format nowadays.

As for XSLT, I would have to spit - hard - if I ever uttered the initialism. It's a vile language. It was supposed to be a functional programming language but it missed the mark as far as productivity goes. Someone's wet dream that didn't quite work if you ask me.
GeneJ 2011-12-03T11:17:16-08:00
@ Tony,

In an early meeting, one of the major vendors commented about the XML - GEDCOM discussions on the wiki. That vendor was pretty matter of fact, said XML would be a deal breaker for them.
ACProctor 2011-12-03T11:26:39-08:00
Wow! I hope they were a bigger player Gene. Did they qualify the comment with any type of reason?

I know people who hate XML for all the wrong reasons, e.g. "takes up more disk space", without looking at a balanced view.

Tom's view on XSLT is entirely justified IMHO but I wouldn't use that language for anything. In all my years of working with XML, I have managed to avoid it and always use other techniques instead.
louiskessler 2011-12-03T11:27:34-08:00
You can't make "a clean break". There are 500+ programs out there that use GEDCOM.

Extending the language for those programmers is a first requirement if we want to get as much wide adoption as possible. That first step must be easy so the majority adopt it. Since every program can already parse GEDCOM input and produce GEDCOM output, extending their code should not be difficult.

But switching completely to XML, which many of those programmers have not even worked with is asking too much as a first step. It won't get adopted that way.

p.s. I've been programming for over 35 years as well.
louiskessler 2011-12-03T11:37:22-08:00

Tom's example is an excellent suggestion as an improvement that can be made to BetterGEDCOM.

With one simple change (allowing a "1 SOUR @Snnn@" within a Source record, a hundred problems are solved. And that's without having to resort to complex newly defined data structures to contain everything.

An improvement such as this, is something most genealogy vendors could accomodate. A completely new structure they probably could not.

Also to GeneJ: Note how Tom's example, which is GEDCOM, contains all the data. Yes, it is true that maybe a few tags here or there should be tweaked to get the data right, but the point is that if programs properly used this general structure with BetterGEDCOM improvements, then ALL the source referene data would be correctly transferred.

Then the programs themselves could implement any citation templates they want to display the data formally. Yes, we at BetterGEDCOM could come up with the standardized source templates for them, but that is not data and need not be part of the BetterGEDCOM symantics, but can be added as an Appendix to allow programmers who want to use consistent citation templates, to do so.
ACProctor 2011-12-03T12:27:36-08:00
Re: "But switching completely to XML, which many of those programmers have not even worked with is asking too much as a first step. It won't get adopted that way"

With 35 years experience Louis, you'll appreciate that a good product wouldn't work directly from/to a particular file format anyway.

I would sincerely hope that their product would work from a memory model, supported by some indexed storage such as a database.

Loading and generation of a format like BG or GEDCOM should be hidden behind a library layer. In other words, much less of an issue then you implied
louiskessler 2011-12-03T14:15:16-08:00

Tony,

Most genealogy programs out there are not modern programs built with library layers and modular development or OOP. Most are old clunky things build years ago that have been patched and repaired so many times, you wouldn't want to see their innards.

The GEDCOM input/output of these programs are mostly hand-coded customized code that are not necessarily independent of the rest of the program.

Taking the GEDCOM input/output and replacing it with an XML input/output for these types of programs is not a simple cut and paste, but is major surgery with a long recovery time.
ttwetmore 2011-12-03T15:04:41-08:00
Louis, AC,

I accept that the BG external format will be XML. I’m simply not a fan. I don't share Louis's belief that the first steps of BG will be an evolution of semantics still clothed in GEDCOM. Though it would be possible. It is too politically incorrect.

My point on this thread is to help make sense of the source and citation world for BG. I thought Geir's report, though addressing difficult aspects of the source area, which are real, was mostly outside the realm of BG's purview, and that following that model would introduce years of nearly impossible work.

I’ll reiterate the four tasks I believe BG should tackle for the source and citation world:

1. Define the structure of a source record.
2. Define the structure of the source reference, a substructure used in records to refer to source records.
3. Define the list of tags for source types.
4. Define the list of key/value pairs that can appear as attributes/citation elements /metadata in the sources and source references.

I propose the DeadEnds model for 1 and 2. I have given a reasonably complex, two-tier example of how the source and source reference concept work in DeadEnds, shown both in XML and GEDCOM syntax. I invite comment on that proposal. GeneJ did comment on the example, but only on the details of the example itself, which I hope means the overall approach makes sense to her.

The remaining tasks in my opinion are 3 and 4 above.

These tasks don’t include creating a template language. If we do tasks 3 and 4 correctly, others can devise the templates and the software needed to process our source and source references through their templates to create citations. BG should be standards and template agnostic. Applications should be able to apply EE templates, Yang templates, etc., against our records, as the users wish and the vendors provide the features. BG should concentrate exclusively on the data aspects of genealogical models and leave the stylistic and presentation issues to the applications.
louiskessler 2011-12-01T17:51:05-08:00
GEDCOM effectively already has multilevel sources in it, because sources can have notes and notes can have sources.

Behold can display multi-level sources and Behold 2.0 will allow you to enter multi-level sources directly.

Louis
ttwetmore 2011-12-01T19:38:18-08:00
A few comments on Gier’s comments on my comments.

Master Source is used since just Source is considered by many to be ambiguous. I don't like Master source either, but I found no good alternative.

Source seems no more ambiguous than person, citation, event, ... I believe Source is the right word.

Citation should have been called Reference note, I agree, my mistake, will change.

Except that it is not a note. I’ll stick with the DeadEnds source reference because it is right for this. Note that references are not record-level entities; they are references in one record that refer to a location in another. See the article and journal example for details.

Citation Element is more difficult since it describes things that may also end up in a template for a Bibliography, not only in a reference note. Attribute is a too general term.

I can live with the term Citation Element. But I like to bring things uniform when possible. Key/value pairs show up universally in all records. We need a consistent name for that concept. I use attribute. We should not use a separate term for attributes when they are used in different kinds of records. Attribute is a good term because it is so general.

Metadata. Tom has described his understanding of the term used in the technical world, I will use the term as it is used in the "bibliographic" world, it is data ABOUT (Master) Sources – Books, whatever.

I didn’t know this, but it confirms my point that non-technical people will completely misinterpret a technical term because it makes themselves sound smart. If we use the term metadata we are perpetuating an unfortunate misuse of the term.

I think everyone involved knows what the terms represent and we can have more terminology discussion later. Until the next update, I will continue to use the terminology of the current version, there are much more important decisions to be made.

I disasgree with the point and the implication. Getting the right terminology so everyone understands the concepts being discussed is the first key to understanding. Terms like “master source” and “metadata” immediately exclude everyone who does not yet have specialized knowledge.With good definitions you are half way there. A major criticism I have of your report is that you do not define your terms.

In the middle there are several blue squares for Evidence Explained, Dublin Core, OAI etc. The model assumes that one agreed upon (standard) specification describing Master Source Types, Citation Element Types and Templates will be developed for Evidence Explained, possibly one for a set with general MSTs and CETs similar what Tom mentions (I have informally called it "BG Core") and a small one containing the CEs of current Gedcom. It is not assumed that for example each vendor implementing EE will have their own definitions, that will create chaos. In addition to those three I mentioned, I assume that there will be user defined definitions, you can in my opinion not prevent that, but at the same time this will not work if users expect to be able to transfer many of their user defined templates to another user, that will not work – they should use "standard" definitions. I think this issue with user defined definitions will police itself when people understand the consequences.

I don’t think BG should try to accommodate all of the different MST’s, CET’s that come from all the variety of MSTS. That is an impossible task that will not only take forever, but will also never get done. BG should research the source types that are out there in the most popular MSTSs, the citation elements that are out there, and come up with our best list of source types and citation element tags that we feel best covers the different sets of MSTS’s out there. That should be the end of our responsibility. It then becomes the job of any application that wants to use a particular MSTS to figure out how to fit BG sources and citation elements into the templates. The EE lists are simply too long for anyone to take seriously in designing a data model. BG cannot fix the ills of the world. All BG can do is put together as complete and constructive and as common sense set of sources and citation elements at it can, and then let them hang out there for others to then make use of as they see fit. The alternative is to try to solve the gigantic problem you have outlined and still be analyzing it 50 years from now.

Throughout the first posting Tom writes as if there is already an existing BG model, and as if there is a defined scope of BG? Well, there is not, except that BG is about transfer between programs, so I have little to add to those comments, they are Tom's opinion.

Again my view is different. If you are seriously saying, that now, after of year, there is no potential BG model out there in our heads, I’d say you are very wrong. And that if you are right, then this past year has been about the most ineffective year imaginable in trying to solve a problem. The BG model for sources and citations has been discussed and it will have to be very close to the DeadEnds model. My main negative comment about your report, obviously, is that you have opened up the scope of this discussion to an amazingly wide perspective, so wide, that in my opinion, what we already know about what BG is simply lost in details.

In Toms 1'st posting, 3'rd paragraph starting "BG has its …", he mentions a "fixed set of the most important Master Source Types, which can be extended" and "a set of Citation Elements they can contain, suggesting the most important ones". I assume he means they will be defined in a standard or some other specification.

That’s exactly what I mean and it is BG’s job to do that. We need to do the following four tasks:

1. Define the structure of a source record.
2. Define the structure of what I call a source reference, which goes by different names.
3. Define the list of tags for source types.
4. Define the list of key/value pairs that can appear as attributes/citation elements in the sources and source references.

I consider tasks 1 and 2 essentially done with my DeadEnds proposal. Tweak it a bit if you must, but we know we need Source records and a way to reference them with additional location information. Tasks 3 and 4 are to take EE, Yates, Xotero, other standards, and decide how to digest them down into twenty or so source types with thirty of so citation elements. This is probably the big difference between you and me. You see 3 and 4 as a project that will take very large effort. I see it as a task that can be done by two or three people with the EE and other lists in front of them.

As I have stated in the document "An Architecture …" that I published in August (see first page of my model document) I would very much like to see such a solution, informally called "Better Gedcom Core" (Core inspired by Dublin Core)in the second page of the data model. BUT, I have not seen such a specification, and I am not aware of any current work that is going on. I will not be surprised if it will take at least a couple of years to develop it for international use. Also, how will the extensions mentioned be handled, by updating a standard or …?

We either get something done or we don’t. I think source types and citation element types are key to getting BG done. I want BG done in my lifetime. Ergo, what you are describing here is our job, not something we wait for.


Unfortunately Tom ignores the fact that there is another solution that is not likely to disappear any time soon, it is called Evidence Explained and has been implemented by several major vendors. Who is the judge that have decided that users are not allowed to enter citation data according to EE, when the vendors support it? How are you going to enable transfer of data between these programs any time soon? How are you going to preserve the semantic context provided by the several hundred Citation Elements in EE? How are you going to handle the fact that these programs can have user defined Master Source Types and Citation Elements – are you going to prohibit the transfer of these between for example a user's programs?

There is nothing in my ideas to prevent a system from letting a user enter data in an EE manner. If we have the source types and element tags necessary to hold that data in our model things are fine. As I said, that is tasks 3 and 4.

The proposed data model is, among other things, trying to satisfy a requirement to develop a solution for EE that can developed relatively fast, a requirement that has been discussed many times in the Developer Meetings.

I believe we are putting too much emphasis on EE. Properly approaching task 3 and 4 allows us to provide the basis so application can then apply EE or any other MSTS template sets.

What I am trying to do is to build a platform where both EE, and a "BG core" MSTS (according to Tom and my taste) can coexist, simply by defining a way that data and definitions that already exists in many of the major programs can be exchanged.

We do that by completing tasks 3 and 4 and then let the application writers write applications.

Tom characterizes my model as complex, but it should be compared to what you find in the major programs, not DeadEnds. Most of the programs allow you to define Master Source Types, Citation Elements and Templates. The things I have added are the capability to exchange these definitions, to do conversion (using templates) and a few features that attempt to provide translation. So, most of the complexity is already implemented in the major programs.

None of these things require BG directly. These are application features. If BG has a rich enough selection of source types and citation elements, applications can use them to solve these problems.

In the 4'th paragraph of Tom's 1'st posting, starting "And then …" he mentions "conversion from anybody's database into citations from anybody's template set, a monstrous problem". Yes, the model allows it, but I think that will regulate itself, I don't think you will see that in practice. The intention is to be able to convert between standardized or otherwise agreed MSTSs, hopefully including one for EE, from user defined MSTSs to the standard ones, and from "external formats" as Dublin Core, Zotero etc. into the "standardized" formats . I believe the practicality of conversion between MSTSs defined by many individual users is very limited.

I agree in principle, but you seem to be implying that it is somehow BG’s responsibility to do these conversions. By completing tasks 3 and 4 we enable applications to perform these conversions.

If you look at Zotero, you may get an understanding of what conversion can achieve. Conversion will be an important feature in an environment where metadata is transferred automatically between various programs (and metadata schemes), rather than being typed in. For smaller vendors it will be much simpler to implement a general conversion mechanism rather than implementing conversion for each external scheme and database. We need to evolve the handling of metadata and citations beyond the methods used when Gedcom was designed more than 20 years ago.

Again, yes and yes. But BG is not a conversion platform. It is a data model that we are responsible for making rich enough that the conversions are possible. We do that by completing tasks 3 and 4.

In the paragraph numbered 4 in Tom's 1'st posting, he defines what the realm of BG is although there is no agreement on that. He states that transfer of MST/CET definitions are outside the realm, again assuming that there will be one standard ala "BG Core" that will contain all that users need. I do not believe there will be a standard that suites everyone anytime soon.

I have a very clear cut idea of what BG should be. It comes directly from all of our original statements -- “Better GEDCOM is a file format for the archiving and exchange of genealogical data that can encompass all the aspects of the data models of the current genealogical systems that we believe should be included.” A bit a paraphrasing, since I didn’t look it up, but this has been front and center since we started.

I live in a country with few native programs, and I am using several foreign programs which are supposed to be localized for my country. Well, they are not, in every detail, and it is necessary to enter user defined source types – the major part of EE cannot be used here. Do their built in support for styles in these programs suite my needs? Sometimes they do, sometimes they do not, it depends on where I publish – thanks heaven the programs got user defined templates. Currently, that requires someone importing my data to enter these templates. Assume that a genealogy journal here defines its own style. It will be much more efficient to distribute the definitions in a BG file rather than requiring each user to type it in. I have absolutely no hope that a foreign vendor will ever implement that definition, and if he does it will take years

You are suggesting that templates be storable in Better GEDCOM files. In my opinion this outside the scope of BG, and unnecessary. Someone else should be responsible for creating a specification for source templates, and the language to represent those specifications. Source templates are not genealogical data. BG is a format for genealogical data. Again our job is only tasks 3 and 4. With those tasks we enable any application to use any source template set in any format from anybody’s standard against our source types and citation tags to generate the final citation strings. This is very basic. The principle is used commonly. Call it model and view as in graphical interfaces. The model holds the data. The view presents the data in different formats. BG is the model. Source templates are the view. The two should not be mixed. Call it data and style as in HTML and CSS. HTML is the data content and CSS is the stylistic rules for displaying the content.

Programs going beyond the capabilities of current Gedcom will have to support more than 50 MSTs, even with a "BG Core", and are likely to have internal structures equivalent to MSTs and CETs entities. If they want to allow their users some freedom, they will also have structures for templates. Importing and exporting these, if they are according to a standard, is not a major undertaking. Wrt to the work that a vendor that has not implemented support for EE would have to do, I expect that it will be less work to implement an import/export feature based on an extension to Gedcom, than to type in all the data required for the definitions in EE – and then the other benefits of such a feature comes for free.

I agree fully. And your model seems to be the design for such a program. As I said twice before, a very noble goal. But not BG’s goal. Our goal is the do tasks 3 and 4 so we can enable such programs to be designed.

If someone is going to write the definitions required to support EE, with many translations, why not do it in a format that can be handled by every program implementing EE, and that can be used for other purposes as well.
A very good point. But what has it got to do with getting tasks 3 and 4 done? All BG can hope for is that the people who finally get around to worrying about specifying EE in templates can do it well and can convince application writers to support their templates. I’m not disagreeing with you, just asking what it has got to do with BG.

Exchange of MST/CET and templates gives at least these benefits

- Users become much more independent of vendors, and can easily and quickly distribute definitions within a community (e.g. as a standard within a country) without having to rely on the willingness of a vendor to do it, or having to wait a year.

- Translations of these can be distributed quickly.

- A user can transfer his definitions between his programs.

- The definitions can be stored together with the data for long term storage.

- MST records are simpler to handle than if this info is carried in an MSI, e.g. an importer do not have to go through all MSIs to find the distinct MSTs. Translations of MST names are easily handled, and new translations can easily be introduced without a need to transfer the data again –in essence the users do not have to bother with translation.

- It will reduce the work needed by vendors, and this will benefit the smaller vendors.

- It allows implementation of source selection mechanisms tailored to user's needs, again without having to rely on a vendor.


I agree with all of this; a great analysis. Again I ask, what has any of this to do with BG?


In the paragraph numbered 5, Tom is again defining the realm of BG, and argues that a simple key value pair is sufficient. I don't see how he envisages a program to allow international data transfer without having the concept of a data type for person names etc., unless you standardize hardwired processing in the programs for each CE. Until a standard exists that defines every Citation Element that will be needed, there will be a need to identify the data type in transfer and there will be a need for some translation capabilities – however limited.

I understand the basis of your point. But so far every genealogical data model has managed just fine by using values that can be expressed as strings -- names, dates, places, ages, sexes, events, notes, sources, and so on. Media files all use text based mime types. The alternative of a key/value pair is a key/type/value triple. If there are places where we find this necessary we can introduce it. I’d bet we never have to use it though.
ttwetmore 2011-12-01T19:55:55-08:00
My model does not currently support 2-level sources. One reason is that I have never come around to evaluate the pros and cons. At this stage I have no strong opinion about such as scheme.

I think it a bit unfortunate that you haven't thought about this yet. Understanding how the different concepts of source, source reference and evidence record referring to a source record work together, become crystal clear in examples such as this. I think my main criticism of your "citation" and "reference note" ideas, even going back to your earlier report, stem from the fact that without seeing them inside the larger picture made possible with a multi-tier system, you can't truly appreciate their full natures.

I make no bones about the fact that I believe the DeadEnds approach with a "recursive source record" with the pointers between the evidence records (persons and events) and source records and between source records being the source reference structures as I have defined them, make up the perfect way to model sources. By perfect I mean things like 1) makes the most sense; 2) best models the real world; 3) is the simplest conceptually; 4) can be applied universally; 5) can be used easily by applications that must match citation elements against source templates; 6) uses concepts that are easy to explain because they match common sense; 7) require the least number of keystrokes to get all information; 8) requires no redundancy when evidence refers to different locations within the same sources.
gthorud 2011-12-02T09:19:41-08:00
Tom, I suggest you repost the last very long posting, for someone who have not read my message it is very difficult to read. The best thing would be to preserve the headdings and leave out my text - otherwise I doubt anyone will read it.
gthorud 2011-12-02T17:48:08-08:00
Lous,
If GEDCOM intended to support multilevel sources, the way you mentions is a very strange way to do it. I don't think it was intended.
gthorud 2011-12-02T17:49:25-08:00
Louis, sorry about the spelling.
louiskessler 2011-12-02T21:41:43-08:00
Geir,

Maybe not, but it works very well:

0 @I1@ INDI
1 NAME John /Black/
1 BIRT
2 DATE 11 NOV 1911
3 SOUR @S1@

0 @S1@ SOUR
1 TITL Interview with Aunt Mabel
2 NOTE She said info about John came from family bible
3 SOUR @S2@

0 @S2@ SOUR
1 TITL Thompson Family Bible
2 NOTE Information about John was written in bible by his son
3 SOUR @S3@

0 @S3@ SOUR
1 TITL Information from John Black's son
ttwetmore 2011-12-03T07:00:02-08:00
Louis's example spurred me to recast my journal article example into GEDCOM syntax. I believe that there is nothing wrong with GEDCOM as the syntax for the external archive format for BG, and I think that Louis shares this view. The only problem with GEDCOM is that it has become politically incorrect to recommend it for its syntactic value, mostly because of the ascendency and extreme political correctness of XML, and because of the negative influence caused by the incompleteness of the GEDCOM semantics model and the many conflicting extensions of GEDCOM 5.5 that vendors have used to compensate for the incompleteness.

Here is the example in GEDCOM. Of course I have had to add tags here and there, as semantic GEDCOM isn't rich enough for some of this information. I think it has been one of Louis's points from the beginning that we could "simply" modify and extend GEDCOM 5.5 to get to a Better GEDCOM solution. I fully agree with him, though for the political incorrectness issues, I think this solution would be rejected by many.

We developers know that the external expression of a model is only syntactic sugar, so the GEDCOM syntax vs XML syntax (or JSON syntax or Google Protocol Buffer syntax) is a trivial issue with respect to developing the model. In my own DeadEnds software, written in Objective-C, I have a set of description methods that can generate external format in XML, GEDCOM and JSON.

0 @S101 SOUR
1 TYPE journalArticle
1 TITL Descendants of Charles Edward Hancock of Snow Hill, Maryland
1 AUTH Richard James Hancock Jr.
1 SOUR @S202@
2 VOLM xxiii
2 NUMB 3
 
0 @S202@ SOUR
1 TYPE journal
1 TITL New England Historic Genealogical Register
1 PUBL
2 NAME New England Historic Genealogical Society
2 DATE 1847 to present
2 PLAC Boston, Massachusetts
1 REPO @R55@
2 CALL nh45-65
 
0 @R55@ REPO
1 TYPE library
1 NAME New England Historic Genealogical Society Library
1 PLAC Boston, Massachusetts
1 ADDR 101 Newbury Street
 
0 @P664@ INDI
1 NAME Charles Pettigrew /Hancock/
1 BIRT ...
1 DEAT ...
1 SOUR @S101@
2 PAGE 345 to 346
ttwetmore 2011-12-03T07:10:32-08:00
In the example, I should have pointed out that the following tags: TITL, AUTH, VOLM, NUMB, PUBL, NAME, DATE, CALL, are the things that would be called CITATION ELEMENTS or METADATA in Gier's vocabulary, and that the TYPE tag specifies what Gier would call a Master Source Type. Casting this into the terminology I use in the DeadEnds model, the 0 SOUR records are the Source records, and the 1 SOUR substructures are the Source References.

So I'm in full agreement with Louis that GEDCOM already has the necessary structures for Source records and Source References (which means the necessary structures for Citation Elements and Metadata).
GeneJ 2011-12-03T08:12:45-08:00
Hi Tom,

I've not been able to retrieve the article to which you refer. Always a chance that there is a fluke in American Ancestor's search engine.

Should I be able to retrieve the article you used for the example?
ttwetmore 2011-12-03T08:19:09-08:00
The article doesn't exist. I made it up. (Charles Edward Hancock was a great grandfather; Richard James Hancock Jr. was an uncle; Charles Pettigrew Hancock is a figment of my imagination). CEH was from Snow Hill, Maryland, and I have used articles from the "Register" in the past.
louiskessler 2011-12-03T08:54:05-08:00
Tom and I have agreed in the past that it is simple to map between GEDCOM and XML. The two are effectively equivalent.

I think a future BetterGEDCOM could be built both as an enhanced GEDCOM, and as an XML equivalent.

Since most developers use GEDCOM, it would be easier for them to migrate to a similar GEDCOM format. Simple utilities could do the translation of GEDCOM to XML and vice-versa so new developers, if they choose, can use the XML format.

Louis
ttwetmore 2011-11-29T14:16:29-08:00
I maintain that G's report goes way beyond BG's concern. But BG can obviously go that route if it chooses.

By designing a model that includes sources and source references, as I have done in the DeadEnds model, we can accommodate data from all current systems. As I have said, the problem is that of choosing a set of source types that are congruent to the current set of citation templates, and, for each, providing a list of the most important attributes (okay, citation elements) used for that kind of source. This simple approach addresses the "large user base effected [sic] by GEDCOM's inability to transfer the work."

There are no "existing technologies" behind G's report that I am ignoring. Do we think it is BG's task to produce a translator program to translate source information stored in Roots Magic format to the Legacy format, and also be able to generate citations from every one of those formats to every citation template set that exists in the real world? Is that the alledged technology I am ignoring? As I said, that's a noble goal, but it had better not be BG's goal or we'll get nothing done. Or do we think that simply coming up with the model for such a tool would give us the insights we need to understand BG's task vis a vis sources and citations? Much faster to read the DeadEnds model and then discuss why that isn't already the right model BG needs for sources.

pps. Citation elements are not metadata, and some of us aren't awkward about using the term, when it is appropriate to do so. But I'm not surprised that non-technologists/non-developers glom on to sexy terms and adopt them inappropriately because they sound so darned intelligent. But it is surprising when some developers do, as is being done in this case. Metadata is data about a whole set of entities AS A CLASS, not about a specific entity AS AN INSTANCE. Citation elements hold data about SPECIFIC SOURCES, not about the SOURCE CLASS as a whole. By the same reasoning we would have to conclude that the name and the sex of a person is metadata. In fact, by this reasoning EVERYTHING becomes metadata, for every attribute is ABOUT its object. Now, saying that sources can have titles does indeed describe an aspect of metadata about the source class, since it describes a property of the class itself, which is, "sources can have titles." Nothing awkward there at all. But saying that the title of a specific source, say "The Trasks of Nova Scotia," is metadata, is just plain wrong, and among rational people, even awkward ones, there is no argument possible. That title is ordinary garden variety data.
louiskessler 2011-11-29T22:43:39-08:00
I am basically in agreement with Tom's comments on the draft.

I think a clear distinction must be made between

1. The Repository/Source/Source Details which tell you what the source is,

2. The Citation/Citation Templates which tell you how to formalize and display the reference to the source, and

3. Everything else.

#1 needs to be formalized in BetterGEDCOM.

As long as the Repository/Source/Source Details are transferred, then every program can process them.

#2 can be done along with BetterGEDCOM if desired to formalize the way every program should then display the source reference. But that has nothing to do with storing or transferring the data.

Geir did a lot of work, but it would be wonderful if he could try to restructure it to include Tom's suggestions.

Also, I really like GeneJ's Zotaro spreadsheet she showed at Monday's meeting. It contains source field types along the top row and attributes along the left column with the contents filled with the attribute if it applies to the field type.

To me, if BetterGEDCOM can define this as it is needed to properly source genealogy data, then that alone might be all that's needed to get source data properly transferred between different programs.

The citation definitions could then be added as the icing, so that all programs can display the source reference the same way.

Louis
ACProctor 2011-11-30T08:03:16-08:00
This may be of interest since I've been working on this area recently.

I elected to have a well-defined <Citation> element that uses a parameterised URN to generate a unique reference string for a citation.

    <Citation Key=’key’>
        [ <Name> citation-name </Name> ]
        <DisplayFormat> format-string </DisplayFormat>
        <URN> parameterised-urn </URN>
        <Params>
            { <Param Name=’name’> default-value </Param> } ...
        </Params>
        [ <ResourceRef Key=’key’/> ]
        [ NARRATIVE_TEXT ]
    <Citation>

For instance:

    <Citation Key=’UkCensus1901’>
        <Name>1901 Census of England and Wales</Name>
        <DisplayFormat> [${Series}/${Piece}/${Folio}/${Page}] </DisplayFormat>
        <URN> http://www.nationalarchives.gov.uk/?census?&piece=?&folio=?&page=?</URN>
        <Params>
            <Param Name=’Series’>RG13</Param>
            <Param Name=’Piece’/>
            <Param Name=’Folio’/>
            <Param Name=’Page’/>
        </Params>
    <Citation>

Parameters would be specified at <CitationRef> time. The URN base string would have to be agreed with the body generating the identifiers in order to ensure that the resultant strings are unique, persistent, and can be correlated with citations from other datasets.

A similar <Resource> element uses a parameterised URL to locate "resources" (i.e. supporting documents, images, etc). These might be part of the local collection (in which case a file: URL might be applicable) or on the Internet (in which case an http: URL would be applicable). Parameters are inherited through multiple levels so the <ResourceRef> in the <Citation> might locate a local image version of the 1901 census citation.

My Narrative text originally contains a simple Source='...' attribute that allowed references to informal sources such as a relative's comments, but I'm thinking of back-tracking on this. There's no reason why the same citation scheme couldn't be used in all cases if the standard provided a number of unique URN base strings for the common informal citations.

As usual, a lot of my stuff has been developed independently and I acknowledge that my years in genealogy are less than many others in this group. However, I'm as keen to share my thoughts (which may seem a bit radical on the face of it) as to hear other people's views and cases that this scheme may or may not work for.
GeneJ 2011-11-30T09:48:41-08:00

It is encouraging to see a variety of interest in this topic.

Believe we should take care in making assumptions that bias how we review the proposal. It seems premature, too, to be critical about terminology in the proposal (or otherwise).

It may be easy for those who haven't worked with Mills _Evidence Explained_ to assume away a variety of complexities. Even more if you have not made some comparative review of Mills to CMOS to Register, or considered principle differences between UK-centric and US-centric source materials and presentations (much less Norway, Germany, etc.).

Ditto, possibly easy to assume away complexities if you haven't worked with various "style" implementations in existing software. The reverse is probably true for those have. We made a small study of vendors' implementation of Evidence Explained and other styles, reviewing not only GEDCOM output, but also the approaches in the user interface and program native output. It's fair to say we found more differences than similarities, and that was before attempting to consider international exchanges.

Much earlier, of the folks interested in the sources and citations topic, there were few among us who had worked with, _Evidence Explained_. Only some even had ready access to the work. Is that still the case?

We posted some information about application issues on the wiki. Have folks reviewed the early effort to develop summaries about the application systems (http://bettergedcom.wikispaces.com/Application+Data#RootsMagic) or some of the graphics / screenshots that are linked to the page "About Citations?"

GEDCOM itself is this nice small set of information relative to sources and citations. It is easy for any of us to suggest it be nudged just a bit here and there--and how well just a few changes would work for many of us. I've even posted some of those suggestions … but eventually we have to deal with real world circumstance that exists in some of the widely used desktop software application.

A bit jammed up today … have time only for a couple of specifics, below. Hope this helps. --GJ

(1) MSTS. Someone suggested MSTS was unnecessary. Geir can probably explain MSTS better or more accurately, but from my user perspective, if you don't recognize the MSTS, then those properties/the functionality will be developed into "source types." (If Zotero had taken this approach, rather than say 35 source types (they call them item types), there might be 35,000. (This is where I repeat Louis' early notice--styles reflect trends/fads in that they come, go and update on an independent cycle.) If we want to have a reasonably clean set of source types, we need MSTS.


(2) The Zotero spreadsheet. I'm a believer. The Zotero folks started with reasonably well cataloged source materials (and those items born digital)--by and large, standardized metadata existed. We can do that also by focusing first on say all published (or originally published) materials. I _assume_ we'd have reasonable agreement from the state about the "what" and even "how" about these published materials … which would provide a good platform to work on some of the important technical issues (data types, translation, etc.). [See no (1), above, the Zotero spreadsheet works for Zotero because it has that separate MSTS concept/development.
ACProctor 2011-11-30T15:19:23-08:00
I hasten to add, Gene, that my URN scheme is not limited in any way, e.g. to the Internet. A URN is simply a unique name representing something, ...anything in fact. The fact it looks like a URL has confused many people in the past so I thought it was worth mentioning :-)

I picked a URN precisely because it is an established standard for creating unique names, or keys, for generalised "things". I needed a mechanism that could be parameterised and that wasn't limited at all in the types of citation - either formal or informal - that I might encounter. At the moment, I'm more interested in complete generality than a precise enumeration of all possible sources or citation types.

I don't have a copy of 'Evidence Explained' by E. S. Mills but I was very interested in John Yates 'Evidence Style Parametrization'. The table at http://jytangledweb.org/genealogy/evidencestyle/evidence_style.html gives parameter fields for the 170 models in the Mills book. That would fit nicely with my URN scheme except for the choice of a base URN string.
GeneJ 2011-11-30T15:38:12-08:00
Hi Tony,

Thanks. There are many needs to be met, aren't there.

We worked with Yates model at the end of last year and corresponded with him at that time. I spent an obsessive :-) amount of time putting the material in spreadsheet form and working to reduce the number of elements in the model. I got to about 500, but further cuts would have required more context--it was obvious I would never get that number to a good level for the purpose of standardization. (We have a spreadsheet about that research linked to the one o the wiki pages.) Believe John worked similarly; last I followed his progress, he arrived at about 550 elements, but separately suggested that number might increase to about 3000 if he accounted for a more complete range of user/documentation requirements.

He did a terrific job advancing the concept, but in addition to the volume of elements and other "bits" there are problems with implementations based on the 170 quick check models in EE 2007. Some are documented on the wiki. For one, the 170 models are examples for just a fraction of the stye combination requirements.

I'm glad we worked with and became familiar with his materials.--GJ
ACProctor 2011-11-30T16:14:16-08:00
Interesting comparison between my URN scheme and the XML-based CSL (see http://citationstyles.org/downloads/specification.html for CSL 1.0).

CSL seems to be designed for formatting citations in a humanly-readable fashion; something like the CMOS one I guess. In contrast, the URN scheme is a computer-readable format for holding the same set of parameters.

This is a little like the difference between the ISO 8601 date format I've recommended elsewhere (which is an unambiguous computer format) and a date formatted according to your local Regional Settings.

It's worth asking 'what does a URN offer that the CSL XML doesn't already have?'. I would say the ability to generate a single string that uniquely defines the citation for easy comparison/correlation by software.

The format-string in my example above is basically doing the job of the CSL.
GeneJ 2011-11-30T17:00:28-08:00
Hi Tony,

Zotero was the first reference management software to adopt CSL, aka, Citation Style Language (_Wikipedia_, "Zotero.").

Little aside ... I was led to John Yates work by a posting he'd made on the Zotero forums. He was hoping to Zotero would build a Mills style. I found his posting about six months later and thought that would be a great project, so I followed his work and that thread. Learned much in exchanges with the Zotero folks.

I'll be following the comments of others about the URN.
ttwetmore 2011-11-30T18:27:50-08:00
I am not going to get into another deep note about G’s report. The first response to my note seemed to say that my criticism implied that I didn’t share Better GEDCOM’s views about solving the source and citation problem. Nothing could be further from the truth. I’ve been suggesting a solution to the problem for a year now, and written it up in the DeadEnds model documents, and presented examples of its use. I'll try to go over it again here.

My solution to the source and citation problem is based on the Xotero,Yates, EE (and frankly, the 24-year-old Wetmore) approach of defining a list of source types and then sets of tag/value pairs ("citation elements") that go along with each one. Since this seems the obvious solution to most of us, I wonder why this is not what we are doing. This is the MSTS and MST "problem" relegated to a simple list.

Here is a quick example showing a simple case where information about a person is extracted from an article found in a multi-volume journal. I’ll go through the full details of the DeadEnds approach. The salient points of this example are that there will be two source records, one for the article, and one for the journal as a whole. There will be a repository record stating where the journal is located. And of course there will be a persona record extracted from the article. There will be two source references, one in the person record that points to the article source record and has the pages where the evidence was found; and one from the article source record to the journal source record that holds the volume and issue number where the article is located.

A journal article provides an excellent example of a situation where a source within a source is the right solution for structuring the source and citation information into two levels. In most cases (e.g., a book) there only needs to be one level. I picked this example primarily because it provides a way of showing the need for two layers of sources in a real and uncontrived manner.

Here are the records one by one. I’m using a generic XML to show the examples. This is for convenience. DeadEnds is external language agnostic; with a plug-in it can generate external files in any format one might want.

First is the source record for the journal article. It has a source type and two citation elements (title and author). It has a source reference that points to the source that contains it (the journal); the source reference has two additional citation elements (volume and issue number) that locate the article in the journal.

<source id=”s101” type=”journalArticle”>
  <title>Descendants of Charles Edward Hancock of Snow Hill, Maryland </title>
  <author> Richard James Hancock Jr. </author>
  <sourceRef id=”s202”>
    <volume> xxiii </volume>
    <number> 3 </number>
  </sourceRef>
</source>

Next is the source record for the journal as a whole. It also has citation elements, a simple one for its title, and a structured one for its publication info. This source is a “top-level” source so it contains a repository reference. The repository reference has an element that serves to locate the source in the repository, a call number in this case.

<source id=”s202” type=”journal”>
  <title> New England Historic Genealogical Register </title>
  <publication>
    <publisher> New England Historic Genealogical Society </publisher>
    <date> 1847 to present </date>
    <place> Boston, Massachusetts </place>
  </publication>
  <repositoryRef id=”r55”>
    <callNumber> nh45-65 </callNumber>
  </repositoryRef>
</source>

To finish the source-chain records, here is the repository record.

<repository id=”r55” type=”library”>
  <name> New England Historic Genealogical Society Library </name>
  <place> Boston, Massachusetts </place>
  <address> 101 Newbury Street </address>
</repository>

Here is a persona record (that is, a person record information extracted from a single source) that we have extracted from the journal article. Note the source reference to the journal article; it contains the citation element needed to locate pages in the article where the evidence was found.

<person id=”p664”>
  <name> Charles Pettigrew Hancock </name>
  <birth> ...
  <death> ...
  ...
  <sourceRef id=”s101”>
    <page> 345 to 346 </page>
  </sourceRef>
</person>

I think the many advantages of this approach are obvious. There will never have to be any more than exactly one source record for the article or the journal, since the info locating places inside them are located outside of themselves in the source references. All the citation elements are located at the exact logical spot where they belong.

Where does the final citation string come from? I hope it is obvious that all the “citation elements” needed for its generation are available in this chain of records and references. Yes, the citation elements needed to generate the string are located in different places, but in a strictly structured and logical manner. Any system employed by an application to interpret citation templates to generate citation strings can find all the citation elements it will need from this structure.

This is an example of the DeadEnds approach to sources, repositories and citations. I have proposed it as the Better GEDCOM approach. I believe it meets all goals BG has with respect to this area.
GeneJ 2011-11-30T19:51:43-08:00
Hi Tom,

Some quick comparisons follow:

Here are the Zotero fields:

Zotero fieldname
seriesText
journalAbbreviation
language
ISSN
shortTitle
libraryCatalog
rights
DOI
url
abstractNote
accessDate
archive
archiveLocation
callNumber
series
seriesTitle
publicationTitle
issue
date
extra
pages
title
volume

ROOTS MAGIC 4
Here are the Mills pre-packaged/interpreted fields from RootsMagic v4 master source for , "Journal Article, print" (each of the RM fields also contains a "hint" or description):
Note: RM cross references this template to "Ref: [EE, QC-14, p779, sec 14.16, p 798; E!, p 64]"

Master Source [name] [administrative]
Author
Article Title
Article Subtitle
Journal Title
Volume
Issue Date
PageRange


LEGACY 7.5
Here are the Mills interpreted master source fields from Legacy 7.5's Source Writer's "Master Source" for "Periodicals > Journals > Basic Format (references to p. 778 and separately to 798* (each of the Legacy fields also contains a "hint" or description).

Source List Name [administrative]
Title (described as journal's title)
*Legacy has a journal grouping from Mills p. 106 and from p. 143 another group from 779 ... 798 .. 799 ... 814 (and then has what may be the same groups referenced by section numbers)...


FTM-MAC
Here are the "source" fields from FTM-Mac for "Publications--Periodicals, Broadcasts, and Web Miscellana > Periodicals > Journal Article - Print Edition"
Each of the fields has a "hint" or description.

Author Surname
Author Forename(s)
Other Authors
Article Title
Article Subtitle
Journal Title
Journal Volume
Journal Issue Date
Article Page(s)

(As soon as I post this I'm sure to find a typo or something I want to fix. Sigh.)
gthorud 2011-12-01T09:52:35-08:00
Thanks for all the comments, and I am sorry that I have not replied earlier.

Terminology

- Master Source is used since just Source is considered by many to be ambiguous. I don't like Master source either, but I found no good alternative.

- Citation should have been called Reference note, I agree, my mistake, will change.

- Citation Element is more difficult since it describes things that may also end up in a template for a Bibliography, not only in a reference note. Attribute is a too general term.

- Metadata. Tom has described his understanding of the term used in the technical world, I will use the term as it is used in the "bibliographic" world, it is data ABOUT (Master) Sources – Books, whatever.

I think everyone involved knows what the terms represent and we can have more terminology discussion later. Until the next update, I will continue to use the terminology of the current version, there are much more important decisions to be made.

About page 2 in the model document


In the middle there are several blue squares for Evidence Explained, Dublin Core, OAI etc. The model assumes that one agreed upon (standard) specification describing Master Source Types, Citation Element Types and Templates will be developed for Evidence Explained, possibly one for a set with general MSTs and CETs similar what Tom mentions (I have informally called it "BG Core") and a small one containing the CEs of current Gedcom. It is not assumed that for example each vendor implementing EE will have their own definitions, that will create chaos. In addition to those three I mentioned, I assume that there will be user defined definitions, you can in my opinion not prevent that, but at the same time this will not work if users expect to be able to transfer many of their user defined templates to another user, that will not work – they should use "standard" definitions. I think this issue with user defined definitions will police itself when people understand the consequences.

Finally, I do expect that within a country, or internationally, there will over time be additional "standard" definitions – possibly according to existing meta data schemes such as Dublin Core, OAI etc. (or one according to Zotero), although the latter ones are currently more interesting as containers structured according to existing metadata schemes, that you can import the external data into, and then use conversion to get them into EE, BG Core or Gedcom MSTs.

The BG model, scope and realm


Throughout the first posting Tom writes as if there is already an existing BG model, and as if there is a defined scope of BG? Well, there is not, except that BG is about transfer between programs, so I have little to add to those comments, they are Tom's opinion.

"BetterGEDCOM Core"


In Toms 1'st posting, 3'rd paragraph starting "BG has its …", he mentions a "fixed set of the most important Master Source Types, which can be extended" and "a set of Citation Elements they can contain, suggesting the most important ones". I assume he means they will be defined in a standard or some other specification. As I have stated in the document "An Architecture …" that I published in August (see first page of my model document) I would very much like to see such a solution, informally called "Better Gedcom Core" (Core inspired by Dublin Core)in the second page of the data model. BUT, I have not seen such a specification, and I am not aware of any current work that is going on. I will not be surprised if it will take at least a couple of years to develop it for international use. Also, how will the extensions mentioned be handled, by updating a standard or …?

Evidence Explained


Unfortunately Tom ignores the fact that there is another solution that is not likely to disappear any time soon, it is called Evidence Explained and has been implemented by several major vendors. Who is the judge that have decided that users are not allowed to enter citation data according to EE, when the vendors support it? How are you going to enable transfer of data between these programs any time soon? How are you going to preserve the semantic context provided by the several hundred Citation Elements in EE? How are you going to handle the fact that these programs can have user defined Master Source Types and Citation Elements – are you going to prohibit the transfer of these between for example a user's programs?
The proposed data model is, among other things, trying to satisfy a requirement to develop a solution for EE that can developed relatively fast, a requirement that has been discussed many times in the Developer Meetings.

A platform

What I am trying to do is to build a platform where both EE, and a "BG core" MSTS (according to Tom and my taste) can coexist, simply by defining a way that data and definitions that already exists in many of the major programs can be exchanged.

Complexity

Tom characterizes my model as complex, but it should be compared to what you find in the major programs, not DeadEnds. Most of the programs allow you to define Master Source Types, Citation Elements and Templates. The things I have added are the capability to exchange these definitions, to do conversion (using templates) and a few features that attempt to provide translation. So, most of the complexity is already implemented in the major programs.

Conversion

In the 4'th paragraph of Tom's 1'st posting, starting "And then …" he mentions "conversion from anybody's database into citations from anybody's template set, a monstrous problem". Yes, the model allows it, but I think that will regulate itself, I don't think you will see that in practice. The intention is to be able to convert between standardized or otherwise agreed MSTSs, hopefully including one for EE, from user defined MSTSs to the standard ones, and from "external formats" as Dublin Core, Zotero etc. into the "standardized" formats . I believe the practicality of conversion between MSTSs defined by many individual users is very limited.

If you look at Zotero, you may get an understanding of what conversion can achieve. Conversion will be an important feature in an environment where metadata is transferred automatically between various programs (and metadata schemes), rather than being typed in. For smaller vendors it will be much simpler to implement a general conversion mechanism rather than implementing conversion for each external scheme and database. We need to evolve the handling of metadata and citations beyond the methods used when Gedcom was designed more than 20 years ago.

Transfer of MST/CET and template definitions


In the paragraph numbered 4 in Tom's 1'st posting, he defines what the realm of BG is although there is no agreement on that. He states that transfer of MST/CET definitions are outside the realm, again assuming that there will be one standard ala "BG Core" that will contain all that users need. I do not believe there will be a standard that suites everyone anytime soon.

I live in a country with few native programs, and I am using several foreign programs which are supposed to be localized for my country. Well, they are not, in every detail, and it is necessary to enter user defined source types – the major part of EE cannot be used here. Do their built in support for styles in these programs suite my needs? Sometimes they do, sometimes they do not, it depends on where I publish – thanks heaven the programs got user defined templates. Currently, that requires someone importing my data to enter these templates. Assume that a genealogy journal here defines its own style. It will be much more efficient to distribute the definitions in a BG file rather than requiring each user to type it in. I have absolutely no hope that a foreign vendor will ever implement that definition, and if he does it will take years

Programs going beyond the capabilities of current Gedcom will have to support more than 50 MSTs, even with a "BG Core", and are likely to have internal structures equivalent to MSTs and CETs entities. If they want to allow their users some freedom, they will also have structures for templates. Importing and exporting these, if they are according to a standard, is not a major undertaking. Wrt to the work that a vendor that has not implemented support for EE would have to do, I expect that it will be less work to implement an import/export feature based on an extension to Gedcom, than to type in all the data required for the definitions in EE – and then the other benefits of such a feature comes for free.

If someone is going to write the definitions required to support EE, with many translations, why not do it in a format that can be handled by every program implementing EE, and that can be used for other purposes as well.

Exchange of MST/CET and templates gives at least these benefits

- Users become much more independent of vendors, and can easily and quickly distribute definitions within a community (e.g. as a standard within a country) without having to rely on the willingness of a vendor to do it, or having to wait a year.

- Translations of these can be distributed quickly.

- A user can transfer his definitions between his programs.

- The definitions can be stored together with the data for long term storage.

- MST records are simpler to handle than if this info is carried in an MSI, e.g. an importer do not have to go through all MSIs to find the distinct MSTs. Translations of MST names are easily handled, and new translations can easily be introduced without a need to transfer the data again –in essence the users do not have to bother with translation.

- It will reduce the work needed by vendors, and this will benefit the smaller vendors.

- It allows implementation of source selection mechanisms tailored to user's needs, again without having to rely on a vendor.

Citation Element Types


In the paragraph numbered 5, Tom is again defining the realm of BG, and argues that a simple key value pair is sufficient. I don't see how he envisages a program to allow international data transfer without having the concept of a data type for person names etc., unless you standardize hardwired processing in the programs for each CE. Until a standard exists that defines every Citation Element that will be needed, there will be a need to identify the data type in transfer and there will be a need for some translation capabilities – however limited.

About Zotero


Zotero is partly based on the "Citation Style Language" (see e.g. my summary on the wiki) which is so far able to handle published (incl. digitally published) sources. It has to be expanded for e.g. archival material and other types of sources used by genealogists.

CSL and the programs interpreting it, can roughly be said to provide the functionality of genealogy programs, the user configuration of them wrt citations and the processing of templates.
Zotero and CSL are not entirely consistent, they use different names for their equivalent of a CE because they as far as I know have different origin – so there is data loss in the mapping between Zotero and the CSL rendering engine.
Zotero is based on a web database containing citation style definitions (1750+, but there are many that are the same with a new name), each of them localized geographically by a single file that apply to all styles. A problem is that the MSTs (called item types) are in several cases hardwired in the style definition, so you may not be able to easily extend the definition with new MSTs without doing some "programming".

A BG Core could be based on an extension of the "variables" (=CEs) in CSL (ca. 60) and/or Zotero (100+), but many extensions for all the CEs that will be required for the sources that genealogists use (A comparison of how many pages EE has on published material, contra other stuff, might give an indication).

An aside: Re. generating URL


Last spring I sent a request to our National Archives asking if they had a guide for citations. The first reply I got was from someone working on their metadata database, and it said: "Use an URL, identifying our data base and a number, that's it!" (every electronically cataloged item has a number, unique across many of our archive institutions). Well maybe some time in the future …

The archive does not have a hierarchical structure of numbers as in the UK. The archive's metadata database has links to a large collection of freely available digitized sources. Creating URLs that allow programs to access sources on a server can be done by entering the URL in a CE, and there is a data type for that, but supporting all the ways that those URLs can be structured require more work, and is not necessary for the production of citations.

Summary

In summary, I have long been arguing for a simpler way to do things than in EE, but have realized that it will take some time to get there. At the same time I see EE existing in the real world, and it cannot be ignored. I also see many other schemes for doing the same thing, and variations between countries. I therefore see a need for a platform, which is not as complex as Tom tries to make it, that can support many ways of doing things – something that can be implemented without waiting for one solution that will suit all. When a platform has been defined, work can begin on standardizing EE and BG Core. I expect those who see "BG Core" as the only solution to start working on it right away.



I have not read the postings after GeneJ's posting here http://bettergedcom.wikispaces.com/message/view/A+Data+Model+for+Sources+and+Citations/47134014#47286752

Geir
gthorud 2011-12-01T14:45:07-08:00
About two level multilevel sources

My model does not currently support 2-level sources. One reason is that I have never come around to evaluate the pros and cons. At this stage I have no strong opinion about such as scheme.
The concept is described in Gentech. There is at least one program that have had two level sources for a long time, Genbox, but the next version is supposed to also support Evidence Explained, so one might ask what the reason for that is. If I have understood it correctly, FamilySearch has been working an something similar in their ICE work. I have been told that that project is not advancing, but I cannot confirm that – in any case they had not advanced beyond published sources this fall.

My model could relatively easy be extended to support two levels by adding a link from the lower level source to the higher level source. It will also be necessary to enhance templates to allow referencing the correct CEs, at the appropriate level, if the same CE (e.g. Title) can be used at both levels. It may also have implications on the complexity of

The major benefit of having two levels is reuse of data – you enter e.g. the Journal only once. It will also reduce the number of CETs, e.g. no need for a separate CE for the Journal Editor.

If you look at Genbox 3.7, you will find ca. 18 lower level MSTs out of 95 MSTs. The usefulness must be evaluated against the complexity in implementation and the data entry process. A question is if there are other implementations?

A similar splitting into several MSTs can also be envisaged used for e.g. republications, digitization, transcription etc. of a source – but I have not seen any examples. I think Louis has mentioned this possibility earlier, and I have something similar in my "Architecture document", although solved differently. ICE may also have looked at this. If the above levels are considered vertical levels, the republications etc. could be considered horizontal "levels". A problem is that there could be many levels, so entering sources could become complex, but that could be handled by adding levels to e.g. the Master Source and Source Detail level in the Source Edit dialog in RootsMagic (this also applies to Tom's "vertical levels"). It may also have serious implications for the number of templates (unless you employ the "CE module concept in my "Architecture …" document.
In any case this will need much more work and evaluation than a single example. The place for it could be in what I have called "BG Core".

Geir
GeneJ 2011-11-29T13:09:08-08:00
Scope of the work proposal: Most of us are members of BetterGEDCOM because we believe GEDCOM does not support features users either want vendors to develop or features that have been developed and users want to share their work. Many, me included, believe GEDCOM's single biggest failure relates to it's inability to transfer work from programs that implemented expanded source and citations systems.

We can debate the features and benefits of these expanded systems, and we can be frustrated by the differences in program implementation, but we can not ignore the large user base effected by GEDCOM's inability to transfer the work.

Now, if you don't think we should support these existing technologies, then I suppose you may feel parts of Geir's proposal are beyond the scope of BetterGEDCOM. I do, so I don't.

Technical Aspects: Tom's posting questioned some of the technical aspects of the work proposal. I might respond more after we've all had a chance to discuss the scope issues.


P.S. It is metadata. I'm surprised technologists/developers would find that term awkward. Wonder if that isn't a bit of a tell about our stuggles with sources and citations.
GeneJ 2011-12-03T06:57:09-08:00
Meet Dublin Core aka DCMI
References:

Geir Thorud, "A Data Model for Sources and Citations," 1-3
http://bettergedcom.wikispaces.com/A+Data+Model+for+Sources+and+Citations
http://bettergedcom.wikispaces.com/file/view/Sources%20and%20Citation%20data%20model%20DRAFT%20v%200.4%2027nov2011.pdf

Dubin Core (r) Metadata Initiative home page
http://dublincore.org/

Dublin Core Metadata Element Set, Version 1.1
http://dublincore.org/documents/dces/

Dublin Core, "Guidelines for Encoding Bibliographic Citation Information
in Dublin Core Metadata."
http://dublincore.org/documents/dc-citation-guidelines/


"Dublin Core," _Wikipedia_
http://en.wikipedia.org/wiki/Dublin_Core

MARC to Dublin Core Crosswalk - developed by the Library of Congress (US)
http://www.loc.gov/marc/marc2dc.html

Zotero, "Import/Export Field Mappings" (scroll to "Unqualified Dubin Core
RDF:")
http://www.zotero.org/support/kb/field_mappings

OCLC Research and the Dublin Core Metadata Initiative
http://www.oclc.org/research/activities/past/orprojects/dublincore/default.
htm

Dublin Core is bibliographic metadata terminology. There were originally 15 metadata terms (the "Dublin Core Metadata Elements" aka "Simple Dublin Core" and "Unqualified Dublin Core"). These 15 elements are:

1. Title
2. Creator
3. Subject
4. Description
5. Publisher
6. Contributor
7. Date
8. Type
9. Format
10. Identifier
11. Source
12. Language
13. Relation
14. Coverage
15. Rights

These 15 original elements are part of the following standards:

IETF RFC 5013 [5]
ISO Standard 15836-2009 [6]
NISO Standard Z39.85 [7].

Qualified Dublin Core
Qualified Dublin Core includes three more elements (Audience, Provenance and RightsHolder), and, in an ongoing process, extended terms. Please see the Wikipedia article for a good overview and the Dublin Core web site for discussions.
GeneJ 2011-12-03T07:17:38-08:00
In the prior posting, a link to "OCLC Research and the Dublin Core Metadata Initiative" broke apart.

http://www.oclc.org/research/activities/past/orprojects/dublincore/default.htm
GeneJ 2011-12-03T07:22:47-08:00
Here are links to the references about the published standards mentioned in the original posting:

IETF RFC 5013
http://www.ietf.org/rfc/rfc5013.txt

ISO Standard 15836-2009
http://www.iso.org/iso/iso_catalogue/catalogue_ics/catalogue_detail_ics.htm?csnumber=52142

NISO Standard Z39.85
http://www.niso.org/kst/reports/standards?step=2&gid=None&project_key=9b7bffcd2daeca6198b4ee5a848f9beec2f600e5
gthorud 2011-12-06T08:43:35-08:00
Similarities and differences between sources around the world
I will try to single out this topic from the mega discussion "Tom's First Comments on Draft 0.4"

Tom asks here http://bettergedcom.wikispaces.com/message/view/A+Data+Model+for+Sources+and+Citations/47134014?o=40#47543524 about differences between sources in different countries:

"Though I do wonder why the rest of the world would need many more MSTs than North America. One would think that books, journals, certificates, land records, censuses, parish registers, are fairly universal. Are there sources of evidence in other parts of the world that are very different from the those used by North American genealogists? I know about the Scandinavian “farm books.” Does this require a new MST or can it fit into a more general category like a diary or a register or a journal? I don’t know; just asking. I can imagine that there might be a few new evidence sources to worry about, but would they add very significantly to the overall number of MSTs? Do you think the fact that the non-North American world has some different types of evidence is a problem for the element-only approach?"

Here is a first try on a very complex issue.

It is very difficult to answer these questions. I seem to remember that there was some discussion about it before, but that might have been off wiki. I remember Adrian informing us about the "Archival hierarchy" in some parts of the UK, and there may have been more.

There are really two issues here, what are the differences between source types, and what are the differences between citation styles.

When I look at Evidence Explained 2007, page. 235, there is a style for Census Records. It has e.g. jurisdiction and civil division. So if an MST has CEs based on the organizational structures in countries, they are certainly different. If you touch on the different types or various forms used for different purposes, there are certainly differences (, not to mention the various fields used in the forms). So in a world-wide generic "Census MST", you would have to create VERY generic CEs – and I am not saying it is impossible, but I simply don't know difficult it is. In any case you will have to do some research to find out, and censuses is just one of a HUGE number of source types – some being more important than others, but you have to handle them all.

If you were to handle Censuses as just any other "collection/document" in an archive (without going into their internal structure) it would be easier, but then Censuses are frequently used documents, so people tend to create special MSTs, e.g. for easy (rapid) entry of data and easy identification of where to enter the data – EE tailors the CEs to the document. You cannot do that with a generic source.

When it comes to citation style, my guess is that you will only find a guide as detailed as EE in the US, but I have not checked. I know very little about citation guides or practices in other countries.

When it comes to Norway, there is no standard for genealogy citations, I do not expect to see one before it is possible to exchange citation data between programs (other than by means of current Gedcom). There was an initiative that aimed at developing a citation guide a few years ago, but that work has stopped without any results. My impression is that students in "human sciences" at the universities are told to use either Chicago or Harvard. When I look in the only scientific genealogy journal we have, many authors seem to use Chicago (or something close to it), but more free form citations – or rather footnotes with embedded citations - are also common. "Farm books", or rather "Bygdebooks" covering the farms and population in a geographic area over time, they are just "books".

But, since there are a lot of different sources, and a lot of cultural differences around the world, they could easily create problems for an "element only" approach. When I look at how long e.g. the library community has worked on standards for metadata, and when I look at the complex conversion engine that is being developed to convert various metadata from different countries in Europe into Eueropeana ( http://www.europeana.eu/portal/ ), it indicates that you will probably have to do a LOT of work before you know the implications of the diversity on an "element only" solution – and you may never get a 100% certain answer.

This does not mean that I am against working on "generic MSTs" and "CEs", but you will always need an "escape mechanism". Generic work should probably start with published sources, since I believe that the differences are smaller around the world. If you have generic sources for those you are some way on the road…
ttwetmore 2011-12-06T09:44:22-08:00
@Geir,

Thanks. From this I don't see any problems with an element-only approach. I hope you know that when I say "element-only" I am only referring to the BG data itself. Of course there have to be source templates and there have to be source type definitions. My view about "element-only" is only that the templates and type definitions are not embedded in the BG data. There are three big problems here 1) structures for sources and citation elements; 2) source templates; 3) source type specifications. All three must be solved, but only the first is necessary for BG file formats. I think it is legitimate to ask how can we determine the list of CEs before problems 2 and 3 are fully solved. This is what I am now calling "task 3." I think we do it by gathering together all the popular standards, Chicago, Havard, EE, etc, and find the common denominators.
gthorud 2011-12-06T12:02:36-08:00
Tom,

It is very confusing that you in one posting talks about ONE generic MST, and now talks about different source types nand templates.

See the "usage scenarios" that I just created.
ttwetmore 2011-12-06T13:00:09-08:00
I have never talked about one generic MST except in the context of having an EXTRA MST for the purpose of handling cases where there is no appropriate type. I am sorry for causing the confusion. I said something like "we could add an extra generic MST to handle other cases," and I made the point that we could even have a source template for the generic MST that would that could generate a catchall type citation string. In other words have a GEDCOM-like escape value.

Again I apologize for generating so much confusion.
gthorud 2011-12-06T14:43:17-08:00
Tom,

I am not sure using the generic MST for extensions only makes it non-problematic. As long as you do not transfer a template, you will still need a big machinery at the receiving end that can render any combination of CEs - or you must have some other mechanism to transfer the template for that instance of the generic MST. The only difference is that the big machine will not be used as much, but what does that matter if you have to implement it.

Also, if you do not transfer a separate MST definition for the extended type, it is just an inefficient way to do it.
ttwetmore 2011-12-06T17:05:51-08:00
@Geir,

Nothing so complicated. I'm only suggesting that it would be possible to still generate a citation string for any source type by imagining a default template that would look for a set of the commonest citation elements in some order (say title, author, ...). I am not suggesting this as the normal way to work, simply suggesting something that could be made to work in extreme cases.

Just imagine a vendor who doesn't want to implement source templates in any way. There will be some like that. If you were such a vendor, how would you generate citation strings from a set of citation elements? Wouldn't you just look for a few of the important elements, e.g., title, author, name, and so on, and format those values in some fixed manner? I am not suggesting at all that this is the right thing to do, but only suggesting it as an easy course for a vendor who hasn't the time time or cleverness to implement a full blown template scheme.

Please realize that if BG does support a mechanism that includes source templates and a large variety of different source types, that this will put intense pressure on vendors to implement features that they may have no interest in or little technical ability to understand. Simply having them accept a wide variety of citation elements and export them, they are way ahead of GEDCOM. They could event implement a very primitive UI screen that would allow their users to add sources and citation elements, simply by giving the users a menu of all the standard source types, and the ability to add citation elements with any tags to those sources. Then the program doesn't have to have hardly any support for sources and citations, but still let the user create the necessary records and structures.

No you don't need big machinery at the receiving end to render any combination of CE's. Just something that goes down a list of element tags in some predetermined order. Please, I am not suggesting anything like this as the normal way of doing things. I am talking about how vendors not interested in implementing full support, or users inventing new source types, can get very reasonable behavior with a very small investment of time and effort.
gthorud 2011-12-10T08:52:18-08:00
Tom,

A very simple solution based on just a few CEs is already in my data model. Look at bullet #12 on page 4. The solution is called GEDCOM. For backwards compatibility, you could require each MSI to be accompanied by the source and citation data encoded according to current GEDCOM, in addition to the data used for a scheme with many CEs. Special conversion templates would generate them. It would then be the job of the program implementing the advanced scheme to provide a simple version of the data, and not the small vendor program. Such a solution will be required in any case since you cannot expect all vendors and users to move to BG at the same time.

A vendor trying to implement a solution based on a very small set CEs will have a problem with data loss. You mention a few CEs, but in practice I think the number of "general" CEs needed to support all MSTs will be greater than 50. If you choose only 20-30 of those you will have data loss, or loss of semantics, or loss of "translation ability" and not to mention loss of flexibility for users.

Are you suggesting that data loss is acceptable? If a vendor does not want to, or is incapable of implementing a solution without data loss, that is his problem.

I have no doubt that if you make a number of tradeoffs, yet to be defined, you could create a solution that would work in many of the cases, but until such a solution is described, and we know the tradeoffs, I cannot gamble on its feasibility. And in any case it will not support Evidence Explained.

You suggest that a user should be able to add any CET to a source, and you are again assuming that a vendor will be able to have a very simple implementation that will be able to output a proper citation for all the possible combinations of CEs. I just have to repeat that I have not seen any description of such a simple solution, so I cannot assume it is possible. You are very welcome to prove me wrong, but you can't do that by mentioning just 4 or 5 CEs that you would see in your solution.

Also reading the description of your solution, it is not as simple as you suggest, a vendor will still have to implement a solution for storing definitions of MSTs and CETs, the only thing you leave out is templates which you think can be handled by a universal super template. Yet, in other discussions you assume there will be templates for use in the "normal cases", but you do not want to transfer them. I suggest you provide an overview of what you expect vendors to implement.

It is interesting that you want to block solutions that are already to a great extent implemented in existing programs, with a solution that you have shown no intention of proving will work. If you believe there is a smaller set that would not cause data loss or other problems I suggest that you start compiling a list for a few commonly used CETs (not only published material) so we can get some idea of the feasibility of the your solution.
ttwetmore 2011-12-10T09:31:37-08:00
Geir,

A vendor trying to implement a solution based on a very small set CEs will have a problem with data loss.
Indeed he will.

You mention a few CEs, but in practice I think the number of "general" CEs needed to support all MSTs will be greater than 50.
Fine, that’s not too many. 10 is too few. 100 is too many.

Are you suggesting that data loss is acceptable?
That’s FUD. I will not honor it with a reply.

...And in any case it will not support Evidence Explained.
I don’t believe that.

You suggest that a user should be able to add any CET to a source, and you are again assuming that a vendor will be able to have a very simple implementation that will be able to output a proper citation for all the possible combinations of CEs.
You have previously made this claim, and I have I explained in detail that I never said what you say I did. I’m not going to repeat myself.

Also reading the description of your solution, it is not as simple as you suggest, a vendor will still have to implement a solution for storing definitions of MSTs and CETs
All I have ever said is that BG files should hold only sources and source references in order to transmit citation information. Of course the programs will have to understand the source types. Of course the programs will have to be able to interpret templates. I don’t believe in magic.

...the only thing you leave out is templates which you think can be handled by a universal super template.
1. I have never left templates out. 2. I have never suggested a single super template.

Yet, in other discussions you assume there will be templates for use in the "normal cases", but you do not want to transfer them. I suggest you provide an overview of what you expect vendors to implement.
I’ve made my view on this clear from the beginning. The vendors must be able to interpret templates, in whatever format they choose to use, with respect to the source and source references found in BG data. It is not BG’s job to dictate what format those templates should be in. It would be great if there were a standard defined for them so that vendors could all use the same ones, but that is not BG's responsibility.

It is interesting that you want to block solutions that are already to a great extent implemented in existing programs
FUD. I have no idea where you come up with such preposterous statements. I have no interest in another rebuttal after I have explained my views in detail.

.., with a solution that you have shown no intention of proving will work.
I am implementing this in DeadEnds. Where did you get your knowledge of my intentions?
gthorud 2011-12-10T14:01:26-08:00
Tom,

Do you think BG should define a format for transfer of definitions of MSTs, CETs and templates? What about equireing MST instances to refer to them?
ttwetmore 2011-12-10T17:44:31-08:00
Geir,

Do you think BG should define a format for transfer of definitions of MSTs, CETs and templates? What about equireing MST instances to refer to them?

This may be the key difference between our views. In my opinion BG should not transfer definitions of MSTs, CETs & templates. However, BG must define the "official BG" set of MSTs and CETs and write those definitions into the BG specifications. Using those specifications vendors have the information they need to develop their own approaches for generating citation strings. The best approach for them would be to use a standard set of templates, and we can strongly recommend that they do so, but ultimately it is the vendor's responsibility to decide how they wish to process source and citation elements.

Templates are third party entities. If BG chooses to participate in the effort to define those templates, that is a worthwhile task. This is my opinion, of course. My opinions are driven by one overriding idea -- BG should be the format that can fully archive and transport all and only genealogical data. I view information about actual sources, primarily citation elements, as genealogical data, and therefore, must be in BG files. But I don't consider definitions of source types, or the definitions of templates to generate final citation strings to be genealogical data. This says nothing about the importance of source type definitions or template definitions (I believe they are both very important); it just says that I don't believe they belong in BG files.

There are so many source types, defined by so many organizations, that BG can never be inclusive enough to handle them all directly, with 100s if not 1000s of custom source types and custom citation element types. Therefore, my tasks 3 and 4, which are to determine the best compromise set of source types that can be fit to all the different types of external source types. For this we have to have faith that we can take, for example, a group of related EE source types and use a single BG source type and a single template for that group, and so on with other organizations. If you look at the vast number of EE source types you see that many are simple variations on a theme. I see it as BG's job to boil down those variations into a smaller set of more general source types, one for each "theme." Task 3, in a nutshell, is to enumerate that themes.
louiskessler 2011-12-10T19:24:45-08:00

I totally agree with Tom on this.

Louis
WesleyJohnston 2011-12-11T05:31:40-08:00
I am totally lost on this thread, since it begins with acronyms that are never explained. What is an MST? What is a CE? a CET?
ttwetmore 2011-12-11T06:08:11-08:00
@Wesley,

The acronyms are defined in Geir's document. There are some others defined there too. Here is how I would define the three you are asking about:

MST -- master source type (I prefer dropping the "master"), something to specify the type of source/evidence being described in a source record. Current discussion is whether the MST should be conveyed as a key word, a type/subtype pair, a structure of keywords or a dotted key word (both complex subtypes), a URN (which can also convey complex subtypes) or a UUID (which is meaningless to human brains).

CE -- citation element (sometimes called metadata; or what I prefer to just call an attribute. It is an attribute of a source record or a source reference that describes some aspect/property of the source (eg., title, author), or some aspect of a location in a source where evidence was found (e.g., page, volume).

CET -- citation element type -- as attributes, citation elements have keys and values -- I'm not sure what a CET is, whether it defines the type of a value with a specific key, or whether it is the specification of what the key's value is intended to mean. In both cases the type would be used to indicate how the string value holding the value should be parsed to determine is more specific type. For example, the value of a date attributes in a BG file is a string, but that string should adhere to some standard for expressing data values so that is can be parsed and understood. The GEDCOM name value is another good example, since by adding slashes around the surname the full value of most personal names can be conveyed as a single string that is a value of the NAME attribute.
gthorud 2011-12-06T11:48:34-08:00
Usage scenarios
Once again, I will try to separate out an issue from the mega discussion ""Tom's First Comments on Draft 0.4".

Unless you believe that an "element only" (one generic MST) solution, covering all the sources in the world, is the only way to go, you will need some way to communicate definitions of MSTs, CETs and Templates.

Various ways to transfer those data have been proposed, and several alternatives are described in version 0.4 of the model – see "Entity presence – usage scenarios" – but note that after the discussions we have had some of them may not be relevant, instead see below.

It is assumed that the principle of "independent record collections" (see the Requirements Catalog) apply, so that definitions and/or MSI ((Master) Source data) can be transferred without information about persons, places etc.

There are at least three ways to transfer the definitions, in a BG file, by downloading them from a web server or some other file transfer method, or using paper – i.e. the vendors have to implement them or the user has to type them in. And maybe others?

The two "non paper" methods will require encoding in some syntax. If we see benefit in transferring the definitions together with the data, the syntax will be that of "BG" and could even be that of current GEDCOM. The same syntax could be used for download from a server – some good arguments are needed if we were to choose a syntax different from that of the data. It is simpler for a program to implement only one syntax, independent of the transfer/download method and how the definitions and genealogy data are bundled. In any case, the data structures that will be encoded must be defined.

It is also important to note that there are two main situations for the use
- Transfer between the programs of two different users
- Transfer between the different programs used by one user

There have been several relevant postings in the topic "Tom's First Comments on Draft 0.4".

I have described an important principle here http://bettergedcom.wikispaces.com/message/view/A+Data+Model+for+Sources+and+Citations/47134014?o=40#47531988 and will repeat them here:

"In general the importing program should be the one to control how Citations are rendered in reports etc., and will be the one that selects the template or other method. If a program has a preferred template/method in the appropriate language that supports the MST of an imported MSI, it should in general be expected to ignore a Template for the MST transferred together with the MSI, and this is what the exporter of the file should expect, unless there is some special agreement that would apply to all templates with corresponding MSIs in the file. The importer could at his own discretion choose to use the template in the file, if for example he does not have an appropriate template, but the exporter cannot expect this to happen without prior agreement.

My model contains a feature that allows a Citation Instance (reference note) to point to a template that will overrule the one defined for the MST. This template could for example contain just text to be rendered in the citation. The same rule as in the previous paragraph must apply to these overruling templates, so the importing program could chose to use its own template, the one supplied in the file for the MST, or the overruling one.

The user of the importing program should be able to control what is to happen."

Some of the possible scenarios I see are:

1. The base scenario is that the two programs are known (expected) to have the same definitions, so there is no need to transfer definitions – but how do you know that?

2. Transfer of user defined definitions (MSTs, CETs and templates) between the programs of one user, possibly together with MSIs (the genealogical data).

3. Transfer of user defined data (see 2) between different users, when there is prior agreement that this is useful – it may be the only way to do it, even if it may result in different citation styles in report.

4. Distribution of "standard" definitions agreed within some community/country.

5. Transfer of citation templates for some language only, if the MSTs/CETs are already known by the program.

6. Download of "standard definitions" for a country, definitions that you program does not know.

7. Request the same as in 6 or 7. from a another user, transferred in a BG file, if it can't be downloaded.

8. Storage of data and definitions for long term storage – the definitions may not be available any other place when you open the file years later.

9. Transfer or download as in 5-7 when you need to convert your MSIs from one standard to another.

10. If a set of MST definition is based on a common (general) set of CETs, possibly updated over time (bugfixed), the set of CETs could be downloaded – without transferring the MST definitions. This would require some rules for how much the CETs could be changed.

Are there others?

It should be noted that transfer of definitions allows users to operate independent of the vendors support for the various definitions.

If the "file size" added by definitions is not a problem, maximum interoperability will be assured if the definitions needed by the MSIs in a file are transferred in the same file. Remember that most of the "file size" will in many cases be due to multimedia.

It will be better to transfer Sets of MSTs, CETs or Templates, rather than e.g. downloading them independently.
Comments and other scenarios are welcome.
ttwetmore 2011-12-06T12:52:51-08:00
@Geir writes:

Unless you believe that an "element only" (one generic MST) solution, covering all the sources in the world...

Is this what you think the "element only" solution is? How would you ever be able to generate citations if there were no source types in the data?. Is this what you think I advocate? Have you read my examples, e.g., the article from the journal?

I am very sorry for the misunderstanding, but I have never advocated a system with only one generic source. "Element-only" to me means that source and citation information in BG files is made up of source records and source references that both contain citation elements. The source records have types. That has been explicit and implicit in everything I have said about them. I am very surprised that you didn't know that. "Element-only" means (to me) that the templates are not in the BG file, and that the definitions of the source types are not in the BG file.

Here are the four tasks I have suggested that BG needs to do to "solve" the source and citation problem. I have listed then and alluded to them frequently over the past few days.

1. Define the structure of a source record.
2. Define the structure of the source reference, a substructure used in records to refer to source records.
3. Define the list of tags for source types.
4. Define the list of key/value pairs that can appear as attributes/citation elements /metadata in the sources and source references.


Note task number 3. That is the task of determining the source types. Maybe I was too short in that task. Maybe I should have written 3. Define the list of tags for source types to be used as the source types in the source records.

I was quite frankly shocked to read what you wrote above. If I have caused you to misunderstand such a basic point about what I believe, about something I have tried to describe in excruciating detail, I can only wonder what other misunderstandings I have caused. My sincere apologies.
testuser42 2011-12-08T09:22:46-08:00
Geir, about your scenario 1:
1. The base scenario is that the two programs are known (expected) to have the same definitions, so there is no need to transfer definitions – but how do you know that?
I'd suggest that every definition gets a Universally Unique ID (UUID) or Uniform Resource Name (URN). Then the receiving program checks the incoming data for the UUID or URN and compares this to its own list of already known definitions. That should do it, shouldn't it?
ttwetmore 2011-12-08T11:05:54-08:00
Testuser,

What are these definitions the definitions of?

In none of my proposals are definitions transmitted by GB files, just source records (with their types of course), and source references, both with citation elements (aka attributes, aka metadata).

It's a big leap to decide to also send non-genealogical data in the files. I think we need a very big reason to convince us to do so. I don't see such a reason yet. Geir can speak fairly convincingly about needed them, but frankly I think he has gone way overboard on what the nature of BG should be vis a vis sources and citations.

And I would hope that source types don't have to be handled as UUID's though there is at least an argument for that. Same argument Tony uses for making them URNs.
testuser42 2011-12-08T14:32:08-08:00
Tom,
What are these definitions the definitions of?

I was thinking of "Master Source Templates" (MSTs) and citation-style-template definitons (as I think Geir was talking about these, but I might be confusing things).
[Aside: I've trouble remembering what exactly these abbreviations mean (MST, CET, MSI, wasn't there a MTST too?) Please, Geir or GeneJ, update the Glossary of Terms... Thank you!]

I wouldn't send them in the BG file, but in an external file inside the BG container.
There wouldn't be a lot of times one would want to send this kind of definition along with BG data. Most probably only when some User constructs his own (Master?) Source Type or a custom Citation Element.
Theoretically, institutions with lots of data could have their data in a BG format and they might want to offer the templates for citing their sources "correctly" (possibly including a Citation style sheet). Of course these recommendations should be separate of the data, and the importing user should be free to use any other style for his citations.

I totally agree it's not our job to write all the (Master) Source Templates and Stylesheets and define _all_ the CEs.
We can only try to define the "core" CEs, as you and others have said.
But I do think we should demonstrate how (M)STs and Stylesheets and custom CEs should be built, so that the syntax is clear for any vendor who wants to support it. (I think this is kind of what you and Tony did in the other thread)


And we might offer fallback rules: if your software can't deal with
a) MSTs and Stylesheets then... just ignore them
b) custom CEs... show them as they are, one after another, key:value or key:type:value. And export them as they are, don't interpret them
(my suggestions, don't know if they're good)


The rest is up to others: vendors, librarys, genealogical societies, individuals, "the community" might all come up with their own stylesheets and templates and additional CEs for their own needs. As long as the "language" is defined, everything should be compatible.

The UUIDs/URNs idea comes in here: if someone defines a stylesheet for the Chicago Manual of Style, it gets an ID. If the University of Chicago Press offers an official version of its stylesheet, then they'll have their own ID. Things like that...

Source Types would probably not benefit much from an UUID, you're right.
NeilJohnParker 2011-12-08T15:17:04-08:00
I believe that BetterGEDCOM needs a user selectable option that allows the user to export in BetterGEDCOM format all Metadata that can affect the utility of the Genealogical data e.g. Templates or Event Templates and their field definitions. Failure to do this will adversely affect the utility of data as when you are transferring a copy of your data base to someone else that want to maintain your family tree using the same of different software.

We need more discussion of this issue rather than just dismissing "metadata" as irrelevant as if it was inherently unimportant.
gthorud 2011-12-10T12:14:55-08:00
Klemens,

UUIDs are already in the model. I hope to get around to the URN alternative shortly.

Your proposal to enclose definitions in a separate file in the "multimedia container" is excellent. You could also upload the same file (for a standard definition) on a file server. It might be useful to have the filename of the definition file in the BG file.

Fallback


I don't mind having fallbacks as long as it is clear that they are fallbacks, and not expected to be the normal case. If you don't make that distinction clear, someone might assume they would be used in the normal case.

I have described a fallback to GEDCOM here, 1st paragraph. http://bettergedcom.wikispaces.com/message/view/A+Data+Model+for+Sources+and+Citations/47671564#47952656

Tom has suggested what I consider to be another one.

If you were to ignore the MSTs and CETs, you would have to carry the name of these in the MSI and Citation Instance. I don't mind that overhead, but it must still be required to supply UUIDs for the MSTs and exporters would have to either supply the MSTs and CETS or expect them to be available at the importing end. A solution simply supplying source type names and CE names cannot be expected to work (I expect them to be in Esperanto), and ignoring MSTs and CETs will in general have negative consequences. Why you don't want to change the CET names before export escapes me.

There will not be a single template, or style sheets as you call them, for Chicago, there will be one for each MST and one for each output language.

Why would MSTs not benefit from an UUID (or some other unique identification)? You should be careful that you don't build two different solutions into the standard, and expect both to work, then you will simply have major sources for incompatibility – and you have essentially promoted the fallback solutions to be the normal solutions.

Geir
GeneJ 2011-12-11T20:11:20-08:00
Programmed Citations, a general overview
Sources and citations is a big topic. There are many user requirements to consider.

I am among the users who write citations to document my work. I prepare reference notes and bibliographic citations. In the last century, I would have created free-form reference notes that included formatting and punctuation, and I would have expected GEDCOM to transfer my notes.

Today I use software that supports programmed citations. Rather than enter a free-form reference note into my software, I create an equivalent template (a "citation template") that refers to data fields ("citation elements" =CEs) and includes formatting and punctuation. Sometimes those citation templates include ancillary words. See the examples below. The "citation elements" are in capital letters and enclosed in brackets; "conditional citation elements," including any related conditional punctuation, are further enclosed in angled brackets.

Citation Template for Full Reference Note: [1]
[TITLE], [COUNTY], [STATE], [SUBTITLE], [PAGE], [FILE REFERENCE], [HOUSEHOLD]; [EDITION], [ITAL:][PUBLISHER][:ITAL] ([PUBLISHER LOCATION] : accessed [ACCESS DATE]), [ROLL AND FILM NUMBER]<; [CD]><. [M1]>.

Citation Template for Bibliography:
[STATE]. [COUNTY]. [TITLE], [SUBTITLE]. [SERIES]. [ITAL:][PUBLISHER][:ITAL]. [PUBLISHER LOCATION] : [ACCESS DATE].

The citation templates I create are associated with a particular source type. (Different source types require different templates.) The citation template examples above are used to record an 1850-1870 US census (the source type).

After I develop a citation template, my software compiles the different citation elements from the templates into form-like data entry screens. Using the example described above, these form-like data entry screens would have fields for citation elements "Title," "County," “State," "Subtitle," etc. The wiki page “Citation Mechanics” contains graphics from TMG showing the citation template and form-like data entry screen for this 1850-1850 US census example. [2]

Once all the citation templates and the data entry screens have been developed, I add a new source by filling out the forms on those data-entry screens; the software combines that data with my citation template to render programmed citations. For the two citation templates above, the related programmed citations appear below.

Full Reference Note:
1850 U.S. census, Williams County, Ohio, population schedule, Florence, page 59 (stamped), dwelling 808, family 810, Asa Thomas household; digital image(s), _Ancestry.com_ (http: www.ancestry.com : accessed 26 December 2006), citing National Archives microfilm publication M432, roll 741.

Bibliography:
Ohio. Williams County. 1850 U.S. census, population schedule. Digital image(s). _Ancestry.com_. http: www.ancestry.com : 2006.

Compared to free-form citations, it is more up-front work for me to develop the citation templates that support programmed citations, but I use them over and over again (some, hundreds of times). In the long run, then, I save time, and achieve better form (that up-front work) and consistency (applied over and over again).

Err… that is until I want to transfer my work to another program. Even if I assume my citation elements would transfer well to another program, slight variations in the receiving program’s citation templates will “break” my programmed citations. I would have to find some way to identify the differences and make corrections to the templates or my data. That might mean tracking down endless commas, semi-colons, punctuation on the wrong side of a quotation mark, or it might mean making sure all of my data actually made it into the programmed citation. In other words, moving files between programs and Internet services would not be feasible, especially in a world where users want to routinely sync between programs and between programs and services.

There are users who don’t use or don’t consider citations in documenting the work in their database, and still others who don’t rely on programmed citations. Those who do want their documentation to survive a me-to-me transfer/sync. Geir’s proposal, “A Data Model for Sources and Citations” considers the user requirements related to programmed citations in a model that is both more functional and flexible.



[1] The examples here report about full reference notes and bibliography citations; I also write templates for the short reference note

[2] “Citation Mechanics” includes a number of screen shots from The Master Genealogist v7. See the “splitter-like approach” screen shots in the links that follow:
SS3b (citation templates)
http://bettergedcom.wikispaces.com/Citation+Mechanics#SS3b
SS3c (data entry screen)
http://bettergedcom.wikispaces.com/Citation+Mechanics#SS3c