The de facto GEDCOM standard actually comes in several versions, and all the standard versions are in fact backward compatible. Given the backward compatible nature of the previous versions, it seems unproductive to go into the minutia of older versions. Here the commonly recognized versions, which are version 5.5 and, to a lesser degree, version 5.5.1 are discussed. Also draft versions such as the XML-based version 6.0 draft and the GEDCOM (Future Directions) draft version will be discussed.

Anyone with knowledge of previous versions that feels discussion of their development or other considerations are of value, please add your thoughts and information.

Much of the data provided here is from the FamilySearch Developer Network for Software Programmers GEDCOM Wiki Page. Given that they are the original developer of this standard, I highly recommend study of their resources.


GEDCOM Specifications

GEDCOM versions (that have been used by genealogy software):

XML Versions (never used)

Other:
GEDCOM Global Unique Identifier (GUID) of 8 June 2007

Use of ANSEL vs. UTF-8 Character Set

The GEDCOM 5.5 specification uses the ANSEL character set, which is obsolete. The intermittently implemented 5.5.1 standard uses the UTF-8 character set that is now nearly universally used by computers. This issue must be resolved in any new standard. Use of UTF-8 character set should clearly be the new standard.

Use of GEDCOM 5.x File Syntax vs. XML Language Syntax

The file syntax of all used GEDCOM versions (i.e., 5.5.1 and below) is the same, and all those versions are consequently backward compatible (except for some largely inconsequential discontinued attributes). This file syntax is unique to GEDCOM, and it contains weaknesses that would make its future use ill advised. Moreover, the GEDCOM syntax is based on a hierarchical data model just as the XML language is, making the adoption of XML something that can be easily understood and accomplished by developers. XML is a widely used standard language, and thus adopting XML for widespread genealogical use could easily result in many more applications being available to the genealogist as developers discover how easily their products can be adapted for genealogical use.

The downside to using XML is that BetterGEDCOM (BG) immediately ceases to be backward compatible. Tools could easily be developed to convert old GEDCOM files to the new BG format, but the two formats would be irrevocably separate. This is probably inevitable, but it is important to note this future incompatibility prominently and explicitly.


(The following was cut from a separate page on this WIKI and that page is now unlinked. The contents are better placed here. Original was by Greg Lambertson. Feel free to edit).

GEDCOM 5.5 Discussion And Detail


Notes from reading through the GEDCOM 5.5 Standard

Cover Letter


"Needed changes that would cause major compatibilities with prior implementations were postponed until release 6.0, some time in the future. Minor 5.x releases may occur in the interim." This tells me that even by the release of the 5.5 standard, FamilySearch recognized the existing GEDCOM standard's syntax caused limitations that could not be fixed within its framework.

Introduction


The specification describes the standard at a low level in Chapter 1 and a high level in Chapter 2. The lower level is called the GEDCOM data format, and the high level is called a GEDCOM form (there can be many GEDCOM forms). The GEDCOM form for the purposes of the standard is specified as the Lineage-Linked GEDCOM Form.

This table should be useful in seeing how these sections correspond with our future efforts.

GEDCOM 5.5
BetterGEDCOM (BG)
Data Syntax (low level)
Chapter 1: GEDCOM Data Format
XML syntax
Data Model (Genealogical Structure of data)
Chapter 2: Lineage-Linked GEDCOM Form
XML BetterGEDCOM namespace

As illustrated by this table, BetterGEDCOM (BG) will benefit from using the existing XML data format by not having to define a custom low-level data syntax.

Chapter 1


Chapter 2


Comments

greglamberson 2010-11-03T19:23:43-07:00
Source citation mangling through GEDCOM
http://www.geneamusings.com/2010/10/source-citation-mangling-through.html
ttwetmore 2010-11-11T02:40:12-08:00
Two Important Points about the Gedcom Model
There are two points about the Gedcom model I think are important to point out and they both apply to a new genealogical model and a better Gedcom.

First, Gedcom is both a syntactic language and a semantic one. At the syntactic level Gedcom is a set of rules for constructing and arranging into a tree a collection of Gedcom lines. There is nothing in this language that specifies what the tags are, what the tags mean, what tags are required, what formats values should have and so on.

It is only at the semantic level that rules about tags, arrangements of tags, and values of lines mean anything. The Gedcom standards try to point this out by saying they primarily add "lineage linking semantics" to the Gedcom syntax.

As a syntax Gedcom is essentially isomorphic to XML as a general purpose language for structuring information. They are both representations of the same things; you can transform back and forth between Gedcom and XML formats pretty much at will. As old computer scientists like myself say, "it's just syntactic sugar."

Second point concerns whether Gedcom is a language for data interchange or a language for data storage and representation. If you read the Gedcom documents it is clear that the intention was the former. But if you look at how Gedcom has often been used it often includes the latter. A VERY important point lurking behind this issue is that if Gedcom is JUST a data transfer language, that data in a different format must be translated to Gedcom before a transfer, and translated from Gedcom to a third format after transfer. In almost all cases, and lamented by millions of genealogists around the world daily, both these translations are LOSSY operations. If Gedcom were the storage language at both ends of the transfer then no information would ever be lost. This is such an obvious point and screams you so insistently over the years, that at some point it pays to notice it and do something about it. As far as I know LifeLines and GeditCom are the only two programs that have listened.

There are lessons from both these points to learn from when thinking about the better Gedcom. First the underlying textual format in which model objects eventually are represented in must have the same general purpose syntax equivalent to that of Gedcom or XML; it could have either format. Personally I would recommend a new syntax because 1) Gedcom syntax at this point in time just has too bad of a connotation/taste to too many peoples' mouths; they would think the better Gedcom effort had really not done something very useful or significant; and 2) XML has TOO MANY WORDS, TOO MANY POINTY THINGS, and IS TOO DARNED HARD TO READ. You don't have to read them you say, because that's what computers are for. Baloney. And anyway, it's just syntactic sugar after all.

Tom Wetmore
dsblank 2010-11-11T04:31:42-08:00
I am sure that every system that reads and writes GEDCOM is aware that the standard does not allow them to save all of their data. However, many vendors EXTEND GEDCOM in arbitrary ways so that it is no longer lossless. This is almost as bad as being lossless, because it is lossless for all other vendors.

Gramps does not extend GEDCOM, nor does it read any other non-standard extensions of GEDCOM. We have documented all of the bits that we lose:

http://www.gramps-project.org/wiki/index.php?title=Gramps_and_GEDCOM

To your points:

1) Yes, theoretically, any syntax would work

2) XML is the standard manner for representing this kind of data. There are many tools for processing general XML. Combined with UTF encoding, XML is the right choice in 2010 for representing any data to be shared, and archived.
Just.another.Gramps.user 2010-11-11T05:02:13-08:00
Tom,

Your text is clear and you make good points, albeit in a somewhat prevailing rough tone here. Somethings I'd like to consider:

If Gedcom is isomorphic to XML, as you say, a 1:1 correspondence could easily be kept _if there were a solid Gedcom specification_. No loss would happen then -- at least nothing significative. What really bothers people is not Gedcom syntax, as you say. It even does have some advantages to XML. The real problem is the lack of proper semantics and semantic chaos emanating from various vendors adopting their own sets of rules.

So would you reach a semantic consensus, there is no real reason why we should think of the translation operations as "LOSSY". They wouldn't be. Gramps is the proof. There is no loss whatsoever between Gramps XML and a Gramps Family Tree (BSDDB). At least I can't recall significative cases, because there is the 1:1 correpondence, because there is semantic consistency. That is not the case between Gramps and Gedcom or between Gedcom and Gedcom, thus the problem you're trying to solve. I don't see a case between Gedcom syntax and XML syntax, because the "LOSS" you mention wouldn't exist or significantly exist with proper semantic specifications.

jagu
greglamberson 2010-11-11T05:16:46-08:00
JAGU,

GEDCOM 5.5 spec Chapter 1 = GEDCOM SYNTAX, anaogous to XML in function

GEDCOM 5.5 spec Chapter 2 = GEDCOM Data Model, which is the real issue.

The only reason XML is being thrown about so much is that people simply aren't familiar with it. I suggest you just ignore references to XML, as we will be using XML for this project, and any reference to any genealogical data whatsoever in relation to GEDCOM must be referring to the GEDCOM Data Model (i.e., that part of the spec in Chapter 2 above)

The problem is that in mapping various data models to each other, data is lost due to either poor mapping, ambiguities in the data models, or simple incompatibilities. All reference to losing data has nothing to do whatsoever with XML being used (or not being used).
greglamberson 2010-11-11T05:24:20-08:00
Tom,

Also I wanted to suggest you use the free XML editor referred to in the TOOLS section linked on the left if you want to read XML. We will be adding structure to BG, just as other formats have added complexity.

You consider GEDCOM to be more "readable." Well, I guess that's because you're used to it. I personally find XML more readable. But overall, go have a cookie or something. I don;t plan on reading either if I can help it, and neither do 99% of people using this or any other similarly formatted data.
ttwetmore 2010-11-11T07:40:36-08:00
I concede the battle over XML; it was over before it began. I simply don't like XML. The amount of extra characters required to say trivial things stuns me. The plethora of pointy things and other character interrupt context so completely that reading is nearly impossible. The answers to my criticisms are that space doesn't matter any more, and you can use special readers. I concede. The rubber stamp argument that there are thousands of XML readers around and thousands of XML tools around, so you'd be an idiot to do anything else, holds no weight with me. Real programmers write parsers in their sleep. Oh, wait; I already conceded.

But you still might like to take a look at the brand new book in the Addison Signature Computer Science Series, "Domain-Specific Languages," by Martin Fowler, who like me, feels that XML has turned into a tyranny that is preventing constructive thought about better representations in many areas. Genealogy is a very domain-specific area and warrants its own domain-specific languages. But I already did concede, didn't I?

I hope you are learning to take my efflufiences with large grains of salt. No offense is ever meant.

Tom Wetmore
Just.another.Gramps.user 2010-11-11T07:57:31-08:00
Tom,

You obviously haven't conceded, but as a Gramps user and supporter I couldn't care less whether you concede or not, because, please try to understand, we Gramps users don't really care whether you will use XML or not. Use Gedcom syntax, we don't mind. Let's keep the discussion to semantic similarities, trying to keep them to a maximum so as to facilitate interchange.
ttwetmore 2010-11-11T08:15:14-08:00
What I have conceded is that if the better Gedcom project decides to define its data model as XML schemas I will happily accept that.

You make it sound like Gramps users aren't all that interested in a new Gedcom effort since you already have an excellent solution to the problem. If that's the case the new Gedcom effort should take a close look at the Gramps model. It might be a good idea to have someone from Gramps actively advocate that position.

Tom Wetmore
hrworth 2010-11-11T08:23:43-08:00
Just.another.Gramps.user,

Do you have any interest in sharing your Genealogy Research with someone who does not use Gramps?

The purpose of this project is so that you and I could, if we found we had common research material, share our research in a way that we don't loose our research details being shared.

Russ
Just.another.Gramps.user 2010-11-11T08:33:18-08:00
hrworth,

Indeed I do and that is the point of my intercision. Focus on my main point: we shouldn't bother about syntax. That's not what's keeping us apart. We have to focus on semantics and aim for the best correspondence possible. Then we will be able to share information properly. XML vs GEDCOM syntax is like an Editor war or something, it will never end. We have to forget it. Gramps has it figured, never mind. Let Gramps developers bother about syntax. I think that is not a problem for them. Use Gedcom syntax, but please, make it more uniform. Semantic consistency, turn Gedcom into a better semantic standard. That will reduce data loss.

JAGU
hrworth 2010-11-11T08:47:30-08:00
JAGU,

I sign my messages, just for clarity. It's Russ.

As a User, I don't care HOW the information is transported between two applications. But, I will help participate in getting us, this community, to where we can share our information without loss of data.

Clearly the current GEDCOM "standard" no longer meets our needs.

Thank you,

Russ
ttwetmore 2010-11-11T10:19:06-08:00
Quick Gedcom 5.5 Reference
Here is the contents of a file that I use in the DeadEnds program to check the validity of a Gedcom file against the 5.5 specification, which was the latest one to be officially sanctioned by the LDS. This file is the most succinct form I have found for showing all the allowed relationships between tags within 5.5. You can use this when you want to make a quick check of some fine point about the Gedcom rules when you don't want to open of the full document and go rooting around in it.


header_record: HEAD [SOUR [VERS NAME CORP DATA [DATE COPR]]
DEST DATE [TIME] SUBM SUBN FILE COPR GEDC [VERS FORM] CHAR [VERS] LANG PLAC [FORM] NOTE [CONT CONC]];

submission_record: SUBN [SUBM FAMF TEMP ANCE DESC ORDI RIN];

family_record: FAM [family_event [HUSB [AGE] WIFE [AGE]] HUSB WIFE CHIL NCHI SUBM
sealing source medialink note REFN [TYPE] RIN change];

individual_record: INDI [RESN name SEX individual_event attribute ordinance child_family spouse_family
SUBM association ALIA ANCI DESI source medialink note RFN AFN REFN [TYPE] RIN change];

multimedia_record: OBJE [FORM TITL note BLOB [CONT] OBJE REFN [TYPE] RIN change];

note_record: NOTE [CONC CONT source REFN [TYPE] RIN change];

repository_record: REPO [NAME address note REFN [TYPE] RIN change];

source_record: SOUR [DATA [EVEN [DATE PLAC] AGNC note] AUTH [CONT CONC] TITL [CONT CONC] ABBR
PUBL [CONT CONC] TEXT [CONT CONC] repository medialink note REFN [TYPE] RIN change];

submitter_record: SUBM [NAME address medialink LANG REFN RIN change];

address: ADDR [CONT ADR1 ADR2 CITY STAE POST CTRY] PHON;

association: ASSO [TYPE RELA note source];

change: CHAN [DATE [TIME] note];

child_family: FAMC [PEDI note];

event_detail: TYPE DATE place address AGE AGNC CAUS source medialink note;

name: NAME [NPFX GIVN NICK SPFX SURN NSFX source note];

family_event: (ANUL CENS DIV DIVF ENGA MARR MARB MARC MARL MARS EVEN) [event_detail];

note: NOTE [CONC CONT source];

place: PLAC [FORM source note];

attribute: (CAST DSCR EDUC IDNO NATI OCCU PROP RELI RESI SSN TITL NCHI NMR) [event_detail];

individual_event: indi_event1 indi_event2 indi_event3;

indi_event1: (BIRT CHR) [event_detail FAMC];

indi_event2: (DEAT BURI CREM BAPM BARM BASM BLES CHRA CONF FCOM ORDN NATU EMIG IMMI CENS PROB WILL GRAD RETI EVEN)
[event_detail];

indi_event3: ADOP [event_detail FAMC [ADOP]];

ordinance: ordinance1 ordinance2;

ordinance1: (BAPL CONL ENDL) [STAT DATE TEMP PLAC source note];

ordinance2: SLGC [STAT DATE TEMP PLAC FAMC source note];

sealing: SLGS [STAT DATE TEMP PLAC source note];

medialink: OBJE [FORM TITL FILE note];

source: source1 source2;

source1: SOUR [PAGE EVEN [ROLE] DATA [DATE TEXT [CONC CONT]] QUAY medialink note];

source2: SOUR [CONC CONT TEXT [CONC CONT] note];

repository: REPO [note CALN [MEDI]];

spouse_family: FAMS [note];
ttwetmore 2010-11-11T11:05:35-08:00
Making Gedcom Support Evidence and Conclusions
A common criticism of Gedcom is that it is a conclusion only language, an argument I have made for twenty years. One of the main reasons (I believe) the better Gedcom effort exists is to encompass more of the evidence aspects of the genealogy process into its new model.

That being said, let me take a brief devil's advocate position, and mention that there are some ways one can use current Gedcom to support the evidence model; certainly not as well as one would want, but we do live in a beggars can't be choosers world for genealogical standards right now.

Consider a Gedcom INDI record and all the event substructures within it. Let's say we're careful researchers and for every event we discover for someone we record the source of that event. So a basic INDI record might have someone's name and his or her BIRT and DEAT events and each of the two events has its own 2 SOUR line. Later you find another source that gives a birth for the same person. You add a second BIRT event to the same INDI record with its own 2 SOUR line. The main issue that arises now is choosing the "preferred" birth event. The best solution I have found to this problem, now hold your criticism for a moment, is to add a THIRD BIRT event with a 2 SOUR line that indicates that it is your own conclusion. You then make that third BIRT event physically first in the record (assuming your software is sane enough to use the physically first birth event in a record to be the one that is displayed in windows and shown in reports). Now you have an INDI record containing one conclusion BIRT event and two evidence BIRT events.

Admittedly there are problems with this approach. I suppose the main one is that most genealogical systems don't give you any say at all about the Gedcom that it will generate. But most programs do allow you to add more than one event of the same type to a person, and they might put them all in a Gedcom export file, so you could try this approach with your program and see.

Another problem that comes from the approach involves names. What if the evidence for the two BIRT events have slightly different names for the person. How do you choose a preferred name and how do you encode it in your Gedcom? Well, I'll say how I do it in LifeLines. I stick a 2 NAME line (note the level 2) inside the 1 BIRT events, using the (excellent in my opinion) convention that a person's name should often be an attribute of an event (the name the person had when the event occurred), but 5.5 Gedcom does not allow a 2 NAME line. This is not a problem for me with LifeLines since there is no barfing with new conventions. Without this addition to Gedcom you would probably do do things -- first put a NOTE in each BIRT saying what the name recorded with the event was, and you would put in multiple 1 NAME lines at the start of the INDI record, with the physically first 1 NAME line being the one you feel is the preferred name for display and reports.

Another problem with this approach, fairly serious in my opinion, is what happens when a user decides that some of those events in the INDI really happened to DIFFERENT people with the same name. Software could handle it, but it's pretty subtle, and I doubt any current software has tried to solve it this problem. For example, what would Gramps do if a user wanted to remove an event from one of the existing persons and create a new person with that event? Might work fine; I just don't know.

The final problem with this stems from the basic fact that in Gedcom all events must be iniside INDI records and don't stand on their own. In the better Gedcom (presumably) there will be event records.

Well, no, the real problem is that the "evidence records" are stuck inside the "conclusion records" (they're all part of one big INDI record), and therefore the "evidence records" have all lost their own identity and integrity.

A reason to point this out in the better Gedcom context is that better Gedcom could presumably decide to simplify its model by finding analogous short-cuts for double-dutying the same entties to represent conclusions and evidence at the same time. I don't recommend this, but I guess concensus could end up flowing in that direction.

It would be interesting if a Gramps user would explain how Gramps users have found ways to support evidence and conclusion within the Gramps context.

Tom Wetmore
ttwetmore 2010-11-11T14:42:02-08:00
Doug,

In my humble opinion, Gramps's approach of treating Events as first class citizens (that is, records), and keeping the Person and Event records separated at all times, were excellent choices in Gramps's design and implementation. It seems to me that the Gramps model is an excellent "model" for the better Gedcom effort key off from.

And I think I am beginning to get a gut feel on how Gramps could deal with the evidence and conclusion issues out of the box. Not really sure though.

On a less optimistic note, the version of Gramps that I downloaded off the site keeps crashing with messages about not being able to find a global variable in the display world. I can send you that message if you like. It does work for awhile between crashes and what I see looks great.

By the way, I have been going through the Gramps schema this afternoon, and though it takes a bit of work trying to internalize it, and there are parts I probably won't be able to understand without asking some questions, it looks comprehensive. I've also been translating it into a context free grammar as that is a form that I can understand a bit better. (And it gets rid of all the pointy things!!!!!).

Thanks so much for you explanations.

Tom Wetmore
hrworth 2010-11-11T16:37:26-08:00
Tom,

I am guessing we are hung up on the term Conclusion.

To me, a conclusion record would come after I have evaluated all of the evidence that applied to an event.

I do not rely on one source to reach a conclusion that an event took place as entered.

I think your term "conclusion" is different than mine.

Thank you,

Russ
greglamberson 2010-11-11T17:40:38-08:00
Tom,

If you would be willing to evaluate the GRAMPS data model and add the content to the GRAMPS Data model Page, perhaps scribbling a couple of boxes on a page and scanning it in to diagram or something, it would really go a long, long way towards presenting the GRAMPS model in a format that would help us evaluate it. Is there any chance I could talk you into that? You're clearly very good at understanding and explaining data models and their implications.

As an open source data model GRAMPS is certainly one we could freely use and adapt. If it's a good data model for use in the real world, as everyone says it is, I would like to see how we might use it and map data from various applications and scenarios to it.
ttwetmore 2010-11-11T18:11:18-08:00
Russ,

I can't find any fault with your concept of conclusion. So I conclude (ahem) that my explanation was insufficient.

For me a conclusion record comes into being at the time a researcher says to himself, "yes this collection of information all refers to the same real person." If that happens after a single piece of evidence has been found, fine. If that happens after twenty pieces of evidence have been examined that mention a John Smith and the researcher has decided to link seven of those together because he thinks they all refer to the same John Smith that he is interested in, the conclusion person comes into being then as either the physical merging (wrong in my opinion) or linking together (correct in my opinion) of the information takes place. Conclusions can be nullified, added to, removed from, as further research takes place.

I believe there really is a continuum between pure evidence, info in primary records, to lofty conclusions. It may be that by trying to make a conclusion object have some strict boundary there is a problem.

So here is an interesting point. If you look at the GenTech data model you will see that it is divided into three overall sub-models, Administration, Evidence, and Conclusions. What GenTech calls Evidence, however, is information directly in sources. As soon as any "markup" or "extraction" occurs the information is already into the realm of Conclusion. That is, even a simple Event record, created directly from an original parish register, is already Conclusion, presumably because some level of value added human interpretation of carbon marks on a piece of paper had to occur in creating that Event record. To my mind, such Event records (and the corresponding Person records) that are taken directly from physical evidence are still at the Evidence level, but I can certainly understand the GenTech position, and I can therefore certainly understand if the way I am using the terms causes confusion.

In summary, for me Evidence records are the records extracted directly from a SINGLE item of genealogical evidence. Conclusion records are any records that link together information from two or more evidence records, because such conclusion records implicitly assume that the linked together evidence records refer to the same real person. So I guess you could have an operational definition of conclusion record that simply states that it is a record that contains information from more than one source. But this really isn't right (and is contrary to my definition of conclusion in the second paragraph above!) because for many of the people you are researching you only have one source of information, and you really don't want to have to stupidly consider those records to be just evidence records. Look, genealogy is messy; it's humanistic which is always messy. If we get too hung up on formal definitions in a messy world we get caught in a gooey mess. Evidence comes from single items of evidence. Nothing combined from different items of evidence is evidence any more. Conclusions are things built up from evidence, even if only one item of evidence is involved. There, that's the answer. Whew!

I have done another, probably poorer, job of explanation.

Tom Wetmore
hrworth 2010-11-11T18:29:09-08:00
Tom,

I have not looked at GenTech and don't know the terms used there.

To me, a conclusion can't be reached until the Evidence has been evaluated. Usually, that Evidence is not one piece of information.

I would not consider a Parish Register a Conclusion. Much of the information in a Parish Register is NOT primary source material. The person providing a Name, for example may say and spell out the full name accurately, but the recorder of that information may not record it as it was given to that person. One person may give all of the information in the Register, but that information my not be accurate or first hand information. For example: the wife was not present when her husband was born.

I could not draw a conclusion of the Birth Date of the Husband based on the Parish Record.

That is why I am raising the issue of the term Conclusion.

I do NOT want may software program marking a record with a Conclusion indicator. I MUST do that.

Oh, I may never have a final conclusion and I might change my conclusion as more information is found.

I do agree that Conclusions are built from Evidence, but I need to evaluate the evidence and draw that conclusion.

I don't thing we are too far off and it certainly can be messy. I am only responding to the Parish Register "is already Conclusion".

Thank you,

Russ
ttwetmore 2010-11-11T18:53:13-08:00
Greg,

I have been analyzing the Gramps schema and am almost done writing a context free grammar for a language that expresses the same information in a non-XML, shorter form. When I have that in a reasonable shape I will either post it or try to describe it. I'm afraid that trying to understand the schema in its XML form is difficult for me (though Gramps folk and you yourself, based on earlier comments, might find it invigorating.) I like the CFG form for defining models and languages because they are considerably shorter, don't contain pointy things (it's a joke, guys), so therefore easier for my poor brain to wrap around.

I can already see that it looks like a great model with many of the same ideas found in other modern (and therefore not Gedcom) models. And it's nothing like the GenTech model, which stinks pretty badly in my opinion. For example, Gramps has a nice set of record types and common sensical ways of linking records together. The record types include persons and event as first class citizens, but also other things that Gedcom does not make first class, such as places. I've noticed that more and more genealogical systems have also done this promotion of places, and I think it is the right thing to do.

Here's an example of what I'm doing. My top level rule for a Gramps database in a CFG notation now looks like this:

database: header name-formats? tags? events? people? families? sources? places? objects? repositories?

This is basically a concept for concept conversion from the Gramps XML-based schema's definition of the overall structure of a Gramps database (or maybe it would better to say the structure of an external file that holds the contents of a Gramps database), to a context free grammar for another file format that would not be in XML but would still be concept for concept isomorphic with the Gramps file. This CFG rule says that a Gramps database or file consists of a header, an optional name-formats thingy, an optional tags thingy and then sets of event, people, family, source, place, object and repository records. I'm assuming that object records are catch-all records that can be used for any kind of thingy that comes up that isn't an event, person, family, source, place or repository. (A nearly unknown and probably never used feature of LifeLines is that it allows users to create top level Gedcom records with any level 0 tag you want so you can actually create your own sets of different types of records -- I put all that in but never really told anyone about it, because I didn't have time to fully debug it before I lost access to UNIX system). The only reason there are question marks after the labels for all the record types is because the schema/grammar must allow for the cases at the start of a database's life when records of any of the types might not exist yet. I haven't skull-dudged enough yet to know what the name-formats and tags thingies are, but I think the important thing to take away here is that we can see exactly what record types the Gramps designers decided to include in their model. As the Templar monk said to Indiana Jones, they choose wisely.

Tom Wetmore
dsblank 2010-11-11T19:23:25-08:00
Tom, thank you for the kind words re: gramps data model. I will pass those on to the team.

I don't think we have a nice description like the grammar you are reconstructing, largely because the schema has grown organically from our non-relational database backend.

However, as I have just recently written importers/exports to move gramps data to SQL tables, this does exist in a more readable format. Have you seen the page on the left here on the Gramps Data Model? That points to http://www.gramps-project.org/wiki/index.php?title=GEPS_013:_GRAMPS_Webapp which does have a low-res image of the relational db schema. I'm working to re-create that in a hi-res version. We also have two python versions describing the schema... one in Django (a Python-based web db system), and one in pure SQL.

I'm on the road this weekend, but feel free to ask me, or gramps-dev mailing list any question. We are a friendly group, and love to discuss these items.

-Doug
ttwetmore 2010-11-11T19:28:34-08:00
Russ,

I really think we agree and I may now be beating a dead horse into the ground.

I didn't intend to imply that it was the software that made the decisions about what is evidence and what is conclusion. In fact I now think (I've gone back and forth on this a million times in the past twenty years) that exactly the same person records (and event records) can/should be used to hold both evidence and conclusion records. Would you feel better if we could say that the there is nothing in the records that specifies whether they are evidence or conclusion, that it is only the mind of the user that they have the distinction.

What I am concerned about is that we have a model that supports a process in which we gather lots of evidence, add it to our databases as different types of records, and as we make our own decisions in or own heads about which records refer to which real persons, we can choose to link those records into groups that represent those decisions. Before we start making these decisions we basically have stand alone evidence records. After we make these decisions, the records formed by grouping together information, the records that now embody our decisions, are conclusion records. The records are organic. We can add evidence records into the mix as we find more. We can regroup the evidence into different "partitions" as we reassess which evidence records really refer to the same persons, and so on. We decide the decision points, not software. When we are finally ready to declare, "These n (where n can be 1) records all refer to the same person" we have reached a conclusion. In order to record that conclusion we have to create a record (at least in the cases where n is greater than 1) that somehow groups together those n records. If we didn't have a way to do that grouping we wouldn't have a system that supports the process. In the Gramps model these groups are simply person records that link to a larger set of event records, so a Gramps person would say we've already got Wetmore's concerns covered, so why doesn't he just shut up already. Actually I don't think my concerns are fully covered, but Gramps is way along the way.

Let me end by saying that evidence records are sacrosanct. The act of making a decision which causes evidence records to be linked together into conclusions or groups, should never physically change the evidence records. Evidence records should never be merged with other records (where two records in the database become one). Yeah, you can change an evidence record if you discover you transcribed it or recorded in incorrectly, but other than that it must retain its integrity forever.

Tom Wetmore
greglamberson 2010-11-11T19:35:02-08:00
Yes, GenTech is great theory and awful practice.

My understanding of their data model certainly makes me want to consider it. I would like to compare it to what the methodologists would like and see what gripes developers have about it.

Back (sort of) to the original idea of this thread, my concepts for source/evidence stuff is all different.

I consider Evidence to be a class of objects in which there are items: Evidence-item with subitems evidence-citation and evidence-element (which can have its own elements indicating authorship, veracity, etc.); Evidence-store; Theory; and Assertion.

Evidence-item=source for most folks
Evidence-store = repository.
Assertion=conclusion for most folks
Theory functions like an intermediate item between evidence-item and assertion. Whereas an evidence-item is a thing (like a book or an interview), a Theory is an item which provides a way to develop ideas in more free-form fashion. A Theory in my mind is something cited in an Assertion (conclusion) but is not itself equivalent to an assertion. In the same way sources are standalone items that exist but are not assertions/conclusions, so are theories.
greglamberson 2010-11-11T19:36:19-08:00
In the above post I said:

"My understanding of their data model certainly makes me want to consider it. I would like to compare it to what the methodologists would like and see what gripes developers have about it."

This statement refers to the GRAMPS Data Model, definitely NOT the GenTech Data Model.
dsblank 2010-11-11T19:47:15-08:00
Gramps developers would be the best ones to tell you what they don't like, would like to fix, change, and add, to the Gramps' XML. In the Gramps world, we live in a practical world (the XML is real, it has to work right now) but we have ideas of where we want it to go. See our Gramps Enhancement Proposals for some of these. Others are still in the minds of the devs.
romjerome 2010-11-12T04:09:09-08:00
"By the way, I have been going through the Gramps schema this afternoon, and though it takes a bit of work trying to internalize it, and there are parts I probably won't be able to understand without asking some questions, it looks comprehensive. I've also been translating it into a context free grammar as that is a form that I can understand a bit better. (And it gets rid of all the pointy things!!!!!)."

If need, you can try to paste url like this one into this form. This will generate a web for the Gramps data model with an alternate representation.

Regards,
Jérôme
hrworth 2010-11-11T11:47:34-08:00
Tom,

Being a User, I do have a question for clarification. You said:

"evidence records" are stuck inside the "conclusion records"

I am not sure I understand what that means in a GEDCOM, as we know it today.

Today, as you so nicely pointed out, we would or could have SOUR statements in the GEDCOM file supporting the INDI and Events for that individual.

What might I look for an Evidence Record or a Conclusion Record.

Again, I am certainly not as familiar with the GEDCOM structure as yourself so it may be there, but I haven't seen it.

I certainly have not seen anything like that in the program that I use.

We DO, however, need to address these two issues. I think that they are two issues.

We have a Source and Citation that would support the Individual or an Event for that Individual.

Is a Source / Citation the same as your term "evidence records"?

I know, after listening and reading Elizabeth Shown Mills speak, that at some point in time, I MUST Evaluate my Source / Citations, with hopes that I am able to draw a conclusion.

It would appear to me, that you might have Source information, Evidence information, that helps me draw my conclusion.

Right now, I don't have that tool.

Thank you,

Russ
dsblank 2010-11-11T11:48:06-08:00
Tom said:

"For example, what would Gramps do if a user wanted to remove an event from one of the existing persons and create a new person with that event?"

No problem. These are independent objects, and can be shared, removed, added, merged as the user chooses.

"It would be interesting if a Gramps user would explain how Gramps users have found ways to support evidence and conclusion within the Gramps context."

Someone will probably have to ask that on the gramps-users or gramps-dev mailing lists. I hadn't heard that distinction until I met you. (There are people that preach the message of the Mill's approach, so perhaps others can answer let you know what is possible in Gramps).

-Doug
ttwetmore 2010-11-11T14:31:58-08:00
Russ,

By saying that evidence records are stuck inside conclusion records I mean that the overall INDI record with the added summary BIRT event is the conclusion record and the three other events (2 BIRT and 1 DEAT events in my example) are the evidence records. Of course the event sub-structures really aren't records, thus the quotes. The fact that Gedcom allows SOUR tags to be used in many places, including inside event sub-structures, means that separate sub-structures within the same overall INDI record can be come from different sources. Thus INDI records can be amalgamations of information taken from any number of items of evidence.

I hope that is clear and that it does answer your question.

Tom Wetmore
GeneJ 2010-12-20T20:45:39-08:00
Blogged about place properties transfer from FTM to TMG via GEDCOM
You can read the entry here:

http://bettergedcom.blogspot.com/2010/12/bettergedcom-test1-ftm-to-tmg-place.html
Andy_Hatchett 2010-12-20T23:44:33-08:00
So if one were entering Borrby in TMG Places one would enter:

Detail: Borrby
City: Simirishamn Municipality
County: Sanke (County?)
State: Sanke (Province?)
Country: Sweden

Correct?
GeneJ 2010-12-21T03:43:05-08:00
:)

Since TMG lets you customize the template itself, I'd probably take the time to learn how Sweden's location name parts were and are used, so that I might have a "Swedish Place Styles." The Wikipedia entry indicates Sweden today is a "unitary state," divided into 21 counties ....

Quite possibly:

City: Borrby
Municipality: Simrishamn
County: Sanke
Country: Sweden
ttwetmore 2011-01-11T16:57:10-08:00
Model in the Future Directions Document

I have been reading the GEDCOM Future Directions Document. On page 15 starts the description of the "Genealogical Information Model." It's kind of a kitchen-sink model; you get the idea the authors felt they couldn't let the GenTech Data Model outdo them!

But once you select some of the key records types, and after you go off on a lot of tangents trying to figure out what it all means, it seems to me that the core model is consistent with points we have discussed for Better GEDCOM. And since it IS a GEDCOM model, Louis will be pleased.

If you like pictures you can get an overview of the model starting on page 16. For any one planning to contribute to the Better GEDCOM model I think the model part of this document should be considered required reading.

A few comments.

The model has an Event record that can be used for multi-role events. The Event records don't point to their role players, but the role players point to the Events. Because the LDS created the model, of course they have to have special records for their LDS Ordinance events, which I find somewhat offensive, but so what.

The model still has the family record, but it is now called the Family/Couple record. Interestingly this record no longer points to the father/husband, mother/wife, children, as done by GEDCOM, but the father, mother and children Individual records do still point to the Family/Couple records. This is the same as GEDCOM except that the links are just now one-way.

But the model also has a Social Group record.

Simple attributes have been turned in big characteristic substructures, but that's okay.

The source model is interesting. There are three basic records in the source hierarchy, first Repository and Source, similar to GEDCOM. But then there is the Document record. This record represents specific information taken form a Source, so it's what we've been calling evidence. Note that all records that point to sources point to these Document records, so it is these Document records that serve the role of sources for all other records.

But that's not all about sources. The model also has a record called the Extracted Detail record. This is a record that holds very specific information that is extracted from a Document, given a fourth level to the sourcing hierarchy. The Extracted Detail records refer to the Document records they come from. Other records can refer to the Extracted Detail records, indirectly referring to a Document, thus Source, thus Repository.

Then there are a whole bunch of other record types that seem to be heading into gibberish land to me. It really looks like the authors were given free range to wander all over the place and they sure did. I think this was written at almost the same time the GenTech model was circulating and it seems this had a very unwelcome infulence. Frankly I think the author of this document just went a little nuts with all the power he temporarily possessed, and it's no wonder none of this stuff really made it to the light of day.

Now comes a VERY INTERESTING feature of the model. Event and Individual records can have "BasedOn" structures that are defined as follows:

"+0 BasedOn
+1 Ref=<Record#> /* Record of the same type */ {1:1}
When records which represent the same thing are merged, the original records and a history of how they were merged can be kept by: Merging the duplicate records into a new record, Keeping the old duplicate records, or Linking the new record to the duplicate records from which it was derived. This process, called “merging forward”, uses the BasedOn Link substructure links the old records to the new record."

Follow all that? It's pretty awkwardly stated but this is that link that I have in the DeadEnds model for allowing the building of evidence and conclusion trees of records. This is the first time (other than in the Assertion record of the GenTech model) have I seen this link, the one I find critical to implementing the genealogical process. I was rather pleased to find that. Note that the final statement in the quote they got wrong, it's the new records that point to the old ones, as it should be.

Enough for now.

Tom
GeneJ 2011-01-11T17:34:58-08:00
@Tom
Under discussions to page "Evidence and Conclusion Process," I'm going to post a few comments related to your posting
louiskessler 2011-01-11T19:26:02-08:00

Tom,

I love the direction you're taking.

No complaints whatsoever.

Of course, the Future Directions Document was a precursor to the GEDCOM XML 6.0 Spec which incorporated most of its ideas.

Together all that is an excellent base to build upon and change what needs to be changed.

Louis
testuser42 2011-01-13T15:15:48-08:00
"BasedOn" structures:
In the later GEDOM XML 6.0 they use this only for <family> and <LDSOrdinance>, and these can only be based on <event>s:
  • <!ELEMENT BasedOn
  • (Event*,
  • Note*)>

But even if they didn't understand the possibilities (or they did, and got scared) - we need something like that to connect different pieces and build sustainable trees of conclusions.