BetterGedcom

ttwetmore 2011-11-09T16:21:13-08:00

Vocabulary for Describing Genealogical Data Models

We should move detailed model discussions away from the GenTech criticism thread. So I’m starting this thread in the discussion area for data models.

Since discussion and agreement is difficult (impossible?) without a common vocabulary, I would like to propose a set of definitions that might prove useful in forming a consistent vocabulary for discussiing genealogical data models.

If you are interested, please take a look at the following set of definitions to see whether you like or hate them, would like to suggest changes, or would like to suggest additions. If you think I am unfairly pulling in some direction or other, then please suggest some other arrangement of ideas.

Source -- a source is something in the real world that we inspect for information of genealogical significance. It might be a physical thing, like a book, or a church register, or a land record, or a census record. It might be a representation of any of these things that has been scanned or indexed and made available in some other form, especially on the world wide web.

Source Record -- a model/computer entity/class/record that describes a real world source so that someone else can find the real world source and inspect it for themselves. A source record does not contain any representation of the genealogical information in the source.

Evidence -- evidence is information found in a source that is of genealogical importance. It will mention one or more persons, probably in the context of one or more events. Because the information is important a genealogist needs to get that info somehow into the data model. Evidence is a real world concept, even if we are viewing a representation of it on a web-page.

Evidence Record -- a model/computer record that contains some representation of real world evidence as found in a real world source. There are not a lot of examples of evidence records to be found in current models, but we can easily imagine some forms that evidence records could take. All evidence records should have a source reference to the source record the evidence is from. The reference would hold the information needed to locate the evidence in the source, e.g., page number. An evidence record will also contain some representation of the real world evidence. Many possibilities for representation exist: it could be a literal transcription; it could be a translation from a foreign language; it could be a summary or synopsis; it could be a URL to an image of the evidence. Or it could be tagged markup of the evidence text, or a structured breakdown of the evidence into a specific, standardized format (e.g., XML elements that conform to a specific schema) for expressing dates, places, names, events, persons, roles and relationships. Or an evidence record could hold a combination of any or all of the above. Since an evidence record could be so many things, any model that includes one should specify clearly what it contains in that model.

Source Reference -- a link in a computer record from the record containing the reference, to a source record. The source reference may contain additional information, primarily to describe where in the source evidence was found.

Person -- a real person in the real world, dead or alive.

Persona Record, or Evidence Person Record -- a model/computer record that contains all the information about a single person that can be extracted from a single item of evidence. The persona record may contain a source reference to the source record it is from. If the model has evidence records, the source reference might be replaced by an evidence reference that refers to the evidence record the persona was extracted from. Since that evidence record will have a source reference, the source for the persona is accessible indirectly through the evidence record. The persona record is has not been used in many desktop systems, because those systems were designed primarily to hold only conclusion information. However, some models are now using this record type. For example, the New Family Search model uses personas. The GenTech model also uses personas.

Evidence Event Record -- a model/computer record that contains all the information about a single event that can be extracted from an item of evidence. The record does not contain direct information about the persons in the event, because that information is in persona records. The evidence event records holds role references, which are references to the persona records that are the role players in the event. Therefore the evidence event record and its set of role player persona records form a tightly linked cluster of evidence-based records, all from the same item of evidence. There are different ways that the source references for the cluster can be handled. Obviously every record in the cluster could have a source reference so the same source (or evidence reference, see above). Or only the event record might have the source reference since each persona can find the source reference through the event record.

Citation -- a citation is a formatted string that is used in footnotes or bibliographies in reports and other documents. A citation is not a record in database. The information needed to generate a citation string must be found in the model, of course, and that information is in the source references and in the source records the references refer to. Recursive sources are possible whereby one source record may also contain a source reference to a higher level source record. A good example of this is a journal article. A journal article should be represented by a source record, and a source reference to the article source record would mention pages in the article. On the other hand, the article itself is inside a journal, and there should be a separate source record for the journal as a whole. The article source record would have a source reference to the journal source record that would contain the volume number of the issue containing the article. The citation for the evidence in the article would then be formed from information about the article and journal and the references to them. There is a second part to generating citations -- the format or template to be used. These templates indicate the elements and their order to be used in the citation (e.g., author, title, publication date, page number), and the styles that should be used in formatting them (e.g, quoted, bold face, italic), and the punctuation between elements. Some models have an entity called a citation. This should be avoided since it leads to so much confusion.

Person Record -- a computer record that represents a real person. It collects together all information that a researcher has found about what he/she believes pertains to a single person. In models that also have persona records, the person record can be little more than references to the persona records that the genealogist has decided refer to that person. In a model without personas, the person will be made up of information that has either been taken from evidence records, or for systems that don’t have persona or evidence records, copied from the real world evidence. This latter case is the one most common for desktop genealogical systems today, since few of them support either evidence or persona records.

Andy_Hatchett 2011-11-09T16:35:54-08:00

This will a good thing for end users. The final version of it could help us understand just what the techs are talking about- or at least give us a start at doing so!
:)

AdrianB38 2011-11-10T09:06:42-08:00

A few tweaks from me...
Source - a conversation is surely a source - and I mean a conversation, not a record of a conversation. That sits uncomfortably with "something ... that we inspect". How about, "something that we might obtain information of genealogical significance from."

Evidence - oh dear, seems like I've got to make the comment that might upset Louis. There are 3 concepts to me - data, information and evidence. Data is just, well, data. Information is data that is (potentially?) meaningful to a realm of study. Thus, a hexadecimal string is just data. When an algorithm is applied to that hex string to turn it into an image, then it becomes information. (Discussion for pedants - if there is the stain of a coffee cup on a census form - is that information? If it's a finger-print, is that information?)
Now, the 3rd concept is evidence and genealogy seems to have adopted this evidence - evidence is information that is relevant to a specific problem.
That definition - which I'm happy with - means evidence _cannot_ exist without a specific problem. Hence you cannot take a source and distil the evidence out of it for publication (e.g.) and then for use by all and sundry, because you don't know what their problems are. What you are doing is distilling all the apparent _information_ out and recording that.
Thus, the Cheshire Parish Register Database, which is a series of (complete) abstracts of Parish Register is a collection of information, not of evidence.
My reference to Louis is because I consider that his evidence records are, in fact, information records. Sorry! (Though when used in a specific problem, they also take on the role of evidence. Does that confuse?)

Tom - feel free to recast evidence and evidence records how you want / need.

Persona - what about relationships, e.g. from censuses?

Citation - "a formatted string that is used in footnotes or bibliographies" - pedantic point, add end notes and in-line notes (is that what they're called?)

ttwetmore 2011-11-10T10:37:16-08:00

Adrian,

I agree with your notions of data, information, and evidence. The way I put it, "evidence is information of genealogical significance." You would prefer to go further and say that "evidence is information of genealogical significance that addresses a genealogical problem we are trying to solve." Theoretically I agree with you, but practically, but that's a mouthful. For me the term "significance" implies that it applies to a problem -- I hope you can appreciate that nuance. I guess my gut would say to extend the notion of evidence to be a little bit more than just information needed to solve a problem, to include the idea of information that I have reason to believe may address and as yet unknown problem. Yes is is a little subtle, but let me give an example that I think we can discuss this around.

Let's say I have the "problem" of determining the detailed lifeline of a Daniel Wetmore who lived in Norwich, Connecticut, in the 19th century. He is listed many times in the Norwich, Connecticut, city directories, from 1860 through the end of the century. But I discover that many other Wetmores were also living Norwich, during the same time period, some even at the same address as Daniel. I obviously assume these are related persons, but up to this point I had never heard of them and so they hadn't been part of any "problem" I had been trying to solve. But seeing those names I can't help but believe that they will become important to me and my research at some time in the future. So, when I start extracting data from the city directories into either evidence records, I decide to extract the information about all the Wetmores and not just Daniel. So now I have "evidence" records in my database about people I hope to someday figure out. I would prefer to avoid any semantic shenanigans and just call all these records evidence, even though by your precise definition only the records about Daniel Wetmore are truly evidence. Any further thoughts stimulated by this example?

I will update the source definition to reflect conversations (and interviews and letters).

I will pedantify the citation definition!

Personas and relationships. The idea is that the evidence events have "role references" that point to the persona records. This is the site for the relationships. At least in my mind -- I don't want to force this though -- it is simply the logical place to put that information if evidence is extracted and placed in evidence level event and person (aka persona) records.

AdrianB38 2011-11-10T12:42:23-08:00

Tom - how about we say that for convenience, the term "evidence" includes both evidence relevant to an actual problem and evidence that may be relevant to specific problems? I'm reluctant to contradict either the view from the outside world of data / information and / or contradict the genealogical definition of evidence - hence my concern about the text as written - but extending "evidence" to encompass future, unknown problems seems not to contradict it but extend it in an unobjectionable manner. If that extension suits both you and Louis, then who am I to argue?

Re personas and relationships - it's probably adequate to say that relationships _will_ be recorded through _either_ evidence events _or_ evidence people, rather than try to design the data model in a definition. (My tidy mind has a problem with saying "evidence events" and "personas" - it doesn't match - I'd rather say "evidence events" and "evidence people". I don't think this causes an issue?)

ttwetmore 2011-11-10T13:19:25-08:00

"how about we say that for convenience, the term "evidence" includes both evidence relevant to an actual problem and evidence that may be relevant to specific problems?"

That suits me very well because if makes a great deal of sense -- it passes the smell test!

"Re personas and relationships - it's probably adequate to say that relationships _will_ be recorded through _either_ evidence events _or_ evidence people, rather than try to design the data model in a definition. (My tidy mind has a problem with saying "evidence events" and "personas" - it doesn't match - I'd rather say "evidence events" and "evidence people". I don't think this causes an issue?)"

Well said; I'll find a way to integrate that.

WesleyJohnston 2011-11-10T09:49:10-08:00

Benchmark Cases

While a number of different benchmark cases have shown up in some of the threads that I have read, I have not seen anywhere in the wiki that specifically focuses on benchmark cases.

So I am starting this thread for that purpose.

I see two main types of benchmarks:

1 - real-world challenging situations

2 - torture tests

Here are some of the ones that I have found.

1 - Real-World Challenging Situations

Adrian has posted one in the GenTech thread: http://bettergedcom.wikispaces.com/message/view/GenTech+Data+Model/31757871#45973620

There are some contained in the "Inferential Genealogy" lesson that is discussed in the wiki at http://bettergedcom.wikispaces.com/message/view/Defining+E%26C+for+BetterGEDCOM/39974836

2 - Torture Tests

Tamura Jones has created some torture tests, which she makes available at http://www.tamurajones.net/ThreeTortureTests.xhtml

WesleyJohnston 2011-11-10T18:53:29-08:00

Gene, thanks for pointing out the Test Suite page.

AdrianB38 2011-11-11T08:21:41-08:00

"How do you think models should handle these kinds of event to person roles that also imply person to person roles?"

My gut feeling is that (in the case of births, etc.) the person to person relationship is just a summary of a pair (or more) of event to person relationships. James Doe is the son of Mary Doe precisely because there is an event "birth of James Doe" where James is the child, Mary is the mother (and John the father).

Simply saying James Doe is the son of Mary Doe actually _hides_ the event. We can be pretty certain that James was born somewhere, sometime.

Therefore, I guess the logical result of that is that it would not be good practice to use a straight person to person relationship to record a parent / child relationship. This may sound a little bit weak - I'm not forbidding it, just giving mild disapproval. However, it would be theoretically possible to use ASSO in GEDCOM style programs to record a parent / child relationship - people don't simply because it's more obvious to use the Family structure, so I'd similarly hope that the apps using BG would encourage the use of a pair (or more) of event to person relationships. On the other hand again, it gets difficult to talk about such software when we don't have practical experience of it.

And on yet another hand, what about recording a grandparent / child relationship? If we have a conversation describing the family, then there's no obvious event, so we seem to fall back onto a straight person to person relationship. (Personally, with what I now realise is a dislike of hidden entities, I actually always create a "dummy" person to be child of the grand-parent and parent of the grand-child, but I appreciate others have a severe distaste for this method).

So, I think we have a "best practice" of un-hiding events by splitting a straight person to person relationship into person to event relationships when the event is meaningful. (Whatever "meaningful" means....)

WesleyJohnston 2011-11-11T12:34:48-08:00

AdrianB38 wrote "My gut feeling is that (in the case of births, etc.) the person to person relationship is just a summary of a pair (or more) of event to person relationships. James Doe is the son of Mary Doe precisely because there is an event "birth of James Doe" where James is the child, Mary is the mother (and John the father).

Simply saying James Doe is the son of Mary Doe actually _hides_ the event. We can be pretty certain that James was born somewhere, sometime.

Therefore, I guess the logical result of that is that it would not be good practice to use a straight person to person relationship to record a parent / child relationship. This may sound a little bit weak - I'm not forbidding it, just giving mild disapproval. However, it would be theoretically possible to use ASSO in GEDCOM style programs to record a parent / child relationship - people don't simply because it's more obvious to use the Family structure, so I'd similarly hope that the apps using BG would encourage the use of a pair (or more) of event to person relationships. On the other hand again, it gets difficult to talk about such software when we don't have practical experience of it."

..........
What about when you have a death certificate as the only source of the name of a parent? I'm not sure what the actual implementation of your notion would look like.

Would you fabricate a presumed birth event so that you could use it as evidence for what the death record is saying about the parent-child connection? Or would you create both a parent-child relationship and a fabricated birth event for which the parent-child relationship is evidence?

Or is there an a priori creation of a birth event for every person? Not sure where this leads when you have to actually implement it. But it seems as if you wind up fabricating birth events.

AdrianB38 2011-11-11T14:44:55-08:00

"What about when you have a death certificate as the only source of the name of a parent?"

Hm.

Following up Tom's census event, where the relationships are the roles in the census event, I guess one option would be to have a death event, with role "deceased" for the person who dies, another role(s) of "parent" for the parent(s). I guess my concern with this approach is that we could be scattering family relationships everywhere.

My temptation would be to create a birth event - within the birth event there would be references to 3 people, each with a role, namely "child", "parent", "parent" (assuming we have the names of both parents). I think that would be all we need.

I don't see there would be a priori creation of a birth event for every person - only if we need to record something about that event. I don't think there's anything wrong with "fabricating" birth events - we always know someone has been born (unless this is Star Trek and they're from an alternate time line...). I am concerned, however, that there might be some sort of event where we can't say that it occurred and the fabrication is less acceptable.

ttwetmore 2011-11-11T15:51:32-08:00

“What about when you have a death certificate as the only source of the name of a parent? I'm not sure what the actual implementation of your notion would look like.”

[ASIDE: I am using XML in the examples I give -- this is because we all understand XML -- DeadEnds IS NOT XML; it is a model; it can be represented in many ways, including GEDCOM for goodness sake (given a few more tags); I just need a way to show the examples, so I pick XML].

The death certificate is evidence of an event, so in DeadEnds I would create an event and two personsas, as follows...

<event type=”death” id=”e1”>
<date> ... date of death ... </date>
<place> ... place of death ... </place>
... attributes for cause of death, etc ...
<role type=”primary” personid=”p1”/>
<role type=”father” personid=”p2”/>
</event>

<person id=”p1”>
<name> ... name of deceased ... </name>
... attributes from certificate
</person>

<person id=”p2”>
<name> ... name of deceased’s father... </name>
</person>

The relationship between the two is implied in the roles, so theoretically at least, nothing more need be encoded; the software can figure out the inter-personal relationship when it becomes important. Note that I am not set in my ways about this. If relations were added to the two personas to make them refer to each other I wouldn't mind. If someone want a family record a la GEDCOM with links corresponding to HUSB and CHIL as well, I wouldn't mind. But in terms of recording just the evidence, these three records (plus a fourth for the source) seem the right way to do it to me.

“Would you fabricate a presumed birth event so that you could use it as evidence for what the death record is saying about the parent-child connection? Or would you create both a parent-child relationship and a fabricated birth event for which the parent-child relationship is evidence?”

I would not fabricate a birth event for the deceased. I believe that the only event records that should be created, should by and large, be only those directly supported by the evidence. A partial exception I use for this comes from census records, where generally the person’s age is given and often birth place (to some high level such as state or country). Since I believe one of the purposes of the census record is to indirectly document birth information about all the enumerated persons I feel that creating these birth records for each person in a household is legit.

“Or is there an a priori creation of a birth event for every person? Not sure where this leads when you have to actually implement it. But it seems as if you wind up fabricating birth events.”

I don’t want fabricated birth events. I don’t see any need for them.

ttwetmore 2011-11-11T16:05:32-08:00

"I would not fabricate a birth event for the deceased. I believe that the only event records that should be created, should by and large, be only those directly supported by the evidence. A partial exception I use for this comes from census records, where generally the person’s age is given and often birth place (to some high level such as state or country). Since I believe one of the purposes of the census record is to indirectly document birth information about all the enumerated persons I feel that creating these birth records for each person in a household is legit."

Actually, having just said this, I need to rescind it. If the birth certificate gives details of the person's birth, including a birth date, or an age, or a birth place, then the certificate ALSO gives evidence for the person's birth, so it is very legitimate to also generate a birth event.

This gives me an opportunity to bring up another topic that I have mentioned many times.

So let me do more on the birth example from the death certificate. Say the death certificate gave the deceased’s birth date and birth place. You could one of two different things. First you could create a birth event and a persona.

<event type=”birth” id=”e2”>
  <date> date of deceased’s birth </date>
  <place> place of deceased’s death </place>
  <role type=”primary” personid=”p1/>
</event>

Note the id is p1 so this person points to the same persona that the death event points to.

This is method one. I don’t really like this for a subtle reason, which basically is the fact that the purpose of the death certificate is not document a person’s birth, and going with the full-bore event and persona approach, to me, should be reserved, where possible, for evidence whose actual purpose it is to document the event.

So the second method, and the method that I use would be to simply add to the persona record as follows:

<person id=”p1”>
<name> name of deceased </name>
... attributes from certificate
  <event type=”birth”>
    <date> date of birth </date>
    <place> place of birth </place>
  </event>
</person>

In the DeadEnds model, when an event is placed inside a person record I call it a “vital.” It is really just a one-role event, that is placed inside the record of the primary role player and the role is not mentioned as it is implied. Other role players could theoretically be added to these vital events, that is, that is allowed by the DeadEnds model, though I don’t see much use for them.

ttwetmore 2011-11-11T16:06:06-08:00

Of course I meant "if the DEATH certificate ..." in the above.

EssyGreen 2011-11-13T23:46:35-08:00

"I would not fabricate a birth event for the deceased." .... I would ... the way I see it we have life events which really existed in the past and we have life events which were *recorded* as having happened in the past ('recorded' could be as loose a definition as a memory).

The aim of the genealogist is to try to establish what happened from what was recorded. (And in doing so of course they provide yet another layer of recorded data).

How far the records reflect reality is largely (if not wholly) due to interpretation.

Imagine a Venn diagram with Reality in one circle and Records in the other. The bit in the middle is the only bit which represents an accurate reflection of actuality and is pretty small. But if we add a fuzzy bit (interpretation) then we get a bigger picture ... it's myth rather than fact but it's bigger and better than only showing hard proven facts.

It also raises questions as to *why* things were recorded the way they were. Take a simple example of a Birth Cert with a blank for the father's details. ... We don't assume that the child was the immaculate conception! We know there was father, just not who he was. And we then ask why he is not recorded: unknown? illegitimate? adopted?

Apologies for rambling on ... This is prolly the wrong place to discuss this so please feel free to shove me somewhere more appropriate :)

ttwetmore 2011-11-14T03:12:36-08:00

EssyGreen,

For full disclosure, I did rescind the statement of mine that you quoted, with a long explanation of why, which agrees with your points here.

For me the the question comes down to fairly silly-sounding ones. When and why and how do we record a birth event, and even in more general terms, what is a birth event?

There was a birth event for everyone. Do we automatically add a birth event for everyone when we add them to our database? I don't. What does it take to add a birth event? It obviously must include some information that allows inference of place or date. Any record that mentions birth place or date can be used -- census, death, marriage, arrival, draft registration, ...

Should we think of the birth event as a simple date and place attribute in a person record, or should we think of it as a highly participatory event that includes mother and father? Nearly all genealogical systems force the first interpretation, while the parent-child relationships are encoded independently.

With the advent of systems/models that allow event-based data, is there reason to change this view? If events become independent records with multiple roles, then the interpersonal relationships can be implied by the roles. Does this mean the parent-child links or the family links of today's systems would become obsolete, or are they just redundant, and do they still have an important role (smiley) to play?

I think we need some very simple case studies. Here are a few examples. How would you wish to record the following items (assume they are all found in legitimate sources):

"One of John Williams's sons was named George Escott, named after John's wife's father."

"John Williams died in Orange County, California, in 1918, at the advanced age of 93 years. At his death he had 13 living children, 56 living grandchildren, and 43 living great-grandchildren."

"John Williams is survived by his daughter-in-law, Mrs. George Williams."

"John Williams's mother was Lucy Putnam, of Worcester, Massachusetts, who died in 1897, when John was 6 years old."

ttwetmore 2011-11-14T03:20:49-08:00

Here are a couple related questions I've pondered.

Let's say you have someone in your database who was born in 1960, and you know the person is dead, but you don't know where or when. How do you record that? Here is an event whose mere existence is important in some sense. I record these as an "attribute-less" event. In GEDCOM it would be the naked "1 DEAT" line (some people use "1 DEAT Y").

How about marriage events? You know two people were married, but you don't know where or when? Do you record a marriage event? In GEDCOM you might stick a naked "1 MARR" in the FAM record.

You know I like to live in the world of evidence and conclusions. I talk a lot about personas and persons. I don't talk a lot about the corresponding concepts in the event world, but evidence events exists as well as evidence conclusions. Are these implied death and marriage events, which are very important in a genealogical database sense, just demonstrating the differences between an evidence event and a conclusion event, or is there something else going on?

EmmanuelBriot 2011-11-14T04:08:15-08:00

Going back to the initial message in this discussion and how registering the "cousin" relationship would be implemented in the GenTech model, I think the answer is exactly the same as Tom's:

- either you create a census even with roles
- or you register a direct persona-to-persona link with a "cousin" type.

(I would personally do the latter, I think).

The more interesting question is that of the GUI itself: I am not sure how it could help you to match existing persons: say you have P1 and P2 in the database marked as cousins. You know P3 is P1's father, who had a brother P4. When you insert P5 as a child of P4, the GUI would be helpful if it could remind you that you also had a P2 as cousin, and ask whether that could be the same person. Tricky to implement right, but that would be awesome...

Another point discussed in this thread is whether using events and roles is enough to describe the pedigree or whether we also need persona-to-persona. Of course, the idea is not
to prevent the latter. In my toy implementation, I have so far relied on birth events and
the principal/father/mother relationships to build the pedigree. It would be very easy to
also take into account the persona-to-persona relationships when they exist. I wouldn't make those mandatory, though, because of the potential redundancy in the database.

ttwetmore 2011-11-14T06:40:25-08:00

Emmanuel, The user interface possibilities are nearly unlimited and very exciting. I've been thinking about user interface metaphors for dealing with personas for a long time. I just wish I had the level of experience needed to prototype some of the ideas quickly. I have shifted my development environment to Mac OS X and Cocoa as a way to prepare for the experimentation.

Most of my user interface ideas center around the concept of a large desktop window holding a tableaux of cards or slips of paper or "stickies" that represent personas and events. Users are free to move the objects around, stacking or grouping them, removing and adding to the tableaux as desired. When P5 is added in your example, I would want software to immediately compare P5 and P2, and if there were enough similarity, to bring P5 onto the tableaux with a suggested affinity with P2. Frankly, I think this is an easy thing to do.

I believe the tableaux of personas is the best user interface to allow users to best grok the relationships that drive the decisions that build up conclusion personas from the persona records. It is this idea of effective visualization of the evidence, and of algorithmically using the evidence, in the support of building up conclusions, that makes me such a strong believer in persona concept. Any system that does not employ personas forces the genealogist to do all this reasoning "in their heads."

Soapbox: We need software to help in the actual process of genealogical research and conclusion making. Our efforts should be focused on making that happen. The persona is the key concept in this, so user interfaces that allow effective visualization of personas will be the the user interfaces of the future. Emmanuel has a user interface that is persona driven. The New Family Search user interface is persona driven. What they lack is effective and interesting ways to allow users to truly interact and work with them. This will come.

WesleyJohnston 2011-11-10T10:00:34-08:00

There are a number of different familial relationships that appear in a census that do not fit into the person-spouse or person-child formulation without some fabrication of one or more intermediate pseudo-persons, who may eventually become known but who are not knowable from the record itself.

These terms include: grandson, nephew, brother-in-law, cousin, and more. You want to capture this relationship between the head of the household in your database in some form that will not simply be buried in a free form textual note, since you would like to see all people connected to a person when you are considering them. You do not want to just ignore the person, since their presence in the record could be the key to something else.

So there are two issues to this as a benchmark case (and let's use "cousin" as the relationship for the benchmark).

1 - How do you store the person and their connection to the head of household in a manner that elevates that connection from a mere textual notation to an explicit relationship that you can then automatically expect to appear when you want to see everyone who is connected to your target person in a later analysis?

2 - How do you modify the representation once you find sufficient evidence for how that relationship is able to be expressed with a combination of person-spouse and person-child relationships?

WesleyJohnston 2011-11-10T10:13:47-08:00

There are a number of relationships between people that show up in particular events. These may or may not be familial relationships. You may not know from the record whether they are or not. But you want to capture the relationship and represent it explicitly in your database, so that it will show up when you want to see who is connected to a person.

These relationships show up with terms such as these: informant, baptismal sponsor / god parent, marriage witness, etc.

- How do you specify such relations as explicit relations, so that they show up when you are seeking to know who is in any way connected to your target person?

Existing systems sometimes do have these terms as events that you can assign as attributes to a person, which can be redundantly entered as events for the persons who are thus connected. But the existing systems do not support an actual defined linkage / connection / relationship between the people involved, which could then later be retrieved no matter which person in the relationship you are looking at.

ttwetmore 2011-11-10T11:11:27-08:00

Showing cousins from census records in the DeadEnd model is done as follows:

<event type=”census” id=”e1”>
<date> ... </date>
<place> ... </place>
<role type=”head” personid=”p1”/>
<role type=”spouse” personid=”p2”/>
<role type=”cousin” personid=”p3/>
</event>

<person id=”p1 sex=”m”>
<name>...</name>
...
</person>

...

This is a “cluster” of records including an “event evidence” record (the census event record) and three persona records (only one partially shown) The relations are handled as role types in the “role references” found in the census record, where it is assumed, because the event is a census type, that role types encode the relationships with respect to the head of household.

There is another way to do this in the DeadEnds model where there is no event to bind the role players together (or even if there is). Consider if we just have the evidence that says that "James and Daniel were cousins. " There’s no event implied in the evidence so all we do in DeadEnds is create a “cluster” of two persona records:

<person id=”p1”>
<name> James </name>
<relation type=”cousin” personid=”p2”>
...
</person>

<person id=”p2”>
<name> Daniel </name>
<relation type=”cousin” personid=”p1”>
...
</person>

(I'm not showing the source records and the source references that would be in these persona records; left as exercise to the reader.)

In the DeadEnds model the inter-person links are called “relation references”. If later research shows the common ancestor/s that actually define how the cousin relationship is established, that’s just fine. There will then be more personas in the database showing the more complete relationship, and some of those personas will be combined with the two personas just shown here. In the DeadEnds model personas are permanent. They never change except to correct errors. So the generic cousin relationship exists forever, right alongside the more direct parent/child relationships that will come to be after more research. Note that there is value in the cousin relationship as it can be used as “proof” that the particular Daniel mentioned in the statement above, is the same Daniel as captured later in another persona record.

GeneJ 2011-11-10T12:13:51-08:00

Hi Wesley,

You wrote, "number of different familial relationships that appear in a census that do not fit into the person-spouse or person-child formulation without some fabrication of one or more intermediate pseudo-persons, who may eventually become known but who are not knowable from the record itself."

Non-techie here.

Tom Jones clarifies the genealogical questions by saying these are issues of either (1) identity or (2) relationship. (See the Inferential Genealogy link you updated.)

Your cousin-in-census may raise many questions of both types--Is "James Miller" age XX who resides XXX in XXXX the same James Miller who .... ; what is his relationship to the recorded head of household, Joseph Turner who is ....; who are the parents of James Miller, who resided XXXX, age XX, as "cousin" in XXXX with XXX

I have the most experience with TMG. How/whether I have or will add ol' James Miller to the database due to the new census find will depend on what I believe I know. (Sometimes a single item of evidence just slides right into the family group, other times the value of the information is yet to be determined.)

(1) You wrote, "How do you store....

If I don't "know" this James Miller from the existing evidence (including the census information), I'd probably be reluctant to "conclude" the census was direct evidence (=answers question all by itself) that James and Joseph were "cousins," but the way I structure my tags and roles, that might not even be an issue for us here, as below.

(i) In TMG, I can still add ol' James to the database as a person and associate him with Joseph Turner via the tag/pfact using a "household resident" role. The field "person or item id" in my "Source Record" will identify James as "James Miller, residing as cousin in Joseph Turner household." In the BetterGEDCOM model I'd initiate a research item with focused goals about James identity and his relationship to parents (see the Jones cases; also BetterGEDCOM Requirements Catalog for "Research Log").

You wrote, "elevates that connection from a mere textual notation ..." In TMG, there is an "Associates window" reporting names and summaries of persons with whom the subject is, err... associated.

(ii) In the alternative, perhaps I know much about either/both James Miller and Joseph Turner, so that, together with other documents, I am able to conclude (have sufficient evidence to prove) the relationship between the men. I'd use the same tags and roles for James and Joseph, and the same documentation (source_record). The census might ALSO be indirect evidence of other relationships--perhaps of James to his parents, or Joseph to his parents.

(2) You wrote, "How do you modify ....
As above (a), if I have a research question to identify the parents of 'ol James, ae XX residing XXX in XXXX as cousin with XXXX.... As part of that focused goal, I will continue to learn more information that can be documented in the database and either added as pfacts to James or accumulated in the research log, or some combination of the two (ie, I might be building a proof in the research log).

Of course ... none of this answers Adrians question about how to reverse errors.

Hope this helps.--GJ

P.S. Putting your "cousin" entry in a US centric context, we don't begin to observe relationship data/information in our federal census until a reasonably modern time (1880+). It doesn't help that our 1890 census was mostly destroyed. Continuing, the availability of early vital records (also church registers) is spotty--New England researchers here are blessed by the practice there of maintaining town records, and so many are extant. Further, It probably goes without saying the word "cousin" has different meanings in different usage. Sigh.

GeneJ 2011-11-10T13:33:54-08:00

Hi Wesley,

You wrote, "informant, baptismal sponsor / god parent, marriage witness, etc. ... How do you specify such relations as explicit relations, so that they show up when you are seeking to know who is in any way connected to your target person?"

In TMG, for example,* I add "associate" others in the event tags using roles. While I may have used some terms just a little differently, I'm sure I've entered roles for all the classic examples you used as examples.

:-) My mom was born in a small Wisconsin town--it had been settled earlier by mostly Norwegian immigrants. Both her parents were first generation "Americans." Mom's baptismal certificate is entirely written in Norwegian. (As is the church register where the baptism was recorded.) The certificate is a real treasure to me.

When I entered the baptismal certificate, I didn't know the relationship between my mother and all the baptismal sponsors (there were four). Two of the sponsors were listed as Mr. and Mrs Carl FroilXXX; the other two were Palmer PeterXXX and Marie OlsXXX.

Mr. and Mrs. Carl FroilXXX were known to me as mom's paternal aunt and her husband--Carl and Emma. I just linked my mother's baptismal tag (pfact) to Carl and Emma with the role "sponsors."

Palmer and Marie were not yet in my database; they were both added to the database via mom's baptismal tag and associated as "sponsors." I researched both to learn that Marie was mom's maternal aunt. Palmer turned out to be Marie's husband-to-be.

Roles/Associates are powerful tools in the genealogical database.

Do I enter every persons name from every document? No. I have a large holding of estate papers--with pages and pages of estate sale data. I have transcribed those documents, but I only enter tags/roles for selected entries, including circumstance when that transaction is part of my "evidence" for a proof/conclusion.

You'll find other discussions on BetterGEDCOM about the range of evidence--direct evidence, indirect evidence, conflicting evidence, circumstantial evidence and negative evidence. I do try to recognize all the evidence in building proof/reaching conclusions.

Hope this helps.--GJ

*As with my reply just above, I'm more familiar with TMG, but I know other modern vendors support roles and associates, too. See also the BetterGEDCOM Requirements Catalog.

WesleyJohnston 2011-11-10T13:53:12-08:00

I'm finding the limitations of a wiki really fast. I had wanted this thread to be a catalog of benchmarks. Unfortunately, there seems to be no easy way in wikispaces to have hierarchies of threads or otherwise discriminate between main and derivative topics without creating separate wikis for things and putting backward and forward pointers (URLs) into them. Otherwise they become long masses of text as every aspect of the thread is treated as of an equivalent type.

I very much appreciate Tom's and Gene's feedback. But I had not really envisioned this thread as a how-to for each particular product. I had envisioned it as a repository where we could post benchmarks that we would like to see addressed in the separate discussions about the products.

Wikispaces is helpful but it really becomes an overwhelming mass of text very quickly.

ttwetmore 2011-11-10T14:10:12-08:00

I believe the solution to these issues are fairly well understood and implemented in some current systems.

And I assume we are asking these questions as benchmarks against proposed models for Better GEDCOM, not, for example, as a way to evaluate existing desktop systems.

To handle arbitrary relationships between persons (e.g., "cousin", "friend", "employer", "in-law"), that don't have an obvious parent-child form, one needs the person objects to be able to refer to other person objects in a way that states the relationship. (Or in the case of normalized models, one needs at least a three column table holding 1) relationship; 2) person 1 id; and 3) person 2 id.) Even in the case of "first cousin", where we don't yet know the common grandparent, we wouldn't want to manufacture anonymous persons to cast the two persons into an actual pedigree structure; we would want to leave the relationship as a simple person to person link with tag to name the relationship. GEDCOM has this facility, as Louis has pointed out, in the ASSO tag, which is not used all that much.

To handle roles in events, one needs the event and person objects to be able to link to each other (or at least in one direction) with role-based links. (Or in the case of normalized models yada yada yada). I don't think there is anything surprising or unusual here either. Not all models have events as actual objects and force them down to one-role events (e.g., birth, death, residence) and keep the event info with the person. In systems that allow events to be objects unto themselves there has to be a mechanism to link persons to the events they are role players in, or vice versa. Some roles from events strongly imply biological relationships between the persons, and different models might deal with this situations differently.

Here's a good example of what this means in practice. Say a model has event records. Say you have a birth event that names the father, mother and child, and the midwife. The event thus has four role players: child, mother, father, midwife. In the DeadEnds model, for example, there would be an evidence event record and four persona records (if recording the midwife), and the event would refer to the four personas via role references. So question: should the child persona also refer to the father and mother personas through additional, direct father and mother links, or is okay to simply leave the records the way they are and let the software infer the parent-child relationships via the event? The DeadEnds model does not specify an answer to this question. Obviously the event to persona roles will be there, since they come directly from the evidence. I think it is up to the application or the user to add the inter-person relations if they choose. The model accommodates both. One wonders whether we could get into trouble with this somewhat redundant state of information.

How do you think models should handle these kinds of event to person roles that also imply person to person roles? I've actually been asking this exact question in different fora for fifteen years. Have never gotten anyone to answer!

ttwetmore 2011-11-10T14:20:35-08:00

"I'm finding the limitations of a wiki really fast. I had wanted this thread to be a catalog of benchmarks. Unfortunately, there seems to be no easy way in wikispaces to have hierarchies of threads or otherwise discriminate between main and derivative topics without creating separate wikis for things and putting backward and forward pointers (URLs) into them. Otherwise they become long masses of text as every aspect of the thread is treated as of an equivalent type."

All I can say is "welcome to the club." Nothing ever goes as one would expect. Many of the best discussions in this wiki are hidden in threads with titles that have nothing to do with that discussion, and I know I could never go back and find them!

This wiki is a great example of herding cats. We all wonder all over the place. I owned many cats in my long life. I haven't a clue how to fix this. Good luck!!

louiskessler 2011-11-10T14:32:00-08:00

Wesley: You are not using the Wiki correctly.

If you want to make a catalog of benchmarks, you should make it a page, not a discussion thread.

A page can be edited and modified by you or anyone else and contains a complete history of how its been edited. Pages are the place you put the work that is being assembled.

Each page can then have its own discussion topics that get associated with the page. That's were you discuss what is on the page. The hope is that the page will summarize the totality of opinions that are validated by the discussions, and the page will link appropriately to the specific discussion post that makes each point. (After all, we're genealogists and know we need to source our information). When a discussion brings up something new, it should be added to the page.

Unfortunately, that has not been done rigorously on the BetterGEDCOM wiki.

If someone has a number of hours available, they might want to take the points from the many discussions and ensure they exist on pages somewhere.

The pages can be better reorganized as well. Some old pages are unlinked and therefore non-discoverable.

Louis

WesleyJohnston 2011-11-10T14:48:10-08:00

Thanks, Louis, I basically use the Search feature to try to find anything specific. That's how I realized there was only one post that used the word "benchmark". I'll see about using a page to collect benchmarks.

Tom, unfortunately not all existing products do allow creating relationships out of the two cases that I posted. But you are right that I was posting these as benchmarks for BetterGEDCOM rather than for existing products.

GeneJ 2011-11-10T15:47:28-08:00

@Wesley,

You'll find there are cases posted throughout the various wiki discussions. When we were working on E&C, testuser created a page and specifically called for case examples. See his page, "BetterGEDCOM Test Suite." http://bettergedcom.wikispaces.com/BetterGEDCOM+test+suite

See his discussion here
http://bettergedcom.wikispaces.com/message/view/BetterGEDCOM+test+suite/32442114

"Please think about interesting problems that have come up during your family research.
Then just write the whole "case" down in plain text, either on this page or on a new sub-page (but please don't use the discussion pages). You don't have to give all the source data or the true names and places, just as much as is needed to understand the problem(s) and your reasoning on the way to a possible conclusion."

You might consider creating a new page for your catalog and linking it back to testusers page. Then post a message at the bottom of this thread pointing readers to your new page.

Hope this helps.-GJ

ttwetmore 2011-11-10T17:02:44-08:00

"Tom, unfortunately not all existing products do allow creating relationships out of the two cases that I posted."

Yup, I know. The Better GEDCOM model must however. I have put DeadEnds forward as a model that could be the basis for the Better GEDCOM model, so was expressing how basic it is to deal with these relationship cases with DeadEnds. Louis and Emmanuel may make the same points about the models they promote.

I assumed that the purpose of the case studies was to solicit demonstrations on how different models would handle possible problematical cases. Thus my responses.

Thanks.

ttwetmore 2011-11-14T06:12:55-08:00

Relationships at the Conclusion Level

I believe that when extracting information from evidence, the information should by and large be put into clusters of eventa evidence level event) and persona records. There is a related question about other objects of genealogical significance, and how to extract them into our databases – objects such as the ship an ancestor arrived in to a new country, or a battle or war or regiment in which an ancestor fought, and so on, but that's for another day.

I have argued how relationships between personas should be handled in the evidence level records: via eventa-to-persona role references in some cases, and via persona-to-persona relation references in others, depending on the nature of the evidence itself. At the evidence level in our database I believe we should express relationship information as it is found in the evidence. Up to this point we’ve had our historian’s hat on: we record exactly what we find about the persons we are interested in.

Now we put on our genealogist’s had and try to draw conclusions from the our evidence. The goal for most genealogists is to determine the true ancestral relationships between parents and children, and the spousal relationships between fathers and mothers. Most genealogists organize this around the concept of family groups, each consisting of two parents and their biological offspring, a task admittedly fraught with difficulties.

As a quick aside I should say that the family group concept is not fully accepted anymore, some people believing it is too artificial a concept, or too Western culture centric, and that with good parent-child links we can do away with the concept. My response to this is that the family is one of the most important concepts in the minds of nearly all genealogists; it is an organizing concept that nearly all of them use; and I for one, have no reason to apologize for my culture, and no reason to give up one of the key concepts in genealogical research in the interest of cultural neutrality. If New Guinea tribesman or Samoan fisherman are not interested in who their fathers were, but only interested in who their father’s sister’s brothers were, fine, let them write their own user interface plug-ins.

Back to our hats. The core of evidence-based, or record-based, genealogy, as I see it, is to use the eventa and persona records we have gathered to build up conclusion records. (This is not all that different than how Louis would do it -- he has more inclusive evidence records that encompass [the much more useful] eventa and persona records.) In my mind the key conclusion record type is the person record, which is the record that gathers together all the evidence, in the form of persona records, that we believe applies to the same person. This gathering together is the key inferencing or concluding step that we do. We conclude we should gather those personas together for some reason. We record that reason as the “value” of the conclusion.

But there is a big hole remaining to be filled. In the conclusion world, how do we express relationships between persons? Isn’t this the key question about relationships? In the evidence world we record what we find as directly as we can. If we find relationships only through eventa roles, we express them that way; if we find relationships mentioned by direct person to person relationships we record them that way.

But at the conclusion level, we are free to express relationships in the way that we choose to best show them at a conclusion level. Dare I mention that this could be done using the good old family record, and that acturally this is a very good idea? Or this could be done by requiring each conclusion person to have one father link and one mother link. Or it could be done by having the conclusion-level event records (which would exist in some models) have conclusion-level role references (that is, they point to conclusion persons rather than personas). And to complete the picture we could have persons refer directly to other persons with conclusion-level relation references.

In other words, at the conclusion level, we have a chance to simplify or unify or leave just as complicated the way we express relationships.

I don’t talk about this particular subject very much because I’m not sure the best way to do it. I’ve put into the DeadEnds model the capability to do handle it any way one might choose, including the possibility of family records.

I know some of you think the key problems left to be solved for Better GEDCOM lie in the area of citations, some in the are of research logs and todo lists, some in the nature of evidence-based records; but for me, the problem of expressing relationships at the conclusion level is the key remaining issue.

testuser42 2011-12-06T06:47:24-08:00

Interesting question... some thoughts:

I guess these options are not really different in their end results. Where do you see problems with going one way or the other?

My gut feeling says, there should be relationship links from your Conclusion Persons to whereever you like (RelationRefs sounds good, to distinguish from PersonRefs which build up one Conclusion Person).
These RelRefs would be the place to attach the reasoning behind the link.
The "RelRef" could be used at the Evidence Level, too, I think. No need for a new type of link.

Now playing devil's advocate:

Of course, normally we'd only add relationships between CPs if there's some evidence for that relationship. So -- do we need the RelRef between CPs at all? Aren't all relationships implied by an "Eventa"? (funny name, like a coffee machine brand...)

Using all of the CPs Personas'relationships to other Personas might automatically give a lot of good relationships. It would also quickly show how well proven a relationship is: If there's lots of Evidence that links the two real life humans, there will be lots of Personas of CP1 that are linked to Personas of CP2.

But also, it is conceivable that one might want to express doubt about a relationship that's found in some evidence. Not every document is true. So even if an Eventa records a relationship for two Personas, if we conclude that this connection is wrong, we need to keep the ensuing CPs apart from each other.
What would be the smartest way to do it? A negative Relationship Reference? Or just don't imply anything from Personas relationships, and only show a relationship if it's explicitly stated at the Conclusion Level.

Coming from this negative example, maybe there could be a case for the opposite, when you speculate about a relationship but can't prove it yet with evidence? Only then a RelRef without connection to Evidence is really necessary.

I'd imagine that it's "cleaner" and maybe more efficient to process if relationships are explicitly recorded with the CPs, even if technically it might be possible to get all of the connections via the Evidence Level. It'd definitely be more human readable ;)

ttwetmore 2011-12-06T07:22:10-08:00

@Testuser

Thanks for your thoughts on this.

I guess these options are not really different in their end results. Where do you see problems with going one way or the other?
I think there are no problems other than some software complexity, and I was worried about how that complexity might be perceived. That is, with the double approach there are different ways that important relationships, e.g., parent/child, could be structured in the BG data.

My gut feeling says, there should be relationship links from your Conclusion Persons to whereever you like (RelationRefs sounds good, to distinguish from PersonRefs which build up one Conclusion Person).
These RelRefs would be the place to attach the reasoning behind the link.
The "RelRef" could be used at the Evidence Level, too, I think. No need for a new type of link.
I agree.

Of course, normally we'd only add relationships between CPs if there's some evidence for that relationship. So -- do we need the RelRef between CPs at all? Aren't all relationships implied by an "Eventa"? (funny name, like a coffee machine brand...)
A most important point. The EPs (evidence person or persona) already have their role links to EEs (“eventas!) and relation links to other EPs. In my opinion these links do not have to be copied up to the conclusion level if the links in the EPs and EEs are sufficient. Wherever possible CPs and CEs (conclusion events) should simply INHERIT their relationships from the evidence layer. The only issue seems to occur when there are conflicts in the relationships at the evidence layer. I believe these conflicts should be resolved at the conclusion level by adding links at the conclusion level; these conclusion level links override the evidence level relationships. These links should be accompanied by a conclusion (a special kind of source reference in DeadEnds) to explain how and why the researcher resolved the conflict.

Using all of the CPs Personas' relationships to other Personas might automatically give a lot of good relationships. It would also quickly show how well proven a relationship is: If there's lots of Evidence that links the two real life humans, there will be lots of Personas of CP1 that are linked to Personas of CP2.
Exactly!

But also, it is conceivable that one might want to express doubt about a relationship that's found in some evidence. Not every document is true. So even if an Eventa records a relationship for two Personas, if we conclude that this connection is wrong, we need to keep the ensuing CPs apart from each other.
What would be the smartest way to do it? A negative Relationship Reference? Or just don't imply anything from Personas relationships, and only show a relationship if it's explicitly stated at the Conclusion Level.
I have no glib answer. I have thought about this a lot, but don’t have any solution to suggest. Other than the “override” idea above.

Coming from this negative example, maybe there could be a case for the opposite, when you speculate about a relationship but can't prove it yet with evidence? Only then a RelRef without connection to Evidence is really necessary.
An excellent suggestion, though some would object to creating a conclusion without evidence. But, it has also been suggested that conclusions and hypotheses are similar (not identical!) concepts in this context. If we call these “evidence-less” conclusion relationships hypotheses, we avoid the semantic tangle!

I'd imagine that it's "cleaner" and maybe more efficient to process if relationships are explicitly recorded with the CPs, even if technically it might be possible to get all of the connections via the Evidence Level. It'd definitely be more human readable ;)
This is the source of my comment about software complexity. I am already programming DeadEnds to allow the multiple ways to express important relationships. There is more complexity when relationships can be redundant, but it’s not rocket science or brain surgery. I don’t have concerns about the human readability aspects. When the software is displaying a CP, there is no need to indicate that the relationships being displayed come from particular or particular sets of EPs and EEs. This info must be available when the user wants to “drill down” into a CP to see how it is constructed or when he/she want to add or alter the evidence. In other words, in an application’s UI, there are places where CPs should be shown as simple, integrated entities, and places where CPs should be shown as a complex of relationships with different sources and a set of hypotheses or overridden conclusions.

Personally, I believe the BG model should encompass this complexity of showing relationships and also the complexity of the multi-levels for evidence and conclusions. But it’s not a foregone conclusion BG will embrace this.

testuser42 2011-12-06T08:49:58-08:00

Tom, I think the "override" for relationships is great. It's simple, and consistent to the way PFACTs are normally inherited from the EPs and only overridden (or supplemented with additional conclusions) in CPs if really necessary.

testuser42 2011-12-19T14:38:52-08:00

GEDCOM X

Just to bring it to everyone's attention:
http://www.tamurajones.net/GEDCOMX.xhtml
Thanks for that find, Tamura! I'm interested what this will turn out to be!

Tamura describes how to get the working files and I downloaded a snapshot. I can't really tell what the model will be by looking at and in the files -- that's way over my head. But an entity called "Persona" does show up.

Well -- they'll probably publish their model when they are finished... Hope that it's going to be good :-)
Actually, there is a member here called "heatonra". If that is you, Ryan, can you tell us a bit about your model?

louiskessler 2012-01-25T19:14:31-08:00

GeneJ's Wikipedia definition is bang on:

"Evidence in its broadest sense includes everything that is used to determine or demonstrate the truth of an assertion"

The assertion (which I called supposition or after "proven": the conclusion) is required before you can have evidence.

Mills definition is indirectly correct:

"information that is relevant to the problem"

The problem is "who is my grandfather". You gather data (information) from sources. Some is relevant to the problem and some is not. The relevant data will become the evidence.

But it makes no sense to have evidence for a problem. Evidence needs a hypothesis to support. The information is only data (possibly relevant data, but still just data) until it is put into the context of an assertion or supposition.

The evidence/conclusion technique is therefore to:

1. Gather the relevant information for your problem.

2. Go through the data and build up your hypothesis or conclusion.

3. Mark the data that supports (or disproves) your hypothesis as your evidence.

Louis

ttwetmore 2012-01-25T20:29:25-08:00

What Louis calls a source detail, is what I call a source reference, is what others call a citation, is what others call a citation reference.

Louis prefers to place the evidence inside the source details. As I have also indicated, this is one of the reasonable places to put it. I very much prefer to extract the information about persons out of that data, and put it into persona records, and put source references in the persona records that point to the sources and have the other citation elements in them.

Also as I indicated, we should evaluate the pros and cons of where the evidence goes as a function of what we want to do with the evidence. I want to be able to see the evidence organized in "person units" because this is the form that will be most useful when using genealogical software for true research. It is also the best form for implementing matching algorithms and other nominal linking algorithms, which includes things like family reconstruction from evidence and many other things that the current generation of genealogical software cannot do.

I am still waiting for Louis to understand this. I think it is just a matter of waiting for him to think about the requirements of genealogical software that is more advanced than he is used to thinking about.

As I have said, deciding how to store evidence is key to BG's future. I have explained the persona approach, and Louis has explained his source detail approach. There are other ways. I hope it doesn't take BG too long to realize the superiority of the persona approach.

louiskessler 2012-01-25T21:36:08-08:00

Tom still thinks the complexity of adding personas is the be all and end all and best way to represent evidence and that is the model he has chosen.

But he indicates his lack of understanding of my idea by his statement of "we should evaluate the pros and cons of where the evidence goes".

The source detail / source reference becomes the evidence once it is used in a supposition or conclusion. It is there already. There is no need to recreate it, or copy it, or attach it to anything else, especially a non-conclusion person.

All that is necessary is to document the reasoning for the assumption as notes attached to the person, family, fact or event - and add a link to the evidence (i.e. the source detail / reference)

Behold version 2 will include a very nice and simple, yet very powerful evidence/conclusion system. It will be accessible with either source-based or person-based data entry, and will work very well thank you.

Tom would do fine to implement his persona and data structures in his Deadends program. I hold no doubts that he will do so and will do a very good job of it.

But I really don't think his structure, should be something that should be added as a required standard - one that no genealogy program today uses and one that few developers, myself included, would want to be forced to support.

I say that the evidence is already in the data (as source details), and only needs a link from the assumption/conclusion to the source detail to indicate that the source detail is the evidence.

Louis

ttwetmore 2012-01-26T13:22:31-08:00

Louis,

You say two odd things in your last. You first imply that the persona approach is more complex than putting evidence in source details -- they have the same complexity -- it's exactly the same data, just organized differently.

Then you make a fundamental error that has happened throughout history -- you say that because something hasn't been done a particular way in the past it must be wrong. First it has been done that way for the past 40 years -- have you ever looked into "nominal record linking?" Second it is being done that way now -- just look into New Family Search. And third, I am promoting a paradigm shift in genealogical software that will truly support records-based research -- we require a model change to support the next generation of features. You don't see the value of personas because you haven't let yourself take the paradigm leap in your mind about how to support research.

Tom

louiskessler 2012-01-26T18:48:23-08:00

Tom,

I'm not putting the evidence in the source details. I only have source details.

Once you make an assumption or a conclusion, the source details used to make it *magically* become the evidence. I think of the evidence as the "link" from the assumption/conclusion to the source details.

Doing it any other way is more complex because you have to create another structure to hold the "evidence".

And this concept leads to the paradigm leap that I'm going to bring about. I will allow users to build their assumptions and conclusions from source data. And I will give them tools that will make it possible to do so in a manner never done before. I just don't agree with you that the cards to move around are individuals. I believe the cards are the individual source details themselves.

So I'm going to implement this and prove that it works and that it works well.

Tom. I want you to do more than just talk about it. Do it! Write DeadEnds and show everyone, including me, that your personas work and work well and will revolutionize genealogy. Seriously!

Louis

ttwetmore 2012-01-27T00:44:01-08:00

Louis,

According to the examples you gave, the TEXT lines in the source details are the evidence.

Your model is a conclusion model. Your person records are conclusions. They point to the sources that provide their details. There is nothing new. If you can make a great user interface to help people add their source records, their source details, their persons, and their links between them all you will have a conventional program that may compete well with other programs doing the same thing.

When you find new evidence you add the new sources and source details and then you find the right conclusion person to point to the source details. If you decide that the evidence applies to a wholly new person, you create a new person record to point to the evidence, otherwise you add to an existing person record by adding a link to the new source detail to it. You are forced to decide what real person each detail applies to as you add the data to the database. This is what I call a conclusion-based system.

What do you do when you have found twenty records that name persons with the same name, and you believe that those twenty records probably refer to three different persons, but you haven't yet been able to figure out how to partition the twenty records into those three persons? In your system you would have to treat the twenty records as twenty source details. As you added the details to your database, you would have to decide on the spot which person the record applies to. You'd be forced to either add each record as a detail to an existing person, or you would have to create a new person to point to the new source detail. You have to make your conclusions before you add details to your database. This is what I call a conclusion-based system.

I want a system that does more. I want to be able to record all the data that I find in a way that the software can support the full conclusion making process. If I have those twenty records before I have made any decisions about the real persons they represent, and I have twenty persona records in a database from those twenty records, then I have all my pre-conclusion data easily available. I can run matching algorithms to help me decide which personas are most likely to be the same real person. I can form hypotheses about which records refer to the same real person by structuring the personas into tentative persons. When I decide an hypothesis is incorrect, I can rearrange the personas into a different set of persons. At any given time my database contains all my evidence and it also contains the wholly fluid state of all my conclusions. I make all my conclusions from within the context of my program, not before I reach the level of using my program.

The paradigm shift is the following -- in conventional systems, we collect our evidence, and when we decide which person new evidence applies to we add that evidence and the persons we conclude the evidence applies to to our database -- we don't add to the database until we have a real person in mind. In next generation systems we add our evidence to our databases without being required to make any conclusions about which real persons the evidence applies to, and then we have capabilities in our software to help us make those conclusions.

Before the paradigm shift we are "doing genealogy" either in our heads, or using scraps of paper with notes, and then adding to our databases once we have "done the genealogy."

After the paradigm shift we have all the evidence in the database ahead of time, and then we use our genealogical software as a full partner in the process of "doing genealogy." The software can suggest relationships between records. The software can attempt to find the patterns of families in the evidence before we have made final conclusions about the final persons, and these patterns can then be used to help us see through all the messy details into the world of conclusions that best matches the noisy evidence. The software is not doing anything that an experienced genealogist wouldn't do without computer support, but we are providing the genealogist with exactly the level of support that removes all of the tedium and all of the administrative problems of visualizing and organizing the data.

ACProctor 2012-01-27T02:19:18-08:00

Any new evidence will mostly be associated with a name, and so whether you file it under a Persona or a Person of that name is a "process" difference.

I guess a good example of where you don't have a name is the classic death notice with the "mourning grandson" (i.e. someone exists) is mentioned but no complete name is given.

Anyway, I don't want to fuel this as some Lilliputian "holy war" but there is a side issue I wanted to mention:-

Genealogists will always have different modus operandi - that's a given - so should BG support or stipulate a particular way of working? Who are we to decide whether genealogists should strictly adhere to, say, E&C philosphy or simply collect names & dates like some glorified "train spotters"?

There's a fundamental question here: is it possible to fully separate the structural aspects of our data from the process aspects? In other words, is it actually possible to store data without presumption about the process used to obtain it?

I personally believe it is possible if sufficient degrees of freedom are incorporated into the format but even then I feel it is not obvious - more an act of faith really.

Tony (...trying to be constructive today)

ttwetmore 2012-01-27T14:00:54-08:00

Any new evidence will mostly be associated with a name, and so whether you file it under a Persona or a Person of that name is a "process" difference.

I partly agree, though I might want to clarify the part of the process that is done in one’s head or on paper, versus the part of the process with software support. I don’t think there is much difference in the overall processes We are all following the research process which is collect evidence and make conclusions. The differentiation in software and in models comes in how much of the process is supported in the software and model.

A person and a persona are the same from a record or object-oriented point of view. The practical difference is that a persona holds evidence and a person holds a “real person”. This is over simplifying but gives the essence. Personas are not needed by software that only supports the recording of conclusions, which is what most genealogical software of today does. Personas are required by software in which the evidence itself is used as raw material for matching, searching, family recovery, and other combining algorithms. Software with these features support more of the overall process.

Genealogists will always have different modus operandi - that's a given - so should BG support or stipulate a particular way of working? Who are we to decide whether genealogists should strictly adhere to, say, E&C philosphy or simply collect names & dates like some glorified "train spotters"?

There is not that much variability in the processes. More detail, less detail, yes. More care, less care, yes. But there is certainly not so much variability as to require adding great complexity to any model.

The conclusion only model and process is simple. Adding research support is as simple as adding personas and evidence-level event records. And the ability to support personas is as simple as allowing person records to refer to other person records in a particular sub-person context. The DeadEnds model allows this but doesn’t require it. Thus the DeadEnds model is suitable for conclusion-based systems and for systems that want to add records/research-based features.

There's a fundamental question here: is it possible to fully separate the structural aspects of our data from the process aspects? In other words, is it actually possible to store data without presumption about the process used to obtain it?

Definitely possible. There is no magic or complications here. The research process is well understood. Many other processes for handling genealogical records appear in the “literature” (once again I implore anyone who desires to have a good background in this area to look into “nominal record linking”).

I personally believe it is possible if sufficient degrees of freedom are incorporated into the format but even then I feel it is not obvious - more an act of faith really.

No act of faith required.

louiskessler 2012-01-27T20:07:02-08:00

Tom,

I can't believe you've forgotten that our models are identical, except that you put the source details in personas. We've gone over this serveral times in the past.

My source detail record is there specifically so that I can add these source details (you call evidence) as independent entities. There is no need whatsoever to tie them to my conclusion people.

That was the basis of my big push a number of months ago, to try to push the implementation of source records and source detail records that contain "just the facts". Then a repository can make a database of all the source material that they have so that people can search it for something relevant.

My source details then work exactly like your evidence. and I can similarly say as you did that: "we have all the source details in the database ahead of time, and then we use our genealogical software as a full partner in the process of 'doing genealogy.' The software can suggest relationships between source detail records. The software can attempt to find the patterns of people, facts, events, dates and places in the source details before we have made final conclusions about the final persons, and these patterns can then be used to help us see through all the messy details into the world of conclusions that best matches the noisy source details."

The only difference between our systems is that I put these source details into their own source detail record, which is the obvious place for them to go. You place them into personas.

And your method is more complicated because you'll have multiple personas for each source record. And it's a worse method, because in doing so, you lose the relationship between the various people in the source detail.

Here's an example of three source detail records:

1. Ship record. 1500 BC
Arrived in Bedrock
Fred Flintsone, husband
Wilma Flintstone, wife
Pebbles Flintsone, daughter
Dino, pet
Barney Rubble, father of Bamm Bamm
Betty Rubble, mother of Bamm Bamm
Bamm Bamm Rubble, son

This is another source record:

Marriage certificate:
Pebbles Flintstone to Bamm Bamm Rubble 1490 BC
Parents of Pebbles: Fred and Wilma
Parents of Bamm Bamm: Barney and Betty

So I have 2 source records to play with and move around.

You will have to create 6 (or 7 if you want to include Dino) personas for the first source record. And another 6 personas for the 2nd. You have 12 personas you have to move around.

You are repeating the same event information 6 times over. You are losing the connection between the people in the event that the source record maintains.

No conclusions have been made yet in either of our systems. We just have the facts.

So now we come to conclusion time. What I'll do now is give someone tools to sort through all the source details to find a "Pebbles Flintstone" or someone whose name is similar to that. It will come up with these two source records. Your system will come up with only 2 personas.

With my system, it's clear that this marriage connected the families. I can make my 6 conclusion people (and the pet as well!) I will also clearly see that these families must have known each other because they were on the ship together prior to the marriage. I can make a lot of assumptions then. Maybe they grew up together or came from the same place.

You will have only those 2 disconnected personas and somehow have to piece the puzzle back together which you have carefully broken apart. You'll have to remake the connections of parent/child, husband/wife which you lost. Maybe you can record the links between them in your persona, but then you'll have dozens of links.

Even if you add the complicating linkage, you'll be hard-pressed to figure out that the two families might have known each other prior to the ship's crossing, which is instantaneously recognizable in the source record because their ship record had them next to each other. And you'll never discover the pet Dino if you didn't make a persona for him.

Sorry. But I see no advantages of personas over source records. Only disadvantages.

Louis

ttwetmore 2012-01-27T23:35:32-08:00

Louis,

The persona approach never repeats event information. It never looses connections between people. Have you read the DeadEnds model? Have you looked into nominal record linking?

Yes, I will have seven personas for the first example. And you will have seven TEXT lines hidden away in a source detail. With a good user interface I will enter those seven personas as fast as you will enter those seven TEXT lines. I will be able to use those seven personas for searching, matching, family recovery, etc., far, far better than you will be able to use those seven TEXT lines for the same purposes. As I have implied before, your approach requires a user to do all the genealogy "in their heads" by bouncing around inside your database, looking for TEXT lines that might match, with almost no software support to help. Whereas the persona approach is designed from the ground up to support the user in doing this level of research-based genealogy.

Why do you think there is an issue with "moving seven personas" around? It is the fact that you can move those seven personas around that gives the persona approach its great advantages.

So now we come to conclusion time. What I'll do now is give someone tools to sort through all the source details to find a "Pebbles Flintstone" or someone whose name is similar to that. It will come up with these two source records. Your system will come up with only 2 personas.

This is where the persona approach shines. First you have to create those tools, which have to find find names and dates inside general TEXT lines. With personas no tools required since you just search for person records matching your criteria. And why do you think the persona system just comes up with 2 personas? Each of those personas are linked directly to their source records, so the two source records are right there too. What you call an advantage for your approach is a significant disadvantage.

With my system, it's clear that this marriage connected the families. I can make my 6 conclusion people (and the pet as well!) I will also clearly see that these families must have known each other because they were on the ship together prior to the marriage. I can make a lot of assumptions then. Maybe they grew up together or came from the same place.

The persona approach has the same source records and the same information immediately available.

I think it best to drop this. I am getting put off by either your actual or your feigned inability to understand the persona approach. If you knew enough about it that we could actually argue about its details it would be one thing. But you so misunderstand and misrepresent it that I have to spend all my time here trying to correct your misunderstandings rather than discussing its pros and cons vis a vis your or any other approach.

louiskessler 2012-01-28T07:38:56-08:00

Don't insult me, Tom.

Any search that can be done on fields in structures can be done just as well with an intelligent text-based search, as Google as well-proved.

Besides, there's nothing stopping the same fields you've got in your DeadEnds personRecord or eventRecord into my Source Detail record. Then, as you even concluded a year ago in our Yogi Bear discussion when my current vision was being formed, that our two methods are identical except for where the data goes.

Your persona method is simply wrong, because it dismembers the relationships in the source detail, and if presented as "people cards" instead of "source detail" cards, you force the programmer to write a super-intelligent system to try to get the user to somehow connect in their mind the people now disconnected.

I say no. We don't want to get the user to spend their time finding personas that seem to match. We want them to spend their time finding source details that seem to match.

Louis

ttwetmore 2012-01-28T18:18:18-08:00

Louis,

Thanks for you cogent reply. I have nothing more to add.

Tom

louiskessler 2012-01-21T08:35:30-08:00

Tom said: "I would say that trying to convince anyone creating a new standard to stick with GEDCOM syntax and not go with XML is an impossible task and I gave up long ago. Louis still sees it as possible."

I've never said we need to stick to GEDCOM syntax. Tom and I previously agreed that GEDCOM and XML can easily be mapped to each other, and translators can be made to convert one syntax to another. Although I'm not familiar with JSON and the other formats, I'm sure they'd be mappable as well.

I simply prefer to express things in GEDCOM syntax because, as Tom says, it's more readable, and because you can more easily relate the changes being proposed to the current 5.5.1 de-facto standard.

Louis

ttwetmore 2012-01-21T09:11:23-08:00

Louis,

I agree that examples from any new genealogical standard could be expressed in GEDCOM syntax, and for some this would make the new standards easier to understand.

ttwetmore 2012-01-21T09:50:25-08:00

Tony,

Maybe my "multiple instances" is not identical to Personas but there is definitely some overlap.

They are similar. I get the impression from what you’ve written, and from the STEMMA document, that you believe that the person records in a database should each represent a single person, and, that if you later discover that two person records are really about the same person you will join their records together. In your example of the two women I believe you only have the two records because you are not yet sure enough to either join them as the same person, or declare that they are two persons and leave them separate.

If this is how you think about your data, then your STEMMA model is what I call a conclusion model. This only means (to me) that the person records are conclusions, not that you can’t refer to the evidence that supports those conclusion persons. It’s just that each person is a conclusion about a person who lived. You want to get rid of duplicate persons as soon as you are able to to keep your database “clean.”

The DeadEnds model allows this same view to be taken, so it works with the same assumptions. However, DeadEnds also supports the idea of personas. A persona is a person record that only contains the information about a person that can be extracted from a single item of evidence. Thus, if I have data about an ancestor from three different censuses then I will have three different persona records for that ancestor in my database. And I have no desire to get rid of those persona records once I decide that they do in fact all refer to the same person. When I make that decision, I create a new person record, and have that person record refer to those three persona records as “sub-persons” in effect. I don’t even have to add other information to the conclusion level person, if all the facts I need about that person can be derived from his persona records (e.g., name, date of birth, etc).

The lage issue behind all this is a very simple question, “How do you want to record your evidence in a genealogical standard and therefore a genealogical database?” Better GEDCOM’s inability to answer this question so far is IMHO the largest factor in its being stalled.

I answer this question by saying I want to record event records taken directly from the evidence, and persona records taken directly from the evidence, and I want those records to stay in my database forever. Others say this is overkill, that we only need to store the conclusion persons. Once you understand GeneJ’s ideas you will see that she want to keep conclusion persons, and to stack up within the citation records all the details of the evidence she finds and all the conclusions she has made about them. Another reasonable approach, but I think of her citation records (better called reference notes) and your STEMMA narratives in somewhat the same light, as catchalls for very important information that would be better captured in more structured ways.

There is a difference between the two instances of the lady I mentioned - one of which is a definite relative and the other is possibly the same woman - but I haven't formalised this other than by the way they're linked in the data at the moment.

Both Evidence and Conclusion can be accumulated for the two 'instances' in this case because both actually existed. Hence, it may not be the same the situation you're describing with Personas and conclusion-only Persons. How would DeadEnds deal with this case specifically Tom?

In my case the woman you know of as a definite relative would be in the database as a conclusion person record, with the persona records “below her” that you have collected about her (e.g., birth records, census records, etc.) The second person would exist in the database as a standalone persona record. In a DeadEnds system a standalone persona record is treated the same as any other “top level” person record, so there is no operational penalty attached to this record. In various displays, only the “top” person in a tree of person records would be shown, as these “top” persons represent the current state of your conclusions about the persons in your database. However, drilling down is easy and searching through all person records (including the personas) is easy.

ACProctor 2012-01-25T04:14:20-08:00

After doing some more reading on the Persona concept, it seems the STEMMA equivalent is actually the EventRef and CitationRef elements inside a Person.

[See diagram at bottom of: http://www.parallaxview.co/familyhistorydata/home/document-structure/dataset-structure]

They don't have identifying names but they embrace the properties from the associated source and may be layered inside the Person.

There is no automatic combination of these layers in STEMMA and details such as parents, DOB, place of birth, etc., have to be reasoned from the evidence in those layers. The net effect is that there is only one such details associated with the Person, although it can be linked back to the available evidence, and to the underlying sources.

...interesting comparison.

Tony

ttwetmore 2012-01-25T10:31:38-08:00

Tony,

After doing some more reading on the Persona concept, it seems the STEMMA equivalent is actually the EventRef and CitationRef elements inside a Person.

A question I have posed often is how do we want to record evidence in our genealogical data. I won't summarize all the ways again here, but this is the key question that must be answered before BG can have a solid philosophy behind its model.

Tony points out one of the "standard" answers to the question, that is, to attach the evidence to the citations or the citation references. This is the approach that extends the idea of a citation into a "reference note." What does this mean? It means that citations have more than citation elements. They also can have the evidence itself attached. And additionally, as advocated by some, the citations can also have the conclusions or other notes attached.

Personas, as separate records, only have value if those records are used for something meaningful that cannot be achieved as easily otherwise. I've pointed out the long history of "nominal record linking" that requires the persona concept, and have also pointed how key the persona concept is to the New Family Search model. In my DeadEnds software, for what it's worth, which incorporates many ideas from the "nominal record linking" methodologies, the concept is front and foremost.

Nearly all the answers to the question "where do we store the evidence" actually do store it somewhere. The issue is to determine the best place. And to decide that we need to answer the questions of what do we want to do with the evidence, or what is the most effective way to use the evidence. In a real sense these are user requirements questions that we haven't gotten to. But this is where I come to the firm decision that the persona is the best place to put it, because the persona record becomes the key concept once a genealogist passes the easy parts of the job and reaches the frontiers of their ancestry and must work strictly in the records-based world. At that point, creating persona records from each real world record, allows the genealogist to display and effectively use the available evidence in the best manner, basically as a list of person-records with the bits of information that each brings to the table. This is the perfect way for a genealogist to gestalt on all available data and the most effective way to enable the conclusion making process. I won't bore you with more about this now, since I have bored you all about this many times before.

But I repeat, the most important topic BG should be addressing now is "where does the evidence do?".

louiskessler 2012-01-25T12:43:53-08:00

This is how I believe it must be done:

We have a source.

One item in the source is a source detail. The source detail contains the data - "just the facts". There are names, places, events, and all the material is in its raw form - uninterpreted.

The key thing is this:

The source detail becomes evidence when someone uses it to come to a conclusion.

A person, family, fact or event is a conclusion made on the basis of the information from one or more source details. Those source details are the evidence for the conclusion that is drawn.

Along with the conclusion, the person can note how they came to their conclusion from the evidence (i.e. from the source details).

e.g.

0 @I1@ INDI
1 BIRT
2 SOUR @SD100@
3 NOTE Census provides birthdate
3 NOTE Name spelled incorrectly in census
2 SOUR @SD200@
3 NOTE Birth certificate date same as census
3 QUAL 3 (any other subjective stuff also goes here - not with the source detail)

0 @S1@ SOUR
1 TITL Census

0 @S2@ SOUR
1 TITL Birth Certificate

0 @SD100@ SDET
1 SOUR @S1@
1 PAGE 42
1 TEXT Jon Smith b. Jan 1 1880
2 CONT Mary Smith b. Jan 2 1882

0 @SD200@ SDET
1 SOUR @S2@
1 TEXT John Smith b. Jan 1 1880
1 CONT in Boston, MA

Again, the Text must be the facts - all non-interpreted.

I see making Personas or Events based on the source details to be an unnessary extra step that just adds work and complication. One source detail can give rise to dozens of personas or events at multiple levels. Instead of sorting through n source details, now you are sorting through n x m personas or events and trying to keep them straight.

I will be implementing my evidence/conclusion model precisely in Behold in this simple way I describe. Combined with an option for source-based data entry, Behold will probably be the first genealogy program program presenting a mechanism that will not only allow, but encourage evidence/conclusion modelling.

There will be source data. There will be conclusions. There will be no personas and no event records floating around on their own to confuse the user.

See: http://www.beholdgenealogy.com/blog/?p=858

Not applicable here, but Just for completion: a citation is simply a format to formally identify and reference a source detail.

Louis

hrworth 2012-01-25T13:14:02-08:00

@Louis,

Does this mean, that IF I only have one piece of Evidence, taken from the Source that you mention, that I HAVE made a conclusion?

I am trying to understand the relationship between a single piece of evidence and a conclusion. I have been trying, since BetterGEDCOM started, to understand this concept.

In my research, I have many pieces of evidence, but for the most part, not enough evidence to draw a conclusion. I may never have a conclusion. At best, "best guess so far, based on the evidence at hand".

I am not questioning the "evidence/conclusion" model. But asking how can ONE piece of Evidence be the conclusion OR does it take a series of pieces of evidence to draw that conclusion.

Just one step further, how to a "mark" a piece of evidence as a Negative or conflicting.

The other thought is that I don't enter any evidence until it IS the conclusion.

Thank you,

Russ

louiskessler 2012-01-25T17:04:09-08:00

You're almost there, Russ.

"IF I only have one piece of Evidence, taken from the Source that you mention, that I HAVE made a conclusion?"

No. IF you use a Source to make a conclusion, then that Source becomes the piece of evidence, as a result of your conclusion.

"I am trying to understand the relationship between a single piece of evidence and a conclusion. I have been trying, since BetterGEDCOM started, to understand this concept."

The evidence is made up of the collection of source details used to formulate a conclusion.

Evidence does not exist without a conclusion.

A source detail is not considered evidence until it is used to formulate a conclusion.

"In my research, I have many pieces of evidence, but for the most part, not enough evidence to draw a conclusion. I may never have a conclusion. At best, "best guess so far, based on the evidence at hand"."

No. In your research, you have many source details, but you have not been able to formulate any conclusions yet from them. Without a conclusion, the source details are not evidence for anything.

If you make a best guess, then that is a conclusion. It is a conclusion with low surety, but it is a conclusion none-the-less. You would have used some source details to derive that best guess, and those details are the evidence leading you to your supposition.

"... how can ONE piece of Evidence be the conclusion OR does it take a series of pieces of evidence to draw that conclusion."

Aunt Jane told me her father's name was George. Source is Aunt Jane. Detail is what she told me. That becomes my evidence that her father's name is George.

Then I find a family bible. It also says George. Two sources, two pieces of evidence.

Then I find his birth certificate. It says Harry. Three sources now. Conclusion is his name on the birth certificate may be incorrect in this case and his name is George. Evidence: all three source details with highest surety assigned to his daughter and the family bible.

Supposition: That he was born Harry and changed his name or decided to use his middle name. Reason: It is unlikely the birth certificate was wrong, but it is also unlikely his daughter and the family bible were wrong. Evidence: all three source details.

So one conclusion can be drawn from one or more source details which then become the evidence for the conclusion.

Also, one source detail can be used in one or more conclusions and could be used as evidence multiple times.

"how to a "mark" a piece of evidence as a Negative or conflicting."

Simply state so in the conclusion. As in the example above: two items say Harry, one says George.

... But one way or another you must come to a conclusion, even if the conclusion is: "His name may be either Harry or George".

Note that I am keeping all the personal analysis and reasoning to be placed in the conclusion, which is attached to the conclusion person, family, fact or event. This is NOT to be placed with the source reference, even though it becomes the evidence for a conclusion.

"I don't enter any evidence until it IS the conclusion."

No. You make a conclusion based on information in one or more sources. Once you make the conclusion, then you refer the conclusion to the sources and that reference basically states that sources are the evidence for this conclusion.

0 @I1@ INDI
1 BIRT
2 SDET @SD1@ <= This states that SD1 is the evidence for the Birth event
3 NOTE blahblah <= This is the reasoning for the conclusion based on the evidence

0 @SD1@ SDET
1 TEXT blahblah <= This is the source detail - just the facts - no interpretation - and is what is used as evidence.

Doesn't that make sense?

Louis

louiskessler 2012-01-25T17:11:17-08:00

One more example. A murder.

They find a gun, then find fingerprints, they blood on a letter.

You might consider these evidence, but they are not. They are only items until a supposition (hypothesis) is formed.

If you suppose Joe is the murderer, then the gun becomes evidence if it is Joe's gun. The fingerprint if it is Joe's fingerprints. The blood if it can be worked into the story.

But if it is Fred's gun, then it isn't evidence.

You might say Fred's gun is negative evidence. But what if they brought in ten guns? Are the all evidence? Some guns may have nothing to do with the murder at all. They are just guns, not evidence. They only become evidence when they can be used to help validate (or invalidate) some sort of conclusion.

Louis

hrworth 2012-01-25T17:41:55-08:00

Louis,

Thank you. I hope that I get to chat with you at RootsTech, so that you can help me.

You made two statements, in the two replies that are at the bottom of why I don't get this.

1) Evidence does not exist without a conclusion.

2) But if it is Fred's gun, then it isn't evidence.

To me, #1 is backward. You can't have a Conclusion without Evidence. The same goes for your Gun Example.

IF Fred or Joe owns the gun, that's a piece of Evidence;

Fingerprints are on the Gun, from either that's another piece of Evidence;

So far, I have two pieces of evidence. I don't have enough information to draw any conclusion.

If either owns the gun, I would expect that the owners finger prints would be on the gun. But, that does not lead me to a conclusion.

If this was a court case, each would be entered as Evidence. The jury could not come to a conclusion based on the Evidence presented. Right?

Where is my understanding of Evidence and Conclusion wrong. OR can't I used your murder example, as a way to look at Genealogy.

You said: "You're almost there, Russ." but what you posted I am not even close. There is something (piece of Evidence) that I haven't seen nor read on this Wiki, to get me close.

Thank you,

Russ

GeneJ 2012-01-25T17:53:47-08:00

Hi there you lucky folks headed to RootsTech 2012!

For its many faults, Wikipedia has some nice things to say at times.

"Evidence," _Wikipedia_ begins, "Evidence in its broadest sense includes everything that is used to determine or demonstrate the truth of an assertion. Giving or procuring evidence is the process of using those things that are either (a) presumed to be true, or (b) were themselves proven via evidence, to demonstrate an assertion's truth. Evidence is the currency by which one fulfills the burden of proof."
http://en.wikipedia.org/wiki/Evidence

Obviously more pointed to our cause, from Mills, _Evidence Explained_, 2007, Evidence is, "information that is relevant to the problem. Common forms used in historical analysis include best evidence (q.v.) direct evidence (q.v.), indirect evidence (q.v.), and negative evidence (q.v.). In a legal context, circumstantial evidence (q.v.) is also common."
http://bettergedcom.wikispaces.com/Supplemental+Glossary+from+_Evidence+Explained_%2C+2007

louiskessler 2012-01-25T19:03:20-08:00

Russ:

You're thinking of the conclusion as being required. It is not. All that is required is a supposition (or hypothesis) which if "proven" by the evidence, then becomes a conclusion.

If you don't know enough to come to a conclusion, then you enter the supposition.

In the court they do this:

"The supposition is that Joe is the murderer".

Then the data comes and is evidence towards a conclusion that either:

1. Joe is the murderer,
2. Joe is not the murderer, or
3. Not enough info to determine, and more data that can be used as evidence is needed.

You write into your genealogy non-conclusions that are suppositions all the time.

e.g. the name of your great-grandfather is Bill, because you think you heard your grandmother say that once.

So in your genealogy, the name is Bill, even though the evidence is nowhere near enough to prove it. But this still is your conclusion.

Besides, how do you determine the dividing line between a supposition and a conclusion? Is one item of evidence enough? Maybe. Is two? Maybe? Is 100? Maybe? When can you say something is absolutely certain. Actually very seldom.

So maybe you have 100 conclusions and 10,000 suppositions in your genealogy. I don't really care. I've been using the word "conclusion" to mean "conclusion person" or "conclusion event" or whatever. It is your personal conclusion as to what you think is most likely. It need not be proven. And in fact (as GeneJ recently related to me, the "Preponderance of Evidence" method for conclusions in genealogy is no longer current.

Louis

ttwetmore 2012-01-15T07:59:32-08:00

The positive face on GEDCOMX is that FS is working on a real standard.

There are negatives, but I feel constrained in what I can say. Obviously GEDCOMX is primarily intended to be an internal FS standard, to be used as a way for FS to record their massive data collection as they proceed on their projects to convert all their microfilm based records to indexed, text-based records. Viewed from this perspective, GEDCOMX is a syntax for recording "record-based" data in massive, searchable databases. [Actually, FS data is much more likely to be stored in custom, relational databases that are fairly unique to each major record source, and GEDCOMX will be an internal format that will hide the details of these custom relational databases, providing a cleaner interface between the raw databases and the searching and display subsystems -- this is all conjecture on my part however, not based on any special knowledge.] Since the FS websites will convert all that data to search results screens, any problems with the format will be dealt with by FS user interface engineers deciding how and what to extract from the data and how to display it.

The GEDCOMX engineers have asked a few outsiders for input, but frankly, the GEDCOMX project can be likened to a large momentum vehicle, with very set ideas stemming from previous work that we are generally not aware of, so the opportunities for outsiders to make any significant impact are almost nil. I believe there is no chance at all that Better GEDCOM could ever have any impact on GEDCOMX. FS has a poor track record on actually bringing the formats they are working on actually up to the level of publishable standards, so who really knows whether GEDCOMX will ever become a suggested external format for genealogical programs.

So whether GEDCOMX would be a practical new format for interchange of genealogical data between ordinary genealogical programs is, in my mind, an open issue. I believe the GEDCOMX designers believe it either should or could be, but I don't think they are really thinking about that issue very much, as I reiterate that GEDCOMX should be thought of as an internal FS format designed to solve specific internal FS problems.

If this leads to any feelings of deja vu, it's not surprising. The original GEDCOM format was also designed to solve a specific FS problem, that being how to easily gather from church members the results of their genealogical researches so that important religious rites could be performed on their ancestors. As such GEDCOM was limited to primarily the information that FS was interested in about persons, which was only the most basic of vital events. And this obviously also tells us why GEDCOM is a strictly conclusion-only format. And why GEDCOM is not source-centric. And why GEDCOM has been so problematic as a general purpose genealogical data format for applications that go beyond the needs of the FS to gather conclusion information about ancestors.

Do we live in interesting times? I'm really not that sure.

testuser42 2012-01-15T10:40:34-08:00

Thank you both for your insights and opinions on Gedcom X. I know it's far from finished and we can't see the overall construction yet, but I like some of the parts that I can see.
One thing I'm glad about is that there is no "Assertion" to be found, so there's hope that they don't copy the GenTech model. I also like the Persona, the Relationship, and ConfidenceLevel. It's good to see these things are thought about.
I don't know the legal aspects, but can FS make the results proprietary, when they are now using an Apache license?
Well -- we'll see what comes out of it.

Tony, the BetterGEDCOM test suite page is the one to use for the test cases. Let's go do some work :)

ACProctor 2012-01-15T11:16:42-08:00

Re: "test suite"

This is what I meant by a "test case" Klemens: http://bettergedcom.wikispaces.com/message/view/Better+GEDCOM+Requirements+Catalog/47575402#47675234

Tony

heatonra 2012-01-17T09:57:16-08:00

I thought I'd just throw out some comments, maybe it'll help curb some speculation and erroneous conjecture.

First of all, while there will be a variety of things said at RootsTech, the message "here's the standard--conform or be assimilated" will never be delivered, now or in the future. So when @ttwetmore says "the opportunities for outsiders to make any significant impact are almost nil", I couldn't disagree with him more. I'll just leave it at that.

I can understand Tom's frustrations, though, because he'd kinda been out of the loop for awhile now. (Although I also detect some sourness in Tom's comments that I can't quite explain.) Anyway, that's not because we've been deliberately ignoring anyone but instead because we haven't had the resources to engage at the appropriate level. Yet. That will change. The change will be gradual, but it will be significant.

Anyway, take that for what it's worth.

Also, @ACProctor, "simply using the RDF standard to represent your data is no more of a data model than representing it in XML". Indeed. Well said. Couldn't agree more.

ACProctor 2012-01-17T10:21:27-08:00

Thanks Ryan :-)

By "under pressure to push their [FS] solution", I didn't mean direct pressure from FS. I apologise for the ambiguity but I simply meant that the whole World will be looking at your next move and you could have another de facto standard simply because of who you are.

BetterGEDCOM is not trying to compete with anyone. However, I personally feel that its goal of a World-wide neutral common representation requires more kudos than we currently have.

Tony

ttwetmore 2012-01-17T10:42:07-08:00

Ryan,

Sorry, didn't intend to sound sour; hoped to just be realistic. If you can find a way to involve outsiders as FULL VOTING members of a team that makes all the TECHNICAL DECISIONS about the format of GEDCOMX I would applaud you highly and take back everything I said. Then GEDCOMX would be a true industry-based standard. Because of the structure of your organization, and because of your need to get approval from a large bureaucratic, non-technical administration (not intending any disrespect to the church), and from a large technical organization that already has strong technical needs and strong technical traditions in place, I can't see that happening. Which means, as I have experienced, that though you may be willing to listen to comments, you are unable to make any substantial changes based on them. In a situation like that, as kind and decent as you are, you will still be dictating a "standard" and you will continue to be open to the type of criticism that whirls around GEDCOM. Please dissuade me from these conclusions, as I don't like having them!

Tom Wetmore

heatonra 2012-01-18T07:48:17-08:00

@ttwetmore, your observations are fair enough.

Even if I did have authority to make statements about project governance (now and in the future), I'm not sure they'd do any good anyway since the reaction would understandably be "I'll believe it when I see it", given previous track records.

Anyway, I think your vision of project governance has significant virtue and I think it's worth pursuing. But I think even you would admit that it takes a lot of time, patience, money, and resources to set up that kind of structure. So indeed, don't get your hopes up for something like that for RootsTech 2012. But also don't discard your hopes to get there eventually.

louiskessler 2012-01-18T15:04:47-08:00

Ryan:

I'm the other developer (other than Tom) who regularly participates in BetterGEDCOM.

I think GEDCOM is a very successful standard that was very thoughtfully and well developed. It's proved its mettle over the past 20 years and has done it's job and allowed basic data transfer between hundreds of different genealogy programs, desktop and online, ever since.

I believe the basic standard could use an update to add enhancements such as the separation of sources from conclusions, new requirements for online databases, addition of place records and source detail records and a few other things.

But I think these can be handled as enhancements to GEDCOM rather than a full rewrite. This would allow current developers to transition to a new model, whereas they may be unable or unwilling to support something structured differently with features their programs cannot handle.

I don't mind a single entity developing a new standard, such as you are doing. Such a project structure worked very well 20 years ago, and will work well again. I encourage you to take it forward, and I'm hoping GEDCOM X will be an evolution (rather than a revolution).

Going to public consensus would be wrong for you. Your team are the ones deeply involved and know all aspects of why you are making certain decisions. As long as you listen and take interest in other ideas (which it seems you are), then you'll make intelligent decisions as to what can be included and what should be left out.

I'll be at RootsTech, and look very forward to your two presentations about GEDCOM X in the first two sessions.

Louis Kessler (developer of Behold)

ttwetmore 2012-01-21T02:43:49-08:00

I like the GEDCOM syntax more than I like XML syntax. When displayed in indented form, GEDCOM is much easier to read than XML, and I consider that a real advantage. I also consider the fact that XML requires twice as many characters to express the same thing as a significant disadvantage for XML. I understand that purists disagree with me on both these points, and I concede they have good arguments. I would say that trying to convince anyone creating a new standard to stick with GEDCOM syntax and not go with XML is an impossible task and I gave up long ago. Louis still sees it as possible.

GEDCOM may have been well developed, but it has not kept up with the times. Louis implies that GEDCOM has made it possible to share data. GEDCOM semantics is woefully inadequate at this job, and it is only the fact that there is nothing that does it better that would ever let someone say that GEDCOM allows successful sharing of data.

Getting GEDCOM semantics up to snuff would require major changes. Louis seems to think that by evolving GEDCOM semantics to a new standard that continues to use GEDCOM syntax would somehow make it easier for vendors to track the changes and therefore provide good support. As if shifting to an XML structure would be a negative aspect of a new standard. XML is trivial to parse and process. (But of course, so is GEDCOM, and anyone who claims that ease of parsing (or even schema checking) is an advantage that XML has over GEDCOM, is ignorant of technical issues.)

A new standard would require much better handling of evidence and event data. A new standard would require much better support for relationships and places and the research process. The "record" concept, as manifested by the "persona" concept is a MANDATORY requirement of any new genealogical model (it has to be added to STEMMA). But, of course, the TRIVIAL addition of a 1 INDI @xxx@ line to GEDCOM fully enables the concept (as would adding a personRef to a Person record in STEMMA).

Sure, you could take the DeadEnds model, which is IMHO currently the best model for genealogical data, and cast it into GEDCOM syntax, and it would be fine. You could reuse INDI and all kinds of other keys, and the result would be pretty much a superset of current GEDCOM semantics. This is so EASY to do that any argument about it seems pointless to me. If Louis wants to insist that he must import any data from a future version of Behold from a GEDCOM syntax, I would be happy to write him an ULTIMATE_GEDSTANDARD_3000 to GEDCOM translator.

The final syntax in which we express our genealogical semantic data is immaterial. To argue to keep GEDCOM because it has a good syntax for expressing its semantics is sophomoric at best. The semantics of genealogical data can be expressed in any hierarchical syntax -- XML, JSON, GEDCOM, Google protocol buffers, millions of custom syntaxes or ALL of them simultaneously.

I hope Louis is sitting down when he takes his first careful look at GEDCOMX. It ain't no evolution, baby!!!

I agree with Louis that the FS have needs that go way beyond the needs of sharing data between genealogical vendors, and that FS requires internal standards that allow them to deal with the massive complexity of the data they are getting into indexable form. In my admittedly not fully informed opinion, GEDCOMX is the format that FS is evolving to meet these internal needs. I question the seemingly tacit assumption that the internal formats required by the FS would make a good standard for sharing data in the genealogical industry. I frankly do not believe that FS has the right mind set for setting data sharing standards for the genealogical industry. I hope I don't need to stress that this is in no way a criticism of GEDCOMX or the team putting GEDCOMX together -- they are doing an excellent job solving a difficult problem.

I firmly believe that the genealogical industry requires a standard that allows them to share data. This is clearly not GEDCOM. I also don't believe it will be GEDCOMX. There is still something for Better GEDCOM to do.

ACProctor 2012-01-21T05:40:39-08:00

I agree about GEDCOM evolution Tom. As well as being a political issue, with many people wanting to have a fresh non-proprietary start, it will also be held back by the misconceived compatibility. If we went that route then I believe we would be simply ignored and other initiatives would take our place.

Re: STEMMA and Personas, this is a subtle point. If a format needs all its Person entities to be "rooted" to their ancestors then you simply could not make any progress without distinct Personas of a Person [I hasten to add that my current knowledge of DeadEnds is weak because I can't find enough documentation]. However, STEMMA doesn't have to have everyone "joined up", if you know what I mean. Persons can be totally unrelated to anyone in your family, but still be depicted and referenced by narrative elements. There can also be separate sub-trees that are not connected to the main tree.

The reason I mention this is that I have several situations where I use this. I'm currently looking at a relative who the census returns say was born in Nottingham in 1809 and married a James Procter in 1831. However, there's no other record of her to be found before her marriage, and so I'm struggling to identify her parents and children - no relative of hers was a witness at her wedding. There is, however, a good match baptised in Leicestershire, about 40 miles away. Coincidentally, there is no subsequent death or marriage recorded for that lady after her baptism.

My records therefore have two instances of the lady, and the respective families (by marriage in the Nottingham case and by birth in the Leicestershire case) are fleshed out separately. There are also a couple of outliers in Nottingham who I believe are related due to some very slim clues but who I cannot associated with either tree yet.

This all works fine in STEMMA and you may be saying 'but that is the essence of a Persona'. I cannot argue if that's the case but, again, it comes back to the point that any data model is more than just the sum of it syntactic elements. The philosophy behind each one is much harder to quantify without specific test cases expounded in each.

Tony

ttwetmore 2012-01-21T06:30:00-08:00

Tony, Thanks for responding:

If a format needs all its Person entities to be "rooted" to their ancestors then you simply could not make any progress without distinct Personas of a Person

Sorry, don’t understand that. By personas I mean person records based simply on evidence. In DeadEnds these personas can be built into trees (since each person record can have 0 to n references to “lower” level persons), with the interior node personsas and the root persona (usually just called a person, though) represent conclusions that the researcher has made that the persons mentioned in different records are the same real person. Their is no notion of biological ancestry at play here.

STEMMA doesn't have to have everyone "joined up", if you know what I mean. Persons can be totally unrelated to anyone in your family, but still be depicted and referenced by narrative elements. There can also be separate sub-trees that are not connected to the main tree.

There are things about your narrative concept that I think are great. The idea of putting in references to other entities to get their “titles” displayed is a great use of markup concepts. However, you seem to be using narratives for much more than that. For example you put relationships between people, other than the parent/child relationships, into narratives. For me that is not the appropriate way to handle such an important concept. If you are also suggesting that STEMMA would place the references to the persona records that a conclusion person is based on inside a narrative in the conclusion person, I would also call that a mistake. It seems you are using the narrative as more of a general purpose pointer nexus, intended to hold many important object to object relationships, that will be difficult for programs to figure out, than as a nice way to markup up narrative material.

example...

In DeadEnds I would have two persona records for the woman/women in question, and each would be joined with the family derived with her. If and when I decided that they were the same person I would create another person record that refers to the two persona records, to hold the conclusion that the two personas are the same. Before that point in time I could have them connected to each other via a relationship that means “potentially the same person.” However, if your software has good searching features, you would be able to select one of the persons and ask for a list of all other persons that might be a potential “match” with that one.

When you make the decision that the two persons are the same in STEMMA, what would you do? Would you get rid of the two original person records and replace them with a combined record? I hope you can see the two major potential problems with that -- 1) you have gotten rid of the pure, evidence-based records; and 2) you have set yourself up with a real problem when you decide later that they aren’t the same person.

ACProctor 2012-01-21T06:51:43-08:00

It's hard to convey these ideas between us Tom without a whiteboard (real or virtual) :-)

Don't get me wrong - I'm interested in your approach but we're both trying to clarify the finer points of how we use our formats here.

Maybe my "multiple instances" is not identical to Personas but there is definitely some overlap.

The point I was making [badly] with the terms "rooted" and "joined up" is that not all the Persons in a STEMMA Dataset are necessarily connected to each other in the same tree (be it bilogical or otherwise), and that you might need to draw distinct trees in order to see them all together. More technically, they're potentially "disjoint".

There is a difference between the two instances of the lady I mentioned - one of which is a definite relative and the other is possibly the same woman - but I haven't formalised this other than by the way they're linked in the data at the moment.

Both Evidence and Conclusion can be accumulated for the two 'instances' in this case because both actually existed. Hence, it may not be the same the situation you're describing with Personas and conclusion-only Persons. How would DeadEnds deal with this case specifically Tom?

Tony

heatonra 2011-12-20T06:14:38-08:00

Hi.

Yep, this is me. Unfortunately, I can't say much at the moment. More to come soon.

However, I can say "thank you" to all the contributors of BetterGEDCOM. You've established some great space for some really good content, discussions, and ideas. Keep up the good work!

GeneJ 2011-12-20T06:51:45-08:00

Ryan

Thank you for your words of encouragement, I'm sure "back at you" is appropriate.

Here's to "soon." Welcome --GeneJ

WesleyJohnston 2011-12-21T17:47:08-08:00

Is there any date set when the RootsTech schedule will become a real schedule and not just a list of sessions with no times or places?

ACProctor 2012-01-14T04:25:33-08:00

{Quote from STEMMA+Model discussion}
Tony and Tom, have you had a look at the GEDCOM X work? You can take a peek at the work-in-progress here:

https://gedcom.ci.cloudbees.com/
You can download the complete snapshot as a zip file
https://gedcom.ci.cloudbees.com/job/gedcomx-snapshot/ws/
or look through the files online. There's already some documentation, e.g.
https://gedcom.ci.cloudbees.com/job/gedcomx-snapshot/ws/gedcomx-rs/target/enunciate/build/gedcomx/model/gx.html
https://gedcom.ci.cloudbees.com/job/gedcomx-snapshot/ws/gedcomx-conclusion/target/enunciate/build/gedcomx/model/index.html
Some good stuff in there.
I don't see the complete model yet, but if FS goes to implement this, IMHO it will be a huge step in the right direction.
Discussion of GEDCOM X should probably go here
{Quote}

The java project hasn't really started yet Klemens. There's not much in there to see yet.

I briefly looked at the documentation but couldn't get a feel for the structure of their schema. They've obviously committed to the RDF representation of information and so must be thinking ahead to the Semantic Web. I've always been a little skeptical of that though. Simply using the RDF standard to represent your data is no more of a "data model" than representing it in XML - loading up someone's data into a DOM doesn't mean you can do anything with it. There has to be more structure to it than those low-level mechanics, and it's that part I don't see in that documentation yet.

ttwetmore 2012-01-14T09:46:54-08:00

Yes, I have looked at GEDCOM-X. I can't say much because I am under a pseudo non-disclosure situation with them. I also know quite a bit about the internal precursor to GEDCOM-X, and have written some XSLT stylesheets that generate DeadEnds like records from that format. I have a kind of metric I try to apply to textual forms of information, a ratio I try to compute between the number of characters in the format versus the number of characters needed to simply express the core information. The GEDCOM-X ratio is very high. Of course every format based on XML has a relatively high ratio, but the GEDCOM-X ratio is much higher than that caused by XML. (The STEMMA ratio is pretty high also!)

The syntax for those two formats, not surprisingly, is XML. In my opinion the models they use make sense, but they are more complicated than they need to be, and I worry that this will affect acceptance once GEDCOM-X goes public.

You mention RDF. GEDCOM-X uses a number of namespaces and I'm not in favor of them. I opt for simplicity at every turn.

ACProctor 2012-01-14T09:55:46-08:00

Another valid metric Tom is the number of discrete element types, or rather syntactic sub-structures. From that POV, STEMMA is much smaller than, say, GedXML.

Expressing 'relationships' is hard to quantify with character-counts alone though. I took the approach that simplicity was in the number of concepts modelled in the data since the character-count is only relevant if you're creating it by hand.

What's your feeling regarding some sort of test-cases? Any suggestions for how we could define them?

Tony

ACProctor 2012-01-15T05:15:45-08:00

A big worry I have over GEDCOM-X is that is is yet another proprietary format. This is not really in FS's best interests and it will not prevent further fragmentation of the electronic data world.

If we're too slow here then we could feel under pressure to push their solution simply because of their size.

Unless we work together then it may end up like many computer standards - there's the accepted one, and then there's the M$oft one (or FS in our case).

Tony

ACProctor 2012-01-02T09:04:29-08:00

STEMMA

I've added a link to a draft of the independent work I was dong as it may prove useful for comparing & contrasting approaches.

The STEMMA format has already tried to address source+citation, E&C, personal names, dates and date comparisons, time-dependent Place hierarchies, compound citations, and author annotation in citations, so it must be worth at least one visit :-)

Aplogies if it's hard to navigate around. I transferred the bulk from a Word doc and tried to break it into pages.

Tony

ACProctor 2012-07-18T07:03:37-07:00

Re: Authorities, I did have a few thoughts on them Tom. In the Person and Places research notes, there's a section on 'Place Authorities'. I think it will happen but I don't yet see how because the industry is so uncoordinated, and this part is not even specific to genealogy. I just fear some consortium or organisation, somewhere, will just "go for it" without talking to us genealogists. I hope FHISO can be proactive here.

The idea of building up a database, and being able to export it in STEMMA format, is working now - although it's a very small amount of data at the moment.

Re: sources holding conclusions, this makes a lot of sense when you realise that a lot of cited material is, in fact, someone else's conclusions.

Re: attribution. I know we had a discussion of this on the wiki. In STEMMA, I merely want to allow an individual (together with contact details) to be cited as another source. From that point of view, it's easy to see it creeping towards the view of generalising sources and conclusions.

Tony

ttwetmore 2012-07-18T09:17:01-07:00

Tony,

Thanks. So an attribution is a human being in the role of a source. Could a source reference in STEMMA point to a Person object? I plan to allow that in DeadEnds.

For me all source and conclusion issues can be dealt with as a chain of Source objects (maybe it would better to say a chain of objects that implement the Source protocol). Objects like repositories, persons, books, images, journals, articles, pages, registers, certificates, land records, military records, censuses as a whole or censuses as a family group, and tons of other things, can all be thought of as objects that implement a Source protocol.

In such a world, a footnote, or a research note, or a bibliographic entry, or a sidebar (that is, anything that anyone now calls a citation), is simply a way of walking up that chain of Source objects, extracting the info needed for the particular "citation type", and formatting that information for some external purpose.

Voila. One protocol and everything dealing with source, provenance, conclusion, attribution, is covered, and with a set of citation templates any external citation form is covered.

testuser42 2012-07-20T10:43:55-07:00

Hi all!

I've not had the time to read the new STEMMA version yet, so can't comment on the changes yet.

But whenever a wish for a place authority comes up, I have to cough and point out the "GOV Genealogical Gazetteer":
http://bettergedcom.wikispaces.com/message/view/Better+GEDCOM+Requirements+Catalog/47039682

I really hope this kind of community project can be used for such an authority. The data is very good already, but not perfect and not complete. Well, no such thing will ever be complete... but the crowd-sourcing principle should be a very good match for such a project.

Ideally, a program would look up the name(s) that have been entered by the user and come back with a list of possible matches. If the user picks one, the program could either import the whole structure for that place, or just add the GOV-ID into its database for future uses.
BTW, this is one more complex place that was mentioned in Gedcom-L recently: http://gov.genealogy.net/item/show/HOLSENJO42JD

GeneJ 2012-07-20T11:22:30-07:00

+ +. :-)

ttwetmore 2012-07-20T12:08:40-07:00

Has anyone thought of using the geonames data as a basis for a place authority? Their data is incredibly complete for the United States (I think they took it from a national file that is maintained within the U.S. govt), and geonames seems to have large files for much of the world. For the U.S. not only are populated places included, but so are buildings, schools, rivers, churches, post offices, lakes, mountains, cemeteries, and so on and so on and so on.

For my current project, processing natural language text (e.g., newspaper obits, newspaper marriage notices, etc) into GEDCOM and JSON output, I have cobbled together a place authority by building lists of all counties in all states, 40,000 cities, lots of Canadian locations, and it is all working surprisingly well. The software recognizes almost all mentions of places in the natural language text, but I'd like to take it all up a notch, to compare all the locations within a single obituary in order to determine their complete hierarchy. To do this I plan to build a U.S. and Canada place authority by pre-processing the geonames files into a full hierarchy. Any experiences with stuff like this out there?

AdrianB38 2012-07-20T12:47:52-07:00

As far as place authorities goes, I feel like the rebel who doesn't believe in ESM's citation formats... I acknowledge that place authorities can create lists according to certain criteria - they just aren't the criteria for what I want to use as place-names.

So far as I can see, the typical place authority lists states / provinces / territories / counties as the 1st 1 or 2 levels. Then they go off and find some other convenient hierarchies such as parishes in England & Wales (I can point you to a list of those) or the political bodies that were formally created in the 1800s or later for local government. Yes, fine. But that is NOT sufficient for my purposes. My home town of Crewe does not exist in any formal sense until the 1870s when Crewe Town Council is created. Before that, it's a group of houses within the township of Monks Coppenhall in the parish of Coppenhall. If the _authority_ is meaningful, Crewe doesn't exist in the 1840s thru 1860s. Clearly it does, as an informal name for a settlement.

If we have to allow informal names (as we must in the UK), what _is_ the point of an authority? I have no idea whether in other countries there is a self-discipline that ensures place names always contain a formally recognised place - but I need to point out this is not the case in the UK. Heck, we can't even agree on which counties we should use!!

GeneJ 2012-07-20T13:09:43-07:00

Hi Adrian,

A place authority, whether input originates via crowd sourcing or not, should be err... authoritative or it would not be useable/destructive.

You wrote, "My home town of Crewe does not exist in any formal sense until the 1870s when Crewe Town Council is created. Before that, it's a group of houses within the township of Monks Coppenhall in the parish of Coppenhall. If the _authority_ is meaningful, Crewe doesn't exist in the 1840s thru 1860s. Clearly it does, as an informal name for a settlement. "

We've talked about this before a little. As with other cultural heritage markers (for lack of a better word), need to be put in their historical context.

ACProctor 2012-07-20T14:19:22-07:00

Re: "GOV Genealogical Gazetteer" (testuser42)

This comes up in my discussion of Place Authortities at: Place Authority

Re: "...but I need to point out this is not the case in the UK. Heck, we can't even agree on which counties we should use!!" (AdrianB38)

I discuss the counties of England & Wales at: Hierachies. My preference here is to use the registration Countires (& registration Districts) to ensure alignment with the census of England & Wales. This is what my software currently does, but takes it down the the level of a street and household. The previous link for 'Place Authority' mentions a useful TNA resource that could have really helped complete this scheme for England & Wales.

Tony

testuser42 2012-07-20T14:38:19-07:00

Thanks Adrian, interesting perspective.
I think crowd sourcing could work better than relying on "official" data for these things. Still, I don't know how a quality check would work. The GOV is a project of an german "Verein" (association/club), I doubt they have the manpower to check every entry. I guess that's where the magic of crowd intelligence is needed. If someone sees an error, he'll correct it or report it.
FWIW, this is the only result when searching for Crewe on GOV: http://gov.genealogy.net/item/show/object_388814
So if you want, you could start an entry for the town of Crewe!

@Tom
Geonames is used by Open Street Map as well. They also have a thing called Nominatim ( http://nominatim.openstreetmap.org/ )

@all (me included) this is a STEMMA thread, I think we should try and keep things tidy...

testuser42 2012-07-20T14:45:04-07:00

Tony, thanks! Should have read before posting...

louiskessler 2012-07-25T10:26:25-07:00

Tony,

I started looking at your 14 July version of Stemma, but it's going to take me some time and thought to digest it all, both of which are in short supply for me at the current time.

Just a few things for now:

1. I don't mind, and even like, each of the top-level entities being hierarchical.

2. Is your "Citation" entity really supposed to be the formatting of the source detail information? If not, you should call the whole thing "Source" and make the source hierarchical. If so, then you're opening that can of worms.

3. I found today an open-source genealogy software program in beta called Stemma. That may cause you some headaches in the future: http://sourceforge.net/projects/stemma/

4. In the STEMMA document you state that as far as you know, no one has implemented GenTech's complex data model. There are 5 projects that have at least been attempting to do this. See: http://www.gensoftreviews.com/index.php?s=gentech

Tom: you should take a look at these 5 projects. Some specifically indicate that they are planning to implement GenTech's concept of Personas.

Louis

ACProctor 2012-07-27T11:55:24-07:00

Thanks Louis. I'm on vacation at the moment.

My Citation entity represents a cited source of information. It does not format anything itself and the format-string is merely a diagnostic aid, and a default when no proper formatting engine is connected.

I plan to add a new case-study to the Data Model document, when I get back, which demonstrates the generation of a Dublin Core OpenURL for a hierarchical source.

Thanks for the sourceforge link - that *is* a worry for me. I've never heard of that project.

Tony

NeilJohnParker 2012-01-02T09:16:06-08:00

Tony, when I enter www.parallaxview.co/FamilyHistoryData in my brouser it takes me back to you covering page. Help

ACProctor 2012-01-02T09:25:07-08:00

I may have been editing at the time Neil. It's always a complete PITA transferring Word content to Google sites. I think they have more than a few bugs.

Can you check it? There should be no more blank pages in there (that's what I was just correcting)

Tony

NeilJohnParker 2012-01-02T09:40:09-08:00

Yes Toni, its there. Perhaps I did not notice the Table of contents in the upper left corner! Did not see anything in there on brief glance on personal names but perhaps missed it. also did not see how to print document.

ACProctor 2012-01-02T09:48:10-08:00

Try looking under 'Document Structure' -> 'Person' Neil.

There's also a section under 'Place' that explains how the same scheme is used for Place names.

Tony

testuser42 2012-01-03T18:16:51-08:00

Hi Tony,
thanks for all that!
I've looked at all your pages and probably need more time to understand things.
Some questions:

The Narrative Structure seems to be the most powerful thing, but I'm not sure if it's too abstract, or asked to do too much...

Is using the "Inference" value the only way to seperate Evidence and Conclusions? How do you record Evidence from a Source, so that it remains unchanged? Inference defaults to 0, so maybe all Persons and Events are "Personas" and "Eventas" by default? Do you still need to have a Narrative inside a Person then?

How do you build a Conclusion Person? I don't see a PersonRef link inside a Person or something like that. So would you use the Narrative Text again? How? Would there be only one Person record that has all the PFACTs inside, evidence and conclusion side by side, with a Narrative text and Inference for each PFACT to point out which is which? That could work, but it sounds more complicated than the trees of Person records in DeadEnds. Maybe I need to see a simple example.

I think the Citation record should be called Source, but it's not a big deal.

I like the simple splitting of names. It's true that in many cases, users won't know what a part of a name is called. Maybe it would still be useful to have a way to indicate the surname at least. The handling of optional name parts is something new, I like it!

ACProctor 2012-01-04T03:57:31-08:00

The Inference attribute is the one and only way to distinguish E & C. It can be permanently associated with Narrative in a Citation definition (or a Citation reference) and the discussion on 'compound citations' and 'author annotation' makes a case for doing this and then regurgitating it in a citation footnote/endnote.

Persons & Places can be associated with E or C type entities and the Narrative becomes all the more important then. Having found that a presumed Person is the same as someone else, or that a misspelled Place is the same as another one, then they can be merged.

I've tried to make the creation of custom 'properties' as simple as possible so that you can create your own when needed. Each of those can represent E or C, and have a value which is a PersonRef or PlaceRef, or has units of measurement, etc.

The handling of names deliberately doesn't mention forename/surname etc. Remember that this is a storage format. A software module loading-up the data can make decisions about how to represent names, or even families for that matter. Neil's approach to handling names is not that different to mine. The biggest difference is that I have indicates how it should provide a "relaxed match", including the treatment of character case and diacritical marks, and by removal of certain punctuation before the matching process begins [I believe Neil suggests parsing the punctuation characters rather than simply treating them as token separators]. The other big difference is that I use the same scheme for Place names too.

Remember that this is a first draft of my work. I am happy with the structural flexibility but there are still questions on best practice for utilising it. I was in the process of writing some software to prototype it and see the problems, pro's & con's, etc. before refining the specification. That's the point at which I found BetterGEDCOM - by accident. All I've done here is round-off my working notes and make them available for everyone to think over.

I cannot say this is the way to go at all but I believe the structure is sufficiently different to other lines of thinking that it's worth pondering over. In any format, the applicability to problems is more than the mere sum of the individual syntax parts. It's a sort of Holistic phenomenon and often distinguishes designs originating from one mind against designs from many minds (SQL springs to mind here since the modern standard is a real "dog's dinner") :-)

Tony

[My copy of Shown Mills finally arrived here, literally 5 min ago]

ttwetmore 2012-07-17T17:35:49-07:00

I read the two documents Tony provided about STEMMA. It's a great model with some great ideas. I've written up three pages of comments based on my first quick impressions. You can find the document at:

http://bartonstreet.com/deadends/STEMMAComments.pdf

ACProctor 2012-07-18T01:33:31-07:00

Thanks for the feedback Tom. I'm impressed that you actually read it. I know how difficult it is to read other people's stuff, especially when their vocabulary is different, and "there's a ton of it" to go through. I fully expected this stuff to just sit on the shelf.

Top-level objects
I believe in the "generalisation" approach too so I can see where you're coming from by generalising what I'd termed Resources and Citations. I'd already managed to generalise sources and citations, but then STEMMA hasn't been around as long.

Keys
There are definitely two sides to this approach, and it's hard to say one is 'obviously correct'. Using readable keys was motivated in-part because I was crafting my XML by hand. Although I've now written software to help me, it's still "bits and pieces" of code rather than a "product". My keys are currently generated from the canonical name of the entity followed by removal of any punctuation. It works but I can think of refinements to add to the code.

Datasets
I'm working on import/export of selected keys between Datasets, but for now they're just a 'container' for a self-contained set of records during transmission (as you say).

Relationships between persons
Good analysis! There is a statement about the use of Roles to infer other types of relationships, but it might be in the Research Notes rather than the main specification. I need to check

Conclusion-base Emphasis
I did mention in the research notes at: Evidence and Conclusion (under 'Persona') that Personae have a strong parallel with STEMMA's Eventlet/EventRef concepts Tom. I don't believe they're as far apart as you suggest here. The relevant section in those notes mentions some pro's & con's for both approaches.

Emphasis on Hierarchical Representation via Nested Persons
If you mean the 'Lineage' section then that is only intended to dismiss that naive idea. In no way does STEMMA rely on it or use it Tom.

References Hold Non-intrinsic Values of Objects
I think I missed the fact that Dead Ends does it this way too. That's reassuring.

Places
STEMMA had to accept that a purely geographic hierarchy is not possible, and admits some 'administrative' levels (see Persons and Places, under 'Place Names'). However, it uses the combined geographical & administrative entities to create a single hierarchy type. That hierarchy is time-dependent, thus allowing for changes of borders etc. However, other potential hierarchiical entities such as religious, judicial, etc., are treated as properties of the entities in the main hierarchy (e.g. Parish).

The Events in Places allow me to describe the history of a Place, or even a house. I admit that this might be unusual for genealogists, although not for historians. I do like the symmetry though :-)

Narrative and Rich-Text Notes
STEMMA's mark-up does include the XHTML tags em/strong, and br/p, and ol/ul/li Tom. These are mentioned near the end of the section on Narrative Structure. It says basically what you're saying here about "style-less mark-up".

Resources and Citation Versus Sources
Certainly room for discussion here. STEMMA Resources were generalised to include the usual attachments like scans, images, documents, etc., in addition to physical artefacts. Citations were also generalised to include sources and traditional citations, and I want to also include attribution in a later revision. I'll have a think about what you've done in Dead Ends.

Personal Names
This approach turned out to be a lot simpler than the approach of rigorously categorising the name tokens, and then implementing culturally-dependent rules on top of them Tom. I made a case for it somewhere on the wiki but it didn't generate much discussion. Supporting worldwide cultures was a STEMMA goal, but only in the interests of good design since my own data didn't really need it

Sex
Point taken - it's evolved this way for a bunch of reasons that aren't as important now. I have to smile though because the use of a Boolean suggests a Y/N interpretation :-)

Tony

ttwetmore 2012-07-18T06:41:55-07:00

Tony,

Thanks. A few re-responses!:

Keys

My old program, LifeLines, is GEDCOM based, and the keys for INDI and FAM records are automatically generated in the way that most GEDCOM systems do it: I1, I2, I3, ..., F1, F2, F3, ... Because this convention is so prevalent, one of the trickier bits of importing GEDCOM files into GEDCOM-based databases is to resolve the "key clash" problems before or during import. LifeLines does this by reassiging keys when necessary during import. This is a tad tricky but can be done in a single pass.

GEDCOM also supports the REFN tag which allows users to assign unique and arbitrary keys to records. The types of keys you are using in STEMMA are perfect for use in REFN type tags. In LifeLines I index the REFN values just as names and regular keys are indexed, so searching by REFN value is just as fast. I use some extremely short REFN values for key (no pun intended) persons in my pedigree so I can snap to them instantly. The way to think of this is that the INDI and FAM keys "belong" to the software, and REFN keys "belong" to the user. You could employ the same idea in STEMMA.

Datasets

LifeLines has an its own built in programming system. There are operations in the language that allow users to create arbitrary sets of persons from the database. Set operations like union and intersection exist; operations for finding all spouses, all children, all ancestors, all descendents exist so that users can create powerful and custom sets of persons with just a few lines of code. There is also an operator that generates a GEDCOM file from any subset of persons from the database. It's a bit subtle to make sure the export file is mathematically closed -- LifeLines supports either adding more persons to the set to "close" all FAMC, FAMS, HUSB, WIFE & CHIL links, or to remove the links if the targets of those tags wouldn't otherwise be in the set.

Relationships between persons
Conclusion-base Emphasis

I will reread these areas to get a better understanding of your points. Sorry for missing some of the implications.

Emphasis on Hierarchical Representation via Nested Persons

Yes, I meant that "Lineage" section. I didn't see that it had much point, since STEMMA doesn't implement it, nor does any other model I'm aware of. I suppose that a naive programmer just starting out might imagine doing it that way, but after an hour of so of trying to program things they would change their minds!

Places

I think that implementing a time-based geopolitical place multi-hierarchy is a MAJOR challenge. I don't scare easily when it comes to software design and tricky algorithm implementation, but this one scares the bejessis out of me. I sincerely hope you can come up with a model that captures it.

Question: How do you feel about third party place authorities that create these place hierarchies that we can just import and use? For the natural language project I'm working on now, I have built such a third party approach in the form of a number of specification files that list countries, counties, states, provinces, cities, village, etc, with notation as to who is in who. What if someone created a place hierarchy in STEMMA format and you simply imported it as your place sub-database?

Having Events attached to Place records seems to run counter to the third-party place authority idea, but you always just add Events to the "authorized" Place records!

Narrative and Rich-Text Notes

"STEMMA's mark-up does include the XHTML tags em/strong, and br/p, and ol/ul/li" I missed this (I guess I wasn't reading with the same concentration throughout). Your Narratitive idea is a real winner in my opinion.

Resources and Citation Versus Sources

Yes, room for discussion. I think so many things can be generalized into a single Source object type. I don't advertise this very often, since most people disagree, but a Source object can also be used to hold conclusions. This allows the source reference to be used in EVERY context where you want to justify a property -- justifing a fact from a physical source, or from a complex conclusion, are just different points on a spectrum in my opinion. Likewise I see it perfectly reasonalble to use a person as the target of a source reference.

On the concept of an attribution -- I simply don't understand what it is and how it differs from other source concepts. Some day I hope to figure it out.

DallanQ 2012-01-09T09:11:08-08:00

De Facto model and open-source parser

I'm not a regular contributor here, so I'm not sure where to put this.

I've just written an open-source GEDCOM parser, which parses GEDCOMs into a "de facto" object model, meaning that the information in most real-world GEDCOMs is represented in this model.

If you wanted to convert existing GEDCOMs into a new model, such as one of the other models proposed here, using this parser to parse them into the "de facto" model first, and then converting the model objects to the new model, would probably be easier than writing your own GEDCOM parser from scratch.

Feel free to remove this if you don't think it belongs here.

DallanQ 2012-01-09T09:12:50-08:00

weird - I don't seem to be able to edit my post.

I forgot to put in a link: https://github.com/DallanQ/Gedcom

GeneJ 2012-01-09T09:19:50-08:00

Hi Dallan,

Good to see your name appear on any topic; thank you especially for your contribution.

P.S. "...edit a post." Sigh. It's the nature of Wikispaces "pages" vs "discussions." Pages can be edited, but not discussions. Administrators can delete a discussion or a discussion posting. I think the pair of postings will do the trick, but me know if you'd like me to delete the pair of postings so that you can include all in a single post.

DallanQ 2012-01-09T09:33:32-08:00

You're welcome. No need to delete. The demo will be updated later this week. Ryan Knight and I will be giving a talk on this at RootsTech.

GeneJ 2012-01-09T09:34:24-08:00

Excellant! --GJ

ttwetmore 2012-02-03T06:27:28-08:00

Persona and Structured Records

In the RootsTech 2012 keynote Jay Verkler gave a vision for the future of genealogy. One part of that vision was the ability for organizations (archives, vendors, ...) to provide standard format data to users as the result of running various sophisticated queries.

The data returned by the queries are termed "structured records" which are then integrated into users’ databases. The persona record that we've been discussing in different fora is the most important of these structured records, as they will contain all the data about a person taken from a record that the queries match. (What I have called eventa records are also key structured records.)

I’ve been promoting the necessity of incorporating the persona concept into the Better GEDCOM model since my involvement began, for precisely the reasons outlined in the vision for the future of genealogy. I continue to hope that Better GEDCOM will embrace the paradigm shift into records-based and research-based genealogy that this future vision for genealogical technology requires.

ttwetmore 2012-02-05T06:22:08-08:00

I feel compelled to say that Jay never called any structured records persona records. However the structured records he referred to most were those that contain information exclusive to persons and taken directly from individual genealogical evidence records, which is, of course, the definition of persona records. For those concerned with sources and citations, Jay's examples also had source records being provided along with the persona records when transferred from the servers (search/data providers) to the clients (search users, which include desktop/laptop/smartphone genealogical applications, the primary target for Better GEDCOM).

ACProctor 2012-05-10T05:50:56-07:00

RDF

Has anyone tried using the Resource Description Framework (RDF) to describe structured content as an alternative to the usual XML?

RDF is often viewed as "sitting on top of" XML but it claims to have multiple serialisation formats.

There's an interesting discussion of this technology under the "Gedcom x" topic on soc.genealogy.computing.

I'm not yet convinced of its advantages but would be keen to hear other people's experiences with it.

Tony

ACProctor 2012-05-11T12:22:58-07:00

There's a halfway-house approach, though, Tom. RDFa can encode the triples in any XML schema using attributes.

The stuff I was reading was specific to XHTML but would apply to any XML.

I'm thinking about the simplicity of the representation. It feels (I could be wrong) that a regular XML representation would be easier to view/handle than RDF/XML (which is probably 50% of the reason for defining the N3 notation). RDFa sounds like it allows the RDF triplets to be added as an "extra" for situations where the original XML schema alone is not enough (e.g. Semantic Web).

I guess it'd be nice to see an example. For instance, an RDF representation of the original GEDCOM schema.

Tony

ttwetmore 2012-05-11T13:42:16-07:00

Tony,

Thanks for the pointers to RDFa. I agree with your assessment.

N3 is a logic language. RDF is best thought of as the subset of N3 that expresses facts. Then N3 can be used to create logical equations that "process" the facts. So with genealogical data in N3 format, one could write N3 equations that would, say, find cousins, or compute ages, etc, etc. N3 is easier to read than RDF/XML, and there are some syntactic sugar extensions that can reduce the size of N3 triples considerably.

It is clear that there are many, many ways that BetterGEDCOM data could be serialized (I guess this is the now the modern way of saying "write to a file."). I will always prefer a custom approach with the simplest possible representation (which also means easiest to read and requiring the fewest characters). The argument against this is the need to write custom parsers and the inability to use other vocabularies and namespaces. My response is that the parsers are one day jobs for any skilled developer, and other vocabularies force square pegs into round holes. Just imagine trying to get the source junkies in this forum to agree to use Dublin Core.

I prefer things like JSON or MongoDB documents for external formats.

Tom

ttwetmore 2012-05-12T10:40:05-07:00

Tony,

Thinking more about the RDF.

For transferring genealogical data between BG compliant systems there is no need to supply the URI and namespace information for the predicates (the "tags"), no need to provide type information for the object values, and no need to supply URIs for the subjects, as long as each subject did have a universally unique id. So there is no need to export the BG data in any RDF compliant format.

But if we do create some URIs for BG, including say "http://bettergedcom.org/predicates", for the tags, we could create a vocabulary that defines all the tags we decide upon, and we could define a mapping from the unique id's to unique URIs for each person, (similarly for each top-level object) something like "http://bettergedcom.org/persons#23ac458ad8d87c65defa" (need minimum of 22 letters and numbers to represent 128-bit UUIDs). With these URIs ready to go, any software handling BG data would be capable of generating BG data in "semantic web ready" RDF/XML files.

I think this is the best option. Have a BG requirement for the capability to generate BG data to RDF/XML or RDF/N3. But certainly don't require that BG compliant programs use that as their export and archival format.

All we have to do is invent a few root URIs in our own domain, and then when we define our schema, create a qualified name for each concept. For example, say we have birth, death, date and place in the BG vocabulary. The BG RDF would have to define something like the following URIs:

"http://bettergedcom.org/predicates#birth
"http://bettergedcom.org/predicates#death
"http://bettergedcom.org/predicates#date
"http://bettergedcom.org/predicates#place

Then any RDF/XML file would have to include a name space declaration, say something like:

<bgedcom "xmlns:bg=http://bettergedcom.org/predicates#">

and then the RDF/XML form of a birth event would be pretty simple:

<bg:birth><bg:date>December 18, 1949</bg:date><bg:place>New London, Connecticut</bg:place></bg:birth>

This is not bad as XML goes.

Tom Wetmore

ACProctor 2012-05-14T01:59:26-07:00

I totally agree with your first paragraph Tom. RDF is irrelevant in certain contexts.

However, your suggestion for meeting RDF half-way is a clever one. I need to read it through a few times and experiment a little.

RDF is clear an important technology but I was concerned about any new reference model "pinning its hat" totally on it, just as much as any new model totally ignoring it.

Tony

ras52 2012-05-16T04:44:41-07:00

Sorry for picking up this thread a bit late. I think we may be in danger of trying to answer the question "should we use RDF?" without first specifying how we are proposing to use it. There are several ways we might consider using RDF in implementing a better Gedcom. We might find RDF is a useful way of conceptualising the data model, and we might go further and define the data model using RDF terminology. Even if we don't primarily think of the data model in RDF terms, we might choose to define a standard mapping from our data model to RDF.

Once we have a means of expressing our data model in RDF, we get access to the various RDF formats such as RDF/XML, RDFa and the various N3 dialects. We may choose to make one of these the (or a) standard exchange format. But even if we don't choose to use an RDF format as our exchange format and instead use a custom XML format, we might still wish to use a technology such as GRDDL to allow software that does not understand the better Gedcom exchange format to automatically translate the better Gedcom format into RDF. This can be entirely transparent to the average user by putting the necessary GRDDL markup into the XML Schema document, much as XHTML does.

My view is that the exchange format is of secondary importance to the key question of whether a standard RDF mapping exists. A standard way of mapping the data model to RDF could be a very valuable part of a better Gedcom specification as it will allow users (and perhaps more relevantly the implementers of programs and websites) access to general-purpose RDF tools and resources. If one of the standard RDF formats then happens to match our requirements for the better Gedcom exchange format, we can use it; if it doesn't, we don't.

Let me comment on a few of the specific technical points raised here. Tom Wetmore suggests not piggybacking on existing RDF vocabularies so that we can put everything into a single namespace. The normal RDF way of doing things is to have separate namespaces for different subsystems, so that they can be reused independently, and in my experience that works fairly well so long as the reuse isn't excessive. To take an example, a number of vocabularies have predicates to express the fact that person P was born on date D. Should we reuse one of them? Probably not. But if an existing vocabulary implements a rich and well thought out vocabulary for lifetime events that is compatible with our needs, then really we should use it. In this particular example, such a vocabulary does exist -- it's the 'bio' vocabulary <http://vocab.org/bio/0.1/.html>, and its primary author, Ian Davis, frequently writes on the use of RDF in genealogy, so we can perhaps assume that it's similarities to certain aspects of the GEDCOM event model is not accidental.

If we're concerned that use of multiple RDF vocabularies will result in multiple XML namespaces being present in our eventual exchange format, there are plenty of techniques we can use to either completely avoid it or to remove the worst cases of it. Similarly, there are plenty of techniques we can use to hide or remove URIs (and/or replace them with UUIDs) from the data model and exchange format. GRDDL is the most versatile tool we currently have for this. Yes, it (typically) involves XSLT, and I largely agree with Tony Proctor's opinion of XSLT, but it's one small piece of XSLT, written once during the specification process, and that users never need to be aware of. It exists solely to translate a custom XML format into RDF/XML for the benefit of general-purpose RDF tools. But there are other, simpler techniques besides GRDDL. For example, we can use an OWL ontology to define synonyms in the better Gedcom namespace for terms in other namespaces.

Richard Smith

ttwetmore 2012-05-16T11:44:07-07:00

Richard,

Thanks for the great input.

It’s never too early to ask questions, but agree it can sometimes be too early to answer them. Still I don’t believe we should define the BG data model with RDF, nor should RDF be the BG exchange format (and I don’t necessarily believe that ANY form of XML should be the exchange format). However, neither of these issues is important enough to go to the mat over.

The BG model will have a schema which will be sufficient to define an RDF export format that can be used when the advantages of RDF are needed.

I appreciate all your technical insights.

Tom Wetmore

ras52 2012-05-16T16:02:31-07:00

I can see three reasons, admittedly none of which are hugely compelling, as to why it might be worth considering defining the data model with RDF.

First, I can see advantages to having a formal machine-readable definition of the data model. A formal grammar, whether in BNF, or as an XML Schema, or whatever, is fine as a formal description of the exchange format, but it's heavily oriented around a single format. That's not so true of RDF: yes, an RDF Schema can be thought of as a grammar for a specific vocabulary expressed in RDF/XML or N3; but really an RDF Schema is a description of a data model. I expect there are other good ways of formally describing data models in a machine-readable way, but if there are, I'm not familiar with them. UML diagrams are extremely good for displaying a data model to humans, but the text representations of UML I've come across are either very domain-specific (e.g. Corba IDL) or seem to me poorly supported (e.g. HUTN or XMI). In any case, this isn't specifically an argument for RDF: rather an argument for something like RDF.

Secondly, it seems important to me that our data model should be very extensible. Different cultures have different naming conventions, different calendars, different significant life events. The location hierarchy is different in different places. Some cultures involve true polygamous relationships -- that is, relationships involving more than two people, as opposed to concurrent relationships each with two. Clearly we cannot expect to include everything in our data model. We might hope to include enough to allow these specific points to be extended. But sooner or later someone will discover a situation we've just not catered for -- maybe we've assumed marriage-like relationships include just two partners, for example -- and the need arises to extend the data model in a way that wasn't initially foreseen. An RDF data model allows arbitrary third-party extensions without risking them being incompatible, and provides a lot of assistance to applications that might want to manipulate unknown extensions.

Thirdly, and continuing with the idea of extensibility, I can imagine getting to a stage with better Gedcom where we say: yes, it would be nice to have X, Y and Z, but we cannot afford the time needed to specify them competently. Digital signing of individual data fragments, audit trails suitable for collaborative work, and version history all seem likely to end up in that bucket. However, existing RDF vocabularies already exist for lot of these things, and we save ourselves a lot of work if we can simply say that you can / should / must use it.

I shan't comment here on RDF as the exchange format, other than to say that I would suggest mocking up some examples to see how it compared to other suggestions. I am, however, interested by your comment that you "don't necessarily believe that any form of XML should be the exchange format". Do you have any reasons for that besides the fairly serious scalability problems in parsing large XML files?

Richard Smith

ttwetmore 2012-05-17T01:19:19-07:00

Richard,

I understand your three reasons why RDF could be used to define the BG schema, and agree they are not compelling. If the decision were to go that way, however, I would support it. There are any number of ways to formally and informally express a model. RDF is rather Johnny come lately.

I don't like XML. I come from an older world. One of things I did often in yester-years was to design and implement "little languages" for custom applications, what was used to be called application generators. The design of these languages was a creative exercise in which there was an opportunity to design a grammar and a language that was a perfect fit for a custom application -- natural yet terse, easy and obvious to read and write and use for practitioners in the area, and so forth. Genealogy is one of those applications that cries out for a custom language. XML with its one size fits all mentality has nearly destroyed this area of computer engineering, though if you look a little below the covers you find a disgruntled backlash.

The arguments that with XML you get all kinds of parsing support and validation support and tool support is hype. Anyone with any experience with language processing can have a parser and high quality validator for a custom and quite complex languages up and running in a day or two.

As a designer of well engineered languages designed for specific purposes, I find XML ugly. Too many characters and conventions that obfuscate the contents and make the files hard to read. Too many little pointy things. I know the argument that the way things look is unimportant for information intended to be read and written by computers, but I don't buy it. And though in this day and age we don't worry about file sizes that are many times larger than they need to be, as a developer from the era (I wrote my first program in 1966) when size and efficiency mattered, XML is a cringe fest.

That being said, I know I live in an era where XML has become a religion, and that my whiny voice has no chance of being heard. I know that the exchange format for BG will be XML, maybe even RDF/XML, and I will put up nothing but token resistance. If I were in charge, the format would be something like JSON, MongoDB, Google protocol buffers, N3 for data, or, God-forbid, GEDCOM.

I intend this to be taken in a light-hearted manner. I don't think any of these issues are important enough to derail work on the BG model. I would be happy to go in any direction that a consensus of views leads. I have accepted that the BG format will be XML and I am happy with it.

Tom Wetmore

ACProctor 2012-05-17T01:54:36-07:00

Re: "Secondly, it seems important to me that our data model should be very extensible. .... (snip) ... Clearly we cannot expect to include everything in our data model."

I'm afraid I have to make a stand on this issue Richard.

I firmly believe that it *is* possible to devise a data model that is not limited by our cultural presumptions. All of the issues you mentioned can be handled cleanly. It does, however, need the designers to take a "step back" to see the bigger picture (and not be focused on things like "GEDCOM does....."), and thoroughly research the variations out there in the real world.

Tony

ras52 2012-05-28T21:46:57-07:00

Sorry for the delay in responding to some of the points raised here by Tom and by Tony.

With regard to Tom's general comments on XML, I agree with almost all of them. XML is ugly, verbose, unnecessarily hard to parse, and comes with lots of baggage. Some of that can be explained in the historical context of XML having evolved as a light(er)-weight subset of SGML, but I'm the first to agree that for a lot of applications, XML is a poor choice and a tiny custom-designed language is better.

However, I don't think this is one of those places. Genealogical data is complex and diverse enough that if we attempted to design a tiny language to express it, I think we'd rapidly end up getting quite complex too: not as much as XML, but still enough that I think parsing the format would become a barrier to acceptance. The JSON-like formats are rather nice in their way, but there are so many dialects with subtle differences: Google protocol buffers aren't quite JSON; Mongodb's BSON is a binary format and doesn't quite map to JSON because of its type system. In any case, I don't think parser libraries are as easily come by for these as for XML. Having said all that, part of the reason I like RDF is that I can hide from the XML syntax. I can read and write it in N3, and use a program like rapper (from librdf.org) to convert between different formats.

Tony, I suspect we're not disagreeing significantly about it being possible to design a culture-independent data model: I suspect when I talk about extensibility, that's part of what you're labelling the data model. For example, I doubt either of us are suggesting that the Better Gedcom specification should have a supposedly-complete list of all possible event types (to use GEDCOM terminology). We could never be sure that the list had everything that genealogists might usefully want to record. So presumably our data model will allow us to define new event types. In my terminology, that's extensibility, though another way of looking at it is that the data model simply includes a definitional level as well as the data level.

A point where I suspect we may not agree is whether we can get it all right on our first attempt. My guess is that we won't entirely. However hard we try not to, my guess is that we'll inadvertently introduce an implicit cultural or technological assumption. To take a different example, it would seem reasonable enough to assume that everyone who really existed was born on some specific date, even if we don't actually know that date, or only know the date in a form that's capable of multiple interpretations. I can imagine that in a decade or two, biotechnology may have advanced to a point where a single date of birth no longer has meaning, and maybe it would become conventional in those rare circumstances to use a date range. Far fetched? Maybe, but the future often is.

However hard we try, I suspect we'll end up making an assumption in some obscure circumstance turns out to be untrue. A good way of safe-guarding against that is to have a format that allows users to ignore part of it and substitute their own alternative (extension, if you will) in its place. They may lose full interoperability with existing systems, but if the data model can cope with the idea that it might be extended (whether by us in a subsequent version, by a third-party vendor, or by a user in their own way), and the exchange format makes it clear whether something is an extension, a processor can at least know which bits to ignore and still understand the remainder of the data.

Richard

ACProctor 2012-05-29T02:42:35-07:00

Thanks Richard.

Re: "...culture-independent data model"

My view was not so much that we could include everything necessary for every culture, but that we had a solid foundational structure (my "data model") with enough degrees of freedom to accommodate cultural variations.

I certainly won't pretend that I know every cultural difference, even though I did a broad range of research for STEMMA, and I've been involved in software globalisation and writing of locale systems for decades. However, that bit of knowledge I have does help to find a generalised model.

Your reference to event types is an easy one because it's a form of extensibility that doesn't really impact the structure of the data model, and that level of extensibility can be accommodated beforehand without a detailed enumeration of all the required types.

(apologies to Louis for taking yet another thread OT)

Tony

ttwetmore 2012-05-29T07:52:14-07:00

Richard,

I am so glad you have joined these discussions! We don't agree on everything, but you bring an experienced perspective that you explain clearly and convincingly.

Tom

ttwetmore 2012-05-10T15:36:05-07:00

Tony,

To answer your direct question, I haven't tried it yet. However, I am working on a contract job now where I am processing the text of obituaries and other announcements, and extracting genealogical information from them. The first part of this is basic natural language processing, where entities like names, dates, places, relations, schools, companies, etc., are recognized as individual and isolated entities. The second part is the more complex part where relationships between the simpler entities are recognized leading to the creation of persons, events, relationships between persons, and so on. Imagine the issues of resolving pronouns and things like that.

The first part is finished, that is, the part that finds entities like names, dates and locations, and I'm debating the exact structure of the second part. Regardless, the operation of both parts will be the generation of entities that fit perfectly within the RDF model. In fact everything fits perfectly within the RDF model, since the purpose of that model is the representation of knowledge of any type as triples of the form (subject, predicate, object), which does in fact cover every bit of knowledge imaginable (given the willingness to deal with enough triples!).

So I am currently now the second part of the project to emit networks of RDF-like knowledge triples. These triple networks will represent the persons, events, relationships between persons, properties of persons (where they went to school, where they worked, the armed service they served in, etc.) exactly as they are mentioned in the text.

Change in subject. The XML form of RDF could be used as the external format for a genealogical data standard. There would be great temptation to make it too complex, I think, but if the temptation were resisted, it would be a fine format. There are other forms that RDF can take that I would prefer, but that's as may be. (Note that RDF could be put into GEDCOM syntax if one wanted to go through the exercise.) The main temptation comes in the area of vocabularies. The pressure would be to piggyback onto many existing name spaces (e.g., Dubin Core for representing resources, other semantic namespaces that exist for personal and relationship information, namespaces for locations and dates, etc). I am in favor of a single schema and namespace for the entire genealogical and family history domain of knowledge. It would likely be a loosing battle.

Tom Wetmore

ACProctor 2012-05-11T06:14:21-07:00

Thanks for the reply Tom.

I have reservations about RDF but I need to fill in some knowledge gaps first.

For instance, a plain XML representation of family-history data would have all the necessary semantics by virtue of its schema definition, although non-genealogical software would not know of them. There would be no elements or attributes that would have meaning to such software looking "blindly" at the XML.

Creating the same E-R relationships using RDF (not specifically RDF/XML) sounds like it could achieve the same data structure, but in a way that conveys some meaning to a general search engine or query language. My concern here is that many of the genealogical entities and entity-relationships have a degree of vagueness or nuances associated with them. Simply adding a relationship that indicates "son of" may not convey enough information to retrieve meaningful results in a query. It may over-simplify the relationship.

The same is true for entities such as a Person with all its many alternative names, and its contextual information. A genealogical 'person' is not the same as a person in a phone directory - it could be an evidence-person or a conclusion-person. Similarly with a date. In order to describe a date in sufficient detail, you need more than what the ISO standard accommodates, e.g. imprecision, granularity, alternative calendars, etc.

Certainly from an XML point of view (I know little of the other RDF serialisation formats), the introduction of RDF namespaces complicates things. Namespaces are notoriously messy to handle, and older versions of XML libraries processed them differently.

XSLT is undoubtedly the worse language in the universe but wouldn't moving from regular XML to RDF/XML prevent its usage?

Tony

ttwetmore 2012-05-11T11:56:40-07:00

Tony,

RDF/XML is pure XML, so XSLT remains a programming option.

RDF/XML requires all subjects and predicates to be universally and uniquely distinguishable from all others (and many objects are also). This is why RDF files are chocker block full of URIs. For example, the Dublin Core author entity is unique, and if Better GEDCOM went the RDF route it could choose to employ the Dublin Core for identifying sources, or it could define the author entity another way. But if it did so that new way would have to universally unique with its own URI. Having globally unique predicates (think of these as just the tags of GEDCOM or the elements of XML) is what makes it possible to talk about “semantics.”

Using RDF for a genealogical transport file, all record level objects (eg, persons, sources, and whatever else are chosen to be at that level) would have to have unique URIs. Every person in every file everywhere. I would much prefer using UUIDs.

Either XML or RDF/XML for Better GEDCOM requires a schema. If RDF were used a unique vocabulary/namespace with a URI would be created. And BetterGEDCOM could decide to support other well known namespaces.

The triple structure of RDF can represent any ER diagram. I’m not worried about the vagueness issue you bring up as that vagueness is part of the semantics.

The namespace issue is a big one for me too. It might be nice to piggyback on Dublin Core for the source part of BetterGEDCOM (GEDCOM-X is doing that), and define all other tags for ourselves. There are probably namespaces already for locations that could be used. I would guess that any namespace that includes persons and relationships would be too elementary for a genealogical data model.

Tom

ras52 2012-05-28T23:46:31-07:00

Packaging data

The Multimedia01 requirement states that "BetterGEDCOM must use a container specification to hold separate supporting files such as multimedia accompanying the genealogical data." That sounds right to me.

Before I came across this site, I spent quite a while considering exactly this problem. One possibility I considered was a multi-part MIME file -- effectively an email plus attachments. This has quite a few good properties: you can specify the MIME type and name of each resource; MIME types are extensible so when the next new video format comes along, we'll support it; and the mid: URL scheme (defined in RFC 2392) allows one part of the container to reference another part. But the draw backs are considerable: the format doesn't allow for efficient extraction of a single resource; there are few decent stand-alone tools for handling multi-part MIME; and shipping multi-part MIME over HTTP is confused and under-specified. (It is also verbose, but that can be solved with compression, so I didn't consider that important.)

The .zip and .tar.gz archive formats solve all those draw backs, but lack some of the advantages of multi-part MIME. While it's possible to specify the name of a resource, there's no way of specifying the type. This is usually worked around by making an assumption based on the file extension, but the number of desirable extensions is small, their use is unregulated, and conflicts do occur. There's also no equivalent to the mid: URL scheme to provide a standard way of allowing one archive member to reference another.

However the GEDCOM X proposal seem to have solved this in a clever way: they've adopted Java's .jar archive format. This is completely compatible with the .zip format, in that anyone with a zip tool can create a .jar file simply by including a directory called META-INF/. No Java required. The META-INF/ directory contains a file (called MANIFEST.MF) that optionally includes the MIME type of each resource. Finally, there's a de facto standard jar: URL scheme for referencing files in the archive.

I think this provides solutions to quite a few of the Better Gedcom requirements including Multimedia01 (Multimedia container -- the .jar file), Multimedia03 (References to multimedia -- via the jar: URL scheme), and Multimedia04 (Grouping of multimedia in a container -- as .zip has a tree-based directory structure).

This MANIFEST.MF file is a series of name-value pairs in the RFC-822 header style (the header format used in email, HTTP, and MIME), and it's legal to include non-standard headers if Better GEDCOM needs something beyond what Java requires. Although Manifest headers technically live in a different namespace to HTTP headers or mail headers, it's pretty clear that there is an informal policy to avoid name conflicts between these namespaces except where the meaning is the same. So if, for example, an HTTP header, we can be hopeful it won't conflict with a future official Java manifest header.

This provides a mechanism for solving Multimedia02 (Information about multimedia objects). We can put some metadata directly in the manifest: e.g. Content-Length, Last-Modified. If we have more complicated metadata, it can be bundled into a separate file in whatever format we want (I would suggest one of the RDF formats, but that's a separate point), and the Link header (c.f. RFC 2068, 5988) can be used in the manifest to associate the file of metadata with the original resource.

Finally, .jar solves a problem that isn't one of the Better Gedcom requirements (though I think it probably should be): it provides a mechanism for digitally signing resources.

ACProctor 2012-05-29T02:55:55-07:00

Just linking this to the recent activity on an older thread: BG Container Formats.

Tony

ACProctor 2012-06-07T12:23:54-07:00

Anyone know anything about the new proposed standard called Document Container File, or ISO/IEC NP 21320-1?

Tony

AdrianB38 2012-06-08T03:46:38-07:00

GEDCOMX initial look

Tamura has recently posted his comments on the GEDCOM to GEDCOMX converter prototype and, more importantly, the GEDCOMX format. See http://www.tamurajones.net/GEDCOMXConverter.xhtml

He is not encouraging on the format but I suggest you read his comments for yourselves. I have to say that I don't always agree with Tamura but tell myself that I better have a darn good reason not to agree with him. Right now, I have no reasons to disagree with him.

AdrianB38 2012-06-10T15:57:01-07:00

Tom asks: "where is the family? There doesn't seem to be a single element in this model that can represent a family."

I had thought that it had migrated to Relationship but looking at the diagram on conceptual-model.png, that can't be so as the relationship there has only 2 persons in it. Oh - that might work if the 2 are the parents and the children are related separately to their parent(s).

But for me, the major issue with Relationship, I THINK (i.e. I'm still trying to get my brain around it) is that if it is only 2 at a time, then it makes it practically impossible to record business partnerships, etc, since one could grow old putting in the 6x5 2-person relationships needed to record a partnership of 6 people, instead of 1 Group with 6 members.

Tom, you also said "I've always thought that an event should be able to be ... a top level element, so that many persons as role players could refer to it in their many roles". That conceptual model diagram doesn't seem to allow for that - but I can't even see how to use it to get from a fact to a Person, so I may be wrong. In fact, if I think about it, the idea that Persons have their own "file", as do Relationships, but not events (again, so far as I can see) makes it impossible for the multi-person event to exist, as it couldn't go in any single person file. I'm _guessing_ that at the moment they are still rejecting the multi-person event because they don't see the need for it.

I'm trying to stay positive but from where I'm sitting, it looks like GEDCOMX is a massive leap forward in technology but zero advance in genealogy (they can't even do dates on names for goodness sake!!!). Or rather, I can't see the genealogical advances. If there are any.

AdrianB38 2012-06-10T16:04:33-07:00

Louis - you said "Events happening to families need to define what the family consists of. It's not worth the trouble. Just have the event be defined to happen to each of the individuals and be done with it"

I'm not certain which way I want to go having got fed up both with residence events against individual members and against the family as a whole (for the reasons you say) . But for me, the advance would be the ability to do all sorts of groups such as business partnerships, etc. all of which have events in their own right. Family would then just be an instance of an informal group and one could use family-level events or individual-level, whatever seemed easiest. At least we'd have the choice.

Adrian

Alex-Anders 2012-06-10T16:25:09-07:00

Louis
"No one needs to search through the 100 previous conclusions that happened before the current one."

Would you see all 'conclusions/decisions' being recorded in the current one? So that it may be a long text, or a 1,2,3,4 rating type thing?

If a text, what would you see as happenning if a conclusion half way along is proven incorrect.

ttwetmore 2012-06-10T16:31:22-07:00

Louis,

No arguments from me. I know you don't like personas, multi-level or not. I didn't know about your feelings about the family object, but fine. When there is another genealogical data model that gets widespread acceptance (if there ever is one) I will be able to handle it with little effort, so I can continue on worry free with my own software projects. Based on where BGEDCOM and GEDCOMX seem to be going, IMHO heading for analysis paralysis and/or unnecessary complexity, and in FHISO's case possibly being process-driven to numbness, I am happy with my own projects. I'm doing some interesting stuff using natural language processing right now, extracting genealogical information from obituaries and other notices.

I know you realize that the person in the GEDCOMX record model is the persona record that you don't like.

Thanks again for your wonderful demeanor.

Tom

ttwetmore 2012-06-10T17:04:07-07:00

Adrian, Louis,

I think it's a very valid question to ask whether the person and the relationship is all that is needed at the top level of a conclusion model. Certainly the key concept in the conclusion world is the person, and that's covered.

The next question is what else is key at the conclusion level. Not to head down the long and fruitless path of asking yet again what is genealogy, it is clear that the GEDCOMX model believes that in the final analysis, the purpose of genealogy is to discover persons and to discover relationships between persons, and to annotate those objects with facts that are also conclusions.

It is pretty hard to argue that that is not a good system. However, I have always thought that there should be an event conclusion object at the top level in preference to the relationship object. However, I believe that it is always the case that by just using relationships, or by just using events, one can always express the same information. Even so, I believe there are better ways and worse ways to form the expressions.

As an example of my position, I prefer in my conclusion model, that the parent relationships and the simple birth "fact" of a child all be bound up together into a single, top level birth event that has three roles, obviously the father, the mother and the child. In the GEDCOMX conclusion model, this information is divided into a birth fact given to the child and two relationships for the father/child and the mother/child relationships. I like the event approach for a number of reasons. For one there are fewer top-level objects in the database, and by now you all know that I place a premium on simplicity and efficiency of representation. (Not to be facetious, but I think it is clear that simplicity and efficiency are not GEDCOMX goals).

GEDCOMX -- 3 person records, 2 relationship records yielding 2 directly implied relationships -- score: it takes 5 records to record 3 persons and 2 relationships. If you object to my use of the term record, just substitute your favorite term for top level thingamabob.

My approach -- 3 person records, 1 event record yielding 3 directly implied relationships (the extra one being the now explicit, conjugal relationship between the parents -- score: it takes 1 fewer record to record 3 persons and 1 additional relationship.

Tom

Another advantage of my approach, which I hope is obvious, is the two parents are directly associated with birth events of their children, something that is not possible in the GEDCOMX model.

ttwetmore 2012-06-11T02:57:10-07:00

Adrian,

On the name not having a date issue. I assume that GEDCOMX will realize that a name is just another kind of fact, and when they do, name will get a date (and a place).

On the multi-role event issue. I agree with you that GEDCOMX has missed an opportunity. Maybe they have it covered in the record model. If it is in that model, I believe they could make an argument that they don't need it in the conclusion model, since the conclusion model should only have the conclusions one would draw from multi-role events, not the events themselves. As you may have seen in my recent post, I don't believe that, because I believe that the top level relationship object should be removed from the conclusion model, and replaced with the multi-role event, because they can do a better job of conveying the information and the conclusions. To get some insight on this, it will be interesting to see how the record model is used to handle census events.

I do agree with the GEDCOMX approach of treating personal events as if they were facts, but not all events can be restricted to a single person (and in GEDCOMX's case, single relationship). This is the distinction I make in the DeadEnds model between vital events which are associated directly with a single person (or other kind of object), and are therefore subsumed as sub-elements under the associated object, and events as first class citizens which are stand alone, top level event objects with any number of roles.

On the giant leap forward in technology point. That one gives me pause. Isn't there an informal metric one should apply to compare the "size of information" with the "size of the technology required to express that information"? Think of something simple, say a nuclear family with names, dates, birth and death and marriage places. Think of that as an amount of information that must be recorded or transmitted. Now first think of legacy GEDCOM as a language for expressing that information in an unambiguous fashion. Create some kind of a metric between the amount of information and the size of the GEDCOM file. Now express the same family in the GEDCOMX XML/RDF representation. It is expressing exactly the same "information" yet it takes 40 times as many characters to do it. I might be thinking about this all wrong, but for me this is a misuse of technology. But as I said before, it would be trivial for GEDCOMX to fix this. All they have to do is to think of the XML/RDF format only as an export format that web-semanticist-gurus might want to play with, not as the actual GEDCOMX transport and archival format, which should be a simple, tag-based approach, best documented with a JSON-like syntax. I expect that there will be a lot of pushback by developers and other persons with common sense about using the XML/RDF/multi-namespace format as the actual transport and archival format.

Tom

AdrianB38 2012-06-11T07:47:43-07:00

"On the multi-role event issue. I agree with you that GEDCOMX has missed an opportunity. Maybe they have it covered in the record model."
I can't see anything in the documentation associated with that other than a constant repetition of "Dublin Core". I couldn't even tell you from that documentation if personas are part of it. The much vaunted UML diagram of the conceptual model (http://familysearch.github.com/gedcomx/uml/conceptual-model.png) seems to ignore the record model.

And it's kind of weird because events give rise to records (even records of characteristics as someone pointed out in a GITHUB exchange that I could understand) , so surely events ought to be "higher" to minimise the change moving from record to conclusion model.

"giant leap forward in technology" - yes, I used the phrase to try to invoke the accumulation of three letter acronyms for one's CV. Clearly there are multiple dimensions to the overall value of something, of which use of leading edge technology is just 1.

"XML/RDF format only as an export format that web-semanticist-gurus might want to play with" I think it's more than that to them. I think it is THE file format. We just want to exchange data between people and store it in our personal databases. This REPLACES the personal database, I can find no other reasons for the extra stuff. Why would we ever want to turn our GEDCOM files into HTML without loading it into a program first? Why else would GEDCOMX have the RDF stuff if not to move structural knowledge out of the application and into the file?

ttwetmore 2012-06-11T10:51:57-07:00

Adrian,

Here is the diagram of the "old" record model. (The record model used to part of the main GEDCOM-X "product line" but they took it off the menu, probably so they could concentrate on and publish the conclusion model more quickly). I don't think we should necessarily anticipate that this old model will be the same as the new record model when they get back to it. I was against taking it off the menu; I recommended in fact that the two models be unified, based on my belief in the multi-layer persona concept, but there was no support for that idea.

http://record.gedcomx.org/record-uml.png

Note that person object in this model is called the persona.

And notice that central to the record model is the record object. This represents the extracted evidence "backbone" record, in whatever form it might take. Therefore this object would be the evidence level event object that we've talked about a lot and that I have in the DeadEnds model.

Note that there is no analog of the record object in the current conclusion model. This is the major oversight that GEDCOMX missed, in my opinion. The record object should evolve into the multi-role event object in the conclusion model.

Tom

AdrianB38 2012-06-11T12:06:44-07:00

Tom - thanks.

"I don't think we should necessarily anticipate that this old model will be the same as the new record model" - I shall bear your warning in mind. But I'm having a hard time trying to see how a normal sort of Record can be used (with minimal interpretation) to derive _paired_ relationships. Take a wedding with bride, groom, three witnesses and a clergyman - and we'll say that I want to record all 6 personas. I can see I've got roles to record - but where? And if the witnesses are related in pairs - well, who to? Saying Witness1 is related to Bride and then Witness1 is related to Groom is pointless repetition.

Oh don't tell me - I've just looked...
"The following relationship types are defined by GEDCOM X:
"URI description
"http://gedcomx.org/Couple
"http://gedcomx.org/ParentChild"

No witness relationship - so no witness role???

So they've not thought ahead to what sorts of relationships might be needed and they've ended up with pairs because that's all GEDCOM has now. If they'd thought just a little bit then they'd never have ended up with pairs, they'd have got groups.

ttwetmore 2012-06-11T12:45:29-07:00

Adrian,

Even though the old record model has a record object that can link to any number of persons, and any number of relationships, there seems to be no way to assign roles to those links.

It may be that the GEDCOMX designers do not believe that the relationships between person and event is necessary or useful. That is why I said it will be interesting to see how the record model handles census data. My guess is that each person mentioned in a family will have its own persona record, and that there will be separate relationship records for each person in a family (except the head of household) to the head of household. I would guess that the names of those relationships will be the relationship type fields of the relationship objects.

Note how this set of pair-wise relationships misses all kinds of things. For example, a son and a daughter each would have a head/child relationship with the head, but would have no relationship with the spouse in the household and no relationship with each other. All those have to be inferred by software. By using the multi-role event, all the relationships are inherent in the object. Maybe I should show another scorecard like I did for the conclusion record.

Okay, say there is a family with head, spouse, and 4 children.

In the current GEDCOMX record model there would be 6 personas and 5 relationship objects, which obviously encodes only 5 relationships, each to the head of household.

Replace this with a multi-role event object. Okay, you still have 6 personas, but now only 1 event record, but that 1 event record encodes 15 different relationships, including 1 spouse/spouse relationship, 8 parent child relationships and 6 sibling/sibling relationships. Here's the score card:

GEDCOM X -- 12 records (6 persona records, 1 record record, 5 relationship records) -- encoding 6 personas and 5 relationships.
Multi-event Approach -- 7 records (6 persona records, 1 event record with 6 roles) -- encoding 6 personas and 15 relationships.

If you are keeping score that is 5 fewer records and 10 more relationships.

Now, to be fair, there is a strong argument against my approach. One cannot really assume that children are biologically related to the spouse, and therefore one cannot infer that they are natural siblings either. However, the software looking at such an event would be able to understand that, and is under no obligation to make those assumptions.

Tom

louiskessler 2012-06-11T18:02:42-07:00

Adrian, Tom:

I've spent the past two days trying to figure out and formalize he GEDCOM X file format. This was NOT easy to do, and I don't think I'll attempt something like it again. But this is the sort of document they need for developers to be able to implement GEDCOM X - not the unintelligible stuff they have with Java programs that most of us can't use in our application.

I may come back and address your comments later, but I've fallen behind on everything else (including personal stuff) because of this and I need a few days to catch up.

I hope you scan through what I've done. Feel free to correct anything you see is wrong.

Louis

louiskessler 2012-06-11T18:03:13-07:00

Oh, I forgot to tell you that it's at:
http://bettergedcom.wikispaces.com/GEDCOM+X+Framework

louiskessler 2012-06-09T08:49:27-07:00

Thank you, Ryan, for the correction. That's egg on my face that I wouldn't even look closely enough at the decompressed files to see they're XML. The use of Java and zip files and my recollection of your writings that always mentioned both XML and JSON made me believe they'd be JSON.

I agree GEDCOM and XML are effectively equivalent. I don't think any developer will have a problem with XML. As you say, there are lots of programming tools for XML. I believe they are mappable from one to the other. XML is bloatier mainly because it is indented and contains closing tags.

What hasn't happened yet, Ryan, is for us developers to now look at your constructs, and see whether our data structures will map to them, and how much change we have to do to incorporate them. I'm sure most programs today have data structures mimicking GEDCOM, with INDI, FAM, SOUR, REPO and OBJE records and links between them. Events are part of INDI records, Source details (i.e. citations) are attached to the event, rather than placed with the source.

We need to look in detail at what you've done and evaluate what we would want to do:

1. Whether we would leave our data structures as they are, and translate your GEDCOM X to our structures internally, and on export translate back to GEDCOM X, or

2. Whether we would change our data structures to match those of GEDCOM X so that your data seemlessly rolls in and rolls out. In that case, we'd have to translate legacy GEDCOM on input or output, or

3. Change some data structures, and leave others the same.

Data structures are the internal guts of a program. It may be difficult or impossible for some developers to make major changes to them.

I am guessing that your data structure will reflect that of FamilySearch ... or of the way you want FamilySearch to ultimately be. Full analysis of your data structures is something that is hard to do rigorously without getting down and dirty. Once I'm at that point, the way I would personally do it is to compare it to legacy GEDCOM piece by piece. I may even attempt to translate your standard into a GEDCOMish standard.

What will cause me pain (and other developers as well) will be all the constructs that GEDCOM X will support that our programs don't currently support. That's forcing our hand to add support of certain features, otherwise we wouldn't be able to properly round-trip the data. I don't say that's bad. It's probably very worthwhile. But it better be worth the pain.

It's a bit of a gamble for the developer community. Do we think your overall structures are the proper way to represent genealogy data? Are we willing to commit to those ideas? What if GEDCOM X doesn't become the standard, but Ancestry or MyHeritage get more followers with theirs? What if we change our data structures to be GEDCOM X compatible and there's major updates every year that make us spend all our time keeping up and no time left for doing the other development we need to do? There are a lot of unknowns.

After my next release of Behold (adding life events and consistency checking), I'll be doing my export to GEDCOM 5.5.1. That would be a good time for me to review what you've done with your GEDCOM to GEDCOM X translator. I would see what is involved in the translation and it will give me a basis on how I should set up my program's disk based database. If GEDCOM X proves good enough, who knows, I might choose it as my file format. :-)

AdrianB38 2012-06-09T12:52:07-07:00

Ryan - thanks for your comments.

Re your comment that "The reason for the size bloat is because we're using XML", I think this is totally unfair on XML. If we take a sample bit of GEDCOM:
1 NAME George /Washington/
and translate it into XML, we get
<NAME>George /Washington/</NAME>
(though I'm not sure if the notation around the surname will work or whether it would need to "escape" the /)

From https://github.com/FamilySearch/gedcomx/blob/master/specifications/xml-format-specification.md I can see that this particular GEDCOMX name comes in as:
<name rdf:ID="789">
<gx:attribution>
<gx:contributor rdf:resource="https://familysearch.org/platform/contributors/STV-WXZY"/>
</gx:attribution>
<preferred>true</preferred>
<primaryForm>
<fullText>George Washington</fullText>
<part>
<rdf:type rdf:resource="http://gedcomx.org/Given"/>
<text>George</text>
</part>
<part>
<rdf:type rdf:resource="http://gedcomx.org/Surname"/>
<text>Washington</text>
</part>
</primaryForm>
</name>

Now, that's not quite equivalent in meaning to the GEDCOM since <gx:attribution> doesn't have an equivalent in the GEDCOM I quoted, and I've also used the notation whereas the components of the name are explicit in the XML version. But what it does illustrate that a large part of the increase in size is down to the RDF stuff, not XML as such.

If RDF brings a benefit, then the GEDCOMX team needs to publicise that benefit because unless that's done, all that people will see is the spectacular increase in file size.

In a similar fashion, the file size wasted by repeating, as Tamura says, the XML version, the character encoding, the namespace for every individual, because each has their own "file" inside the zip, is a function not of XML but of the decision to give each individual their own file. Again, there may be a good reason for this but unless it's publicised, then people will be baffled why that path was followed for an interface file that is simply imported into the application all in one go. Or vice versa.

Now you've got some diagrams (for which, thanks), I shall try to get my head round the GEDCOMX concepts.

louiskessler 2012-06-09T13:08:12-07:00

Tamura Jones was having trouble posting to this page and asked me to post the following on his behalf:

I'm happy to read Adrian does not religiously follow me or anyone else,
but has thoughts and opinions of his own. Wouldn't want it any other
way.

Ryan's suggestion that my Quick Look only commented upon the file size
is a misrepresentation. He goes on to blame GEDCOM for the memory hunger and XML for the file bloat. Methinks GEDCOM and XML are innocent :-)

louiskessler 2012-06-09T15:17:18-07:00

Adrian,

Thanks for the example which explains clearly what's going on.

GEDCOM X has explained why they use RDF at: https://github.com/FamilySearch/gedcomx/wiki/RDF-Integration

It says there that "by providing support for RDF, GEDCOM X gains some notable benefits, including:

•assurance of conformance to industry-accepted standards
•a standard way to reference and describe resources
•a standard mechanism to define controlled vocabularies
•compatibility with known standards for describing physical and digital artifacts via Dublin Core
•compatibility with known standards for describing users and organizations via FOAF
•a standard way to embed (genealogical) data into web pages through RDFa
•automatic compliance with powerful tools built to analyze the semantic web"

I don't work at the standards "metalevel" and those benefits relate to things I know very little about, so someone else has to evaluate the furure benefits of using them relative to the overhead they add.

The multiple small files scheme zipped up into one big zip file is another matter. I don't know what to say about that either.

Louis

AdrianB38 2012-06-10T07:23:25-07:00

Thanks Louis - I shall have to read and ponder, but certainly, while some items seem interesting ("a standard mechanism to define controlled vocabularies"), others ("a standard way to embed (genealogical) data into web pages through RDFa") concern me as they imply FS is looking to process the GEDCOMX files directly, rather (or possibly as well as) than loading them into a "database" for processing by an application program. That's a whole different purpose than we have envisaged for BetterGEDCOM files, where we simply envisage BG files being transferred en bloc between application programs. It's an interesting vision.

Adrian

ttwetmore 2012-06-10T08:18:34-07:00

I've been reading the GEDCOMX documents repeatedly, and am still confused, so my comments might smell a little.

I agree with Tamura that the size of the GEDCOMX files is too large, about two orders of magnitude too large. Most of this comes from encoding each person and relationship as a separate file, and then having to repeat all the XML header info in every file, including all the namespaces that were chosen. I expect GEDCOMX will quickly modify the archive format so that all persons and relationships can be grouped in a single or a least a few files. This will take care of one order of magnitude in the size area.

I disagree with Ryan's comment that GEDCOMX is more efficient than GEDCOM. It is just plain wrong.

A second reason the files are so large was the decision to use the a full bore XML/RDF-based/namespaces up the kazoo notation for the archive file format. GEDCOMX could very easily use a much simpler syntax for archival that could be exploded out to the full bore formats on demand. That is, there is no reason to use the namespaces or the long URI-based strings in the archive format. Simple tags just like GEDCOM would be adequate. When mapping to the full bore formats, software can supply all the extra characters that balloon the format out to such great size; this would take care of the second order of magnitude in the size area. All that this requires is that there be a map between the simple, short, one-word GEDCOM-like tags that GEDCOMX should be using, and the namespaces and resources that they map to. This can easily be thought of as part of the GEDCOMX schema for converting to ridiculously long-winded format. Why not just let those who want their data in ridiculously long-winded format have a button they can push in order to get it that way, but let the rest of us poor souls have a version that we can read, understand, write simple importers and exporters for, and that fit into our current generation of software. Archive files in this much simpler format would be much easier for software downstream to handle, and especially for software developers to understand!

I disagree with the implication that GEDCOMX must use a plethora of namespaces and strictly adhere to RDF-formats with lengthy expressions for resources, and carry along typing information, in order to be ready to fit into a more modern set of applications that will have lots of value to genealogists (which seems to be the justification for GEDCOMX taking this route). This is simply not true. First those applications do not exist. And any new or current application that had to understand this new GEDCOMX format would be a nightmare for its developers. What does exist is a veritable plethora of genealogical programs that employ GEDCOM or different, but still, quite basic and simple models for genealogical data. If GEDCOMX were interested in really being easily applicable and useful to the current (and even future) generation of genealogical software, its format should have been geared to fit into this family of formats. And even if GEDCOMX had geared the format for the software systems of today, they could still provide those buttons that die-hard RDF fanatics could push in order to get their files 100 times too big and filled with thousands of pretty URI strings and all the RDF obfuscation they desire.

I hope that GEDCOMX will take a step back from all their hard work, put on their common sense hats, ask themselves who they are really trying to impress with all this RDF/URI/namespace malarky, and then come back with a format that really will be what developers need for the current and next generation of software.

Tom

louiskessler 2012-06-10T12:11:17-07:00

Tom,

You've had many great posts before, and this was another - hitting the nail on the head.

I was working all day yesterday to see if I could flesh out the meat of what's really in GEDCOM X. There's a reason why you might have been confused. It's because it is confusing. It is not rigorously defined and you have to piece it together from diagrams, model specs, code examples, their java repository and their Legacy GEDCOM Migration Path article. Their Developers Guide is much more confusing than it is helpful.

However, I must say that what I have fleshed out so far looks really good. I think there is a really good basis and structure here for a new standard.

The fleshed out good stuff will be coming very soon.

Louis

louiskessler 2012-06-10T12:26:45-07:00

For now, everyone please read:

http://record.gedcomx.org/Conclusion-Record-Distinction.html

and

http://record.gedcomx.org/twomodels-whitepaper.pdf

I agree that this is exactly how the source details (or evidence if you prefer) should be separated from the conclusions.

They are correct in saying: "The GEDCOMX Record Model has no predecessor within the legacy GEDCOM format. It is a new model for sharing and referencing genealogical evidence, not conclusions, in a world of individual online archives."

Louis

ttwetmore 2012-06-10T12:52:56-07:00

Louis,

Thanks.

I don't think the conclusion and record model should be split, not a surprise to you, since I've been saying it for 20 years. I certainly want my software program to be able to handle both evidence and conclusions within an integrated model.

The documents stress the black and whiteness of evidence and conclusions. They need to overstress this in order to justify separating the models. There is no clean boundary between evidence and conclusions, and forcing the models to conform to the black and whiteness will lead to problems.

You know my solution to this -- the multi-layer persona record -- it forms the backbone of the record model, and the conclusion model, and every gray model in between. But this simplifies the model instead of complexifying it.

I am the first to assert that the kind of information found in a raw evidence persona record is wholly different than the kind of information found in a top level pure conclusion person record, but to artificially separate models apart because there are shades of difference in the different person concepts, is, in my opinion, copping out with lots of fancy words to justify it. I think the GEDCOMX people like separating them out so they can have different RFD-types for the different person concepts.

As you can tell I am very much against this RDF/URI-ification of genealogical models. We don't need them, and they can be added on with a push of a button for those who want them.

I have another note coming about the conclusion model.

Tom

ttwetmore 2012-06-10T13:25:56-07:00

Louis,

Thanks for your kind words as always.

I'm trying to understand the conclusion model at present. I've come up with the following points I'm not sure of.

What is an attribution and why is it so important? How is it different from a source? It feels like an unnecessary concept to me. I don't have a single item that could be called an attribution in any of my databases, and I do not feel a lack. Am I a lazy genealogist? Am I supposed to both keep track of all my sources AND all the people who referred me to those sources? I don't get it.

It seems that only persons and relationships are "first class citizens" in the conclusion model, though I must admit that my cherished concepts of records and structures are very much in jeopardy after reading the documents.

Along those lines, it seems to me that an event in the conclusion model is "just" a fact, which is fine as far as it goes, except that facts are things that seem, in this model, to really apply only to single persons or single relationships. I've always thought that an event should be able to be (doesn't have to be) a top level element, so that many persons as role players could refer to it in their many roles (or conversely, the event could refer to its many role players). Maybe this model allows this, though I am not sure. The question would be: must a fact (which includes the concept of event) always be a sub-element (in the XML sense) to a person or a relationships, or can it be a top level entity of its own?

I have the same question about places. The diagram makes it seem that a place is simply an attribute of a fact, which again, is fine as far as it goes. But does this model also allow an independent hierarchy of top-level place entities/objects as well? That is, can places exist if there are no facts to refer to them? Maybe after I swim through the documents further I'll be able to figure this out, but so far I am unable to.

Is it safe to say that the GEDCOMX designers believe that there are only two key conclusion concepts in genealogy, the person and the relationship, with all other concepts supporting these two? The diagram implies this. It is an interesting idea, and maybe it's a good idea. But where is the family? There doesn't seem to be a single element in this model that can represent a family. I reject the claim that families are secondary concepts, that if you really need to find out what the families are, you can simply use your software to run through all the persons and relationships and simply infer them. Making families an inferred concept, is, in my opinion, one of the worst mistakes one can make in designing a genealogical data model. The family is a key concept in the mind of almost every genealogist. GEDCOM got the family right.

Tom

louiskessler 2012-06-10T13:49:45-07:00

Tom,

Yes, and that has always been the one real difference in our thinking. I agree with the GEDCOM X people and think the evidence should be just the facts and should be independent of all conclusions. And that the conclusion data need not have multi-levels but is best having the "current" conclusion with links to all the evidence that supports AND denies the conclusion. No one needs to search through the 100 previous conclusions that happened before the current one. Once that new piece of evidence is added, the previous conclusions are meaningless because they didn't take the new evidence into account.

With regards to the RDF/UFI stuff, I completely agree with you. A simple XML structure, clearly defined so programmers can actually implement it, would do very nicely.

Louis

louiskessler 2012-06-10T15:25:05-07:00

Tom,

The attribution appears to be the Proof statement. See: http://www.gedcomx.org/gx_Attribution.html

Yes, it also seems to me that Persons and Relationships are the first class citizens as you say. And I like that concept.

I never personally liked the family concept. What is a family anyway? Father, mother and children? What about adopteds? Surrogate? Foster? Same sex marriages? Are family events for something that happens to the family? Who in the family? Grandparents? Borders who live in the house with them? Friend of the family? Children who've died or moved far away or were disowned?

Events happening to individuals are easier. Events happening to families need to define what the family consists of. It's not worth the trouble. Just have the event be defined to happen to each of the individuals and be done with it.

Louis

louiskessler 2012-06-08T16:03:32-07:00

Also see my comments on GEDCOM X:
http://www.beholdgenealogy.com/blog/?p=1096

Louis

heatonra 2012-06-08T23:55:58-07:00

Tamura's has a lot of intelligence. He makes a great point in observation of the file size--that's definitely something that we should think about some more. There are a lot of ways to solve that issue. Smaller file size hasn't been a goal of the GEDCOM X file format.

Personally, I'm kind of encouraged by Tamura's findings. If file size were the only thing he found to conclude that the proposal is "spectacularly inefficient," then that's a great place to be. I guess he did mention the memory bloat and heavy processing time of the conversion program, but that doesn't have anything to do with the new file format--that's actually a result of the inefficiencies of the old GEDCOM file, not the new file format.

I think if he had the capacity to perform some additional analysis and comparisons, he'd find that the GEDCOM X file format was actually much more efficient in areas that matter to developers. This would include both processing efficiency and memory efficiency.

The reason for the size bloat is because we're using XML. (Not JSON, Louis.) XML is an inefficient serialization format. So we could think about using JSON or protocol buffers or a more efficient XML infoset, but of course any choice comes with consequences. XML has a really rich feature set that we'd have to figure out how to live without. And everybody's got an XML parser... fewer developers have parsers for JSON or protocol butters or XML FastInfoset or whatever. Of course, Louis would probably like me to mention that GEDCOM data format is a completely viable alternative to XML, too. Right Louis?

Anyway, just some trade-offs we'd have to consider.

Alex-Anders 2012-06-09T00:11:06-07:00

"Of course, Louis would probably like me to mention that GEDCOM data format is a completely viable alternative to XML, too."

That seems to be still in discussion, from Louis' last Behold Blog post.

ACProctor 2012-07-20T14:25:40-07:00

STEMMA Extended Vocabularies

As a separate STEMMA discussion, I wanted to point specifically to the section: Extended Vocabularies.

The issue of custom types, sub-types, roles, etc., came up in another wiki thread but I can't find it now. The scheme now used by STEMMA is a generalised one, and is even used to implement custom properties. In the STEMMA draft, there were a number of "core properties", like Age, Occupation, etc., and a separate mechanism for custom ones. In this proper STEMMA release , they have been reduced to a single scheme that makes use of extended vocabularies.

The Data Model section contains a couple of examples that illustrate this. Perhaps the most accessible is: Multi-Role Event

Tony

ACProctor 2012-08-02T08:16:29-07:00

UGM - Unified Genealogical Model

Following a recent conversation regarding the prospect of a Unified Genealogical Model (UGM), I felt compelled to write-up some thoughts. I hope we can expand on this ideal since it underpins the primary goal of FHISO and BetterGEDCOM.

This conversation began with a suggestion that every vendor currently uses a different model, that no one wants a model that is effectively a straight-jacket, and that it is unlikely that vendors (incl. content providers) will change their internal model to accommodate anything from a standards organisation.

The reason I want to write something up is that I not only see a UGM being possible but I see it being easily assimilated by those vendors, without internal change, and hence contributing to our collective advantage. The last thing we need are unfounded fears of this principle deterring support for any standard.

OK Tony, so how do you see it?

Well, perhaps the most important concept is that of what is being modelled. If we try to model "real life" - that is, when people were born, when they married, when they died, where they lived, etc - then we're immediately on the wrong track. We're dealing with unknowns and we find ourselves having to accommodate uncertainty, conjecture, and even information from which we cannot draw any conclusions. You may be thinking that this is obvious but it's worth just pondering on this subject for a moment.

The world of genealogy is a level removed from the real-life world. Just as in any scientific discipline, there is no absolute proof and we're primarily dealing with evidence & conclusion. OK, these terms are a little emotive but we can all agree that there is information out there about our ancestors, and then there's the conclusions and inferences that we form from that information. The essential point I'm trying to make here is that our UGM should not try and model the real-life world, but it should model the genealogical world, i.e. what we found, where we found it, and what we infer from it. There's a subtle but important difference between the two worlds.

But why should that be important to a UGM?

The UGM has to be a "representational model", meaning that is simply describes the genealogical data rather than being tied to a particular research process. It should therefore be as capable of describing pure evidence (a good test in itself) as describing pure conjecture. In effect, it would be describing the structure of generalised genealogical data, but without mandating the presence of all its main components. From this point of view, it would not constitute a straight-jacket.

Also, by representing the structure of generalised data, it would facilitate data transfer between products with different internal models - those products would not have to adopt the UGM as their internal operating model.

Aren't you simplifying things too much Tony?

Although these thoughts could benefit from some better structure, and more solid arguments, I do believe that the principle of a UGM is possible and should be discussed further. Far from being a deterrent to any type of standardisation, it should be the holy grail that all vendors and users need.

One area I have glossed over is the scope of the genealogical data. Should it just be the basic BMD vital events, or should it include all aspects of family history? Should it also encompass 'narrative genealogy'? These questions do not detract from the general principle of a UGM but they must be answered at the outset as they affect the universality of a UGM.

Tony

ACProctor 2012-08-12T14:36:38-07:00

A standard data model, based on the UGM principle, would not be the "logical equivalent of the existing stuff" Adrian. As you say, there would be little point.

It has to be far more capable - both functionally and in scope - than GEDCOM, and so would be a superset of other models by implication. This isn't the same as it simply aiming to be a superset of a few existing models.

Tony

ttwetmore 2012-08-12T20:25:12-07:00

AncestorSync is a system that can transform data back and forth between the major desktop systems and on line pedigree systems. They transform directly from one native format to another. This meets the original goal of the BG effort, which was to enable sharing of data between genealogical systems.

I believe that AncestorSync was thinking of joining the BG effort. They probably made the decision that BG would not accomplish anything within the time frame of their business plan.

ttwetmore 2012-08-12T20:58:57-07:00

UGM still seems a restart to me. I don't seen anything new in it. I continue to point back to my original set of requirements and wonder what new UGM brings to the discussion.

I fail to see why we insist on making things so complicated. We need to record our sources. We need to record our data. We need to record our conclusions. All of these things are easy to define, to understand, and to model.

I have been involved in these discussions in many fora for over 20 years. Every effort starts new. Every effort thinks they are making fundamental insights into the nature of genealogical data. Every effort starts strong and simply peters away. BG has followed the pattern exactly. FHISO is likely to follow suit. GEDCOM/SORD/GEDCOM-X has been limping along for 25 years, though GEDCOM-X, with large organizational support may get over the hump to reality. GenTech actually made it to paper, but what a mess it is. BG is still hung up on citation formats, what fields should be italics, etc., when these issues have nothing to do with data models.

louiskessler 2012-08-13T07:19:19-07:00

Tom said: "I fail to see why we insist on making things so complicated. We need to record our sources. We need to record our data. We need to record our conclusions. All of these things are easy to define, to understand, and to model."

Amen!

Louis

Andy_Hatchett 2012-08-13T08:25:51-07:00

The onlly major problem I have with AncestorSync is that it has to go thru inline content providers.

When it becomes a stand alone product that can handle direct end-user to end-user transfers without needing to go online is when I'll look more closely at it.

AdrianB38 2012-08-13T08:55:38-07:00

Tony says a UGM "has to be far more capable - both functionally and in scope - than GEDCOM, and so would be a superset of other models by implication. "
OK - but that's not the bit where I have problems, now I understand you. If the supplier of software does not actually need to make their software compliant with the full UGM model, _then_ there is no advance.

Package A could export their data from their existing database into the UGM - let's say this is something that they couldn't export into GEDCOM, so it looks like an advance to them.

But if a user of package B doesn't have the ability to model this new data from A, then, even though they can import part of the UGM file, they won't be able to import the new data from A because they have nowhere to put it. Result - the manufacturers of both A and B claim UGM compliance, the users of B simply turn round and say - "Well that UGM is a load of rubbish, it's no better than GEDCOM was."

There may well be mileage in exploring concepts of partial compliance - e.g. Package B can import multi-person events but not evidence-personas. Maybe there'd be 3 levels of compliance and a package would need to comply with all in a specific level to claim a degree of compliance.

Or there might be value in exploring concepts of an agreed graceful-failure on import - e.g. Package C can import multi-person events but only by repeating the event across each of the people (this action would be specified in the standard). This isn't actually a successful import because if you came to re-export, you'd have multiple events for the same day and place that you weren't really sure were the same event so couldn't mash-up back together.

ttwetmore 2012-08-13T09:44:26-07:00

Adrian says,

"... Package A could export their data from their existing database into the UGM - let's say this is something that they couldn't export into GEDCOM, so it looks like an advance to them.... But if a user of package B doesn't have the ability to model this new data from A, then, even though they can import part of the UGM file, they won't be able to import the new data from A because they have nowhere to put it. Result - the manufacturers of both A and B claim UGM compliance"

This is so very true. Is it UGM that makes it true, or is it the nature of the world that makes it true? Is there ANY solution to the sharing problem where this isn't true? Well in China there might be.

For compliance one could require that an importing system must be able to import all data in the sense that it can regurgitate it on export. A system that doesn't provide any software support in a specific area would have to let the unused data go along for the ride so it is available for export. But this still isn't going to make the users happy, because even though all data imported is there, their would have no access to it.

Compliance is a slippery concept. Are you compliant if you can import, even though you don't keep stuff you are not interested in? Are you compliant if you can export, even though you have more stuff available you could export correctly? To be compliant must you be idempotent (that is, if you import and immediately export you get exactly the same results)? We would all prefer compliance to imply idempotency, but practically speaking this can only be a long term goal, the "top level" of compliance.

This problem is the problem inherent in all standards. A system complies or it doesn't, and and at different levels, usually defined by the vendor to make them look good. There is nothing the standard bearers can do to change this environment, except take the moral high ground and define what the levels of compliance are.

ttwetmore 2012-08-13T09:53:35-07:00

Andy says,

"The onlly major problem I have with AncestorSync is that it has to go thru inline content providers."

This is very true, but it has an implication. AncestorSync must have their own database format that serves as a superset database for all the desktop and on-line systems that they interact with. I'll leave it up to you to decide whether this implies there is a genealogical data model behind that superset database or not.

The point I am making is that AncestorSync IS SOLVING THE SHARING PROBLEM, while we are sitting here arguing about citation formats, and whether evidence should be packaged into personae, and whether suppositories are sources. We might not like some aspects of their solution, one that Andy points out, but at least someone out there got off their duff to tackle the problem that BG claims to be so interested in.

louiskessler 2012-08-13T13:22:14-07:00

AncestorSync is "attempting" to solve the sharing problem, just as Wholly Genes (TMG) attempted to do with its GenBridge.

The problem they are having is that translating data structures is very hard. So far they've only managed a few and nowhere near perfectly.

Not only that, but there are several hundred full-fledged genealogy programs out there with their own databases. Each program may have many versions where the database has changed from one version to another. Some of the companies are NOT willing to provide their database structure, so for an AncestorSync-like program to include them, they must be reverse engineered.

Plus this is a never-ending task. New programs come out all the time. Existing programs continue to tweak their databases as they add features.

Everyone wonders why I think GEDCOM has been so good for us - well it's because it picked a middle ground of complexity that fit into most developers data structure and was not too overbearing on developers to implement.

As everyone has been saying, we've got a damned if you do and damned if you don't situation here. If you include everything in BG (or in a UGM), then all programs must be able to process and store everything or the data will not transfer. If you don't include some things, then the programs that use those things won't be able to transfer it.

Personally, I think a middle ground of complexity is needed. But if it was decided to attempt to include everything (a near to impossible task because the boundaries of what "everything" is are very subjective), there is a possible way to deal with this.

Some programs that don't handle certain GEDCOM tags put them into NOTEs. That is better than dropping the data. They get output as NOTEs and the information is at least not lost.

So a possible solution is to include in BG a specification on how to retain all the input data that your program doesn't handle and then pass it out again. And this retention specification must be written with the idea that *any* element might be one that might not be handled, so it needs work to identify, store, and pass through any element in a simple and standard way that any program can easily implement.

Louis

ACProctor 2012-08-13T14:14:55-07:00

Re: "So a possible solution ..."

A very constructive suggestion Louis. I have seen this approach work successfully in other fields, but one slight complication here might be if the 'unhandled' tag refrences, or is referenced by, another (handled) tag. It's not impossible, of course, but the detail may need a little thought.

Tony

Alex-Anders 2012-08-13T14:20:04-07:00

one slight complication here might be if the 'unhandled' tag refrences, or is referenced by, another (handled) tag.

If a standard eventuates, and software/onlines implement it, surely part of it would be the requirement they do not use 'tags' defined already. AND that any data they do not use, is retained and exported, as required.

ACProctor 2012-08-13T14:31:31-07:00

I think one of us is misunderstanding the other Alex.

My sentence that you quoted was merely pointing out that if two records are linked (e.g. with an ID= or Link= attribute) then properly supporting one and not the other might be more involved than if the records were independent.

A field where I have used something vaguely similar to Louis's suggestion was in pre-Unicode character set conversions. If a target set did not support a particular symbol from the source set then it was associated with an unused code, or code combination. That ensured a reverse translation would occur with zero-loss, even though the symbol was not truly supported in the target character set.

That case is much simpler because all characters are independent entities. In a genealogical model, there will be linkages different element types so they're not independent. Consider a hypothetical case of a Person record with a ContactDetails record linked to it. If a target model supported Person but not ContactDetails then I believe Louis was suggesting holding the ContactDetails part in a special 'unhandled' record type, such that it could be fetched back and used appropriately if a reverse conversation ever occurred.

Tony

ttwetmore 2012-08-08T08:12:44-07:00

No more long arguments from me on the family issue this point. I'll just state that I disagree with Tony with many of his points in his last message.

ACProctor 2012-08-08T09:29:44-07:00

I'm trying to catch up after a long vacation and was going back through this thread to see where we didn't agree Tom.

I don't want to get bogged down on the family-unit issue either, although it is a good case in point. I avoided it in STEMMA because I could see in my own data that it had no significance at all for several families, and was very hard to define for many others. For instance, father deserts, youngest children put in industrial school, wife has relationship with new man causing more children to be added to household (some being his previous ones and some being their new ones together), elder children leave, or add children of their own to the household, deserted wife dies, partner starts relationship with new woman, etc.

Turning the thread around, how would you have addressed the original criticisms of a standard that I quoted at the outset Tom?

My concept of a UGM is more the principle that would underly any successful standard, as opposed to a specific standard itself, and I'm genuinely interested in how people see it working given the variety of internal models out there.

Tony

ttwetmore 2012-08-08T17:21:01-07:00

Tony,

You said "Turning the thread around, how would you have addressed the original criticisms of a standard that I quoted at the outset Tom?"

I looked back through this thread and could not figure out what you are referring to in this question.

ACProctor 2012-08-09T02:29:43-07:00

Sorry, 2nd paragraph in my original post Tom:-

"This conversation began with a suggestion that every vendor currently uses a different model, that no one wants a model that is effectively a straight-jacket, and that it is unlikely that vendors (incl. content providers) will change their internal model to accommodate anything from a standards organisation."

Tony

ttwetmore 2012-08-09T03:46:50-07:00

Tony,

I agree. Note my requirement 2, which addresses this point. It is the same point that started Better GEDCOM.

AdrianB38 2012-08-09T15:15:27-07:00

Tony - when you said "the 3 examples you quote are not really part of the genealogical model of the world", I would agree with you. However, we need to be clear to people what that scope is because you can bet that someone WILL say - "I've got a to-do list in my genealogy software, so to-do lists must be part of the genealogical world."

When you say "the essence of the data should be transferred intact, and that it must be possible for the destination product to derive whatever subjective concepts it supports from that essence", then I'm sorry but I don't buy it. You can't go from the nebulous to the precise and then recreate comparable nebulousness.

Or take take the Family concept - if my database has X and Y are "husband" and "wife" in a family, but I have no children, no marriage details (because they might NOT be married), no family-events, no censuses, etc, no source even, for this, then what is the essence of that? Pretty thin, you would say - justifiably. But the fact is that it's there and it needs to be transferred via the BG file. I fail to see what the precise version of that can be. Or how such a preciseness can then be reverted to nebulosity.

I am in total agreement that BG should try to eliminate vagueness - but there always needs to be that ability held in reserve because that's the data that's there in existence.

Moving on to your suggestion that "it is unlikely that vendors (incl. content providers) will change their internal model to accommodate anything from a standards organisation". There are 2 aspects to this -
- exporting to BG
- importing from BG
In the case of exporting to BG, we already have requirements about compatibility designed to cover this - doesn't mean they can't be refined and justified better. But to export to BG, I agree that there should be no requirement for them to change their internal model. Totally agree.

In the case of importing from BG, if they aren't going to change their internal models then either their imports will fail or BG itself will have failed from the off. To attempt to meet "no change" for import from BG, means that BG will only be able to have the common denominator that sits inside all known genealogical software. So we can't have multi-person events in BG because that requires a model change. We can't have data in BG to create proper citations according to Mills / CMoS / whatever because that would require changes to their models. In short, BG could not, by this requirement, be an advance on GEDCOM.

(I am taking your idea of 'no change in their internal model' to mean - no new tables, no new columns, no new keys, to put it all in relational DB terms.)

Certainly, we could discuss how to create a model for BG that would gracefully degrade to that for GEDCOM 5.5 (let's say that's the definition of all the models out there). For instance, multi-person events could be multiplied and assigned as individual events to all the participants - but that's not full compatibility as one could never recreate the multi-person event on export and even in the database after import, you've lost information as this could be multiple events on the same day, not one single event.

I'm hoping I've misunderstood something somewhere, but right now I can't work out the practicalities...

ACProctor 2012-08-10T05:52:14-07:00

Re: "You can't go from the nebulous to the precise and then recreate comparable nebulousness."

I didn't mean to imply that Adrian. If the source product has a FAM concept then I assumed it would also have the relevant "precise" data to back it up, e.g. marriages, births, census, etc.... with citations. It's that data which I was suggesting should be exchanged.

If the target product has a FAM concept then it can infer it from that "precise" less-subjective data. However, if the target product has some very elaborate or better-thought-out alternative to FAM then it should be able to infer that instead. That's what I meant about the data being objective rather than subjective.

I can think of no evidence directly corresponding to the FAM concept. It is a conclusion, and a subjective one at that. However, the 'Roles' describing the part those ancestors played in the recorded events *are* evidence. In your X/Y example, there must be some evidence for X & Y existing, even if you have no marriage/birth/census details, so what was it? What evidence does your data cite in presuming they were husband & wife?

In a way, my UGM principle would allow a "superset" standard to be developed, along the lines Tom is arguing, but I think we [me & Tom] differ in the nature and scope of the data being exchanged. My focus was not in taking the lowest common denominator though - and certainly not a GEDCOM one - but an objective representation weighted in favour of evidence (what-we-found and where-we-found it), supplemented by inferences and conclusions.

Tony

AdrianB38 2012-08-12T09:07:04-07:00

Tony - you said your "UGM principle would allow a "superset" standard to be developed" - that's a relief.

Without wishing to labour the point though, it would be sensible to use words(?) like "superset" to make it clear what the meaning is. When you said "it is unlikely that vendors ... will change their internal model", I interpreted that as "no new tables, no new columns, no new keys" - I suspect that what you were really aiming for was "the possibility of new tables, new columns, new keys" but not altering existing ones in a way incompatible with their existing set-up.

Moving on... I think when we take the FAM concept, that we are still apart. If you were creating new software to handle new data input by a thinking human being, I _would_ be agreeing with you. (Though I might well also be saying, "A courageous decision Minister....") In such circumstances, the software could easily say - OK, hold on, what's the real evidence?

However, BG has to be able to export all the data in all the BG compatible databases, no matter how poor it is. So, it must be able to support conclusions (such as FAM groupings) that have NO evidence whatsoever. In reality, of course the evidence exists - but it's probably stuck inside someone's head - if you ask them why a couple are married, they might say, "Of course they were - Aunty Nelly told me they were..." Which is less than ideal evidence (though we accept vaguer) but, more to the point, it isn't even in the database. Aunty's Nelly's FAM couple have to be exported as valid BG, without manual intervention. We can't concoct a marriage between them because we have no evidence that they were - although I guess a valid option might well be, "Marriage - status unknown", where "unknown" includes the possibility of "never married".

ACProctor 2012-08-12T12:59:44-07:00

Re: "the possibility of new tables, new columns, new keys" but not altering existing ones in a way incompatible with their existing set-up"

On the contrary Adrian, I meant what I said - those existing products should be allowed to keep their existing internal models, and their associated investment.

I saw the UGM as a principle related to data external to those products such as when it is being exchanged with a different product. Hence, no new tables at all. The suggestion of a "superset" standard was so that it could exchange data between disparate products, each of which would have a different internal model.

Ideally, if the resulting standard data model was a good one then someone might adopt it at the core of some new generation of product. That should not be mandated for all the participating stakeholders supporting a standard model though.

Tony

ttwetmore 2012-08-12T13:13:06-07:00

Tony, Adrian,

Ancestor Synch solves this problem already.

ACProctor 2012-08-12T13:53:17-07:00

From what I know of AncestorSync, it is not a genealogical model but a "vehicle" for transporting the data around.

I believe it also deals with pedigrees rather than generalised family history, and I know for certain it cannot support my own data because of its very different emphasis.

I don't believe it's the sort of "superset" I meant but maybe you can convince me otherwise Tom :-)

Tony

AdrianB38 2012-08-12T14:17:37-07:00

Tony said - "On the contrary ... no new tables at all."

Err - in that case, I'm back to bafflement. If the existing products don't alter their models - what is the point of BG? If BG in a UGM form is the logical equivalent of the existing stuff - how can there be any advance on GEDCOM?

Adrian

ACProctor 2012-08-05T04:48:01-07:00

No, there's a little more to it than that Tom, including the general perception of what it is we're trying to achieve. We [FHISO & BetterGEDCOM] will have to convince people that our goals are achievable, and to their advantage, even when they have important financial commitments related to their existing products.

I feel you may be focusing more on the nuts & bolts, or the mechanics of a UGM, rather than my sometimes-waffly notions. I would summarise the salient points as follows:

1) The UGM must be a generalised representational model that can be used to exchange data between different products, and without their internal operating models having to change.

2) The UGM must be an objective representation of the data, possibly to the exclusion of explicit subjective concepts like a "family unit". Those subjective interpretations often distinguish the available products and so they must retain the ability make them from the objective data.

3) As a representational model, the UGM must describe what is actually in the data being exported, and must not constrain any product/vendor to have all categories of data available, or to subscribe to any particular methodology.

As far as E&C, reasoning, narrative, etc., are concerned, I believe our thinking is very aligned.

Tony

ttwetmore 2012-08-05T08:21:15-07:00

Tony,

Your point 1 is a restatement of my requirement 2.

Your point 2 uses the terms "objective representation" and "subjective concepts" which are too loosy goosie for a nuts and bolts guy. We want to record the evidence we find, where we found it, and the conclusions we made based upon it. Is there anything more to model? By objective representation I think you mean we need our models to allow us to record all pertinent types of data accurately. This is my requirement 3.

Quick aside on the family. Genealogists want to record information that is important to their concepts of genealogy. The family is a critical concept to most genealogists. Is it subjective because it is hard to define? Must models shy away from critical concepts because they are hard to define?

Your point 3 is a tautological comment on the nature of models, followed by another restatement of my requirement 2 with a bit of my requirement 3.

UGM is a fine restatement of requirements and points that I believe are well understood. My fear is UGM being used as the justification for another restart from zero. Though FHSIO will do a full restart anyway since that is the nature of all new organizations. The well-known NIH syndrome. Dare I guess you are setting the ground-work for that FHSIO restart?

louiskessler 2012-08-05T08:40:50-07:00

I'm very much enjoying the discussion.

Tom. I'd like one clarification. You said: "I simply want conclusions to be able to refer to the evidence they are based upon, and evidence to be able to refer to the sources they were extracted from."

I don't see this as a two step reference. I see it as one step, as in: "I simply want conclusions to be able to refer to the evidence records they are based upon."

I think of source records as information. I think of evidence (or evidence records) as being a source record once it is used as the basis of a conclusion.

So to me, there is no evidence record that is separate from a source record.

Do you have a different idea about this?

Louis

AdrianB38 2012-08-05T08:51:12-07:00

Tony - thanks for your principles, which describe a UGM a bit better for me. Some comments:

1. "UGM must be a generalised representational model ... to exchange data ... without their internal operating models having to change"
Surely this has to take account of direction, and the data in question has to be scoped?

Re scope - if package SPQR happens to have an expenses list, a contact system and an email client then at least 50% of those three items are going to be excluded from the scope of a UGM because the G stands for genealogical. (My saying "50% of 3" indicates we should _not_ be discussing whether expenses lists etc come under the scope of a UGM). So SPQR will not be able to exchange that particular data, I suggest.

Re Direction - It seems to me that if SPQR has data that is in scope for the UGM, then yes, the principle ought to be that it is all exportable _from_ SPQR into a UGM conformant exchange facility. (Not sure how we ever get to confirm this...) However, the reverse is not true. If, by some miracle, our UGM conformant exchange facility has the ability to store ESM compliant citations (again, the issue is not whether such a thing is desirable or possible) then how can SPQR, assuming it to be a non-EE app, import those items from the UGM _without_ altering their data model??

2) "The UGM must be an objective representation of the data, possibly to the exclusion of explicit subjective concepts like a "family unit"." I am sympathetic to those who would replace nebulous concepts by concrete things like births, explicitly stated relationships, co-residency, all of which can be sourced from documents. But, the fact remains - if I have such a nebulous concept in my database, I do _not_ want to lose any nebulous attributes of my nebulous concepts on exporting. If nebulous stuff can be automatically mapped to non-nebulous data, then fine.

3) "must not constrain any product/vendor to have all categories of data available" - not sure at this point what SPQR is then to do with a UGM file that contains data that it does not 'know about'. Is this covered by "must describe what is actually in the data being exported"? So the text might be there after import, e.g., (perhaps not even in English), so it would be human readable, just not readable? Sounds OK to me.

"must not ... subscribe to any particular methodology" - agree with this. As an instance, the UGM should not confine itself to the 5 blocks of the Genealogical Proof Standard, nor to the 11 blocks of ESM's "QuickLesson 8: What Constitutes Proof?". Nor should it require SPQR to implement DublinCore interfaces as another instance (not sure quite what I mean by that but this is only about examples... By the way, tongue in cheek, shouldn't that be DublinOhioCore?)

On the other hand... There is a means of describing all genealogical data, regardless of data models, methodology or whatever... It's called SQL. It just happens to describe _any_ data and its degree of abstraction is, shall we say, large.

GeneJ 2012-08-05T10:06:48-07:00

Tom wrote, "Though FHSIO will do a full restart anyway since that is the nature of all new organizations."
https://bettergedcom.wikispaces.com/message/view/Data/55626940#55652680

Not intending to more the discussion OT; hoping only to correct a possible mis-impression. The issues of FHISO beginning work to develop a standard are a little complex, but I've not before seen reference to "full restart" because it is a "new organization."

Heck, the organizers have worked pretty hard so that the community wouldn't have to face a "full restart."

As to the organization structure: Because it is modern, trustworthy, has best adoption outcomes, and betters the odds on interoperability, the organizers worked to develop the framework for a multi-stakeholder modeled organization. We didn't start from scratch—we worked from some of the most accepted of these models—ANSI, HL7, ISO, NISO, IEEE, etc. We also were not blind to the challenges of the marketplace—diversity and, for some stakeholder groups, extremely so.

As to the standards process: It's not only possible to transition work on a model that has already progressed into the multi-stakeholder process—the organizers have twice developed plans for just such a transition. Said another way, the organizers don't see any need to have to choose between the rock and hard place (rubber stamp or start from scratch).

Questions, input, suggestions, etc. on the multi-stakeholder structure and process are always welcome. fhiso@fhiso.org

GeneJ 2012-08-05T10:07:16-07:00

*to move the discussion OT ...

ttwetmore 2012-08-05T10:51:14-07:00

Louis,

I think the difference is our old argument.

I see source records as only holding information about where evidence comes from. I see evidence records as holding that information. E.g., for me, the most important type of evidence record is the persona. I think you prefer to keep what I would put in a persona in the source record.

You see source records as holding both where the evidence comes from and what the evidence is.

So I see our difference as kind of a lumper/splitter issue. We both want the same info in our databases. You lump what I consider two kinds of info into the same "layer". I split them into two "layers." It's like MVC in software design. Apple's Cocoa interface keeps all three separate. Java's AWT lumps two together. In sum I think we would both keep exactly the same info. I like packaging the evidence into personas because I believe they will prove a useful concept.

How does that agree with your thinking about our differences?

Tom

louiskessler 2012-08-05T11:09:48-07:00

Tom,

Thanks. That makes total sense.

Louis

ttwetmore 2012-08-05T19:41:48-07:00

My comments concerning a restart have nothing to do with FHISO's organization or platitudes. It has to do with the technical work that may be done if FHISO decides to design a genealogical data model. With GEDCOM-X now the center of mass, it is not clear if it makes sense for FHISO to take that path. It would struggle to play catch up if it did. If it takes that path then my comment about restart applies. If FHISO doesn't take that path it's not clear whether it has a remaining raison d'être. I suppose it could take a certification and recommendation role.

GeneJ 2012-08-05T20:40:25-07:00

Again, not intending to move the discussion off-topic. FHISO is a multi-stakehold platform where interests gather, but the organization itself is an independent platform--was never established with the intent of competing with FamilySearch. There has never been a race to the finish line for FHISO, and I believe FamilySearch has known that as long as we have been communicating with them.

Because FHISO is independent, we wouldn't be setting out to do what FamilySearch is doing.

You wrote, what is the "remaining raison d'être"--that may be another way of suggesting there was a race to the finish line, and there wasn't. The multi-stakeholder model IS a raison d'être, though you seem to think of it as platitudes. With independence do come some other options, at least one of which you mention.

GeneJ 2012-08-05T20:40:53-07:00

*where competitive interests gather

ACProctor 2012-08-08T02:15:48-07:00

Thanks @Adrian.

1) I did mention the issue of scope for the data at the end of my original post. However, the 3 examples you quote are not really part of the genealogical model of the world. Although contact information would likely be accommodated as part of some attribution scheme, an expenses list or "To Do" list is part of your private working data. Although a lot can be held in narrative (which, as you know, I very much favour), it is unlikely that a target product would accommodate such private working data in any structured way.

2) The issue with a FAM, or family unit, is that it is purely subjective. Concepts like that can be transferred in tact from one database to another when then databases are hosted by the same product. However, if you're going between different products then it becomes inappropriate. The destination product may not have that concept, or it may be a much-refined version of it (i.e. with sub-categories), or it may have subjective concepts itself which the source product did not support.

What I'm trying to say is that the essence of the data should be transferred in tact, and that it must be possible for the destination product to derive whatever subjective concepts it supports from that essence. In the particular example of a family unit, the essence of the data would reflect the births, marriages, divorces, separations (e.g. from newspapers), and census records with corresponding roles. Whether a family-unit can be inferred from that data depends on the product loading it, and it would be wrong to force-feed subjective concepts from a different product.

Tony

Andy_Hatchett 2012-08-02T14:50:10-07:00

Great article Tony,
Much to think about here.

Andy

AdrianB38 2012-08-02T15:25:21-07:00

Tony,
I am intrigued by your post but concerned by the degree of abstraction that might be involved when you suggest not modelling real-life.

I've thought that BG proceeded quite happily while we were all talking about real life. OK, there was a debate over what Family was - did it include Granny or children who had left the nest? but that reflects the fuzziness of real life - which may, of course, be your point.

But then we went into talking about modelling evidence and conclusions and the world fell apart. Leaving aside the issues of how evidence was to be implemented (because Louis and Tom disagree, I think, over the implementation rather than the pure logical model), the concept of modelling conclusions as a tree based on evidence from previous conclusions was too much for some folks. If 2 people turned out to be the same person, then there should be 1 person in the database, not 3, for them. Whereas those of us used to seeing data looking totally different from outputs, couldn't see what the fuss was about. Please - this isn't a discussion about evidence/conclusions and nor am I being derogatory about those who were put off by the concept. That's just how some of us are. I spent many years in maths working with the square root of minus one - they're called imaginary numbers for a reason. I can cope with this abstraction. Others simply can't. But, I am worried about how much abstraction a UGM might need. After all, the GenTech assertion concept seems to be close to the sort of thing that a UGM might need and that proves hugely difficult to get over. Even the GenTech documents failed to do so!

The idea to just do data, not process, is again, laudable. But my problem there is that if we are not modelling the real world - what are we modelling? And I can't help but respond - we are modelling the process of genealogy and I find it hugely difficult to get to the data without invoking processes to define it. Which we're trying to get away from. That might just be me... I have this horrible feeling that all we will end up with, is a model that essentially contains NO genealogy at all but could equally describe any deductive process.

Please don't take this as being negative - these thoughts are there to be overcome because, while I can be as theoretical as the next Honours Graduate in Maths, I kinda need a (meta)physical crutch to point me in the right direction first....

Adrian

ttwetmore 2012-08-03T05:08:06-07:00

Tony,

I don't agree that genealogy does not model real life. That's its whole purpose in my opinion, possibly restricted to BMD in its minimalistic form. I also insist that we represent evidence, but don't we all? Conclusions are statements about real life.

It seems to me your message boils down to that of we need to represent evidence. I believe most of us agree. But we need evidence solely so we can make conclusions about real life. And we need to represent that real-life in or data model via our conclusions.

Here are the original requirements I set forth early in BG's history.

1. The syntax of the Better GEDCOM files shall be a non-proprietary format (e.g., XML, GEDCOM, JSON, …, or a custom application specific format).
2. The data model that underlies Better GEDCOM must be a superset of the models used by existing genealogical applications to the fullest extent deemed possible during design.
3. The data model that underlies Better GEDCOM must provide a set of data entities that will allow genealogical applications to support all conventional genealogical processes.
4. The character set used by Better GEDCOM files must be UTF-8 encoded Unicode.
5. Better GEDCOM files may contain references to external information that may exist as URI's in cyber space or in container files that accompany the Better GEDCOM files.
6. Better GEDCOM must not impose restrictions on field lengths or value formats except as deemed necessary during design.
7. Better GEDCOM must provide a means to mark-up text that is used in contexts that allow unstructured text (e.g., notes).

Requirement 2 addresses the overall problem you are addressing -- getting the industry to use a new standard. By designing a model that is as best as possible a superset of the models that all members of the industry are using you maximize the possibility for adoption.

Requirement 3 addresses the processes issue. It does NOT state that the model must represent the process. It says the model must support processes. Clearly by storing evidence and sources, and supporting conclusions that rely upon arguments based on evidence, we get that support. This requirement in my mind is the one that requires evidence to be included.

ACProctor 2012-08-04T01:01:32-07:00

I knew this was going to be a subtle point, but I had to write something because I sense that those criticisms levelled against a standard are not uncommon.

@Adrian, the idea of a family unit is probably a good example. In the internal operating model of a product, there may be a specific concept of a 'family unit'. While I believe this is very subjective, and not the way I would go personally, it is the choice of the designer and vendor of that product. However, when that data is passed to another product, you need to put more emphasis on the description of the genealogical world, which as I said is one level removed from the real world.

It would therefore reflect the evidence, plus the conclusions, although my previous description is more accurate: "what we found, where we found it, and what we infer from it". For instance, it might represent events at which the members of the group were present together, and provide a system of roles to describe their participation in that event. You might be asking 'Ah, but isn't a family unit a conclusion?'. Well, that's true, although a conclusion about a BMD event with scant evidence is easier to describe than a 'family unit' which is as much a subjective interpretation of the evidence as a conclusion.

@Tom, I didn't quite say that 'genealogy does not model real life'. What I tried to do is distinguish the genealogical view of the world (which involves evidence and conclusions) from the real-world itself (which is very uncertain). Even if we were personally present at some event, we cannot say with 100% certainty what transpired, what the relationships were between different people, what they were thinking, and the context that led to that event. The essence of what I'm saying is that we need to accept the difference and ensure we describe the data is an objective way. It should be possible for the various products to make their own subjective interpretations when they import such data, and so preserve their internal operating models.

Describing what-we-found and where-we-found it are relatively straightforward. Describing what-we-infer-from-it probably needs a little more though. I am against any such of rigid, logic-inspired, step-by-step conclusion model since that could constitute a straight-jacket in the eyes of vendors. While some chaining between evidence and conclusions is essential, chaining between conclusions and other conclusions could be accommodated but should not be mandated. I personally believe the conclusions and the reasoning that led to them should be distinguishable but I favour a narrative approach to the description of the reasoning.

Tony

ttwetmore 2012-08-04T18:25:52-07:00

Tony, I think I understand the distinctions you are making, but I don't feel that they make any real difference in the design of a genealogical data model. Other than maybe arguing for a surety tag some times. I think I can sum up your points by saying that evidence and conclusions are fraught with uncertainty and one should take that uncertainty into account.

At the level of evidence I consider what-we-found and what-we-infer-from-it to be a distinction that very rarely matters. I interpret this to be things like the evidence says NL and we infer it means New London, Connecticut, because of the nature of the evidence. Or we see something like 4/3/72 and we interpret to mean 3 April 1872, or we see something like "do" on a census form so we pull down the value from the line above. I don't think most genealogists would really care about being a stickler about these things, but if they were, then they should always have a note capacity to transcribe as precisely as possible (within the strictures of Unicode) what the original value of the actual carbon marks on the paper really seemed to be.

For example, the death record for my gg grandfather says he was born in Yarmouth, New Brunswick. He was almost certainly born in Yarmouth, Nova Scotia. I have recorded his birth place as Nova Scotia, but made a big note about the presumed error in the evidence.

I make no insistence on the actual data demonstrating the results of any logic-inspired, step-by-step process. I simply want conclusions to be able to refer to the evidence they are based upon, and evidence to be able to refer to the sources they were extracted from. Note that I don't insist that the model enforce either of these linkings. The DeadEnds model allows conclusions with no evidence, and conclusions and evidence with no sources.

I also favor the narrative approach for expressing conclusions. A conclusion should, in my opinion, be able to refer to the evidence records used in making the conclusion, and contain a narrative that explains how the evidence was interpreted and used to justify that conclusion.

Home > Data Models Home

BetterGEDCOM Data Models & Syntax

Comments