DeadEnds Data ModelDeadEnds XML ModelDeadEnds Custom Import FunctionDeadEnds Date Formats

Comments

ttwetmore 2010-11-23T08:30:36-08:00
DeadEnds Data Model: General
I've put the second version of the DeadEnds model specifications on my web page and put the link of this wiki page. This is the first update to the specifications in a decade. I have used the model in a few software system over the years. The model fully supports the evidence and conclusion research process that I believe must be at the heart of any professional level genealogical software system. It is the testbed model I am using for experimentation with nominal record linkage combination algorithms.

I will edit the wiki page with more info about the classes as time goes on.

I am proposing this model as a candidate model for BG.

Tom Wetmore, Chief Bottle Washer, DeadEnds Software
greglamberson 2010-11-26T18:50:19-08:00
Tom,

I am still reading your specification. The thing that interests me initially is your use of roles.
greglamberson 2010-11-28T10:36:46-08:00
response from "participant" as a Bridge
original thread: https://bettergedcom.wikispaces.com/message/view/Event+entity/30807433?o=20

Let's see if this works...

Tom,

While your model is a drastic improvement, here are the two things I have a problem with right now:

1. There is no dependency tree with your hierarchical linkages.
2. There is still no distinct evidence and consequently no distinction between evidence and conclusions..
hrworth 2010-11-29T05:47:31-08:00
Adrian,

I buy that. The question is HOW.

What "fields" are needed? What are their attributes? What to the attributes mean?

I have an Evidence Record. It contains these attributes, which mean ......

If that field is present, what should be done. If that field is absent, what should be done.

Isn't that what we are trying to do?

Russ
greglamberson 2010-11-29T12:13:40-08:00
testuser42 said, "As far as I get the DeadEnds model, there is a simple way to find out if a record is evidence or conclusion: Look at the things it has references to!"

Everything I am looking at is remarkably similar to this model. My own sketches look remarkably similar to this model. I'm not saying such definition can't be done. I'm saying it simply needs to be done. It's not done yet, but it's not some insurmountable task.
ttwetmore 2010-11-29T13:46:21-08:00

Russ recently wrote: "Thank you for your detailed reply...In my opinion, all of that good stuff belongs in the application or program to End User...What needs to be sent to another EndUser, using the same or different application / program is the DATA, not how the data was entered, nor presented to the end user and any steps in between...If I add a New Person to my database and share that file, including the new person, what information has been added with that new person should be sent along, not matter what state it is in, between the End User and the Program....I should add, that what you are talking about may be helping the end user do family research in a more disciplined way, which is good, but the BetterGEDCOM is not about that. Its about the sharing of data."

The DATA (your emphasis) is everything in the BG model. If that model holds evidence data, which I hope it will, then that is part of the data to be shared. I agree with your idea that the way the data was entered is of no interest to another system, but all the data objects and all the relationships established between the data elements should be. So all those steps I went through that a genealogist might do, true, those steps aren't shared. But all the records and all the connections between records that were created by the user when going through those actioins, are shared.

I don't understand objections to discussions about how applications would use the data, since (for me) it is those discussions that lead to an understanding of what the data should be. I believe we must imagine and experiment with how we want to work with our data, in order to derive the full set of requirements on the data. So we talk about what we want our applications to do for us so we can understand what that says about the data that the applications will manipulate.

We just got a great picture showing example source records, evidence persons, conclusion persons and a conclusion event. This shows a snapshot of how some objects from my DeadEnds model can be used to hold data representing sources, evidence and conclusions. My hope is that thinking about how this information is handled by the DeadEnds model will lead to two places: 1) help the BG effort think about the model they want to support, maybe something simpler or more complex than the DeadEnds model; and 2) demonstrate the existence of a model that does indeed handle the evidence and conclusion process in a reasonable way.

It could still be the case that BG ends up only wanting to handle the conclusion world, that is, truly be a better GEDCOM, which seems to be what the "21 German Genealogists" effort is. That is a laudable goal, but so much more seems possible to me, and listening in here on what others are saying, I think many others want BG to solve a richer problem also. I have thought about what a better model for genealogy would be like for twenty years, without ever having any hope that those thoughts would lead to anything other than a clean architecture within my own software programs. I am idiosyncratic and iconoclastic in my own work, which means I'm blind to all kinds of other approaches and ideas. So I don't want to foist DeadEnds on BG as a solution. But I am selfish enough to believe there are some things in the DeadEnds model that apply to BG and might help BG decide what direction to take.

I get the sense that I am starting to repeat myself (again!).

Tom Wetmore
AdrianB38 2010-11-29T14:50:53-08:00
To try to help people understand what evidence and conclusion persons are, I've rewritten a page in my "Data Model Workspace" for want of a better term.

See
http://bettergedcom.wikispaces.com/evidence+vs.+conclusions

It only attempts to define evidence and conclusion persons, not evidence and conclusion events since I've not sorted them rigorously.

The page refers to Elizabeth Shown Mill's Analysis process because I feel that using that process allows me to explain better what evidence and conclusion / working hypothesis persons are.

I sure hope I've not contradicted Tom but maybe the slightly different perspective may help some.
hrworth 2010-11-29T16:24:32-08:00
Tom,

I am not objecting to the discussion. The question I keep asking myself, it what belongs between the End User and the Application, the Application and the BetterGEDCOM. I suspect that our "side" of the discussion is that you are, and I am guessing, and experience programmer. I am a simple End User. Clearly, if that is what we were to accomplish, (user to application), then you are right. But we aren't. Not that the understanding of how stuff is done isn't important, but I think we are at least one step ahead of our selves.

Or - and this is probably the wrong approach, but lets start with what we "know". That is the GEDCOM format. We know it is broken. OK, now, how can we make it better. I am guessing but you already know how to fix the broken GEDCOM. Thats cool. But we need some other players at the table before we start developing the application.

Why doesn't the Old / Broken GEDCOM work?

What are the reasons what it doesn't work? Or, from a term I used to know, what is the Root Cause Analysis of why it doesn't work.

If you know that, please share that.

You probably have this whole thing worked out. But we have many End Users, Software Vendors, Experience Genealogists that we need to hear from.

That's all I am trying to say.

Russ
greglamberson 2010-11-29T16:56:32-08:00

Russ, Your questions certainly highlight what I believe: We don't have our goals developed enough.

I personally don't think we have to solidify our goals before discussing different data models and their implications, but I totally agree we're not done with our goals. From my standpoint, we don't even have the basics down yet.

Anyway, I'll go stir up some trouble on the goals pages..
ttwetmore 2010-11-29T19:46:35-08:00
Russ,

I don't understand what you mean about being between things, but I don't believe we're a step ahead of ourselves, and wonder why you think we are.

We know GEDCOM is broken because 1) it does not cover a large enough genealogical model to transport all the info that should be transported; 2) it is misinterpreted and extended differently by every program that claims to support it; and 3) nearly every GEDCOM export and import operation by every program is a lossy transformation. But haven't we said these things over and over and over? Isn't this what we "know" and haven't we already started there? By working on BG we are working on making it better.

I don't have it all worked out, but I have thought about it a lot. If there are others to hear from this would be a good time for them to speak. Are you worried that things are going too fast and that might cause mistakes?

You are an end user. What should end users be doing to help BG? I think you should be describing the processes you want your genealogical applications to support, the requirements you want your program to meet, the kinds of data you need your programs to handle, and what you want to be able to do with that data. Only by understanding the requirements of users will we be able to build a data model that can support their requirements. You seem to not want to talk about applications and processes and only talk about BG's role in sharing information. Maybe this is getting a step ahead of ourselves.

Yes I am a software person with many too many years of experience. Genealogy has been my hobby for 22 years and I've written lots of genealogical software. The genealogical application I have used since 1990 is LifeLines, a UNIX-based program I wrote that uses GEDCOM as its native database, so I have been dealing with and working around the limitations in GEDCOM for 20 years. The main limitation in GEDCOM from my point of view is the lack of an event record. Beyond that there are lots of little things that could be added to properly support the evidence and conclusion model (INDI records, with a little tweaking, can serve as both evidence and conclusion records).

The DeadEnds and eventual BG models could easily be represented in GEDCOM syntax, so truly be a Better GEDCOM, but I have explicitly avoided that suggestion because during the past 20 years XML has become the next best thing to a religion, and to suggest that anything besides XML be used to represent hierarchical text-based data is now heresy, and I don't wish to bring the inquisition down upon us.

Tom Wetmore
hrworth 2010-11-29T20:25:49-08:00
Tom,

So, what you are telling me, as an End User is to go talk to my software developer.

OK I will and have been for 10+ years.

In the meantime, Please sent me a GEDCOM file from DeadEnds with all the bells and whistles.

Thank you,

Russ
ttwetmore 2010-11-29T21:45:46-08:00
Russ,

If I wanted you to talk to your software developer I would have suggested you do that. I don't think that would do anybody any good.

What I said was that end users could help the BG process (and here I quote) by describing the processes you want your genealogical applications to support, the requirements you want your program to meet, the kinds of data you need your programs to handle, and what you want to be able to do with that data. I was suggesting that you might tell us these things here in these discussions to help us understand the data models that would be needed for you to get what you need.

Tom Wetmore

Tom Wetmore
hrworth 2010-11-29T21:51:53-08:00
Tom,

I am not qualified to do that. The folks qualified to make those types of requirements are the Scholarly Genealogist. I am a simple user, trying to Share my research with another User. The folks who know how this stuff should work have not been posting too much on these discussions. I only know a few of them, and they aren't here yet.

But, thank you,

Russ
Andy_Hatchett 2010-11-29T22:02:13-08:00
Root cause for GEDCOM failure was mainly because it was so loosely written anyone could interpret it pretty much any way they pleased.
louiskessler 2010-11-29T22:47:37-08:00
For basic data (people, births, deaths, marriages, dates, places, notes, and simple sources and citations), GEDCOM was not a failure at all, but a HUGE success. It has been used as input and output by just about every genealogy program, whether desktop, online, or handheld (there are hundreds of them) for the past 20 years. Very few other "standards" have come close.

What we are looking at is not fixing the 90% that works. We are trying to bring GEDCOM to the 21st century (Unicode, XML) and add the 10% that is missing.
ttwetmore 2010-11-28T18:08:06-08:00
Greg,

I will start backing off. I have tried very hard to explain a genealogical data model that I believe will cover the evidence and conclusion process with a limited number of what I believe are easy to understand concepts. I have tried to define my concepts in both the discussion notes I've written and in the documents I have made available. It is probably time for me to back away for awhile, to either let what I have written sink in, if it indeed is worthy enough to do so, or to just let my ideas fade away in favor of new and better ideas that will flow from the BG effort.

I now see more pending questions that ask the same questions that I have tried to answer in detail before. My ideas are either wrong or I am unable to explain things or both. If it is not yet crystal clear what I mean by an "evidence person" or a "conclusion person" I think it is time to bow out for awhile and see if that lets others have freer reign for their own ideas.

But as I try to disengage I will try one more time to explain the difference between an evidence-person and a conclusion-person.

Evidence-person -- a computer record (in a file, in a database, flowing through the Ether) that represents everything you can learn about a person that can be extracted from a single item of evidence. In some cases all that the evidence-person record will contain is the person's name.

Conclusion-person -- a computer record (in a file, in a database, flowing through the Ether) that combines together, somehow, all the evidence-persons that the user/researcher/genealogist has come to believe, through whatever process has made sense to him/her, refers to the same real person who once lived.

In the DeadEnds model, which I created to be the basis of the software I am writing and to support the research process, I use the same data type to represent an evidence-person and a conclusion-person. There are different ways a software implementation could combine persons records. I prefer a non-destructive approach so that the evidence-persons don't get removed when they are combined into conclusion-persons. The way the DeadEnds person object does this is by allowing person objects to refer to any number of other persons, which let's call sub-persons, where all of these sub-persons are believed, by the user/researcher, to refer to the same real person from the past. Using the simple idea of persons referring to sub-persons one can build up simple or complex trees of persons, and of course, in many cases there may be no sub-trees at all.

I have mentioned the earlier work called "nominal record linkage" that is the forerunner of this idea. This is the technique that was used by historians who wanted to reconstruct the family histories of villages from the information contained primarily in church registers. Each item in a church register, that is, each birth record, baptism record, marriage record, death record, burial record was treated as a separate item of evidence. From each item of evidence a set of evidence-persons was created (as well as the evidence-events that related the persons mentioned in the items). After all the evidence-person records were created the researcher applied a set of rules that allowed him/her to group the evidence-persons together into conclusion-persons. The technique was called "nominal record linkage" because the evidence-persons were called "nominal records," because, in most cases, the evidence-persons were nothing but a name! By applying these rules the historian reconstructed families, and from these reconstructed families drew conclusions about what life was like in the village during the times in question (marriage patterns, relatedness of spouses, sizes of families, life spans, mortality rates, etc). Early on nominal record linkage was done by hand using slips of paper or index cards. Some time in the late fifties of sixties, some researchers started using computers to assist in the work. It is ironic that there were software programs from at least the sixties that dealt with the real genealogical evidence and conclusion research process, but that knowledge of what was done then, so germane to what we might want to do today, is just about lost. One of my goals for DeadEnds is to resurrect these techniques into the current environment.

Tom Wetmore
hrworth 2010-11-28T18:20:42-08:00
Tom,

Thank you. I am slowly understanding Evidence and Conclusion Person.

Is it safe to way, that these two terms are in the way the User works with the Application?

You might see the Conclusion Person as you look at a screen that the End User is looking at. Looking at the screen I see all of the information I have on that person and it includes relationships and evidence.

The Evidence Person might be the first entry for a new person.

You might then enter a new Evidence Person, and then either determine that they are the SAME Evidence Person or a Different Evidence Person.

Do you the add additional information to that SAME Evidence Person that you first entered, so that person would then become a Conclusion Person?

Still trying to understand.

Thank you,

Russ
greglamberson 2010-11-28T18:28:34-08:00
Tom,

I get what you mean by those terms. My point is that in addition to defining the concepts, you've got to have rules they follow in your data model that make the distinction clear. Even if they're convertible from one to the other, you've got to define how that happens. You can't just say they represent whichever you like.

Honestly, I have avoided commenting on your model because I knew I wasn't ready to do so. I'm still not.

Basically if you could look at a person record and have it follow rules and have distinctive features that made it a evidence record or a conclusion, that would go a long way. I just don't see it yet.
Andy_Hatchett 2010-11-28T19:02:30-08:00
Tom,

Re: Sample data set

Might I suggest a well known set, such as the first 3-4 generations of Queen Victoria or perhaps the Kennedys.

That way most users would at least be partially familiar with the material and it would be easier to understand.

Just a thought.
ttwetmore 2010-11-28T19:40:30-08:00
Trying to respond to both Greg & Russ,

I am not a fan of rules. I don't think we can force users to follow many rules. When a user enters a person into a genealogical program, 98% of them won't be saying to themselves "now this is an evidence person" or "this is a conclusion person". They just add a person. Between you and me this is almost assuredly a conclusion-person, but the question arises does the user need to know?

I think the best we can ever hope for is that a user will enter person data and also enter some source data and link the person to the source.

If a user then finds additional data about a person, and they believe the new data applies to the same person, probably 98% will then just go and add the new information to the same person, making that person even more of a conclusion-person than it was before. The user at that point will hard pressed to make the source pointer make sense since now the data in the person comes from more than one source. This is the problem faced by GEDCOM all the time, so each event in a GEDCOM INDI record can have its own source. I don't like this idea, but we might have to put it in BG anyway to support users or programs that provide this working model. So far things are just like they are in every other program. The BG model should support this kind of action.

The regular user continues along. Let's say he/she decides there are two persons in his/her database that refer to the same person. He/she could, in the Gramps fashion, merge the two together. This pretty much means putting all the data from the two original persons into a brand new record and reordering the sub-parts so that the preferred data comes before what becomes alternate data. This is a destructive operation as the original two persons are removed as the new one emerges. The BG model should support this kind of action.

But now let's say the user could do something else. Let's say he could create a new person record that CONTAINS the two original person objects inside it. He/she still has to decide which parts of one sub-person and which parts of the other sub-person should be displayed as the combined person, but that's just a little software. In this way the user gets exactly the same apparent result, one person from two, but this is not a destructive process. You can think of the two sub-persons as just tucked or hidden away, but still available if the user decides reverse the decision. This is exactly the idea implemented in the DeadEnds model. (Of course the DeadEnds model would still allow the user to do a destructive merge as well.) The BG model should support this kind of action.

So far little mention about evidence or conclusion. This is just a normal user doing normal things, adding data, combining persons when he/she thinks there's a duplicate.

Now imagine the other 2% of the users, real research oriented users who want to be sure they collect their evidence, their evidence remains clean, and the process of deduction they go through in finding ancestors is to be maintained somehow in their data. They do lots of things just like the regular users. First then enter person records for each item of evidence they find. Just like regular users they are entering persons, it's just that they know they are entering evidence-based data so their source pointers point to that evidence. Because they think of these persons as evidence, they never want to change the contents of the records after they create them, so they will never add more data to them. Now say the researcher evaluates the evidence persons they have in the database and they decide that two of them refer to the same real person. Just like the regular user they will combine them using the non-destructive approach, tucking away the two evidence persons inside the third person, choosing how to display the combined person from the other persons. A nice thing about this approach is how the sources get handled. The evidence persons have good source pointers going to their evidence and they never change. The combined person can have its own source pointer and the researcher should probably make that source be the reason he/she felt the combination was a valid thing to do.

Note that at this point the regular user using the non-destructive merge and the researcher using the non-destructive merged have both done exactly the same thing. From the regular user's point of view he/she's just gotten rid of duplicate information. From the researcher's point of view he has made an important conclusion about the data he/she's been collecting.

Now let's imagine a third kind of user, probably very much like the few of us on the wiki. We want to be lazy and just enter data for the people who are close to us in time, because they or our family can easily supply all the information, and we just enter it all in quickly. We start with your parents and just create their records. Yeah, they're conclusion records but we don't care, we just want to get on. Maybe we're real lazy and don't even put in a source, or maybe we do put in a source but that source boils down to "because I know them, okay?" But now, let's say we are five generations back in time and we are finally doing "primary research" of our own, searching record sources, collecting data, and so on. At this point we know we have to start doing things right. We don't know who our ancestors were at this point, so we have to reconstruct them from the evidence we gather. So at this point we start using the evidence and conclusion process in earnest. We do know what kind of person records we are creating at each step.

I'll stop there with the examples, but you could imagine it continuing on with more persons added.

Are there many rules in here that the user must follow or that a software program must enforce? I just don't feel this whole process is very rule bound. If the user is a researcher following a strict evidence to conclusion research model, maybe the programs could be configured to help him do it, but basically the only operations he will be performing is creating records, sometimes combining them in non-destructive ways, and sometime splitting them apart. Yeah, I guess there are rules to this process, but how they apply as being restrictions that BG must deal with I don't see.

If I had to write down some rules they would be something like this.

1. Users can create person records (and other kinds) that must adhere to the structures allowed by the data model.
2. Users can edit person records (and other kinds) in ways that must adhere to the structures allowed by the data model.
3. Users must be able to combine person records (and other kinds) in ways that can be destructive or non-destructive of the records being combined.
4. Users are encouraged to maintain accurate source pointers from their person (and other kinds) records to source records, but the existence of and accuracy of these source pointers are not required.
5. Users must be able to tell when they are dealing with person records that resulted from non-destructive merges and be able to unmerge the original person records.
6. Programs can suggest through the user interface how the user can follow good practices, but in now way forces the user to follow them.

I don't know; these sound pretty random, but I can't come up with anything very serious at this point.

Tom Wetmore
testuser42 2010-11-29T03:28:37-08:00
Tom, thank you very much for this post.
It's very important to think of the ways users will work and enter data. I agree with all of what you wrote.

There's a big job ahead for the people who write geneaology software to provide the tools that help people get better at research and documenting their sources and thought processes.
I believe there could be a solution that has a user interface for entering persons just like most software today, only that the things entered are treated more carefully. So that even if the user doesn't realize he's actually entering an evidence record or editing a conclusion person, the software still handles the data accordingly.
testuser42 2010-11-29T03:44:30-08:00
Greg, you said:
"Basically if you could look at a person record and have it follow rules and have distinctive features that made it a evidence record or a conclusion, that would go a long way."

As far as I get the DeadEnds model, there is a simple way to find out if a record is evidence or conclusion: Look at the things it has references to!
If there is only ONE ref that goes to a SOURCE, then it's an evidence record.
Anything else is a conclusion of some sort, derived from the things that are referenced in it (sources, persons, events).

So there's no immediate need for a flag of any kind, even if it might make it more clear at first sight.
testuser42 2010-11-29T04:01:14-08:00
Russ, you asked
"Do you the add additional information to that SAME Evidence Person that you first entered, so that person would then become a Conclusion Person?"
I would say yes to that. That would be a good possibility for a user interface that is quick and easy. The user does not need to know that internally there's a new "conclusion" person being created (and at the same time, probably a new evidence-person, too, for the new information you have).
testuser42 2010-11-29T04:05:16-08:00
"If there is only ONE ref that goes to a SOURCE, then it's an evidence record."
oops, I think there would be more references in an evidence record: an event would ref the source and the participants, a person would ref the source and the event.
Tom, could you shine some light on this?
hrworth 2010-11-29T04:27:19-08:00
Tom,

Thank you for your detailed reply.

In my opinion, all of that good stuff belongs in the application or program to End User.

What needs to be sent to another EndUser, using the same or different application / program is the DATA, not how the data was entered, nor presented to the end user and any steps inbetween.

If I add a New Person to my database and share that file, including the new person, what information has been added with that new person should be sent along, not matter what state it is in, between the End User and the Program.

Russ
hrworth 2010-11-29T04:29:23-08:00
Tom,

I should add, that what you are talking about may be helping the end user do family research in a more disciplined way, which is good, but the BetterGEDCOM is not about that. Its about the sharing of data.

Russ
AdrianB38 2010-11-29T05:05:30-08:00
Russ said: "BetterGEDCOM is ... about the sharing of data"

The thing is, Russ, not merely do I want to record evidence and conclusions / current working hypotheses in my files inside my application program, but also if I transfer data to someone else, I probably want to share with them all my logic and my interim conclusions. (If I didn't I'd probably just run off an unsourced narrative report)

So for me, it's important that the BetterGEDCOM file that I share with someone does contain all the evidence, analysis and conclusions. So it's important to me that the BG data model supports that evidence, analysis and conclusion stuff.

It's also important that the BG data model can accommodate ordinary conventional conclusion-only data (i.e. what 99.9% of us do now) because that's all I have right now and I'll want to load it into a BG compatible application program and also because that's all 95% of genealogists and family historians will ever want to do .
ttwetmore 2010-11-28T12:24:56-08:00
Greg,

What is an evidence record like? Here is something I wrote ten years ago about possible attributes (PACs) one might find in an evidence record. Before I paste this in let me say that when I wrote this I was thinking that there was a separate kind of record called an Evidence record. At this point I think that an Evidence record is just a variety of Source record. That being said here is what I wrote:

"Evidence Records

There are a number of special Attributes used in Evidence objects listed below. Because Evidence records represent evidence in the world, the Attributes are concerned with representing the external evidence in a number of electronic formats. They include the following:

Description – The Attribute value describes the external evidence without providing details of the information it contains. To use the evidence (e.g., to create Person or Event records) a researcher would have to refer to the original.

Digest – The Attribute value includes only information from the external evidence that the researcher believes is important. If this digest of information does not meet research needs, a researcher would have to refer to the original.

Transcription – The Attribute value provides a literal transcription (or as close as possible) of the information found in the external evidence. It provides a full representation of the original evidence; all information in the external evidence is available in the Evidence record.

FileReference – (maybe URLReference would be a better name) – has a value that refers to electronically stored information outside the bounds of the database."

What I was trying to do with this was to emphasize that an item of evidence exists out in the real world, so what would a computer record be like that represented that item of evidence in a computer system.

Note that I was intending that all of these concepts (description, digest, transcription, fileReference) can be used as tags in a PAC (property/assertion/characteristic, take your pick). Therefore different evidence records could look like this (in the loose DeadEnds syntax I've been using):

source: [id: s1; type: evidence; description: "Birth certificate of Thomas Trask Wetmore IV from the New London, Connecticut, city hall"; sourcep: [id: s99]]

source: [id: s2; type: evidence; digest: "Thomas Trask Wetmore IV was born on ... in ... His parents were ... and so on ..."; sourcep: [id: s99]]

source: [id: s3; type: evidence; transcription: ... line by line transcription of the birth certificate ...; sourcep: [id: s99]]

source: [id: s4; type: evidence; fileReference: ... file system URL ...; sourcep: [id: s99]]

Tom Wetmore

source: [id: s99; type: register; source: [id: s100]]

source: [id: s100; type: cityHall; name: "New London, Connecticut, City Hall"; sourcep: [id: s101]]

Tom Wetmore
mstransky 2010-11-28T12:39:30-08:00
"The system we have now is only suitable for concrete facts, not deliberations." Greg, that is True.

Just like Toms model and mine as well as other can store Concerete Documnets.

One thought for everyone is that they keep trying to take a Book, or documents like they are telling a reseach book of their own findings.

One thought is if a person does create a NON published reasearch say in "word" something to capture the researches findigs like a historians book.

Take that Non-Published research item add it the the source documents. THEN create a "person-evidence" data set to like all persons to the Addintional "document/source"

"YOUR NAME, research findings on John Smith and the people of Manor hill formerly Rose Manor"

I hope I am making sense of it, It may not link source to sources but that NEW item will contain the Historian/researcher/genealogist findings and can be past on to the next party with all the other added source documents.

kind of making use of what is all ready available.
greglamberson 2010-11-28T12:41:01-08:00
Tom,

You lose me when switching to an idea you say isn't quite applicable. Also, I don't understand your syntax. Also, I don't get the idea of an evidence record being a kind of source records.

In short, I don't understand yet.
AdrianB38 2010-11-28T12:41:11-08:00
Greg asks - "But, I ask you, where is the evidence entered? It's not. It's implied. Well, what happens if you only have a letter that talks about Mary Smith 'and that new husband of hers.' How would you enter that as evidence?"

I think that, if I understand Greg correctly, what we are missing is a full record of the analysis.

If we take the Elizabeth Shown Mills' statement, "SOURCES provide INFORMATION from which we select EVIDENCE for ANALYSIS. A sound conclusion may then be considered PROOF" (from "Evidence Analysis: A Research Process Map") then we have a process that we go through and, like 99.99% of processes, it has inputs and outputs.

The input evidence is what is extracted from source records or existing persons (or groups or...)

I have a feeling that some of us have been thinking that the output conclusions are only new events, new characteristics, new versions of persons, etc. (i.e. that's what I thought to start with) But this misses two aspects, I suggest:
1. Implied or negative evidence - e.g. "no marriage can be found in a set of theoretically complete data" - doesn't obviously fit an event because it's about a missing event.
2. All the records of the analysis done need to go somewhere. I can write a whole screenfull where I go through all the John Billingtons born in Cheshire in 1780 +/- 10y, examining each in turn to hopefully establish he can't be mine because his father's wrong, he can't because he died as a child, he's in the same census as mine but in a different place, etc, etc. (The exact words aren't the point). I need to document this - the method, conclusions, indirect or negative evidence, etc and I think this needs to be done in a free-format text document, as well as the new or revised characteristic or event or ...

Maybe I'm missing it but I don't see this documentation of the Analysis process in the models so far. And maybe that's what we need.

By the way, personally I find the ESM process provides a good description of what evidence and "conclusion" persons are, especially as it naturally comes out that the output from one is the input to the next stage of analysis and therefore one person can easily be both at the same time.
mstransky 2010-11-28T12:42:52-08:00
Correction
"THEN create a [person-evidence] data set to <LINK> all personIDs to the additional "document/source"

Simply adding the evidence as its own source document write up?
greglamberson 2010-11-28T13:04:49-08:00
Adrian,

You're right. I haven't even mentioned analysis. Yes, that's another big issue, but I think analysis gets mixed in along the way, too. Analysis is part of every single thing. We're not going to completely separate analysis. However, we can certainly provide an analysis/proof statement object that provides the capability for more detailed analysis. This could be an entity independent of both evidence and an event, or it could result in some declarative statement functioning just like an event. So yes, this is a big piece we need to address, but it's not like we have such a thing now, either. Nevertheless, adding such an item to a process that centralized evidence is much simpler than shoe-horning analysis into the conclusion model.

Rather than having our information be based on events/conclusions/facts and work backwards, I would have evidence be the center building block on which everything is centered. Obviously the evidence information needs to be sources properly, but that evidence rather than the events/ideas/facts it supports should be the part of our data that is most clearly defined in the database. Evidence shouldn't be implied. EVENTS should be implied from evidence, if anything.

As to what is in an evidence object, a person would enter what they find to be relevant from the source in the evidence object. Perhaps this is a free-form, vague entry that is later objectified from "grandma's house" to "123 Cherry Street, Billings, MT". Perhaps 90% of the time it is exactly entered like we enter events today. In this way, we can accumulate evidence without destroying any, and we can change our interpretation of that evidence as part of our research process.
AdrianB38 2010-11-28T13:20:08-08:00
This entry from the Ancestry Insider Blog suggests forms of evidence and conclusions. My initial reaction was that it was a lot of typing...

http://ancestryinsider.blogspot.com/2010/05/evidence-management-explained.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+AncestryInsider+%28The+Ancestry+Insider%29
testuser42 2010-11-28T13:33:23-08:00
Adrian, in that example the analysis is just like a note. And I think that would work fine enough in a BG. Just add a note to any assumption you make (if you think it's necessary).
mstransky 2010-11-28T13:36:33-08:00
Look at to post back of mine, three people submited at the same time.

Submit a work of your own as a non-published document as a source.

Since all sources are LINK to person-evidence Your work with show/link back to one/all personid's associated to your findgs/work.
ttwetmore 2010-11-28T15:00:44-08:00
Greg says "You lose me when switching to an idea you say isn't quite applicable. Also, I don't understand your syntax. Also, I don't get the idea of an evidence record being a kind of source records. "

You'll have to help me on the switching ideas bit. Don't know what you're referring to.

As a principle I want a model with as few data classes as possible. I take it we need a Source record which I define as an entity in a computer database that records information about a physical thing that provides useful information to genealogists. This is a general idea. A library can be a source (lower case source means the thing out there in the real world), a city hall, an encyclopedia volume, a county history, a city directory, a census enumeration sheet, and the list goes on forever. The range of these physical things is very large and very nested, so the Source records in a database that represent them can be equally nested. A book can be in a library. A page in a chapter in a book. All of these things can be thought of as sources in the real world and represented by Sources in a database.

So the Source records I have defined for DeadEnds can be built into trees, representing say a range of pages in a book in a library, for three levels of sources and Sources. When you get to the bottom of the this Source tree I assume you are finally going to get to the information that is going to be extracted and placed in Event and Person records. So for me I say that the final Source Records, the most finely grained Sources, are the ones we extract information from. My term for the Source records you actually extract data from are the evidence Source records. I hope that is simple enough. I consider an evidence record to be nothing more than a specific type of Source record, thus the notation I used in representing them above.

To reiterate: evidence records are a kind of Source record. Event and Person records are extracted from them. Please don't be confused by my terms of "evidence persons" and "evidence events." These two terms mean the Person records and the Event records that are the ones EXTRACTED DIRECTLY from Source records at the evidence level, and the source references withing these Person and Event records would be to the Source records that hold that evidence.

One thing I am having a problem with in all these discussions is people by and large not defining their terms. I try as hard as I can to make sure the words I use are defined with both definitions and with examples. It might be nice if others tried to do the same thing more often. I know I am out on a limb with my definitions since people can see how naive and silly my terms and definitions are, but I least I am putting myself out on those limbs to try to get things moved along.

Tom Wetmore
ttwetmore 2010-11-28T15:16:53-08:00
Adrian says, "This entry from the Ancestry Insider Blog suggests forms of evidence and conclusions. My initial reaction was that it was a lot of typing...

http://ancestryinsider.blogspot.com/2010/05/evidence-management-explained.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+AncestryInsider+%28The+Ancestry+Insider%29

I've just read this blog article. It is the evidence and conclusion process we have been discussing:

Red boxes -- sources in general.

Green boxes -- source records low enough on the source totem pole that we can extract meaningful genealogical data directly from them.

Purple boxes -- person and event records extracted directly from evidence, which I call evidence persons and evidence events.

Blue boxes -- conclusion persons and events inferred by the information in all the purple boxes. The only thing GEDCOM supports.

It is most interesting that in the example in the blog the author is inferring a birth event for a person. I like this because normally examples are inferences (conclusions, deductions or hypotheses if you like; select your favorite synonym) about persons; it's nice to see an example in which an event is the object of the inference. In the DeadEnds model I have both persons and events as these kinds of records (existing at both purple and blue levels), but I've never written up an example for the event inferences because it seems more esoteric. However, the author of this blog did a good job of it for us.

Would it be better if I replaced the terms evidence persons and conclusion persons with the terms blue persons and purple persons? Smile.

Tom Wetmore
greglamberson 2010-11-28T15:47:27-08:00
Tom,

Here's my gut feeling: If you can distinctly define data elements within your model and give rules that manifested data follows in your data model, then you've got an elegant solution. If not, it's just a muddy mess with no rules. You're right to express your angst about the definition of terms. Defining data is the most important part of devising a data model.

We not only need to make a system in which data will fit. We need to define those data units and the entities into which they fit. Maybe you can do this, but this is how I'm looking at this, and I wonder if this structure allows both flexibility and truly defined structure.

This is a situation in which I think more definition is imperative. We can't develop a more robust system by lessening the number of objects and simply generalizing them so that they can be anything.

A source is a contained, recognized unit. So is a library so is a piece of evidence. Are they the same? Not on your life. There have to be clear rules. I absolutely can't accept that the same data object can be all of these things. Not at all.

This sort of system may work, but I'm not seeing it yet.
ttwetmore 2010-11-28T10:58:11-08:00
Should someone set up an example set of records that we can use to experiment with different models?

Tom Wetmore
testuser42 2010-11-28T11:05:21-08:00
Greg,
I'm not Tom, but I think your point
"2. There is still no distinct evidence and consequently no distinction between evidence and conclusions.."
can be daebated.
If I understand the model correctly, any object (person...) that only refers to one source is an evidence object.
An object that references other objects is a conclusion derived from these referred objects.
I don't know if this distinction should be expressed literally somewhere. As of now, it can only be deduced by looking at the references.

Tom, a question:
How would a "personRecord" look like that is a conclusion of two other personRecords and adds nothing new to these. Will there be only two "personRefs"? Or will the data that is contained in the other personRecords be collected under the new personRecord?
What if there's a new note to be added?
testuser42 2010-11-28T11:06:21-08:00
"Should someone set up an example set of records that we can use to experiment with different models?"
Yes please!
testuser42 2010-11-28T11:10:53-08:00
I was trying to make a diagram with some example data. I gave up ;)
But here's the starting points I used:

Source 1: Phone conversation
notes taken by (user1) on (Date1)
Transcription/Note: Mr A married in 1912

Source 2: Copy of marriage cert.
made on (Date2) by (user1)
Transcription/Note: Mr Aardvark, age 30, of Newtown married Miss Take, age 25, of Oldtown on 1913-02-02 in Newtown.

What would you make of these?
Thanks!
mstransky 2010-11-28T11:13:49-08:00
I thought the disney characters, but that would be to corny on something seriuos as this, yet people would know Hewy, luie and dewy were brothers. It would be great if a very know struture that almost anyone could follow the data without out knowing the structure of the format can get and comprehend by the know data bread crumps in such.
testuser42 2010-11-28T11:15:16-08:00

...To clarify: In this example, my hypothesis would be that Mr A is the same as Mr Aardvark, and the Marriage did take place in 1913-02-02 and not in 1912. How would this be look in DeadEnds' data format?
greglamberson 2010-11-28T11:15:35-08:00
Tom,

I have been working on lots of examples, so I have several rather clean ones. IF you have something particular in mind, let me know, and I can probably match you up with something.

Regarding point 2.

I expect that data in a data model will conform to definitions we set for elements of the data model. To conform with an evidence-based research process, I expect we'll define evidence in the data model as a distinct thing. This, as you know, has never been done. Perhaps this evidence data is sometimes free-form and sometimes it is entered much like data in an event. I envision that perhaps data is passed from an evidence object to a conclusion object much like information for a source citation as it appears on a printed genealogy is in fact information derived and passed from a source entity. Perhaps in others, the conclusion information is determined through further analysis and does not directly pass through but is entered with the opportunity to note the analysis involved.

I do think that evidence, as an absolutely critical and independent concept, has to be recognizable in the data model and the resulting data as manifested in the database.
mstransky 2010-11-28T11:18:07-08:00
The we just add those uncommon things a with data to the beta-data-sample, and peice in each "thing(s)" and terms can then define them in the BG and say see example PersonID123 Hewy brith. Then we can see how each style format places and stores such data.

At the end of the day BG can take the best concepts in part or whhole from such works to create the BG-XML-transffer-file. or what ever they will term that also?
mstransky 2010-11-28T11:23:46-08:00
Greg, do you think a sample Gedcom might be good to captur such a Beta-samplesettree? then if gedcom does not supoort something just add it like

0 INDI
1 Name John Smith
(red font)
    • Shoe size , 12 **(end red font)
3 BIRT 1900
Place etc...

As examples are termed in the BG for stuff A sample data set can be added and REF by BG to show it also.

Each addition of terms and data sets added people like Tom, Others and myself can freely take that data set and build our styles and everyone can draw the best practices from each to maker a better transffer format.
greglamberson 2010-11-28T11:24:28-08:00
testuser said, "Greg,
I'm not Tom, but I think your point
'2. There is still no distinct evidence and consequently no distinction between evidence and conclusions..'
can be daebated.
If I understand the model correctly, any object (person...) that only refers to one source is an evidence object."

My response to this: I think this veers into the territory of something that can be made to work but that doesn't comply with a true delineation between evidence and conclusion. I would love to be wrong.
mstransky 2010-11-28T11:38:21-08:00
"distinction between evidence and conclusions" I think I see what others ask from a diffrent view.

A user finds 5 findings "items" of a person.
Each item on is own can not support proof.
Yet combined overlayed aids the proof to a cunclusion.


I have one I can relate too.
1) A women "Mary" married 5+ men, only for each 2-3 years and they died.
2) everyone thought all the children were hers and all were step siblings
3) I found records she was collecting the children and men would die in 2-3 years each.
4) Found that each child mother was born in a diffrent state the MARY's birth state.
5) This cuase HELL to the beliefs that many had, then people though we had the first Black window spaninng from 1818-1860.

I will think about that since I can incorperate an easy (link to documents based on "Observation or the data" (what ever tag would be spun? for it"
greglamberson 2010-11-28T12:05:35-08:00
Mike,

I see what you're getting at, I think, but that's not the issue. Yes, sometimes data is going to lead to what most of us would call an "obvious" conclusion. sometimes the two are hard to distinguish.

However, what we work with now is an approach that allows us to enter data based upon certain conclusions like events and the like. These events have defined meaning for the most part. However, if we don't quite have confidence to declare that a marriage event has taken place, for example, there's no way to enter inconclusive evidence that might support such a conclusion.

The system we have now is only suitable for concrete facts, not deliberations.

If we're really talking about supporting the research process, we must recognize that our evidence isn't even part of the puzzle right now. Right now, if you enter a marriage event based upon an entry in a marriage register, you might enter the marriage register as a source, then you might enter the marriage event as a tag and tie the two together using a source citation that tells the page number the proper entry is found on in the marriage resister. For the most part, that works great. But, I ask you, where is the evidence entered? It's not. It's implied. Well, what happens if you only have a letter that talks about Mary Smith "and that new husband of hers." How would you enter that as evidence? You can enter the source as the letter, but there's no way to enter that information without something like a marriage event with no husband, date or place. And maybe you're ok with that. However, Maybe you have another latter from the next week, saying "oops, Mary didn't get married yet. They sure acted married at our house last week." Now add that. How? By removing the faulty event and completely destroying all trace of the first evidence, since the evidence itself is only implied.

I really don't know if we're ready to do this. For about 90% of cases now, a conclusion-based model works fine. If evidence information could be entered and passed through to an event, this would work. However, I certainly don't expect that entering information twice is going to work. Rather, in the first example I'd like to be able to enter that Mary Smith married Tom Jones 12 MAy 1912 in Hopkinsville, KY according to the marriage register entry as evidence, then check little boxes next to the date, place, and marriage event/fact indicating that these pieces of information are to be reported or somehow represent what we previously used as event tags or conclusions.

This is a totally different way of thinking, but such a model is the only way to jump ship from a conclusion-based model and support both correction of mistakes properly and deliberations and detailed analysis.
mstransky 2010-11-28T10:41:49-08:00
Greg Thanks, I have not outline a model yet either and have not put it up for the firing squad either. I just want to be able to CLEARLY have a comparable data set that is created.

Then Tom, others, and myself can "HAS TO DISPLAY ALL' the data from the beta-sample in thier own structure. This way others can follow the data IF they don't grasp a format style right away.
ttwetmore 2010-11-28T10:56:45-08:00
Greg says: "Tom,

While your model is a drastic improvement, here are the two things I have a problem with right now:

1. There is no dependency tree with your hierarchical linkages.
2. There is still no distinct evidence and consequently no distinction between evidence and conclusions.."

Greg,

For number 1, I will put together a small example, maybe that one you created.
For number 2, I'm not sure I understand. Do you mean I've left out the evidence? If so I'll contrive to add some evidence source records as well.

Tom W.
testuser42 2010-11-29T03:34:38-08:00
Some explanation please?
Hi Tom,

could you explain how a person-tree would look like in DeadEnds? Maybe look at my feeble attempt of a graphic ( http://bettergedcom.wikispaces.com/file/detail/bg-schema.pdf ) and show me how the different persons and events would be linked.
I guess that a conclusion person would not repeat all of the information from the previous persons. How would the preferred version of an event or a PAC be made clear? Would you "nest" the persons like nodes and leaves in XML, or can the structure be done only using references?
AdrianB38 2010-12-01T14:18:17-08:00
A new Conclusion Person - new events?
Tom (or anyone)
If we follow the process of analysing evidence to produce conclusions, and we document this by creating a new conclusion person entity - what about the entities for the events and characteristics (e.g. name, physical description) that person has? (For clarity I'm assuming a many-to-many relation btw Persons and Events, with a 1-to-many for Person to Characteristics)

(I may have missed a detailed answer - if so I apologise)

For Characteristics it seems to me that there is every chance that the output BG XML file will have Characteristic tags inside Person tags (if this is the right terminology). Therefore, a new person will auto create new Characteristics on the XML and thus the same ought to happen internally if we don't want to create a potential point of failure. I think.

For Events, creating a new version of the Event Entity to match the new Person - even if that Event hasn't altered - seems simple. But it does seem to get expensive if every time we create a new conclusion person, we also create, approx 1 for 1, new Events for that new person looking (most of them) like the old one except for the revised link to the person.

We can't replace the old Person link inside the Event Entity with the new person's link because that's destructive. Just adding the new person is absurd because a birth event would then have 2 fathers (say) - the old evidence and the new conclusion father.

But what if we left the Event alone? (Assuming it's not directly affected by the new analysis, of course). It would still point to the old evidence person - but any report could then follow through from that Event via the link to the old evidence person, to the new conclusion person. Trouble is ... this loses any degree of simplicity...
ttwetmore 2010-12-02T04:03:25-08:00
A good question that I was afraid would eventually be asked. I've done work with some solutions on this and thought about its answer a number of times. Sorry about the length of this reply.

First imagine a conclusion-persons (CP) as a tree of persons, as I've mentioned in both the DeadEnds model and in my article about automatic combination. Now consider the personal name of the CP. Does the user assign that name when the CP is created or does the software keep a histogram or "weighted average" or "weighted distribution" (sorry, but here is an area that needs a little technicality) of all of the names from all of the evidence-persons (EP) below. For attributes like names in most cases keeping the distribution works fine. When one wants to know the name of the CP one uses the distribution to find the most common form of the name, the next most common, and so on. In other words, without the user doing anything, the attributes of CPs can be treated as the weighted sum of all the attributes of the contained EPs. However, if the user wants to directly assign a property to a CP, which is perfectly okay, exactly because this is a CP, where the user is making his beliefs about the person manifest in a conclusion. So this is how I would handle all attributes/traits/properties/characteristics (they aren't PACs any more, now they are PACTs). In the automated algorithms that I mentioned, there is no human intervention at any step (there are many 100s of millions of EPs so there is no effective way to let a human into the process), all attributes of CPs are determined by the weighted-distributions of the attributes of the EPs. Even when there is a complex tree structure below the CP, a topic useful to avoid because it can seem daunting to the uninitiated, every person record between the CP at the root of the tree and the EPs at the leaves of the three, has the weighted distribution of the persons below it. To enter the world of data speak for a moment, each person record, from the bottom to the top, caches weighted information about all the persons below it in the tree. This is useful for both the automatic algorithms (which no real genealogical program user would probably want to use), and non-automatic algorithms that users would use to search for similar persons that might be duplicates.

But then you must also deal with the entities that persons refer to -- events and related persons -- especially. In my opinion there is still the notion of a weighted-distribution for these concepts as well, but dealing with them is a little more abstract and subtle. Say, for example, you have a CP made up of six EPs, and let's say that three of those EPs refer to their own birth evidence event (EE). These would be three separate EEs from three different sources of evidence. Is there a way to deal with a weighted distribution of EEs? That is, can you think of a conclusion event (CE) as a weighted distribution of EEs? Well, there is no immediate obvious yes or no answer to that, but finding a good answer is a key for making the evidence and conclusion process work.

My first answer to the question is that software is very good at finding weighted distributions, so that software could easily be used to determine the most likely values for the CEs (e.g., dates, places). These CEs wouldn't necessarily have to be in the database at all, but computed by the software whenever one is dealing with the CPs in question. Or the user could create these CPs explicitly by picking and choosing information from the EEs. This needn't be hard as lots of software support can be provided through a user interface.

Note that there is an interesting "recursive effect" that happens when one combines EPs into CPs, with the indirect combining of the referred to EEs into CEs. COMBINING EEs IMPLIES COMBINING THE ROLES WITHIN THE EEs, and EE roles refer to a further set EPs that are related to the originally combined EPs. It is very important to realize that THE COLLAPSING/COMBINING EEs PROVIDES THE STRONGEST EVIDENCE ONE WILL OFTEN FIND ABOUT WHICH EPs TO ACTUALLY COMBINE in the first case (e.g., if you have ten John Smith EPs and five of them have a mother named Nancy, and five of them have a mother named Sally, there are probably at least two John Smitn CPs among the ten EPs and you now have very strong evidence on how to combine those then EPs into two CPs). So each combination of persons impacts a combination of referred to events, which impacts the combination of sets of persons formed from the events' role players which then impacts the combinations of another set of event records, which impacts the combination of a another generation of peson records, and so on and so on and so on.

Handling this "expanding/exploding horizon" of combinations has always seemed to me the most interesting challenge for providing computer support for the evidence and conclusion process. I have rarely written about this challenge, because it seems so confusing at times that I belive it might put off anyone who is evaluating the possibility of providing computer support for such an endeavor. I'd rather they get further along with the idea, fully hooked on its innate goodness, before they reach the point of hitting this fairly high wall. But it looks like Adrian has been staring at that wall, so in the interest of full disclosure I have given these thoughts.

But please be aware that the complexity we are talking about here is the essence of the evidence and conclusion process, the task that good researchers do all the time, performing these "expanding combinations" in their heads all the time as they come to their conclusions.

Tom Wetmore
hrworth 2010-12-02T04:25:01-08:00
Tom,

I have a really simple question. But, then in what we are doing, there isn't such a thing.

I am asking this question to that I don't get hung up or side tracked because of the terms, trying to understand and how to use them.

Aren't the terms that you have been using, Software Terms, and NOT terms that an End User has to deal with?

After reading and trying to understand the GenTech document, the terms that you have been using, and others, are in there as well.

As an End User, I enter names, dates, events, sources, and citations.

In the GenTech material and some of the research steps are not in programs, currently, that are on the market. From what I have read, they should be considered.

So, as I enter data into screens in the program, then your terms "kick in" to create the Evidence Person, the Conclusion Person. I, as the End User, don't have to "mark" or flag, that says This Entry has been promoted to the next higher level, because I have these fields with data in them.

This might help me understand what the End User sees and does, what the application provides to the End User, what the application does with that data, then what the application does to package that data up to be exported into a BetterGEDCOM for transport. The reverse happens at the other end.

Is this close?

Thank you,

Russ
dsblank 2010-12-02T04:37:15-08:00
Tom,

I have written some code similar to this, and I think we are some years away from having something like this that could be used by humans in an effective, non-confusing manner. For example: In Gramps, one can ask: when did some person live and die, even though a particular person doesn't have birth or death events associated with them.

Of course, the software can easily traverse the web of connections looking for evidence for when someone lived. That might include other events, parents, children, etc. This can get complicated very quickly. For example, dates don't have to be exact: a date might be "about 1903" or "between 1870 and 1910" or "Jan 1904". What does "about" mean here? How to weight non-specific dates? How to weight them the further they get from the person in question?

After writing this code, and finding it useful for myself, some further enhancements were needed. For example, users want the system to explain where this information came from and how it was computed. (Some people hate the idea of the software doing any such analysis; there are ways to remove these automatic estimated calculated events and dates).

My comment here is not to discourage you, but to simply comment that theoretically what you describe sounds elegant. However, until someone actually uses a system like that, one should remain skeptical as to its efficacy.

BTW, Gramps is probably the largest group of open source genealogy developers... maybe you want to join that group and collaborate towards some of your goals?

-Doug
mstransky 2010-12-02T06:28:17-08:00
"Handling this "expanding/exploding horizon" of combinations has always seemed to me the most interesting challenge for providing computer support for the evidence and conclusion process."

Because a few here were able to see as describe what I have been trying to describe for a few days about old db ways controlling and storing data.

"We can't replace the old Person link inside the Event Entity with the new person's link because that's destructive..........But it does seem to get expensive if every time we create a new conclusion person" - AdrianB38

It is my opinion that a USERs inputted data should not be anywhere near or inside of a personID data set. To keep the data from being harmed, delete, or ignored.

I see, do it this way and envision user evidence records in a separate area like so..
e=event or evidence, p=people ID's, s Source record data and image if available.

e1, p123, s22 (all the users data entered from source) [flagged] Conclusion, Hypothesis, Disputed, not reviewed yet
e2, p123, s26 (all the users data entered from source) [flagged] Conclusion, Hypothesis, Disputed, not reviewed yet
e65, p123, s29 (all the users data entered from source) [flagged] Conclusion, Hypothesis, Disputed, not reviewed yet
e75, p124, s36 (all the users data entered from source) [flagged] Conclusion, Hypothesis, Disputed, not reviewed yet
e165, p124, s44 (all the users data entered from source) [flagged] Conclusion, Hypothesis, Disputed, not reviewed yet
e221, p124, s48 (all the users data entered from source) [flagged] Conclusion, Hypothesis, Disputed, not reviewed yet

and the Person ID is just a simple marker with how the USER wants the generic display to be seen as per the USERs person name, birth date place and death date place.

you can have just two simple lines of code, one for personA(123) and the other for personB(124)

1.Any application can filter sort, find and edit. if a PersonB is deleted by mistake all the collected data is gone forever, not unless you had a back up.
2.when you do a merge, the USER controls which line evidence gets pointed to which person. Not an app trying to determine which data will over write which birth or death data.

Evidence or record keeping is just that. A record which has been reviewed, entered into the data file as is and should never be altered.

That Record should only point to (link to a display person marker) not the a display person marker control data inside its nodes).

People add, delete, modify, merge and separate display people markers all the time. that can be very destructive to all the collect user data inside those people markers.

Protecting ALL the user data collect from source docs. this area is a users hard work and how do we keep it protected. It is the old way where apps try to only choose one data set for birth and delete or discard the other forever. Many people may have more than one Birth record with different dates. We can have that option if the app structure would stop having the person tree marker control the evidence and data. Data is data, and person marker is just that, a marker in a tree navigation placement. these are two very separate data sets and should be kept very far apart from each other the one has no control over holding or destroying data found or Structured inside their nodes.

Its great we can put all the data together for A person but I believe there is another way which is more flexible for the user and app handling the data.

If you made two persons, those 6 records points to either of them. Say you merge the two people at the persons area, then all the evidences/event records data entered by the user from the sources will now point all six to one person, while the other person can be deleted with OUT losing any data or making a choice between two records data for a display.

If only one record needs to be move you just point the record from p124 to p123 without disturbing any data or choosing data over another.

If a person is deleted all those evidence records are not destroyed but can be deleted later by the user or point to create a new shadow person.

Sorry I am just trying to make a point about not having INDI areas of dbs control, hold and storedata, and worse sometimes force a user into choosing which collected data is more important over the other on a merge.

Oh real quick, If a person wants to split a person of 12 records collected, they split/create a new person and point which data lines of evidence to stop pointing at personA and point them to PersonC.

I hope I made a point how evidence or event records which ever way you want to call them should be separate from the peoples display. or nav tree marker record.
ttwetmore 2010-12-02T08:07:38-08:00
Russ asks "Aren't the terms that you have been using, Software Terms, and NOT terms that an End User has to deal with?"

Yes, but somehow the user interface is going to have to bring the user and the concepts together.

Everything I'm talking about looks like a person or an event or an attribute of one of them. That's what the user sees. But the user, if using evidence and conclusions, is going to have to bring person records together for making conclusions. There have been some ideas about how to do that around.

Tom Wetmore
greglamberson 2010-12-02T10:46:59-08:00
Tom,

Yes, I have seen this problem coming as well. I have been thinking of a slightly different way to handle the problem. My overriding thoughts have been centered around how to properly categorize the data and then what rules does that data follow. However, I don't have it ready to share quite yet.

mstransky,

What you propose is somewhat more akin to what I've been thinking. However, I think these different "roles" (for lack of better word) that an event plays have different data characteristics beyond a flag. However, since my underlying ideas are a bit different than all the previous assumptions, it's not really possible to plug them directly into this discussion.
mstransky 2010-12-02T11:36:00-08:00
"What you propose is somewhat more akin to what I've been thinking." - Greg

a. Being trying, LOL, I have updated my mach area with examples and broke down the function of areas.maybe soon to post as a possible model guideline.


"However, I think these different "roles" (for lack of better word) that an event plays have different data characteristics beyond a flag." - Greg

a. I think many of us cross lines how the Roll and event are use in many different ways in TERMS and in FUNCTIONS. I can see both ways, but to comment on both here would be confusing to the reader.

Everyone, Users, starts with either these two things
1. An actual person they know and then collect sources on them (source documents)
2. Find a source and create people in a tree as a place maker. (generic tree/nav person marker)

3 It is the middle ground that links these snips of data which should be kept separate from the source data collected and separate the preferred default display person area. This snips of data from sources need to be identified by the user as reviewed, ok, bad, conclusion, hypothesis, disputed or which ever BG comes up with.

Beyond saying anymore I would just be guessing what to say and look like I am off on a tangent. I look at roll in this #3 area where I can tag a person head of house, or cousin from a census record, or ex-husband from a divorce record. That is one snip of data PER person that a user can flag "whatever"
that same source event I can make a separate snip of data saying Ex-wife from a shared source event.

The source is a document with its own Identity census 1900 sheet 3of15
This is the event that groups all snips of data to it One ROLL interpretation per individual.
Then the individual roll is linked to the generic outline tree maker to be display as preference of the user.
ttwetmore 2010-12-02T13:27:36-08:00
Doug,

I agree it could take years to solve the computer issues, both the algorithmic ones and the user interface ones, for the combination approach outlined in my note. Since you have written software like that I'm sure you have a solid appreciation for the problem.

The main reason I took the job at Zoom Information almost seven years ago now, was to get some experience working on an analogous problem involving massive databases. The Zoom application had two entity types, persons and companies, and the algorithms had to combine the persons, combine the companies and then find the companies each person worked for. Lots of similarities, but a little bit simpler because there wasn't the recursive, expanding horizons issue to deal with.

I think good genealogical software must eventually head in this direction of fully supported evidence and conclusion. Just look at Gramps, how popular it is now that it has made a strong move in that direction. And many other programs, e.g., Family Tree Maker that I am playing with on my Mac, makes it easy to add lots of events to the same person, so that's another start. A question for BG is whether its standards should be ready for when that time comes. I think it's safe to say that if BG only wants to support the current generation of software products, all this rigamarole about evidence and conclusion is overkill, but I really think it is in BG's best interest to push ahead and support a full genealogical research process.

Frankly the evidence and conclusion algorithms are the most interesting to me and are what keep me thinking about developing new software. Solving these issues is the main purpose of DeadEnds software. At my age I need something fairly meaty to overcome the growing inertia of laziness that comes when you really don't have to do it anymore!

Tom Wetmore
AdrianB38 2010-12-02T14:52:33-08:00
Well, I did ask the question....

And I need to think about these ideas
greglamberson 2010-12-04T23:04:43-08:00
Storing Nested information in a Relational Database
Tom,

I've been trying to figure out how to translate what you've proposed into something today's software can use. Look at this discussion which addresses part of the problem:

http://www.alandelevie.com/2008/07/12/recursion-less-storage-of-hierarchical-data-in-a-relational-database/

I think you're two generations ahead of GEDCOM. The next generation seems like a model with an objective person and separate evidence objects which could have different properties, functions and definitions based upon an if-then statement (which would simply be represented by adding a layer of complexity in XML).

Do you see where I'm going here?
ttwetmore 2010-12-05T04:30:34-08:00
I read that page plus the evolt page it references. I am imaging where it would apply to genealogical data. Maybe this is one -- say each of the Person records had a reference to its father and mother Person record and you wanted to find the set of all ancestor Person records for a given Person record. In a straightforward relational implementation the Person records could be in a table with a key UUID column, a column holding the entire record as text or blob, and then a father and mother column with UUIDs.

I feel I should pull out the "BG is not a database nor an application" trump card at this point and say that this isn't a BG issue. An application using a BG model for its basis could choose to implement the various BG model entity types in a set of tables, whether unnormalized, partially normalized or fully normalized. The developers of the application would first decide what operations and user interface metaphors they want their programs to display, and would then decide how to map BG entities into the table structures required for good performance for those operations. If they were good developers and wanted to be faithful to BG, their mapping from BG archival form, which wouldn't be in tabular form, to their internal database format and back, would be idempotent (if the program exported its full database to BG and then imported it into an empty database the two databases would be identical [I think this should be the definition of BG compliance]).

Maybe you are wondering whether there is anything that the BG designers can do while designing the model to make it easier for future applications be able to 1) import BG data into its database format, 2) keep their internal database consistent with the BG model no matter what their users do, and 3) export their databases to BG format. In my opinion there is nothing in any of the models we are discussing that should lead to any worries about possible application implemenations. All of our models are very simple and very regular.

I've gotten this far without ranting once about the automatically assumed use of relational databases for genealogical data. I know that most applications that would buy into the BG model would use relational databases and they would have to decide what parts of the model to normalize. With a well specified BG model I would just leave it to them to figure out how to do it. It would be pretty basic for a competent database guy/gal.

But I can't finish without mentioning my strong preference for non-relational databases for genealogical data. In LifeLines I discovered that simply storing every record, directly as GEDCOM text, in a B-tree indexed by the record's id (it would be the record's UUID in BG), and then creating a few custom indexes, also stored as records in the B-tree, I was able to get lightening performance in every operation I wanted LifeLines to do.

For example, LifeLines provides the following operations in its programming feature:

Set parentSet (Set s) -- returns the Set of all parents of the persons in Set s.

Set ancestorSet (Set s) -- returns the Set of all ancestors of the persons in Set s.

Set descendantSet (Set s) -- returns the Set of all descendants of the persons in Set s.

Both ancestorSet and descendantSet could have been implemented with recursion, but I chose to do them itertively for the improved performance, using small, custom functions that work directly against the database.

The advantage of the relational approach comes into play if the application must provide general purpose search capabilities to the user, but genealogical programs don't normally do this. They have a pre-defined user interface that defines what the users can do. If the application did support a powerful programming language, then some of these search capabilities would be very useful, however. The disadvantage of the relational approach is that it often introduces restrictions or limitations and what data can be stored, and it can be difficult for the relational database to remain faithful to the underling model, BG in our case. The advantage of the LifeLines approach (which is the network database approach) is that the database stores exactly the model's records (GEDCOM in LifeLines, BG records in future applications). You get idempotency for free in such a model. The disadvantage of the network approach is that you have to anticipate the proper indexes and you don't have as general a mechanism as SQL to compose queries. In my mind the balance of advantages and disadvantages for genealogical databases has always fallen on the network side.

I went down memory lane and found the function in LifeLines that implements the ancestorSet function (it is named ancestor_indiseq and the data type named INDISEQ). This function does two things -- it finds all the ancestors of the persons in the input set, and for each ancestor it finds it also records the minimum number of generations that ancestor is away from its nearest relative in the starting set. This is 20 year old code written in a now archaic version of C, but it should show how easy it is to write custom code against a networked database.

Tom W.

/*=========================================================
 * ancestor_indiseq -- Create ancestor sequence of sequence
 *=======================================================*/
INDISEQ ancestor_indiseq (seq)
INDISEQ seq;
{
    TABLE tab;
    LIST anclist, genlist;
    INDISEQ anc;
    NODE indi, fath, moth;
    STRING key, pkey;
    INT gen;
    if (!seq) return;
    tab = create_table();
    anclist = create_list();
    genlist = create_list();
    anc = create_indiseq();
    FORINDISEQ(seq, el, num)
        enqueue_list(anclist, skey(el));
        enqueue_list(genlist, 0);
    ENDINDISEQ
    while (!empty_list(anclist)) {
        key = (STRING) dequeue_list(anclist);
        gen = (INT) dequeue_list(genlist) + 1;
        indi = key_to_indi(key);
        fath = indi_to_fath(indi);
        moth = indi_to_moth(indi);
        if (fath && !in_table(tab, pkey = indi_to_key(fath))) {
            pkey = strsave(pkey);
            append_indiseq(anc, pkey, NULL, gen, TRUE, TRUE);
            enqueue_list(anclist, pkey);
            enqueue_list(genlist, gen);
            insert_table(tab, pkey, NULL);
        }
        if (moth && !in_table(tab, pkey = indi_to_key(moth))) {
            pkey = strsave(pkey);
            append_indiseq(anc, pkey, NULL, gen, TRUE, TRUE);
            enqueue_list(anclist, pkey);
            enqueue_list(genlist, gen);
            insert_table(tab, pkey, NULL);
        }
    }
    remove_table(tab, DONTFREE);
    remove_list(anclist, NULL);
    remove_list(genlist, NULL);
    return anc;
}
dsblank 2010-12-05T06:36:16-08:00
Greg,

The article that you link to was not written by a computer scientist, but someone with only a passing knowledge of recursion. However, his main point is valid: one should always try to decrease the number of queries on a webserver (I've posted a further response at the website, awaiting moderation). The argument doesn't necessarily hold on a desktop (rather than server)...or at least the effect is dimensioned.

I absolutely agree with Tom here: paraphrasing, this is the wrong level of analysis for BG to even consider. There are pro's and con's to relational vs. hierarchical, but that is irrelevant to the discussion here.

BTW, Gramps uses a hierarchical database. This translate fine into a relational database (which I have done for a web-based version of Gramps, called http://gramps-connect.org/). However, in order to be efficient, many things must change in the handling of the data. A single lookup in the hierarchical version of gramps is a few dozen touches of the database; in the relational form, it was over 700 queries. This is not to say that the relational version won't work, but requires a completely different mindset when accessing the data. If you look at the Gramps-Connect link above, you'll see a system that will be fast, regardless of the number of entries. It should be able to handle millions of people.

-Doug

PS:

Tom's code of recursively traversing the tree of ancestors will look the same in pretty much any system. Depending on where append_indiseq puts the data, it is either a breadth-first or depth-first search. Table must be a HashTable, I presume, to keep track of visited nodes. Very clean!
ttwetmore 2010-12-05T07:13:03-08:00
Doug,

As you surmise the ancestor_indiseq function implements a basic, bread-first traverse of the ancestor tree using two queues, one for the ancestors and one for their generation numbers. It's always nice when someone can appreciate sweet code! I demonstrated this code because it is analogous to the problem mentioned by Greg's URL, and I wanted to show how easy it is in a non-relational, non-recursive context.

A question might be how many of these custom functions would a genealogy program need and whether they could all be anticipated up front. In LifeLines there is a programming subsystem that makes the genealogical operations all available in a programming language, and I think Gramps has exactly the same or better programming and plug-in features. With the programming feature users can implement any algorithms they like on the data. As for Gramps, there are now 100s of LifeLines programs in use that do all kinds of things. But now we're even further away from BG.

Tom W.
greglamberson 2010-12-05T11:20:54-08:00
Tom and Doug,

Thanks for the responses. Yes, this is far outside the realm of BetterGEDCOM, but since no aspect of GEDCOM requires significant recursion be considered, I wanted to see what the response would be since most programs use relational databases.

More than anything, I just wanted to see what you'd say, because with your model, this sort of issue becomes relevant.
AdrianB38 2010-12-05T13:43:49-08:00
Evidence and Conclusion alternatives
I'm still battering my head about how to (non-destructively) combine 2 evidence people, each with characteristics (PACTs if you like) and events into 1 conclusion person in a tree. What has come into my head is an alternative means of accomplishing the same (I hope) thing. First let me summarise how I think the tree of Evidence and Conclusion persons works, just to check in case you have a different idea:

- Source S1 is entered in a fashion similar to current GEDCOM - the text is just copied as text;
- The Evidence from S1 is summarised and the data used to enter a person, call him John-Smith-E1, because he's an evidence person;
- Source S2 is entered in the same fashion;
- The Evidence from S2 is summarised and the data used to enter another person, call him John-Smith-E2;
- Some analysis goes on - John-Smith-E1 and John-Smith-E2 are the input evidence persons. The output is a 3rd person John-Smith-C1, a conclusion person containing the accepted data from 'E1 and 'E2. Characteristics and Events are (I think) copied from 'E1 and 'E2 to 'C1.

Repeat for each source until, if you have 15 sources for John, you probably have John-Smith-C14, which is a combination of John-Smith-C13 and John-Smith-E15, and is sitting on top of
John-Smith-C13, which is sitting on top of John-Smith-C12 and John-Smith-E14, plus
...
Etc

Now, I'm concerned by several things:
- the sheer volume of the number of person entities here;
- the resulting time to navigate the tree, from top to bottom;
- the fact that the "database" is full of stuff that is superseded and needs to be filtered out in any query or report;
- the fact that if I realise that John-Smith-E7 (say) wasn't mine after all, then I throw away John-Smith-C6 upwards and have to rebuild the later ones, 'C6, 'C7, etc.
- the fact that I need every time to write out how I think the analysis process works to ensure the readers understand the model. Describing a high-level data model through the processes that operate on it, instinctively suggests the processes are not well known and the model is dangerously fragile.

OK - I could be talking nonsense - the software developers might understand all this; there might be no issues of volume or speed; I could be misunderstanding totally what's being proposed.

Suppose there is a simpler way that still stores the evidence, the analysis, and enables reversion? Suppose that instead of keeping the superseded "records" in the main database, we simply journalize them in some fashion that I seriously don't understand because it probably needs me to understand relational or hierarchical DB journalising... I reckon it would work like this:

- Source S1 is entered in a fashion similar to current GEDCOM - the text is just copied as text;
- The Evidence from S1 is summarised and the data used to enter a person, call him John-Smith-E1, because he's an evidence person;
- Source S2 is entered in the same fashion;
- The Evidence from S2 is summarised and the data used to enter another person, call him John-Smith-E2;
- Some analysis goes on - John-Smith-E1 and John-Smith-E2 are the input evidence persons. The output is a 3rd person John-Smith-C1, a conclusion person containing the accepted data from 'E1 and 'E2. Characteristics and Events are copied from 'E1 and 'E2 to 'C1;

OK nothing changes yet - that's because I seriously want to preserve the concept of coding the evidence from the source and I'm doing it by creating a person. Now...
- Source S3 is entered in the same fashion;
- The Evidence from S3 is summarised and the data used to enter another person, call him John-Smith-E3 (still no change);
- Some analysis goes on - John-Smith-C1 and John-Smith-E3 are the input evidence. We _update_ John-Smith-C1 with the new conclusions from this analysis process. Before each update of a "record" in our database we copy the contents of the old "record" to the journal file.

So how does this address my concerns?
- there are still a large number of evidence persons, but they are the minimum necessary to codify the contents of a source. There is only one conclusion person per carbon-based life-form.
- the tree has shrunk in height;
- the superseded stuff is minimised to the one-source-only evidence persons;
- the concept of journalising old entries for reversion is surely clear to software guys, and the use of the evidence persons to extract data from the source seems clear enough;

What about if I realise that John-Smith-E7 wasn't mine after all? This is tricky - if 'E7 gave us just one characteristic (e.g. red hair), then we could roll back that physical appearances characteristic record on its own, without disrupting what we did with 'E8, 'E9, etc. This might _not_ work if 'E9 was identified to be mine because the source said "He was the only red-head in the village" and I used that to identify the source as referring to mine. The previous tactic of wiping out everything that came after is a pain but it's also safe.

Two things - we need entity types to record the analyses that are going on and if I decide to export my full details on a BG file, I'd want to export the journals as well, to show the intermediate stages.

Guys - please take this as a "Let's just look and see if there are any alternatives" offering. I could be dangerously underestimating the difficulties of journalising or missing something.
dsblank 2010-12-05T14:23:34-08:00
Adrian,

Rather than attempting to figure out what underlying mechanism one would use, I suggest that you just specify the *functionality* of what you want to happen. That is, specific all the different things that you want to be able to do with the data (ie, "use cases"), and that will tell us what needs to be in the data model, and the file transfer protocol, to support that.

In Software Engineering, this is a Functional Requirements Document:

http://en.wikipedia.org/wiki/Functional_requirement

and is one of the first step of any design.

You could probably develop your example above into a full Use Case. It might be handy to have a section in the wiki to list these.

-Doug
AdrianB38 2010-12-05T14:41:41-08:00
Doug - yes, thank you for that gentle reminder. I, of all people, should actually understand that, given my previous role but one. Err - the trouble is that this Wiki is still in brain storming mode and I got carried away.
ttwetmore 2010-12-05T14:44:12-08:00
Adrian,

Let me try to take a crack at some of your concerns ...

"I'm still battering my head about how to (non-destructively) combine 2 evidence people, each with characteristics (PACTs if you like) and events into 1 conclusion person in a tree."

Ooo, sounds painful. Can't have head battering among our most prodigious model makers.

My first comment on your example is your assumption that the person trees must be binary. If you have E1 through E16 John Smiths, you can wait and create C1 John Smith from all of them at once. The source on C1 should state why you are asserting that all 16 are the same. I did not intend to imply that the trees had to binary and that every decision made had to be to join just two records.

"Now, I'm concerned by several things:
- the sheer volume of the number of person entities here;"

Not a problem. You have exactly the right number of evidence persons, and the right number of conclusion persons, and only if you build up trees with many layers will you any extra records. In almost all cases there will be none of the middle persons so there will be no extra persons at all, and I don't think there'd ever be a worrisome number of them.

"- the resulting time to navigate the tree, from top to bottom;"

Not a problem. Computers are FAST, can whip through trees with thousands of records in way less than a second.

"- the fact that the "database" is full of stuff that is superseded and needs to be filtered out in any query or report;"

My thoughts on this are simple. For reports we never want to show persons that are not at the roots of trees, UNLESS the report is for the express purpose of showing the evidence structure.

"- the fact that if I realise that John-Smith-E7 (say) wasn't mine after all, then I throw away John-Smith-C6 upwards and have to rebuild the later ones, 'C6, 'C7, etc."

In most cases I see this is "just software".

"- the fact that I need every time to write out how I think the analysis process works to ensure the readers understand the model. Describing a high-level data model through the processes that operate on it, instinctively suggests the processes are not well known and the model is dangerously fragile."

I hope my comment on the binary tree issue may have made you more comfortable about this point. There are going to be far few "analysis points" that need to be documented than you were thinking before.

"OK - I could be talking nonsense - the software developers might understand all this; there might be no issues of volume or speed; I could be misunderstanding totally what's being proposed."

I don't see any nonsense in your concerns, but I think they have been mostly addressed.

Now I want to comment on this statement you made:

"- Some analysis goes on - John-Smith-E1 and John-Smith-E2 are the input evidence persons. The output is a 3rd person John-Smith-C1, a conclusion person containing the accepted data from 'E1 and 'E2. Characteristics and Events are (I think) copied from 'E1 and 'E2 to 'C1."

It's the "I think" part. This is a key point you have singled out. Do we copy? If we copy what do we copy? If we don't copy what do we do?

Here's my take on this, based primarily on the combination algorithms I've written in the past. These algos build these trees with a vengence, taking 100s of millions of e-persons and creating 10s of thousands of c-persons.

Let's say you copy everything upwards. To me this means every PACT, every role reference, everything, is copy upwards to the higher level persons. Yikes. No sense in doing that because it's dirt cheap to traverse the tree when needed to gathter up the data. So answer one is we don't copy up.

Aha, you say, "if you're not going to copy the info up, WHAT IS THE NAME OF THE ROOT PERSON?" I'm assuming that you might have combined John Smith's with Jon Smithe's with Johnathan Smythe's with J. Smith's and so on. Every evidence person has a single name (not necessarily! -- what if the evidence used two names?), and the more evidence persons you have for a person the more likely it is that they won't all have the same name.

I see two answers to this. The first is the one I used in the automatic algorithms where there could be no human intervention. In this case each higher level peson had a distribution created for each PACT. So the name PACT might have this info in it:

John Smith -- 12 occurrences
John Smithe -- 1 occurrence
Jon Smith -- 1 occurrence

That was really all there was to it. When it came to showing the person you use the most common member of the distribution, but when you calculate statistics for use in deciding the next levels of combination you take into account the full distribution.

I guess you could say this is really is copy up, and I guess it is, but I think of it as building the distribution of all below.

Okay, not much sweat for names. For me the quintessential example of this problem is what happens when E1 has the birth date and E2 has the birth place. You'd sure like the any report to show the person's birth with the date and the place. Well, I see this as very similar to the name issue. You're still creating distributions but now of things that are slightly more complex than just name strings. Well, software geeks are, if not smart, at least persistant, and this isn't too hard.

Aha, but what if the distribution answer is NOT the one you really believe to be the final answer? (This is the second solution mentioned above.) To me this is the only time where a user has to get involved and enter any extra info during the process of making inferenes (other than justifying the inference that is!) In cases like this I think the user should be able to enter PACTs directly to the c-person, and it is these PACTs that are used in reports.

No comments on your journalizing ideas yet because I don't fully grasp the concept yet, and I hope my answers alleviate your concerns about it somewhat.

Tom Wetmore
AdrianB38 2010-12-06T13:30:06-08:00
"If you have E1 through E16 John Smiths, you can wait and create C1 John Smith from all of them at once"
OK - seems logical, I initially thought you didn't want to blow people's logic circuits with the concept of a tree that deep. However, I suspect that most people will add a source, and want to fold it into the current conclusion person almost immediately, thus forming a new conclusion person, higher up the tree. I know I certainly would on the basis that I'd forget to do it otherwise... This probably wouldn't apply if I'd got the family members already pretty much defined and I had a list of (say) 3 census sources to load - they could be concluded in (to create a verb) in one go. So without disputing your comments on CPU power, I _personally_ think "In almost all cases there will be none of the middle persons" simply won't be true. Most of us can't delay gratification that much.

I said "if I realise that John-Smith-E7 (say) wasn't mine after all, then I throw away John-Smith-C6 upwards and have to rebuild the later ones, 'C6, 'C7, etc."

You said "In most cases I see this is 'just software'". I can't see that it is - if I'm _manually_ analysing this data (and that's _my_ assumption). I have to look at the analysis (to see if it used the disproved data) and rebuild it manually. Admitted I have a similar issue in my proposal but I wasn't envisaging chopping the tree off at a level, only removing specific bricks. Then reviewing manually.

"I think they have been mostly addressed" I'm not as convinced as you about tree depths but I'm willing to take note of people with experience of the type of models necessary. Having come from a mainframe / high-power server background I am always concerned not to use my expectation of power in the wrong place!

Your comments about automatic algorithms examining distribution are interesting,and if you followed that route, then yes, you could have a skimpy tree higher up. I guess my concerns about such algorithms are roughly this:
- I'd like to see the things working on the full complexity of real-life first. I suspect you've seen it working on lots of data but did it all combine together with the same complexity as real life? I suspect given the right algorithms and power, there'd be little problem. (I come from England, and the first (and only, so far) place I visited in the US was St Louis in Missouri, which is known I think as the "Show Me" State, an attitude I have sympathy with, so excuse my need to be convinced!)
- I tend to like to write notes, often extensive, against each event or PACT. I can see how I'd add them to the top c-person - so the auto-generation would need to leave that note alone (these are notes inside each event or PACT). However, this presumably shouldn't be an issue with the right design.
- the serious concern I have is that your auto-generation concept is very much application oriented and very much ahead of what current software does. Further, it's both data-model and algorithm based. So why shouldn't we advance the breed? I think it's a case of what's doable psychologically - if the next BG-compatible generation of just about any genealogy app _has_ to be converted to auto generation, we're going to get minimal enthusiasm for the upgrade. We need to come up with a BG data model that can be used in a manner that advances the breed but also can be used in a more conventional fashion. So - will it fly with tall conclusion trees and lots of manual input? I'm still not sure - but you make a very convincing case for the auto-generation mode.
AdrianB38 2010-12-06T13:45:20-08:00
Tom - one other thing, which is a slightly mischievous question on my part. How well would your concepts work if the app designers used an RDBMS such as MySQL? I think I remember you saying Hierarchical DBMS made more sense - but what if they don't follow that advice? Does it still work, albeit slightly more slowly?
ttwetmore 2010-12-06T18:25:55-08:00
Adrian,

Your longer post deserves more attention than I can give it now but I can respond to your mischievous one first.

My original opposition to relational databases for genealogical databases (first expressed more than 20 years ago) stemmed from the fact that columns in tables have/had to be regular for relational to work. Take a name and a date and a place as three examples. In genealogy these three data types are CRUCIAL, but these three data types are VERY IRRECULAR THROUGH HISTORY AND CULTURE. All early genealogy database systems (and many current ones) force names into the concepts of given and surname, sometimes with prefixes and suffixes; force dates into the concepts of day, month, and year; force places into concepts of city, county, state. This regularity might fit most LIVING citizens of the United States, but genealogists are primarily interested in DEAD PEOPLE who came from all over the place, spoke many languages, had weird kinds of names, expressed dates an a gawd-awfully large number of ways, and could use 123 different ways of describing the same place. Trying to regularize this kind of data to fit into a mold thought up by some 21st century American/English/Norwegian/Czech software geek is exactly the same thing as forcing square pegs into round holes. This forced regularity is RESPONSIBLE FOR MUCH OF THE DATA LOSSY NATURE OF CONVERTING TO AND FROM GEDCOM by genealogical programs -- irregular data expressed in GEDCOM must be regularized into relational tables on import, and that regularized, vanillaized data is put into GEDCOM records during export. This has always been my key objection to relational databases. I have completely avoided the BG discussion on name, date, and place standards because this is where the most naive solutions get proposed and where I literally cringe when I read them, as they are almost always based on the notion of some naive forms of regularization. Early on someone always proposes that we should replace places with latitudes and longitudes. Yikes. Everybody wants to regularize names, dates and places. Well I do too, but the point is you have to let these three concepts be what they are, not what you want them to be. You can't force them into something else. Relational databases almost always force them. Enough on that before I start foaming.

There are techniques that can solve these issues and some software designers have have caught on. For instance a name, a date and a place can just be pure strings of characters, and special "expert" software modules can be used to parse them and try to figure them out. It can be a real challenge to try to sort these these irregular things. A relational database could store a date as two things, a raw string representing the date exactly as it was expressed, and a sorting key that is computed from the raw string (by admittedly sophisticated software) that allows for ordering and searching.

And now there are some other solutions. For example you could keep all your person records in a two-column relational table, one column is the key and the other is the full text of the XML or GEDCOM or some other syntax that encodes the full record. They you can use some full-text scheme to search for records. I happen to like dates like "I think it was a Thursday after the a big snow storm in March a few years ago", so I use them in LifeLines, but you won't get many other systems to deal with that. Or how about a place like "somewhere near the Montana and North Dakota border while traveling on the Big Sky Express Train."

If you can solve the name, date and place problem I think putting genealogical databases into relational tables can be made to work extremely well.

My point has always been, however, that genealogical data records are naturally hierarchical and naturally highly interrelated/networked, and it is the hierarchy and the interrelatedness that most determines the nature of the software required to process the records. Keeping the database hierarchical and networked in the form that is most natural for the software has always seemed just more natural to me.

I think that there are excellent examples of genealogical systems using relational databases, and excellent examples of system using non-relational databases. If the problems with assumed regularity in the columns can be worked out, I no longer think that relational databases are wrong for genealogy. It's just that I'm too old to switch and too set in my ways to really ever feel good about relational databases in genealogy.

Tom W.
greglamberson 2010-12-06T18:42:12-08:00
Tom,

Yeah. I agree.

Later in the week I'll have more time to respond to this and other threads.
ttwetmore 2010-12-07T02:35:21-08:00
Trying to respond to Adrian's response to my first response to his list of important points and questions:

"If you have E1 through E16 John Smiths, you can wait and create C1 John Smith from all of them at once"
OK - seems logical, I initially thought you didn't want to blow people's logic circuits with the concept of a tree that deep. However, I suspect that most people will add a source, and want to fold it into the current conclusion person almost immediately, thus forming a new conclusion person, higher up the tree. I know I certainly would on the basis that I'd forget to do it otherwise... This probably wouldn't apply if I'd got the family members already pretty much defined and I had a list of (say) 3 census sources to load - they could be concluded in (to create a verb) in one go. So without disputing your comments on CPU power, I _personally_ think "In almost all cases there will be none of the middle persons" simply won't be true. Most of us can't delay gratification that much.

NEW>> Even if the trees get deep for certain of your key persons, I still don't see a reason for concern. For this multi-evidence to one-conclusion model to work there has to be some new user interface methods to help it work. Imagine a window in which you can see the structure of these trees layed out. Imagine a user interface that lets you rearrange the tree, lets you drag and drop into or out of the tree, that keeps track of the nodes in the tree where your changes cause implications on the sources. I don't think there is any software that supports this kind of thing yet. Yet I don't think there is anything really difficult about this kind of software.

I said "if I realise that John-Smith-E7 (say) wasn't mine after all, then I throw away John-Smith-C6 upwards and have to rebuild the later ones, 'C6, 'C7, etc."

You said "In most cases I see this is 'just software'". I can't see that it is - if I'm _manually_ analysing this data (and that's _my_ assumption). I have to look at the analysis (to see if it used the disproved data) and rebuild it manually. Admitted I have a similar issue in my proposal but I wasn't envisaging chopping the tree off at a level, only removing specific bricks. Then reviewing manually.

NEW>> Yes it will be manual all (or most) of the time. So what if modifying that tree by removing C6 were as easy as putting your mouse on the C6 (inside a visual, graphical representation of the whole tree) and dragging it out of its larger tree and having the software propertly suture up what is left while it turns that single tree into two trees? That's what I mean by it's just software. After you do the drag out operation the software knows which of the source references you might have to modify and get you to do that, though in fact this might not be necessary. Needs some examples worked out to see all the implications. Again there is no software around that does this stuff now that I am aware of. We need to ask ourselves should BG anticipate the fact that some software someday will properly support evidence and conclusion and have a model ready for it? I believe this to be true, but the collective wisdom of BG might decide it is not. I've been slowly heading toward these new user interface ideas in DeadEnds, which is why I have to make sure the DeadEnds model is adequate for that.

"I think they have been mostly addressed" I'm not as convinced as you about tree depths but I'm willing to take note of people with experience of the type of models necessary. Having come from a mainframe / high-power server background I am always concerned not to use my expectation of power in the wrong place!

NEW>> You are right not to trust my statements without your own analyses. It's true that I can't point to any existing system that does the kinds of things I think should be done.

Your comments about automatic algorithms examining distribution are interesting,and if you followed that route, then yes, you could have a skimpy tree higher up. I guess my concerns about such algorithms are roughly this:
- I'd like to see the things working on the full complexity of real-life first. I suspect you've seen it working on lots of data but did it all combine together with the same complexity as real life? I suspect given the right algorithms and power, there'd be little problem. (I come from England, and the first (and only, so far) place I visited in the US was St Louis in Missouri, which is known I think as the "Show Me" State, an attitude I have sympathy with, so excuse my need to be convinced!)
- I tend to like to write notes, often extensive, against each event or PACT. I can see how I'd add them to the top c-person - so the auto-generation would need to leave that note alone (these are notes inside each event or PACT). However, this presumably shouldn't be an issue with the right design.
- the serious concern I have is that your auto-generation concept is very much application oriented and very much ahead of what current software does. Further, it's both data-model and algorithm based. So why shouldn't we advance the breed? I think it's a case of what's doable psychologically - if the next BG-compatible generation of just about any genealogy app _has_ to be converted to auto generation, we're going to get minimal enthusiasm for the upgrade. We need to come up with a BG data model that can be used in a manner that advances the breed but also can be used in a more conventional fashion. So - will it fly with tall conclusion trees and lots of manual input? I'm still not sure - but you make a very convincing case for the auto-generation mode.

NEW>> I know auto combining is possible since I've seen it in action. (I also know its possible because of all those green leaves that show up when you use Family Tree Maker!). The only question in my mind is whether we should do it in genealogical software. But think about it this way. Think about running an auto-combination algorithm but NOT doing the resulting combination. Instead think of a user interface that shows the results of the combination as a suggested tree on the screen and lets the user decide whether or not to use it, and lets the user rearrange that tree or to completely ignore it, or remove parts of it, and only modify the database where the user approves. Think of the combination algorithm as NOTHING MORE THAN A SOPHISTICATED DUPLICATE RECORD FINDER! I'm with many people who don't want any algorithm to change the state of my database without my absolute and reasoned approval. There has been criticism of the auto-combining algorithm idea because people are not yet thinking about how it could be made to work in a world with more advanced software. I think in this case it's my fault for stressing the auto-combination aspect as a geeky marvel, rather than stressing how it could be made to support the genealogical research process. In the application where it is used there is so much data that human intervention would render the application impossible. In the genealogical application this is not true. The same algorithms apply but the resulting changes to the database should not be automatically performed.

Tom W.
AdrianB38 2010-12-07T13:26:09-08:00
Tom - re hierarchical and relational databases. Interesting, from several aspects....
AdrianB38 2010-12-07T13:44:58-08:00
Tom: "Imagine a user interface that lets you rearrange the tree".
A: Definitely interesting - albeit I think for it to work usefully, you would need to minimise user intervention and major on automatic??

Tom: "We need to ask ourselves should BG anticipate the fact that some software someday will properly support evidence and conclusion and have a model ready for it? I believe this to be true, but the collective wisdom of BG might decide it is not. ... I'm with many people who don't want any algorithm to change the state of my database without my absolute and reasoned approval"

A: I'm trimming that last bit down a lot, with the final sentence left in as a "case for the defence". To me the question "We need to ask ourselves should BG anticipate the fact that some software someday will properly support evidence and conclusion and have a model ready for it?" is very important. I see 2 points that spring to mind:
(1) Much of this discussion is about the ways in which the application might work and therefore, in a strict sense, outside the scope of the creation of a definition for BG and / or the BG data model. Granted but...
(2) We need to build for the future, in my view. Part of that future is surely evidence management. I'm not convinced we have any sort of a consensus on what that means in terms of user-requirements, other than a few phrases like "non-destruction".
On the other hand we have a number of ideas about what EM might be in various guises and the only way that I can think of testing our BG data model etc ideas out for their usefulness with EM is to see how the data model ideas support the various software "designs".
Or, in other words, if anyone thinks we're going too far away from just data modelling - yes, but I think it's important.

I would not envisage that BG 1.0 supports EM in its full glory - but I'd like to see some "stubs" in the model, so that we might feel reasonably confident that we don't need to rip up entity types and break apart relationships to add EM in.
ttwetmore 2010-12-07T14:21:44-08:00
Adrian,

I agree with your numbered points. We do talk about applications a lot, but I think we need to in order to anticipate the needs of modeling.

The best way to anticipate how managing evidence should influence the model in my opinion would be to get some scholarly genealogists to express exactly how they do things, and then shape and evaluate our proposed models as to how well they support their processes. I'm trying to avoid terms like "use cases" and "requirements analysis", but they are hard to avoid if we wish to do things right. Or maybe there's a book we could all read.

Tom W.
ttwetmore 2010-12-07T06:30:20-08:00
Between Destructive and Non-destructive Merging
We've discussed person merging/combining as if it has to be either destructive or non-destructive. There is a middle ground technique that I and many others have used in GEDCOM for a long time. As a quick recap destructive merging occurs when one or more records are merged into one in such a way that the original records cannot be reconstructed later. In non-destructive merging, as we've been discussing it, merging occurs by grouping the merged records together and creating another record that holds references to all the group members, none of which are ever removed. There are disadvantages of both. The primary disadvantage of destructive is obviously the destruction of information and the primary disadvantage of non-destructive merging is the large number of records that can be involved and the complexity of working with potentially large, tree-based data structures.

Before continuing let me show a person record taken directly from my master LifeLines database (LifeLines records are pure GEDCOM that don't have to adhere to any official standard):

0 @I2298@ INDI
1 NAME Emeline Adela /Wetmore/
  2 SOUR JCW, pg 270
  2 SOUR Daphne Baird Wetmore research.
1 NAME Emaline Adele /Wetmore/
  2 SOUR Anne Marie Flewelling
1 NAME Emeline Adele /Wetmore/
  2 SOUR Unknown
1 SEX F
1 BIRT
  2 DATE 10 May 1843
    3 SOUR JCW, pg 270
    3 SOUR Daphne Baird Wetmore research.
  2 PLAC Bloomfield, Kings County, New Brunswick
    3 SOUR Daphne Baird Wetmore research.
  2 PLAC Norton, Kings County, New Brunswick
    3 SOUR JCW, pg 270
1 DEAT
  2 DATE 1917
    3 SOUR 16pp
    3 SOUR Daphne Baird Wetmore research.
  2 PLAC Saint John, New Brunswick
    3 SOUR Daphne Baird Wetmore research.
1 BURI
  2 CEME Christ Church Anglican Cemetery
  2 SOUR Daphne Baird Wetmore research.
1 FAMC @F602@
1 FAMS @F657@
1 FAMS @F5069@

So this is GEDCOM and GEDCOM does not officially support having both evidence and conclusion persons. Every person must be thought of as a conclusion person. This record has SOUR lines all through it. You can tell from the way I use SOURs that I don't maintain SOUR records in the database, I just use text strings that name the sources.

For this person I have found three different names, one birth date, two birth places, one death date, one death place, one cemetery, and all from a combination of four different sources and an unknown source I didn't record. And also note that I didn't record any sources for her sex or for her three family pointers. Nobody's perfect.

So think about this record. Is there enough information in here to reconstruct the evidence it was based on? Well, maybe not quite, but there is certainly something intriguing about the idea and maybe we can learn something from this.

What makes it possible to even consider reconstituting the evidence from this record? It's pretty obviously the presence of SOUR lines whereever we want them. (I believe the only thing that is not GEDCOM 5.5 compliant in this record is putting SOUR lines on DATE lines and using the CEME tag, but we can forgive 5.5 for those omissions. LifeLines doesn't restrict tags or their placement.

An important point about this technique is that you have to place the SOUR lines at the proper spots in the record so they apply only to the parts of the record that they should. For example here is another example record from my database:

0 @I259@ INDI
1 NAME Josiah /Dwight/
1 SEX M
1 BIRT
  2 DATE 19 February 1671 or 8 February 1670/1
  2 PLAC Dedham, Massachusetts
1 GRAD
  2 DATE 1687
  2 INST Harvard College
1 RESI
  2 PLAC Hatfield
1 RESI
  2 PLAC Woodstock
1 WILL
  2 DATE 1724
1 DEAT
  2 DATE 1744 or 1748
  2 PLAC Thompson, Windham County, Connecticut
1 SOUR Dwight Family History  
1 FAMC @F362@
1 FAMS @F49@

In this case I have only found one reference for this person so there is only one SOUR line and it is at level 1 to indicate that it "covers" the entire record.

Maybe you can anticipate where I'm going with this idea. If we allow every PACT (and sub-PACT and sub-sub-PACT) to have its own source reference/s, the with discipline, we can non-destructively join records together, REDUCING their numbers, by having the software keep track of the sources for all the PACT peices in the joined records. With this the original records can be reconstitued.

In my DeadEnds experiments I have to create a DeadEnds database to experiment with. (I have currently succumbed to convention and use XML as the format for DeadEnds databases.) But all I have is a GEDCOM database with lots of records with distributed SOUR lines like the above. So I have some custom code that reads through my GEDCOM records and "reconstitues" multiple DeadDends evidence records from single GEDCOM records when mutiple SOUR lines are found.

A big question I've had on the DeadEnds model for a long time is whether I should allow source references to appear everywhere as is almost true in the GEDCOM model. If I did then the DeadEnds model would also support this middle of the road, not quite destructive, and probably almost completely reversible form of record combining.

And isn't this almost the GRAMPS model? I'm not really sure about that, whether GRAMPS is an anal about sources as one would have to be to makes things truly reversible, but I think it is close to their model.

Tom Wetmore
~
hrworth 2010-12-07T07:21:23-08:00
Tom,

In that first GEDCOM, isn't that One (1) person "0 @I2298@ INDI" but that person has three (3) name entries, each Name entry with citations?

Russ
greglamberson 2010-12-07T07:25:06-08:00
Tom,

Your ideas are certainly intriguing, as usual. It is important to be able to source information in granular fashion, but this doesn't allow one to really differentiate between evidence and conclusions. One shouldn't have to "reconstitute" evidence.

I've had a concept on paper for at least two weeks that I hoped to have opportunity to introduce in discussion. I'm going to throw it up and let you guys rip into it, as we need some more fresh meat.
ttwetmore 2010-12-07T08:01:03-08:00
Russ,

Yes, one Person record with three Name pacts and sources for all three. GEDCOM is happy with this. My convention when using GEDCOM for these "combined" persons is to always put the preferred pact first. So My preferred name for this person is Emeline Adela Wetmore, and my preferred birth information for her is 10 May 1843 in Bloomfield, Kings County, New Brunswick. All of LifeLines's display and report generation features know to follow this scheme. The programming language gives full access to all information in the record, however, so detailed reports could list all the names, all the birth place alternatives and so on.

I don't remember whether GEDCOM 5.5 allows multiple DATE or PLAC lines in an event structures, but I do use them heavily.

Tom W.
hrworth 2010-12-07T08:12:23-08:00
Tom,

Thank you.

Does that not suggest, that in the BetterGEDCOM, there should be a "Preferred" for any or most Tags?

I have many facts / events where there are multiple entries, I choose which is my preferred. I do not think that there is a preferred marking in GEDCOM 5.5, but at the time 5.5 came out, I am guessing that not all applications allowed for multiples of the same tag in a file. (don't know or don't remember).

Having said that, if we were sharing our research, it might be nice to know which was YOUR preferred entry. I could then see, probably, how you came to that conclusion by looking at your source information.

Russ
ttwetmore 2010-12-07T08:14:20-08:00
Russ,

I think of each item of information, wherever in a genealogical database, as having a provenance chain one can follow through its source references. I don't worry at what point along that chain the distinction between evidence and conclusion occurs, so I don't put any tags in the records to say where it is along the spectrum. This may be too sloppy on my part.

And I agree that one should not have to reconstitute evidence. In the GEDCOM solution one MUST destroy the evidence as one builds up the conclusion, but this doesn't mean the approach I outlined would have to do that in BG. What if the evidence persons were sacrosanct, but the conclusion persons were the modifiable ones. I think this is actually very similar to Adrian's "copy up" idea.

Tom W.
ttwetmore 2010-12-07T08:19:05-08:00
On preferred tags, that sounds like a good idea. My convention is to always list the preferred first, but one might be able to come up with situations where this convention wouldn't be good enough, though I can't think of any right now.

I always want things to be as simple as possible. As a software developer I know that any complexity in models can magnify many times over in complexity of software. Thus I tend to keep things out if I can. But I could go either direction on a preferred tag as well as a quality tag as well as an evidence/conclusion level tag.

Tom W.
hrworth 2010-12-07T08:44:16-08:00
Tom,

My thought, and this is an End User talking, that IF a INDI has a Tag used more than one time, One Must be flagged as Preferred.

If a Tag is used One time, IT is Flagged as Preferred.

It seems that you have control over the first usage is preferred, but I think it should not be assumed that the first usage is preferred.

Whether or not that Flag is "observed" or acted upon or presented to the end user on the receiving end, would be up to the application.

Russ
AdrianB38 2010-12-07T13:56:08-08:00
Greg
Re "to really differentiate between evidence and conclusions. ... I've had a concept on paper for at least two weeks that I hoped to have opportunity to introduce in discussion"

I look forward to it. And, excuse me if I put my rigorous reviewing hat on, but perhaps you could start by defining what you mean by "differentiate between evidence and conclusions". I can sort of guess what it's about when you're just talking about the first cycle of directly building on sources, but if one has a chain of analyses, where the output conclusions from one iteration form the input evidence to the next iteration of analysis - what does "differentiate between evidence and conclusions" mean then for the records that are both evidence and conclusion? I can guess what I might say but that's just my guess....
ttwetmore 2010-12-08T13:54:20-08:00
DeadEnds Model in XML
Succumbing to the inevitable I have just rewritten the specifications for the DeadEnds data model using XML. You can find the specifications at:

http://deadendssoftware.com/DeadEndsModelXML1.pdf

Tom Wetmore
louiskessler 2010-12-12T08:11:42-08:00
Thank you, Tom. It is much more understandable (at least to me) that way.

Louis
ttwetmore 2011-01-24T02:07:12-08:00
DeadEnds Custom Import Function

The DeadEnds programs have a custom import function that lets users import data from files in many different formats. DeadEnds does have a native file format, as Better GEDCOM will, that DeadEnds normally uses to import data, but often users have data in different file formats with no easy way to convert them into DeadEnds format. The custom import feature allows users to use these data files, usually after some modifications, to import person and event records into the DeadEnds programs.

A similar feature would be valuable for any Better GEDCOM compliant program. Standalone utility programs that convert from user specified file formats into the Better GEDCOM native file format could be written using the same approach as I'm using in the DeadEnds feature, providing another third party marketing possibility in the Better GEDCOM world.

I have written a short document with a few examples of using the custom import function at:

http://deadendssoftware.com/DeadEndsImport.pdf

Tom
ttwetmore 2011-02-17T07:38:30-08:00
DeadEnds Date Formats

Most software applications require components that have been called "software experts" for the different specialized domains handled by the applications. For genealogical applications these domain include personal names, place hierarchies, source citations, and date formats. I have recently been deep into a redesign of the date expert module for the DeadEnds programs so I wrote down the new date formatting rules that DeadEnds will support as a guide for the new parser I am writing. I have put a link to a short document that describes the format on the DeadEnds model page. Here is the link:

http://bartonstreet.com/deadends/DateFormats.pdf

which is the same as

http://deadendssoftware.com/DateFormats.pdf

Comments always welcome and appreciated.

Tom W.
ttwetmore 2011-02-18T19:40:24-08:00
I'm glad writing up a set of specs has got this discussion going!

Geir, what do you think computed dates should look like? Should they mention how they were computed? Would a new prefix, "computed" be adequate? I've been working lately on a very interesting little program that converts genealogical data collected in CSV files (where C in my case means colon in stead of comma) into either GEDCOM or XML. The program has a language that can be used to compute dates and other values. For example, I can add the following rule:

person.birth.date = event.date - event.person.role.age

which means, I hope fairly obviously, to compute (and place in a record) a person's birth date by taking the date of the event (that all the evidence is coming from) and subtract off the age that that person has as a role player in the event. I've been experimenting with a file that has lots of data from the 1861 census of the New Brunswick (now in Canada). I want a way to quickly extract tons of data in a format that is CUSTOMIZED TO THE EXACT NATURE OF THE EVIDENCE, and then have a software tool that can convert that data to the formats needed by an application. I have discovered that I can get data off Ancestry.com far faster by hand editing a custom CVS text file and then converting that file via software to GEDCOM or XML, than by sitting in front of any application and laboriously filling out new person screens. For the mathematicians out there it's pretty interesting to work through the algebra of date and age computation when things like the event date is know to an exact date, but the age is only known to a year. How do you best guess the birth year based on when in the year the event happened?

On the D +/- 1 issue, my current code handles "before", "after", "on or before" and "on or after". I've been debating getting rid of the two "on or" variant, which was why I left them out of my write up. The precision issue may argue to keep them.

I really don't like ddmmyyyy or yyyymmdd, but there's nothing new about that. I know you can argue that Better GEDCOM doesn't need to be human readable, but my answer to comments like that is always something like, "yeah, but if you can make readable with almost no effort, why not?" since many people are going to want to look at the files and understand them. I know I look at GEDCOM files all the time.

I think Gier is suggesting that dates be treated as just strings, but that we can add a sorting date to the date to allow us to sort the date. Being an English bigot, and also being a good programmer, I have never flinched from the task of parsing out "arbitrarily" formatted date strings searching for what looks like a good date within them. That's what my old LifeLines program does, and it does it pretty well. The whole thing that got me writing the spec was to come up with a simple grammar for doing that task. So I really don't mind the problem a taking user strings and trying to separate the chaff from the wheat. Maybe one can't expect other programmers to be so masochistic. The issue in my mind would be affected by how easy or hard it would be to come up similar formats to use in other languages.

All that being said, Geir may be onto something important with the sorting date concept. It's got me thinking this thing over again. Always more issues!
gthorud 2011-02-19T06:11:33-08:00
Tom,

The main use for CALCULATED here is when latin church dates are used in church (parish) records in the 18’th century. By means of a number of tables (specific for each country), you can lookup the numeric date. CAL is the alternative in Gedcom that comes closest to this lookup process, there are calculations behind the tables. But since the input to the calculation is, as I understand it, encoded in a string that is transferred together with the resulting numeric date, the type of calculation can be inferred from that string, eg, if the string contains a latin date, the receiver will understand whats going on.

The other use may be, as you mention, to calculate the approx birth year based on age in a probate or census. Some would want to store that year as a sort date, and some as a calculated date (to be printed). In the latter case, how do you get the receiving system to understand and print the year as a birth year if the event is a probate – you can not in general expect all CALculated dates to be birth dates. Could be handled by the prefix/postfix in the discussion that I referred to, but those are language dependent. What you want to say in a report is something like “Peter, 4 years old (born about 1766)”. But there could be problems if the rules for outputting these dates (the format), built into some sentence producing engine, varies between users/programs, and maybe the calculation in the case of age is best done in the receiving system.

Off topic:
Re your conversion program.

As I have told earlier, two years ago I worked on a book listing all families on more than a hundred farms in a parish – from approx 1600 until to day. I have most church records for that parish transcribed in databases. What I would have liked to have when doing that work was a program where I could record my findings (a traditional genealogy program) AND in the same program store records of all births, marriages etc in the parish, in a way where I could link the evidence/conclusions to the source record.

If the church records were converted to Gedcom and imported into separate data sets/projects in the program, I could go through those sets to find eg- all births that had not been linked to an evidence/conclusion record in the set/project where I record my work. Also, if I find a birth that may fit some requirements, I could discover that that birth record had already been linked to another person. I can buy a Norwegian program that can do this but it costs at least 10-20.000 USD, used by some professional projects (and it can’t export to Gedcom). I have not checked, but I expect online services to do something like this already. I would like to be able to do this in my own program, and the features needed is basically to be able to store links between records in different Gedcom files – which should be a relatively simple thing to design based on some unique IDs.

Geir
ttwetmore 2011-02-19T07:11:39-08:00
Geir,

Responding to your point: "The other use may be, as you mention, to calculate the approx birth year based on age in a probate or census. Some would want to store that year as a sort date, and some as a calculated date (to be printed). In the latter case, how do you get the receiving system to understand and print the year as a birth year if the event is a probate – you can not in general expect all CALculated dates to be birth dates."

The formula I put in my email answers your question (I modified it a bit here) and I would have it:

person.birth.date = "computed about " event.date - person.role.age

I don't want to go into great detail here, but these formulas are used when converting data collected from evidence into genealogical records, either in GEDCOM or XML form. So imagine that I have a data file with custom formatted lines about events and their role players. Say the event line indicated the event occurred on 6 March 1861, and say one of the person lines said the person was aged 29 at the time of the event. The formula above would cause the a GEDCOM file generated for the person from this data to include:

1 BIRT
2 DATE about 1832

In other words the formula allows the generation of newly derived information from the given raw data. I can also put in source information. For instance, by adding a little more info the file of data that gets converted I can have the following generated automatically:

1 BIRT
2 DATE computed about 1832
3 NOTE Computed from age on census
2 SOUR 1861 Census of Norton Parish, Kings County, New Brunswick, Canada

I wrote up the conversion program in http://bartonstreet.com/deadends/DeadEndsImport.pdf However, at the time I wrote that I hadn't fully worked out the handling of expressions as shown above. There were a few design issues I had to get through to understand exactly what those expressions needed to include, but this is now done so I should update the description to include them. It's really quite clever on what it can do.

I consider the ability to automatically generate inferred/derived events from other events to be important and the simply automate tedious and error prone steps that a genealogist otherwise have to do him/herself. I frequently use census data to get birth data about people that I will add as birth information (properly sourced of course) in an evidence person record. This facility allows me to generate those GEDCOM records about people with birth data automatically for a file of info that contains just the raw data extracted form the census sheets.
gthorud 2011-02-19T07:20:31-08:00
O,O, it appears I have made an error. For latin dates, Gedcom has a construct INT (Interpreted) that be used for my latin dates when the equivalent numeric date have been found in a table. CAL should not be used in this case.
gthorud 2011-02-20T07:05:15-08:00
Tom,

You are of course right about encoding the calculated birth date in a birth event.

What I don’t like is the note subordinate to the date. If we are going to have notes attached to each element of an event, it will screw up every user interface I know. And, how should a sentence template look like that will handle all uses of such a note - where should the note be output?
ttwetmore 2011-02-20T07:57:47-08:00
Geir,

About notes. I have found that it is sometimes useful to allow notes anywhere.

There are two (at least) kinds of notes and we should probably distinguish. There are level one notes (thinking about GEDCOM here) that should probably convey important information about the record (person, event, family, whatever) and be properly sourced as any other kind of important information. Notes at other levels are about whatever is one level above them in the structure tree. There are two extremes of notes in my opinion. At one level are the notes that we want to see in full reports. They are often sentences that we feel should fit nicely in a biography. At the other extreme are notes about some detail that came up somewhere that is more a "note to self" rather than something important. Maybe in trying to transcribe a name from a census register you realize that you can't get the exact spelling, to you put in a name but you add a note below it to indicate it was difficult to transcribe and what the alternatives might have been.

I've been inconsistent over the years in my use of notes. Sometimes I try to reserve the NOTE tag for the note that I want to see in reports, and use the INFO tag for notes about the data itself, that is, for the "notes to myself". Think of INFO as comments in a programming language. They are notes to the programmer to remind him/her of something important in the record, but that something is not material enough to show up in any actual report. I see that about the note about the "about date" computed from a census. Recall my philosophy about evidence persons. An evidence person record should contain ALL and ONLY the information that can be derived from a single item of evidence. The "about birth date" is something that can definitely be derived from that census information, but it is not as DIRECT a kind of information as the date of the census was or the name of the person was. So in that sense I don't mind an "info level note" to give a bit of extra explanation about how the info was derived.

Note that in the first paragraph above I am referring to conclusion records, because these are the ones that require multiple internal sources, because they are the ones that bring together PFACTS gleaned from many sources. The census case in my example is strictly at the evidence level, so there should only be one source, at level one, that sources everything in the record. So in the example I have in my last reponse, where I also attached a SOUR to the date, was really more for the example of what the converter tool can do rather than anything that I would really do!

Is it worth thinking about different kinds of notes to handle different types of situations?
ttwetmore 2011-02-20T08:07:20-08:00
For you geeks out there, and I know there are a few, I thought you might be interested in the yacc/bison file that defines the DeadEnds date format as described in the document that started this thread. This yacc file is used to automatically generate the parser that is then used to convert general strings into date structures. For those of you who have used yacc in the past, you'll see that the underlying language here is Objective-C rather than vanilla C. This parser is used in the DeadEnds import program, for processing dates in detail.

//  DEDate.ym
//  DeadEndsImport
//  Created by Thomas Wetmore on 2/17/2011.
//  Last changed on 2/20/2011.
//  Copyright 2011 DeadEnds Software. All rights reserved.
 
// This yacc/bison file defines the grammar of DeadEnds date strings.
 
%{
#import <Foundation/Foundation.h>
#import "DEDate.h"
extern int datelex ();
%}
 
%union {
    NSInteger ival;
    id idval;
}
%token OR
%token AND
%token<ival> MONTH
%token ABOUT
%token COMPUTED
%token INTERPRETED
%token BEFORE
%token AFTER
%token PROBABLY
%token POSSIBLY
%token BETWEEN
%token FROM
%token TO
%token ON
%token<ival> INT
%type<idval> dateList date singleDate dateRange year
%type<ival> prefix
 
%%
 
dateList    :    date {
                    datesList = [[NSMutableArray arrayWithObject: $1] retain];
                }
            |    dateList OR date {
                    [datesList addObject: $3];
                    [$3 release];
                }
            |    dateList AND date {
                    [datesList addObject: $3];
                    [$3 release];
                }
            ;
date        :    singleDate
            |    dateRange
            ;
singleDate    :    prefix MONTH INT ',' year {
                    $$ = [[DESingleDate alloc] initPrefix: $1 year: $5 month: $2 day: $3];
                }
            |    prefix INT MONTH year {
                    $$ = [[DESingleDate alloc] initPrefix: $1 year: $4 month: $3 day: $2];
                }
            |    prefix MONTH year {
                    $$ = [[DESingleDate alloc] initPrefix: $1 year: $3 month: $2 day: 0];
                }
            |    prefix year {
                    $$ = [[DESingleDate alloc] initPrefix: $1 year: $2 month: 0 day: 0];
                }
            |    prefix INT '/' INT '/' year {
                    $$ = [[DESingleDate alloc] initPrefix: $1 year: $6 month: $2 day: $4];
                }
            ;
prefix        :    /* empty */ {
                    $$ = 0;
                }
            |    ABOUT {
                    $$ = DEDatePrefixAbout;
                }
            |    BEFORE {
                    $$ = DEDatePrefixBefore;
                }
            |    COMPUTED {
                    $$ = DEDatePrefixComputed;
                }
            |    INTERPRETED {
                    $$ = DEDATEPrefixInterpreted;
                }
            |    AFTER {
                    $$ = DEDatePrefixAfter;
                }
            |    ON OR BEFORE {
                    $$ = DEDatePrefixOnOrBefore;
                }
            |    ON OR AFTER {
                    $$ = DEDatePrefixOnOrAfter;
                }
            |    POSSIBLY {
                    $$ = DEDatePrefixPossibly;
                }
            |    PROBABLY {
                    $$ = DEDatePrefixProbably;
                }
            ;
year        :    INT {
                    $$ = [[DEYear alloc] initYear: $1 double: NO];
                }
            |    INT '/' INT {
                    $$ = [[DEYear alloc] initYear: $1 double: YES];
                }
            ;
dateRange    :    BETWEEN singleDate AND singleDate {
                    $$ = [[DEDateRange alloc] initType: DEDatePrefixBetween one: $2 two: $4];
                }
            |    FROM singleDate TO singleDate {
                    $$ = [[DEDateRange alloc] initType: DEDatePrefixFromTo one: $2 two: $4];
                }
            ;
%%
 
// Lexer object created externally that returns tokens.
//--------------------------------------------------------------------------------------------------
extern DEDateLexer* dateLexer;
 
// This is the lexer for the DeadEnds date parser.
//--------------------------------------------------------------------------------------------------
int datelex ()
{
    int tokenType = [dateLexer getToken: &datelval];
    return tokenType;
}
gthorud 2011-02-20T15:43:10-08:00
Tom,
Some programs have “research notes” which I think are the same as your INFO notes – not intended for output – and they are obviously useful. If you only use that type of note for a date, you solve the output issue, but you do not solve my Christmas tree concern – too complex user interface. Why not put the info in the event note – type research. I have argued for a type string attached to each note, user defined, possibly with a few predefined values (incl research) – assuming that you can have several notes in the same “location” in the data structure. In that way you can for example control what types of info to output. But notes have been discussed elsewhere and should probably have its own thread, I guess there are a number of issues that have not been mentioned yet. But I would like to get back to the requirements catalog – we must become more structured otherwise no one will read this.

It is more than 25 years ago that I took a course in compiler technology, and I remember reading man pages etc. about yacc during my Unix only period, but I have not really used it – but I understand what you are doing.
GeneJ 2011-07-15T16:38:29-07:00
@Geir,

On Feb 18 (in another life), Geir wrote, "I see that GeneJ is opposed to sorting dates, but I don’t know why."

Sorry I didn't catch this earlier.

Sort dates would be a helpful addition. I know my data practices are better for this feature availability. --GJ
NeilJohnParker 2011-12-07T15:21:37-08:00
Comments about Tom Wetmore's Deadend Date Standard.

Not sure where to place this so commenting on it here. Please feel free to move it to a more appropriae location.

If ok with Tom, I would suggest that we move this to A Section on Better GEDCOM Data Standards with a Subsection Date Formats

It should have a status of Draft Proposal Version 0.0 and dated.
Its major contributor or Principle Author should be acknowledged to be Tom T. Wetmore and/or acknowledgement given as initial creator and Product (i.e Deadends)

Its sections should be numbered using a multilevel decimal dot notation so they can be easily refered to for future comment.

The purpose should be clearly stated to be for the BetterGEDCOM Genealogy Data Exchange standard.

Only the standard Date data format is considered valid. Fields not conforming to the standard will be converted to the standards only if this can be done in an error free manner or accepted as is and assumed to be in free format but flagged with a clear warning as to why it is assumed to be invalid.

The following are my specific comments on each section:

Single Dates:
Completely Known Dates:
BetterGEDCOM should conform to widely accepted international standards for purposes of Internationalization, i.e. ISO 8601. If this principle is accepted it would favor a all numeric date of the form YYYYMMDD or YYYY-MM-DD with the less precise or partially known forms of YYYY-MM or YYYY (Note YYYYMM not allowed due to ambiquity) Personally, I find the numeric form without the hyphen separators hard to read so would favor the hyphenated form.
If we do not favor the all numeric form but would rather use the name for month, then I would recommend that we use YYYY-MMM-DD, MMM-DD-YYYY, DD-MMM-YYYY where MMM represent the 3 character abreviation or n character word for the month (but note that this will be different for every language in the world negating our commitment to Internationalization and eliminating cultural biases which is probably why ISO 8601 went with all numeric format).
In any event, the separator should always be a hyphen; not a comma and space, slash, semi-colon or space (Note the semi-colon is used and reserved for time). Furthermoe the use of a hyphen must be consistent, i.e YYYYMM-DD is not acceptable. MMM should always be capitalized on the first letter only. Accepting jan, JaN , JAN or JanuaRY strikes me as unaesthetic and inappropriate, YYYY and DD should be 4 and 2 digits respectively, padded on the left with zeros if necessary. Coding for B.C.(or B.C.E.) and A.D. is as customary, i.e. A.D. is optional and assumed if not stated; both appearing before the date.
Your example should have used January 11, 1953 as the ambiquity would be even more obvious.
The calendar system (i.e. Gregorian, Julian, Hebrew etc.) used to express the date must be explicit, except that Gregorian is assumed by default.

Upper and Lower Bound Dates:
Bounded dates can be preceded by "on or". We should not assume that "before" implies "on or before" as this is counter intuitive and will only lead to ambiquity.
Both definitions of date ranges must make it clear that it is bounded within a range inclusively, i.e. including the end dates.

Double Year Dates:
Can a double year date of Febrary 1698/9 be ambiguous, i.e is this a double year or is it the ninth day of Febrary 1698, perhaps double years should always be YYYY/YYYY.

NeilJohnParker
ACProctor 2011-12-07T16:31:15-08:00
I think I agree with most of that Neil, especially the use of ISO 8601 for determinate dates - even if bounded between two error limits.

Just a couple of comments:

1) I agree about using the hyphenated numeric form. Any non-numeric form introduces an unwarranted localisation. [I stress this is the computer-readable data, not the textual transcript of the date]

2) The use of the slash ('/') might cause some ambiguity with start/end dates as defined in the ISO standard.

3) Dates before the introduction of the Gregorian calendar are written using the proleptic Gregorian calendar. However, I thought I remembered there was some suffix that could be used for an explicit Julian date -- I'll check.

4) The ISO 8601 standard allows for Week numbers (e.g. 1956-W34) but for some bizarre reason it does not allow for Quarters (e.g. 1956-Q2). These are very important in some registration schemes in genealogy. The extension would not be ambiguous but it would break with the standard :-(

Tony
ACProctor 2011-12-16T13:24:53-08:00
Re: "However, I thought I remembered there was some suffix that could be used for an explicit Julian date -- I'll check."

No, it's not part of the standard. Someone had used G and J prefixes themselves :-(
ttwetmore 2011-02-17T12:35:18-08:00
Adrian,

I debated adding "on or before" and "on or after" and decided not to yet. I understand your mathematical concerns exactly!

I was hoping the gut feeling of the difference in the phrases "between ... and ..." and "from ... to ... " would be okay, but on second thought it might not be. The between phrase, in my mind, means "some date in between ... and ... ", and the from phrase means "continuously on all dates from ... to ...". I was looking for an easy way to capture the distinction between somewhere in a range and throughout a range. Maybe it really doesn't matter, since any computation I could imagine applying to one form I could imagine applying to the other.

Many years ago one of my early mentors in complex software development told me that I would find my work a continuous stream of difficult compromises. That has turned out to be the case.

Tom W.
gthorud 2011-02-17T15:33:23-08:00
Given that we will have to backwards compatible with the current Gedcom, it would be best if a discussion related to dates would focus on how we want to extend the current capabilities.
SeptemberM 2011-02-17T15:42:51-08:00
Perhaps I can help bring some clarity here. When genealogists use estimated dates like "before 1 Jan 1935" or "after Jan 1935" it is because there exists a piece of evidence indicating the existence (or non-existence in the case of death dates) of an individual as of that date. For example, if a will is written 1 Jan 1935 and the writer names a child for whom no birth record has been found, then that child's birth date would be entered as "before 1 Jan 1935." There is no reason to assume equality here. In fact there is a greater argument to not assume equality as the date used as a boundary to the conditional pertains to another event. Similarly, if the phrase "between . . . and . . ." is used it is because there is other evidence which allows us to narrow the possible dates of an event, i.e. a birth, as being in that range. An example would be a marriage on 12 Sep 1840, and a located birth record for the first child born 8 Aug 1841, and a death record for the mother as of 2 Feb 1843, in combination with a marriage record for a person stating their parents as the same as the first child (b. 8 Aug 1841) -- this would allow us to estimate the second child's birth date as being between 8 Aug 1841 (birth of the first child) and 2 Feb 1843 (death of the mother). This date estimate approach can also be used with year's only as calculated from the age(s) given in other pieces of evidence, i.e. census returns, marriage and death records, etc.

All the conditional phrases used with dates are valid and necessary to the genealogical process, but they do not imply equality. What would be truly helpful to genealogists is the functionality of linking these estimates to the evidence they are being inferred from.
SeptemberM 2011-02-17T15:57:22-08:00
Re: burial date = death date . . . you're going to run into a lot of complaints about that one. In the Jewish tradition, burial must occur within three days of the death, and that is the only one I know that requires it to happen that quickly. In colder climates, burial could be months after death, even today. For a genealogist, burial date <> death date.
gthorud 2011-02-17T16:02:01-08:00
If all programs, at leas those I know, keep a numeric representation of a date (12.12.1234) - where possible, is there any reason to transfer this as "12 Dec 1234" - ie. use a string representation of the month?

Unless there are calenders that don't have days, months and years, a numeric representation would solve some problems.

How a numeric date is presented in the user interface - possibly as "Dec 12 1234" - is not BG's concern.
SeptemberM 2011-02-17T16:15:58-08:00
Yes, I agree that a numeric representation of a date is prefereable in the gedcom, as long as it is acceptable to have zero's in the month and/or day portions. My formatting of the dates in my examples was simply a combination of my genealogical training and my personal preference for readability.

Thinking about this question from the gedcom point-of-view, however, raises another possible consideration . . . the order of this information might be clearer if it followed the order of granularity, i.e. yyyymmdd. From the genealogy side, the year of an event is more commonly known before the month and day. From the developer side, this ordering might be more acceptable/understandable on an international basis and help reduce potential confusion, i.e. U.S. mmddyyyy vs. European ddmmyyyy.
gthorud 2011-02-17T17:14:08-08:00
I am not sure what ISO has to say about the sequence ddmmyyyy or yyyymmdd - I prefer to not get into that discussion at the moment. It will be a discussion that could probably go on for ever.

Tom, does your spec allow "between abt 1234 and abt 1245", or "after est 1234" - i.e. intervalls with approximate endpoints? And then there is "abt 1234 OR abt 1245", and "between 'the 3'rd sunday after trinity 1234' (calculated to 12.7.1234) and abt 24.8.1234".

I think I want to see some examples before I by into the need for combination of several ANDs and ORs.

I will come back with arguments for sorting dates that should be allowed to accompany arbitrary strings, but you have precedence in a number of major genealogy programs.

An example where double dates are used in sources here is probates, that could be held at two (and even more) non consecutive dates.

How do we handle hours and minutes, secs? Or are they never important in peoples life?
ttwetmore 2011-02-17T19:53:42-08:00
Gier asks, "Tom, does your spec allow "between abt 1234 and abt 1245", or "after est 1234" - i.e. intervalls with approximate endpoints? And then there is "abt 1234 OR abt 1245", and "between 'the 3'rd sunday after trinity 1234' (calculated to 12.7.1234) and abt 24.8.1234".

Yes. I did try to make that clear in the document, but as you know, specification documents can be hard to grasp. Note I defined first a "single date" that allows the about, before, after, possibly prefixes. And then I defined date ranges as single dates with either between ... and ... or from ... to .... So this allows all the various combinations that one would want.

I agree that the needs for lots and and and ors seems far-fetched. I just want to capture all the cases, even if the pathological ones would occur only .000001 % of the time.

I will go on the record as saying I am NOT in favor of using numbers for months. I believe that using all numbers makes ambiguities and errors more likely rather than less likely. There is the classic case even in the English language where March 4, 2011, would be 3/4/2011 in the U.S. and 4/3/2011 in the U.K. Anything that helps prevent errors is a good thing, and using names for months helps prevent errors.

Another thing to mention is that Better GEDCOM form of a date is the "external, archival" form. Any application could allow its users to enter dates in any formats and would be free to display those dates in any format or use any format in reports. The point being that the Better GEDCOM format really has little real meaning. You need Better GEDCOM to pick one standard, unambiguous, deterministic format. It is only required to be deterministically parseable during import and deterministically generateable during export.

I have just written a context free grammar for the date formats so I can use an automatically generated parser to read the dates using the formats I put in my paper, and that grammar is small and even better, trivially deterministic.

And of course DeadEnds is not Better GEDCOM, just what I think it should be, smile, smile.

Personally I don't think time (hours and minutes) should be part of dates. They don't come up very often. If needed they can be tacked onto a date as a subelement. I have had long, long discussions about the nature of "humanistic" data versus what you might call engineering or scientific data. I am amused by how computers somehow get automatically equated with requiring strict, formal, scientific data formats. Like for places, someone always wants to throw away all location information and replace it with latitudes and longitudes, or people obsess about time zones, calendars, normalizing all times to universal time, and so on. Just because we now use computers to support our humanistic hobby, we don't need to embrace all the formality of restricted data formats. This is the crux of my argument about trying to shoe-horn genealogical data into relational database tables. Genealogical places and genealogical dates are not nice, neat, well-formated entities that can be easily mapped into type restricted and field-lenth restricted columns in relational tables. In fact, I would claim that there problems with sharing genealogical data are caused by the decisions on how to store genealogical data in a fixed number of fixed format table relational table columns than from any other source.

Tom W.
AdrianB38 2011-02-18T13:00:47-08:00
"Re: burial date = death date . . . you're going to run into a lot of complaints about that one"
Hi "SeptemberM" - err - I beg to differ.

Firstly, let's not get too hung up about the example being about death and burial. I'm trying to think about if people have evidence about an event X happening on date D, and they know that event Y happens after event X, what are they _likely_ to have written in their GEDCOM file for event Y? (What they have written regardless of the English and regardless of whether they have the faintest understanding of a mathematician muttering about possibilities of equality.)

I suggest that, without much thought on their part, they will write "After D". If events X and Y _can_ happen on the same day, then "After D" actually means "After or on D". Nobody except a mathematical (or IT) geek like myself is ever going to write "After D-1 day". Regardless of the truth of it. Therefore, this line of attack suggests we cannot assume people mean "Strictly after D" when they write "After D" and not "After or on D". They probably haven't thought it through that deeply!

Further, when you talk about burial, my understanding is that a Muslim burial (and I've no idea if that's what all Muslims believe) must take place before sundown(?) on the day of death. I remember this from when Dodi Fayed and Princess Diana were killed and also when King Hussein of Jordan died. (Tricky in the latter case to arrange a State Funeral with the level of attendance desired on all sides).

I also suggest that many battlefield or naval battle burials would take place on the day of death.

Therefore, I suggest that the only safe thing to do is assume that "After D" allows for the possibility of equality. Ditto "Before D". I would agree with anyone who said that the English tends to suggest otherwise but I would also say that many people simply won't have thought things through that deeply and occasionally our two events _will_ happen on the same day.
AdrianB38 2011-02-18T13:13:13-08:00
Re storage of dates in ddmmyyyy or mmddyyyy or yyyymmdd or ....
I would suggest that somewhere in the BG file there needs to be a definition of the date format applied (possibly the default date format applied) and that any given date is assumed to conform with that format unless stated otherwise within the date for the date in question.

Let's remind ourselves that our ddmmyyyy in whatever sequence is a strictly Western concoction and so at the very least we need to brand such a date with the calendar and date-format applicable.

Even in Europe, we have no universal agreement on the calendar, never mind the ordering! (For instance Wikipedia on the French Revolutionary Calendar: "There were twelve months, each divided into three ten-day weeks called décades ... The five or six extra days needed to approximate the solar or tropical year were placed after the months at the end of each year."

Thus, we have to brand a date with the calendar used and that branding might as well determine the date format as well.

In fact, I'm going to stick that in the Requirements Catalogue right now! <grin>
SeptemberM 2011-02-18T14:02:14-08:00
Hi "AdrianB38"

Thanks for your response and explanation. I understand what you're trying to establish, and the math/programmer half of me agrees. In fact that half of me often spills over into the genealogist half and my own genealogy is filled with <, >, <=, and >= signs in place of the words and/or abbreviations more commonly used. One thing I would point out is that for myself, and other professional genealogists, precision is important, and in the cases of possible equality, we include that possibility. The reverse is true in that if equality is not possible then it is not included. Genealogists do think these things through that deeply in the interest of certainty and accuracy. That's why the use of conditional modifiers are so prevalent in dates in genealogy. Like many other items, there may not be a "one size fits all" answer to this question.

And, historically, there have been at least a couple of known instances where burial occurred before death! <big grin>
gthorud 2011-02-18T17:25:05-08:00
Tom,

I don’t see anything about calculated dates in your spec. Also, a Simple date could also be an “arbitrary string” when for example one end of a range is a date expressed as a text string. So, your spec does not cover all my examples (unless you put them in one string).

I thought the problem of representing dates were solved bu ISO standards decades ago. I can understand Tom’s arguments against numeric dates for a DeadEnd point of view, but since I do not see a user requirement to enter data directly into a BetterGedcom file, I don’t accept Tom’s arguments against numeric dates in a BetterGedcom context, but that is perhaps also what Tom is saying in the next paragraph. Numeric dates is the only way, that I have seen so far, that can be used i an international specification.

Numeric dates should be represented just as 8 digit numbers, without separators (eg slash or full stop).

I see that relational databases are widely used – that tells me that we can not act as if they are not used – they are just a fact of life. (I do not think that the date format depends on the type of database.)



Adrian, SeptemberH,

Re. “After D” versus “After or on D”, Gedcom seems to have the “after D” variant already – so we want to keep that. I agree with Adrian that many users will not think of D-1. One possibility is to let programs operate with “After or on D” and translate that to “after D-1”. Or we could add a “After or on D”. (Note that BETWEEN in Gedcom is inclusive. i.e. “between or on”.)

It is ok to identify the order of yyyymmdd or ddmmyyyy (or whatever). It should apply to the whole file fort that calendar, and should be identified in addition to the calendar (not as part of the calendar definition). A possibility to “escape” from the default value for the file is ok, but I don’t expect it to be used since the simplest solution for a program is to do it only one way. So. The only reason for being able to identify the order is that we can’t agree on a single way to do things.


Encoding:

I do not want to go into the detailed encoding of dates, since that is currently not needed, and depends on the syntax, but I do not think dates should be encoded as a long string with embedded keywords – it should be broken up into several strings, with for example a single string containing a keyword (eg. FROM).


Prefix/postfix:

The requirement to have strings as prefixes/suffixes, possibly for each Single date in Tom’s spec, as discussed here http://bettergedcom.wikispaces.com/message/view/BetterGEDCOM+Comparisons/32431138


Sorting dates:

Also, I have previously in that discussion (see above link) argued for sort dates to be used in combination with dates in free form strings, but I do not see a requirement for a program to try to extract info from that string (cf. Toms spec). That would be a very complex thing, considering eg. different languages used in dates and the different formats used in various countries.

Gedcom currently allows either a single string or a string with a calculated structured date eg. ddmmyyyy. A sort date is in my view the same as this calculated structured date, but with the difference that it shall not output in reports or charts (but is used to control the sequence of output) – so in terms of implementation it is not more complex that is currently in Gedcom.

One example when sort dates are useful is when the date is given by a latin name for a date. In most cases it is possible to calculate a numeric date for these dates, but that calculation is complex, so in many cases I see only the latin date and an estimated date (often based on the dates before/after) used for sorting – the real numeric date may be calculated later. Another reason to record an estimated sorting date is where it is impossible to interpret the latin date (eg faded ink), or when it may be ambiguous – but you still want the event to sort.

I see that GeneJ is opposed to sorting dates, but I don’t know why. Since the only functionality needed is to be able to hide the date in output, I don’t see any problem.

Also, if a receiving program does not support sorting dates, it could ignore them – that will be the same as if BG did not support sort dates – nothing lost by including sort dates in the spec.

RootsMagic, TMG and Genbox (and possibly Legacy internally?) and most likely others – seem to allow the use of sort dates everywhere – maybe everywhere there is a “not exact” Single date.


Notes for dates:
Notes for dates have been proposed – I don’t see why we need them – we can’t have notes everywhere. Comments about a date should go in a note for the event.
AdrianB38 2011-02-17T12:24:47-08:00
Tom
Re "before January 1, 1953"
and
"aft January 1953"

The mathematician in me keeps asking whether "before" and "after" include the possibility of equality? i.e. could the event occur on 1 Jan 1953?

While the sense of the English tends to suggest not in the first case, in practical terms I'd prefer if it did allow equality. For instance, suppose we had "X died on 1 January 1866". Then what would we write for the burial fact? - and I'm allowing for the burial to occur on the day of death.

"X was buried after 1 January 1866" if we allow for equality.
"X was buried after 31 December 1865" if we do not allow for equality.
That last version is perverse and would result in the burial event being sorted in front of the death event in 99.999% of applications.

Further, since the "From To" construction implies equality in English, then it seems perverse that that "Btw And" construction doesn't.

And when it comes to interpreting what other people meant in their GEDCOM file by the phrase "Btw And" then it seems _safer_ to assume equality is meant.

(Of course, one could introduce "On or after", "On or before", and .... oh heck, what's the English for the version of "Btw and"??)
ttwetmore 2011-09-06T06:37:49-07:00
DeadEnds Proto File
I have added a new file to the set of DeadEnds files that are on line. The file defines the DeadEnds model in terms of Google Protocol Buffers. This is the format that Google uses to transmit information between its various servers; Google made the format and technology available publicly. The purpose of the format is to allow efficient storage and transmittal of information. A DeadEnds database (and by extension any Better Gedcom database) expressed in this manner would use near minimal amounts of storage and would require near minimal times to transmit genealogical data.

Here is the URL to the file:

http://bartonstreet.com/deadends/DeadEnds.proto

This allows:

1. Persons -- full implementation with relationships, events, sub-persons (if desired for handling the research process).
2. Events -- events can either be attributes of the persons or they can be multi-role event records of their own.
3. Sources -- full hierarchical implementation allowing multi-levels from low level citations, to repositories. Any set of templates possible through a general attribute (key/value) mechanism.
4. Places -- places can either be attributes of events or they can be records of their own. Full multi-hierarchical places are allowed.
5. Attributes -- just about anything can have attributes, hierarchical to any depth.
6. "Information" -- just about anything can refer to media, have sources, and have notes (local or as records) associated with them.
7. UUIDs -- all records are tagged with a unique ids.

Tom
AdrianB38 2011-09-06T09:25:45-07:00
One thing I've been struggling to get my head round in the plethora of options re GEDCOM, XML, JSON and now Google Protocol Buffers and anything I've missed...

One of the BG goals is the ability to store this stuff. Now that implies that if something is stored in version n of BG, then sooner or later versions n+1, n+2, etc appear. (OK - this is theoretical,right?) And Murphy's Law says that while n+2 is designed to be compatible with a file written in version n, somewhere it won't work.

So, as a matter of course it would seem useful that the file of stuff should contain something to identify which version of the BG language is being used, and preferably to link to the formal definition. Now, I remember XML had / has DTDs, Schemas and at least one other thing to do this sort of stuff.

Do we know if JSON and Google Protocol Buffers have similar facilities?
ttwetmore 2011-09-06T11:36:42-07:00
Adrian,

Protocol buffer documentation mentions two future-proofing techniques. One is how to modify a record type to allow it more fields. The other is how to allow extensions, say for third party enhancements of record types. The whole point of these extensions is to allow all earlier versions of the data to be read by all future applications, that is, so no old data ever need be converted to a new format. I don't know anything about versioning in JSON.

Notwithstanding this, however, presumably every BG archive or transmission would begin with a header record of some kind where version and other top level info would be available. Think the HEAD record of Gedcom. Nothing prevents XML, JSON or protocol buffers from doing the same. This is a different approach, where a future program would have to have a mode where it can specifically read earlier versions of the data. Presumably it would also be able to convert old data to new data.

Use of protocol buffers is a little more complex than using XML or JSON. Protocol buffer specifications must be processed by a protocol buffer compiler, protoc, which generates source code that must then be compiled. The generated source code contains classes for each of the messages defined in the specifications. Since I do my development on a Mac, my version of protoc generates Objective-C classes. The file that I have put up on my website is cleanly compiled by the protocol compiler.
AdrianB38 2011-09-08T08:22:52-07:00
Thanks Tom. It all sounds like basic, sensible future proofing of design - the sort we've all (well, most of us) done for years. I may be expecting a bit much from XML Schemas, name-spaces, etc.