brantgurga 2010-11-09T16:58:41-08:00

Goal 2

XML can have substantial overhead because every opening tag must have a closing tag. Also, how would binary data like pictures be encluded? By link or by encoding? Perhaps a ZIPed directory tree is the way to go which is how OpenOffice.org and Microsoft Word store their data now. That allows you to have one file to deal with but it easily includes the data as well as referenced data like pictures.

Restricting to UTF-8 is unreasonable because it may not be a preferred format for a region. Simply requiring XML will ensure that implementations have a common input and output character set of UTF-8 without unnecessarily restricting implementations to using other formats where desired.

I would suggest targetting a version of XML though. XML 1.0 is commonly implemented. I have yet to come across a parser that deals with XML 1.1.

jbbenni 2010-11-11T07:04:00-08:00

I'm consolidating my thread from another discussion area (Home), as the topic was started here, and probably belongs here.

After reading and thinking about responses to the idea of handling multimedia content with a "container and reference" approach, I think I see a bit of internal conflict between goals 1 and 2. Fortunately, it's fixable.

Goal 2 says BG should be XML. Everyone seems to agree with that part, although there is controversy over the UTF 8 provision.
Goal 1 says BG should serve as a data archive repository.

But in reading the discussion, there's pretty good consensus that XML is not a good format for a data archive repository (most seem to agree that XML with a reference to a standardized container is a better way to go). So there's some creative friction in the goals.

Can we -- and should we -- revise Goal 2 to resolve the conflict? Perhaps something like this:
"BG should be XML-based using Unicode [with appropriate character encoding decision here] as much as possible, but BG may also utilize a standardized container specification to hold supporting files such as multimedia."

I may not have the wording right, and there is still the important character encoding question to resolve, but I'm coming around to the thinking that an XML approach that allows an external file reference to a standardized container is the way to go. We should look closely at how it has been done elsewhere, starting with the examples cited in these two discussion threads.

At this point, I suggest that we revise goal 2 accordingly. I have no idea what the process for that would be, and I would only want to proceed if there were support for the idea. Is there a formal process to propose a goal revision?

dsblank 2010-11-11T07:20:09-08:00

jbbenni,

That sounds like a good approach, and in fact is exactly what Gramps has done. Gramps has two versions of export: Gramps XML (data only), and Gramps Package XML (with media). The later is just a compressed tar file containing the XML and all of the media.

jbbenni 2010-11-11T07:53:24-08:00

On the question of character encoding for BG--

The XML spec includes the following interesting language:
Although an XML processor is required to read only entities in the UTF-8 and UTF-16 encodings, it is recognized that other encodings are used around the world, and it may be desired for XML processors to read entities that use them. In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 must begin with a text declaration (see 4.3.1 The Text Declaration) containing an encoding declaration

http://www.w3.org/TR/REC-xml/#NT-XMLDecl

So any application that imports BG must be able to handle UTF-8 and UTF-16, and support for other encodings is optional.

I suggest that the BG specification should permit other encodings, but emphasize that application support of them is entirely optional. Make sense?

DallanQ 2010-11-11T09:22:53-08:00

+1 on two formats: XML (data only) and package XML (with media).

greglamberson 2010-11-11T12:02:36-08:00

You guys are right on target. Feel free to edit the main pages to reflect this!

Let's not bleed over into the Unicode character set encodings here.

greglamberson 2010-11-11T12:04:28-08:00

OK I'm trying to read too many things. I thought this was a different thread than the Unicode one. Bottom line: We'll support whatever is called for. I think I just don't know the ins and outs as well as I thought I did.

gthorud 2010-11-11T12:20:04-08:00

To Greg about UTF-8 etc

I am no expert on character sets used in Asia, but if I remember correctly there are many tens of thousands characters defined for Chineese and also Japanese, and I was once told that in Japan there are even a few thousend characters that are supposed to be used only when you correspond with the emperor. But I am not sure I understand how this relates to the issue and I do not understand what the purpose of comparing the number of characters is.

I want to stress that we are talking about encoding, not the set of characters.
See for example
http://unicode.org/faq/utf_bom.html

From a w3.org document about xml

The document character set for XML and HTML 4.0 is Unicode (aka ISO 10646).

and
….. for UTF-8 encoded content. This is likely to be the best choice of encoding for most purposes, but it is not the only possibility. If not using UTF-8 you should replace the utf-8 text in the examples above with the name of the encoding you have chosen. You can see the full list of character encoding names registered by IANA (long). In practice, a few encodings will be preferred, most likely: ISO-8859-1 (Latin-1), US-ASCII, UTF-16, the other encodings in the ISO-8859 series, iso-2022-jp, euc-kr, and so on.
If I understand this correctly, the choice of encoding is not a question about the number of characters, but efficiency in coding – depending on the characters used. If I am not mistaken, Windows and Java uses UTF-16 encoding. I can see why you want to require UTF-8 support, but I don’t understand why you want to exclude other encodings.

I see that while I was writing there has been some development on this topic, but I think that what I have written above may be of intereset.

jbbenni 2010-11-11T12:46:52-08:00

Okay, I modified Goal 2 to use XML plus a container for media files, etc. I saw support in this discussion and no serious opposition, so I took a leap. If this turns out to be a bad idea, we can certainly revert to the original statement of the goal. Progress marches on, thanks all.

(I DID NOT CHANGE THE ENCODING STATEMENT IN GOAL 2, BECAUSE THERE IS ONGOING DISCUSSION AND I DON"T UNDERSTAND IT WELL ENOUGH.)

I think there's an opportunity now to identify and consider specific implementations of the container and reference format, but I think that topic may want its own thread, perhaps outside the Goals section (since its starting to get into implementation).

greglamberson 2010-11-11T16:40:16-08:00

gthorud, the number of characters or symbols used is directly related to how many bytes it takes to render one character. This is in reference to brantgurga's concern of Nov 10 9:32 AM. The number of characters used to render a name is affected a great deal by how many characters you start with. Names in Chinese, for example, can be rendered via a quasi-phonetic form which requires many characters as well as more traditional pictographic symbols. Thus the rendering of Chinese is very likely to always require a more sizeable file. Wasn't that the issue?

I didn't learn Chinese in language school like many of my friends, so I don't remember how many characters are in common use. Needless to say thee are a lot more than in English. In UTF-8, the characters we use that conform to the ASCII character set then some of the extended western European and eastern European characters are included first, then apparently the characters less used by westerners. No matter whether the encoding for computer use is done with the ASCII characters at the beginning of the encoding and mapped to keyboard keys and the like or left at the end, there is no way to render proper Chinese without at least two-byte character codes. Thus Chinese is always likely to result in a larger file, especially since the one thing that is commonly rendered with many characters that have more phonetic origin is names.

jbbenni,this is exactly what you're supposed to do. thanks for doing it. I hope others start jumping in and trying ti out so I don't have to be the only one modifying the actual pages.

At this point, I think there's not a lot more to be said, but hopefully this will still be useful. Anyway, yes we'll certainly support whatever is used by anyone worldwide regarding Unicode encoding, but doing so isn't going to lessen the file size of anyone's GEDCOM.

OH, let me add one more thing from a point made by jbbenni who said,
"...XML is not a good format for a data archive repository..."

I do not advocate that BG have a good format for a data archive. It only needs to work. Just like GEDCOM, BG should have the ability to house a genealogy program's database that has been exported and taken offline so that it can then be copied onto some media and placed physically into a safe or whatever other physical location a researcher may desire. That's the only goal.

gthorud 2010-11-12T13:54:23-08:00

I see that Greg has edited the goal, part of it now states:
-Use Unicode character set in UTF-8 encoding by default, and support other encoding schemes of Unicode

This may be a bit too relaxed, I would suggest:
-Use Unicode character set in UTF-8 encoding, and optionally support other encoding schemes of Unicode

greglamberson 2010-11-12T14:04:17-08:00

gthorud, please go right ahead and edit the main page. There's no debate of any significance.

It's a wiki, and even if you edit something that is controversial or whatever, that's part of the process.

Go for it!

mstransky 2010-11-21T19:34:08-08:00

So they say xml is not good at big databases and image archives? well I already put that to the test.

I dabble in very large xml files that cross ref other xml files. I have two projects going on one at http://www.wartimepress.com running and making an image and pdf archive to search documents. I have always been told xml can not compete with sql. However that is not the case. If one can see how my menu xml links like a pedigree outline to a father, mother and children navigations. a second xml files holds images, notes and descriptions. a third file xml file holds article text inside each publication. I plan to ad a forth xml for pdf and search functions later.

Sorry I am kind of lost on these boards and don't like pasting my urls over and over.
example of function and speed look at wartimepress.com and how it handles 10k's of images and articles. Some xml are larger than a 1M and work just fine.

Also disregard the sad templates at http://www.stranskyfamilytree.net/gen%20project/admin-login.asp I was just test function of pedigree, household and other templates across multiable xml files. those can always be made to look better. I have the online edit screens locked down but I guess I could make a sad box area later to see edit, modify add and delete dashboards. just click public view to enter.

I will try to add those links to my profile instead of the board next.

ONE DRAW BACK, I never figured out how to make xml hold euro charaters and preseve carriage returns. Well I am glad I found this site and people who really want to make something happen for a change.

gthorud 2010-11-09T18:15:34-08:00

I agree that restricting to UTF-8 is unreasonable, but it should be one support level.

I think the method of compression, if any, should be kept outside this standard. Let the users choose whatever they want - as far as files are concerned.

Transferring a directory tree is an obvious option.

Remember that it will take years before 'betterGedcom' will be implemented, so if needed, newer versions of XML could be used if they provide needed functionality. If not, stay with 1.0.

greglamberson 2010-11-09T19:34:00-08:00

I believe reference to multimedia content will have to be done via reference to other files, but I don't know the answer to this. This is a pretty common situation (i.e., referencing multimedia content using XML), so I am sure more experienced people have ideas about how best to achieve this.

Regarding UTF-8: UTF-8 encoding of the Unicode character set is part of the XML standard. UTF-8 is capable or representing basically every language in regular use on earth. Perhaps I'm misunderstanding, in which case I would appreciate a lesson.

Regarding compression: An interesting point, but until we consider what data formats can and should be acceptable, it's a little early to discuss compression.

brantgurga 2010-11-10T07:32:52-08:00

If external references are the approach taken, you would want to keep the links correct. Exposing the files and XML to end users gives a good chance of them being separated, that's why the packaging whether it be ZIP or something else should be part of the specification. Some archive formats have limitations too. For example, old TAR formats have a maximum filename length.

Yes, UTF-8 is very capable. However, it's rather English-centric too. Once you get outside of the ASCII range, the encoding becomes at least two bytes. That means you're guaranteeing that for a similar family tree, a Chinese person's file is going to be bigger than an English person. Leaving the encoding flexible lets them use a more native format that doesn't bloat as much while requiring a common UTF-8 implementation ensures that interoperability on the character set level is possible.

gthorud 2010-11-10T08:49:49-08:00

I thought the first entry in this thread was focusing on the overhead of xml, but realize that the more important issue is transfer of enclosed files. Transfering a directory will require a "container" which may be a zip-file (or another compression algorithm) although zip may be the preferred one. Ajlara has mentioned the Open Package Convention http://bettergedcom.wikispaces.com/message/view/Data+Models/29603857#29943377 I have not studied it, but it could be a basis for a ".gedz" format.

The important issue wrt Character set is the statement about Unicode, the set of characters that must be handled inside a program. It is no problem to convert between encodings, so although UTF-8 may be preferred by many, it is not necessary to restrict the choice to this encoding alone, as long as the encoding supports Unicode. You could even require support for UIF-8, while not excluding others. XML mentions several encoding alternatives.

gthorud 2010-11-10T08:53:34-08:00

Sorry, UTF-8, not UIF-8.

greglamberson 2010-11-10T14:21:23-08:00

Even though I'm a computer guy with pretty serious networking and document management experience, I know nearly nothing about using character sets outside the US and Europe. However, I also know about languages.

I do know that, depending on the particular form of Chines, for example, that their character set is upwards of 2500 separate characters. Our symbology, i.e., our basic alphabet, has 26. Add some special characters and we get to around 300 commonly used characters.

For the same purposes, the Chinese have about 10 times as many unique characters. How is it exactly that, mathematically, you get that down to something equivalent to our laphabet? I have no idea.

More important, what character sets or encoding would you propose be alternatives? I also am under the impression that UTF-8 is _THE_ character set for XML. What am I missing here?

The only encoding I am aware that XML refers to as alternatives are larger subsets of the Unicode character set. However, I thought these were essentially theoretical.

I think someone's missing some information here, and while it may be me, I don't think so. Please enlighten me.

greglamberson 2010-11-09T19:57:22-08:00

Approaches To Standardization

Here is the page that is the starting point for looking at ways to pursue and develop standards:

http://bettergedcom.wikispaces.com/Approaches+To+Standardization

This page is nested in the "BetterGEDCOM Sandbox" on the navbar to the left. As mentioned in the "Goal 4" post on this page, the organization through which we pursue standardization will determine a great deal about how the standard is developed, managed and changed.

AdrianB38 2010-11-10T13:36:45-08:00

Compatibility with GEDCOM

I believe that we should have a goal about compatibility with current GEDCOM. If it's too difficult for people to transfer their data - it won't happen.

Clearly users must be able to convert from current GEDCOM to BG. Reverse conversion of selected entities is, I believe desirable.

To accomplish this goal, I believe that the data model of current GEDCOM must map onto that of BG in a straight forward fashion (i.e. no either-or decisions)

Issues here include - which GEDCOM? I think compatibility with 5.5 is a must. Thereafter, one might say it's up to the software designers???

AdrianB38 2010-11-15T02:15:59-08:00

Greg - re your concerns about tying the models together versus making it practical. That's actually my very personal concern since I use FamilyHistorian from Calico Pie in the UK, which uses GEDCOM 5.5 as its native file format. Calico claim 100% compatibility with 5.5 and I think the only issue over which anyone has disputed that claim is the coding (ANSI, ANSEL, whatever).

Andy - I suggest that we can read the standard for ourselves to determine if a file has been "correctly formed". The fact that, for historical reasons, some software guys have omitted bits, stuck other bits in the wrong area, while no doubt the standard has moved, should not inhibit us from making a start somewhere.

I believe that for theoretical reasons (and for my selfish reasons because I use software that's roughly 99.999% compatible to the Standard), we should start with the GEDCOM 5.5 standard.

If we come up with something that looks incompatible to a well-known flavour of GEDCOM or another program's data model, then if anyone can justifiably tweak the BG model, then fine. We need to bear in mind that the people who will write the bespoke conversion software will probably be the writers of the original software (apart from someone hopefully writing a generic GEDCOM 5.5 to BG routine) and not worry about the last detail of the conversion routine ourselves.

So I suggest we start by delivering a BG data model that's compatible with the common denominator, which I suggest is GEDCOM 5.5 Standard. And then check that against other data models.

louiskessler 2010-11-24T21:39:51-08:00

Andy:

I'm coming in late, but as a software developer of a program that reads in any flavor of GEDCOM and displays it appropriately, I can vouch that it will be possible, as Greg says, to write that translation program.

Once that translation program is developed and available, users will point out any subtleties that the program is not getting quite correct from a GEDCOM output by a certain vendor. Those can be corrected and it won't be long before the translation program does an excellent job.

Maybe that's the way I can make a major contribution to this project. Once the BetterGEDCOM standard is developed, I can add an "Export to BetterGEDCOM" function to my program and make that function available free to those who want it.

louiskessler 2010-11-24T21:42:45-08:00

Adrian:

It doesn't matter if a GEDCOM file is "correctly formed" or not. It is just a matter of that file being correctly interpreted. If it can be interpreted, then it can be translated properly.

mstransky 2010-11-24T21:52:40-08:00

Thats is what I am looking for also. Something to put my hands on and start working with it. When/if they start a basic BG basic model draft I also would like to start messing around making some code to handle it. As it grows I can make my edits to my coding.

AdrianB38 2010-11-25T02:43:44-08:00

"It doesn't matter if a GEDCOM file is 'correctly formed' or not. It is just a matter of that file being correctly interpreted."

Louis - I totally agree with you. The reason I stuck that "correctly formed" phrase in was to limit what the BG standard and community, including users of BG-type software, have a right to expect of software designers and coders.

If I'd said that the model and conversion software had to be able to import any old garbage of GEDCOM, then I'd be setting unfair expectations and an open-ended commitment. The "correctly formed" phrase went in as a target for an absolute minimum of what the new BG using community would reasonably expect.

If others can interpret other forms of GEDCOMs correctly and translate them, that's great. If it turns out that program X produces incorrect and ambiguous GEDCOM, that can't be loaded, well, you've got my "correctly formed" phrase as a reasonable excuse.

VAURES 2010-11-26T10:09:32-08:00

Question- since almost every vendor has their own 'flavor' of GEDCOM, how and who will determine that a file has been formed correctly according to GEDCOM 5.5
GEDCOM 5.5 and also 5.5.1 could be followed strictly but nobody does. This makes export-import quite cumbersome if not impossible.
Besides the problems or incompatibles caused by the makers its the users who cause confusion as they misuse data fields for their own purposes, sometimes by negligence, sometimes because the (eagerly needed) data field is not available.
I think a more appropriate way to better GEDCOM is a consensus about the interpretation of the meaning of the current GEDCOM tags and how they should be used.
There are only a few tags really missing such as different ways to celebrate/certify a marriage.
In Europe 21 German speaking authors try to go this way by vivid discussions and final consenting votes.
I am sure this means has a much better future than a completely new "betterGEDCOM" that can simultaneously swim, walk, and fly.

VAURES 2010-11-26T10:13:51-08:00

"It doesn't matter if a GEDCOM file is 'correctly formed' or not. It is just a matter of that file being correctly interpreted."

Doing this you must take into consideration the misuse of datafield by users and there may be millions of them (I'm one too!)
This will make a "translation" quite difficult.
Wulf

hrworth 2010-11-26T10:18:26-08:00

vaures,

What you described is the reason for this BetterGEDCOM Wiki. To get at these issues.

Clearly, the various software vendors / developers need to join us in addressing these issues.

If you read some of Greg's postings, you will see references to various Standards bodies, to help make this International in nature.

That BetterGEDCOM doesn't have to do any of the activities you mentions. It is to Transport my Research from my application to another Researchers application, with out doing anything to the content of my data.

Not an easy task, but that is the purpose.

Russ
Russ

mstransky 2010-11-26T10:25:48-08:00

Hey russ I sent you an email, It would be great for a non techinical person if they can follow it. If I can clarifiy it in better terms maybe I can post it here "CLEARLY" with out haveing to babble 5 or 6 times to make a point or opionion.

ttwetmore 2010-11-26T10:38:19-08:00

Vaures,

I don't see much value in trying to patch Gedcom. And the jury is out on whether the BG effort will do anything meaningful. There is not a great track record for these things.

The problem with Gedcom has nothing to do with a few missing tags and a few multi-interpretations of tags. The issue is the Gedcom model. Gedcom was designed to summarize families so LDS ordinances could be performed on them. Every use beyond that is forcing a square peg into a round hole.

Genealogical programs all have their own internal models that their authors designed. When they export Gedcom they are forced to squeeze their models into the very simple Gedcom model. With Gedcom import they are forced to translate Gedcom concepts to their own. Both transformation may involve data loss or data misinterpretation. This is not the fault of the applications; they provide a more complete interface to genealogy than can be archived in a Gedcom file. The only way out of this dilemma is to have an archive and transport format that is built around a much more generalized model that fits well with the implied models behind modern genealogical software.

The BG effort is trying to establish a more complete model that can handle more or all of the genealogical research process. The goal is to make the BG model match as closely as possible the models underlying current genealogical systems, and even more ambitious, that the BG model will be so excellent that the next generation of genealogical software will use the BG model as their own internal models.

Bon voyage on your journey.

Tom Wetmore

greglamberson 2010-11-26T12:37:06-08:00

Regarding "patching" GEDCOM: Taking this sort of approach would be about like putting wings on a car.

Tom said, "The goal is to make the BG model match as closely as possible the models underlying current genealogical systems, and even more ambitious, that the BG model will be so excellent that the next generation of genealogical software will use the BG model as their own internal models."

Exactamundo.

AdrianB38 2010-11-27T10:28:23-08:00

Vaures said "users who cause confusion as they misuse data fields for their own purposes, sometimes by negligence, sometimes because the (eagerly needed) data field is not available.
I think a more appropriate way to better GEDCOM is a consensus about the interpretation of the meaning of the current GEDCOM tags and how they should be used."

The more I think about it, the more I get relaxed about how tags are interpreted.

1. If we tighten up the definition of tags, we don't correct stuff that's already been written.
2. If we tighten up the definition of tags, we risk confusing some people (for instance, in the British Army they _award_ gallantry medals but _issue_ campaign medals - is this worth trying to get through to everyone, given that even the Army got it wrong at times!)
3. So long as the tag prints in a report with the expected sentence made up for it, isn't that enough?
4. Item 3 does NOT always apply - some tags DO need intelligence about them - e.g. the software needs to know that death is pretty final (but, yes, probate, etc, comes after).
5. If we provide user-defined facts, there is no need to keep updating the standard.

AdrianB38 2010-11-11T12:44:18-08:00

jbbenni and DallanQ - absolutely agree with you

Russ - agree with what you say, though my initial thoughts are that exceptions are a matter for the software designer, rather than the designer of the file format. (The BG file format can be defined without any software to implement it). On the other hand, the designer of the file format needs to ensure they don't do anything that would obviously stop exceptions being raised. Not sure how it might happen but who knows...

jbbenni 2010-11-11T13:00:43-08:00

Russ at hrworth,

For exceptions that pop up during conversion, I agree with you. I was actually thinking about a different situation.

The point I was trying to make is that the goals don't actually say what many of us may assume. In particular, there is no goal to say:
"All features and functions of version XX.XX of the GEDCOM standard will be supported in BG."

I wonder: should we commit BG to being a superset of GEDCOM (as its name implies) or are there constructs in GEDCOM that we don't want to bring forward into BG? Either way is fine with me, but I think the goals might want to make the answer to this question explicit.

It's natural to want to "embrace and extend" GEDCOM, and thereby preserve forward compatibility (the ability to convert from GEDCOM to BG without losing any content). But the risk is that we'll inherit baggage from the GEDCOM design that we'd rather leave behind.

GEDCOM wizards, why are we reinventing this wheel? Is there something so objectionable in GEDCOM that we should be careful not to bring it into BG?

DallanQ 2010-11-11T13:53:23-08:00

In my opinion, there are things from the gedcom standard that you may not want to include in future standards. Here are a few issues:

(1) The name structure allows full names, name pieces, or both. Some programs store full names, others store name pieces. If your program stores name pieces and you receive a gedcom from a program that stores only full names, you have to parse the name into its constituent pieces, which is problematic. It would be better to insist that all programs store name pieces, since full names can be generated from the pieces easier than the other way around. (I know about the /surname/ convention, but not all full-name-storage programs use it.)

(2) There are many things in the standard that few (if any) programs actually use. If a program displayed all of the possible attributes of a person or event that are in the standard, it would make the UI extremely complex. So most programs use only a subset. But what happens when you get a gedcom from a program that uses a different subset than your program - what do you do with the fields that are not in your subset? There are so many fields in the standard that it's like a smorgasbord where each program picks and chooses what fields to use. It would be better to have an agreed-upon subset that everyone used.

(3) The standard is recursive. A note can contain a source citation which can contain a note, etc. Try coming up with a UI to represent that to the average user.

I think you'd be better off to survey what gedcom features are actually used by the major records managers and use that as your starting point.

greglamberson 2010-11-11T14:03:15-08:00

Adrian,

If by compatibility you accept there can be a separate, manual conversion process, then absolutely, the two will be compatible. I am absolutely certain of this. Converting BG to GEDCOM might result in some data loss, but this is due to the fact that GEDCOM has less definition and/or capability than BG will have. I am 100% sure any GEDCOM file will easily be convertible to BG, but this project will not provide that capability.

Compatibility, to me, as a technology project manager for a couple decades, means something is recognized as an equivalent but perhaps earlier version of the same product and is immediately usable with no modification. Anything that requires a separate conversion process is not compatible. To be compatible by my usage, there can be no separate conversion involved.

Regarding GEDCOM's ambiguities, this is the major, nearly universal complaint software developers dealing with GEDCOM have about it. I have had some of them explained to me, but I admit I take these guys at their word and am not capable of addressing these ambiguities myself.

All in all, I really don't think this is any issue whatsoever.

I started this 2 hours ago and have been otherwise engaged. If I don't send this now, this will be lost.

DallanQ 2010-11-11T15:17:47-08:00

FWIW, I have access to over 5,000 GEDCOM's that have been imported into WeRelate over the past several years. I can't make them public, but if someone came up with a GEDCOM -> BG conversion program, I could run it over the GEDCOM's and tell people where the exceptions were.

BTW, I think anyone trying to come up with a better GEDCOM should do some homework and have a good understanding of the existing standard:

http://homepages.rootsweb.ancestry.com/~pmcbride/gedcom/55gctoc.htm

Chapter 2 is the key chapter for the data model.

igoddard 2010-11-13T15:48:02-08:00

If you expect existing Gedcom files to map effortlessly onto a new structure then you might as well give up trying to devise a better structure. You'll simply lumber the result with all Gedcom's faults first and foremost amongst which is the fact that it uses exactly the same structure to record evidence and conclusion.

greglamberson 2010-11-13T15:58:18-08:00

igoddard,

Getting existing GEDCOM files into a new format will be no problem, as GEDCOM is a simpler format. It's getting a BetterGEDCOM format into the old GEDCOM format that won't work as well.

AdrianB38 2010-11-14T13:57:08-08:00

"If you expect existing Gedcom files to map effortlessly onto a new structure then you might as well give up trying to devise a better structure"

Apologies for banging on about this but if GEDCOM does not map all but effortlessly onto BG, then we are wasting our time.

Yes, the default conversion will not split X and Y but if I have to re-enter all my data in order to do that split, then I simply won't do it.

We have to create a structure for BG that allows GEDCOM 5.5 (say) to come over pretty much automatically, leaving - naturally - all the extra BG capabilities empty.

greglamberson 2010-11-14T13:59:52-08:00

I believe we're saying the same thing but disagreeing about it. lol

AdrianB38 2010-11-14T14:11:27-08:00

Greg
Re "compatibility" - I think we're in agreement. Almost.

My only issue is where you say "the two will be compatible. I am absolutely certain of this." Unless we state this as an objective, then we can't guarantee it.

I suggest that the goal should read:
"The data models of GEDCOM 5.5 and BG _must_ be such that it be possible to write software to convert a file of data formed correctly according to GEDCOM 5.5 to a file of data formed correctly according to BG with no exceptions.

It is desirable that it be possible to write software to convert a file of data formed correctly according to BG to a file of data formed correctly according to GEDCOM 5.5 with no exceptions. It is acceptable that entity types not represented in GEDCOM 5.5 need not be converted."

I think there may be exceptions - e.g. BLOBs, but they can be recorded as specific objectives - please note these are goals which in my book are higher level.

greglamberson 2010-11-14T14:19:14-08:00

Adrian,

OK, you got me. As desirable as that may be, I am not willing to tie the two together so formally. It's far more important that BetterGEDCOM be a practical tool capable of mapping to today's genealogy software applications' data models directly.

I think we're shooting for the same thing here, but I can only see problems with getting too specific on issues like this too fast and too formally.

Andy_Hatchett 2010-11-14T15:43:54-08:00

"write software to convert a file of data formed correctly according to GEDCOM 5.5 to a file of data formed correctly according to BG with no exceptions."

Question- since almost every vendor has their own 'flavor' of GEDCOM, how and who will determine that a file has been formed correctly according to GEDCOM 5.5?

Or does this mean that BG will have to be able to convert all flavors of GEDCOM into a correctly formed BG format?

I guess what I'm looking for is a definition of "correctly formed".

greglamberson 2010-11-10T14:56:56-08:00

This is an issue that has come up several times. Simply put: It is not reasonable to expect GEDCOM to be compatible with an XML format. It is easy to develop a tool that would convert a GEDCOM file into the new format. In fact, such tools already exist. Such tools will undoubtedly be available once we figure out where we're headed here. However, it is not reasonable to expect BG to be "compatible" with GEDCOM.

As I said, there will no doubt be tools that convert GEDCOM to BG (or whatever), but anything beyond that would be like insisting a new car be equipped with a furnace and boiler to be compatible with steam powered cars.

gthorud 2010-11-10T16:36:28-08:00

I have not studied the issue myself, but I am told that there are many ambiguities in Gedcom and that it has several ways to structure what is essentially the same info. If that is the case, I think a new standard should try to NOT import those problems.

jbbenni 2010-11-11T07:33:13-08:00

There's another angle to AdrianB38's post. Is BG expected to contain all the information that can be encoded in GEDCOM statements? (I suspect the answer is generally yes, but some deprecated GEDCOM constructs like BLOBs might want to be avoided intentionally.)

If there is an expectation that BG >= GEDCOM, then perhaps it does need a goal to say so. If not, what exceptions are allowed?

hrworth 2010-11-11T08:06:24-08:00

jbbenni,

If there is a data element that is in the BetterGEDCOM file (what ever that ends up to be) that is an Exception, the End User, receiving that needs to be given complete details of the exception or not being properly formatted. AND the Option to override the Exception.

I see this happening during the "transition" from Old to New, while software vendors get on board and implement 'the new'.

Key here, for me at least, What is the problem and provide me to choice to accept or reject that data trying to be provided to me.

Thank you,

Russ

DallanQ 2010-11-11T09:19:58-08:00

At WeRelate we go from GEDCOM -> XML when someone uploads a GEDCOM and XML -> GEDCOM when someone exports a GEDCOM. There are a few instances where data elements are converted to notes in the exported GEDCOM, but this is fairly rare.

I believe that GEDCOM compatibility is important, since most software won't understand the new format at first. And I believe that compatibility is possible if you're willing to live with some data being converted into notes.

AdrianB38 2010-11-11T11:47:58-08:00

"It is not reasonable to expect GEDCOM to be compatible with an XML format. It is easy to develop a tool that would convert a GEDCOM file into the new format."

Greg - we clearly understand different things by "compatible". If it is easy to develop a tool that would convert a GEDCOM file into the new format, then in my vocabulary the GEDCOM file format is compatible with the BG format (at least, in one direction it is).

By compatibility I mean the ability to convert a GEDCOM file into a BG file with an acceptable number of exceptions. (Lots of bits skipped over in that definition, e.g. whose version of GEDCOM and what's an acceptable number of exceptions, but that's the high level requirement).

Secondly, no-one can possibly say "It is easy to develop a tool that would convert a GEDCOM file into the new format" when we don't know what the new format is. To take one example, the Individual entity in GEDCOM generally represents a single person in the real world. (Subject to limits of research etc). I could envisage a new format where there is no Individual entity but instead Individual-Source entities, where an Individual-Source represents the information about a person in one and only one source document, and there is no attempt to say that "John Smith" in the 1881 census is the same as the "John Smith" in the 1871. The software might then generate Individuals in the old sense dynamically. It would be very difficult for GEDCOM Individuals to be split up to create the corresponding Individual-Source entities on a conversion.

You may say - "No-one would design it like this" - well, they might unless we put goals on that inhibit them from doing so.

AdrianB38 2010-11-11T11:56:01-08:00

"I have not studied the issue myself, but I am told that there are many ambiguities in Gedcom and that it has several ways to structure what is essentially the same info"

I know of no ambiguities in the GEDCOM 5.5 standard. That doesn't mean there aren't any, just that the circles in which I move consider there aren't any.

"it has several ways to structure what is essentially the same info" - very possibly true. But these are not ambiguities, these are simply flexibilities. Each way is clearly defined, i.e. there is no ambiguity. Just about any computer language and / or standard allows the same flexibilities.

If someone cannot import their GEDCOM data into BG software because we've discarded all but one of the flexible options, then BG is sunk from the beginning.

AdrianB38 2010-11-10T14:25:06-08:00

Scope of BG

In several places at a lower level, discussions have highlighted questions of scope. I think this needs to be settled as a high level goal.

Do we want BG to be a system just used for recording conclusions - pretty much as GEDCOM is now? (Then the question arises - do we want to allow alternative conclusions to be recorded, e.g. for births, even though clearly someone is only born once)

And / or do we want BG to record evidence and / or evidence management?

greglamberson 2010-11-10T15:52:51-08:00

Adrian, I think I'm following you.

There is no doubt that evidence and theory development are key goals this project will seek to meet at some point. The question is, how soon?

The concept of a genealogical proof workspace and standards for the same are things we discuss at every meeting. Mostly these concepts are considered to be accommodated in a subsequent project. Developing a more robust XML-based GEDCOM-like file format is the focus right now. Another major goal that is considered subsequent is adapting this new BG standard into an XML schema that (perhaps as a more limited subset) can be the basis for true genealogical data communications between applications and services directly.

My question at the moment is this: Is it possible to accommodate both a conclusion model and an evidence/theory model within one data model (without being so unwieldy as to be unusable)? If not, where do you draw the line between these two approaches since so many applications have their own ways of accommodating some aspects of evidentiary deduction?
First and foremost, we want to be practical: We want to get user's data out of and into today's applications, but we also want to move toward the future. How can we do this?

Should develop a GEDCOM conclusion model that has base objects like "assertion" or "theory" or whatever that aren't developed but give us the option of modular expansion in a subsequent release of the standard? Should we essential build two data models and throw them into the same standard? Is even dealing with both issues doable or is it too much to handle right now?

gthorud 2010-11-10T16:24:42-08:00

One way or another, I think you must be able to handle both conclusions and evidence/theory info in the same standard - long term. The conclusion model may not be far from a subset of the evidence/theory model. You may perhaps see both methods used in the same file.

AdrianB38 2010-11-11T12:58:49-08:00

Greg, I wholly support the idea of the priority being to create a GEDCOM-like format first, which implies a conclusion model first.

I don't know enough about the entities in an evidence / theory model to be at all certain but I would hope and suggest that a high-level data model could be produced that accommodated both a conclusion model and an evidence / theory model in the same data model. Then priorities could take over and the conclusion parts be fully developed with their attributes, relationships, removal of many to many relationships, etc, etc.

greglamberson 2010-11-11T18:03:47-08:00

Adrian,

You're talking about the real core stuff here. This is the meat of the current work. What we need to do is detail a few data models, then compare them to best practices of some genealogical methodology folks (and we've got some good ones involved with this project). Then we'll essentially stress test the models and see how they can handle what we would like them to do. Also, we will look at some example software applications to see how well they map to these various models and how difficult it would be to resolve any problems in mapping their data elements as the software currently written.

This is a practical project as well as one that is very concerned about accommodating best practices and the future.

I strongly encourage you to read up on the various previous projects or your favorite application, then map it out or explain its data elements in as much detail as possible.
This is the real work to be done here, and this wiki is not a spectator sport. Jump right on in!

brantgurga 2010-11-11T17:39:24-08:00

Goal 5 (Internationalization)

There are many internationalization mechanisms inherent to XML such as language attributes. However, with genealogical information, we deal with languages no longer used for which there may be no language code defined. We should have a way to indicate such languages in some fallback way. Last I looked, Gedcom really only had a name field, but that's insufficient because one person has one name in a language but perhaps a different one in another language. So I propose a goal that, as much as possible, such internationalization differences are acknowledged.

hrworth 2012-01-23T15:57:03-08:00

GeneJ,

Since this topic has already been covered, and it appears that it has been, then why is it coming up again. I was only responding to this thread of messages.

Sorry,

Russ

ttwetmore 2012-01-24T02:58:21-08:00

I once again hereby register my disagreement that dates in genealogical data should conform to strict standards of any kind.

ACProctor 2012-01-24T05:50:56-08:00

Re: "I once again hereby register my disagreement that dates in genealogical data should conform to strict standards of any kind."

We're talking about the computer-readable versions Tom, not the transcribed ones.

You of all people should know that the computer format has little to do with the external format that we use day-to-day, or are you saying that no software program should ever convert a date to a binary form either?

I apologise for being terse but this is a no-brainer but people insist on misinterpreting it.

Tony

ACProctor 2012-01-24T05:57:17-08:00

Re: privacy, One way to ensure the privacy of data stored directly in a BG file is to encrypt it.

Obviously attachments such as images, records, etc., are a different case but there's no real problem with encrypting content that's in the BG file - irrespective of whether it's XML-based or otherwise.

I don't see any discussion on this subject elsewhere. Except for a brief mention on the 'Container & File Issues' page. However, encryption would then be an all-or-nothing feature whereas doing it at source level allows selected parts to be encrypted separately and using different keys.

Tony

ttwetmore 2012-01-24T08:57:22-08:00

Tony,

Thanks, but I'm not misinterpreting things. I don't believe the INTERNAL values need meet any strict standards. Obviously, if the date is known exactly, the format should be UNAMBIGUOUS, but this does NOT mean that it must conform to some strict standard. Genealogical dates are often ranges, often partially known, often only probable or possible, often in a form that can't be tied directly to any calendar ("some time in spring a few years ago"), and so on. I believe it is counter productive to try to force essentially sloppy, humanistic data, into strict formats. I even coined a term for this more than twenty years ago -- "the tyranny of relational thinking." An internal date value should be a string, yes, and a string that should be interpretable as a date by reasonably intelligent software in order to generate sorting information for lists. In genealogical data there is almost no reason to get bent out of shape about a not precise date.

This is another area where I fear I will be inundated by the growing tide of standards-toting techies, but I feel I must at least put up some token resistance against the oncoming tsunami.

My LifeLines software allows arbitrary date strings, and it does a wonderful job of figuring out the best way to parse date information from those strings. I'm really not sure where all the fear of non-standard date strings comes from.

Tom

ACProctor 2012-01-24T09:13:39-08:00

An internal date value is rarely a string Tom. That's why most databases have a "timestamp" data type.

We've been on this topic before but the point is to do with the dual representations of a date (i.e. computer-readable and humanly-readable) and nothing to do with its imprecision.

If you have a document registered somewhere in "the first quarter of 1860" then you can store an ambiguous representation of that for software to handle. It doesn't displace the original form and that would be kept too.

Those internal values are essential to many types of analysis and presentation. It's a pipe-dream to expect to parse all possible date forms, from all possible calendars, and from all possible locales.

Tony

ttwetmore 2012-01-24T10:53:26-08:00

Tony,

Thanks. I remain steadfast. I don't want a timestamp data type. There are ways around pipe dreams.

I've been in the business of genealogical software for 25 years. I've never had to distinguish between a human readable and a machine readable date. For me this is a manufactured problem that enables an unnecessary solution to be introduced.

As I said, I just want to make my whisper heard against the roar of the incoming tide. I know I will be swept away and am resigned to it.

Tom

ACProctor 2012-01-24T11:01:21-08:00

There's a large arrow with big flames on it currently heading your way from here Tom :-))

Tony

P.S. if you ever find you way onto Skype, I'd love to chat about "past lives".

louiskessler 2012-01-24T14:10:23-08:00

A few months ago, I programmed into Behold, the date definitions precisely as they are defined in GEDCOM 5.5.1. To me, it seems to be quite a comprehensive definition. It includes date ranges, estimated dates, interpreted dates, and then it also allows date phrases (enclosed in parenthesis). It has double dates, alternative calendars and B.C.

I don't see what it might be that the GEDCOM definition of dates would not cover.

I do not store dates internally in Behold as datetimes. You can have two dates, double dates, descriptors, etc. So I store as a sortable date string is made up of all the GEDCOM date parts. The point is that I don't see any built in timestamp/date type to be adequate.

As well defined as the GEDCOM dates are, most programs still don't export them properly. They use "Jan" instead of "JAN", or the word About instead of ABT or don't put date phrases in parenthesis.

Louis

ttwetmore 2012-01-24T14:21:15-08:00

Louis,

Watch the sky carefully for flaming arrows! Lucky for you Manitoba is not between Ireland and Massachusetts!

On a similar topic are you getting good auroras up there?

Tom

louiskessler 2012-01-24T17:35:16-08:00

Tom,

If it weren't for the fact that you're such a "Persona", we might actually get along. We agree on almost everything else.

We're about 500 miles south of the main Aurora. I probably visible about 40 nights a year and I probably only see them about 10 nights a year.

Louis

ACProctor 2012-01-25T04:03:14-08:00

Re: Dates, this page provides a very concise description of the STEMMA approach to dates.

http://www.parallaxview.co/familyhistorydata/home/document-structure/event/dates

This is actually working very well during my trials and is one of the parts I don't expect to be revising as a result of those trials :-)

Tony

GeneJ 2012-01-23T11:58:13-08:00

Or we could just go back to Intercontinentalisationalism. It's a personal fav of both Andy and Robert. --GJ

louiskessler 2012-01-23T11:58:23-08:00

Sorry. Humerous is a bone. I mean't humorous.

And I'll now stop wasting our board space with these types of posts.

ACProctor 2012-01-23T12:03:38-08:00

Sorry Louis. I'm too serious sometimes. That's one good reason for keeping a bit of humour in the chat.

P.S. Andy, Intercontinentalisation always make me think of an ICBM :-)

Tony

louiskessler 2012-01-23T12:09:55-08:00

... but seriously, GeneJ. There are two world standards for this and we should are forced at every step to pick one over another. (We can't write each word twice.) Whichever is picked will make the other group feel slighted.

See: http://en.wikipedia.org/wiki/American_and_British_English_spelling_differences#-ise.2C_-ize_.28-isation.2C_-ization.29

This relates importantly to our Internationalis/zation goal, and the same argument applies to every single decision BetterGEDCOM makes as to how to represent something.

BetterGEDCOM should not allow ambiguity for it causes complexity.

e.g. Writing dates. We will need to chose one format. We can't allow 50 different formats, or 100 different names for "January".

It's the program that should allow the multiple formats, but the data transfer / data storage method must pick just one method for all to follow.

Louis

Andy_Hatchett 2012-01-23T12:11:24-08:00

ICBM?

InterContinational BetterGEDCOM Movement?
;)

ACProctor 2012-01-23T12:27:41-08:00

DEC coined the term i18n to represent both variations of the name, plus all those other variants resulting from poor spelling and plain old typos :-)

Re: dates and stuff - this is exactly what I'm currently writing as the problem has been solved for years in the software industry. It's just that many developers still don't follow best practices. In other words, I agree with you Louis but the chosen formats are now prescribed by standards that need to be documented.

Tony

GeneJ 2012-01-23T12:53:47-08:00

HI Louis,

Actually, those on the DTO found the "s" just a logical choice. As I recall, one commented this will serve as a subtle reminder to some that FHISO has an international mission; for others, it just shows we know how to spell.

The text of the bylaws, as drafted, which we hope will be adopted by the first board, are written in UK English, too.

One question was whether or not translators worked as well with the "s" as the "z"--not for the benefit of English speakers, but for the benefit of those who do not speak English as their first language. I guess if you think about it in that context, that we need to become accommodating to more non-English speakers, perhaps it puts this issue and our little "s" in a different perspective.

More to the point of your posting, though, I believe it has been recommended that an ISO standard be used for date storage/transfers. As you say, this storage/date transfer format is really an issue different from the options my vendor might provide at the interface level (10 Jan 2012, 10 January 2012, January 10, 2012...). Alas, consensus building work for the technical committees to address. --GJ

hrworth 2012-01-23T13:12:57-08:00

@Louis,

I have a 'simple' question about:

"Private information is what the Government uses the "Freedom of Information Act" to access.

Public information is what the Government uses the "Privacy of Information Act" to protect."

It doesn't have anything specific to do with your message, but does BetterGEDCOM going to DO something in this area? Block the transfer of Data?

As a simple user, that 'control' belongs with the application generating a BetterGEDCOM file. The Generating application would (or should) know the "rules" based on settings of the computer generating the file.

Clearly, these types of "rules" haven't been addresses by the US software programs that I have used. AND it is an issue with what is sitting on my Computer and what I might post Online. It's a documented issue for the application that I use.

Russ

GeneJ 2012-01-23T14:18:12-08:00

Hiya Russ!

I'm not Louis, but say you are making a me to me transfer between software application a and software application b. You might have some information the file of application a that is private--you made a mark of some kind so you could later "opt" not to share something.

If you transferred that your own file to your own copy of application "b," you might want those same "privacy settings" to transfer over.

An example of this might be someone's "address for private use" in a source field. In a you-to-you transfer, most users would probably want that address for private use to be shared, but they would still want it marked so that if they opt to post information to say the internet from program b, the address would be kept private.

Does that make sense? --GJ

louiskessler 2012-01-23T14:42:28-08:00

Russ,

I feel it is up to the program to give the users choices to filter their data.

They may choose to export all of it, or some of it, but whatever they choose, it should be in BetterGEDCOM format.

There could be a "privacy" value added into the data, but I see that as problematic. There is no guarantee that the reading program will not honor the privacy value, and (for that matter), anyone will be able to look at the BetterGEDCOM transmission (if it is in XML, they can use any XML reader) and they will see all the data, whether it is marked private or not.

And the data marked "private" will be a big beacon to anyone wanted to focus right in on the private data.

The only sure way to keep private data private is not to export or publish it.

Louis

hrworth 2012-01-23T15:25:39-08:00

Louis,
GeneJ,

Or is it GeneJ / Louis

Within my program, I have the ability to mark information Private. I can also Privatize the file. So, when Sharing, that information will not be made available to the GEDCOM or a same application file.

That is why I am saying the Application should give the User Control and that BetterGEDCOM just pass along what it was given.

The issue that I have, that my application does NOT do, has to do with Media files that are shared. Even with GEDCOM 5.5.1, and links to Media files. Those Media files that are still under Copyright should not be shared.

As users, we are asking the Application to give on that option. That is, Don't publish or pass along, this Media file.

Again, I would not expect BetterGEDCOM to do so.

Only one User's opinion.

Russ

GeneJ 2012-01-23T15:43:16-08:00

Hi Russ,

You wrote, "we are asking the Application to give on that option. That is, Don't publish or pass along, this Media file."

It's not the feature I'm talking about. I'm talking about the underlying marker or "code" that a vendor would look for _if_ they opt to add an on-off switches (ala, "don't sent this") supporting a feature. There was an earlier discussion about on the Requirements Catalog page. See Admin 12-Support Privacy Settings.
http://bettergedcom.wikispaces.com/message/view/Better+GEDCOM+Requirements+Catalog/40806995

ACProctor 2011-12-07T09:40:50-08:00

Terminology hides a lot of side issues here.

Internationalisation (or "i18n" if you want to avoid z/s spelling variation) is making a product applicable worldwide.

Localisation (or "L10n") is making it address a specific country or culture, e.g. translating its text for that market.

Globalisation is sometimes used to refer to both of these.

Ideally, an XML data file should not have any localised data values in it, or anything related to the presentation of the data (echoes of our Citation thread here). XML primarily uses UTF-8 so characters should not be a problem. Data values should use a 'programming locale' rather than a particular cultural locale. I've mentioned this elsewhere, and it's just the same as with the source code for a programming language - it can include text for a particular language but the source code should compile exactly the same in any country. Developers take this for granted but other folks may want to ponder on that. The representation of a date or a decimal value in the source code is designed to be computer-readable, not humanly-readable, to ensure transportability.

Similarly, BG should be "transportable". Irrespective of whether it contains foreign place names or person names, it should load identically in any (compliant-)product in any country. This was the big issue I was making about ISO 8601 for dates. Similar arguments apply to decimal numbers (e.g. always specify AgeYears=6.5 rather than accepting AgeYears=6,5 sometimes), and Citations.

Tony

ACProctor 2012-01-21T06:18:15-08:00

I have spent a large part of my professional life being involved in the portability of data and software systems between different languages, cultures and locales.

I would like to write something up about the specific issues applicable to genealogy. Initially, this may just be background material for us to reference, but using it as material for a 'white paper' or another magazine article is also on the cards.

Obviously I'm aware of this particular thread but are there any other resources I should look at for previous discussions? I'm especially interested in the distinct areas affected, and the difference between transportability of the data versus culture-neutral data model.

Tony

gthorud 2012-01-22T17:16:59-08:00

I think this is a good idea, and it is something I would like to prioritize wrt to where I would do work.

There are probably several pages when it comes to customs in various countries. One page that I remember is here http://bettergedcom.wikispaces.com/Geir+Thorud%E2%80%99s+personal+notes+on+Genealogy+in+Norway and Adrian wrote a page with a similar title for the UK.

Then there are things like names, event types, calenders, cohabitation (ignored in some cultures), address formats - and I am sure many more, but it will require some thinking.

The other more general category that I am sure you are familiar with, and have mentioned, also contains sorting, "thousand mark", maybe "list separators", in adition to dates and decimal mark.

I am sure such a document would be very useful for our own education, and also for vendors. The first step is to identify the various types of "issues" where there are likely to be differences. Later one could imagine a catalogue listing the variants for each country/culture, but that is a lot of work.

Around 1990 a book was written about the special requirements to software to be localized for the Nordic countries, and I would guess that there are other similar books or documents.

NeilJohnParker 2012-01-23T02:26:54-08:00

Toni, see Section 05.10.02 in Personal Name Data Standard V0.03.docx
re attributes of Locale that may be pertinent to PersonName.

Neil

ACProctor 2012-01-23T05:30:46-08:00

Thanks Geir & Neil. My first task would be to identify the areas affected and highlight some of the wider differences in order to get the message across that 'this is important stuff'.

As you say, Geir, going through every single country/culture's differences would be a huge task but that would benefit more from individual efforts from people in those georgraphic/ethnic groups.

Tony

gthorud 2012-01-23T07:19:59-08:00

Agree.

One more issue came to mind, differences in what is considered private info - that is not the same around the world, and it may depend on where you publish it - e.g. a book or on the internet.

ACProctor 2012-01-23T07:56:31-08:00

I hadn't thought of that one Geir. Do you have any more details?

Tony

Andy_Hatchett 2012-01-23T08:00:03-08:00

That subject opens a whole new can of worms. Even here in the United States people don't understand the difference between personal information vs. private information; most assume they are the same thing and in reality they are quite different.

ACProctor 2012-01-23T09:11:13-08:00

It's still worth covering though. I followed some recent threads on the newsgroups about not publishing details of relatives who are still living. There are strong opinions both ways but it would be the legal or moral side that would be of importance.

Tony

louiskessler 2012-01-23T09:49:13-08:00

Simple:

Private information is what the Government uses the "Freedom of Information Act" to access.

Public information is what the Government uses the "Privacy of Information Act" to protect.

Louis

ACProctor 2012-01-23T09:53:16-08:00

That's not a definition that stands up outside the country you're referring to Louis (which ever one that was).

Tony

louiskessler 2012-01-23T11:56:04-08:00

Tony. That was meant to be humerous.

p.s. Should this thread be called "Internationalisation" with an "s"?

GeneJ 2011-01-14T20:06:30-08:00

Hi gthorud:

It was discussed during the Developer's Meeting.

We didn't record the meeting (it was mostly in voice), but did make a brief record.
http://bettergedcom.wikispaces.com/3+Jan+2011+Dev+Meeting+Notes

I have some other notes about the text input. I'll check that separately.

gthorud 2011-01-14T20:11:59-08:00

I don't understand this. According to the minutes you did not discuss the Internationalization issue.

I would be very interested in knowing why this change was made. Why is eg. internationalization not a goal anymore?

GeneJ 2011-01-14T20:19:38-08:00

The meeting was in voice. It wasn't recorded. We didn't try to capture the pros or cons expressed, only make a record of specific consensus or specific conflicts.

Perhaps you want to put that on the agenda being organized for the next meeting?

http://bettergedcom.wikispaces.com/Developers+Meeting

--GJ

gthorud 2011-01-14T20:22:14-08:00

Well, there must be something wrong with the minutes then.

GeneJ 2011-01-14T20:51:31-08:00

@gthorud:

Thank you for adding it to the agenda. I really think that's the best place to address it.

FYI, the meeting was in voice and we had no means of recording the session. With 15 attendees, it was not possible to otherwise make a record of the comments contributed about the goals discussed.

During the meeting, after discussion about a goal occurred, if consensus was reached, I was asked to enter the consensus changes to the goals on the wiki page. Any change made on the basis of such consensus includes a notation (in red) "[First Developers Meeting].

Generally, if consensus was not reached, I was asked to add a discussion topic to the Goals page of the Wiki relative about the lack of consensus, so that discussion could continue.

Hope this helps. --GJ

gthorud 2011-01-14T21:49:03-08:00

It might clarify things if those that argued in favour of the change could give a reason before the mmeting. With this change my motivation for working on BG was dramaticly reduced.

GeneJ 2011-01-14T22:05:34-08:00

I would like to see that also.

I was doing multiple things at once, and didn't catch all the commentary.

I do recall some question about how broadly the original wording might have been interpreted.

I had expected the consensus to just read "...without bias," and was surprised that it didn't.

I hope the others who were in attendance will chime in.

gthorud 2011-01-14T22:24:52-08:00

Well, the only problem I see is that the original goal can not be achieved unless we have contributions from people that know ALL possible cultures - and that is unlikely to happen. But still one can have this as an overall goal, and describe the conditions/limitations below.

GeneJ 2011-01-14T22:52:24-08:00

What wording would you propose to solve that issue?

AdrianB38 2011-01-15T08:19:37-08:00

As I recollect, the discussion was one of summarising words, not removing anything. E.g. "cultures, countries, nations" were regarded as roughly synonymous. I think that might have lead us just to have "culture" of those 3. Maybe.

And then "culture" and "religion" got subsumed into "belief system".

If we now have words that _appear_ to have removed internationalisation, then (a) that is accidental because it wasn't the intention of the discussion and (b) I suggest we revert to something closer to the original form because clearly the current words aren't explicit enough.

Guys, much as I like short, pithy statements, I think we need specific phrases that are clear. I'd therefore suggest something like:

"Goal 5 BetterGEDCOM aims to support recording of information about real life in an open ended set of cultures, countries, time periods and belief systems. It should not be biased towards any one of these."

"Aims to" and "open-ended set" convey to me that no-one expects to cover all possible cultures.

I've reduced it from "cultures, countries, nations" to "cultures, countries" in the belief that "nation" is covered by those two. I suggest we _do_ need both because naming structures are part of cultures, while calendars tend to be part of the quasi-legal apparatus of government, i.e. of countries. (E.g. we need to cover move from Julian to Gregorian calendar, French Revolutionary, etc - these are all country based.)

And yes, I think we need "time periods" in because it makes us think about that aspect. Yes, one could argue it was implied by the others, but I'd rather make it clear.

I use "belief system" instead of "religions, incl. no religion" as its slightly shorter and covers atheism. I think.

Geir - is this better?

gthorud 2011-01-15T14:23:48-08:00

Adrian,

Yes, much better - Excelent!

DearMYRTLE 2011-01-31T11:39:44-08:00

At the 31st Jan 2011 Developers Meeting it was agreed:

"Goal 5 BetterGEDCOM supports recording of information about real life in an open-ended set of cultures, countries, time periods and belief systems. It should not be biased towards any one of these." Passed.

greglamberson 2010-11-11T17:47:49-08:00

I totally agree that one person can have more than one name at the same time. These names may be the result of identity in different cultures or for other reasons.

Certainly the GEDCOM model which allows only one name as a primary name is not something we will emulate. For me, I think the only element of a person record that is static or serves as a key for the person is the UUID (i.e., an arbitrary number which merely serves to separate a person's identity regardless of name changes or duplicates).

For me this doesn't rise to the level of a goal at this level, but it is a concept I wholeheartedly agree with (insofar as I understand it), and I am reasonably certain that multiple names will be fully supported.

hrworth 2010-11-11T17:54:00-08:00

Greg,

I would hope, that in the BetterGEDCOM, we are allowed to pass along those 'other' Names for that individual.

When I find a different spelling or format of a name, I record it and I put a Source-Citation on that entry.

This will present a conflict, at which time I will evaluate the Source and make a conclusion as to which I think is the correct name (at this point in time).

Each spelling of the name should be passed along, with the Source-Citation information and a marking, perhaps, based on my conclusion.

Russ

gthorud 2010-11-12T06:19:22-08:00

In my view the most important aspect of Internationalization is that BG should be able to handle genealogical data that the reflects real life in as many countries as possible, as many cultures as possible and as many religions (or no religion) as possible. Although programs may not be able to support all these, the standard should not be the limiting factor, as is the case today.

The most important thing for me is that the standard is able to represent the customs over time in my country. Further back in the queue of priorities is the ability to “translate” this info into other languages.

Since Gedcom is not able to properly document various types of family relations, various conventions for naming persons, etc. I will give priority to solving such issues.

I believe the above mentioned type of internationalization is important enough to be captured in a new goal.

The following sentence may be a starting point for discussion:
BG should support recording of information about real life in open ended set of cultures, countries, nations, time periods and religions, incl. no religion.

We should be careful with the use of the term UUID, it is used with totally different meanings on this Wiki. It might be a good idea to encourage use of the Glossary.

gthorud 2010-11-12T06:21:42-08:00

The propose sentence should read "... in AN open ended ..."

greglamberson 2010-11-12T06:56:28-08:00

gthorud,

OK, well, I encourage you add the goal to the main page and start fleshing it out, both here and on the main page. I love that you're participating and are painfully aware of the implications of these issues, whereas I am not.

gthorud 2010-11-12T10:34:18-08:00

I have added a 5'th goal, with an amendment to the proposed sentence. I feel that it may be necessary to add more detail to this later.

gthorud 2011-01-14T19:58:05-08:00

Goal 5 used to read something like "5. BetterGEDCOM will support recording of information about real life in an open ended set of cultures, countries, nations, time periods and religions, incl. no religion. It should not be biased towards without bias toward any one of these."

This has been changed by someone to only mention religion.

Where is the discussion that has lead to this change???

brantgurga 2010-11-11T17:44:30-08:00

Recognition of the past

I'm not sure if this would be a goal or something else, but things like timezones change over time so when a date/time is recorded, we should have a way to specify the timezone, calendar system, and whether that date/time is as of 'now' or the time given. Similarly, locations have different meanings over time. Prussia is something I'd often see as a location on old records, but today no such country really exists.

greglamberson 2010-11-11T17:54:42-08:00

Calendar system is something I absolutely think is crucial, even if there is a default/assumed underlying calendar. There should definitely be allowances made for different calendar systems. It has not been that long ago since Russia was using the Julian calendar. The Middle East still uses a Muslim (not sure what it is called) Calendar in which it's the year 1378 or so and an Israeli Calendar in which it's the year 5000 something. I'm sure there are many other examples.

Regarding places, there are a couple of vigorous discussions about this already. Also, I encourage you to look at the various available XML schema available for description of various time and calendaring systems, as we should definitely adopt at least one such already-defined system.

jbbenni 2010-11-13T13:36:51-08:00

Good points! Don't forget the Gregorian calendar was accepted at different times by different nations. Russia didn't adopt it until 1918, so dates between January 1 and March 25 are quite tricky! It's a problem that is similar to changing timezones offsets and boundaries, but in some ways even thornier.

Mention of changes to timezone boundaries opens the whole place discussion -- not only are there variant spellings and contested names, but geo-political boundaries do move. A family history in Alsace-Loraine would be a real challenge!

I suggest using UTC for date/time and Lat/Long for the ultimate invariant representation inside the data model. Conversions to and from familiar names and popular time notation will depend on the time and place.

It will be fun when we get to that part of the data model!!

brantgurga 2010-11-11T17:50:37-08:00

Data Tolerability

Nonsense values should be transferrable with clear meaning on what should be done with them. By that, what should be done when I receive a file with a time reference to the non-existing time interval when a daylight savings transition occurs. Should they be discarded? Should they be warned about? Should they be moved to the nearest actual time?

This is something where existing data out there is gonna have issues, but at the same time, we want to reduce the amount of such issues in new or modified data. And there should be a consistent expectation across implementations.

greglamberson 2010-11-11T18:08:41-08:00

You're certainly hitting on relevant issues. However, this is a little esoteric. We don't even have a data model yet.

Date and time formats are certainly key, but I don't think we're to the point of addressing this sort of thing except to say that data should never be changed because the software doesn't understand it. No data should ever change just because it doesn't fit into some software parameter. Data with errors should be passed through without modification. Period.

jbbenni 2010-11-13T13:09:04-08:00

Goal 2 -- BG container formats

BG isn't the first project to want benefits of XML for structured data, along with an efficient container format for BLOBs such as multimedia, without sacrificing platform independence or incurring license headaches like paying royalties.

The basic idea seems to be to use XML for BG relationship and evidentiary data, but to use a ZIP-like container to encapsulate that XML along with any supporting files (e.g., multimedia) in native formats. The container provides encapsulation, standardization, and (potentially) compression. It must be durable for archive, and portable for data exchange

Importantly, "XML plus a container" is a goal for BG during interchange (import and export) and file archive, but it's understood that application developers can use any data representation they want internally. Nonetheless, sharing data is important, so the interchange format matters a lot.

1. What are the major existing options?
2. Are we okay with any open source license that is royalty free? (e.g, not necessarily GPL)

There are a blizzard of acronyms for open container technology:
OOXML (Open Office XML, aka Open XML) from Microsoft
OPC (Open Packaging Conventions)
ODF (OpenDocument Format)
OpenDocument
ODA (OpenDocument Architecture)
and even the venerable PDF is related...

Of these, OPC looks like it might be the most recent generation in the meme. I have concerns about platform independence (it's got .Net all over it) and questions about the license.

Anyone know about this stuff? What are the obvious candidates, what are their pros and cons?

ACProctor 2012-05-21T06:52:17-07:00

I'd like to try and resurrect this thread because it's important but the topic doesn't get the discussion time that other family history data topics get.

It's looking at the requirements for a container format (aka wrapper) to hold multi-part data. For instance, a core data file representing entities and entity relationships, plus supporting files like images, photos, recordings, etc. Subject to copyright/permission, these would be transmitted as a whole when family history data is exchanged.

The thread has so far suggested using either a compressed document format (e.g. ODF - OpenDocument Format, or OOXML - Open Office XML), or a MIME-format email container such as MHTML which is relevant to an HTML document and its supporting files.

Neither of these are obvious winners as they both have problems, e.g. a MIME container has no support for optimised access to its body parts. With some help, I would like to tease-out the pro's and con's here so that everyone has a clearer picture of which way to proceed.

Here are some basic requirements we should consider. They're not all mandatory, and in no specific order:

Tools - There should be readily-available tools for manipulating this format. That precludes using a proprietary, non-standard format
Encryption - The ability to obfuscate sensitive data
Compression - Ability to reduce the overall data size. This would certainly be important for XML contributions, but it has advantages beyond that
Properties - Allowing some useful genealogical properties to be associated with each contribution
Storage structure - To know the folder structure required when the container is unpacked
External references - To be able to address a specific contribution from outside the container, e.g. using a URL fragment
Internal references - Some way to relate the contribution names to the identifiers used in the root genealogy file. This is akin to, say, images that are referenced by URL in an MHTML container - they still have to work when unpacked or accessed via the container.

zip files, including ODF/OOXML, obviously support compression. They can also unpack to a given folder layout. MIME containers are used extensively by email systems but there are almost no standalone tools. OOXML has a manifest of the contents but - being from Microsoft - relies on file extensions rather than MIME content types [this would seem to clash with file systems that use internal meta-data rather than visible file extensions]. The individual (body-)parts in a MIME container can be addressed using cid:content-id and mid:message-id/content-id (see RFC 2392).

User ras52 (aka Richard Smith) suggested to me that Java Archives (jar files) might be a better alternative. I hadn't considered these since I assumed the availability of tools might be less, say, on Microsoft platforms.

Jar files are build on the zip file format. They have a manifest file at META-INF/MANIFEST.MF that provides MIME content types and extensible name-value properties. Individual resources in a jar file can also be digitally signed.

Has anyone else given much thought to this area?

Tony

ttwetmore 2012-05-21T09:07:21-07:00

Tony,

I would have suggested jar files as well.

Tom

AdrianB38 2012-05-21T12:52:12-07:00

I checked on .jar files and 7-zip (my Windows "zipper" of choice) doesn't have the ability to associate with .jar, suggesting it can't open them. When I searched for how to open .jars in Windows, it talked about installing the Java RTE and associating the extension with "the Java Interpreterjavaw.exe" (not sure if the spacing is right there but I just copied it straight).

In my simplistic way, I was thinking that it should be easily possible for anyone to open the BG container to access the BG file itself. Then the BG file should be capable of being edited in a normal text editor in order to correct errors of format, before compressing it all up again. (See Tony's comments on tools)

If anyone feels there won't be errors of format, could they explain why the situation will change when moving from GEDCOM - where the files typically have errors of some sort - to BG? In fact, one doesn't even need to invoke errors as different software suppliers might extend the BG format in different, incompatible ways that need to be edited to be compatible.

Conversely, looking at the complexity of some of the three-letter acronyms being talked about for various components of BG, you might contend that users like myself should keep our sticky fingers out.

ttwetmore 2012-05-21T14:57:21-07:00

Jar files are normally read and written by the "jar" command line program that is part of the Java environment. However, jar files can be handled by most programs that can handle zip files.

Maybe at this point in time it would be better to suggest that BG will use a zip/jar like file, rather than making a full commitment.

I agree there will be a need for a program that can open a BG container file and allow one to browse the contents. There are analogous program today for zip files, and probably for jar files. The DropBox browser on an iPad comes to mind as a great example how such a browser would act. I also agree that a utility that would allow the BG text to be edited directly from within a container would be useful, and would likely be developed quickly if BG became a going concern. Any tool like that, however, would have to do a full job of validation before zipping the container back up when the user was done.

Errors of format is a big subject and there are no easy answers. We would depend on all vendors exporting BG files to adhere to the published standard formats. There is no guarantee that they will. What is done in cases like this in the "real world" is to define a large set of interoperability tests that every vendor must pass in order to be "certified." If you are talking about errors of format that are introduced by a user entering bad data, we have to depend upon the vendors once again doing the validation of fields in order to adhere to the standards.

If vendors unofficially extend BG for their own use they are being very bad and should be shunned. If the bad boy is someone like Ancestry.com, good luck; it will be like the kind of standards usurping that Microsoft has long been famous for. The giants can play fast and loose with standards, and the rest of us can whine all we want, but at the end of the day, must go along.

I personally believe that BG should NOT be extensible, but that changes and additions can be proposed and acted upon by an official process. Taking this position is tantamount to the claim that the BG designers can do an excellent job of anticipating the needs of the industry. I personally believe this to be the case in theory; however the technical management of the BG process must undergo a radical improvement before this would be possible. You can FHISO for BG in the preceding sentence if you believe that FHISO will end up in charge.

louiskessler 2012-05-21T16:46:02-07:00

I 100% agree with Tom that BetterGEDCOM should NOT be extensible, for the exact reasons he states.

That makes every decision, such as the BetterGEDCOM container decision, a tough one. It will be hard to U turn once all developers have implemented it one particular way.

The container decision is also affected by technology and what happens outside the genealogy community's control. It's too bad no one has perfect foresight to see which containers will still be used 10 or 20 years from now. Maybe none of them.

Louis

ACProctor 2012-05-22T00:37:43-07:00

That's a good point Louis. It's a direct analogy of the "file format" issue.

In principle, a data reference model can be realised in several alternative file formats, or I should say serialisation formats to be correct. I know some people are violently opposed to XML but there will have to be an XML alternative, and that would have to be defined by some central authority - if XML is ignored then there will be many different variations defined by all-and-sundry.

With the containers, maybe we need to define how each of the popular container formats should best represent the multi-part data. It's not ideal but then neither is the real world. There's no reason why we couldn't produce libraries of software to help with creating such containers, but I'm not clear yet how different the capabilities are for the various possibilities.

Tony

ttwetmore 2012-05-22T03:58:11-07:00

Tony,

You say that if XML is ignored then there will be many variations (of export formats I assume). This implies that you think that if there is an XML format that there won't be many variations. I don't understand that implication.

I am opposed to XML, but not violently! XML is like a religion, and XML fundamentalism seems an unresistable force to all but cranky iconoclasts like myself. I'm sure XML will be the format used for the textual part of BG transport files. As I've said many times in this forum, I am opposed to it, but have fully accepted it as the solution that will be chosen.

I've used XML constantly in many projects for about a decade, from back when one had to write their own parsers. I have used XML libraries on UNIX, Windows and Macs. I've programmed complex XSLT stylesheets; don't get me started on XSLT. I have used SAX interfaces, DOM interfaces, and my own custom interfaces. I've created schemas in its competing technologies. In other words I am not opposed to XML because I don't want to learn something new, or don't want my comfort level disturbed. I am convinced that a custom format, designed for the specific application of genealogy, would be a better solution.

But XML will work fine. Its real advantage, in an ironic sense, is that because it is the obvious right choice for so many, that it will be the only decision BG will make that will take less than year!

Tom

ttwetmore 2012-05-22T04:28:33-07:00

Just a quick comment, maybe not appropriate here.

I think worrying about external formats and container formats and XML is a lot of fun, and I'm willing to get into the discussions and express my opinions with the rest of you.

But the fundamental task of BG is to design an encompassing data model for genealogy. BG has found a nearly infinite number of ways to avoid this hard task by going off on some pretty odd tangents. (I am not accusing this thread of being such an oddity.)

BG should be deeply involved with comparing and evaluating models for existing products and for proposed products. Every month that BG isn't doing this, and those months are currently ticking by at a rate of one month per month, is wasted time.

The big questions that BG should be addressing are, simply off the top of my head:

1. How should evidence be represented in the model.
2. Should the data model just represent conclusions or should it also represent evidence and the relationships between them.
3. What is the nature of a personal name, and how should they be represented to allow for coverage to different cultural practices.
4. What is the nature of a location, and how should they be represented to allow for historical changes and language differences.
5. What is the nature of a date from a genealogical perspective, and how should they be represented to allow for historical and calendar changes.
6. What is the nature of interpersonal relationships and how should they be represented.
7. What are the important genealogical and family historically significant relationships and events.
8. Are events, places, and maybe other concepts, first class citizens in a genealogical data model, or are they just attributes of other things.
9. How are sources best represented. How should the model accommodate the generation of citations in the formats necessary for publication.
10. Should the model support only summarization of information, or should the model also support research-based ("evidence and conclusion") processes.
11. Is the evidence level person record (the "persona") something that should be in the model or not, and if so, should they be multi-level.
12. Should decisions and conclusions be in the model, and if so, what does that mean. If decisions and justifications for conclusions are in the model, what form do they take.
13. Can the model consistently treat the vast plethora of things that we call sources today.
14. Is the concept of a nuclear family something that should be in the model or not.
15. Should the model accommodate publication quality output, and if so what does that imply.
16. Should things like schools, courts, cemeteries, ships, armed forces, military ranks, arbitrary groups of people, and many other types of things be in the model or not.
17. Should the model be extensible by third parties or not.
18. How does the model handle quality of data and confidence in data.
19. Should the model ignore copyright issues or not.

Tom

AdrianB38 2012-05-22T04:49:45-07:00

I have added 2 discussion topics on extensibility -
https://bettergedcom.wikispaces.com/message/view/Better+GEDCOM+Requirements+Catalog/54432200
for User Extensibility of events and characteristics

and
https://bettergedcom.wikispaces.com/message/view/Better+GEDCOM+Requirements+Catalog/54432236
for Extensibility by software companies

Both are in the Requirements Catalogue but neither have had a specific discussion topic before. Please discuss extensibility there. Thanks.

ACProctor 2012-05-22T07:15:56-07:00

Re: "This implies that you think that if there is an XML format that there won't be many variations. I don't understand that implication"

I'll just try and fill in the missing blanks in my XML comments Tom.

I agree that a reference model must come first. However, I also believe that the same authority who devised that model (e.g. BetterGEDCOM, FHISO, whoever) must also prescribe the accepted encoding for the various data syntaxes. If, for instance, you have a standard model, but six different ways of rendering it in XML, then you may as well not have a standard at all. This is the case for any data syntax - not just XML.

In effect, I believe that the serialisation formats (aka file formats) must be defined as some sort of appendix to the data model standard.

I guess I'm also suggesting the same for the container formats too.

Tony

ttwetmore 2012-05-22T09:54:45-07:00

Tony,

Thanks. I agree with your points here.

Tom

ACProctor 2012-05-24T08:46:54-07:00

I just found an interesting thread on this subject over with our friends on github. It's a few months old but treading the same ground.

Tony

greglamberson 2010-11-13T13:21:51-08:00

jbbenni,

I have made a new page called Multimedia File Inclusion Issues. I hope you'll add this content there yourself and help move discussions regarding container formats, compression and the like over to that page. Great work, by the way!

gthorud 2010-11-13T16:14:38-08:00

I am working on it, but with a scope that also includes referencing of media outside the container.

greglamberson 2010-11-13T16:21:38-08:00

Excellent. That's all very important, and you clearly have some ideas about how to structure it. I look forward to seeing what you come up with.

gthorud 2010-11-13T18:39:54-08:00

I have made a first attempt, but we need more expertise - I have not followed the standardization work in the "Container" area for many years. And I may have included too many ideas.

What functionality can XML give us in this area?

There are technical things on the page that do not fit under the heading Goals - but we should not fragment the discussion too much - at the moment.

And, please feel free to improve my English!

Good night!

jbbenni 2010-11-14T04:49:08-08:00

Great job gthorud!

I'll be adding some discussion on the new page.

ACProctor 2011-12-07T13:30:25-08:00

Execellent thread folks!

If all auxiliary data are effectively attachments to the BG data (e.g. images, photographs, documents, etc) then transmission as a whole is an imperative. The onus would be on any export option to bundle the appropriate attachments (subject to privacy control) along with the core BG data.

Using a zip container is the way it's done at the moment, e.g. Microsoft's newer Office tools. The XML compresses down nicely - as it is supposed to - but you end up with a compressed binary file for transmission and for which you need the right tool to decompress on receipt.

There is another scheme that handles a similar problem: MHTML, or MIME HTML. This also bundles a number of attachments with a core and is most often used for the transmission of rich-text email bodies.

A problem both of these schemes have to deal with is adjustment of URLs. I would sincerely hope that BG addresses its attachments via appropriate URLs, e.g. file://mygran.jpg. Those URLs have to be adjusted when the zip is decompressed in a different location.

Tony

theKiwi 2011-12-07T14:30:35-08:00

Replying to "I would sincerely hope that BG addresses its attachments via appropriate URLs, e.g. file://mygran.jpg. Those URLs have to be adjusted when the zip is decompressed in a different location."

This is probalby something that also needs handling better at the application level too.

The only software I'm very familiar with is Reunion, and its multimedia support includes the ability to scan for missing linked media files. So I can rearrange my media files on my Macintosh to a different filing hierarchy for example which might break the path to some or all of the files, and with a click of a button Reunion can find them again and create the new links.

So were someone to send me a file and the associated media (by GEDCOM currently), I can file the images where I want them, and then rely on Reunion to find them for me (as long as I don't change the actual file name) and sort out the path part of it.

A software that can't do something like this when receiving an import with linked media will probably lose favour to some extent if another one can sort this out.

Roger

ACProctor 2011-12-07T14:48:33-08:00

Thanks Roger.

Another way of handling this is to have 'path roots' for different sets of attachments. The URLs within BG then specify local paths relative to those roots. The path roots could be held in the BG header - in one place - so they're easy to update.

Tony

jhy001 2010-11-13T19:51:04-08:00

Evidence Explained Source Model

Whatever data model is decided upon, it needs to be consistent with the model presented in "Evidence Explained" by Elizabeth Shown Mills. She is the "guru" on this.

The chart on the inside cover of EE is:

=

Sources -> Information > Evidence

Sources: are... Documents, Registers, Publications, Artifacts, People, Websites, Etc.
Sources(Original(form),Derivative(form))

Information: is judged by... Informant's degree of participation or knowledge.
Information(Primary(firsthand),Secondary(secondhand))

Evidence: is based on... Relevance of information & its adequacy to answer question
Evidence(Direct(relevant/adequate),Indirect(relevant/inadequate))

Analysis of Sources, Information, Evidence -> "Proof"

=

I think this needs to be kept in mind as data models develop to make sure this model is adequately represented. It could even perhaps be elevated to a goal.

greglamberson 2010-11-13T20:08:58-08:00

There are a couple of areas in this concept that aren't generally well accommodated by genealogical software. The most obvious is the concept of analysis.

In looking at this diagram, another area of deficiency in genealogy software comes to mind: negative evidence. I believe this is what Elizabeth Shown Mills refers to as Indirect Evidence. (I just got this tome this week so I haven't had time to do much other than look at the inside cover.)

Most of us who have looked at this wonder if supporting the research process uniformly is possible within the same framework as that which most genealogy software programs currently use without breaking the latter. This will be a key component of the discussions here over the coming weeks.

Andy_Hatchett 2010-11-13T22:55:29-08:00

Elizabeth is the *present* "guru" but there were others before her and there will be others after her. While a great many follow her there are also a great many that don't.

She has changed some things from her first book- what happens if she changes more in her next one?

All that to say I see no reason whatsoever why the Data model should be consistent with her work alone- will we also make it consistent with Lackey's work?

greglamberson 2010-11-13T23:04:06-08:00

Well, there is no authoritative starting point for our work, but I think it's wise to listen to smart people who have good ideas on the subject.

hrworth 2010-11-14T09:31:54-08:00

Greg,

I agree that Evidence Explained! needs to be kept in mind. Clearly, at this point in time, agreed to or not, it is "the standard" for documenting our genealogy information.

As you point out, the evaluation of the Evidence is not by too many software vendors. They are now, however, stepping up to offering Templates in these software programs to comply to this "standard".

We, as Users of these programs, need to push our vendors to provide us with a platform to evaluate our Evidence and help come to a conclusion.

So, JHY001 has presented us with the model, and this project should prepare for handling of the data elements that might come out of the Evaluation of the Evidence. I am not sure that I want the Software / Program to make that Evaluation, nor do I want it to Draw a Conclusion. It might present me with some options, but I want to make that decision. The outcome, however, might be included in the data I am sharing with another user. That other user might accept or reject my conclusion.

Only One User's opinion.

Russ

GeneJ 2010-11-15T10:13:04-08:00

Another posting with much to chew on. A few thoughts:

(1) Hear me chant, "no more spaghetti sources and citations!" Let me live in a world where the record I entered TO my software, regardless of the style used, matches the output "at the other end." If I am round-tripping my own data, it should be identical; if someone else is importing my file, then my BetterGEDCOM is his/her source and my formatted entry is the "source of the source" for that citation.

(2) Elizabeth Shown Mills doesn't need me to speak for her, but think she'd be the first to say templates are often helpful aids for those just getting started. Ditto, don't think she's been writing the monster _Evidence_ series so that we'd all have "her" templates to use; rather. Believe she hopes the series encourages our understanding of content, layout, usage, etc. requirements of good source and citation-level communication.

(3) I'm biased! Yep. With each new source I locate, how I read/interpret the information contained in a source is biased by what I know and don't know. That latter includes what I know I don't know and what I just plain don't know.

In haste for now. --GJ

hrworth 2010-11-15T13:33:54-08:00

GJ,

Actually, I totally agree with you.

I just have one concern, and that is a Professional Genealogist (I'm NOT one of them) who embraces Evidence Explained! Oh, and I agree with "good source and citation level communication".

I have and continue to take my Source-Citations and moving them into a more uniform output of Citations.

I take your file, I like it (and I would), and with your permission incorporate your research in my file. The BetterGEDCOM provides for marking all of the records, facts / events, source-citations, and media as coming from you.

I then transform your formatted source-citations into my source-citation format.

But now, since were are working on a similar family line, I send my file back to you.

If I understand what you want back is your Source-Citations the SAME way that you sent them.

I am just wondering what that might mean in making that happen. However, I think that I agree with your round-trip thought. Here is why. Back to ESM, I think, I would have figured out how to Evaluate my data based on my Sources. I sent my data to you, you put the Source-Citations into your format and you did your evaluation, then send it back to me. I would then, maybe, have to re-evaluate what I got back. So, I see, the reason for your "requirement".

Mmmmmm ????

Russ

GeneJ 2010-11-15T20:18:55-08:00

Hi Russ,

If you return a BetterGEDCOM to me, I would expect the people/events/media I import to be to show your BetterGedcom as the source; your citation details would import to me as "source of the source"

As part of the merge process, I would expect to be able to review the people/events/media and citations associated with the import.

I'd expect the "stuff" I was effectively re-importing "stuff" to appear as potential duplicates or to otherwise be presented to me for review.

In the simplest sense, I'd no doubt choose to exclude those citations where the source of the source matched entries that existed in my pre-import file.

GeneJ 2010-11-16T08:59:05-08:00

Double posting this comment here.

As to third part[y] IMPORT level BetterGEDCOM exchanges, see Mills, _Evidence Explained_, [1st ed] (electronic)(2007), p. 156 for "3.44 Research Files & Reports, Personal File Copy" for a discussion and series of examples.

Note the "Source List Entry" example and the TWO citation examples (latter as First Reference Note no. 2 and no 3.)

Where there is no underlying source in the GEDCOM, Mills cites the GEDCOM ("Kincaid GEDCOM file"), specifies referenced individual by name and GEDCOM reference number, ("Lois Kincaid, no 1234"), includes what I assume is the "event" reference ("Biographical Sketch") and then adds this language, "with no citation of source."

When a "source of the source" exists, as is the case of the second "First Reference Note" example, ("Biographical sketch of John Kincaid, no 321"), Mills incorporates that reference in her citation example. She writes, "citing 'Resignations of militia officers, November 1832–January 1833 General Assembly Session, North Carolina State Archives.'”

I would expect the mechanics of BetterGEDCOM to enable software certified as compliant be just that smart!!

--from the cheering section, GJ

DearMYRTLE 2010-11-16T13:40:00-08:00

There has been work completed by employees over at FS concerning the citation models in Elizabeth's outstanding 1st edition, which have not been published publicly. Apparently there are a number of so-called contradictions. It may be that the "contradictions" mentioned represented the FS employee's lack of understanding of a genealogical proof standard and lack of practical research experience on the part of the FS employee.

There can sometimes be a big difference between how a coder looks at data and how a genealogy coder looks at data.

I am in no position to argue the point.

I'd like to mention that Lackey's work was published before we had so many copies of source documents -- digital or otherwise.

Until someone else comes up with a source citation scheme that is so readily applicable to genealogists and historians, then we should follow Mills' work.

The Chicago Manual of Style just doesn't cut it for genealogists.

AdrianB38 2010-11-16T14:52:22-08:00

Let me start by saying that I utterly agree that stuff should be cited usefully and analysed thoroughly. (Having said that, I get bemused by the concentration of some folk on the citation while ignoring the analysis)

Point 1 is that ESM's suggested templates (at least in "Evidence", because I've not read "Evidence Explained") are pretty much directed at American practice. The template of a census citation for England will look quite different, as the usual fashion is to quote the class / piece reference from The National Archives - while Scots censuses will look different again. So ESM's formats are useful to some people in the world and baffling to others. We need the ability to concoct our own templates defining items of interest.

Point 2 is going to get me shot down in flames but one reason hobbyist genealogists don't use citations well is that the basics of the Chicago Manual of Style are just plain unintelligible to the casual user. Why oh why should published sources use italics in titles and unpublished ones use quotes? How is the casual user ever going to remember that? Fortunately, there is an answer to that one - keep the data in full in XML, split by item, without confusing matters with quotes or italics (which won't appear anyway), explicitly saying if it's published or not and let the software italicise and quote if desired.

I suggest, therefore, that the bits of the BG data model relating to citation need to define the (separate) attributes necessary for ESM's citation work but also allow extra models. Somehow.

GeneJ 2010-11-16T17:44:46-08:00

Hiya Adrian!

We want BetterGEDCOM to handle our citations so they are not distorted on "the other end."

My citations do include analytical comments.

Separately, from Elizabeth Shown Mills, _Evidence Explained_, 1st ed.(electronic), p. 42, "Citations, Definition & Purpose," in part :

"Toward that end, source citations have two purposes:

to record the specific location of each piece of data; and
to record details that affect the use or evaluation of that data."

You wrote, "... and let the software italicise and quote if desired."

Some genealogical software provides templates that support such formatting. Some people use the ready-made templates, while others are more comfortable developing their own.

Cheers! --GJ

jhy001 2010-11-14T15:37:17-08:00

Genealogy Program Specifications

If the goal of writing a data structure adequate to exchange genealogical data among users is met, I submit that you have also written a genealogy program's specifications. All is needed is for someone to code the data model (which is, after all, all of the genealogy we do and is program independent) and add to the model input, output, and presentation.

And I think we should focus on simply designing the best, and not worry at all if vendors code to it. If it is the best, they will. And by the simple extensions above (input, output, and presentation layers added to our BG model) one has a genealogy program. An Open Source project could start from the data model description, and add the other pieces which are rote. The intellectually difficult model for exchange is the core. And if it is lossless, then it is a complete representation. If a programmer can not take the end model and write a program, then either some specification is unclear or wrong and needs addressing. Otherwise, a skilled programmer can take the model and deliver a program.

So I see this project as having the ability to be much larger than the simple goal of data exchange. And logically, in my mind, I don't see how we can avoid the larger picture that it really is a full specification of a genealogy program.

Again, my $0.02.

greglamberson 2010-11-14T15:57:00-08:00

jhy said,

"... I think we should focus on simply designing the best, and not worry at all if vendors code to it..."

If I absolutely had to write a statement that embodied the exact opposite idea of what I personally was aiming for when helping begin this effort, that statement would be pretty close to it.

I completely and totally disagree with just about every premise and subsequent statement of the above.

Is this a practical joke? I can't really say much more here.

jhy001 2010-11-14T17:13:55-08:00

No it isn't a joke. And a very odd response to come from a founder. I've now concluded that you are not only aiming too low, but you are really not open to opinions if they differ from yours. I won't be bothering you with my opinions any more. I'll be shopping them on other forums.

hrworth 2010-11-14T17:16:10-08:00

JHY,

We are NOT defining application requirements. This project is to understand how and what to transport from one User to another User.

We, the Users, have problems today exchanging information. For example, I sent a documented file to a user with a different genealogy program. None of my documentation made it to that other user.

The other major issue, from what I have seen from other users is the handling of Media. One application won't deal with Media, others have a way to handle Media.

This project is trying to identify these types of issues and toss around technical solutions to them.

Russ

jhy001 2010-11-14T17:34:11-08:00

I reiterate: once you have defined a container for the data that will successfully transfer to another user, you have implicitly defined the core of a genealogy program, less the input, output, and presentation layers. i.e. the hard part. What is genealogy but the data? It isn't the program. If I convey all the properties of an electron to another person, I have defined that electron completely.

TamuraJones 2010-11-14T18:29:36-08:00

jhy001,

I think both Greg and Russ understand what you are saying, but a data exchange format does not define an application.

A data exchange format is something that should work for an entire application category, for applications that are wildly different from one another.

A data format tells you something, but does not really determine how the app works. For example, the KVM file format does not begin to specify what the Google Earth application is like.

By the way, a database with some data entry screens doesn't make a good genealogy app; a good genealogy application is <em>much</em> more. The difference is what you pay developers for :-)

dsblank 2010-11-15T04:37:54-08:00

I, too, think jhy001 is wrong in some of his comments in the above (eg, many vendor's don't use what is best, but what will make them the most money). But I think he is right on the main point. As has been pointed out, we want the file format to represent import aspects of the process (eg, evidence/conclusion). But after that, one would want the application to support that as well. How will that happen?

In the end, I think there will have to be a concerted effort from the open source projects to implement the format *and* process. Only then will big vendors pay any attention.

I encourage jhy001 to not give up, but join an open source project to implement what is eventually decided here.

-Doug

SueYA 2010-11-15T14:50:53-08:00

I am of the opinion that there is a big difference between a program specification and it's final implementation. It is important that any model structure gets tested to make sure it really is a good solution. Attemping to implement part of the model may reveal shortcommings in the model, the implementation or both.

Any particular application does not neccesarily have to perform all the functions I may want to use. One application may have particularly good source citation, another may have an nice data entry interface and a third may offer a certain chart layout. A standard data format is a pre-requisite for being able to switch between programs.

The data model needs to be able to accomodate modular programs.

Sue

cowe 2010-11-24T06:52:51-08:00

I believe John do has a point.

This is touching one of the fundamental questions of a data exchange format: Will the data exchange format be based on an explicit data model or not? I believe it's unusable without an explicit data model. If you try to make it data model independent, the resulting format will be too flexible and impossible to import.

I believe you need to define a best practice data model on which the data exchange format should be based. And this data model will help a lot for anyone wanting to build an application from scratch.

Gedcom has an implicit data model, and I'm sure this has helped many application developers. The data model is however no longer best practice.

Regards,
Christoffer Owe

VAURES 2010-11-26T10:33:15-08:00

A standard data format is a pre-requisite for being able to switch between programs.

That's the point: a short estimate is, that there are some 150 different GEDCOM tags existing plus some 50 or so private tags by the individual maker of the application and there are some ?222? different applications.
Consequently the export-import process would need a matrix of some 300 existing GEDCOM tags to be translated into the specifications of the betterGEDCOM program.
I doubt that many users will be happy with this solution except for the fact that BG is the best app since the dog invented barking.
But how shall a new user learn about the new BG program if can not import correctly *and easily*?
Wulf

hrworth 2010-11-26T10:39:05-08:00

vaures,

As an End User .... I don't care .... What I care about is that the Application I use, allows me to transport my research to another User. That user's application will then re-create my data that fits into that application.

The How to do this, is up to the Rest of the GEDCOM community. This is a community project made up of at least four groups of people. End Users, Technical people, Software Developers, and, in my opinion, some Genealogy Research Experts.

If this community can be developed and choose to make this happen, my 'simple request' can happen. But the how it happens, how to get one bit from my application to another, and that other application "knows" what that bit means, is a community effort.

Russ

mstransky 2010-11-26T10:39:42-08:00

"there are some 150 different GEDCOM tags existing plus some 50 or so private tags by the individual maker "

I want both side of the ilse to see what all feel important and useful without hindering a db storage.

Vaulres do you have that list handy? It would help me make a selection list for display purpose HOW a db sna look and show how compact it can be and use less resources and reduced the complexity of other platforms to incorperate a better BG. This is just a suggestion other techies may or maynot get something out of it looking it in a diffrent angle.

greglamberson 2010-11-19T14:31:43-08:00

Negative Evidence?

Lots of people talk about negative evidence. Indeed, this sort of evidence, which is vital in the research process, is largely absent from today's genealogy software because it doesn't directly support a conclusion.

What is the road map for negative evidence? How does it eventually get incorporated into genealogical data such that it can then be conveyed to our fellow researchers just as our other information is?

greglamberson 2010-11-26T14:22:47-08:00

GeneJ,

EXACTLY. They're distinct. Let's focus on that. You can also say they're not independent because one relies on the other. However, they do have DISTINCTION. That's what I'm getting at!!!

In a computer database, ASSERTIONS AND EVIDENCE AREN'T DISTINCT!!! THAT IS THE PROBLEM!!!!

I'm so excited at the prospect of reaching a starting point here.

Regarding the rest of your comments in reference to Mills statements, I'm sure you mention them for a reason, but I don't grasp how they impact this particular issue except by repeating the need to carefully record information. If we cannot look at such references that Mills and others provide to particular concepts and see those same concepts reflected in how the data is recorded, then we have a huge problem.

100% of this entire issue is and always has been an inability to match data concepts in the research process as defined by Mills and others and find corresponding data concepts in any genealogy software's data model.

This is the essence of the flaws of the conclusion model in genealogy data modeling.

GeneJ 2010-11-26T15:27:48-08:00

Greg,

You and I use the same software, but you don't see your assertion level entries as distinct from your evidence level entries, but I do.

That's not to say the tag portion (by which I intend a software specific meaning) of my software is made up of only assertion level input.

Maybe you can show an example of your practice and why you think there is no distinction.

greglamberson 2010-11-26T15:34:31-08:00

GeneJ,

This isn't about how you or I put data into a database. It's about the fact that there aren't distinct places for that data in the database model.

You may make something work a certain way for yourself, but there is no concept equivalent to evidence data in any genealogy software's database model.

ttwetmore 2010-11-26T16:20:13-08:00

Assertion is a word I don't use much, but it fits in with many discussions.

An assertion is a statement whose truth is being affirmed. In the context of the BG model (as I imagine it) and in the context of the DeadEnds model, an assertion is an event or a person that is supported by one or more items of evidence. The event record is the statement that the event occurred, and its source or evidence is its proof. A person record is the statement that the person existed, and its source or evidence is its proof. The terms conclusion, hypothesis, and assertion are nearly synonymous in the BG context; there are only slight nuances of difference between them.

Including assertion into this set of terms does bring up an issue that I've been avoiding for fear of adding more confusion.

The first step in the evidence and conclusion process is to gather evidence and then extract interrelated event and person records from that evidence, and I have called these events and persons "evidence records" to try to emphasize they come directly from evidence. However, in reality, these records are the first level of conclusions you make when working with evidence. In the GenTech model you find that the boundary between the evidence and the events and persons extracted from the evidence, is indeed identified as the boundary betweenthe evidence and conclusion worlds.

I find it's a little easier to understand the process if we call the "first generation" or event and person records the "evidence records," but it then becomes important to realize that these "evidence records" are not really evidence; they are extracted from evidence. I hope this is not too subtle. The reason this seems important is because the word assertion clearly includes this first set of "evidence" records, whereas these records are not "conclusion records" or "hypothesis records" using the senses of these words that I tried to use in the evidence and conclusion process (even though they are conclusions using GenTech's senses of the words -- you see it can be very confusing). So every event and person record in the DeadEnds model (and maybe by extension to the BG model) is an assertion of the existence of the entity described by the record. For the first generation person and event records the proof its the source which is direct evidence. In the later generation person and event records (those further along the evidence and conclusion process) the proof is the conclusions made by the researcher that multiple items of evidence refer to the same persons and events.

If this were easy someone would have figured it all out already.

Tom Wetmore

greglamberson 2010-11-26T16:45:25-08:00

Tom,

You bring up some interesting points. Right now, we'rte just using the term assertion to reflect an event tag, fact, conclusion, or what have you. I am trying to illustrate that in existing software, sources (e.g., a book), evidence (e.g., a statement in a book about birth of John Smith) and events/assertions/conclusions/facts (e.g., Birth Event stating John Smith was born 1888) are not all three distinct entities.

In your discussion, I see more of a mixture in the evidence and source data.

GeneJ and I are close to clearing up a very important issue. GeneJ is very knowledgeable about scholarly genealogical research, and I am trying to get her to understand that the data concepts she knows in her research process do not correspond with data concepts in genealogy software. This is a very well-known fact that she hasn't quite understood yet.

GeneJ 2010-11-26T17:40:07-08:00

So, most modern genealogical software allows us to enter data (your term) representing "assertions" and data that makes up what I call a "citation."

Citations relay information about the source (in a "citation" format for things like, author, title, date, publisher ...), identify where in the source the relevant passages were located and include compiler's comments about the information/evidence.

Most modern genealogical software provides users places to transcribe parts of information from the source that relate to the citation. With some software, you can attached digital image of relative parts.

What other kinds of data entry would your envision.

GeneJ 2010-11-26T17:40:07-08:00

greglamberson 2010-11-26T18:23:47-08:00

GeneJ,

OK now we're getting into some other things, but let's see where we end up.

When we start discussing "citations," this term has two different meanings that are long-established usages in the differing realms of scholarly genealogy and genealogy databases. This might be a problem we need to resolve, but my feeling is it's ok. I think all the relevant information is included on both sides, but we do both need to be aware of the differences.

I don't think I could properly form a citation by your definition, but I know what one is. In a computer database, a citation includes only the relationship info back to the appropriate source entry in the database as well as specific locating information about the page or whatever else is needed. Also, there is provision for a free-from note entry. That's it. All the information about elements of the author of the source, the title, etc., are all contained in the source entry. Likewise, all the information about how to present those pieces of information in what you would call a citation are actually within the source entry, not the citation entry. In fact, the citation entry including the page number and note are in fact another example of information that doesn't exist distinctly within a computer database. If you remove the "assertion," you have destroyed the citation information including the page number and the note. I think you're very aware of this.

The output of a "citation" as you would identify it is a reporting function that relies upon information from the database's source entry (which is where the author, etc., are defined), then the page number and the note from the source. So these two ideas are somewhat different.

"Most modern genealogical software provides users places to transcribe parts of information from the source that relate to the citation." Perhaps you mean the note in the citation entry within a database. Sure, you could do that, but I think that would be problematic since doing so would pollute your use of those fields when it comes time to output the reporting "citation" that most scholarly genealogists would recognize as a proper citation. More to the point: That citation information is part of the "assertion," not its own independent, distinct piece of information, not its own record within a database. That is, if you remove the assertion/tag/event/fact/conclusion, you have deleted that citation note with your transcribed information.

In fact, I would contend that you might find a place to put such transcribed information, but where does it have defined fields for transcribing parts of the information beyond the possibility of a citation note field? I don't think you'll find defined places for such data.

Here are three separate concepts:

1. data definitions in scholarly genealogy
2. data models defining how data is supposed to be housed genealogy databases
3. your personal adaptation of genealogy software to accommodate data entry as you need it

Do you see the need for 1. and 2. to be consistent in their definitions and usages of data? Item 3. is needed because 1. and 2. don't match in their data definitions.

GeneJ 2010-11-27T06:52:22-08:00

You wrote, “When we start discussing ‘citations,’ this term has two different meanings that are long-established usages in the differing realms of scholarly genealogy and genealogy databases…”

What is a “citation” to one group of users (say a genealogy program and for its users) is a “citation part” to some other program or tool users (like GEDCOM), but in other places or for other users the term footnote, endnote or reference note might be more prevalent. Perhaps we should all use the terminology advanced by _Evidence Explained_, 1st ed., electronic, ala, see p. 43, for use of “Source Lists,” “Reference Notes,” and “Source Labels.” Perhaps when referring to GEDCOM terminology we should type, “Source_Record” or “Source_Citation,” ala http://homepages.rootsweb.ancestry.com/~pmcbride/gedcom/55gcch2.htm#Example . When referring to the terms in use in specific software, we should say so as well. In other words, let’s equally share in the burden of accuracy.

Terminology differences reflect, in part, reasons I hope BetterGEDCOM adopts terminology that does not “blur” the understanding of ”big G” terminology. If BetterGEDCOM makes this accommodation, perhaps genealogy software developers will follow suit. The actions of a few could be revolutionary for the larger body of “users.”

Thoughts or questions below about issues I think might be more on point for the direction this thread has taken ….

Mills’ _Evidence_ drives home, for general practice, that we must categorize “Sources” by their form (see Evidence Analysis: A Research Process Map….), so that a “Source” (q.v.) is either in “Original” form or it’s in some “Derivative” form.

Thanks to Mills and Aunt Nellie, I have more and more become like a heat-seeking original-form missile, so that I have many originals and well-treated digital images from originals. Indeed, where my “Source List” refers to material other than originals (or well-treated digital images of same), I view those “Sources” as inferior to an extant Original.

Even when I have an “Original” document, I’m reviewing that material for “Secondary” Information (in the meaning of the _… Research Process Map_), always hopeful to locate other “Sources” closer in kind to “Primary” Information.

When a question arises in the research process, I want to be working with the whole “Best” “Source” available to me, not someone’s sliced and diced interpreted (“derived”) version (ala, database entry or indexing) of that Source.

When another user is able to provide me with what I consider an “Original,” I want it communicated WITH a “Source Label.” When the “Source” “Original” contributes to communicating User to User, I likewise seek the “Reference Note” a user developed relative to the research question at hand.

More thoughts … but I’ll wait for your comments/questions back.

GeneJ 2010-11-27T07:19:50-08:00

I should have more completely said, "not someone’s sliced and diced interpreted (“derived”) version that has not been vetted (ala, database entry or indexing of that Source).

greglamberson 2010-11-27T10:38:01-08:00

GeneJ, this problem with terms is, of course, vitally important. In discussing these terms, I have constantly said things like, "in a computer database," "in data modeling," and I constantly say things like, "this data, "this data entity," "a database entry," a database element," etc. Besides, what we are talking about here is data in a database. I really don't think I personally have been vague about the distinction in terms between scholarly genealogy and terms related to data in a database.

In fact, the point I have been trying to drive home here is that we need to realign the data organized in a database to match what is used in scholarly genealogy. The ironic thing here is that this entire discussion about evidence, evidence persons, conclusion-based models and the like are entirely about realigning the data and terms used in genealogy data modeling to reflect those idea as espouse by scholarly genealogy, for who I consider you an advocate. Yet you have been almost entirely resistant to it.

As much as possible, I agree we should move toward terms reflected in scholarly genealogy., but this is not always possible or even reasonable. I think use of the term citation is a good example of reasonable differences, as your citation consists mostly of information about a source. Information about a thing (that thing being a source in this instance) is going to reside in that thing in a computer database, and that's not going to change, nor should it.

We are speaking about data in a database. We're going to use terms not found in EE or any other genealogy reference.

"Thoughts or questions below about issues I think might be more on point for the direction this thread has taken …"

Rather than addressing the core issue here, you just want to change the subject?? Well, that's great for another thread, but I would appreciate sticking to the issue at hand: defining evidence IN A COMPUTER DATABASE, since there is no equivalent to evidence as understood by scholarly genealogy in current genealogy software.

I frankly can't believe you're trying to change the subject.

GeneJ 2010-11-27T11:18:21-08:00

Actually, I thought my comment WERE comments about the current subject.

Got life to live. --GJ

greglamberson 2010-11-26T04:24:29-08:00

Here's a simple example of the problem:

Research question: Where and when was John Smith born?

Evidence: Birth Record from Birth Register of Fayette County, IL

John Smith born 12 Nov., 1888 in Vandalia.

How is this entered in any genealogy database?

INDIvidual John Smith
EVENt - BIRTh
DATE 12 Nov 1888
PLAC Vandalia, Fayette, Illinois, United States

Problem: This is not simply the answer to the research question nor is it the evidence. It's both all mashed together.

If there is more than one piece of evidence, one must reexamine each piece of evidence to determine which ones might be relevant to the place of birth and not the date, for example.

If you doubt the importance of this issue, we can discuss that. First you need to see the problem before we discuss why this is important, should it not be apparent.

hrworth 2010-11-26T05:51:19-08:00

Greg,

#2 - Evidence - Evidence Explained!, Elizabeth Shown Mills, 2007 p 822. information that is relevant to the problem. Common forms used in historical anaylysis include "best evidence (q.v)" "direct evidence (q.v.)" "indirect evidence (q.v.)" and "negative evidence (q.v.)". In a legal context "circumstantial evidence (q.v.)" is also common.

I, for one, am only using the term 'evidence' because that is the term that has been put on the table in the BetterGEDCOM wiki.

I have sources where I may or may not find information for my research, I have citations that are attached to information that I put into my database, that is taken from that source.

For the first entry about this person, that has not been documented, the first piece of Evidence that helps define who this person is?

I may, later in the research, I may need to Evaluate this "evidence" based on other Source and Citation information.

Russ

GeneJ 2010-11-26T07:43:25-08:00

@Greg, Russ, all:

Sounds to me like Greg didn't get enough potatoes yesterday.

In haste, perhaps what you call you "mashed" data, I might call your "assertions?"

ttwetmore 2010-11-26T07:57:30-08:00

Trying to understand the essence of the last part of this thread I sense concern about two things...

1) Whether BG should support research and
2) If so what does that mean with respect to all the Gedcom data that already exists.

My answers are...

1) Yes and
2) It's not good.

If the answer to 1) isn't yes most of us would have pulled up stakes and gone elsewhere, which we haven't.

The only way a Gedcom file can be said to hold evidence data is if the creator of the file forced himself to use very strict conventions. And this presupposes the user actually has any control over how his/her data would be formatted in its Gedcom form. The only users with this luxury are users of programs that not only use Gedcom as the database format, but also give the user complete access to that data for formatting and editing. I know of only two programs that allow this and have heard rumors of a third.

I am lucky enough to use one of those programs, and I have evolved conventions over the years that allow me to keep conclusions and evidence relatively intact inside Gedcom INDI records. I provided an example of this recently with a birth event. With the convention you include many events of the same type (eg., BIRT or DEAT for persons) in the same INDI record. You make the first event in the physical record hold the conclusion data, what you yourself believe to be true after studying all the evidence, and you follow this conclusion event with all the same events as you extracted them from the the evidence you found. You give each of these evidence events a 2 SOUR line that points to the evidence's source. Because I have used these conventions I have been able to write a Gedcom model to DeadEnds model converter that reconstructs the evidence events and evidence persons from the Gedcom INDI records, and creates the conclusion person trees implied by the events. This causes an increase in the number of records, but this is necessary if you are going to maintain the evidence records independently, which the research process demands.

That's all fine and good, but it helps no one but me and may be a few other pedantic geeks that have been worrying about this issue. And because I have evolved my approach over the years, anticipating the era of BG and DeadEnds, lots of my Gedcom is still the old ugly "mashed" type mentioned earlier in the thread.

Probably the best one can do in converting data from a Gedcom model to a BG model, is to simply make every INDI into a BG person, and every event sub-structure (BIRT, DEAT, MARR, etc), into a BG event, with no supporting evidence records and don't worry about whether to call them evidence or conclusion records. Of course such things look just like evidence records in general, but you've got to live with that. After reaching this starting point the user would, if they so cared, retrofit the evidence records back under the new BG records, presumably by going back to their paper files, or Ancestry.com, or images on their computer. Of course the typical user wouldn't do this at all, and wouldn't even understand the nuances of what we are talking about. They would see exactly the same data that they saw from programs that used to use the Gedcom model, and wouldn't be botherer by what to them has always been obvious, that, God d**m it, my grandmother is my grandmother so stop confusing me with all this stupid crap about evidence. I know she's my grandmother so just go away."

Tom Wetmore

hrworth 2010-11-26T08:10:01-08:00

Tom,

Should the BetterGEDCOM support Research?

Sorry, don't understand the question. Or, the answer is "of course".

I am doing family research, I enter data into a program. I want to share that research with another researcher, so I tell the program to generate a BetterGEDCOM file and sent it along. The researcher at the other end, chooses to open that BetterGEDCOM file, in that program, and what I sent, is open and available at "the other end".

I want to be able to control WHAT is sent, the person receiving my file should be able to control What to do with that information.

The support of Research, I think, needs to be in the application that I am using.

What I think is trying to be suggested here, is what "things" need to be considered as a result of our research. Some of the definitions of what "research" means, including Evaluation or research information, and good genealogical practices are not yet completely "on the table".

If the researcher's application doesn't support the BetterGEDCOM but does a GEDCOM 5.5, then the BetterGEDCOM sends that information along to the application that may support a BetterGEDCOM file.

I have seen issues, today, about applications not recognizing a GEDCOM 5, or earlier. That format of GEDCOM needs to up "upgrated" to Version 5.5. Today, that is a simple Edit of the GEDCOM file. Don't know what that means in a BetterGEDCOM enviorment.

Russ

ttwetmore 2010-11-26T09:16:18-08:00

Responding to Russ,

Here are my opinions that apply to your points:

BG should support research.

When exporting info from a genealogical system, in Gedcom, or BG, or any other format, the user should be able to control what gets exported.

When importing info into a genealogical system, in Gedcom, or BG, or any other format, the user should be able to control what gets imported.

The support of research needs to be in the application. Enough of the process is on the table that we can begin to talk about what it means. See my description of a possible user interface metaphor using virtual 3x5 cards and virtual slips of paper.

The "physical" results of research are the conclusion/hypothesis objects created by the user to bind together the evidence he/she believes apply to the same real persons.

A BG system can generate a Gedcom 5.5 file that can use conventions to include both evidence and conclusion. I described a possible convention for that above in this thread. Many of today's programs would be able to read those files just fine but would probably throw out lots of data. This is because the BG generated Gedcom would have multiple events of the same kind in the persons. Since most programs throw out all duplicates after the first, having the conclusion form of the event first in the Gedcom at least means the receiving system gets the final conclusion form of the person. If the receiving program uses the last event read of each type it is brain dead and we shouldn't worry about it.

I don't know what to say about all the programs that have mangled Gedcom for their own purposes. Well I do know what to say, which is scr*w 'em, but that's not politically correct. When I wrote LifeLines my decision was to read Gedcom syntax, not a Gedcom standard, so LifeLines reads every possible form of Gedcom, keeps everything, makes everything available to the user. This is great, except the flip side is LifeLines generates Gedcom output with exactly what's in the records as well. LifeLines users get around this particular issue by using "report programs" that generate "Gedcom reports" instead of some other output format, an in those programs they can restrict what gets output to things that obey any given Gedcom standard. There is an important distinction here between the LifeLines Gedcom export feature and the LifeLines Report Generator feature that happens to be generating a report in Gedcom format.

I think the bottom line here is that a BG system should be able to generate pure Gedcom 5.5 files that maximize the use of the official 5.5 standard to include the most information that it can. The result will be that information from many different BG records might have to be squeezed into many fewer Gedcom records, but we're smart people and should be able to figure out the best way to do this. I've written some algorithms that do this in my DeadEnds experiments and they are not onerous.

And of course BG systems should be able to read Gedcom files, 5.5 or otherwise, and not loose a drop of information.

Tom Wetmore

hrworth 2010-11-26T09:32:39-08:00

Tom,

My only question is, Where does this application belong, to support Research?

Like a GEDCOM, the BetterGEDCOM is the transport of the information from one application to another.

Rules / Guidelines / data tables may need to be designed so that developers can take a piece of data and put that data into an "envelope" to be mailed via a BetterGEDCOM file and send it on its way. At the "other end" the "envelope" needs to be opened, and the data re-constructed using the 'rules / guidelines / data tables' to understand what to do with the data.

That has nothing to do with research rules. That also doesn't mean that the BetterGEDCOM process doesn't mean that we don't bring Research issues to the table.

Sorry, I am not that technical to get any deeper into this process. All I want to do is to Share my Research with another. With hopes that the various Genealogy Programs will help us do a better job with our research, like giving is tools so that we can Evaluate our Research with hopes of generating more acceptable research results.

Russ

GeneJ 2010-11-26T10:57:50-08:00

My last post was eaten, apparently by the wiki-monsters.

Will wait to see if it does eventually post. Perhaps didn't get associated with my account.

:) --GJ

GeneJ 2010-11-26T10:57:51-08:00

My last post was eaten, apparently by the wiki-monsters.

Will wait to see if it does eventually post. Perhaps didn't get associated with my account.

:) --GJ

greglamberson 2010-11-26T12:14:24-08:00

Russ and Tom,

You both veer off course into other implications. I'm trying to remain focused on the particular problem of evidence not existing in a database independently.

Yes, GeneJ, you might call this data an assertion. An assertion is analogous to the answer to a research problem. What it is NOT is evidence, because there is nothing in the data that lets you separate the assertion (if you're happy with that term) and the evidence data. Don't veer off into this other stuff. Once you get this, the toher stuff is east to discuss.

Tom, a little help here. I know you see the problem exactly. this is a discussion of the long term. Yes, you're right that the implications for existing GEDCOM data aren't all that great (but the implications can be mitigate). What is more important to realize is that GEDCOM and other conclusion-based models will not work moving forward. But I digress. Please help me illustrate this point.

greglamberson 2010-11-26T13:01:01-08:00

I would also point out that there are a lot of long-term issues here. Much of what we're talking about is stuff that has to happen within genealogy programs that BetterGEDCOM cannot really solve.

HOWEVER, if you decide you want to do something, like support the research process, you must provide places for the resulting data. If you're looking to buy a new house and want to put in a pool someday, you need a place to put the pool somewhere on the new property. One of the worst problems for the GEDCOM format was that it didn't evolve. As a result, neither did major software applications.
In fact these concepts I am discussing and trying to bring attention to have been well understood to be problems for around 15 years. The GEDCOM 5.5 Specification itself makes numerous references to structural changes avoided but considered inevitable simply because they would be so drastic. In fact, bot the GEDCOM 6 and the GEDCOM Future Directions documents make reference to these very solutions to the problem of supporting research.

Nothing I'm expounding upon is any great revelation of mine. I'm simply regurgitating what many before me have said. Again, here I go veering off into other things...

GeneJ 2010-11-26T13:50:24-08:00

Hi Greg,

You wrote, “An assertion is … NOT … evidence.

In this context I would say my assertions are distinct from my evidence, but not independent. (My words.)

From the introduction to _Evidence Explained_, 1st ed. (electronic), Mills writes, “As history researchers, we do not speculate. We test. We critically observe and carefully record. Then we weigh the accumulated evidence, analyzing the individual parts as well as the whole, without favoring any theory.”

Same source, see p. 17 for the entry, “Conclusions: Hypothese, Theory & Proof,” for Mills discussion of these comment. She begins, “Each and every assertion we make as history researchers must be supported by proof. However, proof is not synonymous with a source. The most reliable proof is a composite of information drawn from multiple sources—all being quality materials, independently created, and accurately representing the original circumstances.” She follows with, “For history researchers, there is no such thing as proof that can never be rebutted….,” and then discusses nuances of “Hypothesis,” “Theory,” and “Proof.”

Thereafter (my section 1.4) is comment about “Fact vs. Assertion or Claim.”

Hope this helps.

hrworth 2010-11-24T16:56:57-08:00

Greg,

I don't know "surety"

Sorry,

Russ

GeneJ 2010-11-24T17:04:03-08:00

Hi Russ and Greg:

In Russ' example (father's death date), the SSDI and the death certificate are "conflicting evidence." Mills, _Evidence Explained_, 1st ed. (electronic), b. 820, "conflicting evidence: relevant pieces of information from disparate sources that contradict each other."

Using Russ' example, he locates the SSDI first, and enters that information in his software. After he's made the entry, his "database" reports the date of death reported in the SSDI.

A few days later, Russ finds the death certificate. He enters that source and now has two sources "in conflict" (ie, conflicting sources).

He notes the conflict in the citation for the death certificate. In keeping with the need to continue to reassess our body of evidence as a whole, he'll go back to the citation about the SSDI and add a note about the conflict.

Not all conflicts are born equally, not all conflicts can be resolved quickly. Some persist.

These are my thoughts. Those with more expertise could no doubt improve upon my input.

More on indirect evidence shortly. --GJ

GeneJ 2010-11-24T17:08:50-08:00

oops ... intended "conflicting evidence" where I wrote, in error, conflicting sources.

GeneJ 2010-11-24T17:08:50-08:00

oops ... intended "conflicting evidence" where I wrote, in error, conflicting sources.

greglamberson 2010-11-24T17:09:36-08:00

Surety is just the grading within a source citation of the quality or veracity of the source in relation to a particular event, attribute, fact,conclusion, etc. This is accommodated in GEDCOM by use of the QUAY tage which has possible value of 0,1,2 or 3.

This surety is something that I think could certainly benefit from some attention, but I haven't considered it.

Andy_Hatchett 2010-11-24T19:46:14-08:00

I've used several programs with Surety values but...

In the end they are only meaningful to the person who enters them do the the fact that they are so subjective in nature. Your 2 may be my 1 (or 3 for that matter).

greglamberson 2010-11-24T20:08:10-08:00

Well, there is a defined way to express such data adequately within the GEDCOM data model. There is a similar, if rudimentary, method to define such data in basically all genealogy software.

Here are the defined meanings that GEDCOM gives:

0 = Unreliable evidence or estimated data
1 = Questionable reliability of evidence (interviews, census, oral genealogies, or potential for bias
for example, an autobiography)
2 = Secondary evidence, data officially recorded sometime after event
3 = Direct and primary evidence used, or by dominance of the evidence

This way of expressing data veracity or quality is certainly rudimentary. It could use some work. However, I don't think we're quite qualified to give standards for evaluating genealogical data.

GeneJ 2010-11-25T11:02:28-08:00

Greg wrote, “A Source_Citation gives a relationship to a source and allows notes, but it does not explicitly manifest the evidence, only the location of the evidence and notes useful in evaluating the evidence. … I've always assumed a sepatate citation entity is important, but perhaps such an entity could be part of an evidence entity with no problems? This is a question someone else has to answer. … No evidence exists on its own within a database, but since direct evidence often (or even usually) is recognized as immediately manifest in an EVENt or ATTRibute, this isn't really seen as a problem. However, since there is no ability to enter indirect evidence without other supporting evidence, the problem is more easily seen. … This topic screams for some illustrations, so I'll be working on diagramming this out in detail and presenting it on its own page. Please comment so I can properly address your concerns as I work on such a illustrative model in the next few days.”

I am not sure I follow your thought that a citation does not “explicitly manifest the evidence.” A citation is “part of” something. Citations relate to particular passages in a larger body of material that likely includes references to many citations.

From my vantage point, “notes” in a citation are not “just notes.” I sure think citations I read in publications like the Register and the Quarterly are right on the money.

When you write, “there is no ability to enter indirect evidence without other supporting evidence, the problem is more easily seen.”

As previously quoted, from _Evidence Explained_, indirect evidence is "relevant information that does not answer the research question all by itself. Rather, it has to be combined with other information to arrive at an answer to the research question."

Below are somewhat straightforward examples of items I consider indirect evidence; both are from my family file:

1) Tombstone data or death record, as part of the evidence for a question pertaining to a person's date of birth (for the purpose of identification).
"Jones County Death Register #1" (1 Sep 1880 - 11 Nov 1897), transcription by …. ; database, Jones County GenWeb ( http://www.rootsweb. …) as p. 52, entry 621 for Mrs. Thomas, died at … 4 April 1888, age "85y (?)" 13 d, citing FHL film #… ; reports she was born in Michigan; buried … at …; from age at death (as 85 years, 13 days), compiler calculates estimated date of birth c22 March 1803 using TMG v6 date calculator.

2) Census census data, 1850 to 1870 for example, as part of the evidence for a question pertaining to family relationships.
1850 U.S. Census, Ray County, Missouri, population schedule, no city listed (District 75), sheet 618 (penned), page 310 (printed), dwelling 294, family 284, George F. Carle household, as of 27 Aug 1850; digital image, Ancestry.com (http://www.ancestry.com : accessed 18 Mar 2007), cites National Archives micropublication M432, roll 412. George is ae 39, born Ohio; his apparent wife is Elizabeth Carle, ae 32, also born Ohio. Apparent children are Richard, ae 13, Harriet, ae 10, John, ae 9, Lydia Ann, ae 2; and Elizabeth, ae 2/12, all children born Missouri except the eldest, Richard, said born Ohio.

Some personal observations.

a. I don't "read" my genealogy database entries as though the events are like transactions on my credit card statement—each an island to itself. As well, when I think of "a genealogy," it's not in terms of raw database entries.

I try to maintain evidence record (in notes to the event and/or notes in the footnote/endnote) based on how the record would appear in a narrative genealogy (the kind of flowing text found in family books and scholarly articles), where evidence is presented in a family context (broad sense). In a descendancy narrative, information/evidence isn’t presented in the order in which you discovered it. Rather, information/evidence about the parents is presented before the list of their children. The initial child list entries cover all the known children, so that inter-sibling evidence* is likely being presented at about the same time/same place.

*Say the listing from a family bible under “Births,” or even one child’s elaborate evidence including many items that call names of the siblings and parents.

When I initially record a citation, it's usually long. The comments included, I hope, reflect a pool of then existing evidence. As I gather more and more evidence, my prior comments have to be revised, so that the evidence presentation is current as to the existing pool.

b. I’m not the “expert” and don’t wish to confuse the presentation of “indirect evidence,” with what genealogists call a “proof summary” or even a “proof argument.”

Likewise, I don’t think we can really discuss evidence without placing the work in context of the “Genealogical Proof Standard.”

For better information on many of the topics we are discussing, see BCG Standards Manual, 2000 and the BCG website (www.bcgcertification.org).

c. IMO, indirect evidence looks very different, often almost naked, when taken out of a “family context,” say for the purpose of proving a particular direct line of descent (where we are not going to tell the whole family story). This was the case recently for a cousin who was submitting application to Sons of the American Revolution. We had the proof, it seemed apparent in a family context, but the indirect evidence did not present itself well in the strict sense of the SAR application.

In other cases, even in a family context, the evidence (collectively) may just need some ‘splaining. Perhaps you need to explore the meaning of “coming of age in a place in 1684”; you might need to disprove part of someone else’s elaborate research. Lots of times, such material won't present well in small sized text (ala, a citation), or maybe it's presentation in the body of the work is thought to detract from the work. Maybe it just needs (and often does) some real formatting muscle, as you find in a word processor. In those cases, you might write research memorandum, proof summary; even a proof argument.

In the case of my cousin’s SAR application, we wrote a humble proof argument (closer to a proof summary). (We elaborated on that proof and created a blog entry. See “Mixing it up: The indirect evidence challenge.)

For examples of really well done proof summaries and arguments, please see the BCG Genealogical Standards Manual, 2000, and the BCG website “Work Samples.”

The first edition, hardcover, of Elizabeth Shown Mills, _Evidence Explained_, included "Evidence Analysis: A Research Process Map." I have reproduction, c2007, of that material.  Mills writes, "SOURCES provide INFORMATION from which we select EVIDENCE for ANALYSIS. A sound conclusion may then be considered "PROOF."  On the backside of the Mills "map" re four blocks--one each for Sources, Information, Evidence and Proof. In the block "proof," she finishes with, "Quality proof does not rest upon any simple statement of fact conveniently offered by some source. Proof should rest upon the totality of the evidence."

hrworth 2010-11-25T11:38:08-08:00

Greg,

I think that we have at least 3 examples of "Evidence", at least in this discussion. I won't go into the "surety".

Good Evidence, with a 1 - x Scale, as seen by the user entering data

Bad (Negative) / Wrong Evidence

and GeneJ's excellent example of Indirect Evidence. (in her example, the Data Calculator)

I submit that there are very few genealogy applications allows us to make any complete evaluation of these types of Evidence.

From that best that I can tell, some of the "Genealogy Experts" will be needed to help with some of these 'terms'. Clearly, we (an end user) needs tools to do this.

That is an application task, in my opinion, but the "Genealogy Experts" need to help with the definitions, to help get to the BCG or Elizabeth Shown Mills standards.

The task that the BetterGEDCOM project has at hand is to help each of these groups understand the "problem" first, get together for a solution, then the BetterGEDCOM to transport the resulting data from one place to the other.

Record 1 Source-Citation - Bad Data
Record 2 Source-Citation - InDirect Data
Record 3 .....

But, there are Genealogical Standards that need to be brought into play here. We are just trying to bring the issues to the table.

Thank you GeneJ for the excellent examples.

Russ

greglamberson 2010-11-25T20:34:11-08:00

GeneJ,

I really think this discussion is going to be of great benefit. What I am trying to do is be explicit in differentiating between terms you are familiar with and terms describing data in a database. They're entirely different.

I am typing this as I read through your comment. I'm very excited so far.

"Source_Citation" refers to a very specific piece of information in a computer database. It is NOT at all analogous to what you're thinking of as a source citation. Recognizing that we're using two completely different things is key. "Soucre_Citation" with the underscore is a database term defined by GEDCOM and carried over into genealogy software.

What we're dealing with here is data in a computer database. That is the language I am using. In trying to translate the data in a database to match the terms used in scholarly genealogy, I am trying to illustrate that the data needs to be definable using scholarly terms. However, this is not possible. In fact, there is not any bit of data in today's genealogy software that fits the definition of evidence per ESM. Ther's no such thing.

Just as you wouldn't cite a source saying, "It's on the third shelf on the left in a Green book, a big one, in the library downtown. Page 300 or so," data in a database has to be defined properly, precisely and consistent with the definitions used in best practice. This is 100% of everything I am getting at. If we can start speaking the same language when referring to data in a database, recognizing that this is necessary to mirror best practices as defined by Elizabeth Shown Mills and many others, then we'll really be getting somewhere. Right now, we're not even talking about the same things.

You said, "...when I think of "a genealogy," it's not in terms of raw database entries."

It's crucial we start to talk about raw database entries because that's what we're dealing with here: genealogy as expressed in a computer database. If the component parts of the data in a computer database don't correspond to the parts of data defined in the research process, then how can the two be complementary? There will always be distortion until this inconsistency is fixed.

Russ,

You're exactly right. There are several ideas mixed up here. When I started this discussion, I was really just trying to get some discussion going without speaking all that intelligently on the matter. In speaking less than intelligently, I succeeded.

Here are 3 concepts I think are mixed up frequently, two of which I managed to mix up in the originating entre' to the discussion:

1. Bad or Disproved data - Accommodated however poorly using surety (QUAY) in GEDCOM.
2. Indirect data needing other data to answer a research question as it would appear in a computer database, for example, as an EVENt entry.
3. No data - Notation that a source has been examined and that which was searched for was not found (e.g., Book A every-word index shows no entires for "Tyler" or common variants.)

I'm just coasting through this right now, but over this long weekend I'll diagram these different parts out to illustrate the problems I am trying to get to uncovering.

In the meantime, I would ask everyone to consider the following:

1. Can you understand the need to define key bits of data in identical or nearly identical fashion in both databases and in genealogical research?
2. What is the definition of evidence?
3. Please try to point to the representation of a piece evidence in your database. My assertion is that it can't be done because there's no such thing in a computer database. Evidence is all mixed up with EVENts and similar database attributes.
4. When conducting genealogical research, how would you react if someone couldn't distinguish their answers to research questions from the evidence that got them there?

GeneJ 2010-11-26T00:43:37-08:00

Greg,
There is much to comment about from the first part of your posting.

You wrote, "“In fact, there is not any bit of data in today's genealogy software that fits the definition of evidence per ESM. Ther's no such thing.”

Several genealogy programs (Legacy, RootsMagic, FamilyTreeMaker, TMG, etc.) have _Evidence Explained_ citation and source list support tools.

Before I comment more fully, will you further explain what you meant by the quote above. Might you provide some examples?

As to the comments you posed to Russ and then "everyone":

"Bad data," "Indirect data," "No data?"
I've spent hours finding, transcribing and describing authoritative information about evidence terminology in this wiki and this thread. The terms here (Bad Data, etc.) only confuse the issue for me; seem to be a step backwards.

Did you mean, "Direct Evidence," "Indirect Evidence," and "Negative Evidence?"

You asked, "What is the definition of evidence."
I live in the US. See _Evidence Explained_, 1st ed. (electronic), p. 82, "evidence: information that is relevant to the problem. Common forms used in historical analysis include best evidence (q.v.) direct evidence (q.v.), indirect evidence (q.v.), and negative evidence (q.v.). In a legal context, circumstantial evidence (q.v.) is also common."

You asked, "Please try to point to the representation of a piece evidence ... it can't be done."
I have been posting examples. Might you explain why those examples are not evidence? Yet another citation follow. This on represents direct evidence for birth of William Presson, Beverly,5 Aug 1728.

William Presson, born Beverly ...[1] 5 Aug 1728,[1] son of ....

Beverly, Massachusetts, "Vital Records of Beverly Massachusetts to the end of the year 1849, Vols. 1-2," 1: 266, William Presson, born 5 August 1728, entry reads, [Presson], "William, s. William and Mary, Aug. 5, 1728"; digital images, Massachusetts Vital Records Project (http://www.ma-vitalrecords.org/EssexCounty/Beverly/BirthsNtoS.html#P : accessed 12 December 2007).

(This is one of perhaps 6 citations for that date of birth.)

You asked, "how would you react if someone couldn't distinguish their answers to research questions from the evidence that got them there?" We want every assertion to be supported by evidence.
Would depend on the research questions, but I'll bite. I'd suspect we are talking about a limited array of direct evidence without lacked conflict.

greglamberson 2010-11-26T03:45:52-08:00

GeneJ,

Assuming the answer of a research problem or question (per ESM definition) is represented in a genealogy database as an EVENt (for example, a birth), there is no existing data entry (I'm using GEDCOM as a data model, but I could use TMG or something else if it would be helpful) representing a single piece of evidence supporting that EVENt. The evidence data and research problem data is all mashed together rather than each having its own integrity. Without an answer to a particular research question, the evidence doesn't exist at all. There is no way for it to. In other words, without a "conclusion," there is no evidence.

The part addressing Russ's question is a step back from what we were getting at. He basically opens up additional questions. I think it's best to ignore that whole section rather than adding problems to the equation. We're right at the heart of the matter.

The examples you give are great. However, I am talking about the data in a database. Go and see how such data is entered. In TMG, for example, you make a Birth tag. This birth tag is not capable of representing a piece of evidence. Rather, this Birth tag is the representation of an answer to a research question. The supporting evidence, whatever it may be, cannot be independently entered. If you were to remove the Birth tag, you have destroyed any trace of the evidence that was supposedly used to support the birth tag because the evidence does not exist in the database independently.

You're looking right at the problem, but I think you're not seeing it yet. This is completely understandable. Just take this process slowly.

For any simple example, pick a source and a research question you wish to answer. Write down on an index card the evidence you can glean from that source relevant to the research problem. Now, go to your computer database. Look at the bits of data as they are entered. Find the answer to your research problem. Now find your evidence data. The fact is, they don't exist independently. If you remove the answer to the research problem, you also remove the evidence.

Computer databases do not distinguish answers to research problems from evidence. This is a design problem.

If you just look at your data on the computer screen, consider your research problem's answer and try to find the information about your evidence and nothing else, you'll eventually come to realize that these important components do not exist independently in your data.

My last question was simply asked to illustrate that just as researchers distinguish between answers to their research problems and supporting evidence, so computer databases also need to distinguish between these two.

hrworth 2010-11-19T18:32:21-08:00

Greg,

You are correct in that most of our genealogy software does not support a conclusion. That was a topic we talked about before we got this project off the ground.

Since we are now talking about it, some software allows to Evaluate a Source with some sort of rating. Lets say, for this discussion at 5 Star Rating.

Because I haven't evaluated all of my Sources, I would NOT choose a Zero as a Negative, but would choose a 1 Star Rating as my "negative".

That's now, based on what I have now. Who knows what we will have in the future, but from what I can tell in various Genealogy circles there is talk about trying to do this evaluation, GPS being one of those efforts.

For the BetterGEDCOM, Passing what ever Evaluation "flag" that is sent from the application should be marked as an Evaluation "flag" and passed along. The receiving end should accept the "flag" and let the end user know as to what the senders evaluation is of that source.

Vague answer for untested waters.

Russ

GeneJ 2010-11-20T11:20:29-08:00

Greg wrote, "negative evidence ... is largely absent from today's genealogy software because it doesn't directly support a conclusion."

The genealogical proof standard (GPS / http://www.bcgcertification.org/resources/standard.html ) calls for us to perform a "reasonably exhaustive search."

We find a definition of negative evidence in Elizabeth Shown Mills, _Evidence Explained_, 1st ed., electronic, 2000, pg. 826 (and 25), "negative evidence: an inference one can draw from the absence of information that should exist under given circumstances."

While these might not seem profound, specific examples follow for negative evidence this user has applied:

(1) Negative evidence that creates a conflict you will want to resolve and/or later supports record a man died c1853:
Proof that George F. Carle was the son of Mary (Firestone) Carle, latter dec'd 1869. Extant pages from her probate file make no mention of George/George Carle, the supposed son; there are no "unaccounted" individuals who are therein mentioned.

--How might this be entered in a standard family file:
George F. Carle's death event, separate citing from other references located that mention his death, cite the 1869 probate file and comment that he was not named.

(2) Negative evidence supporting family relationship about "Hannah (Presson) (Preston)" ... of Rumney" married 1774 to Asahel Brainerd:
Proof of identity, Hannah^7 (Preston) Brainard and her niece, Hannah^8 Preston. There were maybe eight elements to this proof, one of the elements included checking various records to learn if other families of the same name resided in the town. In this case, traces of Preson/Presson/Preston families other than whose related to William^6 and Hannah (Healey/Healy) Presson/Preston were not found in extant Rumney town records, town birth, marriages or deaths and/or registered deeds between the dates XXXX and XXXX. [Where each of the record groups noted can be identified in detail.]

--How might this be entered in a family file:
In this case, a written memorandum of proof was developed to contradict findings in an otherwise published account. That written memorandum is cited a number of times including the link of Hannah^6 Presson to parents, and in the death entry for her niece, Hannah^7 Preston.

I also have the option of a more general approach to record "negative search results" during the research process. Say I am researching an Ahern family from Williams County, Ohio; I begin with an 1850 census record for husband, wife, say John and Elizabeth, and six children. I might next locate a large collection of cemetery records for the county, so I create a master source for those records. If I find stone entry for "John Ahern, loving husband of Elizabeth" and otherwise confirm that to be my John Ahern, I would include that evidence in his file. If no evidence for Elizabeth or any of the children is found in those records, I can add an "event," say "research note-no evidence," to the other family members files citing the master source and reporting no record found.

Hope this helps. --GJ

greglamberson 2010-11-24T11:29:10-08:00

Saying something can be accomplished and a categorized place in a database that has as its definition that specific kind of data are entirely different things.

Most of us have found ways to make our software work. That doesn't mean it's a solution that is in keeping with the intended purpose of the data structures we use or that the next user would understand if that data were transported.

This statement from GeneJ illustrates the problem exactly:
"...I can add an 'event,' say 'research note-no evidence...'"

By what definition is any kind of note, research note, etc., an 'event?'

hrworth 2010-11-24T11:55:22-08:00

Greg.

Isn't GeneJ trying to say that we, the End Users, need a why to indicate that something needs to indicate Negative Evidence. One would hope, that we, the end users, can help the developers of our software, give us a tool to make an Evidence Entry Negative.

Once we have that, then the BetterGEDCOM needs to pass that along.

Today, my software does allow me to rate my evidence. I have to mark mine on a scale of 1 to 5. So, I have to mark a "negative evidence" with a 1 star.

I need a way to mark it as a Negative number, not saying that its a negative 1 to 5, but I do need to say that this Evidence is Junk, I have ignored it, and when I share my research with GeneJ, the software being used needs to display that this evidence is junk.

As I know you understand, this is my evaluation of the evidence that I have for this specific event.

Russ

greglamberson 2010-11-24T12:23:38-08:00

Russ,

It's clear that different people mean different things by the term, "negative evidence."

Depending on what you mean, accommodation of negative evidence varies as one might expect.

Some people use the term negative evidence to indicate evidence that has been disproven. Actually, I think that using a surety indicator that this sort of data can in fact be entered faithfully.

ESM's definition, given above in GeneJ's post, is what I meant and what is not accommodated.

The rest of your post appears to presuppose the first definition of negative evidence, which I think is in fact accommodated.

I really think this point might help bridge the gap in understanding the disconnect here. In a database, you define kinds of data and give definitions of those data pieces and the parameters of data that are valid entries, define relationships between the data, etc.

AS SOON AS data entries start to violate those definitions for the data types, you have a broken data model. Period. End of story. It matters not one bit whether the end user can make it work for them. That's not relevant. The problem is that that data is no longer findable using a "roadmap" to that data, that roadmap being the data model. This means that if the data is moved, transported, etc. the data may in fact be completely obscured.

Data modeling is exactly equivalent for a computer guy to proper source citation to a scholarly genealogist. I'm sure we'll get past this, but it's actually really amusing (in a completely nerdy way) to have to advocate for proper data definitions under these circumstances.

GeneJ 2010-11-24T12:40:40-08:00

Hope this helps:

From Mills, "...Process Map," 2006-7; "Negative Evidence: An inference one draws from the absence of information that should exist under a given set of circumstances."

I'm still looking for examples where it can't be entered or why it doesn't exist in most existing software, said by others to be based on a "conclusion model."

1) George F.^4 Carle (Mary^3 Firestone, Mathias^2 Firestone, Nicholas ....)
Evidence of this son is located in various family records through say 1850; but his name appears no where in the probate records of the woman we link as his mother.

In the "event" for Mary (Firestone) Carle's probate, I enter the source. When I tag that source to the son George and comment in the citation to the effect that George's name was not found in the extant records.

When I later learn from George's daughters obituary that he died c1854, what was previously negative evidence now simply confirms information we learned in the obituary.

greglamberson 2010-11-24T15:57:46-08:00

GeneJ,

This has been most illuminating and helpful.

To express the problematic concept in ESM terms, the correct term is "Indirect Evidence." Negative evidence is most commonly indirect evidence, and for these purposes, your examples, being also indirect evidence, still illustrate the problem well. I should have used the term “Indirect Evidence” to begin with.

For purposes of illustrating the problem, I use the GEDCOM 5.5 data model, which is reflected to a large extent in today's genealogy software data models. I am not aware of any major software application that differs in its approach as regards this problem.

ESM refers to evidence as relevant to solving research questions or problems. In a computer database, resolutions of these research questions or problems (often called “conclusions” in reference to computer data) are manifested in data types such as EVENts, ATTRibutes and others such as parent-child relationships, etc. I don't think it matters which manifestation of such an ESM concept is used, but perhaps some other data type example might give different results (although I doubt it).

The key lies in recognizing that each of these data called evidence by ESM should be independent entities within the database to support the research process. I assume the need for this is clear. (Perhaps not, though.) Anyway, within a conclusion model, any moving of ESM evidence from one EVENt, ATTRibute or whatever in fact causes the information to be destroyed. Logically, this information must be recreated to be moved because independent of an EVENt or ATTRibute data type, the evidence doesn't exist.

This concept is perhaps most closely represented by a Source_Citation in a computer database, which is already a part of that which is destroyed if its parent EVENt is destroyed. This is a very closely related issue but it's still a separate issue, A Source_Citation gives a relationship to a source and allows notes, but it does not explicitly manifest the evidence, only the location of the evidence and notes useful in evaluating the evidence.

I've always assumed a sepatate citation entity is important, but perhaps such an entity could be part of an evidence entity with no problems? This is a question someone else has to answer.

No evidence exists on its own within a database, but since direct evidence often (or even usually) is recognized as immediately manifest in an EVENt or ATTRibute, this isn't really seen as a problem. However, since there is no ability to enter indirect evidence without other supporting evidence, the problem is more easily seen.

This topic screams for some illustrations, so I'll be working on diagramming this out in detail and presenting it on its own page. Please comment so I can properly address your concerns as I work on such a illustrative model in the next few days.

hrworth 2010-11-24T16:34:42-08:00

Greg,

Paage 824 of Evidence Explained!

Indirect Evidence: relevant information that does not answer the research question all by itself. Rather, it has to be combined with other information to arrive at an answer to the research question.

I am not sure that this is the same as Negative Evidence.

How about negative evidence, to mean Wrong information, as opposed to inaccurate evidence.

I think that I may have an example:

If I look at the SSDI, I find that my father died on a date. I later find the death certificate with a date 10 days later then the SSDI entry.

What is that SSDI death date?

Russ

greglamberson 2010-11-24T16:45:16-08:00

Russ,

EE 2nd edition, page 25, which I have been reading over and over again all afternoon. Same definition. This is what I'm talking about.

Wrong information, disproved information , etc., are different concepts. The concept I started with is most accurately reflected in "indirect evidence."

Disproved/wrong information can be entered with a surety of 0. This isn't a problem.

greglamberson 2010-12-02T14:48:15-08:00

Patch vs. Replace?

Patch GEDCOM and keep a conclusion model or embrace evidence?

First, this is a practical effort. Anything that doesn't embrace 100% what exists today is a no-go. We're not designing a Utopian world here. Whatever the answer to this question is has to be based upon the goals as already expressed.

However, nearly every single effort to define genealogical data for 15 years has seen this as the issue to address. That said, embracing evidence only causes problems for the current project. I'm not sure it solves that many of our current problems.

Here's what I know: Almost every single genealogy program in existence has a model that is based upon the GEDCOM model. If successful, whatever BetterGEDCOM produces will be the driving force behind advances in genealogy software for many, many years. If we don't even attempt to include evidence in the model, I believe there are at least two problems:

1. Many developers won't see any significant benefit to adopting our standard, since recoding will be significant. If we don't provide a roadmap for the future, we're not going to appear too relevant to some developers.

2. Many technical people will see no reason to participate. A very large number of people have specifically asked me about this issue, saying that the opportunity to tackle this age-old (in computer time) problem is THE reason to be here and do anything.

Also, patching GEDCOM without even looking at this problem just kicks this huge problem down the road. If I wanted to do that, I'd run for Congress.

I would really appreciate everyone who has an opinion on this sharing it.

Andy_Hatchett 2010-12-02T15:47:51-08:00

In my personal opinion we have to embrace evidence - if for no other reason that that is where the genealogical community as whole is going- with or without the blessing developers.

BG has to make it easy for the developers to be able to use the present GEDCOM as an input to BG but there is no reason on earth that BG has to then output anything that the old GEDCOM could use. I see this as a one way street to the future.

Merely patching GEDCOM, imho, is to be chained to the past- we must break that chain.

ttwetmore 2010-12-02T17:59:10-08:00

I am in favor of tackling the evidence problem, and as Greg would likely surmise, it is the reason I am here and trying to contribute. A better GEDCOM that patched up the conclusion world would be worthy effort, but it wouldn't have much of an impact, and would likely not attract much enthusiasm.

Tom Wetmore

Andy_Hatchett 2010-12-18T11:08:56-08:00

Goal 7

While it is all to the good that BetterGEDCOM encourage best practices, it is not, imho, enough.

BetterGEDCOM should also discourage certain practices that are not best practices that today's applications allow.

For Example:

BetterGEDCOM should mandate that any fact/tag/event/assertion/whatever entered must also have a source for that item entered and that any application using BetterGEDCOM must require that of the enduser.

It should also require that any application that exports such items must also export at least one source for the item and that any application that imports a BetterGEDCOM file must import any sources contained in that export file.

Now- most of you will say this is overkill, but I really believed that such a basic genealogical practice as sourcing absolutely must be required of endusers -be they hobbyist or professional- to improve the field of genealogy in general and online genealogy in particular.

GeneJ 2010-12-18T15:11:01-08:00

I edited 3 and 4 ... and 6/7 on the GOALS. Maybe everybody take a look and provide feedback?

louiskessler 2010-12-18T15:14:16-08:00

Gene:

A better way of doing that is offering a badge or something that can be displayed once a program passes some tests that show their program maintains data through input and output.

That will encourage programmers to make their programs compliant, if people start recognizing the badge and that increases sales.

The the requirement of compliance is not part of a standard. Developers will decide how compliant they want to be.

Look how compliant Internet Explorer and other browsers have been to Web Standards. Do the standards require web browsers to be compliant? They can't.

mstransky 2010-12-18T15:26:49-08:00

Kind of like WC3 compliant badges.

GeneJ 2010-12-18T15:27:55-08:00

Louis ..

"A better way of doing that is offering a badge or something that can be displayed once a program passes some tests that show their program maintains data through input and output."

... and we shine a big light on it.

Errr... that's the way I'd like to see it.

gthorud 2010-12-18T15:29:17-08:00

I think goal 4 should be deleted. One might recommend such functionality on import - and many programs do - but not require. I am not sure I understand what a log on export is useful for - isn't the log the gedcom file?

Before we start talking about conformance - maybe we should focus on something to be conformant to.

GeneJ 2010-12-18T15:43:43-08:00

gthorud,

Are we saying that average users should analyze their own GEDCOM to discover what didn't export? If it is a third party that imports, how would they know what was missing if there was no log?

gthorud 2010-12-18T16:25:03-08:00

GeneJ,

So, that is what you wan't to log - then I understand. I agree with you intention, but I doubt that I will ever see a program creating such a log. I think developers will be very reluctant to implement a new feature and at the same time advertise to the world that it is not supported by BGedcom. This is more likely to be disclosed by testing.

gthorud 2010-12-18T16:34:49-08:00

Also, if the info can be transferred by BG, it will probably cost about the same to implement the log as implementing the export of the info.

gthorud 2010-12-18T18:54:58-08:00

I have changed my mind on the IMPORT log, ie. requiring a log of dropped data is a good idea.

Are there any programs that do not do this today?

gthorud 2010-12-18T18:57:13-08:00

I have changed my mind on the IMPORT log, ie. REQUIRING a log of dropped data is a good thing.

Are there any programs that don't do this today?

GeneJ 2010-12-18T18:58:43-08:00

I believe most programs do create a log on import.

louiskessler 2010-12-18T23:29:03-08:00

Many programs create input logs, but most are inadequate. Few of them report the info that is dropped or what is thrown into notes. They do not want to highlight their deficiencies.

I don't think any programs produce output logs to describe the information they don't transfer. Again, how are you going to convince a programmer to state what their program cannot do?

AdrianB38 2010-12-18T12:20:09-08:00

Andy
However well intentioned your desire to improve genealogy is, the idea of mandating sources is a non-starter.

1. This is a hobby, not a profession. (Mandate it for professionals, by all means).

2. Any unwanted mandatory practice results in the opposite of the desired effect. For instance, a requirement for string passwords simply results in unmemorable passwords that end up being written on a Post-It note next to the screen.

3. Adding sources is about the process of genealogical research. BetterGEDCOM is about the provision of a (static) "template" for data-transfer. This is not, except peripherally, part of the process.

4. There are no means of mandating such requirements technically, so far as I know. One could specify triggers, stored procedures, etc, in a database but that's wholly outside BG's scope.

5. There are no means of mandating such requirements procedurally. We might have an "Approved by BG" sticker, but we can't stop anyone writing software to use the new format if they wouldn't merit the sticker.

6. Even if it could be mandated in some fashion, one of two things will happen - the sources created will be dummies with titles like "Personal knowledge" or, when people load unsourced data into a BG-compatible program, it will be rejected. People will then give up on BG saying "It doesn't work". Note that - it will not be their fault, it will be BG's fault in their view.

7. If BG gets the reputation of rejecting user's data, then people will give up on it and go back to GEDCOM type software and we will have failed to advance the cause one iota.

hrworth 2010-12-18T12:55:12-08:00

Andy,

I don't disagree with you. However, would you agree that what you are talking about belongs in the application. The BetterGEDCOM is to transport that data.

What we have tried to so, and will continue to do, is to make sure that we address 'best practices' that the professional genealogists are putting on the table now, and some have been for a while.

Thank you for your comments.

Russ

Andy_Hatchett 2010-12-18T13:35:59-08:00

Russ,

Actually, no- I wouldn't agree.

If that is all BetterGEDCOM is then goal #7 should be removed from the page entirely.

BG should, imho, have a compliance factor built into it so that before it attempts to transfer any file that file should be validated as meeting certain BG standards and that one of those standards should be sources for everything. BG would analyze the file, see no sources and simply delete it before transmission with an error message of "File deleted: Did not comply with BetterGEDCOM standards".

And while we are on the subject I'd like to see the Blogging community take a stand on this.

I'd like to see every reviewer end their review with the following:

"Although a adequate program in many respects, it has one damning fault. It allows data to be entered without sourcing that data. I therefore cannot recommend purchasing this product for genealogical use."

Get Eastman, Myrt, and the others doing that and it won't be long before *every* program will be requiring sources when data is entered.

AdrianB38 2010-12-18T13:53:57-08:00

"BG should, imho, have a compliance factor built into it"

Andy - how?

louiskessler 2010-12-18T14:00:47-08:00

If you require someone to do anything they don't want to do, e.g. must add a source, then they will add whatever junk they need to that will allow the program to accept their input. Then they will swear at the program and say it is stupid.

So by forcing sources, all you'll get is people to use a dummy source, e.g. "Who knows?"

Isn't it better just to leave it an option?

For that matter, everything should be an option. You should even be able to add a person without any info, even a name.

louiskessler 2010-12-18T14:02:53-08:00

... and I don't see Goal 7 as being a practical goal of BetterGEDCOM.

BetterGEDCOM is just to store or transfer data. Most people won't see it. Only the programs and the programmers will.

GeneJ 2010-12-18T14:12:31-08:00

I see a variety of ways that BetterGEDCOM can play a role in support of best practices.

For example, we can change the way sources are transferred from one user to another by asking developers, as part of the option to create a gedcom for a third party, to cite the BetterGEDCOM and report the creator's source as a source of the source.

If there was no source in the original file, on input, the GEDCOM would be cited and it would report, "no further reference" or some language.

Andy_Hatchett 2010-12-18T14:20:51-08:00

Louis,

I'd actually like to see the "Who Knows?"

It would tell me all I'd need to know about that person's research so that I could avoid all of it in the future, as well as warn others away from that particular person.

If everything is an option then you reduce the entire field of genealogy to its lowest common denominator- something I'm not willing to do at this point.

As granny would say "it it ain't sourced it ain't genealogy; and if it ain't genealogy then whoever is producing it can, and should be, safely ignored as of being no use whatsoever to the genealogical community as a whole."

louiskessler 2010-12-18T14:25:40-08:00

Andy,

So you are going to discount all information that is not sourced? That's 99.99% of information that is out there.

If you were a detective and you discounted all evidence that didn't have a witness, you wouldn't solve very many cases.

Data without sources is not necessarily wrong. It is less reliable for sure, but it is and always will be a viable clue to help you find and work towards the correct conclusion.

gthorud 2010-12-18T15:01:08-08:00

Why don't we burn all genealogy books that don't have sources for every bit of info?

GeneJ 2010-12-18T15:02:40-08:00

As a user (not a programmer), if the program I'm buying is BetterGEDCOM compliant, then I'd expect (realizing much of this is application based):

1) I want to know those fields in my application that are NOT able to be exported, so that I will also know which fields ARE able to be exported. (I don't waant to learn this the hard way.)
2) If I follow the applications rules when I enter source information, I want to know the program is capable of faithfully exporting that same source information.
3) If I import something from someone else, I don't want only the option to (a) report only the name of the GED as that source, or (b) the full identity of the source (which, as the importer, I may have never seen).

louiskessler 2010-12-18T14:16:00-08:00

Goals 3 & 4 again

Discussion of Goal 7 (which I think should be removed) has made me look at the goals again.

I don't think Goals 3 & 4 belong in BetterGEDCOM either:

3. BetterGEDCOM should require a software application to export all data to be in compliance
4. BetterGEDCOM should require a software application to have a robust conflict resolution facility prior to final import to be compliant

We want programmers to embrace BetterGEDCOM. We don't want to force rules upon them. The harder you make it for them, the fewer that will adopt it. If few adopt it, it fails.

What must be done is to make BetterGEDCOM simple and easy to implement. That will get it used.

Give them reasons to use BetterGEDCOM. Don't give them reasons not to.

We should remove Goals 3 & 4.

GeneJ 2010-12-18T14:25:27-08:00

I take a stab at a change, not deletion.

AdrianB38 2010-12-19T09:27:25-08:00

I've tweaked Goal 4, viz:
Original
BetterGEDCOM should include a test suite of data that will allow programmers and users to diagnose and resolve issues.

New [comments in square brackets]
The BetterGEDCOM project [to emphasise that the data is not part of the standard itself but sits alongside it] should provide a test suite of data that will allow software suppliers [not just programmers] and users to assess compliance of software [it's not just odd issues but we need to get something to check as much as possible. Is this realistic? Probably not but this is a goal not a realistic objective!], diagnose issues and assist in their resolution.

AdrianB38 2010-12-19T09:52:22-08:00

I've tried to tweak and clarify 3 now, viz:
Original -
3 BetterGEDCOM should sufficiently allow the definition of all types of genealogical data so that any and all data can be transferred faithfully.

New - extra comments in [square brackets]
BetterGEDCOM should define data relating to the study of genealogy
[Yes - we'd not actually said this in the goals! Told you I can be pedantic!].

The definitions will describe the XML-based syntax and also be embodied in a data model.
[Tells us what products we aim to produce]

The definitions will be capable of extension by software companies and users. [Important to aim for extensibility because we can never define all types of data ourselves]

The coverage of the types of genealogical data will allow faithful import of data [faithful is the word from before. But I've restricted it to import, rather than just 'transfer' because exporting is just application stuff. It's import from other formats that's crucial]
from all current, common genealogical software [i.e. we want to enable data to be loaded from GEDCOM and potentially others but don't expect us to define something equivalent to absolutely everything]
with no material manual intervention [people can live with a few minor incompatibilities but requiring them to say, reinput all their families is not on],
subject to the limits of the applications involved [i.e. don't blame us if application X doesn't produce compliant GEDCOM and your new app can't read it.]

louiskessler 2010-12-19T10:04:37-08:00

Adrian:

With respect to Goal 4, I think the most BetterGEDCOM can and should do is make up a bunch of BetterGEDCOM sample data files, that the software needs to import, store in its database, and then export. If the exported file is exactly the same as the one imported, then he is compliant.

Anything more, such as requirements put upon the software, error logs, messages, or anything, is an imposition on the developer.

This must simply be a tool to assist the developer to help ensure the program's compliance and nothing more.

Louis

mstransky 2010-12-19T10:14:56-08:00

Well if we ever get a mach up of data. say like a file like 500Kbs of various data fields. an APP can import it with say no errors. then if that app can export the document back to a BG format as it was I think that would be compliant.

If I took a basic gedcom of data from FTM and pulled it in my app I can vary easity say I am compliant.

But if a test file of various relations and captuing particular data fields in a hel BG format would be better.

Like GEDCOM flat file is the universal starting poit for most older apps.

If BG had Various types of data to cover the needed things to transfer or captured data in a test file.

But bofre this is even done the techies and devs need a BG list of data fields which need to be captured, not how to structure them right now, just the fields that will hold data.

after that we can talk about a mach up file to hold various data as a test sample and add to it as things come up.

AdrianB38 2010-12-19T12:20:19-08:00

Louis
Re "With respect to Goal 4, I think the most BetterGEDCOM can and should do is make up a bunch of BetterGEDCOM sample data files, that the software needs to import, store in its database, and then export. If the exported file is exactly the same as the one imported, then he is compliant."

Yes - round-trip sounds a neat idea. Although it needs to be more than just a few samples, but something that gives the app a good workout.

"Anything more, such as requirements put upon the software, error logs, messages, or anything, is an imposition on the developer."

Totally agree with you - does my phrasing sound as if it does require more? It certainly wasn't my intention to aim at more.

mstransky 2010-12-19T12:38:13-08:00

I 2nd that, but dont we need a achiveable data field list first.

This would gove the devs a hit list of fields to sync with.

AdrianB38 2010-12-19T15:04:55-08:00

"don't we need a achievable data field list first"
To do the job, yes, absolutely, but this page is about the goals, which describe the end-game - how we get there and in what order is for another day. A lot of other days!

GeneJ 2010-12-19T15:07:09-08:00

There is a "to do list" page at about the bottom of the navigation on the right.

I put family file in there; maybe that's the place to put the data field list?

gthorud 2010-12-18T18:42:21-08:00

Keep goal numbers

Could we keep the goal number once they are assigned, othervise the discussions will become a mess.

GeneJ 2010-12-18T18:43:30-08:00

I'll add them as parens (system is using automatic numbering)

gthorud 2010-12-18T19:48:23-08:00

Good idea! We have to keep the links between the conclusions and the evidence :-)

louiskessler 2010-12-19T00:43:42-08:00

I've attempted to modify the two issues in a way that would allow BetterGEDCOM to include them as goals and avoid putting any extra work on developers.

I've put them back in on the Goals page as Goal 3 and 4.

3. BetterGEDCOM should sufficiently allow the definition of all types of genealogical data so that any and all data can be transferred faithfully.

4. BetterGEDCOM should include a test suite of data that will allow programmers and users to diagnose and resolve issues.

Feel free to modify further or comment here.

Louis

AdrianB38 2010-12-30T03:07:25-08:00

Single way (current goal 7)

Re the (current?) goal 7 "BetterGEDCOM should define just one way of doing one thing. More than one will cause ambiguity and extra programming for programmers who will now have to handle all methods. BetterGEDCOMs definitions should be general enough to handle all cases, but in just one way"

I have concerns over what this might be expected from this. It seems entirely logical that if there are 2 different ways of accomplishing exactly the same thing, then this is pointless.

However, as a for-instance, we might discuss whether "facts" should be sourced or not and whether "facts" should be represented through the "conclusion only" model of current GEDCOM or only via the "evidence and conclusion process". These are, in my view, three different ways of doing genealogy and some might read the (current) goal 7 and say BG should choose only one way. This would, I suggest, be very wrong - and indeed most of us have agreed with the concept of not mandating one of those methodologies.

I suggest we need to refine the goal 7 to be more like "BetterGEDCOM should define just one representation of specific data or information ..."

Comments?

Even that's a bit fraught as anyone who's tried to get their heads round the differences between baptism and christening in the Anglican churches may agree (there is a theoretical difference but the terms often appear to be used indiscriminately in parish registers). The English language is inherently fuzzy over time.

theKiwi 2011-01-03T19:44:56-08:00

Louis wrote:

"The header says GEDCOM 5.5, yet they include tags such as ORIG and EDTR and PERI that were removed from GEDCOM 5.4 and later. "

"They have invalid tags: URL, LOCA, REPT, FRAM, DATV, which they should have made custom tags."

This is not entirely Reunion's fault - it is me that have created at least some of those tags

ORIG, FRAM and DATV were created by me I'm pretty sure, but perhaps Reunion should have insisted they start with a _ for a Custom tag.

URL and LOCA probably are Reunion standard, but I've been using it so long it's hard to remember for sure, and in the case of some of the others you say were removed from GEDCOM 5.4 and later, they have been in my file since long before Reunion started using GEDCOM 5.5 I think.

"They have invalid tags: URL, LOCA, REPT, FRAM, DATV, which they should have made custom tags. I find it strange they didn't since they include a number of them: _EMI, _HAM, _HAN, _MVR, _OBT, _WIL and about 12 others."

All of the ones starting with _ were created by me, I guess after I had learned that Custom Tags should start with a _

When a user adds a Custom field to Reunion - whether it's an Event, a Note or a Fact, or a Flag or anything else, the user gets to create the GEDCOM tag that will be used, and it certainly doesn't insist, or even suggest that it start with a _

If I get a chance in the next few days I'll try to go through my file, and compare it to what a new empty Reunion file might have to properly set all of my custom fields to have a GEDCOM tag starting with a _ and then see what is left. I expect that will take a while as according to TNG my complete GEDCOM file has 110 "Custom Events" in it, or which 48 start with a _ and the rest don't.

Roger

AdrianB38 2011-01-04T04:30:55-08:00

As you probably suspect by now, if Reunion is compliant with GEDCOM 5.5, it should have inserted the underscore itself or demanded you insert it. Though if some were once "real" GEDCOM tags, then whether they get treated as custom tags may depend on what version of GEDCOM Reunion is nominally working to.

theKiwi 2011-01-04T06:23:32-08:00

An interim update on Reunion 9 GEDCOM files...

I've posted to my site a screen shot of Reunion's "Optional Fields" screen which is part of the Export GEDCOM process. This is for the Reunion 9 Sample file, so perhaps indicates what Reunion includes as GEDCOM tags "out of the box".

http://roger.lisaandroger.com/TNG_Roger/bg/Reunion9OptionalFields.png

I've also posted the GEDCOM file exported from this small file here

http://roger.lisaandroger.com/TNG_Roger/bg/ReunionSampleFamily.ged

Roger

theKiwi 2011-01-04T06:25:02-08:00

(Sorry - links are wrong - how do I edit and existing post, or delete it??)

An interim update on Reunion 9 GEDCOM files...

I've posted to my site a screen shot of Reunion's "Optional Fields" screen which is part of the Export GEDCOM process. This is for the Reunion 9 Sample file, so perhaps indicates what Reunion includes as GEDCOM tags "out of the box".

http://roger.lisaandroger.com/bg/Reunion9OptionalFields.png

I've also posted the GEDCOM file exported from this small file here

http://roger.lisaandroger.com/bg/ReunionSampleFamily.ged

Roger

testuser42 2011-01-04T13:28:05-08:00

About the problem regarding the British Census:
Maybe there's a misunderstanding because people from the US don't know how the UK did/does its censusses (censi??)? I don't really know either of them.

But my abstract idea would be: Have a tree of sources. Then one source could be the information concerning a specific household, and this can link to a higher level source like the complete census of that year.
That way, it should be possible to accommodate different methods in different countries with relative clarity and elegance.

testuser42 2011-01-04T13:36:50-08:00

Quick summary about notes:
Louis doesn't like a separate entity for "Note", while others do. Both sides have valid arguments. Both inline and separate "Notes" already exist in GEDCOM, and should somehow be preserved / converted in Better GEDCOM.
How about giving the entity-level "Note" a different name from the inline "Note"? That might sound like a cop-out, but it's making it clear that these are different things, and should be good for compatibility.

AdrianB38 2011-01-05T08:28:45-08:00

"How about giving the entity-level "Note" a different name from the inline "Note"? "

Shareable Note?

Not SharED Note because I have to have the concept of a Reminder / Miscellaneous Info Note as an entity in itself

theKiwi 2011-01-05T08:36:06-08:00

"Shared Note?"

In Reunion, what appears in the GEDCOM file as the "inline note" is called the Memo - it's extra information about that particular Event. So for example on this page

http://roger.lisaandroger.com/getperson.php?personID=I16&tree=Roger

the items in the first few blocks of data that start with the hollow circle are in the GEDCOM file as the

2 NOTE xxxxx

and are identified in Reunion as the Memo. They are ONLY applicable to that one event.

So perhaps BetterGEDCOM could have MEMO for the inline note and NOTE for the Shared Notes?

Roger

ttwetmore 2011-01-05T09:03:54-08:00

Okay, we want to do everything one way, so we're worrying about inline versus separate records Notes. (Well, frankly, I'm not worrying about it because I think it's not an important thing to worry about -- but that never keeps me from blabbing my mouth).

So think of it this way. Sometimes we want to apply a note to just one particular fact, maybe to a whole record, maybe to a date or a place in a record, maybe to an attribute of a person, whatever. So we have an in-line note for that, and that is obviously the right thing to do.

What if we have a more general note that we want to be referenceable from many different appropriate spots. Well we put this is a separate note record, because if we didn't we have to duplicate the note in all the spots we want to reference it.

Is this two ways of doing the same thing? No it's not. We are doing two different things, so it's fine to have two different ways to do them. If we want to call one of them a note and the other a memo so be it, but I really don't think it's at all confusing to use the same tag for the name of a record and the name of an attribute in a record. After all, it's only software developers who have to understand this distinction, so there certainly isn's an user issues to worry about. If developers can't understand this concept they're too dumb to write a successful application, so it's a totally moot point.

And remember, we still have the exact same issue of in-line sources and places as vs the record version sources and places. I think you make exactly the same arguments here, that sometimes there are sources and places that will only be needed once (think about it and I'll bet you can come up with any number of examples where this is so), whereas it's usually better to put places and sources in their own records.

I think it's best, in all three of these cases, to simply say notes, sources and places can be either in-line or in separate records as it seems the most appropriate for the application. The user doesn't even have to be involved. An application can decide on export whether to use in-line or record versions based on how many references to them exist in the current version of the application's database. Frankly I think this is a non-issue.

Also remember that if we insist that BG put all notes in line, then every application that supports notes in separate records (which a lot of them do) will have to duplicate the contents of those records many times as they convert the notes into an in-line version to put in the BG transport file.

Tom Wetmore

brianjd 2011-01-07T22:43:25-08:00

Adrian,

All of your notes are attached to something. Either to an event, a record , a person, or to the database itself. Unless it's a sticky note stuck on you computer screen, or a note that the software allows you to write that is stored like any other text file on your computer. I would argue any note stored outside of the family history database is irrelevant to the BG standard, except as a multimedia object, but then it's no longer a BG Note.

Notes are attached to something, always. I'm not sure why anyone would want to have a note that is attached to the database, but I can certainly understand allowing it. I think the goal number 7 isn't meant to mean that our objects will function in only one way. I think it means that a note is always a note and not a person record also. That we should use only one format, or only one format within a BG file. Things of that nature.

I actually like the general idea of goal 7 from that perspective. Like the discussion about supporting multiple character sets. Let's pick one and only one. Pick one and only one file format. I'm not crazy about XML, but it's popular, descriptive, readable, and I can live with it.

So, I think goal 7 is a keeper, but we need to be clear on what we mean by it.

Goal 7: BG will have one and only one way of doing things. Objects will have one and only one definition.
(Example: Note:: definition: Is an object. Contains free-form text. May have formatting. May be embedded in other objects. May be inherited by other objects... )

AdrianB38 2011-01-08T13:26:05-08:00

Brian - agree with everything you say (once I add the phrase "attached to the database" to my lexicon!)

I also like Goal 7 - it's just that I consider the 2 varieties of notes to be different beasts so they don't breach the Goal, whereas others, ahem, dispute that.

brianjd 2011-01-08T19:11:11-08:00

Adrian,

I'm not sure what the issue is with notes.

Personally, I'd like to see most bits of the definition as objects in themselves. Then pieces can be embedded in other objects, or be their own objects or both. It seems to me, that the less restrictions we put on the definition the more versatile it will be.

theKiwi 2011-01-03T05:55:22-08:00

Re:
>"- your belief that most genealogy software suppliers will simply add BG as another option is perhaps the most crucial comment I've read yet on this Wiki relating to the scope of what we can achieve with BG - or at least, its first few releases."

I would hope this is what happens, particularly for those who now use GEDCOM regularly to move their data around. I use Reunion for Macintosh to maintain my databases, and publish them online using TNG [[ http://tngsitebuilding.com/ ]]. The means of moving from one to the other is GEDCOM file, and due to the flexibility of both Reunion and TNG I am able to move almost all of my data from one to the other using custom tags that Reunion exports and TNG can be taught to import.

The thought that TNG might suddenly move to BetterGEDCOM, abandoning GEDCOM, while Reunion took 18 months (or more) to implement BetterGEDCOM is quite unthinkable.

theKiwi 2011-01-03T06:01:13-08:00

Re Inline v's separate entity for Notes - to clarify with an example.

1 BURI
2 DATE 10 SEP 1895
2 PLAC Balclutha Cemetery, Balclutha, Otago, New Zealand
2 NOTE From death notice in Clutha Leader paper Friday 13 September 1895 “His funeral took place on Tuesday, and the cortege w
3 CONC as one of the largest seen in the district”
2 SOUR @S246@
1 _HAN @N357@
1 _EMI @N18@
1 _WIL @N415@
1 _OBT @N418@
1 _UID 224BA1DFC2F340D798AC20C3DA7CF0DD03DA

Does this snippet show what is being referred to here - the 2 NOTE line about the Burial Event being an "inline note" and then the later 1 lines referring to @Nxx@ entries being "separate entities"?

This is from a GEDCOM file output by Reunion 9.

ttwetmore 2011-01-03T07:26:48-08:00

Mr. Apteryx provided an inline vs entity example from Reunion 9.

The 2 NOTE line is what is being called an inline note; also called a local note. The Reunion additions are new to me, but if the @Nxx@ id's reference separate NOTE records then those four lines are references to "separate entities." Do you know whether the 1 _EMI line points to just a note about an emigration event or does it point to something more substantial, like a real event.

I am guessing that Reunion does have things like emigration and will and obituary events as separate entities, but in order to squeeze them into a GEDCOM file they are converted to something a little simpler. It would be very interesting to see the actual GEDCOM of one of those record with one of those Id's. Maybe you could post one of them as well?

A reference to a separate NOTE entity/record "should look", in my opinion, like:

1 NOTE @N181@

Thanks for the insight into Reunion. Looks like there is something to learn here.

Tom Wetmore

theKiwi 2011-01-03T09:45:14-08:00

Tom the Ornithologist wrote:

>

Do you know whether the 1 _EMI line points to just a note about an emigration event or does it point to something more substantial, like a real event.

Sorry I should have been more complete. The 1 _EMI line (as well as the other lines that start out 1 _xxx) represents what Reunion calls a Notes field - each person can have up to 100 different notes fields. This particular record as you can see has a number of custom notes fields for Emgration, Will, Obituary. Those Note fields appear at the end of the GEDCOM file like this

0 @N18@ NOTE
1 CONT They sailed almost immediately for Australia. [Source 3]
1 CONT
1 CONT William (Wm) and Ellen Moffat emigrated to Australia in Aug 1857 onboard the ship “Titan”. They are listed as 2 of 66
1 CONC 4 passengers on that voyage. [Source 53]
etc etc.

Assuming this board will "take it", below here is a complete GEDCOM file from Reunion for just this individual.

OK it couldn't - I got "Messages cannot be larger than 20 KB." So I have posted the GEDCOM file here

http://roger.lisaandroger.com/WilliamMoffat.ged

This data is presented by TNG here

http://roger.lisaandroger.com/getperson.php?personID=I16&tree=Roger

Everything on that page was transferred from Reunion to TNG by the below GEDCOM file, with the exception of the mapping information which Reunion doesn't (yet) support, so I have to do in TNG, although TNG does support importing LAT and LONG information in a GEDCOM file in several different formats as exported by programmes like Legacy, RootsMagic

louiskessler 2011-01-03T10:49:37-08:00

Reunion is completely non-standard in it's output. It has defined use custom user "events" via the _HAM, _EMI, _WIL and _OBT tags and uses the NOTE structure to contain the text-based info about them.

What some programs do, which I think is better is define their own custom top-level record,

0 @R14@ _EMI
1 DATE Aug 1857
1 NOTE They sailed almost immediately for Australia.
2 SOUR @S53@
1 NOTE William (Wm) and Ellen Moffat emigrated to Australia in Aug 1857 onboard the ship “Titan”. They are listed as 2 of 66
1 CONC 4 passengers on that voyage.
2 SOUR @S54@

with the reference to it being
1 _EMI @R14@

Notice how doing it this way allows additional information (e.g. dates and sources) to be usable by the program, instead of being embedded within a note where it is lost.

louiskessler 2011-01-03T10:57:27-08:00

Adrian:

You said: "... as far as I'm concerned, each schedule (i.e. each household) is an individual source."

That is really the wrong way to do it. Each census is a source. The individual records from that source (each schedule if you would), are the source-citations.

The definition of citation on BG has been up to discussion, but I treat it as a "where within source", because that is the way it is generally used in almost all the GEDCOMs I have seen. It is the thing that would be the footnote, with the source being the IBID, carried down from line to line. Obviously, a specific census would be that thing.

louiskessler 2011-01-03T11:09:25-08:00

>"- your belief that most genealogy software suppliers will simply add BG as another option is perhaps the most crucial comment I've read yet on this Wiki relating to the scope of what we can achieve with BG - or at least, its first few releases."

I feel this is the only way BG can make any headway. No developer is going to drop GEDCOM and implement BG. They may, if we give them enough reason, include BG support as well.

From then on, until BG gets enough "market share" and enough momentum to get supported by the bigger players, it's not going to cross the chasm. (Read Geoffrey Moore's Crossing the Chasm) In the meantime, GEDCOM will be supported by all. I would say it would take a minimum(!!!) of five years before software developers would consider dropping GEDCOM, and that's if everything went perfectly for BG.

The ONLY way BG will succeed is if it has specific and identifiable major benefits over GEDCOM that will entice both developers and users to want it.

To be honest, all our proposals, even an evidence and conclusion process, don't have the earthshaking benefit that will tell the genealogy community that they just have to switch.

GEDCOM is the car that works, has a few problems and is good enough. You are trying to get everyone to buy that new car. What will get them to do that?

theKiwi 2011-01-03T11:19:52-08:00

Adrian:

You said: "... as far as I'm concerned, each schedule (i.e. each household) is an individual source."

Louis Kessler:

You said: "That is really the wrong way to do it. Each census is a source. The individual records from that source (each schedule if you would), are the source-citations.:

I'm with Adrian here - for one thing having the household be the Source, allows me to print out a hard copy of that page, and file it under its own unique source number for easy retrieval later if I need to look at it (for example who were the neighbours).

If all of the "pick any year" British Census records I have had a single source number, filing them for easy retrieval would become problematic, at least in the scheme I have implemented in my Reunion software and my file folders. In the example page above http://roger.lisaandroger.com/getperson.php?personID=I16&tree=Roger Source 759 points to a page printed out with my great great grandfather and his parents in the 1841 British Census which is represented online here

http://roger.lisaandroger.com/showsource.php?sourceID=S759&tree=Roger

AdrianB38 2011-01-03T12:22:29-08:00

"That is really the wrong way to do it. Each census is a source"
I'm smiling ruefully here. The fact is that you can have 6 genealogists in a room and get 7 different opinions on what level the source should be at. I guess I was influenced by the idea that the source is what I physically pick up to get the data out of. As such it sits in a Repository and has a call-number (whatever).

So thinking of what do I physically pick up for a census - There's no way I'd get the entire 1851 census of the UK out to look at. But if it's an on-line version of it, I make an enquiry on a person or a household and get an image. So the output from that is my source. (I guess by my own logic I should use the image as the source, not the schedule.)

Whatever - the point is that we need to understand that there are people who describe their sources at all different levels and BetterGEDCOM needs to support them _all_ when it describes sources and citations.

AdrianB38 2011-01-03T12:26:01-08:00

"I feel this is the only way BG can make any headway. No developer is going to drop GEDCOM and implement BG. They may, if we give them enough reason, include BG support as well"

Louis - you're absolutely right on this. I've been too wrapped up in my corporate IT view where we could change the format and dictate its use.

GeneJ 2011-01-03T13:28:31-08:00

Whatever - the point is that we need to understand that there are people who describe their sources at all different levels and BetterGEDCOM needs to support them _all_ when it describes sources and citations."

Mills Evidence Explain advances the effort of managing your source list. So, all my e-mails with particular cousins are considered a single correspondence collection for purposes of my bibliography. My citation is quite often to a specifically dated email, and so I usually have a separate entry in my master source list for the individual e-mails (but they all have an identical bibliographic entry).

For census, I "source" to the county level (was an easy way for me to keep track of the roll numbers).

For death records records and certificates, I like to source at the collection or jurisdictional authority.

louiskessler 2011-01-03T19:08:47-08:00

Tom:

http://roger.lisaandroger.com/WilliamMoffat.ged

Lots of problems I didn't expect in this file. I thought the Reunion people were better than this.

The header says GEDCOM 5.5, yet they include tags such as ORIG and EDTR and PERI that were removed from GEDCOM 5.4 and later.

The have extra data on Level 0 SRCE records that shouldn't be there.

They have invalid tags: URL, LOCA, REPT, FRAM, DATV, which they should have made custom tags. I find it strange they didn't since they include a number of them: _EMI, _HAM, _HAN, _MVR, _OBT, _WIL and about 12 others.

The Note text is supposted to begin on the NOTE record, but it doesn't.

Basically, I am saying that I would not use this file (or any file from Reunion) as a good example of how a compliant GEDCOM file should be constructed.

I give TNG kudos for being able to read and interpret this GEDCOM and its custom tags correctly.

ttwetmore 2011-01-01T12:58:41-08:00

I've been following the local vs. record note discussion. The same in-line vs. record situation occurs for sources and places. Even, in a sense, vital events within person records vie with external event records. If we were to mandate one way of doing things it would seem to have to apply to all four situations.

Another note about notes. Some programs allow note strings to be formatted (bold, italic, etc). Event GEDCOM uses html tags for this. GRAMPS allows formatting. When note are so involved that they are formatted it just seems to me they are complicated enough to deserve being their own records.

Tom Wetmore

ttwetmore 2011-01-01T14:48:29-08:00

My mistake. It's GEDCOM 6.0 that uses html tags for formatting notes.

Tom W.

AdrianB38 2011-01-02T09:22:55-08:00

"The same in-line vs. record situation occurs for sources" Oh gosh - I'd forgotten in-line sources. What's American for "Holds head in hands and groans?" Dunno but Dilbert must have said it sometime...

This is not an immaterial question - the current goal 3 says "coverage of the types of genealogical data will allow faithful import of data from all current, common genealogical software with no material manual intervention". Which roughly translates as "BG must allow programs to convert existing GEDCOM data" - or perhaps more accurately, not stop them doing so. So if someone has a GEDCOM with in-line sources, then we ought to facilitate their import somehow - unless we can show there are only a few such files in the world.

What we do NOT have to do is say that in-line in GEDCOM must be in-line in BG. Especially as the in-line source in the GEDCOM 5.5. standard is pathetically thin (a description, text from source and a note).

I therefore suggest that we do not allow in-line sources in BG and put a note into the BG standard that says we expect the writers of GEDCOM to BG conversion software create a free-standing source entity on the output for any in-line source on the input, with values taken from the in-line source and substitute a pointer to the new source record at the appropriate point in the output file.

All of which is exactly what the GEDCOM standard says for what source-record capable GEDCOM software should do when faced with in-line sources.

GeneJ 2011-01-02T09:32:33-08:00

Hi Adrian, Tom, all:

Sigh, my fist note back was eaten by the "dumb user closes window while trying to do two things at once" monster.

Evidence/Conclusions: I live in the US and recognize the Genealogical Proof Standard (GPS) as the scholarly standard relative to a conclusion. See BCG website, http://www.bcgcertification.org/resources/standard.html

There are five elements of the GPS:
*Reasonably exhaustive search
*Complete and accurate citation of sources
*Analysis and correlation of the collected information (ie, _all_ the evidence/body of evidence)
*Resolution of conflicting evidence
*Soundly reasoned, coherently written conclusion.

Separately, while a host of information in Board for Certification of Genealogists (BCG), _... Genealogical Standards Manual_ (2000) is helpful, see p. 25 in "Teaching Standards," as item/standard no. 72, "Database programs accommodate sound data-collection, evidence-evaluation, and compilation standards and do not force or encourage users to leap to premature conclusions about personal identities or relationships, or to tailor their research findings to the input interface."

There are likely standards in other countries that warrant our consideration.

The BCG's Genealogical Standards Manual (2000) is a great resource. The manual provides examples of how genealogists present names, dates, places and source materials.

There are two "pocket guide" styled materials that cover "techniques" in presentations. These materials are the published style guide information for the Register and the Quarterly, as below. Of course, current editions of the Register and the Quarterly report on the broader application of these very techniques!

Michael Leclerc and Henry Hoff, editors, _Genealogical Writing in the 21st Century: A Guide to Register Style and More_, 2nd ed. (Boston, Mass.: New England Historic Genealogical Society, 2006).

Joan F. Curran, Madilyn C. Crane, and John H. Wray, edited by Elizabeth Shown Mills, _Numbering Your Genealogy: Basic Systems, Complex Families, and International Kin_, rev. ed. (Washington: NGS, 2008).

Hope this helps. --GJ

Hope this helps.

AdrianB38 2011-01-02T09:35:00-08:00

Incidentally, I stick to my guns that 2 ways to do something is not the same as making things ambiguous. A train ticket marked "Chicago to New York on the 10 p.m. New York Central RR or the 10:30 p.m. Pennsylvania RR" is not ambiguous even though it offers 2 routes (it's probably an unheard of ticket mind you, but that's what you get from someone on my side of the Atlantic).

But if the group really does feel like Notes should only be done in 1 way, then I'd accept all Notes being free-standing entities, instead of mixed in-line and free-standing. With the proviso that free-standing Notes could be linked to nothing at all.

It's more work because the software has to maintain separate sources and citations for both the event / attribute in question and the note for it, bit it's better than losing shared-notes altogether.

louiskessler 2011-01-02T11:39:50-08:00

A note is a note about something. So I say, attach it to that something. If it is to be a "fancy" note, allow embedded HTML in it.

For that matter, every data item, not just notes, should be allowed embedded HTML in it.

Sources, places and events are different than notes. They have a variety of attributes that need to be contained together, therefore they deserve to be entities. But a note is just a note, describing something else.

louiskessler 2011-01-02T11:56:22-08:00

Adrian,

Don't worry about the extra work for the programmer.

First, they all currently have a way of importing GEDCOM to their database, and exporting their database to GEDCOM. All they need to do is add an import from BG and export to BG.

Second, don't expect them to change their databases. Maybe future programs will design their databases after BG if we make a good enough model. But existing programs will have their databases too embedded into their code to easily change.

Finally, we don't need to facilitate their import. We just have to make BG general enough to be translated to a host of different data structures upon import. None of these inline/record decisions will affect this. If they do it the other way in their program, it is easy enough to translate. Let's just make sure that when there's no good reason to choose the more complex method, then let's standardize the simpler one. The programmers will thank us profusely.

But they will curse us if there's two ways for the same thing. They don't want to have to decide. They/we/I want the standard to decide for them/us/me.

AdrianB38 2011-01-02T13:33:51-08:00

"A note is a note about something. So I say, attach it to that something."

That's not how I use free-standing, unlinked notes. My first unlinked note is a list of UK census dates so I know what date to apply to the census event (it's not on our forms because UK censuses take place on one day).

The only common thing I could link these unlinked notes to is "The Genealogy Hobby" - which is a fairly pointless link.

AdrianB38 2011-01-02T13:54:29-08:00

Louis - you just said something very interesting - "they all currently have a way of importing GEDCOM to their database, and exporting their database to GEDCOM. All they need to do is add an import from BG and export to BG" and "don't expect them to change their databases"

I confess I had not expected anyone to create software supporting BOTH GEDCOM and BG at the same time. (And that's partly the effect of using FamilyHistorian, which uses GEDCOM as its native file format and would therefore need to move in a big bang to BG).

If we introduce multi-person events (as I believe the consensus is), then they really have to change their database structure. If they don't, then they _will_ be cursing us as they try to keep multi-person events aligned across umpteen copies of the event. And how can they import new entity types like GROUPs, or indeed, multiple names across time of LOCATIONs, if they don't change their database structure? (Those new ones probably aren't much of an issue, mind).

Nor am I convinced that maintaining both the GEDCOM concept of a family and the BG one (where much of the important stuff is in roles relating to individual's events) is simple.

Further, the whole point of converting to XML is to use existing externally supplied routines and end up with less bespoke code. Which won't happen if people attempt to maintain both GEDCOM and BG in the same software.

And I really cannot see how an evidence-and-conclusion model can be thought about if the underlying data is still (tweaks excepted) the current GEDCOM compatible databases.

Now, if someone wants to maintain both GEDCOM and BG in the same software, then the best of luck to them, and I'm not going to stand in their way. But if (virtually) everyone goes down that route, then the software world is going to have a much harder job than I had envisaged. Hm. It's a disturbing thought...

louiskessler 2011-01-02T15:35:51-08:00

Adrian,

As much as I'm sure you're very happy with Family Historian and its embrace of GEDCOM, in my opinion, it was a mistake for them to use GEDCOM as their data structure. We know the limits of GEDCOM, which is why BG is being attempted. Once BG or something else becomes the standard, they will either be stuck with an "old" model, or will have to do a rewrite to go to the "new" model. Or maybe they'll be smart and just pick their own data model, and use GEDCOM and BG as the mechanism to transfer to and from other programs.

By the way. Family Historian is not as GEDCOM compatible as they claim. Read Tamura Jones' article:
http://www.tamurajones.net/FamilyHistorian3.1.2.xhtml

louiskessler 2011-01-02T17:42:57-08:00

Adrian,

I would consider each UK Census a source and create a source for each one. Then I'd attach the note to that source.

Does that not make sense?

Louis

AdrianB38 2011-01-03T04:12:26-08:00

"I would consider each UK Census a source and create a source for each one"

Well, it's the way some people work, but as far as I'm concerned, each schedule (i.e. each household) is an individual source. My "citations" for each of those sources have the date on, but the reason I have that note with census dates on in the first place, is to do a quick look-up on the date for the YYYY census so I can add it to the "citation" for the census source, i.e. procedurally the note precedes creation of the source and its citation.

Incidentally, re FH and its claim for 100% compatibility - we're both old enough and cynical enough I think, never to believe such claims, so I discard the exact detail of such claims before assessing anything. And the advantage of being able to hack into my FH database to rename a custom tag (e.g.) that I spelt wrong outweighs the sort of disadvantages you mention - which are real.

Let me try and summarise:
- discussions like this are valuable for getting our heads round how other people do things;
- I wholly agree with the sense of BG doing things one way only wherever possible;
- on that basis I could settle for NOTEs being only a separate entity (i.e. removing the in-line option for NOTEs from BG);
- I couldn't accept BG's NOTEs being in-line only because this would require NOTEs that are currently shared being copied to multiple events when written to an output BG file, resulting in massive duplication;
- I couldn't accept BG demanding that NOTEs always be linked to something else;

AND...

- your belief that most genealogy software suppliers will simply add BG as another option is perhaps the most crucial comment I've read yet on this Wiki relating to the scope of what we can achieve with BG - or at least, its first few releases.

AdrianB38 2011-01-01T04:50:02-08:00

In other discussion on another page (discussion: "tracking land changes idea" on the page for "Mike's Model") Louis commented (Dec 31, 2010 7:55 am) "NOTE as a record and a NOTE local to a record are two ways to do a note. That is bad."
If you regard such notes as equivalent, then yes, it is. But Louis expects others won't agree - me for a start because I don't believe that they are equivalent. One is a note on a specific event or attribute of a specific person or individual. The other is allowed to be more generic. It may apply to a person but no specific event or attribute. It may also apply to no specific entity. For instance, in lieu of a generic "Other-entity", I keep notes about ships and regiments as Note records.

Louis adds "The way I've seen it in the hundreds of GEDCOMs I've looked at, programs have chosen to do it either one way or the other - always local, or always as records. I've never seen a program that has them mixed."
Well, I have. FamilyHistorian from Calico Pie, one of the leading UK FH apps allows both.

He asks: "Which way to go on notes? ... I think the deciding criteria should be simplicity and help the programmer out ... make all notes inline and they are simple strings. BetterGEDCOM becomes simpler with one less entity to worry about. Everyone will be happier."
While I'm always in agreement with the maxim "Simplicate and add less coding" I think this is a simplification too far. Given the number of possible extra entities we are talking about (groups, locations, ships, events, assertions, etc in some form or another) I can't see that keeping the simple entity type of note is adding much of a burden.
Further, if we are to convert existing GEDCOM files to an in-line style, how do we convert the free-standing note records? I've got some (to-do, techniques, etc) that aren't even linked to anything else.
And conversion of them all to free-standing note records would seem perverse in that it clutters up the list of the note records, since I have notes to just about every fact and event in my file, not to mention notes added to many citations, none of which I would want to see in my list of free-standing note-records.

(I've copied the bits here because I think it belongs under this discussion rather than one about places)

AdrianB38 2011-01-01T06:33:54-08:00

Naturally almost immediately after sending the above post, it occurred to me that I probably would have no issues with eliminating the choice between in-line (local) notes and free-standing note records in some places.

Specifically, an event or attribute in current GEDCOM can currently have either an in-line (local) note or a link to a free-standing note record. Or both. If we allow multi-person events (i.e. events linked to more than one family or more than one person), then off the top of my head, I can't immediately see a need for anything other than an in-line (local) note linked to that specific event.

And I also can't see why an attribute of a single person would need anything other than an in-line (local) note linked to that specific attribute.

Note that I still want the ability to have separately both in-line (local) notes and links to a free-standing note record. It's just that in _some_ cases it appears to be possible to reduce the choice. Which may or may not be worth it.

GeneJ 2011-01-01T08:31:19-08:00

Adrian:

I am not sure you and I see the conclusion only model in the same way.

Models in which ALL the "evidence" has been considered in reaching conclusions that are then documented by disclosure of reasoning and relevant evidence, are evidence-based models. (And yes, there is more than one way to represent such an evidence based conclusion.)

Without reasoning, why would a collection of then unreasoned source bits be considered superior to the evidence model above. At some level, doesn't it seem a sort of compilation (whether accurate or not).

A "conclusion only model" might be undocumented information?

Separately, conclusions in all the models above can be judged historically "right" or "wrong" or "possible"/"impossible."

...and where do we consider the elements of a "reasonably exhaustive search?"

louiskessler 2011-01-01T11:29:11-08:00

Adrian,

I think notes should be inline and see no reason for a general note. It needs to be attached to something.

e.g. You said:
'in lieu of a generic "Other-entity", I keep notes about ships and regiments as Note records.'

If I wanted to record information about a ship, I would want to document the Ship itself as an entity, and place the note under it. Otherwise you have a whole nest of unattached random notes.

Where to put a "ship" is another question which we don't have to answer right now (e.g. Tom's Entity, or a "Place" could be stretched to include a Ship which is a place, albeit a moving place)

"FamilyHistorian from Calico Pie, one of the leading UK FH apps allows both."

Thanks for pointing this out to me. Indeed they do. But it looks like they use inline notes for short notes and record notes for long or multi-line notes. I'm haven't yet found a record note that is used multiple times and even if Family Historian allows that, they are rare. So there seems to be no necessary reason for Family Historian to not have them all inline.

If BetterGEDCOM states that notes in the BG file be all inline, the software author can still choose to do them both ways in their software, and then just translate them inline to transport. But BG would need to define a "holder" for that note if it is unreferenced. Unreferenced notes are sort of a ToDo list of tasks and things, and could be attached to a Task entity. If it is unattached to anything, it is really useless.

AdrianB38 2011-01-01T12:02:38-08:00

"it looks like [Family Historian] use inline notes for short notes and record notes for long or multi-line notes."
Length doesn't come into it - my notes against Military Service events are quite huge, as they are, in effect, records of all the service btw the dates in question. Just the way I and others do it, though clearly many others don't!

"If it is unattached to anything, it is really useless"
Sorry Louis - can't agree - that's like saying Notes in Microsoft Outlook are useless because they aren't attached to tasks, contacts, etc. Certainly all sorts of entity types could be created, as you suggest, that would soak up many of my notes, but I fear the list would be endless for little benefit. Where, for instance, do I attach the note that contains the dates for the UK censuses? Where do I attach the note that summarises the Genealogical Proof Standard? One could create "Methodology" entities but this isn't simplifying things, which was your (laudable) objective, rather we're increasing entity types.

Clearly you feel much more comfortable with all notes attached to something - I feel the reverse, so why not allow both of us to work as we want?

AdrianB38 2011-01-01T12:22:01-08:00

Gene said "I am not sure you and I see the conclusion only model in the same way."
and "Without reasoning, why would a collection of then unreasoned source bits be considered superior to the evidence model above"

Well, I wouldn't be at all surprised if each of us meant slightly different things by just about any term. However, my point in the first discussion note was not meant to imply that any of those methods was superior to any of the others. Rather I was saying that there are three different ways of making genealogical conclusions. The current goal 6 is "BetterGEDCOM will actively encourage the best practices of scholarly genealogy" and following this means we don't mandate one of those methods. Thus we have an _extreme_ example of 3 different ways of doing 1 thing, that goal 6 says we must allow. (Even if we dislike it ourselves!) All I want is to ensure that we don't blindly think the current goal 7 (which, in principle is perfectly sensible) must apply 100% everywhere to everything.

GeneJ 2011-01-03T12:32:39-08:00

**DM GOAL 2 BetterGEDCOM should have the following encoding and syntax characteristics

This topic was tabled at the First BetterGEDCOM Developers Meeting. Please continue the discussion here.

BetterGEDCOM should have the following encoding and syntax characteristics :

* Use an XML-based syntax
* Use Unicode character set in UTF-8 encoding, and optionally support other encoding schemes of Unicode
* Utilize a standardized container specification to hold separate supporting files such as multimedia
* Support a markup language to allow formatting (such as HTML) in all appropriate data fields
* Lines should have no length restriction

gthorud 2011-01-05T18:14:50-08:00

While we are having so much fun ...

"Be accommodating of all possible languages represented in IT"

May "all possible languages represented in IT" could be changed to "all languages covered by modern international character encoding standards" --- otherwise I start thinking about programming languages.

ttwetmore 2011-01-05T19:23:42-08:00

And continuing the happiness ...

The Better GEDCOM file's character code set should be Unicode. We should just say that and be done with it forever.

If some application uses a different character set internally that's fine (if stupid). All it has to do is translate from Unicode to its brain dead format on import, and from its brain dead format to Unicode on export, and that's all there is to it. There is no need for Better GEDCOM itself to ever have to support a character set other than Unicode. Please remember that Better GEDCOM will be the specification for a very specific file format to be used to archive and transport genealogical information.

The character bullet item should simply be:

Use Unicode exclusively.

Tom Wetmore

AdrianB38 2011-01-06T13:34:15-08:00

OK - to avoid confusion over "languages" (which is a good point) and after thinking about Unicode and the alternatives a bit deeper, I now write my proposal for the revised Goal 2 as:

"BetterGEDCOM should have the following encoding and syntax characteristics :

Use an existing, non-proprietary syntax (on which to build genealogical-specific definitions)
Be robust in the event of data corruption
Be accommodating of all possible data types and lengths
Use Unicode (only) for the consistent encoding, representation and handling of text expressed in most of the world's writing systems.
Utilize a standardized container specification to hold separate supporting files such as multimedia
Support a markup language to allow formatting (such as HTML) in all appropriate data fields"

Tom - I've left in the bit in bullet 1 about building the genealogy definitions on top, just to make it clearer to those less adept with terminology that this is about the infrastructure, not the genealogy, albeit put it in brackets to emphasise the important bit.

I've also reverted to mentioning Unicode as it seems such a major physical characteristic across so much that it's easier to specify the answer - I think we needn't worry about whether it's UTF-8 at this level - that should come out of compatibility discussions, etc. But, me being me, I have added a "why" to the bald requirement of Unicode - which I copied from Wikipedia.

brianjd 2011-01-07T20:59:51-08:00

I would have to agree here, on just supporting UTF-8. UTF-8 specifically and no other. UTF-8 supports all known languages, and does it in the most backward compatible way possible and the most efficient way, and IS the de facto standard.

Supporting other formats only creates more headache for those who wish to support a new standard. We are working for a standard, are we not? Standards imply using a particular method. Choice is good in some things, but when I go in to buy 1/4" 20 tpi bolts, I really want them to all be 1/4" 20 tpi bolts. The same is true here. The actual text output of the gedcom should be in one and only one format. Simple is good. Keep it simple and it stays easy to implement and test and support.

gthorud 2011-01-14T19:37:15-08:00

I have an issue with the term "standardized" in the bullet "Utilize a standardized container specification to hold separate supporting files such as multimedia".

The container issue was discussed to some extent just after the BG wiki was started. It is my view that the detailed requirements to such a specification has yet to be developed, and that no one has yet in detail evaluated any specifications/standards that might satisfy such a set or requirements. I hope that it will be possible to use an existing standard, but I am not ready to say that today. Some more work must be done before we can make a choice. I therefore suggest that the word "standardized" is removed (for the moment).

AdrianB38 2011-01-15T08:25:44-08:00

Geir - Seems OK to me. "Obviously" we will chose the most optimal container, which will presumably be a standardised one if one exists!

brianjd 2011-01-15T15:08:30-08:00

Well if there is an issue with "standardized" as a term, then let's just choose one. Let's use the zip compression format. It seems to be the defacto Internet standard compression container. Simple, straight forward, good, open, common. Sure 7-zip provides better compression, but doesn't have the wide spread distribution yet.

My preference is always for the actual "standard" in common use. Whether it is actually blessed by a standards committee or not. Providing it is a "standard" that is open, as opposed to proprietary.

There are very good reasons for staying away from proprietary formats.

brianjd 2011-01-15T15:14:37-08:00

I meant to include a note that we should be open to changing that at a future date, and that we should be working on producing a standard with both today's applications in mind and the future. So the requirement should read something like ""Utilize as a minimum zip compression as a container specification to hold separate supporting files such as multimedia". That way we look in a common base, and leave open the opportunity to expand that or change it at a future date.

gthorud 2011-01-15T17:57:10-08:00

Zip is a strong candidate, but I think more detailed work is needed before we say Zip or any other word. Also a solution must say more than just Zip. But we can't work on everything at the same ...

gthorud 2011-01-15T17:58:18-08:00

... time.

AdrianB38 2011-01-16T08:04:21-08:00

Brian - zip seems entirely possible, indeed obvious even, but I'd go with Geir's view - we don't gain anything by specifying it at this stage. (And I've already got burnt at least once by disobeying my project manager's injunction and putting solutions into the Goals, rather than requirements!)

brianjd 2011-01-16T09:09:11-08:00

Adrian,

Lol, well said. That would appear to be my tendency to also install solutions into the Goals.

Perhaps we should worry less about technical, grammatical, and connotation of our goals and focus instead on getting an actual template of the syntax. Someone quite smart, not me, mentioned starting from the existing GEDCOM and adding the new features to it. I'd like to second that motion, and, if necessary, call for a vote on it. Just to get the ball rolling. It would give us an instant base to work from. we could then add and subtract from that shell.

We could at the same time decide which format that should be in, or in the alternative request those who propose a particular format to post an exemplar based on the initials base pre-pre-draft-spec.

Doing so should kick-start this process. we spend an awful lot of time nit-picking minor details.

Andy_Hatchett 2011-01-04T08:24:32-08:00

BetterGEDCOM simply *MUST* handle everything GEDCOM does presently or it isn't going anywhere and we can all go on to other things.

ttwetmore 2011-01-04T08:45:40-08:00

Better GEDCOM must create a data model for genealogical data and processes. Everything depends on that so I assume it will happen.

A data model is a relatively simple thing in the abstract. No details here but many models are described in this wiki so you can go see how simple they are.

GEDCOM, XML and JSON are SYNTAXES for expressing information and data objects, and the kinds of information we are interested in are records from our BG data model. All three of these syntaxes are equally suited for representing BG records. Whether we pick one, two, three or even add our own format customized for the genealogical domain, is immaterial to the structure of the data model. Converting from any one to any other is a trivial operation given we know the tag mappings because all three formats come with fully featured libraries of software for parsing them into and generating them out of internal node tree structures.

We worry about the exact spelling and casing rules for tags. When we create our models we should simply choose abstract implementation agnostic tags. When we later decide the mappings from the abstract model entities to the physical GEDCOM, XML, JSON, ??? formats, we can define separate tag sets appropriate for each format. The point I am trying to make is that this is not an important issue at this point in time. The "critical path" through the BG Gant chart right now is working on the abstract model. Every day we don't do that is a day we loose on a schedule.

Tom Wetmore

SandyRumble 2011-01-04T08:47:33-08:00

Does that mean the current GEDCOM definition or should that include obsoleted tags, like NCHI (number of children)?
Also, does it mean that the contents must be covered or does it mean we are building on existing object definitions?

ttwetmore 2011-01-04T09:09:56-08:00

We should claim to be backward compatible to the GEDCOM used in all conventional programs, regardless of how they adhere to standards. We should also cover all the tag set extensions defined by different vendors. To make this happen we need to do three things:

1. Figure out what all those tags are and what they mean.
2. Figure out how to map those tags into the BG model.
3. Provide an example utility program that performs the conversions from all past GEDCOM formats to BG format (XML, GEDCOM, JSON and/or ??? syntaxds).

At the meeting it seemed that some members thought that to be backward compatible meant that BG must use GEDCOM syntax. This is not true. To be backward compatible all one needs is an unambiguous way to map data in the old GEDCOM formats to an equivalent BG format, whatever its syntax. As long as BG has an underlying data model that encompasses the data model of the old GEDCOMs this is straightforward. And that old GEDCOM model is basically the "lineage-linked" semantic model. It is based on person and family records; it is based on placing vital events inside person and family records; it is based on simple notions of names, dates, places, notes, and sources; and it is based on a small set of conventional attributes. All of these concepts will be in BG and therefore encompassed by BG.

Point 3 above implies that it may be useful for the BG effort to be able to demonstrate the prowess of its ideas and models with some demonstration software. It appears that there are enough development-savvy members of the wiki now that such an effort might prove an interesting diversion from everyday life.

Tom Wetmore

Andy_Hatchett 2011-01-04T09:56:44-08:00

Tom,

If it helps here is the list of the 112 "Standard" Tag Types TMG ships with.

List of All Tag Types

Label----- Abbr Tag_group------

Address add. Address
Adoption ado. Other
AFN afn. Other
Age age. Other
Anecdote ane. Other
Annulment ann. Other
Associatn ass. Other
Attributes att. Other
Baptism bap. Birth
BaptismLDS bap. Other
BarMitzvah bar. Other
BasMitzvah bas. Other
Birth b. Birth
Birth-Covt bct. Birth
BirthIlleg b. Birth
BirthStill b. Birth
Blessing bls. Other
BlessngLDS bls. Other
Burial bur. Burial
CancelSeal can. Other
Caste cst. Other
Census cen. Other
Christning chr. Birth
Codicil cod. Other
Communion1 com. Other
Confirmatn cnf. Other
ConfirmLDS cnf. Other
Criminal crm. Other
Death d. Death
Descriptn des. Other
Divorce div. Divorce
Divorce Fl dvf. Divorce
Education edu. Other
Emigration emi. Other
Employment emp. Other
Endowment end. Other
Engagement eng. Marriage
Event-Misc msc. Other
Excommuntn exc. Other
Father-Ado Relationship
Father-Bio Relationship
Father-Fst Relationship
Father-God Relationship
Father-Oth Relationship
Father-Ste Relationship
GEDCOM ged. Other
Graduation grd. Other
History his. History
HTML htm: Other
Illness ill. Other
Immigratn imm. Other
JournalCon nar. Other
JournalInt nar. Other
Living liv. Other
Marr Bann mbn. Marriage
Marr Cont mcn. Marriage
Marr Lic mlc. Marriage
Marr Sett mst. Marriage
Marriage m. Marriage
Milit-Beg mlb. Other
Milit-End mle. Other
Misc msc. Other
Mother-Ado Relationship
Mother-Bio Relationship
Mother-Fst Relationship
Mother-God Relationship
Mother-Oth Relationship
Mother-Ste Relationship
Name-Baptm nam. Name
Name-Chg nam. Name
Name-Marr nam. Name
Name-Nick nam. Name
Name-Var nam. Name
Namesake nsk. Other
NarrativeC nar. Other
Nationalty nat. Other
Natlzation nat. Other
Note nt. Other
NullifyLDS nul. Other
Num Child #c. Other
Num Marr #m. Other
Occupation occ. Other
OrdinacLDS ord. Other
Ordinance ord. Other
Ordination ord. Other
Parent-Ado Relationship
Parent-Bio Relationship
Parent-Fst Relationship
Parent-God Relationship
Parent-Oth Relationship
Parent-Ste Relationship
Probate pro. Burial
PrsmCancel prs. Other
Psgr List psg. Other
Ratificatn rat. Other
Rebaptism rbp. Birth
Reference ref. Other
Religion rel. Other
Reseal rsl. Other
Residence res. Other
Restoratn rst. Other
Retirement ret. Other
SealChild sc. Other
SealParent sp. Birth
SealSpouse ss. Marriage
SSN ssn. Other
Stake stk. Address
Telephone tel. Other
VoidLiving vdl. Other
WAC wac. Other
Will wi. Other

Printed on: 4 Jan 2011
Prepared by: Andy Hatchett

gthorud 2011-01-04T12:17:34-08:00

Could we leave the backwards compatibility issues not related to syntax out of this, also the tags, and focus on the things covered in the first posting.

Please also read previous discussions on these issues, we can't keep on repeating all previous dscussions.

gthorud 2011-01-04T12:28:05-08:00

Louis wrote:

"Markup should be allowed in ALL data fields."

I am not sure I see a need for this in ALL data fields. The markup will in most cases be chosen on output from the receiving appl. I am not sure I see a need for markup in other fields than notes. Also, as far as I know, most database engines do not support markup in all data types.

Please present some examples, other than notes, where this would be useful.

louiskessler 2011-01-04T22:32:30-08:00

Geir:

Thank you for asking me this. I thought on it, and you are right.

Markup should only be on NOTE and TEXT fields and maybe there is one or two other data fields that I don't recall right now.

The reason is, all the other values will be presented in a report and the programmer will want full control of the way the report looks, and won't want markup interfering with that. Only the notes and data will be items that might want to be replicated as they were in the original source, with hyperlinks and formatting and tables, etc.

So now I now agree that the word "appropriate" should be left in.

Louis

AdrianB38 2011-01-05T09:31:16-08:00

I am very tempted to take the advice of my last project manager and say that, unless there is some utterly over-riding reason, we take XML _out_ of the goals, because the goals are what lead to precise requirements, and XML is a solution to those requirements.

That way, we may actually be able to leave this decision and get on with looking at data models (i.e. WHAT are we collecting data about).

So I suggest rewriting Goal 2 to read:

"BetterGEDCOM should have the following encoding and syntax characteristics :

Use a standardized, existing, non-proprietary, open syntax
Be robust in the event of data corruption
Be accommodating of all possible data types and lengths
Be accommodating of all possible languages represented in IT
Utilize a standardized container specification to hold separate supporting files such as multimedia
Support a markup language to allow formatting (such as HTML) in all appropriate data fields

Comments:
I would like to exclude
"Lines should have no length restriction" because that's a very specific requirement that seems to have come from CONC/CONT experiences. Instead I have "Be accommodating of all possible data types and lengths". Plus "Be robust in the event of data corruption" since some CONC/CONT incidents seem to crop up if characters are lost.

This may upset someone - I've removed
"* Use Unicode character set in UTF-8 encoding, and optionally support other encoding schemes of Unicode"
in favour of
"* Be accommodating of all possible languages represented in IT"
The latter is the requirement - UTF-8 is simply one solution. Isn't it? And even if it's the only solution, it's still a solution, not a requirement.

If lots of IT systems are given to rejecting anything other than UTF-8, then I'd be happy to go with the original wording.

Like I said - I'm trying to reword this to get goals / high level requirements that we can all agree on. This does NOT avoid the XML issue, it simply postpones it to allow us to concentrate on data content.

AdrianB38 2011-01-05T09:35:20-08:00

Oh - looking at it all, I'm now worried that
"Use a standardized, existing, non-proprietary, open syntax"
doesn't convey the idea that the syntax is something we build genealogy stuff on - which was obvious when it said XML, but not now. So how about:

"BetterGEDCOM should have the following encoding and syntax characteristics :

* Use a standardized, existing, non-proprietary, open syntax on which to build definitions unique to genealogy
* Be robust in the event of data corruption
* Be accommodating of all possible data types and lengths
* Be accommodating of all possible languages represented in IT
* Utilize a standardized container specification to hold separate supporting files such as multimedia
* Support a markup language to allow formatting (such as HTML) in all appropriate data fields

AdrianB38 2011-01-05T09:38:24-08:00

Sandy - re "Is BetterGEDCOM going to be backwards compatible with GEDCOM? "

It's a sensible question that comes under Goal 3. I attempted there to describe what I thought we should be compatible with. (Bearing in mind these are hi-level goals not specific objectives).

ttwetmore 2011-01-05T13:38:15-08:00

The handwriting on the wall definitely says that BG will be use XML as its primary and probably only archive and transport file format. Even so, I suggest that the bullet item:

"Use a standardized, existing, non-proprietary, open syntax on which to build definitions unique to genealogy"

be simplified to

"Use a non-proprietary syntax"

Don't need to say open if you say non-proprietary. And you don't need to mention the data at all since this is just about encoding and syntax. Go ahead and stick back in existing if we are truly ready to commit to XML (or GEDCOM or JSON), but I don't think standardized is so important since all existing syntaxes are standardized.

Ah, these terminology discussion are so darned fun! Let's have more.

Tom Wetmore

louiskessler 2011-01-03T15:22:58-08:00

As always, I don't like giving options. Pick UTF-8 and stick with it. Or pick something else.

Remove the word "appropriate". Markup should be allowed in ALL data fields.

SandyRumble 2011-01-04T05:13:03-08:00

Is BetterGEDCOM going to be backwards compatible with GEDCOM? The answer to this goal and how support will be provided for existing GEDCOM files will help guide goal definitions for the next two topics.

SandyRumble 2011-01-04T05:14:42-08:00

There are currently a lot (millions?) of GEDCOM files on hard drives all over the world; not all of which are in the GEDCOM 5.0 or GEDCOM 5.5 format. If BetterGEDCOM is backwards compatible with GEDCOM, to what specification number? In addition are we planning to support the custom variations that products have introduced over time as they found the GEDCOM specification deficient? There is nothing a user hates more than to lose data!

SandyRumble 2011-01-04T05:15:54-08:00

What data should be captured or provided for in the BetterGEDCOM data model/specificaton? The answer to the first goal will have an impact on this goal. Are we extending the current GEDCOM data model, addressing areas that could be more robust or are we starting over? If we are starting over, should support be provided for all data that is currently defined in the GEDCOM specification? What about historical tags which may be encountered in older data or from GEDCOM files generated by programs that are obsolete, but still in use today?

SandyRumble 2011-01-04T05:16:31-08:00

What file format should BetterGEDCOM use? This is a two part question, which will generate debate, but which does not need to be resolved until later in the BetterGEDCOM process, ie. after the data model itself is defined. Additionally, the answers to the two prior goal questions will have an impact on this goal. Postponing the setting of this contentious goal will allow work to progress on the data model without worrying about the handshake.

SandyRumble 2011-01-04T05:17:22-08:00

Which character specification should be used?
The original GEDCOM specification called for the ANSI character set, although some versions of genealogy management software do create their GEDCOM files in Unicode. Unicode has the advantage over ANSI/ASCII of being able to represent any character in all languages. Unicode comes in several variants (UTF8, UTF16 or UTF32). Of these UTF8 offers the 100% backwards compatibility to ASCII and has become a defacto standard on the intenet.

SandyRumble 2011-01-04T05:17:56-08:00

Should the current GEDCOM tag system be extended, should we use XML, JSON or some other tag oriented variant?

SandyRumble 2011-01-04T05:24:46-08:00

UTF8 is most likely the best choice as it offers backwards compatibility with the current standards while offering support for characters in all languages. In addition, most developer tools and operating systems provide strong support.

GeneJ 2011-01-03T12:36:18-08:00

**DM GOAL 3 (part) ... coverage of the types of genealogical data ...

This item was tabled at the First BetterGEDCOM Developers Meeting. Please continue the discussion here.

BetterGEDCOM should define data relating to the study of genealogy. The definitions will describe the XML-based syntax and also be embodied in a data model. The definitions will be capable of extension by software companies and users. *The coverage of the types of genealogical data will allow faithful import of data from all current, common genealogical software with no material manual intervention, subject to the limits of the applications involved.*

louiskessler 2011-01-03T18:34:39-08:00

Just wanted to note that I originally proposed this goal as:

"BetterGEDCOM should sufficiently allow the definition of all types of genealogical data so that any and all data can be transferred faithfully."

So it has morphed somewhat from that.

GeneJ 2011-01-03T18:46:29-08:00

@Louis:
Perhaps we should again start from your proposed goal and edit from there (leaving issues specific to the second goal for that discussion).

BetterGEDCOM should sufficiently allow the definition of all types of genealogical data so that any and all data can be transferred faithfully.

AdrianB38 2011-01-04T13:22:48-08:00

OK - I think I morphed Goal 3, so perhaps I should explain why...

Firstly please understand this - by training I am a mathematician so I am naturally someone who wants something defined precisely. During my career in IT I never found any instance where someone told me, "Hey, can't you be a bit more vague here?" Although they might have told me my detail was inappropriate and contradictory.

Let's start with the first version:
"BetterGEDCOM should sufficiently allow the definition of all types of genealogical data so that any and all data can be transferred faithfully"

"sufficiently" isn't necessary. Either "any and all data" can be transferred, or it can't. And "sufficiently" is undefined - how sufficient is sufficient?

So we now have "BetterGEDCOM should allow the definition of all types of genealogical data so that any and all data can be transferred faithfully"

The next concern is "allow the definition of". This isn't specific enough. What does it mean if it _allows_ the definition of stuff? Does it mean it allows someone else to do the definition? Certainly not - the BG 'programme' is surely what does the work of the definition.

So we get something like "BetterGEDCOM should define all types of genealogical data so that any and all data can be transferred faithfully".

The next bit I wanted to add in was something about how those definitions would be, err, defined. In IT, when defining requirements on products, we always need to think about just what the end-product is. Is ours just a document? Is it applications that actually do the work? (to which the answer is 'no', based on our scope). I think that the end products are:

(a) a document of the definitions (gives us "BetterGEDCOM should define data relating to the study of genealogy" - the "relating to" bit is because I wanted a clear statement of scope of the data that we're talking about that wasn't in the middle of another sentence.)

(b) a data model that shows how the bits of the 'things' being defined relate to one another, including diagrams of those relationships (trust me - if you're an IT guy, you want this and if you don't get it, you create it yourself. Clearly, we want just one, not one from each software company, each of which is slightly different) (Gives us "The definitions will ... also be embodied in a data model.")

(c) a formalisation of the definitions to save software developers formalising them several times - each time subtly differently. Hence "the definitions will describe the XML-based syntax". Now, at this point I was smart enough to avoid saying "schema" or "DTD". BUT - I committed the cardinal sin of describing a solution ("XML-based") when this goal is about requirements - requirements first, then design the solution to match the requirements. (OK - I apologise to my last Project Manager who was brilliant at spotting where I'd put solutions into my requirements. Sometimes accidentally. Sometimes, ahem, deliberately. Sorry Barbara! Again!) In this instance I was too lazy to think of an equivalent term for non-XML solutions, and thus this bit got tangled up in the XML, GEDCOM, JSON, debate when it shouldn't have been touched by it.

OK - I think that gets us to a (proposed) corrected version ...
"BetterGEDCOM should define data relating to the study of genealogy. The definitions will be embodied in a data model. Their syntax will also be codified in a form appropriate to the technology chosen for BetterGEDCOM. <This is> so that any and all data can be transferred faithfully"

The IT-speak "codified in a form appropriate to the technology" means - IF it's XML we create DTDs or Schemas or.... Or if it's something else, then it's something else...

Next, I want to remind us that the definitions that I've just been talking about aren't fixed. GEDCOM is extensible, so BG also needs to be, or else we need to cover all of the world and all of time. Which is impossible. So I add "The definitions will be capable of extension by software companies and users", giving us:

"BetterGEDCOM should define data relating to the study of genealogy. The definitions will be embodied in a data model. Their syntax will also be codified in a form appropriate to the technology chosen for BetterGEDCOM. The definitions will be capable of extension by software companies and users. <This is> so that any and all data can be transferred faithfully."

Note that the bit "<This is>" is the point at which the definition splits for the purposes of this discussion (see Gene's highlighting at the start of this), so I'll split this response there as well.

AdrianB38 2011-01-04T13:51:17-08:00

Right - at the end of my last response, we had a corrected but still incomplete version of Goal 3 as:

"BetterGEDCOM should define data relating to the study of genealogy. The definitions will be embodied in a data model. Their syntax will also be codified in a form appropriate to the technology chosen for BetterGEDCOM. The definitions will be capable of extension by software companies and users. <This is> so that any and all data can be transferred faithfully."

So why are we doing this definition again? In the original form, it is "so that any and all <genealogy> data can be transferred faithfully."

Now, "any and all data" is an awfully big definition of the scope. As someone asked elsewhere - does this mean we should be able to import files of format equivalent to GEDCOM v1? (If there wasn't such a format, apologies). Well, if there are thousands of such files, and they don't have modern equivalents, we might well want to load them. But if there are only 2 such files left in the world, would we really want to spend the time on altering BG to allow their import when we know that the format was superseded for good reasons? I suggest no - print them out, extract the real information by eye and enter it afresh. It'll take less time.

So if it's not "any and all", what is it? I suggest it's "all current, common genealogical software". "Common" because we can't go chasing every app. "Current" because we can't guarantee that we can know anything about defunct versions of software.

Please note this is a _minimum_ coverage - if someone can point to application XYZ version N that was superseded 10y ago but is still in use and contains hundreds of vital files, then there's nothing to stop anyone _who knows about it_ helping us to ensure the data from XYZ vN can be represented in BG.

The original said "transferred faithfully" - I restricted this to "imported faithfully" because I can't see how we can faithfully export our new entity types to apps that don't have them. Anyone who wants to come up with a form of words to describe faithfully exporting a selection, is welcome to try. I suspect that comes as a given from faithful import but am willing to be proven wrong.

So purpose and scope is now "The coverage of the types of genealogical data will allow faithful import of data from all current, common genealogical software."

Nearly there - I think we need a few couple of clarifications on what a successful "faithful" might be. Clearly if we need to manually correct 50% of the import, it's not going to be viable. Hence, "with no material manual intervention". And since we are not totally in control here because we're relying on someone else's software, we need some weasel words to say that if the app don't work, it's not our fault, viz: "subject to the limits of the applications involved."

So, the 2nd part of the goal now reads:
"The coverage of the types of genealogical data will allow faithful import of data from all current, common genealogical software with no material manual intervention, subject to the limits of the applications involved"

This gives us why we're doing this, what we're importing, the scope of the import and some sort of idea about success criteria.

So my corrected version now reads:
"BetterGEDCOM should define data relating to the study of genealogy. The definitions will be embodied in a data model. Their syntax will also be codified in a form appropriate to the technology chosen for BetterGEDCOM. The definitions will be capable of extension by software companies and users. The coverage of the types of genealogical data will allow faithful import of data from all current, common genealogical software with no material manual intervention, subject to the limits of the applications involved."

AdrianB38 2011-01-05T08:32:57-08:00

Dumb question - did we agree half of this, and if so which half? And if we didn't, why is half in bold in Gene's 1st post and the other half not?

GeneJ 2011-01-05T10:49:25-08:00

@Adrian:

That isn't real clear, is it.

The whole of Goal 3 was tabled during the first developers meeting. A discussion topic was to be opened about part of that goal, specifically, discussion for the part above in bold.

This whole-tabled/part-for-discussion had to be clarified for me during the meeting (see the wiki Goals page, it shows up there in italics).

I didn't entirely follow (was still adding the comments about items being tabled) why the whole discussion topic was tabled, but only part was to be added for discussion.

Maybe others recall better.

GeneJ 2011-01-05T11:13:18-08:00

Why not shorter:
"The BetterGEDCOM data model will define data relating to the study of genealogy. The definitions will be capable of extension by software companies and users. The data model will allow faithful import of data from current, common genealogical software with no material manual intervention, subject to the limits of the applications involved."

AdrianB38 2011-01-05T11:36:02-08:00

I'm all for "adding less words".

However, in this instance, I think we (part) would lose the bit about what our end product is because
(a) there's nothing about codifying the syntax (yes, this is the techie bit)
(b) the data model is only mentioned implicitly

And pedantically, the data model doesn't do anything - it's just a set of diagrams and words. It's the definitions themselves and how they are used that allow things.

What might be useful is in fact to split this into 2 goals - one about what BG is, the other about what the products are, viz:

Goal 3 (new)
- BetterGEDCOM should define data relating to the study of genealogy.
- The coverage of the types of genealogical data will allow faithful import of data from all current, common genealogical software with no material manual intervention, subject to the limits of the applications involved.

and then Goal X ...
BetterGEDCOM will produce:
- a data model describing the data in scope;
- a documented syntax for that data;
- a syntax codified in a form appropriate to the technology chosen for BetterGEDCOM;
- a means to allow extension of the definitions by software companies and users.

GeneJ 2011-01-05T11:43:02-08:00

@ Adrian

Goal 3 (new): I like your suggestion even better!

As to Goal X: Why wouldn't those matters be considered in Goal 2?

AdrianB38 2011-01-05T13:14:35-08:00

Goal X is close to Goal 2 but I'd like to keep it separate both to avoid a single goal getting too big but also because Goal X is about setting expectations on the products that we come up with, rather than physical requirements on BG.

brianjd 2011-01-07T21:48:18-08:00

For goal 3, why not simplify in clear language?

BetterGEDCOM will fully define data relating to the study of genealogy, in a logical data model. It will be extensible and faithfully import of data from current GEDCOM files with no material manual intervention.

"logical data model" having a very technical and specific definition to software designers. But fully defined should cover it. Goals aren't meant to be technical specifications. Which is the way Adrian is going. Let's leave the technical specs for the technical specs topic. Goals are 10,000 foot elevation views and technical specs are 12" views. ;')

Maybe even make it GEDCOM 5 or greater.

I don't see how we can claim to make our model workable for all the current common genealogical software, since they are largely incompatible with each other now. Are we planning on fixing everyone's proprietary extensions with this model? Surely, not, I hope. It would be my hope that those who have written the programs would write the conversions to the new model. The new standard should be able to pull in any GEDCOM field, in some manner. How much hand-tweaking is needed to put it where one wants becomes another issue.

I probably won't be at any of these meetings. Mondays kind of suck for me, plus, I've never really found a great way to turn speech from multiple speakers into text, and I doubt there is any closed-captioning at these meetings.

AdrianB38 2011-01-08T13:15:06-08:00

"Goals aren't meant to be technical specifications. Which is the way Adrian is going" Yes, I'm probably guilty of putting in too much detail in some of my suggestions for goals. Though I'd suggest they are more like specific objectives and not like technical specifications. In my defence I'm doing it because we don't appear to have a page for specific objectives of the BG "programme" and I don't want to lose my thoughts about that next level.

Does anyone think a split between high level goals and specific objective would help? (I'd be sympathetic to that view)

Brian - re your concern "I don't see how we can claim to make our model workable for all the current common genealogical software, since they are largely incompatible with each other now." I first tried restricting the scope of BG to converting from GEDCOM 5.5 (then building on that) but the consensus was along the lines of (a) how could we tell anything was GC 5.5 compliant? and (b) why should we restrict it if we knew of a popular extension elsewhere?

My _personal_ belief is that while the data models of all the current common genealogical software are different, we can design a superset that can load almost all of that data. That doesn't mean we have exactly the same attribute attached to the same entity type. Far from it - conversion routines could move stuff around, split entities, etc.

Practically, we can only include proprietary extensions if someone comes to the party with the data and says "I can do this in XYZ - how does BG hold this data?" And yes, I think we should go out and look for users of XYZ.

So, perhaps naively, I would want to aim at making the model workable for all common stuff (NB - the word "common" is, of course, the first weasel word there!)

"It would be my hope that those who have written the programs would write the conversions to the new model." Absolutely - if they don't do it - I don't see who will. All the BG "programme" can do is gives them somewhere to put the stuff.

GeneJ 2011-01-03T12:37:54-08:00

**DM GOAL 4: ... test suite of data that will allow software suppliers and users to assess compliance of software, diagnose issues and assist in their resolution

This topic was tabled at the First BetterGEDCOM Developers Meeting. Please continue the discussion below.

The BetterGEDCOM project should provide a test suite of data that will allow software suppliers and users to assess compliance of software, diagnose issues and assist in their resolution.

GeneJ 2011-01-07T22:47:46-08:00

Just wondering if we are getting ahead of ourselves by defining exactly what that suite of data will look like before we know what BetterGEDCOM will look like.

This goal was originally "how" compliance would be determined--would a program be "certified," or not.

Weren't we were trying to improve the existing system where nearly every program claims to be gedcom compatible yet reading Tamura's blog, etc., we know that not to be the case.

Maybe we only need to agree now on that higher principle.

louiskessler 2011-01-08T11:04:54-08:00

Okay. Here's my proposal.

Let the BetterGEDCOM "consortium" (if I may call us that) come up with some draft format of BetterGEDCOM. I will then take my program Behold, and see if I can add the ability to export to our BetterGEDCOM draft format.

That will do a number of things:

First, since my program internally uses an "extended GEDCOM" data structure, which is probably a midpoint in the extremes of the data models other developers use, it will give us an idea of how difficult it will be for developers to transform their model structure to be able to export as BetterGEDCOM. I'll then be able to post what works easily and what is difficult here at the wiki and then decisions can be made as to whether to adjust BetterGEDCOM or not.

Second, my program is designed to read in GEDCOMs of varying types from various vendors. If I get it to export to BetterGEDCOM, then you'll be able to use it to easily create BetterGEDCOM examples.

Once we have a number of BetterGEDCOM examples we like, I can then write the input routines for BetterGEDCOM and comment to the wiki on the problems with that, and further adjustments can be made.

In the meantime, you'll all be able to start playing with the BetterGEDCOM files that Behold creates and I think we'll be able to advance our progress rapidly.

I could see this taking several iterations, but it will probably converge rapidly to an agreed result. Possibly by relating the experience that this one developer's work to support BetterGEDCOM was not very difficult, and the realization by other developers that there are advantages to support BG as well as GEDCOM might give it the push it needs to get on the road to wide adoption.

Behold will be available free to anyone who will want to use it for BetterGEDCOM development.

In so doing, I would stay out of the direct decisions as to what to put into BetterGEDCOM, but will concentrate on giving real feedback as to what works and what is too difficult and developers won't adopt.

And you don't lose too much in my staying out of the direct decisions. My expertise is in the understanding of the existing GEDCOM standards. I really have not looked into the proposed standards because I haven't had to. I've never really thought about the evidence/conclusion model because no GEDCOMs currently contain them.

Tom and I "had it out" and I just wanted to make my point the existing GEDCOM has a lot of thought and good concepts in it that should not be just thrown away unless there is a reason for it. But I do accept his view and the view of most of the BetterGEDCOM participants that a new better model is needed. I will let you go at it, and I'll do my best to help aid and promote the implementation phase with my above proposal.

Don't worry. During all the discussion, I'll be watching and I'll throw my 2 cents worth in from time to time, but I'd sooner all of you decide on what the future model will look like.

How's that?

Louis

GeneJ 2011-01-08T13:43:11-08:00

Louis,

Among us, it's likely you best understand the potential of existing GEDCOM.

As a user, I love the benefits above, but know we will fall short if you step back.

As far as the data model suite is concerned, and quite separate from your proposal, we may need to also have a suite based on a real genealogy (ala, perhaps clips of genealogies from Register or Quarterly), mocked up for array of family circumstance, stages of proof, and variety in source materials, so that we are not only representing "data" but also how genealogy translates to data.

Thinking out loud.

ttwetmore 2011-01-08T18:41:23-08:00

Louis,

I think it is a great idea for you (or anyone else willing) to provide software help for the Better GEDCOM effort. Let me try to understand what you are offering.

We begin to work on the BG model, deciding on basic record types, substructures within records, tags to use, and so on. Let's say we also decide on a final syntax, which let's assume is not GEDCOM (though why rule it out?), likely XML. And let's also assume that the underlying Better GEDCOM model is a true extension of GEDCOM's, not some crazy thing like the GenTech model. For sake of this example, say it extends GEDCOM by adding EVENT and PLACE records (let's not worry about major changes to INDI or FAM records yet).

Okay, it sounds like you are saying that you will do two things with Behold. First you will extend Behold so it can read special GEDCOM files, that is, GEDCOM files that really hold our new Better GEDCOM records, but we write those records using GEDCOM syntax (instead of, say, XML syntax). Is that right? And since Behold reads extended GEDCOM anyway, this wouldn't be too hard to add. Then second, you would write another piece that could write out these new special "Better GEDCOM GEDOM" records in the final Better GEDCOM syntax. So if the Better GEDCOM syntax were XML then Behold would be acting like a GEDCOM to XML syntax converter. How far off am I?

My main question in this scenario would be, what's the advantage of writing out the Better GEDCOM example records in GEDCOM first, so they can be translated by Behold into the final format, than just writing them in the final syntax directly? In trying to answer my question, one advantage of your approach that I see is that we could take an already existing GEDCOM file that is fairly small but with all kinds of good stuff in it. We would then, by hand, extend that GEDCOM file to add some Better GEDCOM model records, but we would do that using GEDCOM syntax rather than the final syntax. Then Behold would translate these files to "pure Better GEDCOM," whatever the heck that is.

I'll stop at this point, because if I have your basic idea wrong it would be better to know before I say anyting else.

(Tech speak alarm on.) I wanted to mention an algorithm that I use in converting GEDCOM to DeadEnds, that I think is a little novel and really quite interesting.

First, once I read in either GEDCOM records or DeadEnds records into memory, the iternal representation is the same and very simple, just a tree structure of node objects. If you are familiar with the DOM notion, this internal structure is basically a simple DOM structure. To convert a GEDCOM record to DeadEnds here's what I do.

1. Create a copy of the GEDCOM internal tree and create an empty DeadEnds internal tree.
2. Recurse through that GEDCOM copy tree, searching for all sub-tree structures that are known to have DeadEnds equivalents. As those sub-trees are found, remove them from the copy and add the DeadEnds equivalents to the DeadEnds internal tree.
3. When recursion is complete, there may be a skeleton of nodes left in the GEDCOM copy. This skeleton represents all the parts of the original GEDCOM records that, as of yet, have no DeadEnds equivalents.

I use this algorithm to convert more and more complex GEDCOM records to DeadEnds, and whenever I find a common skeleton pattern left over I then figure out the next major addition in doing the conversion. It is the algorithm of "subtracting substructures out of trees while leaving in place all the other nodes" that is so particularly interesting. (Tech speak alarm off.)

I want to thank you for the offer of a copy of Behold, but unfortunately (well, as far as I am concered, fortunately would be the better world) when I retired I decided I would BE RID OF THE MICROSOFT WORLD FFOORREEVVEERR!!!! I can only run Mac software here at my DeadEnds establishment, that establishment being the tiny bedroom vacated by my youngest son many years ago!

I've read your Behold information now, and I think you have a very worthy goal of software that doesn't require special forms for editing. I agree with your philosophy, and that is one of the reasons that the only way a user can edit data in the LifeLines program is to EDIT THE GEDCOM RECORD DIRECTLY with a simple editor. Of course this is also why LifeLines would never actually be used by anyone sane! By the way John Nairn's program GEDditCOM also allows users to edit at the GEDCOM level, though I think he can hide that a little from the users. What's interesting with your Behold approach is that I'd say you too are also having the user edit directly at the GEDCOM level, but you are hiding that fact from them by making it look like they are just editing text. I've thought about doing that in the past, but always chickened out, that it would be way too hard. One of my DeadEnds programs, called GedcomViewer, has been sitting around for a couple years waiting for me to try to figure out how to edit the GEDCOM in a way that would be even simpler than you are planning to do with Behold, and I've gotten nowhere with it, so you must be doing some pretty clever stuff. (I was going to try to implement something akin to an XML editor, which wouldn't be anywhere as clean as what you are trying to do). If anything could tempt me to run Windows again, trying out Behold would be it, but at this point I am still an adamant Mac bigot. Or maybe I could do your Mac port! Smiley.

Tom Wetmore

louiskessler 2011-01-08T19:31:51-08:00

Tom,

Yes that's right.

Step 1: read GEDCOM (possibly with extensions for BG) and output BetterGEDCOM.

Step 2: Read BetterGEDCOM

I hadn't thought of adding extensions to the GEDCOM input to allow for BG, but that can be done, and I could then get Behold to display those extensions. However, those GEDCOM files will at first have to be manually edited to place those extensions in somehow.

But adding those extensions into the GEDCOM input files may be a waste of effort. Maybe we should just make sure normal GEDCOMs transfer first. We'll end up with a nice BG library we can use which will at least ensure that BG can handle all the GEDCOMs out there.

Then for testing the extensions, we should take the BetterGEDCOM library, and at first, manually add the extensions there.

Behold is currently only a browser. The editing part of Behold is still a ways away. I'm hoping within a year from now, but that's the eternal programmer optimistic expectation-of-everything-going-perfectly-with-no-problems-along-the-way philosophy. However, once I do add that, then it will edit all the data, including any BetterGEDCOM extensions.

Your DeadEnds input is very strange to me. I simply parse the GEDCOM, make minimal changes to the input internally and store extended GEDCOM records at the record level and create internal reverse indexes. I don't create a tree. I have a flat structure that is lineage-linked via the INDI and FAM connections. I traverse the links as needed to create the report.

There is Windows emulation software for the Mac: Parellels or GuestPC, or Behold will run under Wine.

And there has been talk of Delphi (the language I am using) will be extended to produce executables for Unix and the Mac. So maybe in the future I will make a native Mac version.

testuser42 2011-01-09T04:28:11-08:00

Louis,
I believe this is a very helpful offer, thank you! But please continue giving your input while the first draft of a BG specification is hammered out. Your early input will be very helpful, too!

testuser42 2011-01-09T04:34:55-08:00

Gene, you wrote

..."we may need to also have a suite based on a real genealogy (ala, perhaps clips of genealogies from Register or Quarterly), mocked up for array of family circumstance, stages of proof, and variety in source materials, so that we are not only representing "data" but also how genealogy translates to data."

YES! That's exactly what I was thinking about when I set up the "test suite" page. We should start with real problems and see how they are translated into a BetterGedcom that really captures all of the process.

mstransky 2011-01-09T04:56:53-08:00

(set up the "test suite" page)
(how they are translated into a BetterGedcom)
(that really captures all of the process)

I am on board and happy it is said and understood what I meant awhile back by a block of test data which can be throw back and forth to each other as things as captured and added from real life example data.

This way when the others like (platforms) export new data segments that I don't support, I could then import and store them (in use or parked data) when ready return it to a former BG format, for someone else to bounce off on it.

testuser42 2011-01-09T05:15:30-08:00

Mike, yeah, I remember you asking for that a while ago. So, let's try and get a few examples, shall we? Put them on the http://bettergedcom.wikispaces.com/BetterGEDCOM+test+suite
test-suite page or a subpage there.
I will try and write a bit about some person where I remember some difficulty finding the date of birth, because of different sources having different dates. It took a while to untangle and make a reasonable decision. It might take a while to collect all the info on that "case".

gthorud 2011-01-09T22:45:45-08:00

I don't understand the benefit of developing the conformance testing before we have nearly finished BG. I wounder what the benefits of a conversion exercise with Behold would be, since most of the implications for programs are in processing/display/reports and not that much in conversion - and we don't know much about how BG will look like yet. Let us first see if we can make a first version of a BG document, and keep a high level Goal 4 as it is - without detailing it.

That said, I think testing will have to look at the user interface - not just files.

And, I appreciate Lois' contributions to the discussions too much - I don't want too see him disappearing into program development.

Geir
... who is still recovering from a virus of the season while on imuno-suppressive light, so I don't write much at the moment.

ttwetmore 2011-01-10T03:09:17-08:00

Geir,

My opinion. The test suite data at this point is not for potential application developers. It is for us so we can touch, taste, and feel what the new Better GEDCOM will be. It's a bunch of cases we make up and then use as running examples during Better GEDCOM development.

Tom Wetmore

gthorud 2011-01-10T19:42:54-08:00

Tom,

I may be wrong, but I think at the moment a text editor would be the best tool for creating BG-files. I would in general wait until BG is much more developed.

BUT, if we are talking prototyping rather than conformance testing, there is one area where a prototype would be useful and that is in the evidence-conclusion area. The challenge here is to develop a user interface that is as efficient as possible and supports both the conclusion user and the more advanced users. But then, this is application internal.

There may be other similar complex areas ...?

louiskessler 2011-01-03T18:45:50-08:00

I have added the following on the Goals page under "Discussions About Current Goals":

Goal 4: Test suite of data.
This would include GEDCOM and BetterGEDCOM files for testing:

(A) Proper Translation of GEDCOM to BetterGEDCOM via:

(1) Input GEDCOM, (2) Save to Database, (3) Retrieve from Database, (4) Export to BetterGEDCOM

For this test, the input file will be provided. The exported file is to be compared to a BetterGEDCOM file that will be provided.

(B) Proper Understanding and Flowthrough of all BetterGEDCOM information via:

(1) Input BetterGEDCOM, (2) Save to Database, (3) Retrieve from Database, (4) Export to BetterGEDCOM

For this test, the input file will be provided. The exported file is to be compared to the input file.

AdrianB38 2011-01-05T08:47:10-08:00

Actually, I have to say that Test B is deficient as it stands. In the extreme and _very_ silly case, suppose the software just loaded all imported BG into a text area on import, then on export, copied it from the text area to the exported BG file. Result is 100% match but zero useful functionality!

Yes - that's a very silly case - but one could envisage cases where BG lines that were not recognised were indeed dumped to text, with a line number linking them back to a relative position in the data. If the stuff were then exported back out, the software might manage to stick the text data into the output file at exactly the right point.

Slightly more plausible. Maybe.

Round-tripping as a test only really makes sense in this case if one starts with a database in the app, exports to BG, reimports into an empty database, and finally compares the two databases. And it's difficult to see where BG could help with that.

But we surely owe it to people to provide sample BG files...

Perhaps a more useful test would be to
(1) Input the provided BetterGEDCOM to the app,
(2) Save to app's Database
(3) Export to GEDCOM;
(4) Reimport same GEDCOM to empty app's database
(5) Export to BG;
(6) Compare the 2 BG files - the differences should all be down to the differences btw GEDCOM and BG.... Possibly quite large!

ttwetmore 2011-01-05T09:13:59-08:00

Adrian,

You've captured that issue very well! I was going to bring up something similar during the meeting, but decided it would be too teckky and could lead down a useless tangent. So I'm very glad you caught the issue and decided to bring it up.

I think your solution is a good one. I was also trying to think up a solution, and I had another idea, though I certainly don't think it's better than yours.

My idea was to insist that the applications importing the BG test files have enough of a user interface that we can check that all the data in the BG file has been fully read and fully distributed into the right areas of the application's database. That is we can see the people on the person screen, we can see the events somehow, we can see the places and sources somehow, that sort of thing. The reason I don't think this is such a good idea is I don't believe application developers would believe we should have the power to insist on such a thing. Though, in fact, they might be quite proud to be able to demonstrate such a thing.

I still think this is still a bit of a conundrum.

Tom Wetmore

AdrianB38 2011-01-05T11:17:14-08:00

Your idea is sensible but it's the protocols associated with the "we can check" bit that concern me. Unless we have adepts in all those bits of software, who can trial beta versions, then we're in no position to check how an app works. And even if we were in such a position, politically it would worry the heck out of me for our "programme" to play such a role.

testuser42 2011-01-07T16:27:48-08:00

I made a page for the topic of a test suite.
http://bettergedcom.wikispaces.com/BetterGEDCOM+test+suite
Please feel free to edit or add, or correct me if I'm wrong...

brianjd 2011-01-07T22:03:31-08:00

Wow.

Ok, how about this for goal 4.

Goal 4: Test suite of data. BetterGEDCOM will provide:
(A) For importing into BetterGEDCOM, a Sample Input GEDCOM and a Sample Compliant Output BetterGEDCOM.

(B) For Importing BetterGEDCOM files a SAMPLE Compliant Input BetterGEDCOM file.

[Optionally] (C) An interface, which will test the compliance of a software produced BetterGEDCOM. This can be the output from the provided sample GEDCOM.

It is to be hoped that makers of software know how to test their own software for compliance. I know this is asking a lot and making assumptions that should not be made. But we have to learn that there is only so much that we can actually control, and people are notoriously hard to control. Especially those that tend to be creative and intellectual with frequent serious issues with authority figures.

hrworth 2011-01-03T13:44:15-08:00

Goal 1 Discussion

Goal 1:

BetterGEDCOM will be a file format for archiving and exchange of genealogical data. [Developers Mtg 3 Jan 2011 status (approved)]

I understand that it was agreed that this goal was approved. I voted against this Goal because of one word. The word is ARCHIVING.

The main reason that I am against that word is that it is very misleading to the common End User.

The BetterGEDCOM goals are becoming clear. But with this word in this Goal, then I would assume that EVERYTHING in my file, using what ever program / application that I am using would be Archived.

But, the real purpose of the BetterGEDCOM project is for transporting of information between two end users.

First, IF I choose to only exchange certain portions of my research with another researcher and, for an example, do NOT wish to include any images, then this would NOT be an Archive of my "genealogical data".

Secondly, I think that the most appropriate Archive for my genealogical data should be provided by my software program. However that archive or back up would be. It would be EVERYTHING in my data, not just a piece of it.

In the past, many users have 'backed up their file' using a GEDCOM only to find that no images were included.

It has been stated that a BetterGEDCOM can / will include media, which is wonderful, but the concern is that IF the end user doesn't read or understand the fine print, that when they see Archive, they will think everything. When what was "archived" in a BetterGEDCOM file may NOT be everything, only that information that has been shared.

I respectfully submit that the word "archiving and" be removed from this Goal.

The Development of Archiving tools belong with the application and not the transport of that information.

That does NOT mean that the information that was shared isn't "archived", but am just trying to avoid the the notion to an end users that a GEDCOM is a good BACK UP / Archive of information.

Thank you,

Russ

ttwetmore 2011-01-04T01:54:03-08:00

A number of goals for the Better GEDCOM (BG) project are listed in the Goals section of the BG wiki. In my opinion they are all intended to help meet two overarching goals:

1. To allow genealogists to archive all their data using a BG file format that accommodates ALL their information – source information, evidence information, conclusion information and anything else useful.
2. To allow genealogists to share/exchange/transport their data with/between one another by using genealogical software applications that can both fully import the information in BG files, and can fully export the contents of their own databases to a BG file.

For the second goal to be reached two conditions must be met:

1. BG compliant programs must correctly represent all information they import from any BG file (and they must import all of it). This requires the program’s data model to be rich enough to accept all information from BG files.
2. BG files must be able to hold all information held in a BG compliant program’s database. This requires the BG format to be rich enough to hold all information from that database.

These are interesting conditions as the first is a requirement on applications to do at least so much work, and the second is a requirement on applications to do no more than so much work. Frankly, BG would at best be able to require the first, because the second would require applications to not put in any of their own added value. That is impossible. We sweep it under the rug by assuming that the BG model will be so complete and compelling that no vendor would ever want to use a different one. And I'm afraid that's the best we will ever be able to do. The only hope for BG to be a success is to base our file format on a model that is so complete that it will contain everything of importance to us and to applications developers no matter what any idiosyncratic application developers decide to do.

The only hope for BG to be a success is to create this compelling and complete model. Everything else is fluff. Even deciding whether the ultimate BG file format will be GEDCOM syntax based or XML syntax based is fluff.

Tom Wetmore

Andy_Hatchett 2011-01-04T01:56:52-08:00

Here is my take on the "Archiving"...

Suppose that instead of each program having its own backup format- which most other programs can't handle- each program used BetterGEDCOM as their backup format.

Then Users could exchange backup files without worrying if they were in a format their program would accept.

And no- I don't think most end users would be misled. Let's face it, most end users have no real concept of what goes on in a program, all they want to do it input data and get output. They really don't care how it is done.

BetterGEDCOM as a standard backup format for all programs would be a HUGE win.

ttwetmore 2011-01-04T02:03:41-08:00

Andy,

Exactly.

Tom W.

AdrianB38 2011-01-04T06:23:47-08:00

"BetterGEDCOM as a standard backup format for all programs would be a HUGE win"

I'd certainly agree that the idea of all genealogy apps putting out BetterGEDCOM files for _archiving_ would be great. Oh - see what I did with the words there?

I don't want us to use the word back-up in this context because back-ups should be done by back-up software, onto separate media. And that's not what we're talking about here because what I _think_ we're talking about is the regular / irregular / occasional copying of all genealogy data from an app into a file (or set of files), which are then saved and stored in case we need to look at something from several months back when we realise we entered the data against the wrong John Smith and what did it look like beforehand? Which is what I call archiving but if anyone has a better name please shout.

So - I agree with Andy - but please don't use the word "backup" for it.

AdrianB38 2011-01-04T06:23:48-08:00

AdrianB38 2011-01-04T06:54:42-08:00

Re Russ' dislike of the term "archiving":

We need to be as clear as possible with our goals and if we find that "archiving" is dragging in connotations of "back-up", then we need to review its use.

How about using the term "storage" instead?

Russ suggests simply removing "archiving" and says that will cover the future equivalent of GEDCOM submissions to genealogy societies (thanks for that example, Gene). I submit that it will not cover it at all, because only having "exchange" loses the emphasis of the time-scales.

The word "exchange" to me implies something transient and only transient. The (example) phrase "exchange my GEDCOM data with a Gen Soc" has an implied time-scale to me that starts with the export from my software, covers burning the DVD (or whatever), mailing it, and the Gen Soc picking it up, perhaps cataloguing it and putting it into store. The "exchange" then stops. It's taken - what a couple of weeks at most? most of which is after the receipt at the Gen Soc. Indeed, many people, reading the phrase "exchange my GEDCOM data with a Gen Soc", might cosider the exchange actually stops at the point that the data hits the Gen Soc's mail box.

We need to make it very _explicit_ that the time-scales we are talking about cover the long-term storage of the data for the next 10, 20, 30 years...

I would therefore suggest that we consider rephrasing Goal 1 to read:
"BetterGEDCOM will be a file format for exchange and long-term storage of genealogical data"

Note 1 - I have reversed the order of "exchange" and "storage" for clarity because that's the order they happen in.

Note 2 - by making "long-term storage" explicit, it makes us think slightly differently. Well, it makes me think differently at least. For a start, while we might deprecate certain BetterGEDCOM "tags" in future, to show there are better ways of storing the data, it does mean that we should _never_ remove any such tags since if we did, the oldest BetterGEDCOM files would therefore no longer be 100% meaningful when compared against a contemporary BG standard. If we only use the term "exchange" with its implications of transitory time-scales, then we might not think through the impact of a change on ancient BG files.

Please do _not_ say "Yes, but we know that people store the files so we don't need to say it". We need to be _explicit_ about what we want to happen.

In summary, I would be happy if we replace "archiving" by "storing" but cannot accept that we just have "exchange" with no time-scales.

AdrianB38 2011-01-04T07:14:14-08:00

And two more points re storage / archiving that I need to mention:

1. Re long-term storage and the viability of BG data after 30y: We need to make it totally clear that the BG "project" can only influence the pure BG data format aspect. We _cannot_ ensure that the Gen Soc above can read my physical media (5.25 inch floppies anyone?). We cannot ensure that the Gen Soc can read any linked multimedia within the package on that DVD (etc) of BG data, images, transcripts, etc. (Wordstar files anyone?). The fact that we cannot ensure those things is no reason to give up on the bit that we can influence - the BG data and format itself.

By all means put those caveats in the final report - I don't think they need to be in the goals. Though perhaps I should follow my own advice of being explicit!

2. As Tom and Andy indicate, no-one can rely on any genealogy app being around in 20y time. You may have saved the installation media and copied it to a brand new CD every year - but what happens when that software won't install any more because it's not digitally signed through quantum entanglement using the Heisenberg compensator algorithm? (And yes, that was meant to be Trekkie speak)

Whereas, if you have a BG archive / extract, whatever, then even if you can't read the .JPG images, and even if the world has moved on to EvenBetterGEDCOM, the essential data is in plain text and therefore still legible.

Hence, BG is a much better means of storing your genealogy data than a proprietary program and its proprietary format file.
(Always providing it's not on a 5.25 inch floppy.)

gthorud 2011-01-04T11:13:43-08:00

I tend to agree with what Tom and Adrian has written. To me there is no big difference between long term storage and archiving, but long term storage makes it very clear what's supposed to happen - so it may be a better word. (I hate terminology discussions.)

The important thing in this discussion is how the wording may affect the content of the standard, e.g. keep old tags as Adrian suggests, but that is a detail that we will discuss in 10-15 years time. Tom's discussion about "import all" and "export all" is also important, but it does not change the wording.

ttwetmore 2011-01-04T11:42:31-08:00

For me archive and long-term storage are almost synonymous. Personally I think archive is the better term as it has all the connotations we want to convey, of which long-term storage is one.

However, if long-term storage is more palatable I can go with it.

I also dislike these long discussions about terminology; they do nothing but deflect efforts from where they are needed. If BG comes to fruition it will be used for archiving.

Tom Wetmore

Andy_Hatchett 2011-01-04T11:59:17-08:00

To me the word archiving denotes proper care in preserving something whereas long term storage denotes putting something away out of the way (and probably never to be looked at again).

AdrianB38 2011-01-04T14:12:28-08:00

"To me the word archiving denotes proper care in preserving something"

And perhaps since we cannot guarantee the physical media nor the readability of proprietary attachments, then we should just refer to long-term storage.

(And yes, probably 99% is never looked at again - but we never know which the 1% is)

AdrianB38 2011-01-04T14:12:31-08:00

GeneJ 2011-01-03T14:20:08-08:00

Not disagreeing with you; however, I'm sure some societies, perhaps even libraries, accept GEDCOM submissions.

hrworth 2011-01-03T14:34:13-08:00

GeneJ,

We are not talking about GEDCOM submissions. We are talking about the term Archive and the potential miss interpretation of what that means. The Goal stands, minus 'archives and' to meet your "accept GEDCOM submissions" when the term Exchange is in the Goal. I can Exchange information with a society. THEY may consider it an archive.

But, I am a simple end user. I do not think that "archive and" should be part of this goal. We are about Exchanging our Genealogical Information.

I would probably NOT Exchange everything in my file. So, if I thought that when I did that, I would be Archiving my Genealogical Data, it would be WRONG. I would have a COPY of the data I shared.

I continue to think that Archiving is miss leading.

ttwetmore 2011-01-03T15:38:04-08:00

Russ,

I don't believe you have the right to claim that end users will find the term archiving misleading, and since your argument revolves around that premise, I find it unconvincing. You claim archiving is a misleading term. I disagree 100 percent. To archive something means to put it in a format where it is complete and whole, where it will last for a long time (longer than the life time of the current batch of genealogical programs), and where you can retrieve the information it contains when you need it. A Better GEDCOM file holding your genealogical data is the best backup system you could find. Save that file on a backup service or on your own external hard drives and you'll never have to worry about data loss again.

Your argument that you want your genealogical application to be the agent that archives your data is, in my opinion, a mistake. It makes you wholly dependent upon that application, which will have, almost definitely, a proprietary format, and not be around twenty years from now. I thought one purpose of Better GEDCOM was to get away from dependence on software application.

You say, "But, the real purpose of the BetterGEDCOM project is for transporting of information between two end users." I beg to differ. The real purpose of BetterGEDCOM is to meet the goals that the BetterGEDCOM team set for it. If we have agreed upon a goal, it is pretty silly to then tell us that that is not our goal. I certainly agree that transporting data is a main goal of Better GEDCOM, but providing a non-proprietary, complete specification for storing/archiving/backing up genealogical data is a win win win every way you look at it. What's a little silly about this whole argument is that no matter what we do we get the archive for free; we can't avoid it. Taking the word out of the goal changes nothing except making us look like we haven't thought it all the way through yet.

You also say, "That does NOT mean that the information that was shared isn't "archived", but am just trying to avoid the the notion to an end users that a GEDCOM is a good BACK UP / Archive of information." Again you are claiming to speak for the voiceless "end user." And again I differ 100 percent. First, no one claimed that GEDCOM is a good backup of genealogical data (though IT IS the single most important backup method in the genealogical world today). What we should be thinking about is whether a Better GEDCOM file would be a good back up of genealogical information. That was a win win no brainer in the last paragraph, and it's still a win win no brainer in this paragraph. If Better GEDCOM does become the lingua franca for exchanging data between genealogical programs, and if it can hold all the data that is kept by those programs (which should be one of our goals -- it was one of the goals we tabled today because we know it would be a hard one to thrash through) then Better GEDCOM is the best backup and archive mechanism anyone could ever hope for. Such a file holds everything and is instantly readable by every BG compliant program. Wouldn't you rather have an archive that can be read by every application (and by real end users) than an archive that can only be read by your chosen application.

I reject your argumentative technique of making up imaginary end users, then claiming to know what they are thinking, and then using their imaginary thoughts to justify making important decisions. If you don't understand something, fine, but don't then assume that a whole cadre of others wouldn't understand. Decisions should be made based on what words really mean and what genealogists really need, and not on how imaginary persons might react to hearing certain words. I have been an end user of genealogical programs for a long time and I have almost no thoughts in common with your imaginary users.

Sorry to be so frank and cranky, but these endless discussions on obvious and relatively unimportant things get galling, when the real issues of proper data models and their contents await work.

Tom Wetmore

louiskessler 2011-01-03T15:40:22-08:00

BetterGEDCOM will be a file format for archiving and exchange of genealogical data. [Developers Mtg 3 Jan 2011 status (approved)]

I voted against this goal as well, but not for the same reasons as Russ.

I just don't think this sounds right as the "definition" of BetterGEDCOM. I can't really pin down why, but it's just not right.

What is it we really want it to do?

Transport and Storage? Pretty dry stuff. If that's all we really intend it for, then lets simply convert GEDCOM to XML and just add the bits that are missing.

There's got to be something sparking this initiative that is saying BetterGEDCOM is something more than GEDCOM was.

What is that something more?

louiskessler 2011-01-03T15:53:40-08:00

Okay, here's an example of what I'm thinking of from an Operational Data Model for clinical data. Same purposes as genealogists, but for clinical data: www.cdisc.org

Paraphrasing from their site, I'd say:

"BetterGEDCOM is a standard to support the acquisition, exchange, submission and archive of genealogy data and metadata."

"BetterGEDCOM is designed to facilitate the archive and interchange of the metadata and data for genealogy."

Something like one of those is better in my opinion.

I do note they talk about metadata as well as data. We may want to, or may not need to.

TamuraJones 2011-01-03T20:43:01-08:00

Methinks there are four scenarios to consider:

1. data exchange between applications
2. data exchange between users
3. public archive in open format (no private stuff)
4. private full archive in open format

Backup is not the same as archive and an independent issue; you want backup of all your files, including your archives. Especially your archives.

ttwetmore 2011-01-04T00:27:43-08:00

Louis,

I wouldn't want that statement as the definition of BG either. But that isn't the definition, is it? Isn't that statement one of BG's goals?

Tom Wetmore

ttwetmore 2011-01-04T01:03:36-08:00

"1. data exchange between applications
2. data exchange between users
3. public archive in open format (no private stuff)
4. private full archive in open format"

I don't understand the importance of the distinction between points 1 & 2. I'd just say data exchange between genealogists and leave it at that. Try to explain the difference in a short sentence and I think you'll see what I mean.

Points 3 & 4 reflect the fact that there are different reasons to extract and/or store genealogical data in BG format. When you export your data you can export all of it or just some of it. This is based on why you are exporting. If you are exporting to share you would probably limit the data to just the information the person you are sharing it with desires, and/or you might limit it for privacy reasons. If you are exporting your data for the purpose of archiving or making a backup you want to export all of it.

"Backup is not the same as archive and an independent issue; you want backup of all your files, including your archives. Especially your archives"

The main reason we create archive files, INHO, is to back-up the data in them. We create the archive file, and then let MOZY or CrashPlan or whatever on-line service we use back up the archive file at our next scheduled backup time. We also create archives/backups every few days (or at least we should) as a safety measure against our genealogical programs crashing and corrupting our databases. So they are different issues, but they don't really seem fully independent to me.

TW

GeneJ 2011-01-04T07:28:08-08:00

The "thing" and Goals 3 (part), 5, 6, and 7 (part)

The first Developers Meeting was great. Thank you, Pat, for organizing and leading that wonderful effort.

Looking at all the goals, even those we didn't get to, in the wake of the meeting discussion, I wonder where we write in the missing "thing," or hook.

I'd like to see something "BetterGEDCOM enabled," a thing or things compelling and universally understood among techs, users, bloggers, scholars and vendors.

"We GOTTA have it...and Yesterday, please!"

I wonder if we could merge part of 3, all of 5, 6, and part of 7 into one goal, and then create a separate goal that defines a "hook."

(a) Part of 3, all of 5 and 6, and part of 7 are all getting at the same thing.
I could probably pick up the BCG Genealogical Standards Manual and find those same objectives list, maybe almost word for word.
In short, unless I'm missing something, I think the noted items (part of 3, all of ....) should be merged into one goal.

(b) Do we have room in the vision for a "hook."
If we want the scope of BetterGEDCOM forever limited to the faithful transfer of information between programs and systems, then maybe there isn't room for a "hook." I think there is room, and I'd like to see a goal related to some functionality.
We might not be able to realize full functionality of that "hook" in a first release of BetterGEDCOM however, elements that will eventually unlock full functionality of that hook should be obvious, even in the first release (and should not conflict with other functionality in that same release).

(c) If we have room for a "hook," then what is that hook.
I'd like to see that hook be in the source/citation area. Specifically, those building blocks that will be needed to ultimately support developing citations "on the fly" (in the spirit of EndNote, Zotero, etc.). It's a given that those same building blocks would aid and streamline the process by which users share sources.

Related:
Over on the Build a BetterGedcom blog, Russ has been comparing and contrasting how different programs do things, and how those difference relate to GEDCOM. The blog hasn't much discussed the sourcing system differences between programs, but I assure you, those will be significant.

GeneJ 2011-01-05T00:39:12-08:00

Hi Louis:

Golly, I wish I could do just that!!

I don't have the list you want/we need. Actually, that's part of the reason this might be a BetterGEDCOM opportunity.

Zotero doesn't yet recognize Genealogy as a discipline separate from History (http://www.zotero.org/people/). For example, Zotero doesn't yet have elements to record a website owner/title and URL/access date in the way lots of us recognize "web publisher" information (ala Mills at the element level).

Zotero recognizes 15 "default styles" and a _very long_ list of "additional styles." See http://www.zotero.org/styles

APA is on the default styles list as are the CMOS styles listed below:

Chicago Manual of Style (Author-Date format)
Chicago Manual of Style (Full Note with Bibliography)
Chicago Manual of Style (Note with Bibliography)
Chicago Manual of Style (Note without Bibliography)

_Register_ and/or _Quarterly_ are not on the list of scholarly journals in the "additional styles."

To work with the Zotero folks, we need to present them with a list of elements (their fields) we want to use. (Russ and I were working on an element list--work that isn't finished.) In addition, they want us to compare element needs against those already in Zotero. I don't know where to pull such a list, but we need ours first, anyway.

I've used the "EndNote" button at WorldCat, and GoogleBooks to port information into Zotero. I've also pulled in generic "website" information by using the FireFox/Zotero command "Create a new item from current page."

I'll post some screen shots of the raw ported data to the BetterGEDCOM blog in a few minutes.

GeneJ 2011-01-05T11:35:55-08:00

A few Zotero screen shots are posted on Build a BetterGEDCOM blog. Link below:

http://bettergedcom.blogspot.com/2011/01/few-zotero-screen-shots.html

testuser42 2011-01-06T12:54:16-08:00

I guess the USP for a Better Gedcom would be that it actually works. BG should do what Gedcom doesn't: faithfully transport data.

BUT - for this to work, we need software programmers to support BG better than they now support Gedcom. So, what will make that happen?
IMHO, there are only a few things we can do to help the adaptation and implementation of a BG:

- make BG big enough to handle everything that would ever be needed in the field of genealogy. (e.g. clean support for evidence-conclusion including reasoning; research documentation and planning; history of places, things and groups; ability to include sources and other files; quality assessment...)

- make BG well documented and give clear examples for how the elements should be used (to reduce the need for custom tags or misuse of standard tags)

- make BG and its documentation available for free forever

- make a lot of noise about it! in blogs, boards, mailing lists etc. Get GRAMPS support, and send note to independent developers. Convince / nag the big commercial players to come on board.

Getting an official seal of ISO / W3C /... might help to add clout - but I don't know about the costs involved.

I do guess that some of the big players might think they don't need any data-sharing standard. But that's really short-sighted, they would benefit a lot from this, too.

AdrianB38 2011-01-06T13:55:11-08:00

Re the "hook" or "Unique Selling Point" - it might be useful to consider 2 of them - one for the software companies without whom nothing will happen, and one for the customers to direct them into buying BG equipped software when it appears.

I think the hook for customers is about "being able to record and exchange the full richness of the stories, genealogy and family history of people, families and groups of associated people in all cultures, without being restricted to the conventional family"

Note 1 - I'd like to see both genealogy and family history referred to, as they mean different things in different countries, and in some cases FH has a wider meaning, in other G has the more scientific viewpoint.

Note 2 - I've put in "groups of associated people" because this is one big advance on GEDCOM.

Note 3 - "without being restricted to the conventional family". Perhaps one should not be negative, but I want to encompass families with step-children, unofficial adoptions in Western culture, as well as, yes, polygamy, polyandry, same-sex, etc, etc. And I think the phrase "without being restricted to the conventional family" is the simplest way of covering off all possibilities, as well as stepping round the issue of same-sex relationships, which will raise all sorts of issues if I write it down.

GeneJ 2011-01-06T14:00:56-08:00

@Adrian:

You wrote, "genealogy and family history referred to, as they mean different things in different countries"

Might you elaborate for me.

I think of the discipline as "genealogy" ... it is practiced by folks who consider themselves genealogist, family historians (and occasionally, small bears).

GeneJ 2011-01-06T14:04:08-08:00

P.S. Hoping we can fit the various practices within a single discipline, we are a stronger collective. In this country, for example, APG (Association of Professional Genealogists) recognizes an "umbrella," so that the organization welcomes archivists, librarians, etc.

AdrianB38 2011-01-07T14:28:11-08:00

Gene asked me to elaborate on "genealogy and family history referred to, as they mean different things in different countries"

See https://bettergedcom.wikispaces.com/personal+notes+on+Family+History+in+the+UK

where I wrote that in the UK "we tend not to have "genealogy" but "family history".
...
for most of us, if there is a real difference, genealogists were once concerned with the passage of land, titles and coats of arms. "Family history" began with the idea that the study could apply to everybody, not just aristocracy or landed gentry, but also that it was worth covering all aspects of what affected family lives - their working conditions, the social lives they lead, etc. I'm sure most genealogists consider exactly those aspects today"

I suspect that for many people in the UK that difference still applies in their mind - not for any justifiable reason, but just because that's the way they think. Equally, there are people in the UK who think genealogists are serious and family historians are hobbyists.

Let's not try to establish definitions on page one of the BG publicity but just cover both terms!

GeneJ 2011-01-07T15:18:11-08:00

If they don't mean the same thing, not sure how we can cover both terms without understanding the differences.

In the alternative, it doesn't sound like there are differences in practice today.

Apologize in advance if I mis-interpreted. --GJ

AdrianB38 2011-01-08T09:43:16-08:00

Gene - different people _can_ mean different things by "genealogy" and "family history". That's why I'd like both terms to appear, since if we put one, the other lot _might_ consider themselves left out.

As for what it means to have in both, that's covered (I hope) by my phrase "being able to record and exchange the full richness of the stories, genealogy and family history of ..."

So, we cover "everything" and don't worry whether someone might consider it "genealogy" or "family history".

NB - "everything" is in quotes there because I mean "sort of everything that we're interested in".

GeneJ 2011-01-08T09:48:15-08:00

Okay. Believe I'm following now.

You are saying the words "genealogy" and "family history" all point to a single discipline, which we call genealogy or family history, and mean the same thing.

So you are further saying we will be more inclusive if we double up the terminology by using both terms with some consistency.

AdrianB38 2011-01-08T12:26:56-08:00

Gene - absolutely.

Doubling up the terminology is what we we would call "belt and braces" over here. (And I think that doesn't come over directly into American English - it's whatever you call both methods of holding up a pair of trousers!)

AdrianB38 2011-01-09T09:19:46-08:00

Re the "thing" / the USP / the hook / the sizzle (if anyone hasn't heard of that phrase, as I recollect it comes from someone saying "You don't sell the sausage, you sell the sizzle")

In my view, the hook for users is about the rich picture of everything affecting a family that we (might) be able to get with BG. It seems to me that it would be possible to easily develop a "demonstrator" / "proof of concept" app that showed up this richness. Maybe even as a web-page? (Don't ask me if it's possible - a Wordpress blog is my limit...)

I'd suggest a family
- emigrating from somewhere in Europe to the US,
- one of the parents dying after arrival,
- the remaining parent marrying again to someone who already has children of their own
- the younger members of that 2nd become part of the new family, the older ones don't.

This allows
- the emigration to be shown as a multi-person event (advantage: enter that data once, not for each person)
- the ship to be entered as an entity in its own right - naturally, the ship has several names, just to show off these things, and you could link through to the various bits of the ship's history in BG;
- the explicit documentation through roles / associations / whatever that only some of the children of that the younger members of that 2nd are part of the new family, the older ones aren't;

Naturally, the place they settle in will change county or state / territory, while they live there, and we show the links to the BG history of that place that records this.

And maybe one of the sons fights in a war so we are able to link through to a BG history of the regiment he fights with.

What I'm after is show-casing linkages to (and within) as many of the different entity types as possible that would be held within BG (in fact, one could also add external links to web-sites to further richen the picture).

We don't need the BG data model for this other than some vague idea of the entity types and how they are related. And, ahem, we don't _need_ sources or citations....

AdrianB38 2011-01-04T14:06:57-08:00

"Part of 3, all of 5 and 6, and part of 7 are all getting at the same thing"

If you can define what that "thing" is, then fine, but my feeling is that we need to be a touch specific for the goals. The goals are, to me, more than a mission statement - if we had a mission statment, then that probably _would_ include all of those parts as one - and a hook.

testuser42 2011-01-04T16:46:57-08:00

Gene, what do mean by that "thing" and "hook"? Something like a unique selling proposition? The one thing that will make software developers want to use BG?

GeneJ 2011-01-04T17:31:15-08:00

Hi all:

Yes, a "unique selling proposition."

During the meeting yesterday, is was remarked that our goals are all worthy and good, but worthy and good don't necessarily translate into excitement motivating program vendors to make the development investment.

One way to deal with that is to dumb down our ideals to that it's dirt cheap for vendors to make the jump.

The other way is to identify particular functionality that BetterGEDCOM enables.

A while back some of us discussed the Endnote/Zotero concept. In a nutshell, Zotero is an open-source browser based source/citation system being developed by the Center for History and New Media at George Mason University. (It doesn't yet recognize genealogy as a separate discipline, so/and there isn't yet a "genealogy" source template database in Zotero.)

Using something like Zotero, I envision one day being able to visit a FamilySearch record collection and with one click of a button, I can port a properly cataloged source into a source "library," where I can also record notes about that source, capture a page image, etc. With a click from my library, I envision being able to port the source data from my library into my genealogy software. I envision being able to "sync" sources from my genealogy software to Zotero.

I'm not endorsing Zotero, or even saying the ability to capture and manage sources on the fly is the "hook."

louiskessler 2011-01-04T22:53:46-08:00

Gene:

Marketers know that nobody cares about functions and functionality.

What are needed are benefits.

I picked this out from somewhere:

"The most important thing to remember about a USP is "benefits, not features." Focus on benefits to your potential customers, not the great features you think are really cool. The technical aspects of your products are great, but your customers are more concerned with whether the product or service is what they need or want."

So how can we make BetterGEDCOM something that all vendors will need and want? What specific benefit or gain will get from converting to it.

And which one benefit is the no-brainer that says they HAVE to switch. That will be the USP.

Louis

GeneJ 2011-01-04T22:58:13-08:00

Oooo ... Most assuredly, I can talk that language.

Regardless of the term, it is that hook/sizzle/spark, "must offer" benefit I refer to in (c). -GJ

louiskessler 2011-01-04T23:11:07-08:00

My other BIG post elsewhere says I think GEDCOM's sourcing isn't bad as it is. I don't think we gain much by going to "perfect" sourcing. The programs that have implemented all of Elizabeth Shown Mills' templates think they've done that already. And they can export all their sources to the current GEDCOM the way it is.

My problem is I have yet to see that big hook that BetterGEDCOM will offer. I only see little benefits so far.

Louis

GeneJ 2011-01-04T23:33:54-08:00

@ Louis:
I read your comment about GEDCOM sources in the longer posting. Actually, I pulled up the GEDCOM 5.5 materials after reading your comment.

I am not a fan of GEDCOM's sources. BUT, I am not at all a fan of the predefined Mills' source templates built into genealogical software, either.

I understand the Mills' templates in Evidence Explained were provided as examples that showed various elements in use. The templates in Evidence Explained (much less the far fewer QuickCheck models were never intended to cover the universe of source or citation presentations of those elements.

I don't know that anyone has yet normalized the actual elements in Mills' templates. Russ and I started to work on that, but I don't think that effort is even close to being finished.

Actually, a Zotero-like system isn't an attempt at "perfect sources." It's a technology driven way of communicating source information between a provider (I used the example of FamilySearch) and a user. The information is ported in element/template format.

You can visit WorldCat today an port source information about titles into Endnote or Zotero with a simple click.

louiskessler 2011-01-04T23:42:18-08:00

Gene:

You seem to really like Zotero. Could you go to the GEDCOMs Comparison Spreadshett and add Zotero to the Source page (we may need a Source-Citation page as well) and fill in the data items that Zotero deals with. Then we can compare and contrast it with what GEDCOM and the other models have.

Louis

This page is obsolete - please go to BetterGEDCOM Requirements Catalog

Issues > Needs and Goals (Under Development)

Goal 1: BetterGEDCOM should remain a file format technology capable of serving as a data archival repository

Goal 2.1: BetterGEDCOM should use an XML-based syntax

Goal 2.2: BetterGEDCOM should use Unicode character set in UTF-8 encoding, and optionaly support other encoding schemes of Unicode

Goal 2.3: BetterGEDCOM should utilize a standardized container specification to hold separate supporting files such as multimedia

Goal 2.4: BetterGEDCOM should support a markup language such as HTML.

Goal 2.5: Lines should have no length restriction.

Goal 4: Test suite of data.

Shortcomings Of GEDCOM

Patch vs. Replace?

What Do Genealogists Want?

What Do Genealogical Standards Require?

What Do Genealogical Organizations Want/Need?

What Do Software Developers Want/Need?

What Other Goals May Be Desirable That Are Not Otherwise Accommodated?

Comments

=

=