UPDATE 21 Jan 2011 10:15pm Pacific US Time
DearMYRTLE: Only 21 members of BetterGEDCOM wikispaces have voted, and we have 2 hours to go on this week's survey. What shall we do since a minimum of 50% of the registered BetterGEDCOM members must vote for this to be considered a consensus? Can you add your thoughts to the DISCUSSION tab on this page?
Reference our 3 Jan 2011 Developers Meeting Notes
for guidelines about voting.
COMMENTS ARE NOW CLOSED.
PLEASE VOTE at our Survey Monkey site between Wednesday 19 Jan 2011 and Friday 21 Jan 2011 Midnight, using the following link:
- You are REQUIRED to include your BetterGEDCOM wikispaces ID.
- Theoretically an individual could vote multiple times with this version of Survey Monkey. However, only your FIRST series of votes will be counted.
COMMENTS ARE NOW CLOSED.
But the old comments are found below:
During the 17 January 2011 Developers Meeting, we agreed to discuss and then vote on the wording of a new requirement that BetterGEDCOM:
Be accommodating of all possible types and lengths of data
To cut down on confusion, please use this page, and this page only, to discuss and submit your suggestions concerning the statement in red.
a. Preface your statement with your wikispaces screen name, as per my sample below, in green.
b. Similarly label any subsequent responses you make. There is no limit to the length or number of responses.
c. On Wednesday morning all comments here are closed at 10am US Mountain time. DearMYRTLE will then post the suggested sentences on Survey Monkey, and provide a link to the voting space here.
At that time, this page will no longer be open for added discussion.
d. Voting requires 50% of the BetterGEDCOM wikispaces membership participate, with 75% voting “Yes”, or “Yes I can live with it” to be considered a consensus.
e. Voting will close at midnight Pacific US time, Friday.
f. Results will be posted Monday January 25, 2011.
----------------------------Please do not alter the text above -----------------------
DearMYRTLE: “Accommodate multiple types and lengths of genealogy data.”
: I find the red
statement ambiguous to the point of meaninglessness. I believe the intention was to say that certain types of values, most especially names, dates and places, are expressed in many ways by different cultures, and that Better GEDCOM should not impose, knowingly or unknowingly, any restrictions that would make expressing those values awkward or impossible. That is what I believe the phrase "all possible types" referred to. The phrase "lengths of" I believe means that Better GEDCOM should not impose any restrictions on the lengths of values; for example, notes can be as long as necessary. My suggestion is:
Better GEDCOM must not impose restrictions on field lengths or value formats except as deemed necessary during design.
We need to define the purpose of this sub-goal. Is it to identify the value types and formats (e.g., names, places, dates, notes, sources, etc.) and sub-types (e.g., first, last, AKA names)? Or is it to "accommodate" the information transmitted by each existing software program?
I suggest that this be separated into three sub-goals - one for value types and formats, one for interaction with software programs, and one for field lengths. I like Tom's suggestion for field lengths.
Better GEDCOM must not impose restrictions on field lengths except as deemed necessary during design.
- Since I'm guilty of concocting that wording....The original bullet point read "Lines should have no length restriction". OK, but that's a very specific requirement that seems to have come from CONC / CONT experiences in GEDCOM. If you don't restrict the length of a note (only a note??), there's no reason to have CONC or CONT lines after the first line, seems to have been the logic and therefore you remove an issue whereby software confuses CONC and CONT. But there are other items as well which seem to have limits on - for instance, "Cause of death" is supposed to be limited to 90 ch. Pointless to limit it. So, it's not just lines of notes, but any textual item (indeed any number as well), that needs to be allowed to be as long as it likes. Hence "Be accommodating of all possible lengths of data". "Accommodating" simply means "allow it".
Not sure why I put in "types" - it wasn't meant to cover internationalisation, which is elsewhere. So if I can't justify it, take it out.
I therefore propose "Better GEDCOM should not impose restrictions on item lengths
". Please note I've put the word "should" in, not "must", which removes the necessity for Tom's (sensible) "except as deemed necessary" as one has to justify each occasion when the "should" cannot be delivered.
: Provide both standard and self-referential data types of variable length.
Here, "self-referential" means that any nonstandard data type completely describes itself so other software can process it if it desires. But if AdrianB38 meant the requirement to concern only length, then another requirement could address data types.
I find the phrases "be accommodating" or "accommodate" to be vague. It hints there are degrees of compliance from loose to strict. It also alludes to "how" the requirement should be satisfied without providing any scope or guidelines to do so. The phrase "all possible types" is also vague to the point of being impossible to satisfy as a requirement. It would be difficult to list all data types in use today let alone every possible data type to be invented in the future.
"Types" of data can imply both semantic and syntactic meaning. We know there are different date formats and calendars used in genealogy, but there are also different character sets used to express those dates.
The phrase "except as deemed necessary during design" in the comments above defeats the purpose of a requirement. The design should satisfy the requirement, not dictate what the requirement should be.
Since Adrian can not remember why ”data TYPES” was included in the text, and I can’t think of a reason why this should be an issue, the ”types” aspect can be removed.
The length aspect is more complicated since Gedcom currently specifies a lot of max lengths for fields. I must admit I am not up to date on all sorts of database technology, but I assume that some database engines offer different functionalities for strings that are say 200 characters long or long notes that are 10 000 characters long, e.g. when sorting. If that is the case, we may still have to specify max lengths – but with somewhat larger values than currently. Also, the max length of a field has some implications on the design for fields in the user interface. I don’t think there is a reason to specify “no limit” for eg. a person name field if all names are shorter than 200 characters.
But, OK, my knowledge in this may be outdated. Comments are welcome.
If my assumptions are correct, I think rjseavers text is closest to what I want, but not for the reasons rjseavers specifies.
Until I see an example that requires "self-referential" data types I have no oppinion on it.
: I think the jist of this requirement is to not impose ANY length restrictions on ANY field and leave it up to the developers of the software to limit their fields as they see fit. That way, the software dev's can change field lengths when needed without having to wait for an updated standard.
For example, consider the Social Security Number. It's currently formatted as 999-99-9999. It's been that way since the 1950s. We could write BG to constrain the SSN element/field to only 11 characters. But what happens if Ol' Uncle Sam decides to add another prefix to it and make it 999-999-99-9999? Then we'd have to change the standard to accommodate it.
I too like rjseaver's rewording, with regard to TreeTraverser's last paragraph -- we should leave off the "except as deemed necessary during design":
BetterGEDCOM must not impose restrictions on ANY field length.
Louis Kessler: These sentences are so convoluted. Isn't this what we're trying to say:
Data should have no type or length restrictions.
I'm thinking somewhat the same as Geir - my concern here is that without any specification as to length of a field of a particular type, it would be hard for the application developers to know just what to "aim at" when developing their software in terms of field sizes. So strict compliance with a BetterGEDCOM specification that imposes no limits could mean that they have to at least allow for the possibility of a place name of 5,000 characters for example.
But if for example it can be determined that any acceptable date string, in any language is never more than 50 characters, why not at least start with a field limit on dates of 50 characters?
The 2 softwares I'm most familiar with - TNG and Reunion both have limits on the lengths of many of their fields, particularly place names and people names.
TNG being MySQL based has many of its fields defined as type varchar(x) where x is 50 for dates, 248 for places, 127 for each element of name. Notes fields have no limit.
Reunion allows 255 characters in each name element, 255 in places, 64,000 in notes.
If developers think that the requirements of strict compliance are too onerous, then I guess they're likely to either not comply, or not even implement, which won't help with the widespread adoption of BetterGEDCOM support.
So if program X only allows 10 character names, then we should make the limit 10 characters so that all existing programs will work with it? Gthorud just said in a discussion attached to Tom's Goals page that: "BetterGEDCOM should not be the limiting factor when info is exchanged between applications". Sorry, it's got to be unlimited length. Most modern programming languages have strings that are unlimited length (well not really - but there is no hard-coded limit imposed on them other than the physical limits of the language and operating system).
gthorud: Louis, no one has suggested that BG should adopt the minimum length implemented by any program. Also, my statement was written in a totally different context, so it does not apply here. Many existing programs are based on a relational database, and most of these programs will have problem with unlimited length. Unlimited length is likely to cause interworking problems.
Louis Kessler: Ah. But your statement was so profound and deserves to apply to ALL of BettterGEDCOM.
Re. Louis's proposal. So a jpeg image of a signature would be ok as a persons name?
Genealogy "data" is supposed to be UTF8 in BetterGEDCOM. Put the jpeg image in UTF8 and then make it the person's name, and we should not kick it out. Tamura has super-long names as one of his GEDCOM tests. There are any sorts of languages, and there is no way the BG standard should try to validate. I sure hope you don't expect to try to embed jpegs and multimedia into BetterGEDCOM. GEDCOM attempted that with the BLOB and immediately realized what a mistake it was and took it out in the next version.
gthorud: Louis, what practical purpose does “Data should have no type restrictions” have? What does type mean in this statement? (If all data is a Unicode string, this seems like a contradiction, but I guess it may be me that does not understand what you mean.)
Louis Kessler: I am not 100% sure what everyone else is thinking in reference to "types", but my thinking is that they refer to whether items are words, numbers, booleans, date and time, etc. Truly, it should be the specification of the BetterGEDCOM constructs that we'll have to come up with, that will define what types of data are allowed where. But no overall restrictions should be imposed.
"The Tyranny of Relational Thinking." I have ranted about this for years. We put restrictions on field lengths because of restrictions in relational database technology. That's tyranny in my hyperbolic way of speaking. We put restrictions in our models in anticipation of the restrictions that users of the model may encounter. Let's be clear about that. Do we want to put ourselves in that position or not? I know what my answer is, but Better GEDCOM must decide its answer. Are we going to be wrapped around the little finger of SQL or we going to do the right thing? There is no other reason to limit the size of any field (except the fields, like sex, that take their values from a pre-specified, enumerated set of values). The argument that maximum lengths are needed to make graphical user interfaces possible is wrong. There are many examples of user interfaces that allow unlimited text input that work just fine (I am typing into one of them right now!). IMHO Better GEDCOM should impose no limits on field sizes "except as deemed necessary during design". In the LifeLines program I wrote 20 years ago I put no restrictions on field/line lengths. It was an obvious decision made the instant I thought about it. There have never been any bad repercussions from that decision. LifeLines does not use a tyrannical relational database.
Many models have strict or loose ways to store data. A way or flexible idea is a possible work around. Take the persons name. some may have "Smith/ John A" in one field. other may use Given name "John A" surname "Smith". Now look at dates "Abt. 1880" or "Bef. 1880". If many formats out there can read the data around "/" split, join or separate. We could adopt a "/" such as date "1880/ abt." the date can be primary while the secondary half supplements the primary data. This can eliminate the need of a second node to shrink the database size. Same can go for many of the other fields.
I would be on the bases that limits should not be set. Problem resolve for import or exports of such data. One would need to handle such data in such a way. If you had "This is a long segemnt of data/ important additional data"
From Gedcom to one of the many platforms can keep it as one node or split it as needed. Upon export to a BG model it could split it as a PRIMary node "This is a long segment of data" and SECOndary node "important additional data". The other Platforms could also jion or split data as needed or join them.
I believe lengths should not be set. However if they are, this is the only way I see how any Platform Models following along with BG can import and export data in a way which can be used and understood by other platforms within usable limits.
Bottom line: If a system(software) can not read a long data than I see that system as limited and in turn will absolutely loose data with no regard to it importance.
I am not just trying to force a one way street, but a work around that can/could be used from a restrictive style to a relaxed style. and both import and export platforms can either join or combined data for their own use till it is given back out in a way that other styles can handle them with no data loss.
Most Dev's can get what I am saying, and all can understand the data after a "/" can be sorted, filtered, joined, split and matched for any purpose. We need to stop think how we want to handle data by our standards, but how to transport data to a universal platform. We independent model dev's will have the responsibility to manipulate the long data for our own software needs. BUT MUST export it back to a model that other can use quickly without needed to decipher too much.
I have to say I totally hate this wording. I understand it's purpose. I do however have a problem with it on a technical level. Some kinds of data are naturally of a particular type. Names for instance are text based, and they should be of a type of text. We should try to keep the model free of one kind of model or another. Not all implementations will be relational.
Data should have no length restrictions, and be of an appropriate data type without regards to data model.
Data should have no length restrictions.
- Practical not theoretical justification?
I guess I should my developer's hat on and see things from their side: Suppose the BG standard dictates that someone's first name can be a huge number in length. What happens if I push back at you all and ask "Why? Give me an example of someone's given name that's over 255 characters long" If you ask, why does it matter, then my own developer's hat starts slipping over my eyes a bit and obscuring my vision because of my personal background but I might mumble something about sorting or data manipulation and the extra time to code it all. And if you still can't
give me an answer that justifies things, then if it does seem like a lot of hard work, why will I bother?
So - can
a first name (say) over 255 ch long? Or a "node" of a place name? (I would envisage full names of people or places being several 255 ch fields, concatenated together.)
And what is the impact on programming in any
language or database of manipulating, concatenating, sorting, stuff over 255 ch long?
1. I'm using 255 ch as the limit because that appears to be what mySQL uses as the maximum size of a character item.
2. Notes could be 65k long if Text in mySQL or 4.2 billion ch if LongText or infinite if they are concatenated together.
3. Can someone just confirm for me what happens if I have 255 ch a/v for an English name and I want to use other alphabets? Does this mean I've only got half of 255? (Or indeed, a third or a quarter if I wanted to use some of the more unusual encoding such as that for Klingon?)
4. I'm pondering what happens if we assign 255 ch as the standard size for some items (e.g. each node of a person, place or organisation name, also dates - and what else
?) and either 65k or infinite (probably actually multiples of 65k) for descriptive text, causes, personal attribute values.