Evidence Model Attempt:
(I removed my original structure in favor of Geir's simplified, more readable version. Mike)
A syntactically simplified version of the model:
source:
id="UUID"
parentSource? (source ref)
sourceDetail*
sourceDetail:
type
value
extraction:
id="UUID"
date_extracted
extractor (source structure or source ref) required, and ONLY ONE)
source (structure) required, and ONLY ONE
person*
event*
place*
note*
person:
id="UUID"
sourceDetail*
gender* (TBD)
name* (TBD)
eventRoleRef*
sameAsRef*
note*
eventRoleRef:
role ref="eventUUID"
type="TYPE"
sameAsRef:
sameAs ref="personUUID"
This model has some interesting characteristics:
1. A person only exists in context of an extraction and thus EVERY person is an "evidence" person. There is no distinction between an evidence person and a conclusion person.
2. Because the model only allows (and requires) one source in an extraction, it puts the researcher contributions (and automated computer contributions) at the same level as any of the sources from which information is extracted. In other words, a researcher forming a "conclusion" is just another source and is on the same level as a census record.
3. Because person, place, event etc records are all contained within an extraction, this model would easily support containing an entire gedcom file in a single extraction tag with the source being the gedcom file itself.
Here is a concrete example:
<bettergedcom>
<source id="s00" type="COMPUTER_SYSTEM">
<sourceDetail type="DESCRIPTION">FamilyPursuit.com</sourceDetail>
<sourceDetail type="URL">http://www.familypursuit.com</sourceDetail>
</source>
<source id="s0" type="RESEARCHER">
<sourceDetail type="DESCRIPTION">Michael Martineau</sourceDetail>
<sourceDetail type="EMAIL">mike at familypursuit dot com</sourceDetail>
</source>
<source id="s1" type="WEBSITE">
<description>FamilySearch</description>
<sourceDetail type="URL">http://www.familysearch.org</sourceDetail>
</source>
<source id="s2" type="CENSUS" parentSourceRef="s1">
<sourceDetail type="DESCRIPTION">1880 US Federal Census</sourceDetail>
<sourceDetail type="JURISDICTION">United States</sourceDetail>
<sourceDetail type="YEAR">1880</sourceDetail>
</source>
<source id="s3" parentSourceRef="s2" type="CENSUS">
<sourceDetail type="JURISDICTION">North Carolina</sourceDetail>
</source>
<source id="s4" parentSourceRef="s3" type="CENSUS">
<sourceDetail type="JURISDICTION">Transylvania</sourceDetail>
</source>
<source id="s5" parentSourceRef="s4" type="CENSUS">
<sourceDetail type="JURISDICTION">Little River</sourceDetail>
</source>
<source id="s6" type="CENSUS" parentSourceRef="s1">
<sourceDetail type="DESCRIPTION">1870 US Federal Census</sourceDetail>
<sourceDetail type="JURISDICTION">United States</sourceDetail>
<sourceDetail type="YEAR">1870</sourceDetail>
</source>
<source id="s7" parentSourceRef="s6" type="CENSUS">
<sourceDetail type="JURISDICTION">South Carolina</sourceDetail>
</source>
<source id="s8" parentSourceRef="s7" type="CENSUS">
<sourceDetail type="JURISDICTION">Georgetown</sourceDetail>
</source>
<source id="s9" parentSourceRef="s8" type="CENSUS">
<sourceDetail type="JURISDICTION">Collins</sourceDetail>
</source>
<extraction id="a0">
<dateExtracted>26 March 2011</dateExtracted>
<extractor ref="s0" />
<source type="CENSUS" parentSourceRef="s5">
<sourceDetail type="PAGE" value="222B"/>
<sourceDetail type="NARA_FILM_NUMBER" value="T9-0983"/>
<sourceDetail type="ENTRY" value="1339"/>
<sourceDetail type="FHL_FILM_NUMBER" value="1254983"/>
</source>
<person id="p1">
<sourceDetail type="LINE_NUMBER" value="5"/>
<gender type="MALE" />
<name>John Hunt</name>
<role ref="e1" type="CHILD" />
<attribute type="AGE" value="22"/>
<attribute type="OCCUPATION" value="Farmer"/>
</person>
<event id="e1" type="BIRTH">
<role ref="p1" type="CHILD"/>
<role ref="p2" type="PARENT" />
<role ref="p3" type="PARENT" />
<date><year>1828</year></date>
<place>South Carolina</place>
</event>
<person id="p2">
<sourceDetail type="LINE_NUMBER" value="5"/>
<gender type="MALE" />
<role ref="e1" type="PARENT" />
<role ref="e2" type="CHILD" />
</person>
<event id="e2" type="BIRTH">
<role ref="p2" type="CHILD"/>
<place>South Carolina</place>
</event>
<person id="p3">
<sourceDetail type="LINE_NUMBER" value="5"/>
<gender type="FEMALE" />
<role ref="e1" type="PARENT" />
<role ref="e3" type="CHILD" />
</person>
<event id="e3" type="BIRTH">
<role ref="p3" type="CHILD"/>
<place>South Carolina</place>
</event>
</extraction>
<extraction id="a1">
<dateExtracted>26 March 2011</dateExtracted>
<extractor ref="s0" />
<source type="CENSUS" parentSourceRef="s9">
<sourceDetail type="PAGE" value="75"/>
<sourceDetail type="ENUMERATED" value="1 July 1870"/>
</source>
<person id="p4">
<sourceDetail type="LINE_NUMBER" value="10"/>
<gender type="FEMALE" />
<name>Eliza Hunt</name>
<role ref="e4" type="CHILD" />
<attribute type="AGE" value="20" units="YEARS" />
<attribute type="OCCUPATION" value="Farmer" />
<attribute type="VALUE_OF_REAL_ESTATE" value="320"/>
</person>
<event id="e4" type="BIRTH">
<role ref="p4" type="CHILD"/>
<role ref="p5" type="PARENT" />
<role ref="p6" type="PARENT" />
<place>Carvers Bay, South Carolina</place>
</event>
<person id="p5">
<sourceDetail type="LINE_NUMBER" value="10"/>
<gender type="MALE" />
<role ref="e4" type="PARENT" />
<role ref="e5" type="CHILD" />
</person>
<event id="e5" type="BIRTH">
<role ref="p5" type="CHILD"/>
<place>United States</place>
</event>
<person id="p6">
<sourceDetail type="LINE_NUMBER" value="10"/>
<gender type="FEMALE" />
<role ref="e4" type="PARENT" />
<role ref="e6" type="CHILD" />
</person>
<event id="e6" type="BIRTH">
<role ref="p6" type="CHILD"/>
<place>United States</place>
<note>Census indicated Eliza's parents were not foreign born.</note>
</event>
<person id="p7">
<sourceDetail type="LINE_NUMBER" value="11"/>
<gender type="MALE" />
<name>John Hunt</name>
<role ref="e7" type="CHILD" />
<attribute type="AGE" value="12" units="YEARS" />
</person>
<event id="e7" type="BIRTH">
<role ref="p7" type="CHILD"/>
<role ref="p4" type="PARENT" />
<place>Carvers Bay, South Carolina</place>
</event>
<person id="p8">
<sourceDetail type="LINE_NUMBER" value="12"/>
<gender type="FEMALE" />
<name>Catherine Hunt</name>
<role ref="e8" type="CHILD" />
<attribute type="AGE" value="1" units="YEARS" />
</person>
<event id="e8" type="BIRTH">
<role ref="p8" type="CHILD"/>
<role ref="p4" type="PARENT" />
<place>Carvers Bay, South Carolina</place>
</event>
</extraction>
<extraction id="a3">
<dateExtracted>26 March 2011</dateExtracted>
<extractor ref="s00" />
<source ref="s00"/>
<basedon>
<extraction ref="a1"/>
</basedon>
<person id="p200">
<sameas ref="p7" />
<role ref="e207" type="CHILD" />
</person>
<event id="e207" type="BIRTH">
<role ref="p200" type="CHILD"/>
<date>
<year>1858</year>
<note>Based on age at time of 1870 census.</note>
</date>
</event>
</extraction>
<extraction id="a4">
<dateExtracted>9 April 2011</dateExtracted>
<extractor ref="s0" />
<source type="RESEARCHER" parentSourceRef="s0" />
<basedon>
<extraction ref="a0"/>
<extraction ref="a1"/>
</basedon>
<person id="p300">
<sameas ref="p1" />
<sameas ref="p200" />
<note>I think these two are the same individual because...</note>
</person>
</extraction>
</bettergedcom>
Second, I'm happy that the DeadEnds model was able to help organize your approach.
Third, I feel the "extraction" level concept might be a bit of an overkill. I hadn't thought about what you are using it for before. You use it as a way to capture when an extraction (the act of creating of a set of evidence records from real world evidence) was made and by whom it was made. I see your point, but wonder about its utility. How important is it to "glorify" an extraction activity as a major, high-level entity in a database? I think your example works just as well without that layer.
Fourth, you don't yet address the person-based, conclusion-based world. I applaud your audacity in omitting the conclusion world, since most people only want the conclusion world and not the evidence world at all! Based on the objections I get from people who think the evidence level is an unnecessary, theoretical complication to the right way of doing conclusion-based genealogy (it brings some to tears), I hope you have a good sense of humor.
Fifth, I think adding the conclusion world would be easy for you to do. I believe you only need to think of "justifications" or "proof statements" as another kind of "source", and you need to be able to bind persons together into higher level persons. The DeadEnds model gives you the mechanism needed for these.
Sixth, if you added the conclusion world and didn't require the evidence world you would make everybody happy. You would only have to deal with having conclusion persons refer to sources, but that's all we do today.
Neat stuff.
Tom W.
The important thing that I see in the proposal is that you want some way to tell what info you have extracted from a source (identified by at least part of a citation). Requirement Admin07 in the Req Cat already talks about “BetterGEDCOM shall be able to record information about, or link to records representing, the findings and results produced by the task” – task being the lookup in one source.
Assuming on one hand an Evidence Person (Tom’s model) or a Persona (Gentech) based on the info from one lookup, and on the other hand a free text transcript, an image or a data structure recording eg. a census household record from a lookup task, this can be achieved by linking the task to the Evidence person or Personas created as a result of the task (rather than keeping the info in one Extraction structure). I may be failing to see some advantage of having it in one structure, since this is for me a new and unfamiliar way of structuring the data.
I don’t see why the researcher making a conclusion should be modeled as a source, it is unnecessary complicated. I think researchers should be a separate record type, there are other needs for that -cf Admin0x. I don’t mind calling a program a researcher.
Then there is a question about how transcribed info would look like if you download it from a database. I would get that as table structures from a relational database (Sorry Tom) operated by e.g. our National Archives, so I would like my program to be able to store that as an “excerpt” – not interpreted in any way. Archives have to have a general, not genealogy specific way, to store such info, and a table structure would be more genealogy-user-friendly.
You could also convert the data to Gedcom/BG format, so there may be a need for both, but I am not sure an archive would take the responsibility for that conversion since it may require some interpretation of the original data – it is not always easy to derive the biological relationships from a census record. There is also a lot if info in a census that you would have to create special structures in Gedcom to handle. But if I can’t have a table structure for storage of my excerpt, I would have to use BG, and have separate “evidence BG files” for some simpler source types. (I already have all the census, church records and other stuff for my prime area transcribed, so I would love to have them in my genealogy program, it will be no extra typing work – cf. Tom’s comment – I don’t see his circular problem.)
The hierarchical structure of sources – that is something that should be considered in the Sources&Citation work of BG. I am not sure how useful it is, and don’t know if source meta data bases around the world have hierarchically linked sources? Maybe some have?
Tom, as I said in the last Developers meeting, I think it would be a good idea to create (if it does not exist) an BG Evidence & Conclusion page, and as a start populate it with links to the older/current discussions scattered all over – so interested parties could find the info more easy – cf. Mike’s comments.
I didn't particularly like that either. But it solves the problem of giving a conclusion record a source and a justification.
Every record, including a conclusion record, needs a source. In the conclusion-based world, each conclusion record typically has many sources, one for each of the separate items of information that have been collected together to form that person record (e.g., a source for the name, a source for each birth event, each death event, each residence event, ...). This misses the VERY IMPORTANT point that some agent, probably human, maybe software, made the decision that all those items of information refer to the same real person. Someone who believes in citations and sources would be appropriately shocked, I assume, when they realize that all those nice individual citations have nothing at all to say about the person as a whole! Yikes.
In the evidence and conclusion approach, as shown in Mike's example, this situation is wholly rectified. Each person record, whether it is an evidence level thing or a conclusion level thing, has EXACTLY ONE source. For an evidence person it is the source that provided the evidence. For a conclusion person it is the human (or algorithmic!!) agent that chose to group the evidence persons together, along with the reason/proof statement/justification/algorithmic comparison score/resarch notes that state why those evidence persons were joined together.
Requiring a person as a source might not be the best solution, but it works and solves the problem. If there are other alternatives we should certainly consider them.
Tom Wetmore
Gier said, "Then there is a question about how transcribed info would look like if you download it from a database. I would get that as table structures from a relational database (Sorry Tom) operated by e.g. our National Archives, so I would like my program to be able to store that as an “excerpt” – not interpreted in any way. Archives have to have a general, not genealogy specific way, to store such info, and a table structure would be more genealogy-user-friendly."
Excellent example. An "excerpt" in this case, a "table structure from a database" plays the role of an item of evidence taken from a source (the national database). Well, it doesn't play the role, it actually is a real item of evidence from a real source.
If one were using the conclusion-based (aka person-based) methodology one would take info from that excerpt and add it directly to a conclusion person in a database. That excerpt may be textually based, but you treat as you would any other item of evidence, for example a birth certificate or an image of a census record -- you are using it ONLY to extract information to add to a conclusion person record.
If you were using the evidence-based (aka records-based) methodology one would take that excerpt and create an evidence person directly from it. That would then be a new evidence level person record in your database, that would be grouped eventually with other evidence person records to form a conclusion person. In this world we are talking about two things, the excerpt which is real evidence, and the evidence person that is extracted from the excerpt.
Tom Wetmore
Look at "Code" in the Wikitext help page.
[[code]]
This is plaintext.
[[code]]
I don't se any problem with linking to two different entity types, sources and researchers.
There is a need to look at how do we store "excerpts", and what do we link them to.
Also, in order to have evidence records to link to, I assume we need to have evidence and conclusion groups, places, ships and what ever.
Are we able to link to all interpreted (not exerpts) by including evidence persons/groups/places/etc? Have we covered everything every place that you wold record something?
You have not responded to my suggestion to have a separate page for E&C - I am tired of discussing it all over the place, repeating explanations of the model all the time.
If a source reference can point to a researcher, we've got a solution.
I see "excerpt" the same was I see a birth certificate image. It is a REPRESENTATION of evidence that we MIGHT want to store in our database. In the case of Mike's excerpt example, it is the representation of a record that exists somewhere else in a database. Evidence person records can be extracted from this kind of excerpt exactly as an evidence person records can be extracted from an image of a birth certificate.
Personally, I would be more likely to just look at the "excerpt" on my computer screen while searching a database, and extract the info from the except directly into an evidence person record in my database. (I have done lots of work with city directories, and this is exactly what I do.) I might be making a mistake by not also wanting the add the EXCERPT ITSELF to my database, but I am confident enough that when I create the evidence person from the excerpt, I will capture the data fully and completely so I won't need to check the excerpt again. If I did have to refer to it again I'd have to go query the database again!
Here is an example. You are researching a person in city directories. You find persons with the names you are interested in a string of three annual directories. How would you capture that evidence (say it consists of a name, an address and an occupation).
What I do -- create a source record for each of the three directories -- create an evidence person for each mention of a person with a name I am interested in. The result is a few source records and a few evidence person records that refer to the source records.
Now I COULD have ALSO created an "excerpt" for these records. That is, I could have clipped an image of the pages where the persons are mentioned and added those images to my database. They would also point to the sources, and the evidence records would point to them. . For me that's overkill. I have faith that the evidence person records I create are sufficient. But I fully understand the desire that many genealogists might have to be able to store images of the sources in their databases for ultimate backup.
What do you think? Would you have clipped those images of the directories and added them to your database?
Tom Wetmore
I see your and Gier's point about the extraction level being overkill. It also models better in a relational database to remove it. I'll update my example to get rid of that layer.
With that in mind, where would you suggest I put that information without duplicating it on every person record?
BetterGEDCOM has a moderator for Evidence Explained and GPS--it's me.
Can someone explain why there is another effort going on to redevelop the concepts?
When I read your original example it was clear that <dataExtracted> and <extractor> were new concepts that the <extraction> element added. I wondered about the value of those two things when trying to evaluate the importance of the <extraction> idea. My conclusion was that the ideas were not important enough to justify keeping the <extraction> concept. As far as I know there has been little concern about these two ideas. When getting data from Ancestry.com, you get the source info (the database the data was found in, and you can click on the "more info" button about that source to get a more detailed description of it), but you don't get any details as to who did the individual extractions or when they did it.
If you get rid of the <extraction> element, you have to put the <dateExtracted> and <extractor> elements into the <person> records. (What's wrong with a little duplication? Smile, smile.)
Tom Wetmore
"I don't se any problem with linking to two different entity types, sources and researchers."
If you want to think of a researcher as different than a source in this context, fine. Just remember that we want a model to define our entites, attributes, and relationships. Having one kind of entity refer to another kind of entity in the "source reference" relationship would be a relationship we define in the model. It is up to us to specify the model entities that can occur at each end of the relationship.
"There is a need to look at how do we store "excerpts", and what do we link them to. "
There has been some discussion on this. I assume we are talking about excerpts that are in a form that can be stored on a computer as either a text or image file. Excerpts can therefore be on the web or in the local file system, so can all be specified by a URL. Most models allow references to these things. In the DeadEnds model they are currently called URLs, and references to URLs can be placed almost anywhere within DeadEnds records. GEDCOM allows the same idea.. There has been discussion here about how such things would be put into a Better GEDCOM transport file, most people thinking that the export file, if the user so desired, could be a container (bundle in Mac OS X terminology) that holds copies of those files along with the data.
"Also, in order to have evidence records to link to, I assume we need to have evidence and conclusion groups, places, ships and what ever."
At least persons and events, maybe families and groups if added. OpenGen is talking about the same idea for places, but I don't understand they would gain from that. (At last rumor they were thinking of doing it with DATES also; please God, help us poor mortals).
"Are we able to link to all interpreted (not exerpts) by including evidence persons/groups/places/etc? Have we covered everything every place that you wold record something?"
I think so.
"You have not responded to my suggestion to have a separate page for E&C - I am tired of discussing it all over the place, repeating explanations of the model all the time."
It's fine with me. I have been discussing this stuff for seventeen years, repeating explanations, ad nauseum. I might be tired of it, but I'm also convinced it is necessary. I guarantee that an E&C page will not put an end to these discussions!
Tom Wetmore
Thanks for your response.
"Third, I feel the "extraction" level concept might be a bit of an overkill."
I disagree. In fact, I think it is the most important (and radical) concept to what I have proposed as I will explain while addressing the rest of your points.
First let me say, the explanation part (first half) of what I proposed what not very complete. I wrote it late last night and was running out of steam. The concrete example I gave was more fleshed out as I had written it a couple days ago.
In reality, the "extraction" is just another source. In fact, I was thinking this morning that the person, event, etc records ought to just reside in a source structure. I think it is VERY important to capture the result of extracting from a source, because what the researcher is doing, is in fact creating another source. In my mind there is no difference between a researcher extracting a record from a source and then passing that extraction on (in the form of a gedcom file, etc) to another researcher. The extraction (source) was created from another source, but then again, the source that the extraction was created from probably came from anther source as well. And so it continues (if you are fortunate to find the complete chain of sources) until you arrive at a human being who originally created/provided information for a physical source (document etc).
Because extracting information from a source actually creates a source, this recursive data structure can now be used to solve your Fourth point of addressing the "person-based, conclusion-based world" in a way that documents the creation of the conclusions (your Fifth point).
If you look at <extraction id="a3"> and <extraction id="a4"> in the example, you can see a computer algorithm and a researcher forming and documenting a conclusion and creating a "top-level" conclusion person (personId p300).
I apologize for the readability of the example. If anyone can tell me how to get this wiki to maintain white space, I'll make the example more readable.
Mike M.
I still think the extraction element is overkill. I see it as a "sub-source" concept that's not needed. I've read your example carefully. You are putting additional source fields at the top of some of the extraction elements. Those things should in my opinion be in the source records directly. The persons and events only need the additional source details to get to the right place in the sources.
I see one possible advantage of an extraction-like object, as a way of binding together a self-contained set of person records with the event record they are role players in, e.g., a census household or a marriage record. In my software, internally, I have found it useful to use a data type I call a cluster, to be this concept, but at the database level I haven't encountered that need. The event record points to the person records and the person records point to the event record. Nothing additional is needed to call attention to the fact that they form a self-contained unit. One of my absolute rules about evidence records is that they be permanent, so those pointers will never change.
And you don't need the basedOn elements because they are redundant with the sameAs fields. From the sameAs fields you can get to the sub-persons via the sameAs pointers, and from the sub-persons you can get to their sourceRefDetails. I'm not terribly comfortable with the name "sameAs" but I don't know what would be better, "subPerson". In the DeadEnds model I think I just called it a personRef (too lazy to go look while I'm typing this, though not too lazy to say I'm too lazy).
Without an enclosing extraction your person 300 could be:
<person id="p300">
<sourceRef id="s0">
<sourceRefDetail type="CONCLUSION">I think these two are the same individual because...</sourceRefDetail></sourceRef>
<sameAs ref="p1" />
<sameAs ref="p200" />
</person>
I have taken the liberty of adding the sourceRef and sourceRefDetail elements as a replacement and slight tweak on your sourceDetail element to better capture its structure as I feel it.
Frankly, I've never been a fan of XML and I love to remove extraneous pointy thingies.
I get from your message that you might be thinking about the extraction as an extracted form of the source itself. I've thought about this a bit in the past, wondering whether there should be a way to take a source and convert it into some kind of tagged, "marked up" text or computer file. Whenever I go there I end up feeling like I'm grabbing at an extreme idea and I back away from it. I have found that the idea of creating evidence level person and event records is already too extreme for many Better GEDCOM folk. If we try to add the idea of also creating text-based, internal records, out of the sources themselves, I think we would loose almost everybody. What confuses this whole thing, and makes it very difficult to explain to the lay person, is that some sources are already in digital form (e.g., GEDCOM files, FamilySearch index databases)! It gets circular and pretty nasty sorting out these things some times.
I am super glad you have joined the discussion as someone else seriously thinking about the modeling problems of multi-tiered evidence and conclusion trees. I've been getting pretty pessimistic that Better GEDCOM will finally embrace this idea, so it's great to have some one new to add to the group of us who think it might be a good idea! I really don't care much that we are disagreeing on the extraction idea, since we are agreeing wholly on the multi-tiered needs of supporting the research process.
Tom W.
To be honest, I don't mind that we are disagreeing on extractions either. I've had this "extreme" idea for sometime now and I suspect there are others who also agree that a major paradigm shift needs to occur in the genealogy data model sphere. It is my belief that current models lend themselves by their very nature to corrupting genealogy data. Only the very diligent and experienced can keep their personal databases untainted by their own conclusions and the conclusions of others because current data models make it so easy to add erroneous conclusions and yet VERY difficult to locate and remove those conclusions and replace them with new conclusions.
My purpose in posting the model to BetterGedcom is to start a discussion about this in a concrete form that can allow people to discuss the model's merits and pitfalls. Thus, I'm very pleased that you are taking what I have posted seriously and taking the time to really internalize it and offer valuable feedback.
"I've been getting pretty pessimistic that Better GEDCOM will finally embrace this idea"
I hope you are wrong and that other will embrace this paradigm shift. I really don't understand why there is a disagreement at all to exploring alternative data models. I haven't taken the time to read all the discussions on this website (there a lot and I don't have a lot of time), so maybe if I did, I would understand the disagreements. Perhaps those who are disagreeing don't fully understand what you are proposing. It is a rather radical shift in what everyone is used to. My thought is this new data model can very easily offer everything an old model did, but add so much more along the lines of supporting real genealogy research.
My thought is if others on the BetterGedcom website don't agree with our approach, perhaps they are willing to let us and those who are interested have our own little space on the website to continue to explore this evidence-based model. In the end we may all find we are on the same page. I could go off and write it on my own, but I would much rather have feedback, help, collaboration, support, adoption etc, etc. BetterGEDCOM seems the right place to get this. Plus, I find all the other activities on the website very valuable too.
Mike M