1. Standardized Metadata and CSL
Creating reference notes and bibliographies in genealogical software usually begins by the user adding details about the sources to their database. When users fill in fields of data about a source—identifying each author, title, date, etc—they are recording source metadata. Using yet other techniques, this metadata is electronically manipulated to create reference notes and bibliographies.
Professionals from different libraries and archives also create metadata, often about the same sources with which genealogists work, and there are formal “metadata standards.” Some familiar metadata standards are MARC21, MODS and Dublin Core. For example, if you review a title in the
//Library of Congress Online Catalog// (US), the associated “MARC TAGS” about the work are displayed in a tab.
The three noted standards,
MARC21, MODS and Dublin Core (there are others), are each different, especially in their targeted level of detail, and thus, in descriptors and definitions. In particular, MODS does not reach to the level of detail of MARC21, and Dublin Core is greatly simplified from MODS. [
1]
Taken collectively, the standardized metadata is complex, however third parties have developed ways of extracting typically
high-level bibliographic data from a range of the standardized metadata. “Reference Management Software”—products like Thompson’s
EndNote and
Reference Manager— feature the ability for users to create libraries of information about sources (including bits of extracted standardized metadata), from which bibliographies and reference notes can be created in a variety of recognized citation styles.
Many paper-based genealogists use products like
EndNote in their research and writing. Some of these products allow users to “drag and drop” citations from the library into their word processing program.
Zotero is an open-source reference management software product that works similar to Endnote. The technology behind Zotero is called “Citation Style Language” or CSL, which is also open source. Like Endnote, Zotero uses CSL protocols to interface with sets of standardized metadata. Said another way, CSL/Zotero define fields (like our citation elements) that map to the more complex standardized metadata. Using CSL technology, Zotero supports different languages and generates bibliographic entries and high-level reference notes in more than 1500 different citation styles.
The main point here is that at a more scholarly and even institutional level, approaches to standardization exist and their development continues. As well, third party efforts, including some that are open-sourced have discovered ways to work with this data.
2. Reference Management Software, Stylistic Matters: Substance and Form
Chicago Manual of Style Online, at 14.1, "The purpose of Source Citations," explains, "Ethics, copyright laws, and courtesy to readers require authors to identify the sources of direct quotations or paraphrases and of any facts or opinions not generally known or easily checked." The source continues, "Conventions for documentation vary according to scholarly discipline, the preferences of publishers and authors, and the needs of a particular work. Regardless of the convention being followed, the primary criterion of any source citation is sufficient information either to lead readers directly to the sources consulted or, for materials that may not be readily available, to positively identify the sources used, whether these are published or unpublished, in printed or electronic form."
From Mills, 2007, 42, "Evidence Explained is rooted in [Chicago Manual's Humanities] style. However, most
Evidence models treat original or electronic sources not covered by [the CMOS, Bluebook, MLA, Turabian manuals], as well as some modifications that better meet the analytical needs of history researchers."
See introductory paragraphs on this wiki page and the opening comments to the wiki page, "Modern Style Guides." See also, Tamura Jones, "
Genealogy Citation Standard,"
Modern Software Experience, 27 June 2011; James Tanner, "
Looking towards a rational philosophy of citations,"
Genealogy's Star, 17 July 2011.
Where angels fear to tread.
Standardization involves issues of substance (information about the source) and form (given a particular style, how should information be formed into a citation). Matters of form include if or how to present the various elements including the punctuation thereof, which tends to vary by country, often by institution or publisher. All the form in the world won't make up for a lack of substance.
Substance
Some issues of substance cross discipline lines, but some are unique to particular disciplines. In the simpest sense, I see substance as fields or citation elements that provide information about the source. Some elemental needs (pun intended) are more common in genealogy than in other disciplines. For example, genealogists deal with many documents that don't carry any title (say, a letter), and we work with many documents that carry the same title ("Certificate of Death").
Just as BetterGEDCOM wants to standardize these matters of substance so that genealogists can exchange information, the broader class of reference management software is working toward the same objective on a worldwide basis and across disciplines. As far as I have been able to learn, reference management software approaches substance in the same way Geir's document suggests we would proceed--first setting out "source types" (Zotero calls them item types) and then, for each source type, reference management software has defined fields (like our citation elements), including references that are similar to Geir's modules. As a result of these approaches, there are a relative few source types in reference management software, and there will be a relative few source types in BetterGEDCOM.
That genealogists need more source types or more fields, including assertion level fields is not the point--that they have item types and fields we could build upon is the point. As these same third party efforts (some open source) develop item types and fields that support access to standardized metadata, so aligning our work with the larger reference management software movement might bring BetterGEDCOM's effort closer to online citations.
Form
In reference management software, given the coordinated list of item types and fields, style libraries can be created and managed by a separate effort, based on something akin to Geir's concept of "style rule sets." Once core style have been developed, additional styles are added based on how that new style compares to a core style or to some other style.
3. Master Sources and Assertions
The source system in genealogical software, which BetterGEDCOM aimed to support, functions somewhat like reference management software--libraries of source data are developed to support the creation of bibliographic and reference note citations, but genealogical software source systems need to store full reference notes using integrated citation mechanics that involve elements at both “master source” and “assertion” level. These mechanics are frequently manipulated by users taking different approaches to managing the master source list. (Conversely, the library in reference management software stores high-level elements, frequently at a level higher than a master source developed by genealogists.) Said another way, to function, genealogical software needs to efficiently store and manage more elements/fields and likely at more or different levels than reference management software.
4. Genealogical Software and Source Types.
Currently, across genealogy software programs, source types tend to be directly associated with particular citation templates, which may or may not be proprietary. The template determines which citation elements are available for that given source type. Reference management software works quite differently. In CSL/Zotero, the item type (source type) determines the available information fields, which ideally represent a universe of the fields various styles or catalogs require. The styles are built separately, populated by the item type’s information fields.
Because of this difference, hundreds upon hundreds of source types exist in genealogical software, each unique to a style. These source types are often localized (US- or UK-centric) and use an array of citation elements that are also unique to each software vendor.
In contrast to the hundreds of source types existing in genealogical software that supports few styles, CSL/Zotero recognizes fewer than 50 source types, supports 1500 citation styles and serves a world-wide market.
5. Flexible Schema
Much like reference management software, BetterGEDCOM could defined reasonably high-level source types in accordance with Geir’s approach (I call his approach a schema).
Universal/County Specific > Source Type Class > Source Type.
We tentatively identified a group of 23 universal source type classes (books, journals, research reports, web pages, newspaper items, etc.). Census and vital records are among the source type classes that would be country specific.
Where CSL has established item types, BetterGEDCOM could/should adopt those named source types and descriptions. BetterGEDCOM would add source types to the BetterGEDCOM model as necessary.
For each Source Type (and thus Source Type Class), BetterGEDCOM could define a set of available elements (citation elements/data types) to ideally support the production of citations for that source type regardless of nuances in form, language or style. Where CSL has established fields, BetterGEDCOM could/should adopt those fields as citation elements, and add unique citation elements to the BetterGEDCOM model as necessary.
Extendable.
Vendors (and users, if the vendor permits) could/should be able to extend the BetterGEDCOM source types using a system of levels, with the lowest level representing the database assertion. All citation elements would be available at any level,* and lower level sources would inherit the properties of the higher-level source.
US > Vital Records > Assertion
US > Vital Records > Sammy Sue > Assertion
US > Vital Records > New Hampshire Deaths > Assertion
US > Vital Records > Massachusetts Vital Records > Assertion
US > Vital Records > New Hampshire Deaths > Sammy Sue, certificate 20632 > Assertion
US > Vital Records > New Hampshire Deaths > Thomas Jones, died 1698 > Assertion
US > Vital Records > State Certificates > Missouri Death Certificates > Mike Jones (1942)
US > Vital Records > State Certificates > Missouri Death Certificates > Assertion
*This includes page, etc., so that no particular citation element should be limited to the "assertion" level. See the lumper vs splitter graphics on the wiki page
Citation Mechanics.
[1] Among others, “
Metadata Object Description Schema,”
Wikipedia, “[MODS] is an
XML-based bibliographic description schema developed by the
United States Library of Congress' Network Development and Standards Office. MODS was designed as a compromise between the complexity of the
MARC format used by libraries and the extreme simplicity of
Dublin Core metadata.”
Your article is very timely. This is exactly what I'd like BetterGEDCOM to accomplish as in my Proposed Vision Statement discussion.
You reference Tamura's Genealogy Citation Standard Article and in that he says:
"In 2009, Mark Tucker posted a YouTube video that demoed how citing online sources should work; different sites should use all use the same standard citation format, one which would enable desktop applications to automatically receive all the pertinent data from the online record collection."
That is exactly what I'm looking for in my proposed Vision for BetterGEDCOM.
Mark himself in the 3rd comment on his post answers what the question as to what the system would use, and he says: "This would either use XML as the file format or an updated version of GEDCOM."
I hope GeneJ (and everyone else), that you're willing to use BetterGEDCOM to start taking this to reality and making it happen.
Louis
GeneJ:
The only GEDCOM tag that should be used for location information that will translate into a citation is:
+1 PAGE <WHERE_WITHIN_SOURCE>
where
WHERE_WITHIN_SOURCE = Specific location within the information referenced.
The data in this field should be in the form of a label and value pair, such as: Film: 1234567, Frame: 344, Line: 28
This is perfectly generalized and puts no limitation on the label/value pairs that are allowed (except it doesn't tell you what to do if you want the value to contain a comma).
This is the part that you can expand on and formalize to get citations to work perfectly and identify the data source as precisely as you want it.
This is also the part that very few (if any) developers have implemented. They are the ones who missed the construct they were supposed to use and they are the ones who mess it up by forcing their information into other GEDCOM Tags. Citations would transfer much better if this construct was used. There would be no mistaking what the fields and their values are.
In BetterGEDCOM, we should bring this construct front and center and give concrete examples on how to use it properly.
As far as the Source Record itself, GEDCOM has tags that include: AUTH, TITL, ABBR, PUBL, TEXT, DATA/EVEN/DATE/PLAC/AGNC/<NOTE_STRUCTURE>, <SOURCE_REPOSITORY_CITATION>, REFN/TYPE, RIN, <CHANGE_DATE>, <NOTE_STRUCTURE>, <MULTIMEDIA_LINK>.
These generally transfer well. If there are cases where they don't, we can devise a way in BetterGEDCOM to fix them and add another Tag if necesary.
Do you mean move macro (master level) data to the assertion level fields?
As I said, if you have a nice little book that you reviewed as a text edition, or a nice little manuscript that you viewed at a repository, GEDCOM will do just fine.
But, I review books on line and in e-edition now.
I do not want to hunt and peck to find a URL (especially if I want to delete it) ... I don't want to wonder where to put the original publication date.
I want to manage my master sources, not be told by GEDCOM how I should manage my work.
GEDCOM's source structure is broken, Louis, it really is. Trying to add complicating features to a broken system just seems like a bad vision.
Me thinks we can fix the structural issues faster than we can figure out all the possible patches it needs. --GJ
Gene:
Adding a URL field (if that's what we decide to do) is a trivial change to make to BetterGEDCOM. That's not broken. That's just
a normal evolution of the standard.
Nothing in GEDCOM tells you how to manage your work. In fact, when you do your work, you do it with your genealogy program. The program should worry about how to present your source data to you and it should know (but usually doesn't) how to correctly translate its own database to and from GEDCOM.
We won't make any progress unless we start fixing something.
If you think we can fix the structural issues, then lets address that first. Right here and now.
I have one big structural issue that my Vision is addressing - and that is separating the source information / evidence from the conclusions.
What structural issues are the ones you are concerned with?
Downloading evidence records would also require the ability to download source meta data.
I don’t see a conflict (or opposite focus) between downloading evidence and citations.
The leading genealogy programs have implemented Evidence Explained. Exchanging citation data between these programs will in my view require a solution that is much more advanced (in terms of e.g. Citation Elements) than current Gedcom – the old Gedcom would end up being 5% of a new solution. Therefore I don’t see any point in patching Gedcom to handle EE and more advanced solutions, it will just make things more complicated to develop. I have discussed backwards compatibility with current Gedcom in my Architecture document, see Standard Citation elements. A short term solution based on Gedcom, for download of source meta info is in my view a dead end – it will not carry the necessary info in a precise way – I think this is demonstrated in how current more advanced programs use the Gedcom fields today – it’s a mess as documented earlier on this wiki.
As I have indicated in a parallel discussion, I see a potential problem in allowing just about everybody to define source types that can be exported all over the place. I fear anarchy because of far to many and similar source types – but at the same time we want user defined source types. This need to be looked into.
The splitter/lumper issue – isn’t the additional CEs in the splitter “where in source info”? Why do we need the “Lumper?”
I am not sure I understand the PROBLEM being discussed about micro data (as defined by Gene, perhaps record/entry data is a better term) but if I understand it correctly there is a potential overlap between citation elements and evidence data. If so, I am not sure I see a problem with the overlap.
Louis states that Gedcom already has a solution for “standardized source info”. I have my doubts, but awaits what Louis has in mind.
Geir,
Yes, the leading genealogy programs have implemented Evidence Explained. They also export them into GEDCOM but all in somewhat different ways. They can read their own data back, but they don't transfer well. You've already started analyzing that and that's excellent.
Surely each program's GEDCOM export suggests different ways of exporting, and a "best" method could be decided on to be the standard and generalized so that it could work with all of EE or Lackey or other citation methodologies.
Louis
"Opposite focus"-if you conceptualize the information as "levels of detail"--from information about how to find and identify the source through to information in a source-there is a sort of continuum. Higher level (macro) information at one end, and fine details in the source (micro) at the other end.
We want "Geir smart" dis-ambiguous fields to hold all that data. By "opposite focus," I was offering my perspective that our discussions about evidence records focus on the micro end of that continuum. ...and that the higher, macro level data is just as important in substance.
Lumper/splitter ... "isn’t the additional CEs in the splitter “where in source info”? Why do we need the “Lumper?” I don't think I understood the question, but an example follows using census.
Current GEDCOM recognizes two levels only--that master source (the source record) and the assertion level (where in source, etc)
Some users create a master source for census at a high level (say the county or state level or even the census year level), while other users create a master source (source record) for census at the household level.
Those who set up their master source at a higher level are called "lumpers" (they consolidate or lump the source definition and add many details at GEDCOM's only other level-the assertion level ... so they add a lot more than "where in page" at the assertion level). Those who set up their master source at a lower level, say the household level, are called splitters (they split what might be called a single source into many master sources). These splitters enter most of the details (including all the page numbers) at the master source level.
BetterGEDCOM could try to standardize these mechanics, but I'm suggesting that wouldn't work well--lumpers will find a way to lump and splitters will find a way to split, even if that means mis-using our wonderful elements to meet their needs. Ala, I suggest BetterGEDCOM employ the higher-lower level approach to sources, with the last level always representing an assertion. If we make all the fields available to all the levels, with the lower level inheriting the properties of the higher level, we can make lumpers and splitters happy with the same element group.
Make sense? --GJ
Finally I understand your "opposites".
Lumper/splitter: What I was thinking is that if the Splitter contains all the CEs of the lumper, but has added fields for more detail, why not have only the splitter - just not use all the CEs.
I now realize that the "border" between source and "where in source" CEs has been moved in the two cases. I may not see all the implications, but if all CEs that you transfer are uniquely identified, the "function" that produces a reference note may not see this boarder (it just sees a set of CEs), but it will show up in Bibliographies - or?
Am I starting to understand?
Well, there is a problem exchanging the data today if you don't intend to pay for conversion (in a little while). I realize the issue is huge, but we will see ...
Lumper/Splitter:
Yes, I think you have it.
When we create the broader group of CE's, lumpers will want the more detailed level CD's to be at the assertion level, while splitters will want them to be at the master source level.
See the first part of Adrian's post (link below), where we agreed that users want the ability to decide what sits at the source_record level (ie, my term, "Master Source.")
Comparatively, when RootsMagic implemented EE, it used a whole host of CE's at the "master source" level and a whole group of CE's at the "assertion" level. See the link below. The part in yellow color is RM's Master Source level and the part in green is RMs assertion level.
http://bettergedcom.wikispaces.com/Master+Source#RM4
If we were to take screen shots (I wish I had) for this same "source type" (US Census Online Image), we'd find that even if the CE field names are standardized, FTM has a little different set up than RM. (And TMG will be different than both. Let's assume Legacy, Family Historian, etc are a litte different too, and so on and so forth.
If we allow the CE's to shift between levels, there is no reason to believe that each vendors approach wouldn't fit well into BetterGEDCOM.
Now then, if we allow there to be another level (so you could have high level/low level entries in the master source list), I think I'm in heaven. I woud likely then have my high level census entry set up like the bibliographic entry=county level. My low level master source (I'm a census splitter), would be the household level data (so, page no., civil district, dwelling/household no., person of interest). I could manage both those levels from the Master Source List.
These higher-lower level source mechanics would be great for large series, too. --GJ
(1) I don't believe all programs should be required to use the same citation style. (In truth, as an export format, we can't even require that software generate citations.)
(2) "Standardized Metadata," and effort I would support, has the opposite focus of your E & C approach, Louis. Standardized Metadata works to get information about the source developed. While I'm not an expert in the current issues, in looking over the different standards, it seems pretty clear that those who have try to simplify the metadata fields ultimately fail to create a standard. (The fields become ambiguous.)
So, for example, if you tried to add your micro level data (that's really what it is) to GEDCOM styled maco-identifiers, I believe that standard would fail. PS As you know, I don't know why repositories would be interested in coding to our alternative standard anyway.
This difference in focus--macro or micro-is not a small issue. Focusing on the micro and not macro is what led to one of our providers posting millions of images that can't be traced back to the FHL film from which the images and/or indexing was developed.
(3) I hope that BG will implement the item type/source type, expanded field/citation element and modules in the near term. And I'd be willing to do all I can to identify the fields and help with the modules.
I realize your vision is in conflict with that hope.
(4) In the longer term, I'd like to see an open source BetterGEDCOM citation library develop--something that woud support all styles.
GeneJ:
(1) I agree with you. Citation styles should be flexible. I am actually shocked when I go back and read the the line I quoted did say "citation style". Somehow I interpreted it as a standard for online "source information / evidence" - but NOT for the citation. Thanks for pointing this out.
(2) It is standardized source information that we are trying to build up. I don't see that as being very difficult, and GEDCOM almost has it now. It's just a matter of separating the information from the conclusions.
I have no idea what you're talking about with regards to micro/macro. I don't see what's so hard about this.
(3) Continue with your citation work. It will be needed in the future. But a first step is needed to make progress now.
I don't see how my vision is in conflict. I see my vision as the first step.
(4) So would I. But I'd like to see the shorter term happen before the longer term. :-)
Louis
(2) ... "Standardized source information .... and GEDCOM almost has it now. It's just a matter of separating the information from the conclusions."
See, I don't follow that logic. We are just on different planets.
I'm guessing we have both read Terry Reigel's article about TMG and GEDCOM. Can't we agree that source macro data becomes ambiguous when processed via GEDCOM?
Simple item types--letters, interviews, photographs, journal articles, digitized census, websites, etc.--do not "GEDCOM" transparently. GEDCOM's fields work for simple published books or unpublished manuscripts you find in a repository. Beyond the most simple forms, one has to get creative with the available fields so that for one source or item type, you find a URL listed as the call number, the website name "might" be in the repository field ... when "date" transfers, it might be the access date, maybe it's the event date or maybe it's the date published. In some other source, you'll find the URL in the field for publisher location, and the access date in the note field ...
By macro data, I mean the fields or elements critical to the identification of a source. These are often higher level fields by which sources are cataloged and identified by repositories, libraries, archives.
By microdata, I'm referring to the bits you sometimes call "evidence"--these bits are at a lower, deeper, micro level.
Say the macro data for a death certificate would be all the information we use to identify the certificate.
The micro data would be fields of information in that death certificate (informant's name, cause of death, street address, disposition of remains).
As I understand it, you want to use GEDCOM fields as a substitue for the standardized metadata discussed in the page posting, and then you hope others will harvest information at an even deeper level to attach to that GEDCOM record.
I'm questioning that logic. If I can't pass my own master source list from me to me without significant confusion of the data, why would we recommend a repository use it as a base for cataloging information.